2023-08-23

cs.LG

cs.LG - 2023-08-23

The Challenges of Machine Learning for Trust and Safety: A Case Study on Misinformation Detection

paper_url: http://arxiv.org/abs/2308.12215
repo_url: https://github.com/ramybaly/News-Media-Reliability
paper_authors: Madelyne Xiao, Jonathan Mayer
for: 这篇论文旨在探讨机器学习在信任和安全问题上的应用，使用假信息检测作为例子。
methods: 作者系мати了关于自动检测假信息的文献，涵盖270篇最具影响力的论文。然后，作者分析了这些论文中的数据和代码可用性、设计异常、可重现性和泛化性等方面的缺点。
results: 研究发现了论文中的一些缺点，包括：检测任务与实际场景不符，数据集和模型评估不符合实际情况，模型无法在非标准数据上泛化。这些缺点让人们对检测性能和实用性的宣称存在很大的质疑。

Abstract
We examine the disconnect between scholarship and practice in applying machine learning to trust and safety problems, using misinformation detection as a case study. We systematize literature on automated detection of misinformation across a corpus of 270 well-cited papers in the field. We then examine subsets of papers for data and code availability, design missteps, reproducibility, and generalizability. We find significant shortcomings in the literature that call into question claimed performance and practicality. Detection tasks are often meaningfully distinct from the challenges that online services actually face. Datasets and model evaluation are often non-representative of real-world contexts, and evaluation frequently is not independent of model training. Data and code availability is poor. Models do not generalize well to out-of-domain data. Based on these results, we offer recommendations for evaluating machine learning applications to trust and safety problems. Our aim is for future work to avoid the pitfalls that we identify.

摘要
我们研究机器学习在安全和信任问题上的应用中存在的知识沟通问题，使用假信息检测为 caso study。我们对涉及270篇论文的领域评审了自动化假信息检测的文献。然后，我们对文章中的数据和代码可用性、设计异常、可重现性和泛化性进行了分析。我们发现了 significativo shortcomings 在文献中，这些缺陷会质疑提出的性能和实用性。检测任务经常与实际场景中的挑战不符，数据集和模型评估 часто不准确反映实际情况，评估frequently不独立于模型训练。数据和代码可用性差，模型不好泛化到非本地数据上。根据这些结果，我们提出了评估机器学习应用于信任和安全问题的建议。我们的目标是将来的研究避免我们所标注的坑。

Learning to Learn Financial Networks for Optimising Momentum Strategies

paper_url: http://arxiv.org/abs/2308.12212
repo_url: None
paper_authors: Xingyue Pu, Stefan Zohren, Stephen Roberts, Xiaowen Dong
for: 这篇论文旨在提出一种新型的风险豁免，即基于金融网络的资产间关系来预测未来回报。
methods: 这篇论文提出了一种基于机器学习的结构，同时学习金融网络和优化交易信号，以实现网络动量策略的最优化。
results: 经过64个连续未来合约的回测，模型显示了明显的资产收益和风险控制提升，资产收益率为1.74，覆盖20年时间。

Abstract
Network momentum provides a novel type of risk premium, which exploits the interconnections among assets in a financial network to predict future returns. However, the current process of constructing financial networks relies heavily on expensive databases and financial expertise, limiting accessibility for small-sized and academic institutions. Furthermore, the traditional approach treats network construction and portfolio optimisation as separate tasks, potentially hindering optimal portfolio performance. To address these challenges, we propose L2GMOM, an end-to-end machine learning framework that simultaneously learns financial networks and optimises trading signals for network momentum strategies. The model of L2GMOM is a neural network with a highly interpretable forward propagation architecture, which is derived from algorithm unrolling. The L2GMOM is flexible and can be trained with diverse loss functions for portfolio performance, e.g. the negative Sharpe ratio. Backtesting on 64 continuous future contracts demonstrates a significant improvement in portfolio profitability and risk control, with a Sharpe ratio of 1.74 across a 20-year period.

摘要
网络势头提供了一种新型的风险资本，利用财务网络中资产之间的连接来预测未来的回报。然而，现有的金融网络构建过程受到了高价数据库和金融专业知识的限制，导致小型和学术机构有限的访问。另外，传统方法将网络构建和投资策略优化视为两个独立的任务，可能会降低投资策略的优化性。为解决这些挑战，我们提出了L2GMOM，一个终端机器学习框架，同时学习金融网络和投资策略。L2GMOM的模型是一个具有高度可读性的神经网络，其来自算法折叠。L2GMOM是灵活的，可以使用多种损失函数来优化股票表现，例如负涨策略率。在64个连续未来合约的回测中，L2GMOM显示出了明显的投资策略效果和风险控制提升，负涨策略率为1.74 across a 20-year period。

ULDP-FL: Federated Learning with Across Silo User-Level Differential Privacy

paper_url: http://arxiv.org/abs/2308.12210
repo_url: https://github.com/fumiyukikato/uldp-fl
paper_authors: Fumiyuki Kato, Li Xiong, Shun Takagi, Yang Cao, Masatoshi Yoshikawa
for: This paper focuses on providing user-level differential privacy (DP) in cross-silo federated learning (FL) settings, where a single user’s data may belong to multiple silos.
methods: The proposed algorithm, called ULDP-FL, ensures user-level DP through per-user weighted clipping, departing from group-privacy approaches. The algorithm also utilizes cryptographic building blocks to enhance its utility and provide private implementation.
results: The paper provides a theoretical analysis of the algorithm’s privacy and utility, and empirical experiments on real-world datasets show substantial improvements in privacy-utility trade-offs under user-level DP compared to baseline methods. Specifically, the results demonstrate that ULDP-FL achieves a better balance between privacy and utility compared to existing methods.Here’s the same information in Simplified Chinese text:
for: 这篇论文关注在跨存储桶 federated learning（FL）设置下提供用户级别的敏感度保证（DP），其中单个用户的数据可能属于多个桶中。
methods: 提案的算法（ULDP-FL）通过每个用户的权重剪辑实现用户级别的DP，与集群隐私方法不同。它还利用了 криптографических建筑物来增强其实用性和提供private实现。
results: 论文提供了算法的理论分析和实践实验，显示ULDP-FL在用户级别的DP下实现了更好的隐私利用冲突比基eline方法。具体来说，实验结果表明ULDP-FL在隐私利用冲突中实现了更好的平衡。

Abstract
Differentially Private Federated Learning (DP-FL) has garnered attention as a collaborative machine learning approach that ensures formal privacy. Most DP-FL approaches ensure DP at the record-level within each silo for cross-silo FL. However, a single user's data may extend across multiple silos, and the desired user-level DP guarantee for such a setting remains unknown. In this study, we present ULDP-FL, a novel FL framework designed to guarantee user-level DP in cross-silo FL where a single user's data may belong to multiple silos. Our proposed algorithm directly ensures user-level DP through per-user weighted clipping, departing from group-privacy approaches. We provide a theoretical analysis of the algorithm's privacy and utility. Additionally, we enhance the algorithm's utility and showcase its private implementation using cryptographic building blocks. Empirical experiments on real-world datasets show substantial improvements in our methods in privacy-utility trade-offs under user-level DP compared to baseline methods. To the best of our knowledge, our work is the first FL framework that effectively provides user-level DP in the general cross-silo FL setting.

摘要
diferencialmente privado aprendizaje federado (DP-FL) ha atraído la atención como un enfoque de aprendizaje automático colaborativo que garantiza la privacidad formal. La mayoría de los enfoques DP-FL aseguran la privacidad DP en el nivel de registros dentro de cada silo para la FL trans-silo. Sin embargo, un usuario solo puede tener datos que se extienden a través de múltiples silos, y el garantizado de privacidad del usuario a nivel se desconoce para tal configuración. En este estudio, presentamos ULDP-FL, un marco de FL novel diseñado para garantizar la privacidad del usuario en la FL trans-silo donde los datos de un usuario solo pueden pertenecer a múltiples silos. Nuestro algoritmo propuesto garantiza directamente la privacidad del usuario a través de clipeo ponderado por usuario, en lugar de enfoques de privacidad de grupo. Proporcionamos un análisis teórico de la privacidad y la utilidad del algoritmo. Además, mejoramos la utilidad del algoritmo y demostramos su implementación privada utilizando bloques de construcción criptográfica. Los experimentos empíricos en conjuntos de datos reales muestran mejoras significativas en los compromisos de privacidad-utilidad en comparación con los métodos de línea base bajo la privacidad del usuario. A nuestro conocimiento, nuestro trabajo es el primer marco de FL que proporciona efectivamente la privacidad del usuario en el entorno general de FL trans-silo.

Curriculum Learning with Adam: The Devil Is in the Wrong Details

paper_url: http://arxiv.org/abs/2308.12202
repo_url: None
paper_authors: Lucas Weber, Jaap Jumelet, Paul Michel, Elia Bruni, Dieuwke Hupkes
for: This paper aims to explore the limitations of curriculum learning (CL) methods in natural language processing (NLP) and understand why they often fail to achieve expected results.
methods: The paper uses a combination of replication and extension of recent CL methods, as well as a deep dive into the (in)effectiveness of these curricula in different scenarios.
results: The paper finds that CL methods often learn to adapt to suboptimally chosen optimisation parameters for the Adam algorithm, and none of the common hand-crafted and automated CL approaches outperforms optimisation with only Adam and well-chosen hyperparameters. The results contribute to understanding why CL methods work, but also urge caution when claiming positive results.Here is the same information in Simplified Chinese text:
for: 这篇论文目标是探讨自然语言处理（NLP）领域中curriculum learning（CL）方法的局限性，并理解它们常常无法达到预期的结果。
methods: 论文使用了复制和扩展最近的CL方法，以及对这些课程的深入分析。
results: 论文发现，CL方法经常学习到Adam算法中不优化的优化参数，而且没有任何常见的手动制定和自动生成CL方法能够超过Adam算法和良好选择的超参数。

Abstract
Curriculum learning (CL) posits that machine learning models -- similar to humans -- may learn more efficiently from data that match their current learning progress. However, CL methods are still poorly understood and, in particular for natural language processing (NLP), have achieved only limited success. In this paper, we explore why. Starting from an attempt to replicate and extend a number of recent curriculum methods, we find that their results are surprisingly brittle when applied to NLP. A deep dive into the (in)effectiveness of the curricula in some scenarios shows us why: when curricula are employed in combination with the popular Adam optimisation algorithm, they oftentimes learn to adapt to suboptimally chosen optimisation parameters for this algorithm. We present a number of different case studies with different common hand-crafted and automated CL approaches to illustrate this phenomenon, and we find that none of them outperforms optimisation with only Adam with well-chosen hyperparameters. As such, our results contribute to understanding why CL methods work, but at the same time urge caution when claiming positive results.

摘要
curriculum学习（CL）认为机器学习模型，类似于人类，可能从数据中更加效率地学习。然而，CL方法仍然不够了解，特别是在自然语言处理（NLP）领域，成果很有限。在这篇论文中，我们探索了这个问题的原因。从尝试复制和扩展一些最近的课程方法开始，我们发现其结果在NLP领域很brittl。我们对这些课程的效果进行了深入的分析，发现它们在使用流行的Adam优化算法时 oftentimes会适应不合适选择的优化参数。我们在不同的常见手动制定和自动生成CL方法的case study中发现， none of them outperforms仅使用Adam算法和合适的超参数。因此，我们的结果对CL方法的工作 Mechanism做出了贡献，同时也谨慎地宣称了CL方法的成果。

Predicting Drug Solubility Using Different Machine Learning Methods – Linear Regression Model with Extracted Chemical Features vs Graph Convolutional Neural Network

paper_url: http://arxiv.org/abs/2308.12325
repo_url: None
paper_authors: John Ho, Zhao-Heng Yin, Colin Zhang, Henry Overhauser, Kyle Swanson, Yang Ha
for: 这项研究的目的是提高药物设计中的化学结构影响化学性能的理解，以便设计新药物。
methods: 这项研究使用了两种机器学习模型：线性回归模型和图 convolutional neural network 模型，并在多个实验数据集上应用了这两种模型。
results: 研究发现，GCNN 模型在多个实验数据集上表现最佳，但是当前 GCNN 模型是黑盒模型，而线性回归模型的功能重要性分析却可以提供更多有关化学元件的影响的信息。

Abstract
Predicting the solubility of given molecules is an important task in the pharmaceutical industry, and consequently this is a well-studied topic. In this research, we revisited this problem with the advantage of modern computing resources. We applied two machine learning models, a linear regression model and a graph convolutional neural network model, on multiple experimental datasets. Both methods can make reasonable predictions while the GCNN model had the best performance. However, the current GCNN model is a black box, while feature importance analysis from the linear regression model offers more insights into the underlying chemical influences. Using the linear regression model, we show how each functional group affects the overall solubility. Ultimately, knowing how chemical structure influences chemical properties is crucial when designing new drugs. Future work should aim to combine the high performance of GCNNs with the interpretability of linear regression, unlocking new advances in next generation high throughput screening.

摘要
预测给定分子的溶解度是药品工业中重要的任务，因此这是一个已有的研究主题。在这项研究中，我们利用现代计算资源重新评估了这个问题，并应用了两种机器学习模型：线性回归模型和图 convolutional neural network 模型。两种方法都可以做出合理的预测，但GCNN模型的性能最好。然而，当前GCNN模型是一个黑盒模型，而线性回归模型的特征重要性分析可以提供更多有关化学影响的信息。使用线性回归模型，我们显示了每个 функциональ组对总溶解度的影响。最后，了解化学结构对化学性质的影响是药品设计新药的关键。未来的工作应该努力结合GCNN的高性能和线性回归模型的可读性，开启下一代高通过率屏选。

Self-Supervised Knowledge-Driven Deep Learning for 3D Magnetic Inversion

paper_url: http://arxiv.org/abs/2308.12193
repo_url: None
paper_authors: Yinshuo Li, Zhuo Jia, Wenkai Lu, Cao Song
for: 这个论文旨在提出一种基于自主学习的非破坏地球物理方法，用于估计地表磁场异常数据下的地下吸引分布。
methods: 该方法使用了自主学习的深度学习方法，通过一个封闭的循环结构，将目标场数据与反射模型相互关联。
results: 对比实验表明，提出的方法可以在磁场异常数据下提供可靠的地下吸引分布估计，并且可以更好地解释深度学习模型的含义。

Abstract
The magnetic inversion method is one of the non-destructive geophysical methods, which aims to estimate the subsurface susceptibility distribution from surface magnetic anomaly data. Recently, supervised deep learning methods have been widely utilized in lots of geophysical fields including magnetic inversion. However, these methods rely heavily on synthetic training data, whose performance is limited since the synthetic data is not independently and identically distributed with the field data. Thus, we proposed to realize magnetic inversion by self-supervised deep learning. The proposed self-supervised knowledge-driven 3D magnetic inversion method (SSKMI) learns on the target field data by a closed loop of the inversion and forward models. Given that the parameters of the forward model are preset, SSKMI can optimize the inversion model by minimizing the mean absolute error between observed and re-estimated surface magnetic anomalies. Besides, there is a knowledge-driven module in the proposed inversion model, which makes the deep learning method more explicable. Meanwhile, comparative experiments demonstrate that the knowledge-driven module can accelerate the training of the proposed method and achieve better results. Since magnetic inversion is an ill-pose task, SSKMI proposed to constrain the inversion model by a guideline in the auxiliary loop. The experimental results demonstrate that the proposed method is a reliable magnetic inversion method with outstanding performance.

摘要
《磁偏振方法》是一种不破坏地球物理方法，旨在从地表磁偏数据中估算地下 Distribution of susceptibility. 现在，大量的地球物理领域，包括磁偏方法，都使用了监督式深度学习方法。然而，这些方法仰赖于人工生成的Synthetic training data，其性能有限，因为人工生成的数据不是独立并且相同分布于场数据。因此，我们提出了基于自我监督深度学习的磁偏方法。我们的自我监督知识驱动3D磁偏方法（SSKMI）通过关闭循环的反向模型和减少模型来学习目标场数据。给定前置的前向模型参数，SSKMI可以优化减少模型参数，使其与观测到的地表磁偏数据的差异降到最小。此外，我们还添加了知识驱动模块到减少模型中，使得深度学习方法更加可读性。同时，对比 экспериiments表明，知识驱动模块可以加速我们的方法的训练和提高其性能。由于磁偏是一个不定问题，我们提出了在辅助循环中使用导向的约束来约束减少模型。实验结果表明，我们的方法是一种可靠的磁偏方法，性能出色。

Robustness Analysis of Continuous-Depth Models with Lagrangian Techniques

paper_url: http://arxiv.org/abs/2308.12192
repo_url: None
paper_authors: Sophie A. Neubauer, Radu Grosu
for: 本研究旨在提供一种综合性的 Lagrange 验证技术，用于评估时间连续过程中的行为坚定性。
methods: 本研究使用了 LRT-NG、SLR 和 GoTube 等算法来构建紧距抽象（reachtube），以估计在给定时间框内可达的状态集。这些算法具有确定性和统计性的保证。
results: 实验表明，Lagrange 技术在对不同时间连续模型的坚定性分析中表现出色，比较流行的 LRT、Flow* 和 CAPD 等方法更高效。

Abstract
This paper presents, in a unified fashion, deterministic as well as statistical Lagrangian-verification techniques. They formally quantify the behavioral robustness of any time-continuous process, formulated as a continuous-depth model. To this end, we review LRT-NG, SLR, and GoTube, algorithms for constructing a tight reachtube, that is, an over-approximation of the set of states reachable within a given time-horizon, and provide guarantees for the reachtube bounds. We compare the usage of the variational equations, associated to the system equations, the mean value theorem, and the Lipschitz constants, in achieving deterministic and statistical guarantees. In LRT-NG, the Lipschitz constant is used as a bloating factor of the initial perturbation, to compute the radius of an ellipsoid in an optimal metric, which over-approximates the set of reachable states. In SLR and GoTube, we get statistical guarantees, by using the Lipschitz constants to compute local balls around samples. These are needed to calculate the probability of having found an upper bound, of the true maximum perturbation at every timestep. Our experiments demonstrate the superior performance of Lagrangian techniques, when compared to LRT, Flow*, and CAPD, and illustrate their use in the robustness analysis of various continuous-depth models.

摘要
The paper compares the use of variational equations, associated with the system equations, the mean value theorem, and the Lipschitz constants, in achieving deterministic and statistical guarantees. In LRT-NG, the Lipschitz constant is used as a bloating factor of the initial perturbation to compute the radius of an ellipsoid in an optimal metric, which over-approximates the set of reachable states. In SLR and GoTube, the Lipschitz constants are used to compute local balls around samples, which are needed to calculate the probability of having found an upper bound of the true maximum perturbation at every timestep.The paper also presents experiments demonstrating the superior performance of Lagrangian techniques compared to LRT, Flow*, and CAPD, and illustrates their use in the robustness analysis of various continuous-depth models.

Development and external validation of a lung cancer risk estimation tool using gradient-boosting

paper_url: http://arxiv.org/abs/2308.12188
repo_url: https://github.com/plbenveniste/lungcancerrisk
paper_authors: Pierre-Louis Benveniste, Julie Alberge, Lei Xing, Jean-Emmanuel Bibault
for: 这个研究旨在开发一个基于机器学习算法的工具，用于估计在5年内是否有可能发生肺癌。
methods: 该研究使用了PLCO癌症creening试验和NLST试验的数据，并使用了XGBoost ensemble学习算法进行特征选择、超参数优化和模型加鲁。
results: 模型在PLCO试验 dataset上具有良好的准确率（Brier score=0.044）和ROC-AUC（82%），并且在NLST试验 dataset上也达到了一定的准确率（ROC-AUC=70%）。相比USPSTF的肺癌检测指南，该模型具有同等的回归率，但是准确率较高（13.1% vs. 9.3%）。

Abstract
Lung cancer is a significant cause of mortality worldwide, emphasizing the importance of early detection for improved survival rates. In this study, we propose a machine learning (ML) tool trained on data from the PLCO Cancer Screening Trial and validated on the NLST to estimate the likelihood of lung cancer occurrence within five years. The study utilized two datasets, the PLCO (n=55,161) and NLST (n=48,595), consisting of comprehensive information on risk factors, clinical measurements, and outcomes related to lung cancer. Data preprocessing involved removing patients who were not current or former smokers and those who had died of causes unrelated to lung cancer. Additionally, a focus was placed on mitigating bias caused by censored data. Feature selection, hyper-parameter optimization, and model calibration were performed using XGBoost, an ensemble learning algorithm that combines gradient boosting and decision trees. The ML model was trained on the pre-processed PLCO dataset and tested on the NLST dataset. The model incorporated features such as age, gender, smoking history, medical diagnoses, and family history of lung cancer. The model was well-calibrated (Brier score=0.044). ROC-AUC was 82% on the PLCO dataset and 70% on the NLST dataset. PR-AUC was 29% and 11% respectively. When compared to the USPSTF guidelines for lung cancer screening, our model provided the same recall with a precision of 13.1% vs. 9.3% on the PLCO dataset and 3.2% vs. 3.1% on the NLST dataset. The developed ML tool provides a freely available web application for estimating the likelihood of developing lung cancer within five years. By utilizing risk factors and clinical data, individuals can assess their risk and make informed decisions regarding lung cancer screening. This research contributes to the efforts in early detection and prevention strategies, aiming to reduce lung cancer-related mortality rates.

摘要
严重的肺癌是全球的主要死亡原因之一，强调了早期发现的重要性以提高存活率。在这项研究中，我们提出了一种基于机器学习（ML）技术的工具，用于在5年内lung cancer的发生可能性的估计。该工具在PLCO肺癌creening试验和NLST肺癌试验上使用了大量数据，并通过选择特征、优化超参数和模型准确性来训练和测试。该模型包括年龄、性别、吸烟史、医疗诊断和家族史等因素。模型在PLCO数据集上训练，并在NLST数据集上测试。模型的准确性很好（Brier score=0.044），ROC-AUC在PLCO数据集上为82%，在NLST数据集上为70%。PR-AUC分别为29%和11%。与美国医疗保险基金（USPSTF）肺癌检测指南相比，我们的模型具有同等的回归率，但精度更高（13.1% vs. 9.3%）。我们开发的ML工具提供了一个免费的在线应用程序，用于在5年内肺癌的发生可能性的估计。通过利用风险因素和临床数据，个人可以评估自己的风险，并做出有知ledge的决策关于肺癌检测。这项研究对早期发现和预防策略的发展做出了贡献，以减少肺癌相关的死亡率。

Unsupervised anomalies detection in IIoT edge devices networks using federated learning

paper_url: http://arxiv.org/abs/2308.12175
repo_url: None
paper_authors: Niyomukiza Thamar, Hossam Samy Elsaid Sharara
for: 这个研究是为了解决 IoT/ IIoT 设备上的数据隐私问题，使用分布式机器学习方法（Federated Learning，FL）来训练机器学习模型，而不需要将数据传输到中央服务器。
methods: 本研究使用 Fedavg 算法，让参与训练的设备上执行模型训练，并将训练结果传递到协调服务器，进行模型平均值。
results: 研究发现，使用 Fedavg 算法可以达到中央机器学习方法的效果，并且解决了对数据隐私的担忧。但是，研究也发现了一些 Fedavg 的缺陷，例如在训练过程中，一些设备可能无法参与训练，导致模型的不公平性。因此，研究人员提出了一个 Fair Fedavg 算法，以解决这个问题。

Abstract
In a connection of many IoT devices that each collect data, normally training a machine learning model would involve transmitting the data to a central server which requires strict privacy rules. However, some owners are reluctant of availing their data out of the company due to data security concerns. Federated learning(FL) as a distributed machine learning approach performs training of a machine learning model on the device that gathered the data itself. In this scenario, data is not share over the network for training purpose. Fedavg as one of FL algorithms permits a model to be copied to participating devices during a training session. The devices could be chosen at random, and a device can be aborted. The resulting models are sent to the coordinating server and then average models from the devices that finished training. The process is repeated until a desired model accuracy is achieved. By doing this, FL approach solves the privacy problem for IoT/ IIoT devices that held sensitive data for the owners. In this paper, we leverage the benefits of FL and implemented Fedavg algorithm on a recent dataset that represent the modern IoT/ IIoT device networks. The results were almost the same as the centralized machine learning approach. We also evaluated some shortcomings of Fedavg such as unfairness that happens during the training when struggling devices do not participate for every stage of training. This inefficient training of local or global model could lead in a high number of false alarms in intrusion detection systems for IoT/IIoT gadgets developed using Fedavg. Hence, after evaluating the FedAv deep auto encoder with centralized deep auto encoder ML, we further proposed and designed a Fair Fedavg algorithm that will be evaluated in the future work.

摘要
在许多互联网关联的设备之间连接多个设备，通常在训练机器学习模型时需要将数据传输到中央服务器，这会导致严格的隐私规则。然而，一些拥有者因数据安全问题而不愿提供数据。 Federation learning（FL）是一种分布式机器学习方法，在设备上训练机器学习模型，不需要将数据传输到网络上。 Fedavg算法是FL中的一种，允许在训练过程中将模型 копи到参与设备上。这些设备可以随机选择，并且可以中途终止。获得的模型将被传输到协调服务器，然后平均处理完成训练的设备。这种方法可以解决互联网/IIoT设备拥有敏感数据的隐私问题。在这篇论文中，我们利用了FL的优点，并在最新的数据集上实现了Fedavg算法。结果与中央机器学习方法相似。然而，我们还发现了Fedavg算法的一些缺点，如训练过程中不参与的设备可能会导致不公平。这可能会在对互联网/IIoT设备开发的投入检测系统中导致高数量的假警示。因此，我们在后续工作中提出了一种公平的Fedavg算法，以解决这个问题。

Data-driven decision-focused surrogate modeling

paper_url: http://arxiv.org/abs/2308.12161
repo_url: https://github.com/ddolab/decfocsurrmod
paper_authors: Rishabh Gupta, Qi Zhang
for: 解决 computationally challenging nonlinear optimization problems in real-time settings
methods: 使用 data-driven framework 学习一个简单的优化模型，以减少决策预测错误
results: 通过数学实验 validate 了框架，并与标准数据驱动模型方法进行比较，显示了更高的数据效率和决策预测准确率

Abstract
We introduce the concept of decision-focused surrogate modeling for solving computationally challenging nonlinear optimization problems in real-time settings. The proposed data-driven framework seeks to learn a simpler, e.g. convex, surrogate optimization model that is trained to minimize the decision prediction error, which is defined as the difference between the optimal solutions of the original and the surrogate optimization models. The learning problem, formulated as a bilevel program, can be viewed as a data-driven inverse optimization problem to which we apply a decomposition-based solution algorithm from previous work. We validate our framework through numerical experiments involving the optimization of common nonlinear chemical processes such as chemical reactors, heat exchanger networks, and material blending systems. We also present a detailed comparison of decision-focused surrogate modeling with standard data-driven surrogate modeling methods and demonstrate that our approach is significantly more data-efficient while producing simple surrogate models with high decision prediction accuracy.

摘要
我们介绍一种专注决策的代理模型，用于解决具有 Computationally Challenging Nonlinear Optimization Problems 的实时设定中的问题。我们的框架将学习一个更简单的、例如凸的估计服务器模型，以减少决策预测误差，即原始服务器模型的优化解的差异。我们的学习问题可以被视为一个数据驱动的逆优化问题，我们将运用之前的研究中的分解法来解决。我们透过实验证明了我们的框架，并与标准的数据驱动模型方法进行比较，发现我们的方法更加资料效率，实现高精度决策预测。

A Probabilistic Fluctuation based Membership Inference Attack for Generative Models

paper_url: http://arxiv.org/abs/2308.12143
repo_url: None
paper_authors: Wenjie Fu, Huandong Wang, Chen Gao, Guanghua Liu, Yong Li, Tao Jiang
for: 这篇论文主要研究的是机器学习模型的训练数据中是否包含某记录的问题，即成员推测攻击（MIA）。
methods: 该论文使用了现有的MIA方法，并对这些方法进行了改进和扩展，以适应生成模型中的Memorization现象。
results: 对多种生成模型和数据集进行了广泛的实验，结果显示，提出的PFAMI方法可以提高攻击成功率（ASR）约27.9%，比最佳基eline更高。

Abstract
Membership Inference Attack (MIA) identifies whether a record exists in a machine learning model's training set by querying the model. MIAs on the classic classification models have been well-studied, and recent works have started to explore how to transplant MIA onto generative models. Our investigation indicates that existing MIAs designed for generative models mainly depend on the overfitting in target models. However, overfitting can be avoided by employing various regularization techniques, whereas existing MIAs demonstrate poor performance in practice. Unlike overfitting, memorization is essential for deep learning models to attain optimal performance, making it a more prevalent phenomenon. Memorization in generative models leads to an increasing trend in the probability distribution of generating records around the member record. Therefore, we propose a Probabilistic Fluctuation Assessing Membership Inference Attack (PFAMI), a black-box MIA that infers memberships by detecting these trends via analyzing the overall probabilistic fluctuations around given records. We conduct extensive experiments across multiple generative models and datasets, which demonstrate PFAMI can improve the attack success rate (ASR) by about 27.9% when compared with the best baseline.

摘要
Member Inference Attack (MIA) 可以判断一个记录是否在机器学习模型的训练集中存在，通过访问模型。针对经典的分类模型，MIA的研究已经很深入，而现在的研究已经开始探索如何将MIA应用于生成模型。我们的调查表明，现有的生成模型MIA主要依赖于目标模型的过分学习。然而，过分学习可以通过多种正则化技术来避免，而现有的MIA在实践中表现不佳。与过分学习不同，记忆是深度学习模型达到优化性能的关键因素，因此在生成模型中更加普遍。记忆在生成模型中导致生成记录的概率分布增加趋势，因此我们提出了一种基于概率波动的黑obox会员攻击方法（PFAMI）。PFAMI通过分析给定记录的概率波动来推断成员身份。我们在多个生成模型和数据集上进行了广泛的实验，实验结果表明，PFAMI可以提高攻击成功率（ASR）约27.9%，相比最佳基准。

Masking Strategies for Background Bias Removal in Computer Vision Models

paper_url: http://arxiv.org/abs/2308.12127
repo_url: https://github.com/ananthu-aniraj/masking_strategies_bias_removal
paper_authors: Ananthu Aniraj, Cassio F. Dantas, Dino Ienco, Diego Marcos
for: 这篇论文主要探讨了精细图像分类任务中背景对准精度的影响，以及如何使用不同的掩盖策略来减少背景对准精度的影响。
methods: 这篇论文使用了两种掩盖策略来减少背景对准精度的影响，包括 Early Masking 和 Late Masking。Early Masking 是在输入图像水平进行背景资讯的 removing，而 Late Masking 则是在高级特征水平进行背景掩盖。
results: 实验结果显示，两种掩盖策略都能够提高 OOD 性能，并且 Early Masking 显示出最佳 OOD 性能。另外，一种使用 GAP-Pooled Patch token-based classification 和 Early Masking 的 ViT 模型也取得了最高 OOD 抗衡性能。

Abstract
Models for fine-grained image classification tasks, where the difference between some classes can be extremely subtle and the number of samples per class tends to be low, are particularly prone to picking up background-related biases and demand robust methods to handle potential examples with out-of-distribution (OOD) backgrounds. To gain deeper insights into this critical problem, our research investigates the impact of background-induced bias on fine-grained image classification, evaluating standard backbone models such as Convolutional Neural Network (CNN) and Vision Transformers (ViT). We explore two masking strategies to mitigate background-induced bias: Early masking, which removes background information at the (input) image level, and late masking, which selectively masks high-level spatial features corresponding to the background. Extensive experiments assess the behavior of CNN and ViT models under different masking strategies, with a focus on their generalization to OOD backgrounds. The obtained findings demonstrate that both proposed strategies enhance OOD performance compared to the baseline models, with early masking consistently exhibiting the best OOD performance. Notably, a ViT variant employing GAP-Pooled Patch token-based classification combined with early masking achieves the highest OOD robustness.

摘要
“微型图像分类任务中的模型，特别是当某些类别之间的差异非常微小，且每个类别的训练数据少的时候，容易受到背景相关的偏见。为了更深入地了解这个重要问题，我们的研究探讨微型图像分类中背景对于模型的影响，以标准的对称神经网络（CNN）和视觉 трансформа者（ViT）为例。我们探讨了两种遮盾策略来减少背景对模型的影响：早期遮盾，将背景信息在输入图像水平上除除，以及晚期遮盾，将高层次的空间特征选择性地遮盾。实验结果显示，这两种遮盾策略都能够提高模型对于非典型背景的一般化性，其中早期遮盾一般表现较好。此外，一种基于GAP-Pooled Patchtoken的ViT变体，配合早期遮盾，实现了最高的非典型背景Robustness。”

An Accelerated Block Proximal Framework with Adaptive Momentum for Nonconvex and Nonsmooth Optimization

paper_url: http://arxiv.org/abs/2308.12126
repo_url: None
paper_authors: Weifeng Yang, Wenwen Min
for: solves nonconvex and nonsmooth optimization problems with block variables and adaptive momentum.
methods: uses an accelerated block proximal linear framework with adaptive momentum, enhances the comparison process, and allows random shuffling of update order for variable blocks.
results: monotonically decreases function value, globally converges, and achieves linear and sublinear convergence rates under mild assumptions. Additionally, the derivative set of the sequence generated by the algorithm is a critical point set, and the algorithm is effective and efficient in numerical experiments.

Abstract
We propose an accelerated block proximal linear framework with adaptive momentum (ABPL$^+$) for nonconvex and nonsmooth optimization. We analyze the potential causes of the extrapolation step failing in some algorithms, and resolve this issue by enhancing the comparison process that evaluates the trade-off between the proximal gradient step and the linear extrapolation step in our algorithm. Furthermore, we extends our algorithm to any scenario involving updating block variables with positive integers, allowing each cycle to randomly shuffle the update order of the variable blocks. Additionally, under mild assumptions, we prove that ABPL$^+$ can monotonically decrease the function value without strictly restricting the extrapolation parameters and step size, demonstrates the viability and effectiveness of updating these blocks in a random order, and we also more obviously and intuitively demonstrate that the derivative set of the sequence generated by our algorithm is a critical point set. Moreover, we demonstrate the global convergence as well as the linear and sublinear convergence rates of our algorithm by utilizing the Kurdyka-Lojasiewicz (K{\L}) condition. To enhance the effectiveness and flexibility of our algorithm, we also expand the study to the imprecise version of our algorithm and construct an adaptive extrapolation parameter strategy, which improving its overall performance. We apply our algorithm to multiple non-negative matrix factorization with the $\ell_0$ norm, nonnegative tensor decomposition with the $\ell_0$ norm, and perform extensive numerical experiments to validate its effectiveness and efficiency.

摘要
我们提出一种加速的块靠近线性框架（ABPL$^+$）用于非对称和非准确优化。我们分析了一些算法中扩rap step失败的可能原因，并解决这个问题 by 提高比较过程，该过程评估了 proximal Gradient step 和线性扩rap step 之间的负担。此外，我们扩展了我们的算法，以便在每个ecycle中随机洗牌变量块的更新顺序。此外，在宽泛的假设下，我们证明了 ABPL$^+$ 可以不断降低函数值，而不是严格限制扩rap参数和步长，这表明了更新这些块的随机顺序的可行性和有效性。此外，我们还证明了我们的算法的全局收敛性，以及其线性和非线性收敛率，通过利用 Kurdyka-Lojasiewicz（K{\L）} 条件。为了提高我们的算法的效iveness和灵活性，我们还扩展了我们的研究到不精确版本的我们算法，并构建了 adaptive 扩rap参数策略。我们应用我们的算法于多个非正方形矩阵因子化、非正方形矩阵归一化等问题，并进行了广泛的数值实验来验证其效果和效率。

An Open-Source ML-Based Full-Stack Optimization Framework for Machine Learning Accelerators

paper_url: http://arxiv.org/abs/2308.12120
repo_url: None
paper_authors: Hadi Esmaeilzadeh, Soroush Ghodrati, Andrew B. Kahng, Joon Kyung Kim, Sean Kinzer, Sayak Kundu, Rohan Mahapatra, Susmita Dey Manasi, Sachin Sapatnekar, Zhiang Wang, Ziqing Zeng
for: 这个论文的目的是为了实现硬件加速的深度学习算法和非深度学习算法的设计空间探索（DSE）。
methods: 这个论文使用了物理设计驱动的学习based预测框架， combinig backend PPA分析与前端性能仿真，以实现准确的backend PPA和系统指标（如运行时和能耗）预测。此外，论文还提出了一种自动化DSE技术，通过自动化搜索架构和后端参数，以优化backend和系统指标。
results: 实验研究表明，该预测框架可以准确预测backend PPA和系统指标，在两种深度学习加速器平台（VTA和VeriGOOD-ML）的ASIC实现中，在12nm商业处理器和45nm研究处理器中，平均预测错误率只有7%或更低。

Abstract
Parameterizable machine learning (ML) accelerators are the product of recent breakthroughs in ML. To fully enable their design space exploration (DSE), we propose a physical-design-driven, learning-based prediction framework for hardware-accelerated deep neural network (DNN) and non-DNN ML algorithms. It adopts a unified approach that combines backend power, performance, and area (PPA) analysis with frontend performance simulation, thereby achieving a realistic estimation of both backend PPA and system metrics such as runtime and energy. In addition, our framework includes a fully automated DSE technique, which optimizes backend and system metrics through an automated search of architectural and backend parameters. Experimental studies show that our approach consistently predicts backend PPA and system metrics with an average 7% or less prediction error for the ASIC implementation of two deep learning accelerator platforms, VTA and VeriGOOD-ML, in both a commercial 12 nm process and a research-oriented 45 nm process.

摘要
现代机器学习（ML）加速器的 Parameterizable 机制是 ML 的新突破口。为了全面探索设计空间（DSE），我们提议一种物理设计驱动、学习基于的预测框架，用于硬件加速深度神经网络（DNN）和非 DNN ML 算法。这种方法结合了后端能力、性能和面积（PPA）分析和前端性能仿真，从而实现了准确的后端 PPA 和系统指标（如运行时间和能耗）预测。此外，我们的框架还包括一种完全自动化 DSE 技术，通过自动搜索架构和后端参数的调整来优化后端和系统指标。实验研究表明，我们的方法可预测后端 PPA 和系统指标的误差在 7% 左右，对 ASIC 实现的两种深度学习加速器平台（VTA 和 VeriGOOD-ML）在商业 12 nm 进程和研究oriented 45 nm 进程中进行了可靠的预测。

Less is More – Towards parsimonious multi-task models using structured sparsity

paper_url: http://arxiv.org/abs/2308.12114
repo_url: None
paper_authors: Richa Upadhyay, Ronald Phlypo, Rajkumar Saini, Marcus Liwicki
For: This paper aims to incorporate structured group sparsity into a Multi-Task Learning (MTL) framework to develop parsimonious models that can effectively address multiple tasks with fewer parameters while maintaining comparable or superior performance to a dense model.* Methods: The paper uses channel-wise l1/l2 group sparsity in the shared layers of a Convolutional Neural Network (CNN) to facilitate the elimination of extraneous groups (channels) and impose a penalty on the weights, thereby enhancing the learning of all tasks.* Results: The paper compares the outcomes of single-task and multi-task experiments under group sparsity on two publicly available MTL datasets, NYU-v2 and CelebAMask-HQ, and investigates how changing the sparsification degree impacts both the performance of the model and the sparsity of groups.

Abstract
Group sparsity in Machine Learning (ML) encourages simpler, more interpretable models with fewer active parameter groups. This work aims to incorporate structured group sparsity into the shared parameters of a Multi-Task Learning (MTL) framework, to develop parsimonious models that can effectively address multiple tasks with fewer parameters while maintaining comparable or superior performance to a dense model. Sparsifying the model during training helps decrease the model's memory footprint, computation requirements, and prediction time during inference. We use channel-wise l1/l2 group sparsity in the shared layers of the Convolutional Neural Network (CNN). This approach not only facilitates the elimination of extraneous groups (channels) but also imposes a penalty on the weights, thereby enhancing the learning of all tasks. We compare the outcomes of single-task and multi-task experiments under group sparsity on two publicly available MTL datasets, NYU-v2 and CelebAMask-HQ. We also investigate how changing the sparsification degree impacts both the performance of the model and the sparsity of groups.

摘要
Translated into Simplified Chinese:群体简洁在机器学习（ML）中鼓励更简单、更易理解的模型，具有更少的活跃参数组。这个工作想要在多任务学习（MTL）框架中 integrate 结构化群体简洁，开发更具有约束的模型，可以更好地处理多个任务，使用 fewer 参数，保持与稠密模型相当或更高的性能。在训练中减少模型的内存占用、计算需求和预测时间。我们使用 CNN 中的通道 wise L1/L2 群体简洁，不仅可以消除无关的通道，还对权重进行约束，从而提高所有任务的学习。我们在 NYU-v2 和 CelebAMask-HQ 两个公共可用的 MTL 数据集上进行单任务和多任务实验，并研究透明度如何随着简洁度的变化而变化。

Generalized Continual Category Discovery

paper_url: http://arxiv.org/abs/2308.12112
repo_url: None
paper_authors: Daniel Marczak, Grzegorz Rypeść, Sebastian Cygert, Tomasz Trzciński, Bartłomiej Twardowski
for: 本研究旨在探讨 continual learning (CL) 方法在实际生活中的应用，即一个学习机器人可以在含有未知类和已知类的大量未标注数据下学习新任务。
methods: 本研究引入一种新的框架，即通用 continual category discovery (GCCD)，允许任务中存在未知和已知类，并使用 continual 无监督学习方法来发现它们。
results: 实验结果表明，现有的 CL 方法在面对后续任务中存在未标注样本时表现不佳，而我们提议的方法能够充分利用 supervised 和无监督学习信号，并通过中心点适应来缓解忘记现象，从而达到更高的表征学习性能。

Abstract
Most of Continual Learning (CL) methods push the limit of supervised learning settings, where an agent is expected to learn new labeled tasks and not forget previous knowledge. However, these settings are not well aligned with real-life scenarios, where a learning agent has access to a vast amount of unlabeled data encompassing both novel (entirely unlabeled) classes and examples from known classes. Drawing inspiration from Generalized Category Discovery (GCD), we introduce a novel framework that relaxes this assumption. Precisely, in any task, we allow for the existence of novel and known classes, and one must use continual version of unsupervised learning methods to discover them. We call this setting Generalized Continual Category Discovery (GCCD). It unifies CL and GCD, bridging the gap between synthetic benchmarks and real-life scenarios. With a series of experiments, we present that existing methods fail to accumulate knowledge from subsequent tasks in which unlabeled samples of novel classes are present. In light of these limitations, we propose a method that incorporates both supervised and unsupervised signals and mitigates the forgetting through the use of centroid adaptation. Our method surpasses strong CL methods adopted for GCD techniques and presents a superior representation learning performance.

摘要
大多数 Continual Learning（CL）方法在supervised learning Setting下进行推广，其中一个agent需要学习新的标记任务而不忘记之前的知识。然而，这些设置并不适合实际生活中的enario，因为一个学习Agent在可以获得大量的无标签数据，包括新的（完全无标签）类和已知类的示例。 Drawing inspiration from Generalized Category Discovery（GCD），我们引入一个新的框架，允许在任务中存在新的和已知的类，并使用continual version of unsupervised learning方法来发现它们。我们称这种设置为Generalized Continual Category Discovery（GCCD）。它将CL和GCD融合在一起，bridging the gap between synthetic benchmarks和实际场景。在一系列实验中，我们发现了现有方法在接下来的任务中不能够从无标签样本中积累知识。在这些限制下，我们提出了一种方法，该方法将supervised和unsupervised信号相互作用，以避免忘记。我们的方法超越了strong CL方法，并在representation learning中表现出优于其他方法。

Constrained Stein Variational Trajectory Optimization

paper_url: http://arxiv.org/abs/2308.12110
repo_url: None
paper_authors: Thomas Power, Dmitry Berenson
for: 这个论文的目的是提出一种具有约束的轨迹优化算法，以便在平行的环境中进行轨迹优化。
methods: 这个算法使用stein热函数 Gradient Descent (SVGD) 来找到一组粒子，这些粒子可以近似一个低成本轨迹分布，并且遵循约束。
results: 这个算法在具有高度约束的任务中表现出色，比如7DoF 瓦辛 manipulate 任务，这个算法成功了20/20次试验，而最近的基eline则只有13/20次。这些结果表明，通过生成多个约束遵循的轨迹，可以增强响应外部干扰和初始化的稳定性。

Abstract
We present Constrained Stein Variational Trajectory Optimization (CSVTO), an algorithm for performing trajectory optimization with constraints on a set of trajectories in parallel. We frame constrained trajectory optimization as a novel form of constrained functional minimization over trajectory distributions, which avoids treating the constraints as a penalty in the objective and allows us to generate diverse sets of constraint-satisfying trajectories. Our method uses Stein Variational Gradient Descent (SVGD) to find a set of particles that approximates a distribution over low-cost trajectories while obeying constraints. CSVTO is applicable to problems with arbitrary equality and inequality constraints and includes a novel particle resampling step to escape local minima. By explicitly generating diverse sets of trajectories, CSVTO is better able to avoid poor local minima and is more robust to initialization. We demonstrate that CSVTO outperforms baselines in challenging highly-constrained tasks, such as a 7DoF wrench manipulation task, where CSVTO succeeds in 20/20 trials vs 13/20 for the closest baseline. Our results demonstrate that generating diverse constraint-satisfying trajectories improves robustness to disturbances and initialization over baselines.

摘要
我们提出了受限 Stein 变量演化轨迹优化（CSVTO）算法，用于并行执行受限轨迹优化。我们将受限轨迹优化视为一种新的受限函数最小化问题，避免对约束视为对目标函数的罚项，从而可以生成符合约束的多个轨迹。我们使用 Stein 变量梯度下降（SVGD）来找到一组粒子，该粒子 approximates 一个低成本轨迹分布，同时满足约束。CSVTO 适用于具有平等和不平等约束的问题，并包括一种新的粒子重采样步骤，以避免局部最优点。通过显式生成多个约束满足的轨迹，CSVTO 更有可能避免噪音和初始化点的问题，并且更加稳定。我们在一个高度受限的 7DoF 工具把持任务中表现出色，CSVTO 成功完成了 20/20 次试验，而最接近的基准只成功完成了 13/20 次。我们的结果表明，通过生成多个约束满足的轨迹，CSVTO 可以提高对干扰和初始化点的Robustness。

Quantifying degeneracy in singular models via the learning coefficient

paper_url: http://arxiv.org/abs/2308.12108
repo_url: https://github.com/edmundlth/scalable_learning_coefficient_with_sgld
paper_authors: Edmund Lau, Daniel Murfet, Susan Wei
for: 这篇论文探讨了深度神经网络（DNN）中的复杂度征性。
methods: 这篇论文使用了singular learning theory中的学习系数来量化DNN中的复杂度征性。
results: 论文显示了学习系数可以准确地捕捉DNN中的复杂度征性，而不仅仅是计数”平坦”方向。此外，论文还提出了一种可扩展的近似方法，并在低维模型中进行了验证。

Abstract
Deep neural networks (DNN) are singular statistical models which exhibit complex degeneracies. In this work, we illustrate how a quantity known as the \emph{learning coefficient} introduced in singular learning theory quantifies precisely the degree of degeneracy in deep neural networks. Importantly, we will demonstrate that degeneracy in DNN cannot be accounted for by simply counting the number of "flat" directions. We propose a computationally scalable approximation of a localized version of the learning coefficient using stochastic gradient Langevin dynamics. To validate our approach, we demonstrate its accuracy in low-dimensional models with known theoretical values. Importantly, the local learning coefficient can correctly recover the ordering of degeneracy between various parameter regions of interest. An experiment on MNIST shows the local learning coefficient can reveal the inductive bias of stochastic opitmizers for more or less degenerate critical points.

摘要

Cached Operator Reordering: A Unified View for Fast GNN Training

paper_url: http://arxiv.org/abs/2308.12093
repo_url: None
paper_authors: Julia Bazinska, Andrei Ivanov, Tal Ben-Nun, Nikoli Dryden, Maciej Besta, Siyuan Shen, Torsten Hoefler
for: 这篇论文旨在提出一种基于图 neural network 的性能优化策略，以提高大规模图神经网络模型的训练速度。
methods: 该论文使用了一种统一的视图来分析图 convolutional network 和 graph attention 层的计算图，并提出了一些代替计算策略。包括了自适应操作重新排序与缓存，可以达到 GCN 的速度提高最多 2.43倍。
results: 该论文的提出的优化策略可以降低内存占用量，易于实现在不同硬件平台上，并有望缓解大规模图神经网络模型训练中的性能瓶颈。

Abstract
Graph Neural Networks (GNNs) are a powerful tool for handling structured graph data and addressing tasks such as node classification, graph classification, and clustering. However, the sparse nature of GNN computation poses new challenges for performance optimization compared to traditional deep neural networks. We address these challenges by providing a unified view of GNN computation, I/O, and memory. By analyzing the computational graphs of the Graph Convolutional Network (GCN) and Graph Attention (GAT) layers -- two widely used GNN layers -- we propose alternative computation strategies. We present adaptive operator reordering with caching, which achieves a speedup of up to 2.43x for GCN compared to the current state-of-the-art. Furthermore, an exploration of different caching schemes for GAT yields a speedup of up to 1.94x. The proposed optimizations save memory, are easily implemented across various hardware platforms, and have the potential to alleviate performance bottlenecks in training large-scale GNN models.

摘要
граф нейронные сети（GNNs）是一种功能强大的工具，用于处理结构化图数据并解决节点分类、图分类和聚类等任务。然而，GNN计算的稀疏性带来了性能优化的新挑战，与传统深度神经网络相比。我们通过提供一个统一的视图来解决这些挑战，包括计算图、输入/输出和内存。我们通过分析GCN和GAT层的计算图来提出代码执行策略。我们的方案包括自适应运算重新排序与缓存，可以在GCN中实现速度提升达2.43倍，而且对GAT也可以实现速度提升达1.94倍。这些优化可以降低内存占用量，易于在不同硬件平台上实现，并有可能解决训练大规模GNN模型的性能瓶颈。

Stabilizing RNN Gradients through Pre-training

paper_url: http://arxiv.org/abs/2308.12075
repo_url: None
paper_authors: Luca Herranz-Celotti, Jean Rouat
for: 防止梯度变幅呈指数增长，稳定和改善训练
methods: 预训练网络到本地稳定性，扩展已知稳定理论覆盖更广泛的深度循环网络
results: 预训练Feed-forward和循环网络达到本地稳定性后，final表现通常提高，提供一种稳定网络的方法，可以作为大规模增强数据集预训练的一步。

Abstract
Numerous theories of learning suggest to prevent the gradient variance from exponential growth with depth or time, to stabilize and improve training. Typically, these analyses are conducted on feed-forward fully-connected neural networks or single-layer recurrent neural networks, given their mathematical tractability. In contrast, this study demonstrates that pre-training the network to local stability can be effective whenever the architectures are too complex for an analytical initialization. Furthermore, we extend known stability theories to encompass a broader family of deep recurrent networks, requiring minimal assumptions on data and parameter distribution, a theory that we refer to as the Local Stability Condition (LSC). Our investigation reveals that the classical Glorot, He, and Orthogonal initialization schemes satisfy the LSC when applied to feed-forward fully-connected neural networks. However, analysing deep recurrent networks, we identify a new additive source of exponential explosion that emerges from counting gradient paths in a rectangular grid in depth and time. We propose a new approach to mitigate this issue, that consists on giving a weight of a half to the time and depth contributions to the gradient, instead of the classical weight of one. Our empirical results confirm that pre-training both feed-forward and recurrent networks to fulfill the LSC often results in improved final performance across models. This study contributes to the field by providing a means to stabilize networks of any complexity. Our approach can be implemented as an additional step before pre-training on large augmented datasets, and as an alternative to finding stable initializations analytically.

摘要
多种学习理论建议防止梯度变异的暴力增长，以稳定和改善训练。通常，这些分析是在 fully-connected 神经网络或单层循环神经网络上进行的，由于它们的数学可读性。然而，这项研究表明，在网络 Architecture 太复杂，无法进行分析初始化时，可以预训练网络到地方稳定性。此外，我们扩展了已知稳定性理论，以涵盖更广泛的深度循环神经网络，无需对数据和参数分布做出多少假设。我们称之为地方稳定条件（LSC）。我们的调查表明，经典的 Glorot、He 和正交 initialization 方案在 fully-connected 神经网络上满足 LSC。然而，对深度循环神经网络进行分析时，我们发现了一种新的梯度爆炸源，即在深度和时间方向上计数梯度路径的暴力增长。我们提出一种新的方法来解决这个问题，即在时间和深度方向上给梯度分配一半的权重，而不是经典的权重为1。我们的实验结果表明，预训练 feed-forward 和循环神经网络以满足 LSC frequently 会导致最终性能提高。这项研究对深度学习领域做出了贡献，提供了一种可以稳定任何复杂性的网络。我们的方法可以作为大量增强数据预训练的附加步骤，以及analytical初始化的替代方案。

Identifying Reaction-Aware Driving Styles of Stochastic Model Predictive Controlled Vehicles by Inverse Reinforcement Learning

paper_url: http://arxiv.org/abs/2308.12069
repo_url: None
paper_authors: Ni Dang, Tao Shi, Zengjie Zhang, Wanxin Jin, Marion Leibold, Martin Buss
for: 这个论文目的是提出一种基于最大熵逆反奖学习（ME-IRL）方法来识别自动驾驶汽车（AV）的驾驶风格。
methods: 该论文使用了ME-IRL方法，并添加了一些新的特征来捕捉AV的反应情况。
results: 该论文通过使用修改后的ME-IRL方法和新提出的特征，成功地识别了在多辆自动驾驶汽车系统中AV的驾驶风格。

Abstract
The driving style of an Autonomous Vehicle (AV) refers to how it behaves and interacts with other AVs. In a multi-vehicle autonomous driving system, an AV capable of identifying the driving styles of its nearby AVs can reliably evaluate the risk of collisions and make more reasonable driving decisions. However, there has not been a consistent definition of driving styles for an AV in the literature, although it is considered that the driving style is encoded in the AV's trajectories and can be identified using Maximum Entropy Inverse Reinforcement Learning (ME-IRL) methods as a cost function. Nevertheless, an important indicator of the driving style, i.e., how an AV reacts to its nearby AVs, is not fully incorporated in the feature design of previous ME-IRL methods. In this paper, we describe the driving style as a cost function of a series of weighted features. We design additional novel features to capture the AV's reaction-aware characteristics. Then, we identify the driving styles from the demonstration trajectories generated by the Stochastic Model Predictive Control (SMPC) using a modified ME-IRL method with our newly proposed features. The proposed method is validated using MATLAB simulation and an off-the-shelf experiment.

摘要
自动驾驶车（AV）的驾驶方式指的是它如何行驶和与其他AV交互。在多辆自动驾驶车系统中，能够识别附近AV的驾驶方式的AV可以更加可靠地评估碰撞风险并做出更加合理的驾驶决策。然而，在文献中没有一致的定义自动驾驶车的驾驶方式，尽管认为驾驶方式是AV的轨迹中编码的。在前一些ME-IRL方法中，不完全包含AV对附近AV的反应的重要指标。在这篇论文中，我们定义了AV的驾驶方式为一系列加权特征的成本函数。我们还设计了新的反应意义特征来捕捉AV的反应特征。然后，我们使用修改后的ME-IRL方法和我们新提出的特征来识别示例轨迹，并从示例轨迹中提取出驾驶方式。我们的方法在MATLAB模拟和一个off-the-shelf实验中得到验证。

InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4

paper_url: http://arxiv.org/abs/2308.12067
repo_url: None
paper_authors: Lai Wei, Zihao Jiang, Weiran Huang, Lichao Sun
for: This paper aims to improve the instruction-following capabilities of multimodal large language models by fine-tuning them on high-quality, limited data.
methods: The authors use a two-stage training process, pre-training on image-text pairs and fine-tuning on supervised vision-language instruction data, and propose several metrics to evaluate the quality of multimodal instruction data.
results: The fine-tuned model, InstructionGPT-4, outperforms the original MiniGPT-4 on various evaluations, demonstrating that less but high-quality instruction tuning data can be efficient in enabling multimodal large language models to generate better output.Here’s the Chinese version:
for: 本研究目的是提高多modal大型自然语言模型的 instrucion-following 能力，通过高质量但有限的数据进行微调。
methods: 作者使用两stage训练过程，首先在图片文本对的 pré-training 中进行训练，然后在supervised vision-language instrucion数据上进行微调。同时，作者提出了多种métricas来评估多modal instrucion数据的质量。
results: 微调后的模型InstructionGPT-4，在多种评估中超过原始miniGPT-4，表明less pero高质量的 instrucion tuning 数据可以有效地使多modal大型自然语言模型生成更好的输出。

Abstract
Multimodal large language models acquire their instruction-following capabilities through a two-stage training process: pre-training on image-text pairs and fine-tuning on supervised vision-language instruction data. Recent studies have shown that large language models can achieve satisfactory results even with a limited amount of high-quality instruction-following data. In this paper, we introduce InstructionGPT-4, which is fine-tuned on a small dataset comprising only 200 examples, amounting to approximately 6% of the instruction-following data used in the alignment dataset for MiniGPT-4. We first propose several metrics to access the quality of multimodal instruction data. Based on these metrics, we present a simple and effective data selector to automatically identify and filter low-quality vision-language data. By employing this method, InstructionGPT-4 outperforms the original MiniGPT-4 on various evaluations (e.g., visual question answering, GPT-4 preference). Overall, our findings demonstrate that less but high-quality instruction tuning data is efficient to enable multimodal large language models to generate better output.

摘要
多模式大语言模型通过两个阶段训练过程获得指令遵循能力：在图像文本对的预训练后，进行精度的视觉语言指令数据的精度训练。现在的研究表明，大语言模型可以通过有限量的高质量指令遵循数据来获得满意的结果。在这篇论文中，我们介绍了InstructionGPT-4，它通过一个小样本集（约6%的指令遵循数据）进行了精度训练。我们首先提出了一些度量视觉语言指令数据质量的指标，然后提出了一种简单有效的数据选择器，可以自动地从视觉语言数据中选择和过滤低质量的数据。通过这种方法，InstructionGPT-4在不同的评估（如视觉问答）中表现出色，超过了原始的MiniGPT-4。总之，我们的发现表明， fewer but high-quality instruction tuning data is efficient to enable multimodal large language models to generate better output.

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

paper_url: http://arxiv.org/abs/2308.12066
repo_url: None
paper_authors: Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, Mao Yang, Minsoo Rhu
for: 该研究旨在解决基于转换器的大语言模型（LLM）中的计算和存储需求问题，以提高其算法性能。
methods: 该研究使用了 Mixture-of-Experts（MoE）架构，并提出了一种名为 Pre-gated MoE 的新系统。该系统通过我们的算法-系统合理设计，可以有效地解决传统 MoE 架构中的计算和存储挑战。
results: 我们的 Pre-gated MoE 系统可以提高性能，降低 GPU 内存占用量，同时保持模型质量不变。这些特点使我们的 Pre-gated MoE 系统可以经济地部署大规模的 LLM，只需要一个 GPU 高性能。

Abstract
Large language models (LLMs) based on transformers have made significant strides in recent years, the success of which is driven by scaling up their model size. Despite their high algorithmic performance, the computational and memory requirements of LLMs present unprecedented challenges. To tackle the high compute requirements of LLMs, the Mixture-of-Experts (MoE) architecture was introduced which is able to scale its model size without proportionally scaling up its computational requirements. Unfortunately, MoE's high memory demands and dynamic activation of sparse experts restrict its applicability to real-world problems. Previous solutions that offload MoE's memory-hungry expert parameters to CPU memory fall short because the latency to migrate activated experts from CPU to GPU incurs high performance overhead. Our proposed Pre-gated MoE system effectively tackles the compute and memory challenges of conventional MoE architectures using our algorithm-system co-design. Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation, allowing our proposed system to address the large memory footprint of MoEs while also achieving high performance. We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality. These features allow our Pre-gated MoE system to cost-effectively deploy large-scale LLMs using just a single GPU with high performance.

摘要
Previous solutions that offload MoE's memory-hungry expert parameters to CPU memory are not effective because the latency to migrate activated experts from CPU to GPU incurs high performance overhead. Our proposed Pre-gated MoE system effectively tackles the compute and memory challenges of conventional MoE architectures using our algorithm-system co-design.Pre-gated MoE employs a novel pre-gating function that alleviates the dynamic nature of sparse expert activation, allowing our proposed system to address the large memory footprint of MoEs while also achieving high performance. We demonstrate that Pre-gated MoE can improve performance, reduce GPU memory consumption, and maintain the same level of model quality. These features allow our Pre-gated MoE system to cost-effectively deploy large-scale LLMs using just a single GPU with high performance.

Ensembling Uncertainty Measures to Improve Safety of Black-Box Classifiers

paper_url: http://arxiv.org/abs/2308.12065
repo_url: None
paper_authors: Tommaso Zoppi, Andrea Ceccarelli, Andrea Bondavalli
for: 提高机器学习模型的安全性，避免因误分类而导致的系统故障。
methods: 使用ensemble of uncertainty measures计算出输入和输出uncertainty measures，如果检测到误分类，则阻止输出的传播到后续系统。
results: 能够准确地检测机器学习模型的误分类，并将误分类转化为数据排除错误，这有助于提高系统的安全性。

Abstract
Machine Learning (ML) algorithms that perform classification may predict the wrong class, experiencing misclassifications. It is well-known that misclassifications may have cascading effects on the encompassing system, possibly resulting in critical failures. This paper proposes SPROUT, a Safety wraPper thROugh ensembles of UncertainTy measures, which suspects misclassifications by computing uncertainty measures on the inputs and outputs of a black-box classifier. If a misclassification is detected, SPROUT blocks the propagation of the output of the classifier to the encompassing system. The resulting impact on safety is that SPROUT transforms erratic outputs (misclassifications) into data omission failures, which can be easily managed at the system level. SPROUT has a broad range of applications as it fits binary and multi-class classification, comprising image and tabular datasets. We experimentally show that SPROUT always identifies a huge fraction of the misclassifications of supervised classifiers, and it is able to detect all misclassifications in specific cases. SPROUT implementation contains pre-trained wrappers, it is publicly available and ready to be deployed with minimal effort.

摘要
机器学习（ML）算法可能会预测错误的类别，导致错误分类。这是已知的现象，错误分类可能会带来整个系统的崩溃。这篇论文提议了“植物”（SPROUT），它是一种安全包装，通过多个不确定度度量来怀疑错误分类。如果检测到错误分类，SPROUT会阻止分类器的输出传递给包含系统。这将导致安全性的改善，因为SPROUT将抖动输出（错误分类）转换为数据检查失败，这可以轻松地在系统 nivel handle。SPROUT适用于 binary 和多类分类，包括图像和表格数据集。我们实验表明，SPROUT总能够检测大量错误分类，并且在某些情况下可以检测所有错误分类。SPROUT 实现包括预训练包装，可公共释放，准备好使用，需 minimum 的努力。

HarvestNet: A Dataset for Detecting Smallholder Farming Activity Using Harvest Piles and Remote Sensing

paper_url: http://arxiv.org/abs/2308.12061
repo_url: None
paper_authors: Jonathan Xu, Amna Elmustafa, Liya Weldegebriel, Emnet Negash, Richard Lee, Chenlin Meng, Stefano Ermon, David Lobell
For: The paper aims to improve the accuracy of cropland mapping in smallholder farming regions, specifically in sub-Saharan Africa.* Methods: The authors use a new approach based on the detection of harvest piles, which are characteristic of many smallholder systems. They use expert knowledge and satellite images to collect a dataset called HarvestNet, which includes 7,000 hand-labeled images and 2,000 ground-collected labels. They also benchmark a set of baseline models, including state-of-the-art remote sensing models.* Results: The authors achieve high accuracy performance on the hand-labeled data (around 80%) and ground truth data (90% and 98% accuracy in Tigray and Amhara, respectively). They also visually compare their model with a pre-existing coverage map and show that it detects an additional 56,621 hectares of cropland in Tigray. The authors conclude that remote sensing of harvest piles can contribute to more timely and accurate cropland assessments in food-insecure regions.Here’s the format you requested:* For: 增强小农场地区农地评估的精度* Methods: 利用収获堆特征进行农地探测，使用专家知识和卫星图像实现数据库HarvestNet的收集，包括7,000个手动标注图像和2,000个地面点标测数据。 benchmark一些基准模型，包括顶尖距离感知模型。* Results: 取得手动标注数据（约80%）和真实数据（90%和98%的正确率在提前的Tigray和Amhara地区）。进一步比较自己的模型与已有的覆盖地图，发现自己的模型可以检测到提前的56,621公顷农地在Tigray。

Abstract
Small farms contribute to a large share of the productive land in developing countries. In regions such as sub-Saharan Africa, where 80% of farms are small (under 2 ha in size), the task of mapping smallholder cropland is an important part of tracking sustainability measures such as crop productivity. However, the visually diverse and nuanced appearance of small farms has limited the effectiveness of traditional approaches to cropland mapping. Here we introduce a new approach based on the detection of harvest piles characteristic of many smallholder systems throughout the world. We present HarvestNet, a dataset for mapping the presence of farms in the Ethiopian regions of Tigray and Amhara during 2020-2023, collected using expert knowledge and satellite images, totaling 7k hand-labeled images and 2k ground collected labels. We also benchmark a set of baselines including SOTA models in remote sensing with our best models having around 80% classification performance on hand labelled data and 90%, 98% accuracy on ground truth data for Tigray, Amhara respectively. We also perform a visual comparison with a widely used pre-existing coverage map and show that our model detects an extra 56,621 hectares of cropland in Tigray. We conclude that remote sensing of harvest piles can contribute to more timely and accurate cropland assessments in food insecure region.

摘要
小型农场在发展国家占据大量生产土地。如在亚фри半沙漠地区，80%的农场面积在2公顷以下。将小农场易于易识别的特征——即收割堆——作为易识别特征，我们提出了一种新的方法。我们发布了一个名为“卫星网”的数据集，用于在2020-2023年的补做期间在埃塞俄比亚地区的提格雷和阿姆哈拉地区进行农场易识别。我们收集了7000张专家知识和卫星图像的手动标注图像和2000张地面收集的标签。我们还对一些最新的遥感领域的标准模型进行了比较，并发现我们的最佳模型在手动标注数据上达到80%的分类性能，而在地面真实数据上达到90%、98%的准确率在提格雷和阿姆哈拉地区。我们还对一个广泛使用的现有的覆盖地图进行了视觉比较，并发现我们的模型可以检测到提格雷地区的56621公顷的农地。我们认为遥感卫星堆核可以为食品不安全地区提供更加准确和及时的农地评估。

Manipulating Embeddings of Stable Diffusion Prompts

paper_url: http://arxiv.org/abs/2308.12059
repo_url: https://github.com/webis-de/arxiv23-prompt-embedding-manipulation
paper_authors: Niklas Deckers, Julia Peters, Martin Potthast
for: This paper focuses on improving the process of generating images using text-to-image models, specifically by changing the embedding of a prompt instead of the prompt text.
methods: The proposed method treats the generative text-to-image model as a continuous function and passes gradients between the image space and the prompt embedding space, allowing for more fine-grained and targeted control of the generated images based on user intentions.
results: The authors demonstrate the feasibility of the proposed methods through experiments in three scenarios: optimization of a metric defined in image space, assistance of users in creative tasks, and inclusion of information in the prompt that the user has seen but finds difficult to describe.

Abstract
Generative text-to-image models such as Stable Diffusion allow users to generate images based on a textual description, the prompt. Changing the prompt is still the primary means for the user to change a generated image as desired. However, changing the image by reformulating the prompt remains a difficult process of trial and error, which has led to the emergence of prompt engineering as a new field of research. We propose and analyze methods to change the embedding of a prompt directly instead of the prompt text. It allows for more fine-grained and targeted control that takes into account user intentions. Our approach treats the generative text-to-image model as a continuous function and passes gradients between the image space and the prompt embedding space. By addressing different user interaction problems, we can apply this idea in three scenarios: (1) Optimization of a metric defined in image space that could measure, for example, image style. (2) Assistance of users in creative tasks by enabling them to navigate the image space along a selection of directions of "near" prompt embeddings. (3) Changing the embedding of the prompt to include information that the user has seen in a particular seed but finds difficult to describe in the prompt. Our experiments demonstrate the feasibility of the described methods.

摘要
<>将文本描述转换为生成图像的模型，如稳定扩散，允许用户根据文本描述生成图像。但是，通过修改描述仍然是用户 cambio 图像的主要方式。然而，通过修改描述来改变图像仍然是一种困难的过程，这导致了描述工程的出现。我们提出和分析了直接改变描述 embedding 的方法，允许更细化和targeted控制，考虑用户的意图。我们的方法将生成文本到图像模型看作是连续函数，将gradients传递 между图像空间和描述embedding空间。通过解决不同的用户交互问题，我们可以应用这个想法在以下三个场景中：1. 图像空间中定义的一个度量的优化，例如图像风格。2. 用户在创作任务中的帮助，让他们可以在"near"描述 embeddings 方向上导航图像空间。3. 将描述 embedding 包含用户看到的特定种子中的信息，但用户Difficult to describe in the prompt。我们的实验表明这种方法的可行性。

Sample Complexity of Robust Learning against Evasion Attacks

paper_url: http://arxiv.org/abs/2308.12054
repo_url: None
paper_authors: Pascale Gourdeau
For: The paper is written to study the feasibility of adversarially robust learning from the perspective of learning theory, specifically considering sample complexity.* Methods: The paper uses the exact-in-the-ball notion of robustness and explores the setting where the learner has access to random examples only, as well as learning models where the learner is given more power through local membership queries and local equivalence query oracles.* Results: The paper shows that robustly learning monotone conjunctions has sample complexity at least exponential in the adversary’s budget, but if the adversary is restricted to perturbing $O(\log n)$ bits, then one can robustly learn conjunctions and decision lists w.r.t. log-Lipschitz distributions. Additionally, the paper develops robust empirical risk minimization algorithms in the distribution-free setting with general query complexity upper and lower bounds, as well as for concrete concept classes.

Abstract
It is becoming increasingly important to understand the vulnerability of machine learning models to adversarial attacks. One of the fundamental problems in adversarial machine learning is to quantify how much training data is needed in the presence of evasion attacks, where data is corrupted at test time. In this thesis, we work with the exact-in-the-ball notion of robustness and study the feasibility of adversarially robust learning from the perspective of learning theory, considering sample complexity. We first explore the setting where the learner has access to random examples only, and show that distributional assumptions are essential. We then focus on learning problems with distributions on the input data that satisfy a Lipschitz condition and show that robustly learning monotone conjunctions has sample complexity at least exponential in the adversary's budget (the maximum number of bits it can perturb on each input). However, if the adversary is restricted to perturbing $O(\log n)$ bits, then one can robustly learn conjunctions and decision lists w.r.t. log-Lipschitz distributions. We then study learning models where the learner is given more power. We first consider local membership queries, where the learner can query the label of points near the training sample. We show that, under the uniform distribution, the exponential dependence on the adversary's budget to robustly learn conjunctions remains inevitable. We then introduce a local equivalence query oracle, which returns whether the hypothesis and target concept agree in a given region around a point in the training sample, and a counterexample if it exists. We show that if the query radius is equal to the adversary's budget, we can develop robust empirical risk minimization algorithms in the distribution-free setting. We give general query complexity upper and lower bounds, as well as for concrete concept classes.

摘要
“机器学习模型对抗攻击的漏洞日益受到重视。一个基本问题在于在恶意攻击下执行机器学习的时候，需要多少训练数据。在这些thesis中，我们使用精确在球体上的敏感度来衡量机器学习模型的Robustness，并从学习理论的角度进行研究，考虑体积 Complexity。”“我们首先假设学习者只有随机样本的存在，并证明了需要Distributional assumptions。然后，我们对input数据的分布做出了Lipschitz条件，并证明了在恶意攻击下，可以实现精确地学习几何联合和决策表。但是，如果攻击者只能对每个入力进行O（log n）位数据的变更，那么可以精确地学习几何联合和决策表。”“然后，我们研究了学习模型，学习者可以更多的力量。我们首先考虑了本地会员查询，学习者可以查询训练样本附近的标签。我们证明了，在均匀分布下，随着攻击者的预算增加，精确地学习几何联合的 exponential dependence是不可避免的。然后，我们引入了本地相等查询或acles，它返回训练样本附近的标签是否相同，如果存在对立的示例，则返回对立的示例。我们证明了，如果查询半径等于攻击者的预算，那么可以在分布无法设定的情况下开发精确的empirical risk minimization算法。我们给出了一般的查询复杂度上下文和具体的概念类别的上下文。

Layer-wise Feedback Propagation

paper_url: http://arxiv.org/abs/2308.12053
repo_url: None
paper_authors: Leander Weber, Jim Berend, Alexander Binder, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
for: 本研究专项是一种基于解释的训练方法，用于对神经网络类模型进行训练，并使用层别反映传播(LRP)来评估个别连接的贡献度。
methods: 本方法使用层别反映传播(LRP)来分配奖励到个别连接，而不需要计算Gradient。LFP会将奖励信号分布到模型中，强化获得正面奖励的结构，同时抑制获得负面奖励的结构。
results: 本研究 theoretically和empirically证明LFP的稳定性和有效性，并在不同的模型和数据集上显示LFP可以实现与梯度下降相同的性能。此外，LFP可以超越一些相关的限制，例如对梯度 Computation的依赖。

Abstract
In this paper, we present Layer-wise Feedback Propagation (LFP), a novel training approach for neural-network-like predictors that utilizes explainability, specifically Layer-wise Relevance Propagation(LRP), to assign rewards to individual connections based on their respective contributions to solving a given task. This differs from traditional gradient descent, which updates parameters towards anestimated loss minimum. LFP distributes a reward signal throughout the model without the need for gradient computations. It then strengthens structures that receive positive feedback while reducingthe influence of structures that receive negative feedback. We establish the convergence of LFP theoretically and empirically, and demonstrate its effectiveness in achieving comparable performance to gradient descent on various models and datasets. Notably, LFP overcomes certain limitations associated with gradient-based methods, such as reliance on meaningful derivatives. We further investigate how the different LRP-rules can be extended to LFP, what their effects are on training, as well as potential applications, such as training models with no meaningful derivatives, e.g., step-function activated Spiking Neural Networks (SNNs), or for transfer learning, to efficiently utilize existing knowledge.

摘要
在这篇论文中，我们介绍层wise Feedback Propagation（LFP），一种使用解释性来训练神经网络模型的新的训练方法。LFP使用层wise Relevance Propagation（LRP）来将权重分配给各个连接，以便在解决特定任务时评估它们的贡献。这与传统的梯度下降不同，梯度下降会更新参数向估计损失函数的最小值。LFP不需要梯度计算，可以在模型中分布奖励信号。它会强化接收正面奖励的结构，而减少接收负面奖励的结构的影响。我们证明LFP的 converges theoretically and empirically，并在不同模型和数据集上证明其效果相当于梯度下降。此外，LFP可以超越梯度基本方法的一些局限性，例如依赖于有意义的导数。我们还研究了不同LRP规则如何扩展到LFP，以及它们在训练中的影响，以及潜在应用，如训练没有有意义导数的模型，例如步函数激活的神经网络（SNN），或者用于转移学习，以有效地利用现有的知识。

A multiobjective continuation method to compute the regularization path of deep neural networks

paper_url: http://arxiv.org/abs/2308.12044
repo_url: https://github.com/aamakor/continuation-method
paper_authors: Augustina C. Amakor, Konstantin Sonntag, Sebastian Peitz
for: 该论文旨在提高深度神经网络（DNNs）中的稀疏性，以提高数值效率、提高模型解释性和鲁棒性。
methods: 该论文使用了对律学习方法的扩展，将empirical loss和稀疏性（$\ell^1$ norm）作为两个矛盾的目标，并通过多目标优化问题来解决。
results: 该论文提出了一种高效地计算Pareto前面的算法，并通过数据示例和推论表明了知道Regularization path可以为神经网络参数化提供好的generalization。

Abstract
Sparsity is a highly desired feature in deep neural networks (DNNs) since it ensures numerical efficiency, improves the interpretability of models (due to the smaller number of relevant features), and robustness. In machine learning approaches based on linear models, it is well known that there exists a connecting path between the sparsest solution in terms of the $\ell^1$ norm (i.e., zero weights) and the non-regularized solution, which is called the regularization path. Very recently, there was a first attempt to extend the concept of regularization paths to DNNs by means of treating the empirical loss and sparsity ($\ell^1$ norm) as two conflicting criteria and solving the resulting multiobjective optimization problem. However, due to the non-smoothness of the $\ell^1$ norm and the high number of parameters, this approach is not very efficient from a computational perspective. To overcome this limitation, we present an algorithm that allows for the approximation of the entire Pareto front for the above-mentioned objectives in a very efficient manner. We present numerical examples using both deterministic and stochastic gradients. We furthermore demonstrate that knowledge of the regularization path allows for a well-generalizing network parametrization.

摘要
<>将文本翻译成简化中文。<>深度神经网络（DNN）中的稀疏性是一个非常受欢迎的特性，因为它确保了数学效率，提高了模型的解释性（由于更少的相关特征），并且提高了模型的稳定性。在基于线性模型的机器学习方法中，已经知道存在一条连接 PATH между最稀疏的解决方案（以 $\ell^1$ norm 为零）和非REGULARIZED 解决方案，这个 PATH 被称为REGULARIZATION PATH。很少时候，有一次尝试将REGULARIZATION PATH 扩展到 DNN 中。通过对 empirical loss 和稀疏性（$\ell^1$ norm）作为两个矛盾的目标，解决 resulting 多目标优化问题。但由于 $\ell^1$ norm 的不整合性和参数的高数量，这种方法不是很有效率。为了解决这些限制，我们提出了一种可以很快地 Approximation 整个 Pareto front 的算法。我们通过 deterministic 和 Stochastic 梯度进行数值示例，并证明了知道REGULARIZATION PATH 可以获得一个很好地 Parametrization 的网络。

IncreLoRA: Incremental Parameter Allocation Method for Parameter-Efficient Fine-tuning

paper_url: http://arxiv.org/abs/2308.12043
repo_url: https://github.com/feiyuzhang98/increlora
paper_authors: Feiyu Zhang, Liangzhi Li, Junhao Chen, Zhouqiang Jiang, Bowen Wang, Yiming Qian
for: 这个研究是为了提高大型预训语言模型（PLM）的优化效率，特别是当有许多下游任务时，避免过度训练和储存成本。
methods: 本研究使用了低维适应（LoRA）方法，将适应矩阵注入到每个目标模组中。但LoRA忽略了各模组参数的重要性。为解决这个问题，许多过滤参数的方法已经被提出，但这些方法仍然受限于固定的训练条件下的最大维度上界。我们因此提出了IncreLoRA方法，它在训练过程中逐步增加可变参数，根据各模组的重要性分数。这种方法与删除参数方法不同，因为它不受限于初始训练参数数量，每个参数矩阵的最大维度上界较高。
results: 我们在GLUE数据集上进行了广泛的实验，结果显示了IncreLoRA方法的效iveness。相比于基eline，我们的方法在资源充足情况下表现出更高的参数效率，特别是在资源有限的情况下，我们的方法具有明显的优势。我们的代码公开可用。

Abstract
With the increasing size of pre-trained language models (PLMs), fine-tuning all the parameters in the model is not efficient, especially when there are a large number of downstream tasks, which incur significant training and storage costs. Many parameter-efficient fine-tuning (PEFT) approaches have been proposed, among which, Low-Rank Adaptation (LoRA) is a representative approach that injects trainable rank decomposition matrices into every target module. Yet LoRA ignores the importance of parameters in different modules. To address this problem, many works have been proposed to prune the parameters of LoRA. However, under limited training conditions, the upper bound of the rank of the pruned parameter matrix is still affected by the preset values. We, therefore, propose IncreLoRA, an incremental parameter allocation method that adaptively adds trainable parameters during training based on the importance scores of each module. This approach is different from the pruning method as it is not limited by the initial number of training parameters, and each parameter matrix has a higher rank upper bound for the same training overhead. We conduct extensive experiments on GLUE to demonstrate the effectiveness of IncreLoRA. The results show that our method owns higher parameter efficiency, especially when under the low-resource settings where our method significantly outperforms the baselines. Our code is publicly available.

摘要
随着语言模型的大小不断增加，训练所有模型参数是不可能efficient的，特别是当有大量下游任务时，训练和存储成本会增加显著。许多parameter-efficient fine-tuning（PEFT）方法已经被提出，其中LoRA（low-rank adaptation）是一种常见的方法，它在目标模块中添加可训练的rank划分矩阵。然而，LoRA忽视了模块中参数的重要性。为解决这个问题，许多工作已经提出了对LoRA的参数剪枝。然而，在有限的训练条件下，剪枝后参数矩阵的最大核数仍受到先前设置的值的影响。我们因此提出了IncreLoRA，一种逐步分配参数的方法，它在训练过程中根据模块的重要性分配参数。这种方法与剪枝方法不同，因为它不受限于初始训练参数的数量，并且每个参数矩阵的最大核数 upper bound 是同样的训练负担下得到的更高。我们在GLUE上进行了广泛的实验，结果表明，我们的方法具有更高的参数效率，特别是在资源短缺的情况下，我们的方法明显超越了基eline。我们的代码公开可用。

CACTUS: a Comprehensive Abstraction and Classification Tool for Uncovering Structures

paper_url: http://arxiv.org/abs/2308.12031
repo_url: None
paper_authors: Luca Gherardini, Varun Ravi Varma, Karol Capala, Roger Woods, Jose Sousa
for: 这篇论文是为了提高安全分析的可靠性和效率而设计的，它使用可解释人工智能技术来解释模型的决策过程。
methods: 这篇论文使用了CACTUS模型，它可以有效地处理小型数据集，并且可以保留 categorical 特征的原始含义，提高内存使用率和并行计算速度。
results: 在应用于 wisconsin 诊断乳腺癌和 thyroid0387 数据集上，CACTUS 模型表现出色，可以快速地分类数据并提供有用的特征频率和排名结果。

Abstract
The availability of large data sets is providing an impetus for driving current artificial intelligent developments. There are, however, challenges for developing solutions with small data sets due to practical and cost-effective deployment and the opacity of deep learning models. The Comprehensive Abstraction and Classification Tool for Uncovering Structures called CACTUS is presented for improved secure analytics by effectively employing explainable artificial intelligence. It provides additional support for categorical attributes, preserving their original meaning, optimising memory usage, and speeding up the computation through parallelisation. It shows to the user the frequency of the attributes in each class and ranks them by their discriminative power. Its performance is assessed by application to the Wisconsin diagnostic breast cancer and Thyroid0387 data sets.

摘要
大量数据的可用性正在驱动当前人工智能的发展。然而，使用小数据集构建解释性人工智能解决方案存在实用和成本效益的挑战。本文提出了一种名为CACTUS的全面抽象和分类工具，用于提高安全分析。它可以有效地使用可解释人工智能，并且支持分类属性，保持原始含义，优化内存使用和并行计算，以提高计算速度。它还可以为用户显示每个类别的属性频率，并将其排序为推理力。本文通过应用于 wisconsin 诊断乳腺癌和 thyroid0387 数据集来评估其性能。

Prompt-Based Length Controlled Generation with Reinforcement Learning

paper_url: http://arxiv.org/abs/2308.12030
repo_url: None
paper_authors: Renlong Jie, Xiaojun Meng, Lifeng Shang, Xin Jiang, Qun Liu
for: 提高 GPT 类模型的Length controlled generation能力，以满足不同场景中的需求
methods: 使用奖励学习的方法，通过给定的奖励模型来评估并改进 GPT 类模型的生成结果的长度
results: 在 популяр的数据集上 Summarization 任务上，我们的方法可以准确地控制 GPT 类模型的生成结果的长度，并且在某些情况下可以提高准确率。

Abstract
Recently, large language models (LLMs) like ChatGPT and GPT-4 have attracted great attention given their surprising improvement and performance. Length controlled generation of LLMs emerges as an important topic, which also enables users to fully leverage the capability of LLMs in more real-world scenarios like generating a proper answer or essay of a desired length. In addition, the autoregressive generation in LLMs is extremely time-consuming, while the ability of controlling this generated length can arbitrarily reduce the inference cost by limiting the length, and thus satisfy different needs. Therefore, we aim to propose a prompt-based length control method to achieve this length controlled generation, which can also be widely applied in GPT-style LLMs. In particular, we adopt reinforcement learning with the reward signal given by either trainable or rule-based reward model, which further affects the generation of LLMs via rewarding a pre-defined target length. Experiments show that our method significantly improves the accuracy of prompt-based length control for summarization task on popular datasets like CNNDM and NYT. We believe this length-controllable ability can provide more potentials towards the era of LLMs.

摘要
Translated into Simplified Chinese:最近，大型自然语言模型（LLM）如ChatGPT和GPT-4已经吸引了很大的关注，因为它们的表现和提高有让人惊叹。控制LLM的长度已成为一个重要的话题，这也使得用户可以更好地利用LLM在更多的实际场景中，如生成一个正确的答案或一篇文章的总长度。此外，LLM的自适应生成具有极高的计算成本，控制生成的长度可以大大减少计算成本，满足不同的需求。因此，我们目的是提出一种提示基于的长度控制方法，可以广泛应用于GPT样式的LLM。特别是，我们采用了奖励学习，使用trainable或规则基本的奖励模型，通过奖励LLM生成的目标长度，进一步影响LLM的生成。实验结果表明，我们的方法可以在 популяр的数据集CNNDM和NYT上提高提示基于长度控制的准确性。我们认为这种可控长度的能力可以为LLM的未来带来更多的潜力。

A Scale-Invariant Task Balancing Approach for Multi-Task Learning

paper_url: http://arxiv.org/abs/2308.12029
repo_url: None
paper_authors: Baijiong Lin, Weisen Jiang, Feiyang Ye, Yu Zhang, Pengguang Chen, Ying-Cong Chen, Shu Liu
for: 此研究旨在解决多任务学习（MTL）中的任务均衡问题，通过保证损失和梯度层次一致性来提高MTL的性能。
methods: 本研究提出了一种减小损失层次差异的缩放不变多任务学习（SI-MTL）方法，包括对所有任务损失进行对数变换，以及一种梯度均衡方法SI-G，用于 нормализа所有任务梯度的大小。
results: 对多个 benchmark 数据集进行了广泛的实验，并 consistently 显示了 SI-G 的效果和 SI-MTL 的状态之冠性。

Abstract
Multi-task learning (MTL), a learning paradigm to learn multiple related tasks simultaneously, has achieved great success in various fields. However, task-balancing remains a significant challenge in MTL, with the disparity in loss/gradient scales often leading to performance compromises. In this paper, we propose a Scale-Invariant Multi-Task Learning (SI-MTL) method to alleviate the task-balancing problem from both loss and gradient perspectives. Specifically, SI-MTL contains a logarithm transformation which is performed on all task losses to ensure scale-invariant at the loss level, and a gradient balancing method, SI-G, which normalizes all task gradients to the same magnitude as the maximum gradient norm. Extensive experiments conducted on several benchmark datasets consistently demonstrate the effectiveness of SI-G and the state-of-the-art performance of SI-MTL.

摘要
多任务学习（MTL），一种同时学习多个相关任务的学习模式，在各个领域取得了很大成功。然而，任务均衡仍然是MTL中的主要挑战，由于任务损失/梯度的比例不同，通常会导致性能受损。在这篇论文中，我们提出了一种归一化多任务学习（SI-MTL）方法，以解决从损失和梯度角度来看的任务均衡问题。具体来说，SI-MTL包括一个对所有任务损失进行对数变换，以确保损失水平归一化，以及一种梯度均衡方法SI-G，该方法将所有任务梯度Normalized到最大梯度 нор 的同一个范围。我们在多个标准数据集上进行了广泛的实验，并 consistently demonstated SI-G的有效性和SI-MTL的状态的杰出性。

Bias-Aware Minimisation: Understanding and Mitigating Estimator Bias in Private SGD

paper_url: http://arxiv.org/abs/2308.12018
repo_url: None
paper_authors: Moritz Knolle, Robert Dorfman, Alexander Ziller, Daniel Rueckert, Georgios Kaissis
for: 提高 differentially private SGD (DP-SGD) 的模型使用性和安全性。
methods: 利用 per-sample gradient norms 与 private gradient oracle 的估计偏差之间的连接，并提出了 Bias-Aware Minimisation (BAM) 方法，可以有效减少 private gradient estimator 的偏差。
results: 通过 theoretically 和实验测试，证明 BAM 不仅减少了偏差，还substantially 提高了 privacy-utility trade-offs 在 CIFAR-10、CIFAR-100 和 ImageNet-32 数据集上。

Abstract
Differentially private SGD (DP-SGD) holds the promise of enabling the safe and responsible application of machine learning to sensitive datasets. However, DP-SGD only provides a biased, noisy estimate of a mini-batch gradient. This renders optimisation steps less effective and limits model utility as a result. With this work, we show a connection between per-sample gradient norms and the estimation bias of the private gradient oracle used in DP-SGD. Here, we propose Bias-Aware Minimisation (BAM) that allows for the provable reduction of private gradient estimator bias. We show how to efficiently compute quantities needed for BAM to scale to large neural networks and highlight similarities to closely related methods such as Sharpness-Aware Minimisation. Finally, we provide empirical evidence that BAM not only reduces bias but also substantially improves privacy-utility trade-offs on the CIFAR-10, CIFAR-100, and ImageNet-32 datasets.

摘要
<>使用差异隐私SGD（DP-SGD）可以实现安全和负责任地应用机器学习到敏感数据集。然而，DP-SGD只提供偏置、噪声的批处理Gradient。这使得优化步骤效果减退，模型的用途受限。在这个工作中，我们显示了每个样本Gradient的norm和私有Gradient报告的估计偏置之间的连接。我们提议了偏置意识的最小化（BAM），可以证明性地减少私有Gradient估计偏置。我们介绍了如何有效地计算BAM所需的量，并高亮了与相似方法，如锐度意识的最小化之间的相似性。最后，我们提供了实际证明，BAM不仅减少了偏置，还有效地改善了隐私-功能质量的负责任。在CIFAR-10、CIFAR-100和ImageNet-32数据集上，BAM实现了substantial的改善。>>>

Graph Neural Stochastic Differential Equations

paper_url: http://arxiv.org/abs/2308.12316
repo_url: None
paper_authors: Richard Bergna, Felix Opolka, Pietro Liò, Jose Miguel Hernandez-Lobato
for: 本研究写作的目的是提出一种新的模型Graph Neural Stochastic Differential Equations (Graph Neural SDEs)，用于评估预测uncertainty。
methods: 该模型基于Brownian Motion嵌入随机性，从而提高预测精度。研究者提出了Latent Graph Neural SDE变体，并通过实验证明其效果。
results: 实验结果表明，Latent Graph Neural SDEs比常见模型如图 convolutional networks和图Neural ODEs更高效，特别是在信任预测方面。这种模型在静止和空间时间上也能够更好地处理out-of-distribution检测。

Abstract
We present a novel model Graph Neural Stochastic Differential Equations (Graph Neural SDEs). This technique enhances the Graph Neural Ordinary Differential Equations (Graph Neural ODEs) by embedding randomness into data representation using Brownian motion. This inclusion allows for the assessment of prediction uncertainty, a crucial aspect frequently missed in current models. In our framework, we spotlight the \textit{Latent Graph Neural SDE} variant, demonstrating its effectiveness. Through empirical studies, we find that Latent Graph Neural SDEs surpass conventional models like Graph Convolutional Networks and Graph Neural ODEs, especially in confidence prediction, making them superior in handling out-of-distribution detection across both static and spatio-temporal contexts.

摘要
我们提出了一种新的模型：图 neural stochastic differential equations（图 neural SDE）。这种技术将随机性embedded into数据表示使用拟合运动，从而评估预测不确定性，这是现有模型常常缺乏的一个关键方面。在我们的框架中，我们强调了《 latent graph neural SDE》的变体，并证明其效果。通过实验研究，我们发现latent graph neural SDEs在信任预测方面表现出色，特别是在对于静态和空间-时上的异常检测方面，比如graph convolutional networks和graph neural ODEs更高效。

MKL-$L_{0/1}$-SVM

paper_url: http://arxiv.org/abs/2308.12016
repo_url: https://github.com/maxis1718/simplemkl
paper_authors: Bin Zhu, Yijie Shi
for: 提出了一种基于多kernels学习的支持向量机（SVM）算法，用于解决非 convex 和非光滑的优化问题。
methods: 使用了首览优化条件，然后开发了一个快速的ADMM解决方案来处理非 convex 和非光滑的优化问题。
results: 对于 sintetic 和实际数据集的实验表明，我们的MKL-$L_{0/1}$-SVM表现和SimpleMKL算法相当。

Abstract
This paper presents a Multiple Kernel Learning (abbreviated as MKL) framework for the Support Vector Machine (SVM) with the $(0, 1)$ loss function. Some first-order optimality conditions are given and then exploited to develop a fast ADMM solver to deal with the nonconvex and nonsmooth optimization problem. Extensive numerical experiments on synthetic and real datasets show that the performance of our MKL-$L_{0/1}$-SVM is comparable with the one of the leading approaches called SimpleMKL developed by Rakotomamonjy, Bach, Canu, and Grandvalet [Journal of Machine Learning Research, vol. 9, pp. 2491-2521, 2008].

摘要
这篇论文提出了一种基于多kernels学习（简称MKL）的支持向量机（SVM）模型，使用$(0,1)$损失函数。文章首先给出了一些首选条件，然后利用这些条件来开发一种快速的ADMM算法来解决非对称和不连续优化问题。广泛的数据集实验表明，我们的MKL-$L_{0/1}$-SVM的性能与SimpleMKL模型相当，这种模型由Rakotomamonjy等人在《机器学习研究》杂志上发表的（vol. 9, pp. 2491-2521, 2008）。

Quantum-Noise-driven Generative Diffusion Models

paper_url: http://arxiv.org/abs/2308.12013
repo_url: None
paper_authors: Marco Parigi, Stefano Martina, Filippo Caruso
for: 这个论文是为了探讨并实现基于机器学习技术的生成模型，以便从 finite 个训练样本中推断复杂的数据分布，并生成新的 sintetic 数据。
methods: 这个论文使用了 diffusion 模型，这是一种在创建 sintetic 文本和高质量图像方面表现出色的潜在敌手网络的新框架。
results: 这个论文提出了三种基于量子噪声的生成扩散模型，这些模型可以在实际的量子系统上进行实验测试。这些模型利用量子特性，特别是干扰、Entanglement和噪声之间的非常复杂的交互作用，以解决类别 diffusion 模型中的主要计算压力。因此，这些结果预期会开拓新的量子推导或量子基于的生成扩散算法，用于解决更广泛的实际应用问题，从气候预测到 neuroscience，从交通流量分析到金融预测。

Abstract
Generative models realized with machine learning techniques are powerful tools to infer complex and unknown data distributions from a finite number of training samples in order to produce new synthetic data. Diffusion models are an emerging framework that have recently overcome the performance of the generative adversarial networks in creating synthetic text and high-quality images. Here, we propose and discuss the quantum generalization of diffusion models, i.e., three quantum-noise-driven generative diffusion models that could be experimentally tested on real quantum systems. The idea is to harness unique quantum features, in particular the non-trivial interplay among coherence, entanglement and noise that the currently available noisy quantum processors do unavoidably suffer from, in order to overcome the main computational burdens of classical diffusion models during inference. Hence, we suggest to exploit quantum noise not as an issue to be detected and solved but instead as a very remarkably beneficial key ingredient to generate much more complex probability distributions that would be difficult or even impossible to express classically, and from which a quantum processor might sample more efficiently than a classical one. Therefore, our results are expected to pave the way for new quantum-inspired or quantum-based generative diffusion algorithms addressing more powerfully classical tasks as data generation/prediction with widespread real-world applications ranging from climate forecasting to neuroscience, from traffic flow analysis to financial forecasting.

摘要
“生成模型利用机器学习技术可以将复杂和未知的数据分布推导出来生成新的人工数据。扩散模型是一种兴起的框架，最近在创建高品质的文本和图像的生成方面已经超越了对抗式生成模型。在这篇文章中，我们提出了量子扩散模型的普遍化，即使用量子过程中的噪声来验证和评估这些模型。我们的想法是利用量子特有的特征，特别是干扰、统计和噪声之间的非凡互动，从现有的噪声量子处理器中获得优化的效果。因此，我们建议利用噪声不作为问题，而是将其视为一个非常有利的元素，从中生成更加复杂的概率分布，这些分布可能是класси�criptor的表达不可能的，而且量子处理器可能从这些分布中抽样更加高效。因此，我们预期我们的结果将开启一条新的量子灵感或量子基础的生成扩散算法，用于更有效地处理大量的数据，并具有广泛的实际应用，包括气候预测、神经科学、交通流量分析和金融预测。”

Neural oscillators for magnetic hysteresis modeling

paper_url: http://arxiv.org/abs/2308.12002
repo_url: None
paper_authors: Abhishek Chandra, Taniya Kapoor, Bram Daniels, Mitrofan Curti, Koen Tiels, Daniel M. Tartakovsky, Elena A. Lomonova
for: 这个论文的目的是模型和诊断材料中的惯性现象。
methods: 这个论文使用了 diferencial equation-based recurrent neural network (RNN)方法来模型惯性，启发自coupled-oscillatory RNN和现象学模型。
results: 研究发现，HystRNN可以在未经训练的区域中展现出普适的行为，这是traditional rate-dependent方法无法捕捉材料内部非线性的优势。

Abstract
Hysteresis is a ubiquitous phenomenon in science and engineering; its modeling and identification are crucial for understanding and optimizing the behavior of various systems. We develop an ordinary differential equation-based recurrent neural network (RNN) approach to model and quantify the hysteresis, which manifests itself in sequentiality and history-dependence. Our neural oscillator, HystRNN, draws inspiration from coupled-oscillatory RNN and phenomenological hysteresis models to update the hidden states. The performance of HystRNN is evaluated to predict generalized scenarios, involving first-order reversal curves and minor loops. The findings show the ability of HystRNN to generalize its behavior to previously untrained regions, an essential feature that hysteresis models must have. This research highlights the advantage of neural oscillators over the traditional RNN-based methods in capturing complex hysteresis patterns in magnetic materials, where traditional rate-dependent methods are inadequate to capture intrinsic nonlinearity.

摘要
恒定性（Hysteresis）是科学和工程中广泛存在的现象，其模型化和识别对各种系统的行为有着重要的意义。我们开发了基于常微分方程的循环神经网络（RNN）方法来模型和评估恒定性，该现象在Sequentiality和历史依赖性中表现出来。我们的神经普朗（HystRNN） drawing inspiration from coupled-oscillatory RNN and phenomenological hysteresis models，用来更新隐藏状态。我们对HystRNN的性能进行评估，可以预测一般化enario，包括第一次反转曲线和小循环。发现HystRNN可以在未经训练的区域中generalize its behavior，这是恒定性模型必备的一个重要特点。本研究表明神经普朗比传统RNN基本方法更有利于捕捉magnetic materials中的复杂恒定性模式，这些模式traditional rate-dependent方法无法捕捉内在非线性。

Trustworthy Representation Learning Across Domains

paper_url: http://arxiv.org/abs/2308.12315
repo_url: None
paper_authors: Ronghang Zhu, Dongliang Guo, Daiqing Qi, Zhixuan Chu, Xiang Yu, Sheng Li
For: 本文主要研究领域是如何使 Machine Learning 模型在不同领域之间进行可靠的表示学习，以满足现实世界应用场景中的多样化数据和不确定性。* Methods: 本文提出了一个包含四个概念（robustness、privacy、fairness和explainability）的可靠表示学习框架，并对现有的方法进行了概括和评论。* Results: 本文通过对多种实验和案例进行了评估和分析，并提出了未来研究方向的想法和见解。Here’s the same information in English for reference:* For: The main research area of this paper is how to make Machine Learning models reliable and robust across different domains, to meet the diverse data and uncertainty in real-world applications.* Methods: The paper proposes a trustworthy representation learning framework that includes four concepts (robustness, privacy, fairness, and explainability) and provides an overview and analysis of existing methods for each concept.* Results: The paper evaluates and analyzes the performance of various experiments and case studies, and provides insights and ideas for future research directions.

Abstract
As AI systems have obtained significant performance to be deployed widely in our daily live and human society, people both enjoy the benefits brought by these technologies and suffer many social issues induced by these systems. To make AI systems good enough and trustworthy, plenty of researches have been done to build guidelines for trustworthy AI systems. Machine learning is one of the most important parts for AI systems and representation learning is the fundamental technology in machine learning. How to make the representation learning trustworthy in real-world application, e.g., cross domain scenarios, is very valuable and necessary for both machine learning and AI system fields. Inspired by the concepts in trustworthy AI, we proposed the first trustworthy representation learning across domains framework which includes four concepts, i.e, robustness, privacy, fairness, and explainability, to give a comprehensive literature review on this research direction. Specifically, we first introduce the details of the proposed trustworthy framework for representation learning across domains. Second, we provide basic notions and comprehensively summarize existing methods for the trustworthy framework from four concepts. Finally, we conclude this survey with insights and discussions on future research directions.

摘要
First, we introduce the details of the proposed trustworthy framework for representation learning across domains. Second, we provide a comprehensive summary of existing methods for the trustworthy framework from the four concepts. Finally, we conclude this survey with insights and discussions on future research directions.The proposed trustworthy framework for representation learning across domains includes four key components:1. Robustness: This refers to the ability of the representation learning model to be resistant to variations in the input data and to maintain its performance under different conditions.2. Privacy: This involves ensuring that the representation learning model does not infringe on the privacy of the individuals or entities it is applied to, and that it does not leak sensitive information.3. Fairness: This requires that the representation learning model treats all individuals or entities equally and without bias, regardless of their background or characteristics.4. Explainability: This involves providing clear and understandable explanations of the decisions made by the representation learning model, so that users can understand how it arrived at its conclusions.Existing methods for the trustworthy framework from the four concepts include:1. Robustness: Techniques such as data augmentation, adversarial training, and ensemble learning can be used to improve the robustness of representation learning models.2. Privacy: Methods such as differential privacy, secure multi-party computation, and homomorphic encryption can be used to protect the privacy of individuals or entities in representation learning models.3. Fairness: Techniques such as fairness-aware regularization, re-weighting, and transfer learning can be used to promote fairness in representation learning models.4. Explainability: Approaches such as feature importance, saliency maps, and visualization can be used to provide explainability in representation learning models.In conclusion, the proposed trustworthy representation learning across domains framework provides a comprehensive and systematic approach to ensuring the trustworthiness of representation learning models in real-world applications. Future research should focus on developing more effective and efficient methods for each of the four components of the framework, as well as exploring new applications and domains for trustworthy representation learning.

On Uniformly Optimal Algorithms for Best Arm Identification in Two-Armed Bandits with Fixed Budget

paper_url: http://arxiv.org/abs/2308.12000
repo_url: None
paper_authors: Po-An Wang, Kaito Ariu, Alexandre Proutiere
for: 这个论文研究了固定预算下最佳臂 Identification的问题，具体来说是在随机二臂投掷机中的 Bernoulli 奖励中。
methods: 作者使用了consistent和stable算法的自然类型，并证明了这类算法在所有实例上都能够达到 uniform 抽样算法的性能水平，而且不存在任何 strictly outperform uniform 抽样算法的算法。
results: 研究结果表明，无论是在所有实例上还是在某些特定实例上，uniform 抽样算法都是最佳的，而且任何consistent和stable的算法都不能 strictly outperform uniform 抽样算法。

Abstract
We study the problem of best-arm identification with fixed budget in stochastic two-arm bandits with Bernoulli rewards. We prove that surprisingly, there is no algorithm that (i) performs as well as the algorithm sampling each arm equally (this algorithm is referred to as the {\it uniform sampling} algorithm) on all instances, and that (ii) strictly outperforms this algorithm on at least one instance. In short, there is no algorithm better than the uniform sampling algorithm. Towards this result, we introduce the natural class of {\it consistent} and {\it stable} algorithms, and show that any algorithm that performs as well as the uniform sampling algorithm on all instances belongs to this class. The proof is completed by deriving a lower bound on the error rate satisfied by any consistent and stable algorithm, and by showing that the uniform sampling algorithm matches this lower bound. Our results provide a solution to the two open problems presented in \cite{qin2022open}.

摘要
我们研究fixed budget下的最佳臂标识问题，具体来说是在随机两臂投机中的 Берну利奖励下。我们证明了一些意外的结论：没有任何算法可以在所有情况下比uniform sampling算法（即随机选择每个臂的算法）更好，并且在至少一个情况下比它更好。简而言之，uniform sampling算法是最佳的。为达到这个结论，我们引入了自然的consistent和stable算法类型，并证明了任何可以在所有情况下与uniform sampling算法相当的算法都属于这个类型。proof的完成是通过 derivation of a lower bound on the error rate satisfied by any consistent and stable algorithm，并证明了uniform sampling算法与这个下界相符。我们的结果解决了在 \cite{qin2022open} 中提出的两个开放问题。Note: "consistent" and "stable" are translated as "自然" (natural) and "稳定" (stable) respectively, since there is no direct translation for "consistent" in Chinese.

Relational Concept Based Models

paper_url: http://arxiv.org/abs/2308.11991
repo_url: https://github.com/Aghoreshwar/Awesome-Customer-Analytics
paper_authors: Pietro Barbiero, Francesco Giannini, Gabriele Ciravegna, Michelangelo Diligenti, Giuseppe Marra
for: 这paper的目的是提出一种可解释深度学习模型，用于在关系领域中进行任务预测。
methods: 这paper使用的方法是基于概念的深度学习模型，它们不仅可以在关系领域中进行任务预测，还可以提供可解释的任务预测结果。
results: 这paper的实验结果表明，关系基于概念的深度学习模型可以匹配现有关系黑盒模型的一致性表现，同时支持生成量化的概念基于解释，应对测试时间的交互，并在各种具有挑战性的情况下保持可靠的表现。

Abstract
The design of interpretable deep learning models working in relational domains poses an open challenge: interpretable deep learning methods, such as Concept-Based Models (CBMs), are not designed to solve relational problems, while relational models are not as interpretable as CBMs. To address this problem, we propose Relational Concept-Based Models, a family of relational deep learning methods providing interpretable task predictions. Our experiments, ranging from image classification to link prediction in knowledge graphs, show that relational CBMs (i) match generalization performance of existing relational black-boxes (as opposed to non-relational CBMs), (ii) support the generation of quantified concept-based explanations, (iii) effectively respond to test-time interventions, and (iv) withstand demanding settings including out-of-distribution scenarios, limited training data regimes, and scarce concept supervisions.

摘要
“ interpretability deep learning 模型在关系领域设计存在开放挑战： interpretable deep learning 方法，如概念基础模型（CBM），不是解释关系问题的设计，而关系模型不如 CBM 那样可读取。为解决这个问题，我们提议关系概念基础模型，这是一种可解释的关系深度学习方法，可以提供可读取的任务预测。我们的实验，从图像分类到知识图表链接预测，表明关系 CBM（i）与现有关系黑盒子（非关系 CBM）相比，具有相同的普通化性能，（ii）支持生成量化的概念基础解释，（iii）能够响应测试时间 intervenction，以及（iv）在具有异常enario、有限训练数据 régime和罕见概念监督的情况下表现出色。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

Will More Expressive Graph Neural Networks do Better on Generative Tasks?

paper_url: http://arxiv.org/abs/2308.11978
repo_url: None
paper_authors: Xiandong Zou, Xiangyu Zhao, Pietro Liò, Yiren Zhao
for: 这种论文的目的是为了探讨图生成任务中Graph Neural Network（GNN）的表达能力，并评估不同GNN的表达能力在分子生成任务中。
methods: 该论文使用了六种不同的GNN，包括GCPN和GraphAF两种生成框架，在ZINC-250k dataset上进行了广泛的实验。
results: 研究发现，使用更高级的GNN可以提高GCPN和GraphAF在分子生成任务中的表现，但是GNN表达能力不是一定是一个很好的GNN-based生成模型的必要 условия。此外，GCPN和GraphAF与更高级的GNN可以在17种非GNN-based图生成方法（如变量自动编码器和极限优化模型）上达到状态元的结果，在提出的分子生成目标（DRD2、Median1、Median2）中。

Abstract
Graph generation poses a significant challenge as it involves predicting a complete graph with multiple nodes and edges based on simply a given label. This task also carries fundamental importance to numerous real-world applications, including de-novo drug and molecular design. In recent years, several successful methods have emerged in the field of graph generation. However, these approaches suffer from two significant shortcomings: (1) the underlying Graph Neural Network (GNN) architectures used in these methods are often underexplored; and (2) these methods are often evaluated on only a limited number of metrics. To fill this gap, we investigate the expressiveness of GNNs under the context of the molecular graph generation task, by replacing the underlying GNNs of graph generative models with more expressive GNNs. Specifically, we analyse the performance of six GNNs in two different generative frameworks (GCPN and GraphAF), on six different molecular generative objectives on the ZINC-250k dataset. Through our extensive experiments, we demonstrate that advanced GNNs can indeed improve the performance of GCPN and GraphAF on molecular generation tasks, but GNN expressiveness is not a necessary condition for a good GNN-based generative model. Moreover, we show that GCPN and GraphAF with advanced GNNs can achieve state-of-the-art results across 17 other non-GNN-based graph generative approaches, such as variational autoencoders and Bayesian optimisation models, on the proposed molecular generative objectives (DRD2, Median1, Median2), which are important metrics for de-novo molecular design.

摘要
<>文本翻译成简化中文。图生成具有重要挑战，因为它需要预测具有多个节点和边的完整图 based on 只有一个标签。这个任务也对实际应用中的许多应用有基本重要性，包括新药和分子设计。在过去几年，图生成领域中有多种成功的方法出现。然而，这些方法受到两个重要缺点的影响：（1）用于这些方法的基本图神经网络（GNN）架构通常未得到充分研究; 和（2）这些方法通常只在有限的约束下评估。为了填补这个差距，我们调查了GNN在分子图生成任务中的表达能力，通过将基本GNN替换为更表达能力的GNN来进行分析。我们在ZINC-250k数据集上使用六种GNN进行两种不同的生成框架（GCPN和GraphAF）进行了广泛的实验。我们的结果表明，高级GNN可以改善GCPN和GraphAF在分子生成任务上的性能，但GNN表达能力不是一定的必要条件 для好的GNN-based生成模型。此外，我们发现GCPN和GraphAF与高级GNN可以在17种非GNN-based图生成方法（如变量自动编码器和抽象优化模型）之上 achieve state-of-the-art 结果，这些结果是重要的指标 для新药设计。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Approximating Score-based Explanation Techniques Using Conformal Regression

paper_url: http://arxiv.org/abs/2308.11975
repo_url: None
paper_authors: Amr Alkhatib, Henrik Boström, Sofiane Ennadir, Ulf Johansson
for: 这 paper 的目的是提出一种 computationally less costly regression model 来 approximate score-based explanation techniques, such as SHAP, 以提高时间效率。
methods: 这 paper 使用了一种 inductive conformal prediction framework 提供了有效性保证，以及一些不同的 non-conformity measures 来考虑 aproximation 的困难性和计算成本。
results: 实验结果表明，提出的方法可以 Significantly improve execution time compared to the fast version of SHAP, TreeSHAP, 并且可以生成紧凑的间隔，同时提供有效性保证。另外，这种方法允许对不同的 approximation 方法进行比较，并选择一种方法基于其预测间隔的紧凑程度。

Abstract
Score-based explainable machine-learning techniques are often used to understand the logic behind black-box models. However, such explanation techniques are often computationally expensive, which limits their application in time-critical contexts. Therefore, we propose and investigate the use of computationally less costly regression models for approximating the output of score-based explanation techniques, such as SHAP. Moreover, validity guarantees for the approximated values are provided by the employed inductive conformal prediction framework. We propose several non-conformity measures designed to take the difficulty of approximating the explanations into account while keeping the computational cost low. We present results from a large-scale empirical investigation, in which the approximate explanations generated by our proposed models are evaluated with respect to efficiency (interval size). The results indicate that the proposed method can significantly improve execution time compared to the fast version of SHAP, TreeSHAP. The results also suggest that the proposed method can produce tight intervals, while providing validity guarantees. Moreover, the proposed approach allows for comparing explanations of different approximation methods and selecting a method based on how informative (tight) are the predicted intervals.

摘要
“基于分数的解释机器学习技术常用来理解黑obox模型的逻辑。然而，这些解释技术经常 computationally expensive，这限制了它们在时间敏感上下文中的应用。因此，我们提出并调查使用 computationally less costly regression models来近似分数基的解释技术，如 SHAP。此外，我们提供了雇用 inductive conformal prediction 框架来提供有效性保证。我们还提出了一些非准确度度量，用于考虑近似解释的困难程度，同时保持计算成本低。我们对大规模的实验进行了评估，并评估了我们提出的方法的执行时间、间隔大小和有效性。结果表明，我们的方法可以明显提高执行时间，并且可以生成紧凑的间隔，同时提供有效性保证。此外，我们的方法允许比较不同的近似方法的解释，并选择基于解释的紧凑程度（有效性）来选择最佳的方法。”Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. If you need Traditional Chinese, please let me know.

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

paper_url: http://arxiv.org/abs/2308.11971
repo_url: None
paper_authors: Junyi Chen, Longteng Guo, Jia Sun, Shuai Shao, Zehuan Yuan, Liang Lin, Dongyu Zhang
for: This paper aims to introduce an efficient vision-language model that can learn from diverse, multimodal data.
methods: The proposed model, called EVE, uses a unified multimodal Transformer network with modality-aware sparse Mixture-of-Experts (MoE) modules to capture modality-specific information. The model is pre-trained using a simple yet effective masked signal modeling task, which accelerates training and enables better downstream performance.
results: EVE achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval, despite its simplicity.

Abstract
Building scalable vision-language models to learn from diverse, multimodal data remains an open challenge. In this paper, we introduce an Efficient Vision-languagE foundation model, namely EVE, which is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Specifically, EVE encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which capture modality-specific information by selectively switching to different experts. To unify pre-training tasks of vision and language, EVE performs masked signal modeling on image-text pairs to reconstruct masked signals, i.e., image pixels and text tokens, given visible signals. This simple yet effective pre-training objective accelerates training by 3.5x compared to the model pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed. Despite its simplicity, EVE achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.

摘要
建立可扩展的视觉语言模型，从多样化的多模式数据学习仍然是一个开放的挑战。在这篇论文中，我们介绍了一种高效的视觉语言基础模型，即EVE（高效视觉语言基础模型），它是一种具有共享变换网络和多样化专家模块的单一多模式变换器，可以同时处理视觉和语言信息。Specifically，EVE使用多样化专家模块来捕捉不同模式的信息，并通过隐藏信号重建来快速适应不同的预训练任务。由于EVE的统一架构和预训练任务，它可以轻松扩展，从而实现更好的下游性能，并且更快地训练。尽管其简单，EVE仍然达到了多种视觉语言下游任务的状态领先性，包括视觉问答、视觉理解和图像文本检索。

Anisotropic Hybrid Networks for liver tumor segmentation with uncertainty quantification

paper_url: http://arxiv.org/abs/2308.11969
repo_url: None
paper_authors: Benjamin Lambert, Pauline Roca, Florence Forbes, Senan Doyle, Michel Dojat
for: 避免liver tumor burden, 提高hepatocellular carcinoma（HCC）的治疗策略。
methods: 使用contrast-enhanced magnetic resonance imaging（CE-MRI）进行liver和tumor的分割，并 Comparing two different pipelines based on anisotropic models。
results: 两种pipeline都能够 segments liver and tumors,但是每个pipeline具有不同的优劣点，同时提出了一种uncertainty quantification strategy来评估潜在的假阳性肿瘤。

Abstract
The burden of liver tumors is important, ranking as the fourth leading cause of cancer mortality. In case of hepatocellular carcinoma (HCC), the delineation of liver and tumor on contrast-enhanced magnetic resonance imaging (CE-MRI) is performed to guide the treatment strategy. As this task is time-consuming, needs high expertise and could be subject to inter-observer variability there is a strong need for automatic tools. However, challenges arise from the lack of available training data, as well as the high variability in terms of image resolution and MRI sequence. In this work we propose to compare two different pipelines based on anisotropic models to obtain the segmentation of the liver and tumors. The first pipeline corresponds to a baseline multi-class model that performs the simultaneous segmentation of the liver and tumor classes. In the second approach, we train two distinct binary models, one segmenting the liver only and the other the tumors. Our results show that both pipelines exhibit different strengths and weaknesses. Moreover we propose an uncertainty quantification strategy allowing the identification of potential false positive tumor lesions. Both solutions were submitted to the MICCAI 2023 Atlas challenge regarding liver and tumor segmentation.

摘要
肝肿瘤的重要性，排名为癌症致死的第四大原因。在肝细胞癌（HCC）中，在对着增强的磁共振成像（CE-MRI）中定义肝脏和肿瘤的分割是用于导向治疗策略的。然而，由于这个任务需要较长的时间，需要高水平的专业知识，并且容易受到观察者间的变化，因此有强烈的需求 для自动工具。然而，由于数据不足以及图像分辨率和MRI序列的高变化，在这种情况下存在多种挑战。在这项工作中，我们提出了两种不同的管道，基于不同的模型来获得肝脏和肿瘤的分割。第一个管道是基线多类模型，同时 segments肝脏和肿瘤的多个类。在第二个方法中，我们训练了两个不同的二进制模型，其中一个只 segments肝脏，另一个只 segments肿瘤。我们的结果显示，这两种管道具有不同的优劣点。此外，我们还提出了一种不确定性评估策略，允许标识 potential false positive肿瘤肿瘤。这两种解决方案都被提交到了MICCAI 2023 Atlas挑战，探讨肝脏和肿瘤的分割。

Maintaining Plasticity via Regenerative Regularization

paper_url: http://arxiv.org/abs/2308.11958
repo_url: None
paper_authors: Saurabh Kumar, Henrik Marklund, Benjamin Van Roy
for: 本文旨在提出一种简单的方法来维护神经网络在处理非站ARY数据流时的柔软性，即L2Init。
methods: L2Init方法基于损失函数中添加L2正则化项，将初始参数迁化到原点方向进行正则化。这种方法简单易行，只需选择一个额外参数即可。
results: 在不同类型的非站ARY学习问题上，L2Init方法能够一致地遏制权重损失。此外，我们发现L2Init方法可以降低参数的大小，并保持高效的特征矩阵。

Abstract
In continual learning, plasticity refers to the ability of an agent to quickly adapt to new information. Neural networks are known to lose plasticity when processing non-stationary data streams. In this paper, we propose L2 Init, a very simple approach for maintaining plasticity by incorporating in the loss function L2 regularization toward initial parameters. This is very similar to standard L2 regularization (L2), the only difference being that L2 regularizes toward the origin. L2 Init is simple to implement and requires selecting only a single hyper-parameter. The motivation for this method is the same as that of methods that reset neurons or parameter values. Intuitively, when recent losses are insensitive to particular parameters, these parameters drift toward their initial values. This prepares parameters to adapt quickly to new tasks. On simple problems representative of different types of nonstationarity in continual learning, we demonstrate that L2 Init consistently mitigates plasticity loss. We additionally find that our regularization term reduces parameter magnitudes and maintains a high effective feature rank.

摘要

When MiniBatch SGD Meets SplitFed Learning:Convergence Analysis and Performance Evaluation

paper_url: http://arxiv.org/abs/2308.11953
repo_url: None
paper_authors: Chao Huang, Geng Tian, Ming Tang
For: The paper is written to address the issue of client drift in split federated learning (SFL) and to propose a new algorithm called MiniBatch-SFL that incorporates mini-batch SGD into SFL to mitigate client drift.* Methods: The paper uses a distributed approach called SFL, which splits the model into two parts and trains them separately on the client and server sides. The algorithm incorporates mini-batch SGD into SFL to improve the training process.* Results: The paper shows that MiniBatch-SFL achieves higher accuracy than conventional SFL and FL, with an accuracy improvement of up to 24.1% and 17.1% with highly non-IID data, respectively. The paper also analyzes the convergence of MiniBatch-SFL and shows that the bound of the expected loss can be obtained by analyzing the expected server-side and client-side model updates.Here is the information in Simplified Chinese text:* For: 本文是为了解决分布式学习（Federated Learning，FL）中客户端偏移（client drift）问题，并提出一种新的算法called MiniBatch-SFL。* Methods: 本文使用了分布式方法called SFL，该方法将模型分成两部分，并在客户端和服务器两个地方进行训练。该算法将mini-batch SGD incorporated into SFL以改善训练过程。* Results: 本文显示，MiniBatch-SFL可以与传统的SFL和FL相比，在高度异步数据上提高准确性，准确性提高可达24.1%和17.1%。本文还分析了MiniBatch-SFL的收敛性，并证明了服务器端和客户端模型更新的预期损失下界可以通过分析服务器端和客户端模型更新的预期值来获得。

Abstract
Federated learning (FL) enables collaborative model training across distributed clients (e.g., edge devices) without sharing raw data. Yet, FL can be computationally expensive as the clients need to train the entire model multiple times. SplitFed learning (SFL) is a recent distributed approach that alleviates computation workload at the client device by splitting the model at a cut layer into two parts, where clients only need to train part of the model. However, SFL still suffers from the \textit{client drift} problem when clients' data are highly non-IID. To address this issue, we propose MiniBatch-SFL. This algorithm incorporates MiniBatch SGD into SFL, where the clients train the client-side model in an FL fashion while the server trains the server-side model similar to MiniBatch SGD. We analyze the convergence of MiniBatch-SFL and show that the bound of the expected loss can be obtained by analyzing the expected server-side and client-side model updates, respectively. The server-side updates do not depend on the non-IID degree of the clients' datasets and can potentially mitigate client drift. However, the client-side model relies on the non-IID degree and can be optimized by properly choosing the cut layer. Perhaps counter-intuitive, our empirical result shows that a latter position of the cut layer leads to a smaller average gradient divergence and a better algorithm performance. Moreover, numerical results show that MiniBatch-SFL achieves higher accuracy than conventional SFL and FL. The accuracy improvement can be up to 24.1\% and 17.1\% with highly non-IID data, respectively.

摘要
federated 学习（FL）可以在分布式客户端（例如边缘设备）上进行共同模型训练，而不需要将原始数据共享。然而，FL 可能会具有高计算成本，因为客户端需要训练整个模型多次。splitFed 学习（SFL）是一种最近的分布式方法，它可以减轻客户端设备上的计算工作负担，通过将模型分割成两个部分， client 只需要训练一部分模型。然而，SFL 仍然受到客户端数据高度非同一致（non-IID）的问题困扰。为了解决这个问题，我们提出了 MiniBatch-SFL。这个算法将 MiniBatch SGD 与 SFL 结合，客户端在 FL 方式下训练客户端模型，而服务器则在 MiniBatch SGD 方式下训练服务器模型。我们分析了 MiniBatch-SFL 的收敛性，并证明了预期损失的下界可以通过分析服务器和客户端模型更新的预期值来获得。服务器模型更新不依赖于客户端数据的非同一致度，并可能减轻客户端漂移。然而，客户端模型受到非同一致度的影响，可以通过正确选择切割层来优化。有些奇妙的是，我们的实验结果表明，将切割层放在后面位置可以减少平均梯度偏差，并且提高算法性能。此外，我们的数值结果表明，MiniBatch-SFL 可以高于传统的 SFL 和 FL 准确率。准确率的提高可以达到 24.1% 和 17.1%，尤其是在高度非同一致的数据上。

Multi-scale Transformer Pyramid Networks for Multivariate Time Series Forecasting

paper_url: http://arxiv.org/abs/2308.11946
repo_url: None
paper_authors: Yifan Zhang, Rui Wu, Sergiu M. Dascalu, Frederick C. Harris Jr
for: 本文旨在提出一种维度不变的嵌入技术，用于捕捉短期时间关系，并将多ivariate时间序列数据映射到更高维度空间中，保留时间步和变量的维度。
methods: 本文提出了一种新的多级变换器 пирамид网络（MTPNet），用于有效地捕捉多个自由级别的时间关系。该网络通过多级变换器来获取多scale latent representation，并通过这些latent representation来预测时间序列。
results: gexperiments表明，提出的MTPNet方法在九个benchmark数据集上表现出色，比之前的state-of-the-art方法更高。

Abstract
Multivariate Time Series (MTS) forecasting involves modeling temporal dependencies within historical records. Transformers have demonstrated remarkable performance in MTS forecasting due to their capability to capture long-term dependencies. However, prior work has been confined to modeling temporal dependencies at either a fixed scale or multiple scales that exponentially increase (most with base 2). This limitation hinders their effectiveness in capturing diverse seasonalities, such as hourly and daily patterns. In this paper, we introduce a dimension invariant embedding technique that captures short-term temporal dependencies and projects MTS data into a higher-dimensional space, while preserving the dimensions of time steps and variables in MTS data. Furthermore, we present a novel Multi-scale Transformer Pyramid Network (MTPNet), specifically designed to effectively capture temporal dependencies at multiple unconstrained scales. The predictions are inferred from multi-scale latent representations obtained from transformers at various scales. Extensive experiments on nine benchmark datasets demonstrate that the proposed MTPNet outperforms recent state-of-the-art methods.

摘要
多变量时间序列（MTS）预测 involve 模型历史记录中的时间相关性。 transformers 在 MTS 预测中表现出色，因为它们可以捕捉长期相关性。然而，先前的工作受限于模型时间相关性的固定级别或多个级别，其中每个级别都是对数增长的（大多数是对数增长级别）。这种限制使得它们无法有效地捕捉多种季节性，如每小时和每天的模式。在本文中，我们介绍了一种维度不变的嵌入技术，该技术可以捕捉短期时间相关性，并将 MTS 数据 проек到更高维度空间中，保持时间步骤和变量在 MTS 数据中的维度。此外，我们提出了一种特有的多级 transformer пирамиidal 网络（MTPNet），该网络是专门为了 capture 时间相关性的多个自由级别。预测基于 MTS 数据中的多级幂表示，从 transformers 中获得的多级幂表示。经验表明，我们提出的 MTPNet 在九个 Refer 数据集上表现出色，超过了当前状态的方法。

RamseyRL: A Framework for Intelligent Ramsey Number Counterexample Searching

paper_url: http://arxiv.org/abs/2308.11943
repo_url: None
paper_authors: Steve Vott, Adam M. Lehavi
for: 这篇论文探讨了使用最佳搜索算法和强化学习（RL）技术来找到特定 Ramsey 数字的反例。
methods: 该论文引入了图vectorization和深度神经网络（DNN）基于的优化，以评估图是否为反例的可能性。
results: 论文不主要是报道新的反例，而是介绍和评估一种用于 Ramsey 反例探索的框架，并提供了代码和方法。

Abstract
The Ramsey number is the minimum number of nodes, $n = R(s, t)$, such that all undirected simple graphs of order $n$, contain a clique of order $s$, or an independent set of order $t$. This paper explores the application of a best first search algorithm and reinforcement learning (RL) techniques to find counterexamples to specific Ramsey numbers. We incrementally improve over prior search methods such as random search by introducing a graph vectorization and deep neural network (DNN)-based heuristic, which gauge the likelihood of a graph being a counterexample. The paper also proposes algorithmic optimizations to confine a polynomial search runtime. This paper does not aim to present new counterexamples but rather introduces and evaluates a framework supporting Ramsey counterexample exploration using other heuristics. Code and methods are made available through a PyPI package and GitHub repository.

摘要
《赛美数字》是指最小的节点数量，$n = R(s, t)$, 使得所有无向简单图的顺序为$n$，都包含一个 clique 的顺序为$s$，或一个独立集的顺序为$t$。这篇论文探讨了使用最佳搜索算法和强化学习（RL）技术来找到特定的赛美数字 counterexample。我们会逐步改进先前的搜索方法，如随机搜索，通过引入图像化和深度神经网络（DNN）基于的优化器，来评估图的可能性是否为 counterexample。论文还提出了算法优化，以确保搜索时间减少到多项式。本论文不是寻找新的 counterexample，而是介绍和评估一种用于赛美 counterexample 探索的框架，使用其他诱导。代码和方法通过 PyPI 包和 GitHub 存储库提供。

Audio Generation with Multiple Conditional Diffusion Model

paper_url: http://arxiv.org/abs/2308.11940
repo_url: None
paper_authors: Zhifang Guo, Jianguo Mao, Rui Tao, Long Yan, Kazushige Ouchi, Hong Liu, Xiangdong Wang
For: The paper aims to enhance the controllability of existing pre-trained text-to-audio models by incorporating additional conditions such as content and style.* Methods: The proposed model uses a trainable control condition encoder enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions, while keeping the weights of the pre-trained text-to-audio model frozen.* Results: The model achieves fine-grained control over the temporal order, pitch, and energy of generated audio, and experimental results demonstrate its success in accomplishing controllable audio generation.Here is the same information in Simplified Chinese text:* For: 该文章的目的是增强现有的预训练文本到音频模型的可控性，通过添加更多的条件，包括内容（时间戳）和风格（折屈轨迹和能量轨迹）。* Methods: 提议的模型使用一个可调控条件编码器，通过一个大型自然语言模型和一个可调控的融合网络来编码和融合更多的条件，同时保持预训练文本到音频模型的 weights 冻结。* Results: 模型实现了细化的控制，可以控制音频的时间顺序、折屈和能量等方面，实验结果表明模型成功实现了可控音频生成。

Abstract
Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation. Audio samples and our dataset are publicly available at https://conditionaudiogen.github.io/conditionaudiogen/

摘要
文本基于的音频生成模型有限制，因为它们无法包含所有音频中的信息，导致仅基于文本的控制性有限。为解决这个问题，我们提议一种新的模型，它可以增强现有的预训练文本到Audio模型的控制性，通过添加内容（时间戳）和风格（折射和能量折射）等补充条件。这种方法可以实现精细的控制音频的时间顺序、折射和能量。为保持生成的多样性，我们使用可训练的控制条件编码器，并通过大语言模型和可训练的融合网络来编码和融合其他条件，同时保持预训练文本到Audio模型的权重冻结。由于缺乏适当的数据集和评价指标，我们将现有数据集整合成一个新的数据集，包括音频和相应的条件，并使用一系列评价指标来评估控制性表现。实验结果表明，我们的模型成功实现了精细的控制，以实现可控的音频生成。音频样本和我们的数据集公共可在https://conditionaudiogen.github.io/conditionaudiogen/上下载。

Retail Demand Forecasting: A Comparative Study for Multivariate Time Series

paper_url: http://arxiv.org/abs/2308.11939
repo_url: None
paper_authors: Md Sabbirul Haque, Md Shahedul Amin, Jonayet Miah
for: 预测零售需求精度的研究，以提高企业的财务表现和供应链效率。
methods: 对时间序列顾客需求数据进行拓展，并利用macro经济变量（如Consumer Price Index、Index of Consumer Sentiment和失业率）进行预测。
results: 通过对不同回归和机器学习模型进行比较，实现 precisel retail demand prediction。

Abstract
Accurate demand forecasting in the retail industry is a critical determinant of financial performance and supply chain efficiency. As global markets become increasingly interconnected, businesses are turning towards advanced prediction models to gain a competitive edge. However, existing literature mostly focuses on historical sales data and ignores the vital influence of macroeconomic conditions on consumer spending behavior. In this study, we bridge this gap by enriching time series data of customer demand with macroeconomic variables, such as the Consumer Price Index (CPI), Index of Consumer Sentiment (ICS), and unemployment rates. Leveraging this comprehensive dataset, we develop and compare various regression and machine learning models to predict retail demand accurately.

摘要
precisión del pronóstico de demanda en la industria de comercio al por mayor es un determinante crítico del rendimiento financiero y la eficiencia de la cadena de suministro. Con el aumento de la interconexión global, las empresas están dirigiéndose hacia modelos de predicción avanzados para obtener una ventaja competitiva. Sin embargo, la literatura existente se centra principalmente en los datos de ventas históricas y ignora la influencia vital de las condiciones macroeconómicas en el comportamiento de gasto de los consumidores. En este estudio, abrimos esta brecha utilizando variables macroeconómicas, como el Índice de Precios al Consumidor (CPI), el Índice de Sentimiento del Consumidor (ICS) y tasas de desempleo, para enriquecer los datos de tiempo series de la demanda de los clientes. Utilizando esta base de datos completa, desarrollamos y comparamos diferentes modelos de regresión y aprendizaje automático para predecir la demanda de retail de manera precisa.

System Identification for Continuous-time Linear Dynamical Systems

paper_url: http://arxiv.org/abs/2308.11933
repo_url: https://github.com/Jonas-Nicodemus/phdmd
paper_authors: Peter Halmos, Jonathan Pillow, David A. Knowles
for: 这篇论文旨在探讨 kalman 筛filter 的系统识别问题，以期减少 Observations 的时间间隔点击，以便在实际应用中更好地适用。
methods: 该论文使用 expectation-maximization (EM) 算法来学习底层系统的参数，并提出了一种新的 two-filter 解法，可以efficiently 计算 posterior 的 analytical 更新，不需要先行计算 forward-pass。
results: 该论文通过应用这种新的解法，成功地扩展了 kalman 筛filter 的学习范围，使其可以处理不规则的测量数据和偶极缺失值。此外，该论文还对 latent 线性动力系统 (LDS) 的学习进行扩展，并证明了这种扩展可以提高非线性系统的识别精度。

Abstract
The problem of system identification for the Kalman filter, relying on the expectation-maximization (EM) procedure to learn the underlying parameters of a dynamical system, has largely been studied assuming that observations are sampled at equally-spaced time points. However, in many applications this is a restrictive and unrealistic assumption. This paper addresses system identification for the continuous-discrete filter, with the aim of generalizing learning for the Kalman filter by relying on a solution to a continuous-time It\^o stochastic differential equation (SDE) for the latent state and covariance dynamics. We introduce a novel two-filter, analytical form for the posterior with a Bayesian derivation, which yields analytical updates which do not require the forward-pass to be pre-computed. Using this analytical and efficient computation of the posterior, we provide an EM procedure which estimates the parameters of the SDE, naturally incorporating irregularly sampled measurements. Generalizing the learning of latent linear dynamical systems (LDS) to continuous-time may extend the use of the hybrid Kalman filter to data which is not regularly sampled or has intermittent missing values, and can extend the power of non-linear system identification methods such as switching LDS (SLDS), which rely on EM for the linear discrete-time Kalman filter as a sub-unit for learning locally linearized behavior of a non-linear system. We apply the method by learning the parameters of a latent, multivariate Fokker-Planck SDE representing a toggle-switch genetic circuit using biologically realistic parameters, and compare the efficacy of learning relative to the discrete-time Kalman filter as the step-size irregularity and spectral-radius of the dynamics-matrix increases.

摘要
system identification for the Kalman filter, which relies on the expectation-maximization (EM) procedure to learn the underlying parameters of a dynamical system, has been extensively studied assuming that observations are sampled at equally-spaced time points. However, in many applications, this assumption is too restrictive and unrealistic. This paper addresses system identification for the continuous-discrete filter, with the goal of generalizing the learning of the Kalman filter by relying on a solution to a continuous-time It\^o stochastic differential equation (SDE) for the latent state and covariance dynamics. We introduce a new two-filter, analytical form for the posterior, which is derived using Bayesian methods and yields analytical updates that do not require the forward-pass to be pre-computed. Using this analytical and efficient computation of the posterior, we provide an EM procedure that estimates the parameters of the SDE, naturally incorporating irregularly sampled measurements. By generalizing the learning of latent linear dynamical systems (LDS) to continuous-time, we can extend the use of the hybrid Kalman filter to data that is not regularly sampled or has intermittent missing values, and can also extend the power of non-linear system identification methods such as switching LDS (SLDS), which rely on EM for the linear discrete-time Kalman filter as a sub-unit for learning locally linearized behavior of a non-linear system. We apply the method by learning the parameters of a latent, multivariate Fokker-Planck SDE representing a toggle-switch genetic circuit using biologically realistic parameters, and compare the efficacy of learning relative to the discrete-time Kalman filter as the step-size irregularity and spectral-radius of the dynamics-matrix increases.

Dynamic landslide susceptibility mapping over recent three decades to uncover variations in landslide causes in subtropical urban mountainous areas

paper_url: http://arxiv.org/abs/2308.11929
repo_url: https://github.com/cli-de/d_lsm
paper_authors: Peifeng Ma, Li Chen, Chang Yu, Qing Zhu, Yulin Ding
for: 这项研究旨在提供一种能够考虑不同时间间隔内的滑坡风险评估方法，以便更好地mitigate landslide risks.
methods: 这项研究使用了多种预测模型来实现年度滑坡风险评估，并使用了SHAP来解释每个模型的输出和滑坡特征的排序。此外，研究还使用了MT-InSAR来增强和验证滑坡风险评估结果。
results: 研究发现，滑坡诱导因素主要是地形坡度和极端降雨，而滑坡诱导的变化主要归结于全球气候变化和香港政府实施的滑坡预防和 Mitigation Programme (LPMitP)。

Abstract
Landslide susceptibility assessment (LSA) is of paramount importance in mitigating landslide risks. Recently, there has been a surge in the utilization of data-driven methods for predicting landslide susceptibility due to the growing availability of aerial and satellite data. Nonetheless, the rapid oscillations within the landslide-inducing environment (LIE), primarily due to significant changes in external triggers such as rainfall, pose difficulties for contemporary data-driven LSA methodologies to accommodate LIEs over diverse timespans. This study presents dynamic landslide susceptibility mapping that simply employs multiple predictive models for annual LSA. In practice, this will inevitably encounter small sample problems due to the limited number of landslide samples in certain years. Another concern arises owing to the majority of the existing LSA approaches train black-box models to fit distinct datasets, yet often failing in generalization and providing comprehensive explanations concerning the interactions between input features and predictions. Accordingly, we proposed to meta-learn representations with fast adaptation ability using a few samples and gradient updates; and apply SHAP for each model interpretation and landslide feature permutation. Additionally, we applied MT-InSAR for LSA result enhancement and validation. The chosen study area is Lantau Island, Hong Kong, where we conducted a comprehensive dynamic LSA spanning from 1992 to 2019. The model interpretation results demonstrate that the primary factors responsible for triggering landslides in Lantau Island are terrain slope and extreme rainfall. The results also indicate that the variation in landslide causes can be primarily attributed to extreme rainfall events, which result from global climate change, and the implementation of the Landslip Prevention and Mitigation Programme (LPMitP) by the Hong Kong government.

摘要
陡峰危险评估（LSA）对陡峰风险的控制具有极其重要的 Bedeutung。在最近几年，由于飞行器和卫星数据的可用性的增加，数据驱动方法在陡峰危险评估中得到了广泛的应用。然而，由于陡峰引发环境（LIE）的快速振荡，主要归因于外部触发因素的重要变化，如降水的变化，这使得当今的数据驱动LSA方法具有困难以适应LIE的问题。本研究提出了动态陡峰危险地图，使用多种预测模型来年度陡峰危险评估。在实践中，这将不可避免小样本问题，因为 certain years 的陡峰样本数量有限。另一个问题是，现有的LSA方法通常将黑盒模型训练到特定数据集上，然而这些模型在总体化和提供完整的输入特征和预测之间的交互关系的解释方面经常失败。为此，我们提议使用元学习来学习快速适应能力，使用少量样本和梯度更新来更新模型；并使用 SHAP 来对每个模型进行解释和陡峰特征排列。此外，我们还应用 MT-InSAR 来增强和验证 LSA 结果。我们选择的研究区是香港大屿山岛，我们在1992年至2019年之间进行了全面的动态陡峰危险评估。模型解释结果显示，陡峰岛上陡峰的主要诱因是地形坡度和极端降水。结果还表明，降水事件的变化可以归因于全球气候变化，以及香港政府实施的坡地防险mitigationProgramme（LPMitP）。

Solving Elliptic Optimal Control Problems using Physics Informed Neural Networks

paper_url: http://arxiv.org/abs/2308.11925
repo_url: None
paper_authors: Bangti Jin, Ramesh Sau, Luowei Yin, Zhi Zhou
for: 本文解决了优化控制问题（无框约束/带框约束）的线性和半线性二阶凝聚问题的数学解决方案。
methods: 本文基于优化控制问题的第一阶可能性系统，使用物理学 Informed Neural Networks (PINNs) 解决了 Coupled 系统。
results: 本文提供了深度神经网络参数（例如深度、宽度、参数范围）和采样点数量在域内和边界上的 $L^2(\Omega)$ 误差 bounds，并通过对比三种现有方法进行了数据例子。

Abstract
In this work, we present and analyze a numerical solver for optimal control problems (without / with box constraint) for linear and semilinear second-order elliptic problems. The approach is based on a coupled system derived from the first-order optimality system of the optimal control problem, and applies physics informed neural networks (PINNs) to solve the coupled system. We present an error analysis of the numerical scheme, and provide $L^2(\Omega)$ error bounds on the state, control and adjoint state in terms of deep neural network parameters (e.g., depth, width, and parameter bounds) and the number of sampling points in the domain and on the boundary. The main tools in the analysis include offset Rademacher complexity and boundedness and Lipschitz continuity of neural network functions. We present several numerical examples to illustrate the approach and compare it with three existing approaches.

摘要
在这项工作中，我们提出和分析了一种数值解决方案 для优化控制问题（无框约束和带框约束）的线性和半线性第二阶几何问题。我们的方法基于优化控制问题的第一阶优化系统中 derivated 的 Coupled 系统，并使用物理学 Informed Neural Networks (PINNs) 解决这个 Coupled 系统。我们提供了深度神经网络参数（例如深度、宽度和约束）和域内和边界上样本点数量的 $L^2(\Omega)$ 误差分析，并提供了误差分析中的深度神经网络参数和样本点数量的关系。我们使用偏移Rademacher复杂度和 boundedness 和 Lipschitz连续性来分析神经网络函数。我们在数据中提供了多个实例来 illustrate 我们的方法，并与三种现有方法进行比较。

Diverse Policies Converge in Reward-free Markov Decision Processe

paper_url: http://arxiv.org/abs/2308.11924
repo_url: https://github.com/openrl-lab/diversepolicies
paper_authors: Fanqi Lin, Shiyu Huang, Weiwei Tu
for: 这篇论文是关于多样化强化学习的研究，旨在解决现有的多样化强化学习算法无法确定其扩散和效率的问题。
methods: 该论文提出了一个统一的多样化强化学习框架，并研究了多样化强化学习的训练过程是如何 converges。此外，他们还提出了一种可证明高效的多样化强化学习算法。
results: 经过数学实验，该论文证明了他们的方法的有效性和高效性。

Abstract
Reinforcement learning has achieved great success in many decision-making tasks, and traditional reinforcement learning algorithms are mainly designed for obtaining a single optimal solution. However, recent works show the importance of developing diverse policies, which makes it an emerging research topic. Despite the variety of diversity reinforcement learning algorithms that have emerged, none of them theoretically answer the question of how the algorithm converges and how efficient the algorithm is. In this paper, we provide a unified diversity reinforcement learning framework and investigate the convergence of training diverse policies. Under such a framework, we also propose a provably efficient diversity reinforcement learning algorithm. Finally, we verify the effectiveness of our method through numerical experiments.

摘要
<> translate "Reinforcement learning has achieved great success in many decision-making tasks, and traditional reinforcement learning algorithms are mainly designed for obtaining a single optimal solution. However, recent works show the importance of developing diverse policies, which makes it an emerging research topic. Despite the variety of diversity reinforcement learning algorithms that have emerged, none of them theoretically answer the question of how the algorithm converges and how efficient the algorithm is. In this paper, we provide a unified diversity reinforcement learning framework and investigate the convergence of training diverse policies. Under such a framework, we also propose a provably efficient diversity reinforcement learning algorithm. Finally, we verify the effectiveness of our method through numerical experiments." into Simplified Chinese.这chter=" Reynforcement learning has achieved great success in many decision-making tasks, and traditional reinforcement learning algorithms are mainly designed for obtaining a single optimal solution. However, recent works show the importance of developing diverse policies, which makes it an emerging research topic. Despite the variety of diversity reinforcement learning algorithms that have emerged, none of them theoretically answer the question of how the algorithm converges and how efficient the algorithm is. In this paper, we provide a unified diversity reinforcement learning framework and investigate the convergence of training diverse policies. Under such a framework, we also propose a provably efficient diversity reinforcement learning algorithm. Finally, we verify the effectiveness of our method through numerical experiments.">以下是文章的中文翻译：这些年来，强化学习已经在许多决策任务中取得了很大的成功，但传统的强化学习算法仅针对单一的最佳解答。然而，最近的研究显示了发展多样化策略的重要性，使之成为研究的热门题目。 DESPITE 多样化强化学习算法的出现， none of them theoretically answer the question of how the algorithm converges and how efficient the algorithm is. In this paper, we provide a unified diversity reinforcement learning framework and investigate the convergence of training diverse policies. Under such a framework, we also propose a provably efficient diversity reinforcement learning algorithm. Finally, we verify the effectiveness of our method through numerical experiments.

Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement

paper_url: http://arxiv.org/abs/2308.11923
repo_url: None
paper_authors: Daiki Takeuchi, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada, Kunio Kashino
for: 本研究提出了一种新的扩展任务：音频差异描述（ADC），用于描述输入对的相似 yet 微妙不同的音频clip中的semantic差异。
methods: 我们提出了一种基于对比两个音频clip的cross-attention-concentrated transformer encoder，用于提取差异。此外，我们还提出了一种similarity-discrepancy disentanglement，用于在干扰空间中强调差异。
results: 我们使用AudioDiffCaps数据集进行实验，并证明了我们的方法可以有效解决ADC任务，并且可以提取差异。此外，我们还可以在transformer encoder中视见化差异。

Abstract
We proposed Audio Difference Captioning (ADC) as a new extension task of audio captioning for describing the semantic differences between input pairs of similar but slightly different audio clips. The ADC solves the problem that conventional audio captioning sometimes generates similar captions for similar audio clips, failing to describe the difference in content. We also propose a cross-attention-concentrated transformer encoder to extract differences by comparing a pair of audio clips and a similarity-discrepancy disentanglement to emphasize the difference in the latent space. To evaluate the proposed methods, we built an AudioDiffCaps dataset consisting of pairs of similar but slightly different audio clips with human-annotated descriptions of their differences. The experiment with the AudioDiffCaps dataset showed that the proposed methods solve the ADC task effectively and improve the attention weights to extract the difference by visualizing them in the transformer encoder.

摘要
我们提出了对audio captioning的新扩展任务，即Audio Difference Captioning（ADC），用于描述两个相似但微小不同的音频片段之间的semantic differences。ADC解决了传统的音频描述 sometimes generates similar captions for similar audio clips, failing to describe the difference in content的问题。我们还提出了一个cross-attention-concentrated transformer encoder，用于比较两个音频片段并强调差异在 latent space 中。为了评估我们的方法，我们建立了AudioDiffCaps dataset，该dataset包含了两个相似但微小不同的音频片段，每个片段都有人题标的其差异。实验结果表明，我们的方法可以有效地解决ADC任务，并将注意力集中在 transformer encoder 中强调差异。

Addressing Selection Bias in Computerized Adaptive Testing: A User-Wise Aggregate Influence Function Approach

paper_url: http://arxiv.org/abs/2308.11912
repo_url: https://github.com/riiid/useraif
paper_authors: Soonwoo Kwon, Sojung Kim, Seunghyun Lee, Jin-Young Kim, Suyeong An, Kyuseok Kim
for: 这paper的目的是提出一种基于CAT服务的测试数据的项目profile生成方法，以便提高CAT的效率和准确性。
methods: 该方法利用CAT服务中收集的响应数据，首先通过训练一个诊断模型来生成项目profile。然而，由于CAT服务中的选择性偏见问题，这些数据具有明显的偏见特征，导致生成的项目profile与真实值有很大差异。为了解决这个问题，该方法提议使用用户wise总影响函数方法，即对每个用户的响应数据进行权重平衡，以便减少数据中偏见的影响。
results: 该方法在三个公共数据集和一个真实世界CAT响应数据集上进行了广泛的实验，并证明了其超越了传统方法的性能。

Abstract
Computerized Adaptive Testing (CAT) is a widely used, efficient test mode that adapts to the examinee's proficiency level in the test domain. CAT requires pre-trained item profiles, for CAT iteratively assesses the student real-time based on the registered items' profiles, and selects the next item to administer using candidate items' profiles. However, obtaining such item profiles is a costly process that involves gathering a large, dense item-response data, then training a diagnostic model on the collected data. In this paper, we explore the possibility of leveraging response data collected in the CAT service. We first show that this poses a unique challenge due to the inherent selection bias introduced by CAT, i.e., more proficient students will receive harder questions. Indeed, when naively training the diagnostic model using CAT response data, we observe that item profiles deviate significantly from the ground-truth. To tackle the selection bias issue, we propose the user-wise aggregate influence function method. Our intuition is to filter out users whose response data is heavily biased in an aggregate manner, as judged by how much perturbation the added data will introduce during parameter estimation. This way, we may enhance the performance of CAT while introducing minimal bias to the item profiles. We provide extensive experiments to demonstrate the superiority of our proposed method based on the three public datasets and one dataset that contains real-world CAT response data.

摘要
计算机化适应测试（CAT）是一种广泛使用的高效测试模式，可以根据测试对象的水平进行适应。CAT需要预训练的项profile，CAT每次评估学生的实时水平，并根据注册的项profile选择下一个测试项。然而，获得这些项profile的成本很高，需要收集大量的精密的项响应数据，然后在这些数据上训练诊断模型。在这篇论文中，我们explore使用CAT服务中收集的响应数据来获得项profile。我们首先发现，这会带来一种困难，即CAT中的选择偏见，即更有 talent的学生会接受更加困难的问题。实际上，当直接使用CAT响应数据训练诊断模型时，我们发现itemprofile差异 significatively From the ground truth。为了解决选择偏见问题，我们提议使用用户wise总影响函数方法。我们的直觉是，过滤掉响应数据中受到极大偏见的用户，以judged by how much perturbation the added data will introduce during parameter estimation。这样，我们可以增强CAT的性能，而不是引入最小的偏见到项profile中。我们提供了广泛的实验来证明我们的提议的优越性，基于三个公共数据集和一个包含实际CAT响应数据的数据集。

Diagnosing Infeasible Optimization Problems Using Large Language Models

paper_url: http://arxiv.org/abs/2308.12923
repo_url: None
paper_authors: Hao Chen, Gonzalo E. Constante-Flores, Can Li
for: 该论文旨在帮助实践者更好地理解和解释不可能的优化模型，尤其是当这些模型是不可能的时候。
methods: 该论文提出了一种基于自然语言的系统，名为OptiChat，可以与用户进行互动对话，描述优化模型，识别可能导致不可能性的源头，并提供修改模型以解决这些问题的建议。
results: 实验表明，OptiChat可以帮助专家和非专家用户更好地理解优化模型，快速地识别不可能性的来源，并提供修改模型以解决这些问题。

Abstract
Decision-making problems can be represented as mathematical optimization models, finding wide applications in fields such as economics, engineering and manufacturing, transportation, and health care. Optimization models are mathematical abstractions of the problem of making the best decision while satisfying a set of requirements or constraints. One of the primary barriers to deploying these models in practice is the challenge of helping practitioners understand and interpret such models, particularly when they are infeasible, meaning no decision satisfies all the constraints. Existing methods for diagnosing infeasible optimization models often rely on expert systems, necessitating significant background knowledge in optimization. In this paper, we introduce OptiChat, a first-of-its-kind natural language-based system equipped with a chatbot GUI for engaging in interactive conversations about infeasible optimization models. OptiChat can provide natural language descriptions of the optimization model itself, identify potential sources of infeasibility, and offer suggestions to make the model feasible. The implementation of OptiChat is built on GPT-4, which interfaces with an optimization solver to identify the minimal subset of constraints that render the entire optimization problem infeasible, also known as the Irreducible Infeasible Subset (IIS). We utilize few-shot learning, expert chain-of-thought, key-retrieve, and sentiment prompts to enhance OptiChat's reliability. Our experiments demonstrate that OptiChat assists both expert and non-expert users in improving their understanding of the optimization models, enabling them to quickly identify the sources of infeasibility.

摘要
Translated into Simplified Chinese:决策问题可以表示为数学优化模型，在经济、工程和生产、交通和医疗等领域找到广泛应用。优化模型是决策问题的数学抽象，它们需要决策者满足一系列要求或限制。但现有的优化模型 диагностика方法通常需要很多背景知识，这成为实际应用中的一个主要障碍。在这篇论文中，我们介绍了 OptiChat，一个基于自然语言的系统，它通过一个 chatbot GUI 进行互动对话，以帮助用户理解和解释不可行的优化模型。OptiChat可以提供优化模型的自然语言描述，识别可能的不可行性来源，并提供修改模型以使其可行的建议。OptiChat 的实现基于 GPT-4，它通过与优化解手机接口，以确定整个优化问题中的最小不可行subset，也就是不可行最小集（IIS）。我们使用了 few-shot learning、专家链条思维、关键检索和 sentiment 提示来增强 OptiChat 的可靠性。我们的实验表明，OptiChat 可以帮助专家和非专家用户更好地理解优化模型，快速地识别不可行性的来源。

Utilizing Admissible Bounds for Heuristic Learning

paper_url: http://arxiv.org/abs/2308.11905
repo_url: None
paper_authors: Carlos Núñez-Molina, Masataro Asai
for: 本研究旨在提高前向搜索算法中的规划函数学习，通过现代机器学习技术。
methods: 本研究使用截断 Gaussian 分布来 parameterize 可靠规划函数，这将减小假设空间， faithful 地实现最大Entropy原则。
results: 实验表明，使用可靠规划函数可以获得更高精度的规划函数，并且更快地训练。

Abstract
While learning a heuristic function for forward search algorithms with modern machine learning techniques has been gaining interest in recent years, there has been little theoretical understanding of \emph{what} they should learn, \emph{how} to train them, and \emph{why} we do so. This lack of understanding leads to various literature performing an ad-hoc selection of datasets (suboptimal vs optimal costs or admissible vs inadmissible heuristics) and optimization metrics (e.g., squared vs absolute errors). Moreover, due to the lack of admissibility of the resulting trained heuristics, little focus has been put on the role of admissibility \emph{during} learning. This paper articulates the role of admissible heuristics in supervised heuristic learning using them as parameters of Truncated Gaussian distributions, which tightens the hypothesis space compared to ordinary Gaussian distributions. We argue that this mathematical model faithfully follows the principle of maximum entropy and empirically show that, as a result, it yields more accurate heuristics and converges faster during training.

摘要
Recently, there has been growing interest in using modern machine learning techniques to learn heuristic functions for forward search algorithms. However, there has been little theoretical understanding of what these heuristics should learn, how to train them, and why we do so. This lack of understanding has led to various literature using ad-hoc selection of datasets and optimization metrics, such as suboptimal vs optimal costs or admissible vs inadmissible heuristics. Moreover, due to the lack of admissibility of the resulting trained heuristics, little focus has been put on the role of admissibility during learning.This paper explores the role of admissible heuristics in supervised heuristic learning, using them as parameters of Truncated Gaussian distributions. This approach tightens the hypothesis space compared to ordinary Gaussian distributions, and faithfully follows the principle of maximum entropy. We empirically show that this approach yields more accurate heuristics and converges faster during training.

Rethinking Data Perturbation and Model Stabilization for Semi-supervised Medical Image Segmentation

paper_url: http://arxiv.org/abs/2308.11903
repo_url: https://github.com/zhenzhao/dpms
paper_authors: Zhen Zhao, Ye Liu, Meng Zhao, Di Yin, Yixuan Yuan, Luping Zhou
for: 本研究主要是为了提高 semi-supervised medical image segmentation（SSMIS）的性能。methods: 本研究提出了一种简单 yet effective的方法，称为 DPMS，以提高 SSMIS 性能。DPMS 采用了一种教师-学生框架，并使用了标准的监督损失和无监督一致损失。为生成足够的预测差异，DPMS 对未标注数据进行了强大的扩展和增强。同时，DPMS 还使用了一种forwarding-twice和 momentum 更新策略来正常化统计信息，以确保在未标注数据上进行有效的训练。results: DPMS 在公共的 2D ACDC 和 3D LA 数据集上 across 多种 semi-supervised 设置中 obtained 新的 state-of-the-art 性能。例如，DPMS 在 ACDC 数据集上使用 5% 标签时获得了 remarkable 22.62% 的提高，胜过了之前的 SOTA。

Abstract
Studies on semi-supervised medical image segmentation (SSMIS) have seen fast progress recently. Due to the limited labelled data, SSMIS methods mainly focus on effectively leveraging unlabeled data to enhance the segmentation performance. However, despite their promising performance, current state-of-the-art methods often prioritize integrating complex techniques and loss terms rather than addressing the core challenges of semi-supervised scenarios directly. We argue that the key to SSMIS lies in generating substantial and appropriate prediction disagreement on unlabeled data. To this end, we emphasize the crutiality of data perturbation and model stabilization in semi-supervised segmentation, and propose a simple yet effective approach to boost SSMIS performance significantly, dubbed DPMS. Specifically, we first revisit SSMIS from three distinct perspectives: the data, the model, and the loss, and conduct a comprehensive study of corresponding strategies to examine their effectiveness. Based on these examinations, we then propose DPMS, which adopts a plain teacher-student framework with a standard supervised loss and unsupervised consistency loss. To produce appropriate prediction disagreements, DPMS perturbs the unlabeled data via strong augmentations to enlarge prediction disagreements considerably. On the other hand, using EMA teacher when strong augmentation is applied does not necessarily improve performance. DPMS further utilizes a forwarding-twice and momentum updating strategies for normalization statistics to stabilize the training on unlabeled data effectively. Despite its simplicity, DPMS can obtain new state-of-the-art performance on the public 2D ACDC and 3D LA datasets across various semi-supervised settings, e.g. obtaining a remarkable 22.62% improvement against previous SOTA on ACDC with 5% labels.

摘要
研究 semi-supervised medical image segmentation (SSMIS) 在最近几年得到了快速进步。由于受限制的标注数据，SSMIS 方法主要是利用无标注数据来提高分类性能。然而，当前状态的艺术方法经常强调复杂的技术和损失函数，而不是直接面临半超级vised scenario 的核心挑战。我们认为，SSMIS 的关键在于生成足够和合适的预测冲突。为此，我们强调数据扰动和模型稳定在半超级vised分类中的重要性，并提出了一种简单 yet effective 的方法来提高 SSMIS 性能，称为 DPMS。我们首先从数据、模型和损失三个角度重新评估 SSMIS，并进行了广泛的研究相应策略的效果。根据这些研究，我们然后提出了 DPMS，它采用了一种简单的教师-学生框架，并采用标准的超级vised损失函数和无标注一致损失函数。为生成足够的预测冲突，DPMS 在无标注数据上应用强大的扩展来增加预测冲突至多。在另一边，使用 EMA 教师当强大扩展应用时，不一定能提高性能。DPMS 还使用了前进 twice 和积分更新策略来减少无标注数据的训练不稳定。尽管简单，DPMS 可以在多个半超级vised设置下获得新的状态的艺术性能，例如在 ACDC 和 LA datasets 上获得了22.62%的提高。

Shape-conditioned 3D Molecule Generation via Equivariant Diffusion Models

paper_url: http://arxiv.org/abs/2308.11890
repo_url: None
paper_authors: Ziqi Chen, Bo Peng, Srinivasan Parthasarathy, Xia Ning
for: 这个论文的目的是用计算机方法来设计新的药物，具体来说是通过设计与已知活性分子相似的三维分子结构来找到新的药物候选者。
methods: 这个论文使用了一种叫做ShapeMol的方法，它是一种基于equivariant shape encoder和equivariant diffusion模型的三维分子生成方法。这种方法可以根据给定的分子形状生成新的三维分子结构，保持与给定分子的形状相似。
results: 实验结果表明，ShapeMol可以生成新、多样化的药物分子结构，这些结构与已知活性分子的三维形状相似。这些结果表明ShapeMol可以用于设计需要的三维形状卷积到蛋白质目标袋中的药物候选者。

Abstract
Ligand-based drug design aims to identify novel drug candidates of similar shapes with known active molecules. In this paper, we formulated an in silico shape-conditioned molecule generation problem to generate 3D molecule structures conditioned on the shape of a given molecule. To address this problem, we developed a translation- and rotation-equivariant shape-guided generative model ShapeMol. ShapeMol consists of an equivariant shape encoder that maps molecular surface shapes into latent embeddings, and an equivariant diffusion model that generates 3D molecules based on these embeddings. Experimental results show that ShapeMol can generate novel, diverse, drug-like molecules that retain 3D molecular shapes similar to the given shape condition. These results demonstrate the potential of ShapeMol in designing drug candidates of desired 3D shapes binding to protein target pockets.

摘要
<>利ганд基于药物设计的目标是找到与已知活药分子相似的新药候选者。在这篇论文中，我们形ulated了一个在 silico 形态条件生成问题，用于生成具有给定分子形态的3D分子结构。为解决这个问题，我们开发了一种具有变换对称性的形态导向生成模型ShapeMol。ShapeMol包括一种具有变换对称性的形态编码器，该编码器将分子表面形态映射到幂 embeddings 中，以及一种具有变换对称性的扩散模型，该模型基于这些幂 embeddings 生成3D分子结构。实验结果表明，ShapeMol 可以生成新、多样、药理活药分子，其3D分子结构与给定形态条件相似。这些结果表明ShapeMol 在设计愿意 binding 到蛋白质目标孔的药物候选者中具有潜在的应用价值。Note: "Simplified Chinese" is a romanization of the Chinese language that uses a simplified set of characters and pronunciation. It is commonly used in mainland China and Singapore.

Adversarial Training Using Feedback Loops

paper_url: http://arxiv.org/abs/2308.11881
repo_url: None
paper_authors: Ali Haisam Muhammad Rafid, Adrian Sandu
for: 防御黑客攻击，建立高度抗性的深度神经网络
methods: 基于控制理论的Feedback Neural Networks架构，及其相应的反向循环对抗训练方法（FLAT）
results: 对标准测试问题进行数值实验，显示FLAT方法比现状技术更有效地防御黑客攻击

Abstract
Deep neural networks (DNN) have found wide applicability in numerous fields due to their ability to accurately learn very complex input-output relations. Despite their accuracy and extensive use, DNNs are highly susceptible to adversarial attacks due to limited generalizability. For future progress in the field, it is essential to build DNNs that are robust to any kind of perturbations to the data points. In the past, many techniques have been proposed to robustify DNNs using first-order derivative information of the network. This paper proposes a new robustification approach based on control theory. A neural network architecture that incorporates feedback control, named Feedback Neural Networks, is proposed. The controller is itself a neural network, which is trained using regular and adversarial data such as to stabilize the system outputs. The novel adversarial training approach based on the feedback control architecture is called Feedback Looped Adversarial Training (FLAT). Numerical results on standard test problems empirically show that our FLAT method is more effective than the state-of-the-art to guard against adversarial attacks.

摘要

paper_url: http://arxiv.org/abs/2308.11880
repo_url: https://github.com/csimo005/summit
paper_authors: Cody Simons, Dripta S. Raychaudhuri, Sk Miraj Ahmed, Suya You, Konstantinos Karydis, Amit K. Roy-Chowdhury
for: 这个论文的目的是如何在多Modal数据下实现Scene理解，以便在各种应用中实现自主导航等多Modal数据应用。
methods: 这个论文使用了一种 switching 框架，通过自动选择两种跨Modal pseudo-label 融合方法（协调 Filtering 和 Entropy Weighting）来适应目标频谱中的频谱差。
results: 实验结果表明，这种方法可以在七个挑战性的适应场景中达到与有源数据 assumptions 的结果，并在一些场景下超越竞争性的基准值。特别是，这种方法可以提高 mIoU 指标的表现，最高提高12%。

Abstract
Scene understanding using multi-modal data is necessary in many applications, e.g., autonomous navigation. To achieve this in a variety of situations, existing models must be able to adapt to shifting data distributions without arduous data annotation. Current approaches assume that the source data is available during adaptation and that the source consists of paired multi-modal data. Both these assumptions may be problematic for many applications. Source data may not be available due to privacy, security, or economic concerns. Assuming the existence of paired multi-modal data for training also entails significant data collection costs and fails to take advantage of widely available freely distributed pre-trained uni-modal models. In this work, we relax both of these assumptions by addressing the problem of adapting a set of models trained independently on uni-modal data to a target domain consisting of unlabeled multi-modal data, without having access to the original source dataset. Our proposed approach solves this problem through a switching framework which automatically chooses between two complementary methods of cross-modal pseudo-label fusion -- agreement filtering and entropy weighting -- based on the estimated domain gap. We demonstrate our work on the semantic segmentation problem. Experiments across seven challenging adaptation scenarios verify the efficacy of our approach, achieving results comparable to, and in some cases outperforming, methods which assume access to source data. Our method achieves an improvement in mIoU of up to 12% over competing baselines. Our code is publicly available at https://github.com/csimo005/SUMMIT.

摘要
Scene理解使用多Modal数据是必需的许多应用程序中，例如自动驾驶。为实现这些应用程序中的多种情况，现有的模型需要能够适应数据分布的变化而无需较大的数据注释。现有的方法假设源数据在适应过程中可用，并且假设源数据是 paired multiModal 数据。这两个假设可能会对许多应用程序造成问题。源数据可能因为隐私、安全或经济问题而不可用。假设存在 paired multiModal 数据进行训练也会产生大量的数据收集成本，而且不使用广泛可用的免费分布的uniModal 模型。在这个工作中，我们relax这两个假设，解决将多个独立训练在 uniModal 数据上的模型适应到目标领域中的问题，无需访问原始源数据集。我们提出的方法通过一个 switching 框架来自动选择 между两种 complementary 方法的 cross-modal pseudo-label fusion -- agreement filtering 和 entropy weighting -- 根据估计的领域差异。我们在 semantic segmentation 问题上进行了实验，并在七个具有挑战性的适应情况中证明了我们的方法的有效性，与许多方法相比，我们的方法可以达到更高的 mIoU 改进率，最高达到 12%。我们的代码可以在上获取。

Cabrita: closing the gap for foreign languages

paper_url: http://arxiv.org/abs/2308.11878
repo_url: None
paper_authors: Celio Larcher, Marcos Piau, Paulo Finardi, Pedro Gengo, Piero Esposito, Vinicius Caridá
for: 这个研究是为了提高特定语言或领域上模型的性能，以及确保有效的分词。methods: 该研究使用了自scratch开始训练模型，并采用了一种名为Cabrita的方法来解决成本问题。results: 研究表明，使用Cabrita方法可以在低成本下提高模型的性能和Tokenization的效率，并且在少量学习任务中实现了类似的结果，与传统连续预训练方法和英语7B模型预训练模型相比。

Abstract
The strategy of training the model from scratch in a specific language or domain serves two essential purposes: i) enhancing performance in the particular linguistic or domain context, and ii) ensuring effective tokenization. The main limitation inherent to this approach lies in the associated cost, which can reach six to seven-digit dollar values, depending on the model size and the number of parameters involved. The main solution to overcome the cost challenge is to rely on available pre-trained models, which, despite recent advancements such as the LLaMA and LLaMA-2 models, still demonstrate inefficiency for certain specific domain problems or prove ineffective in scenarios involving conversational memory resources, given the large number of tokens required to represent text. To overcome this issue, we present a methodology named Cabrita, which, as our research demonstrates, successfully addresses the performance and efficient tokenization problem, all at an affordable cost. We believe that this methodology can be applied to any transformer-like architecture model. To validate the study, we conducted continuous pre-training exclusively using Portuguese text on a 3-billion-parameter model known as OpenLLaMA, resulting in a model named openCabrita 3B. The openCabrita 3B also features a new tokenizer that results in a significant reduction in the number of tokens required to represent the text. In our assessment, for few-shot learning tasks, we achieved similar results with this 3B model compared to a traditional continuous pre-training approach as well as to 7B models English pre-trained models.

摘要
这种训练模型从零开始的策略在特定语言或领域中服务两个重要目的：一是提高特定语言或领域上表现，二是确保有效的分词。这种方法的主要局限性在于相关的成本，可以达到6到7位数字的值，具体取决于模型的大小和参数的数量。为了解决成本挑战，我们可以依靠可用的预训练模型，尽管最近的进展，如LLaMA和LLaMA-2模型，仍然在特定领域问题上不具有效果，或者在需要大量字符的场景下表现不佳。为此，我们提出了一种方法named Cabrita，该方法成功解决表现和有效分词问题，并且在成本上具有可持续性。我们认为这种方法可以应用于任何 transformer-like 架构模型。为验证研究，我们进行了独特的连续预训练，仅使用葡萄牙文本，在一个3亿参数的模型上进行了openCabrita 3B。openCabrita 3B还拥有一个新的分词器，该分词器可以减少表示文本的字符数量。在我们的评估中，对少量学习任务，我们使用3B模型和传统连续预训练方法以及7B英文预训练模型获得了类似的结果。

Integrating Large Language Models into the Debugging C Compiler for generating contextual error explanations

paper_url: http://arxiv.org/abs/2308.11873
repo_url: https://github.com/comp1511unsw/dcc
paper_authors: Andrew Taylor, Alexandra Vassar, Jake Renzella, Hammond Pearce
for: 这篇论文旨在使用大语言模型（LLM）生成改进的编译器错误说明，以便为 beginner 程序员提供更好的学习体验。
methods: 该论文使用了 DCC（Debugging C Compiler）和 LLM（大语言模型）组合，以生成在编译时和运行时的错误说明。
results: 经过专家评估，LLM生成的错误说明准确率为90%（编译时）和75%（运行时），而 DCC-help 工具也在学生中得到了普遍的承认和使用。

Abstract
This paper introduces a method for Large Language Models (LLM) to produce enhanced compiler error explanations, in simple language, within our Debugging C Compiler (DCC). It is well documented that compiler error messages have been known to present a barrier for novices learning how to program. Although our initial use of DCC in introductory programming (CS1) has been instrumental in teaching C to novice programmers by providing safeguards to commonly occurring errors and translating the usually cryptic compiler error messages at both compile- and run-time, we proposed that incorporating LLM-generated explanations would further enhance the learning experience for novice programmers. Through an expert evaluation, we observed that LLM-generated explanations for compiler errors were conceptually accurate in 90% of compile-time errors, and 75% of run-time errors. Additionally, the new DCC-help tool has been increasingly adopted by students, with an average of 1047 unique runs per week, demonstrating a promising initial assessment of using LLMs to complement compiler output to enhance programming education for beginners. We release our tool as open-source to the community.

摘要

Fast Exact NPN Classification with Influence-aided Canonical Form

paper_url: http://arxiv.org/abs/2308.12311
repo_url: None
paper_authors: Yonghe Zhang, Liwei Ni, Jiaxi Zhang, Guojie Luo, Huawei Li, Shenggen Zheng
for: 这篇论文主要用于提高NPNN类型的分类算法，即使用Boolean影响来加速NPNN类型的分类。
methods: 该论文提出了一种新的canoical form和其计算算法，通过引入Boolean影响来简化NPNN类型的canoical form构造和计算。
results: 实验结果表明，在使用Boolean影响的情况下，可以大幅提高NPNN类型的分类速度，相比之前的算法，可以达到5.5倍的提高。

Abstract
NPN classification has many applications in the synthesis and verification of digital circuits. The canonical-form-based method is the most common approach, designing a canonical form as representative for the NPN equivalence class first and then computing the transformation function according to the canonical form. Most works use variable symmetries and several signatures, mainly based on the cofactor, to simplify the canonical form construction and computation. This paper describes a novel canonical form and its computation algorithm by introducing Boolean influence to NPN classification, which is a basic concept in analysis of Boolean functions. We show that influence is input-negation-independent, input-permutation-dependent, and has other structural information than previous signatures for NPN classification. Therefore, it is a significant ingredient in speeding up NPN classification. Experimental results prove that influence plays an important role in reducing the transformation enumeration in computing the canonical form. Compared with the state-of-the-art algorithm implemented in ABC, our influence-aided canonical form for exact NPN classification gains up to 5.5x speedup.

摘要

KinSPEAK: Improving speech recognition for Kinyarwanda via semi-supervised learning methods

paper_url: http://arxiv.org/abs/2308.11863
repo_url: None
paper_authors: Antoine Nzeyimana
for: 这个论文是为了提高基尼亚拉各语音识别的精度和可靠性而写的。
methods: 这个论文使用了自然语言处理技术，包括自助预训练、简单的课程学习计划和半监督学习，以利用大量未标注的语音数据来提高基尼亚拉各语音识别的性能。
results: 这个论文的实验结果显示，使用公共领域数据only，收集了一个新的studio质量语音dataset，然后使用这个清洁基线模型来评估和排序来自更加多样化和噪音的公共数据集中的示例，并在四个Successive Generation中应用半监督学习来标注和学习大量未标注的语音数据，最终实现了3.2%的单词错误率（WER）和15.9%的WER在Mozilla Common Voice数据集上，与当前最佳状态相比。此外，实验还表明，使用 syllabic 而不是字符基本的分词可以提高基尼亚拉各语音识别的性能。

Abstract
Despite recent availability of large transcribed Kinyarwanda speech data, achieving robust speech recognition for Kinyarwanda is still challenging. In this work, we show that using self-supervised pre-training, following a simple curriculum schedule during fine-tuning and using semi-supervised learning to leverage large unlabelled speech data significantly improve speech recognition performance for Kinyarwanda. Our approach focuses on using public domain data only. A new studio-quality speech dataset is collected from a public website, then used to train a clean baseline model. The clean baseline model is then used to rank examples from a more diverse and noisy public dataset, defining a simple curriculum training schedule. Finally, we apply semi-supervised learning to label and learn from large unlabelled data in four successive generations. Our final model achieves 3.2% word error rate (WER) on the new dataset and 15.9% WER on Mozilla Common Voice benchmark, which is state-of-the-art to the best of our knowledge. Our experiments also indicate that using syllabic rather than character-based tokenization results in better speech recognition performance for Kinyarwanda.

摘要
尽管最近有大量采用Kinyarwanda语音训练数据的可用性，但实现Kinyarwanda语音识别仍然是一项挑战。在这项工作中，我们展示了使用自动预训练、在精度调教中采用简单的课程时间表，以及使用半监督学习来利用大量无标语音数据，可以显著提高Kinyarwanda语音识别性能。我们的方法仅使用公共领域数据。我们收集了一个新的 studio-quality 语音数据集，然后使用这个数据集来训练一个清洁基线模型。然后，我们使用这个清洁基线模型来排序从更多和更加噪音的公共数据集中的例子，定义了一个简单的课程时间表。最后，我们应用半监督学习来标注和学习大量无标语音数据，在四个成功的代代中进行了四次生成。我们的最终模型在新数据集上达到了3.2%的单词错误率（WER），在Mozilla Common Voice标准测试上达到了15.9%的WER，这与我们所知道的状态太阳相当。我们的实验也表明，使用 syllabic 而不是字符基本的分词可以提高Kinyarwanda语音识别性能。

Finding the Perfect Fit: Applying Regression Models to ClimateBench v1.0

paper_url: http://arxiv.org/abs/2308.11854
repo_url: None
paper_authors: Anmol Chaure, Ashok Kumar Behera, Sudip Bhattacharya
for: 本研究旨在评估非线性回归模型在气候数据预测中的表现，以便为政策制定者提供有用的信息。
methods: 本研究使用了数据驱动机器学习模型作为气候模拟器，以减少计算机中的时间和碳脚印。在这个方向下，气候Bench（ClimateBench）是一个最近被整理的气候数据 benchmarking 集合，用于评估机器学习模型的表现。
results: 研究发现，尽管被视为基本的，回归模型在气候预测中具有多种优势，特别是通过kernel trick，回归模型可以捕捉复杂的关系，提高预测能力。在这些非线性回归模型中， Gaussian Process Regressor 表现最佳，其在标准评估指标上的表现比其他两种模型更好。然而， Gaussian Process Regression 具有空间和时间复杂性的缺点。相比之下， Support Vector 和 Kernel Ridge 模型也提供了竞争力的结果，但需要解决一些负担。此外，我们还在活动地 investigating composite kernels 和 variational inference 等技术，以提高回归模型的表现和有效地模型复杂非线性 patrerns，包括降水现象。

Abstract
Climate projections using data driven machine learning models acting as emulators, is one of the prevailing areas of research to enable policy makers make informed decisions. Use of machine learning emulators as surrogates for computationally heavy GCM simulators reduces time and carbon footprints. In this direction, ClimateBench [1] is a recently curated benchmarking dataset for evaluating the performance of machine learning emulators designed for climate data. Recent studies have reported that despite being considered fundamental, regression models offer several advantages pertaining to climate emulations. In particular, by leveraging the kernel trick, regression models can capture complex relationships and improve their predictive capabilities. This study focuses on evaluating non-linear regression models using the aforementioned dataset. Specifically, we compare the emulation capabilities of three non-linear regression models. Among them, Gaussian Process Regressor demonstrates the best-in-class performance against standard evaluation metrics used for climate field emulation studies. However, Gaussian Process Regression suffers from being computational resource hungry in terms of space and time complexity. Alternatively, Support Vector and Kernel Ridge models also deliver competitive results and but there are certain trade-offs to be addressed. Additionally, we are actively investigating the performance of composite kernels and techniques such as variational inference to further enhance the performance of the regression models and effectively model complex non-linear patterns, including phenomena like precipitation.

摘要
климакс预测使用数据驱动机器学习模型作为模拟器，是当前研究领域的一个主要方向，以便政策 makers 做出了 informed 决策。使用机器学习模拟器作为计算机中心 GCM 模拟器的代理，可以降低时间和碳脚印。在这个方向下，ClimateBench 是最近筹集的气候数据 evaluating 性能的机器学习模拟器的 benchmarking 数据集。最近的研究表明，尽管被视为基本的，但是回归模型在气候模拟方面具有多种优势。具体来说，通过键ernel trick，回归模型可以捕捉复杂的关系，提高预测能力。这个研究将关注评估非线性回归模型的表现。特别是，我们将比较三种非线性回归模型的预测能力。其中， Gaussian Process Regressor 表现出了对气候场模拟研究中常用的标准评估指标的最佳表现。然而， Gaussian Process Regression 具有空间和时间复杂度的计算资源占用问题。相比之下， Support Vector 和 Kernel Ridge 模型也可以提供竞争力，但是需要解决一些负担。此外，我们还在 актив探索 composite kernels 和 variational inference 等技术，以提高回归模型的表现和有效地模拟复杂的非线性模式，包括降水现象。

A deep reinforcement learning approach for real-time demand-responsive railway rescheduling to mitigate station overcrowding using mobile data

paper_url: http://arxiv.org/abs/2308.11849
repo_url: None
paper_authors: Enze Liu, Zhiyuan Lin, Judith Y. T. Wang, Hong Chen
for: 提供一种基于实时旅客流动数据的铁路规划调整方法，以应对铁路紧急事件引起的长期干扰。
methods: 使用深度优化学习（DRL）框架，结合实时旅客流动数据，对多个路线的列车时间表进行调整，以满足实时旅客需求，避免站点拥堵和列车过载。
results: 提出了一种基于实时旅客流动数据的铁路规划调整方法，可以有效地应对铁路紧急事件引起的长期干扰，并且能够满足实时旅客需求。

Abstract
Real-time railway rescheduling is a timely and flexible technique to automatically alter the operation schedule in response to time-varying conditions. Current research lacks data-driven approaches that capture real-time passenger mobility during railway disruptions, relying mostly on OD-based data and model-based methods for estimating demands of trains. Meanwhile, the schedule-updating principles for a long-term disruption overlook the uneven distribution of demand over time. To fill this gap, this paper proposes a demand-responsive approach by inferring real-world passenger mobility from mobile data (MD) to facilitate real-time rescheduling. Unlike network-level approaches, this paper focuses on a heavy-demand station upstream of the disrupted area. The objective is to reschedule all trains on multiple routes passing through this target station, which have been affected by a severe emergency event such as a natural disaster. Particular attention should be given to avoiding the accumulation of overcrowded passengers at this station, to prevent additional accidents arising from overcrowding. This research addresses the challenges associated with this scenario, including the dynamics of arriving and leaving of passengers, station overcrowding, rolling stock shortage, open-ended disruption duration, integrated rescheduling on multiple routes, and delays due to detours. A deep reinforcement learning (DRL) framework is proposed to determine the optimal rescheduled timetable, route stops, and rolling stock allocation, while considering real-time demand satisfaction, station overcrowding, train capacity utilization, and headway safety.

摘要
现实时铁路重新计划是一种时间可靠和灵活的技术，可以在应对时间变化的条件下自动修改运营计划。当前的研究缺乏基于数据驱动的方法，可以捕捉实时铁路干线上的乘客流动，而是主要依赖于 Origine-Destination 数据和模型基于方法来估算列车需求。此外，长期紧急情况下的计划更新原则忽略了时间不均的需求分布。为了填补这个空白，本文提出了一种需求回应方法，通过推理出实际世界中乘客流动的 mobile 数据 (MD) 来支持实时重新计划。与传统网络级方法不同，本文将关注被紧急事件such as 自然灾害所摧毁的重要铁路站点。目标是重新计划通过这个目标站点的多个路线的列车，并且在紧急情况下进行实时重新计划。特别是要避免在这个站点上积累的乘客过多，以避免由过度填满引起的进一步事故。本研究面临的挑战包括乘客到达和离开站点的动态、站点拥堵、车辆短缺、开放结束的紧急情况持续时间、多个路线的同时重新计划和延迟。为解决这些挑战，本文提出了一种深度强化学习（DRL）框架，以确定最佳重新计划时间表、路线停站和车辆分配，同时考虑实时需求满足、站点拥堵、列车容量利用和队列安全。

SEA: Shareable and Explainable Attribution for Query-based Black-box Attacks

paper_url: http://arxiv.org/abs/2308.11845
repo_url: None
paper_authors: Yue Gao, Ilia Shumailov, Kassem Fawaz
for:This paper aims to provide a novel security system for Machine Learning (ML) systems to characterize black-box attacks for forensic purposes and facilitate human-explainable intelligence sharing.methods:The proposed security system, called SEA, leverages Hidden Markov Models (HMMs) to attribute observed query sequences to known attacks, allowing for a comprehensive understanding of the attack’s progression.results:SEA is effective at attack attribution, even on its second occurrence, and is robust to adaptive strategies designed to evade forensics analysis. The explanations provided by SEA allow for the identification of specific minor implementation bugs in attack libraries, and the system achieves 90+% Top-1 and 95+% Top-3 accuracy in recognizing the same attack’s second occurrence.

Abstract
Machine Learning (ML) systems are vulnerable to adversarial examples, particularly those from query-based black-box attacks. Despite various efforts to detect and prevent such attacks, there is a need for a more comprehensive approach to logging, analyzing, and sharing evidence of attacks. While classic security benefits from well-established forensics and intelligence sharing, Machine Learning is yet to find a way to profile its attackers and share information about them. In response, this paper introduces SEA, a novel ML security system to characterize black-box attacks on ML systems for forensic purposes and to facilitate human-explainable intelligence sharing. SEA leverages the Hidden Markov Models framework to attribute the observed query sequence to known attacks. It thus understands the attack's progression rather than just focusing on the final adversarial examples. Our evaluations reveal that SEA is effective at attack attribution, even on their second occurrence, and is robust to adaptive strategies designed to evade forensics analysis. Interestingly, SEA's explanations of the attack behavior allow us even to fingerprint specific minor implementation bugs in attack libraries. For example, we discover that the SignOPT and Square attacks implementation in ART v1.14 sends over 50% specific zero difference queries. We thoroughly evaluate SEA on a variety of settings and demonstrate that it can recognize the same attack's second occurrence with 90+% Top-1 and 95+% Top-3 accuracy.

摘要
机器学习（ML）系统容易受到恶意例子的攻击，特别是来自于查询基本黑盒攻击。尽管有各种努力检测和防范这些攻击，但是还是需要一种更加全面的方法来记录、分析和共享攻击证据。 whereas классической安全受益于已经成熔的审计和情报共享，机器学习尚未找到一种方法来识别和分享攻击者的信息。为此，这篇论文引入了SEA，一种新的机器学习安全系统，用于Characterize黑盒攻击ML系统的攻击行为，以便对其进行审计和人类可读的情报分享。SEA利用隐藏马尔可夫模型框架来归属观察到的查询序列到已知的攻击。因此，它可以理解攻击的进程而不仅仅是关注最终的恶意例子。我们的评估表明，SEA能够有效地归属攻击，包括其第二次出现，并且对于适应性攻击策略而言是Robust。而且，SEA的攻击行为解释可以让我们甚至可以识别特定的小型实现漏洞。例如，我们发现ART v1.14中的SignOPT和方块攻击实现中发送了50%以上的特定零差查询。我们在不同的设置下进行了全面的评估，并证明了SEA可以在90+% Top-1和95+% Top-3的准确率下识别同一个攻击的第二次出现。

${\rm E}(3)$-Equivariant Actor-Critic Methods for Cooperative Multi-Agent Reinforcement Learning

paper_url: http://arxiv.org/abs/2308.11842
repo_url: https://github.com/dchen48/e3ac
paper_authors: Dingyang Chen, Qi Zhang
for: 本研究旨在利用多智能体强化学习（MARL）中存在的几何 symmetries，以提高多智能体在各种应用中的合作行为。
methods: 我们采用了具有几何约束的神经网络体系，以便在多智能体actor-critic方法中嵌入几何偏好。
results: 我们的实验结果表明，在多种合作MARL benchmark中，我们的方法可以实现更高的性能和更好的泛化能力，包括零shot学习和传输学习。Here’s the translation in English for reference:
for: This research aims to exploit the Euclidean symmetries inherent in cooperative multi-agent reinforcement learning (MARL) problems to improve the cooperative behavior of multiple intelligent agents in various applications.
methods: We use neural network architectures with symmetric constraints embedded as an inductive bias for multi-agent actor-critic methods.
results: Our experimental results show that our method can achieve higher performance and better generalization capabilities, including zero-shot learning and transfer learning, in various cooperative MARL benchmarks.

Abstract
Identification and analysis of symmetrical patterns in the natural world have led to significant discoveries across various scientific fields, such as the formulation of gravitational laws in physics and advancements in the study of chemical structures. In this paper, we focus on exploiting Euclidean symmetries inherent in certain cooperative multi-agent reinforcement learning (MARL) problems and prevalent in many applications. We begin by formally characterizing a subclass of Markov games with a general notion of symmetries that admits the existence of symmetric optimal values and policies. Motivated by these properties, we design neural network architectures with symmetric constraints embedded as an inductive bias for multi-agent actor-critic methods. This inductive bias results in superior performance in various cooperative MARL benchmarks and impressive generalization capabilities such as zero-shot learning and transfer learning in unseen scenarios with repeated symmetric patterns. The code is available at: https://github.com/dchen48/E3AC.

摘要
identification和分析 symmetrical patternsinatural world Led to significant discoveries across various scientific fields, such as the formulation of gravitational laws in physics and advancements in the study of chemical structures. In this paper, we focus on exploiting Euclidean symmetries inherent in certain cooperative multi-agent reinforcement learning (MARL) problems and prevalent in many applications. We begin by formally characterizing a subclass of Markov games with a general notion of symmetries that admits the existence of symmetric optimal values and policies. Motivated by these properties, we design neural network architectures with symmetric constraints embedded as an inductive bias for multi-agent actor-critic methods. This inductive bias results in superior performance in various cooperative MARL benchmarks and impressive generalization capabilities such as zero-shot learning and transfer learning in unseen scenarios with repeated symmetric patterns. The code is available at: https://github.com/dchen48/E3AC.Here's the translation breakdown:* "identification" is 识别 (shí bié)* "and analysis" is 并分析 (bìng fāng yì)* "symmetrical patterns" is symmetrical pattern (xìng xìng jì zhì)* "in the natural world" is 在自然界 (zài zì rán jiè)* "have led to significant discoveries" is Led to significant discoveries (dàn yì yì jì)* "across various scientific fields" is across various scientific fields (guò yǐ zhòng kē jiào)* "such as the formulation of gravitational laws" is such as the formulation of gravitational laws (如如 gravitational laws 的提出)* "in physics" is in physics (在物理学中)* "and advancements in the study of chemical structures" is and advancements in the study of chemical structures (并在化学结构的研究中)* "In this paper" is In this paper (在这篇论文中)* "we focus on exploiting Euclidean symmetries" is we focus on exploiting Euclidean symmetries (我们专注于利用欧几何Symmetries)* "inherent in certain cooperative multi-agent reinforcement learning (MARL) problems" is inherent in certain cooperative multi-agent reinforcement learning (MARL) problems (在某些合作多智能agent reinforcement learning (MARL) 问题中存在)* "and prevalent in many applications" is and prevalent in many applications (并在许多应用中很普遍)* "We begin by formally characterizing a subclass of Markov games" is We begin by formally characterizing a subclass of Markov games (我们开始是正式地定义一个Markov games 的子集)* "with a general notion of symmetries" is with a general notion of symmetries (通过一个通用的Symmetries 概念)* "that admits the existence of symmetric optimal values and policies" is that admits the existence of symmetric optimal values and policies (可以承认Symmetric 的最优值和策略的存在)* "Motivated by these properties" is Motivated by these properties (受到这些属性的激发)* "we design neural network architectures with symmetric constraints" is we design neural network architectures with symmetric constraints (我们设计了具有Symmetric 约束的神经网络架构)* "embedded as an inductive bias" is embedded as an inductive bias (作为一种启发性的偏好)* "for multi-agent actor-critic methods" is for multi-agent actor-critic methods (用于多智能agent actor-critic 方法)* "This inductive bias results in superior performance" is This inductive bias results in superior performance (这种启发性偏好会导致更高的性能)* "in various cooperative MARL benchmarks" is in various cooperative MARL benchmarks (在多种合作MARL 标准测试中)* "and impressive generalization capabilities" is and impressive generalization capabilities (以及各种泛化能力)* "such as zero-shot learning and transfer learning" is such as zero-shot learning and transfer learning (如无需训练学习和过渡学习)* "in unseen scenarios with repeated symmetric patterns" is in unseen scenarios with repeated symmetric patterns (在未经训练的情况下，重复的Symmetric 模式)* "The code is available at" is The code is available at (代码可以在)* "https://github.com/dchen48/E3AC" is https://github.com/dchen48/E3AC (在 GitHub 上的 E3AC 项目)

A Survey for Federated Learning Evaluations: Goals and Measures

paper_url: http://arxiv.org/abs/2308.11841
repo_url: None
paper_authors: Di Chai, Leye Wang, Liu Yang, Junxue Zhang, Kai Chen, Qiang Yang
for: 评估 Federated Learning（FL）系统的用途，包括评估FL的实用性、效率和安全性。
methods: 论文综述了现有研究中采用的主要评估目标，以及对每个目标的评估指标。同时，文章还介绍了FedEval，一个开源的评估平台，可以为FL算法提供标准化和全面的评估框架。
results: 文章评估了FL算法的实用性、效率和安全性，并提出了一些评估挑战和未来研究方向。

Abstract
Evaluation is a systematic approach to assessing how well a system achieves its intended purpose. Federated learning (FL) is a novel paradigm for privacy-preserving machine learning that allows multiple parties to collaboratively train models without sharing sensitive data. However, evaluating FL is challenging due to its interdisciplinary nature and diverse goals, such as utility, efficiency, and security. In this survey, we first review the major evaluation goals adopted in the existing studies and then explore the evaluation metrics used for each goal. We also introduce FedEval, an open-source platform that provides a standardized and comprehensive evaluation framework for FL algorithms in terms of their utility, efficiency, and security. Finally, we discuss several challenges and future research directions for FL evaluation.

摘要
评估是一种系统的方法，用于评估系统是否达到其目标。联邦学习（FL）是一种新的隐私保护机器学习方法，允许多方共同训练模型而无需分享敏感数据。然而，评估FL具有多种目标和特点，如实用性、效率和安全性，这使得评估变得具有挑战性。在这篇文章中，我们首先评审了现有研究中采用的主要评估目标，然后探讨每个目标的评估指标。我们还介绍了FedEval，一个开源平台，提供了对FL算法的标准化和完整的评估框架，包括实用性、效率和安全性。最后，我们讨论了FL评估中的一些挑战和未来研究方向。

A Benchmark Study on Calibration

paper_url: http://arxiv.org/abs/2308.11838
repo_url: https://github.com/Aryia-Behroziuan/history1
paper_authors: Linwei Tao, Younan Zhu, Haolan Guo, Minjing Dong, Chang Xu
for: This paper aims to explore calibration properties within Neural Architecture Search (NAS) and answer several longstanding questions in the field, such as whether model calibration can be generalized across different tasks, whether robustness can be used as a calibration measurement, and how calibration interacts with accuracy.
methods: The paper leverages the NAS search space to create a model calibration dataset that evaluates 117,702 unique neural networks across 90 bin-based and 12 additional calibration measurements. The authors use this dataset to explore calibration properties and answer the aforementioned questions.
results: The paper provides a comprehensive analysis of calibration properties within NAS, including the impact of bin size on calibration measurement and the beneficial architectural designs for calibration. The authors also explore the relationship between calibration and accuracy, and investigate the robustness of calibration metrics.

Abstract
Deep neural networks are increasingly utilized in various machine learning tasks. However, as these models grow in complexity, they often face calibration issues, despite enhanced prediction accuracy. Many studies have endeavored to improve calibration performance through data preprocessing, the use of specific loss functions, and training frameworks. Yet, investigations into calibration properties have been somewhat overlooked. Our study leverages the Neural Architecture Search (NAS) search space, offering an exhaustive model architecture space for thorough calibration properties exploration. We specifically create a model calibration dataset. This dataset evaluates 90 bin-based and 12 additional calibration measurements across 117,702 unique neural networks within the widely employed NATS-Bench search space. Our analysis aims to answer several longstanding questions in the field, using our proposed dataset: (i) Can model calibration be generalized across different tasks? (ii) Can robustness be used as a calibration measurement? (iii) How reliable are calibration metrics? (iv) Does a post-hoc calibration method affect all models uniformly? (v) How does calibration interact with accuracy? (vi) What is the impact of bin size on calibration measurement? (vii) Which architectural designs are beneficial for calibration? Additionally, our study bridges an existing gap by exploring calibration within NAS. By providing this dataset, we enable further research into NAS calibration. As far as we are aware, our research represents the first large-scale investigation into calibration properties and the premier study of calibration issues within NAS.

摘要
深度神经网络在不同机器学习任务中日益普及，然而随着模型复杂度的增加，它们经常面临准确性问题，即使预测精度得到提高。许多研究已尝试通过数据处理、特定损失函数和训练框架来改善准确性表现。然而，关于准确性质量的研究相对落后。我们的研究利用Neural Architecture Search（NAS）搜索空间，提供了详细的模型建立空间，以便对准确性质量进行全面探索。我们专门创建了一个模型准确性数据集。这个数据集评估了90个分割值和12个附加准确性测量，在117,702个Unique Neural Networks（NATS-Bench）搜索空间中进行了117,702次评估。我们的分析旨在回答以下长期未解之问题：（i）模型准确性是否可以泛化到不同任务？（ii）可以使用Robustness作为准确性测量吗？（iii）准确性指标如何可靠？（iv）后期准确性方法对所有模型是否具有同样的影响？（v）准确性与精度之间是否存在相互关系？（vi）划分大小对准确性测量有何影响？（vii）哪些建筑设计对准确性有益？我们的研究填补了现有的空白，通过对NAS中的准确性进行大规模的探索。我们的研究表明，准确性质量和模型复杂度之间存在着深刻的关系。

Characterizing normal perinatal development of the human brain structural connectivity

paper_url: http://arxiv.org/abs/2308.11836
repo_url: None
paper_authors: Yihan Wu, Lana Vasung, Camilo Calixto, Ali Gholipour, Davood Karimi
for: 这个研究用于研究新生儿脑发育阶段的结构连接ome的发展趋势，以及这些连接ome在脑发育和环境因素的影响下的作用。
methods: 这个研究使用了基于时空平均的计算框架，以确定新生儿脑发育阶段的结构连接ometry的normative基线。
results: 研究发现，在33-44周孕期间，脑结构连接ometry发展出了明显和强的趋势，global和local效率增加，特征路径长度减少，脑叶和脑半球之间和cross脑叶之间的连接强度增加。

Abstract
Early brain development is characterized by the formation of a highly organized structural connectome. The interconnected nature of this connectome underlies the brain's cognitive abilities and influences its response to diseases and environmental factors. Hence, quantitative assessment of structural connectivity in the perinatal stage is useful for studying normal and abnormal neurodevelopment. However, estimation of the connectome from diffusion MRI data involves complex computations. For the perinatal period, these computations are further challenged by the rapid brain development and imaging difficulties. Combined with high inter-subject variability, these factors make it difficult to chart the normal development of the structural connectome. As a result, there is a lack of reliable normative baselines of structural connectivity metrics at this critical stage in brain development. In this study, we developed a computational framework, based on spatio-temporal averaging, for determining such baselines. We used this framework to analyze the structural connectivity between 33 and 44 postmenstrual weeks using data from 166 subjects. Our results unveiled clear and strong trends in the development of structural connectivity in perinatal stage. Connection weighting based on fractional anisotropy and neurite density produced the most consistent results. We observed increases in global and local efficiency, a decrease in characteristic path length, and widespread strengthening of the connections within and across brain lobes and hemispheres. We also observed asymmetry patterns that were consistent between different connection weighting approaches. The new computational method and results are useful for assessing normal and abnormal development of the structural connectome early in life.

摘要
早期大脑发展 caracterized by the formation of a highly organized structural connectome。这个 connectome 的相互连接性是大脑的认知能力的基础，也影响大脑对疾病和环境因素的反应。因此，量化评估早期Structural connectivity 是研究正常和异常 neural development 的有用工具。然而，来自 diffusion MRI 数据的 connectome 的计算具有复杂的计算。在早期，这些计算更加困难，因为大脑快速发展和成像困难。加之高 между人组织变化，这些因素使得不可能 chart 正常发展的 Structural connectivity 基线。在这项研究中，我们开发了一个基于 spatio-temporal averaging 的计算框架，以确定这些基线。我们使用这个框架分析33-44 postmenstrual weeks 期间的 Structural connectivity，使用了166个subject的数据。我们的结果表明，在早期 stage 的Structural connectivity 发展出了明确和强大的趋势。基于 fractional anisotropy 和 neurite density 的连接重量得到了最好的结果。我们发现全球和局部效率增加，特征路径长度减少，并且广泛加强了大脑内部和外部的连接。我们还发现了一些协调的偏好，这些偏好在不同的连接重量方法中都是一致的。这种新的计算方法和结果有助于评估早期生命中 Structural connectome 的正常和异常发展。

Performance Comparison and Implementation of Bayesian Variants for Network Intrusion Detection

paper_url: http://arxiv.org/abs/2308.11834
repo_url: None
paper_authors: Tosin Ige, Christopher Kiekintveld
for: 本研究的目的是对网络入侵异常检测中使用权重聚合法进行比较性研究，并investigate each variant of Bayesian classifier assumption的影响。methods: 本研究使用了多种权重聚合法的变体，包括Multinomial、Bernoulli和Gaussian。results: 实验结果表明，Bernoulli的准确率为69.9%（71%的训练集），Multinomial的准确率为31.2%（31.2%的训练集），而Gaussian的准确率为81.69%（82.84%的训练集）。进一步调查发现，每种Naive Bayes变体的性能和准确率归结于它们的假设，Gaussian类器表现最佳，因为它假设特征遵循正态分布，而Multinomial类器表现糟糕，因为它只假设离散和多omial分布。

Abstract
Bayesian classifiers perform well when each of the features is completely independent of the other which is not always valid in real world application. The aim of this study is to implement and compare the performances of each variant of Bayesian classifier (Multinomial, Bernoulli, and Gaussian) on anomaly detection in network intrusion, and to investigate whether there is any association between each variant assumption and their performance. Our investigation showed that each variant of Bayesian algorithm blindly follows its assumption regardless of feature property, and that the assumption is the single most important factor that influences their accuracy. Experimental results show that Bernoulli has accuracy of 69.9% test (71% train), Multinomial has accuracy of 31.2% test (31.2% train), while Gaussian has accuracy of 81.69% test (82.84% train). Going deeper, we investigated and found that each Naive Bayes variants performances and accuracy is largely due to each classifier assumption, Gaussian classifier performed best on anomaly detection due to its assumption that features follow normal distributions which are continuous, while multinomial classifier have a dismal performance as it simply assumes discreet and multinomial distribution.

摘要
bayesian 分类器在实际应用中表现良好时，每个特征都必须独立于其他特征，这并不总是正确的。本研究的目的是对网络入侵异常检测中bayesian分类器的每个变体（多omial、bernoulli和 Gaussian）的实现和比较其性能，以及每个变体假设和其性能之间的关系。我们的调查发现，bayesian算法的每个变体都会遵循其假设，不管特征的性质如何。实验结果表明，bernoulli的准确率为69.9%（71%训练集），多omial的准确率为31.2%（31.2%训练集），而 Gaussian 的准确率为81.69%（82.84%训练集）。进一步调查发现，每种Naive Bayes变体的性能和准确率主要归功于它们的假设，Gaussian分类器在异常检测中表现最佳，因为它假设特征遵循正态分布，而多omial分类器表现不佳，因为它只假设离散和多omial分布。

Exploring the Effectiveness of GPT Models in Test-Taking: A Case Study of the Driver’s License Knowledge Test

paper_url: http://arxiv.org/abs/2308.11827
repo_url: None
paper_authors: Saba Rahimi, Tucker Balch, Manuela Veloso
for: 这项研究的目的是使用不在模型训练数据中包含的信息源来使GPT模型回答问题。
methods: 这种方法包括Contextual信息的预处理、查询和Context的Embedding的构建、基于Context embedding的提问、以及使用GPT模型回答问题。
results: 在一个控制测试场景中，使用加利福尼亚驾驶手册作为信息源，GPT-3模型在50个样本驾驶知识测试题上达到了96%的通过率，而无Context情况下的通过率为82%。然而，即使提供Context，模型仍然无法正确回答一些问题，表明还有改进空间。研究还研究了提问长度和Context格式对模型表现的影响。总的来说，这项研究提供了GPT模型在问题回答任务中的限制和改进空间。

Abstract
Large language models such as Open AI's Generative Pre-trained Transformer (GPT) models are proficient at answering questions, but their knowledge is confined to the information present in their training data. This limitation renders them ineffective when confronted with questions about recent developments or non-public documents. Our research proposes a method that enables GPT models to answer questions by employing context from an information source not previously included in their training data. The methodology includes preprocessing of contextual information, the embedding of contexts and queries, constructing prompt through the integration of context embeddings, and generating answers using GPT models. We applied this method in a controlled test scenario using the California Driver's Handbook as the information source. The GPT-3 model achieved a 96% passing score on a set of 50 sample driving knowledge test questions. In contrast, without context, the model's passing score fell to 82%. However, the model still fails to answer some questions correctly even with providing library of context, highlighting room for improvement. The research also examined the impact of prompt length and context format, on the model's performance. Overall, the study provides insights into the limitations and potential improvements for GPT models in question-answering tasks.

摘要
大型语言模型如Open AI的生成预训练Transformer（GPT）模型在回答问题方面表现出色，但它们的知识仅仅受训练数据的限制。这个限制使得它们在面对最新的发展或非公开文档时无法回答问题。我们的研究提出了一种方法，让GPT模型通过不包括在训练数据中的信息来回答问题。这个方法包括对背景信息进行预处理、将查询和背景信息转换为嵌入、通过嵌入的集成建立提示，并使用GPT模型产生答案。我们在一个控制过的测试场景中将这种方法应用到加州驾照手册作为资料来源。GPT-3模型在50个验证驾照知识问题中获得96%的得分，而无 Context的模型仅获得82%的得分。然而，即使提供库存中的背景信息，模型仍然无法回答一些问题正确，这显示了改进的空间。研究也检查了提示长度和背景信息格式对模型表现的影响。总的来说，这些研究获得了GPT模型在问题回答任务中的限制和改进的可能性。

Accel-GCN: High-Performance GPU Accelerator Design for Graph Convolution Networks

paper_url: http://arxiv.org/abs/2308.11825
repo_url: https://github.com/xiexi1990/iccad-accel-gnn
paper_authors: Xi Xie, Hongwu Peng, Amit Hasan, Shaoyi Huang, Jiahui Zhao, Haowen Fang, Wei Zhang, Tong Geng, Omer Khan, Caiwen Ding
for: 提高GCN的计算效率和多级存储效率，以提高图数据中的秘密信息提取。
methods: 使用轻量级度排序、块级别分配策略和组合挺别策略来提高 Shared 存储地方性和工作负荷均衡，并且通过利用内存划Alignment和内存划缩进行内存带宽优化。
results: 对于18个 benchmark 图，Accel-GCN比cuSPARSE、GNNAdvisor和graph-BLAST高效性提高1.17倍、1.86倍和2.94倍。

Abstract
Graph Convolutional Networks (GCNs) are pivotal in extracting latent information from graph data across various domains, yet their acceleration on mainstream GPUs is challenged by workload imbalance and memory access irregularity. To address these challenges, we present Accel-GCN, a GPU accelerator architecture for GCNs. The design of Accel-GCN encompasses: (i) a lightweight degree sorting stage to group nodes with similar degree; (ii) a block-level partition strategy that dynamically adjusts warp workload sizes, enhancing shared memory locality and workload balance, and reducing metadata overhead compared to designs like GNNAdvisor; (iii) a combined warp strategy that improves memory coalescing and computational parallelism in the column dimension of dense matrices. Utilizing these principles, we formulated a kernel for sparse matrix multiplication (SpMM) in GCNs that employs block-level partitioning and combined warp strategy. This approach augments performance and multi-level memory efficiency and optimizes memory bandwidth by exploiting memory coalescing and alignment. Evaluation of Accel-GCN across 18 benchmark graphs reveals that it outperforms cuSPARSE, GNNAdvisor, and graph-BLAST by factors of 1.17 times, 1.86 times, and 2.94 times respectively. The results underscore Accel-GCN as an effective solution for enhancing GCN computational efficiency.

摘要
图像卷积网络（GCNs）在不同领域中提取隐藏信息的过程中扮演着重要的角色，但是它们在主流GPU上的加速受到工作负载不均和内存访问不规则的挑战。为解决这些挑战，我们提出了Accel-GCN，一种针对GCN的GPU加速器架构。Accel-GCN的设计包括：1. 轻量级度排序阶段，用于将节点按照度数分组，以便减少内存访问不规则和metadata开销。2. 基于块级别的分配策略，可以动态调整工作块大小，提高共享内存地方可用性和工作负载均衡，同时减少metadata开销。3. 结合折衔策略，以提高内存归一化和计算并行性在纵向 dense 矩阵中。通过这些原则，我们实现了GCN中的稀疏矩阵乘法（SpMM）kernel，该 kernel使用了块级别的分配策略和结合折衔策略。这种方法可以提高性能和多级内存效率，并且可以利用内存归一化和对齐来增加内存带宽。经过对18个标准图表进行评估，我们发现Accel-GCN在与cuSPARSE、GNNAdvisor和graph-BLAST进行比较时，能够提高性能的 фактор为1.17倍、1.86倍和2.94倍。这些结果证明Accel-GCN是一种有效的GCN计算效率的加速器。

PatchBackdoor: Backdoor Attack against Deep Neural Networks without Model Modification

paper_url: http://arxiv.org/abs/2308.11822
repo_url: https://github.com/xaiveryuan/patchbackdoor
paper_authors: Yizhen Yuan, Rui Kong, Shenghao Xie, Yuanchun Li, Yunxin Liu
for: 防止深度学习系统中的攻击，特别是在安全关键场景中。
methods: 使用一个特制的补丁（namely backdoor patch），将其置于前置于摄像头，并将其与输入图像一起传输给模型。
results: 在实验中，patchbackdoor可以在常见的深度学习模型（VGG、MobileNet、ResNet）上实现攻击成功率为93%到99%。此外，我们在实际应用中也证明了攻击的可行性。

Abstract
Backdoor attack is a major threat to deep learning systems in safety-critical scenarios, which aims to trigger misbehavior of neural network models under attacker-controlled conditions. However, most backdoor attacks have to modify the neural network models through training with poisoned data and/or direct model editing, which leads to a common but false belief that backdoor attack can be easily avoided by properly protecting the model. In this paper, we show that backdoor attacks can be achieved without any model modification. Instead of injecting backdoor logic into the training data or the model, we propose to place a carefully-designed patch (namely backdoor patch) in front of the camera, which is fed into the model together with the input images. The patch can be trained to behave normally at most of the time, while producing wrong prediction when the input image contains an attacker-controlled trigger object. Our main techniques include an effective training method to generate the backdoor patch and a digital-physical transformation modeling method to enhance the feasibility of the patch in real deployments. Extensive experiments show that PatchBackdoor can be applied to common deep learning models (VGG, MobileNet, ResNet) with an attack success rate of 93% to 99% on classification tasks. Moreover, we implement PatchBackdoor in real-world scenarios and show that the attack is still threatening.

摘要
<>转换文本为简化中文。<>深度学习系统在安全关键场景中面临着主要的后门攻击威胁，这种攻击目标是在攻击者控制的条件下让神经网络模型发生不良行为。然而，大多数后门攻击需要修改神经网络模型通过训练使用毒品数据和/或直接模型修改，这会导致一种常见 pero incorrect的信念，即后门攻击可以通过正确保护模型来避免。在这篇论文中，我们展示了后门攻击可以无需修改模型。相反于在训练数据中注入后门逻辑或模型中注入后门逻辑，我们提议在前置摄像头上置一个特制的贴图（即后门贴图），该贴图与输入图像一起被传递给模型。贴图可以在大多数时间内保持正常行为，而在攻击者控制的触发对象存在时产生错误预测。我们的主要技术包括生成后门贴图的有效训练方法和使用数字物理变换模型增强贴图在实际部署中的可行性。广泛的实验表明，PatchBackdoor可以应用于常见的深度学习模型（VGG、MobileNet、ResNet），攻击成功率在93%到99%之间。此外，我们在实际场景中实现了PatchBackdoor攻击，并证明了攻击仍然具有威胁性。

Mitigating Health Disparity on Biased Electronic Health Records via Deconfounder

paper_url: http://arxiv.org/abs/2308.11819
repo_url: None
paper_authors: Zheng Liu, Xiaohan Li, Philip Yu
for: 这篇论文旨在解决临床数据模型中的公平问题，特别是在电子健康记录（EHR）上。因为EHR具有复杂的潜在结构和可能的选择偏见，因此需要同时维护健康差异和模型的总准确性。
methods: 本论文提出了一个名为“公平长期医疗混合模型”（FLMD）的新型模型，旨在同时维护公平和总准确性在长期EHR模型中。基于混合模型理论，FLMD使用了两阶段训练过程。在第一阶段，FLMD捕捉了每次遇到的隐藏因素，这些隐藏因素代表了资料以外的医疗因素，例如患者基因和生活方式。在第二阶段，FLMD结合了学习到的潜在表现和其他相关特征来做预测。通过将适当的公平标准加入模型中，FLMD确保了高预测精度和同时维护健康差异。
results: 根据实验结果，FLMD在两个真实世界EHR数据集上表现出色，较基eline方法和FLMDVariants在公平和精度方面表现更好。此外，FLMD在干扰/不对称和 sintetic 数据集上也表现出色，显示FLMD在不同设定下的优秀性。

Abstract
The fairness issue of clinical data modeling, especially on Electronic Health Records (EHRs), is of utmost importance due to EHR's complex latent structure and potential selection bias. It is frequently necessary to mitigate health disparity while keeping the model's overall accuracy in practice. However, traditional methods often encounter the trade-off between accuracy and fairness, as they fail to capture the underlying factors beyond observed data. To tackle this challenge, we propose a novel model called Fair Longitudinal Medical Deconfounder (FLMD) that aims to achieve both fairness and accuracy in longitudinal Electronic Health Records (EHR) modeling. Drawing inspiration from the deconfounder theory, FLMD employs a two-stage training process. In the first stage, FLMD captures unobserved confounders for each encounter, which effectively represents underlying medical factors beyond observed EHR, such as patient genotypes and lifestyle habits. This unobserved confounder is crucial for addressing the accuracy/fairness dilemma. In the second stage, FLMD combines the learned latent representation with other relevant features to make predictions. By incorporating appropriate fairness criteria, such as counterfactual fairness, FLMD ensures that it maintains high prediction accuracy while simultaneously minimizing health disparities. We conducted comprehensive experiments on two real-world EHR datasets to demonstrate the effectiveness of FLMD. Apart from the comparison of baseline methods and FLMD variants in terms of fairness and accuracy, we assessed the performance of all models on disturbed/imbalanced and synthetic datasets to showcase the superiority of FLMD across different settings and provide valuable insights into its capabilities.

摘要
严重性问题在医疗数据建模中，尤其是在电子健康记录（EHR）方面，是非常重要的，因为EHR具有复杂的潜在结构和可能的选择偏见。为了避免健康差距，而且在实践中保持模型的总准确性，常常需要进行权衡。然而，传统方法经常遇到准确性和公平之间的负面选择，因为它们无法捕捉下面数据的深层结构。为解决这个挑战，我们提出了一种新的模型，即公平长期医疗去假设模型（FLMD），旨在在长期EHR模型中同时保持准确性和公平性。受到去假设理论启发，FLMD采用了两个阶段的训练过程。在第一阶段，FLMD捕捉了每次观察的隐藏因素，这些隐藏因素对应于下面数据中的深层结构，如患者的基因和生活习惯。这些隐藏因素是解决准确性/公平之间的负面选择的关键。在第二阶段，FLMD将学习的潜在表示与其他相关特征结合以进行预测。通过包括合适的公平标准，如Counterfactual公平，FLMD确保保持高预测准确性，同时最小化健康差距。我们在两个真实的EHR数据集上进行了广泛的实验，以证明FLMD的有效性。除了与基eline方法和FLMD变体进行公平和准确性的比较外，我们还评估了所有模型在受损/不均衡和 sintetic 数据集上的性能，以验证FLMD在不同的设置下的优势和提供有价值的洞察。

Incorporating Nonlocal Traffic Flow Model in Physics-informed Neural Networks

paper_url: http://arxiv.org/abs/2308.11818
repo_url: None
paper_authors: Archie J. Huang, Animesh Biswas, Shaurya Agarwal
for: 提高交通状况估算精度，提高交通管理策略效果
methods: 利用非本地LWR模型，physics-informed深度学习框架，提高交通状况估算精度
results: 比基eline方法提高了交通状况估算精度，能够更好地支持交通管理策略

Abstract
This research contributes to the advancement of traffic state estimation methods by leveraging the benefits of the nonlocal LWR model within a physics-informed deep learning framework. The classical LWR model, while useful, falls short of accurately representing real-world traffic flows. The nonlocal LWR model addresses this limitation by considering the speed as a weighted mean of the downstream traffic density. In this paper, we propose a novel PIDL framework that incorporates the nonlocal LWR model. We introduce both fixed-length and variable-length kernels and develop the required mathematics. The proposed PIDL framework undergoes a comprehensive evaluation, including various convolutional kernels and look-ahead windows, using data from the NGSIM and CitySim datasets. The results demonstrate improvements over the baseline PIDL approach using the local LWR model. The findings highlight the potential of the proposed approach to enhance the accuracy and reliability of traffic state estimation, enabling more effective traffic management strategies.

摘要
Translation notes:* "physics-informed deep learning" is translated as "物理学引入深度学习" (wù yì yì yì xué yì deep learning)* "nonlocal LWR model" is translated as "非本地LWR模型" (fēi běn dì LWR módel)* "speed as a weighted mean of downstream traffic density" is translated as "速度为下游交通密度的Weighted Mean" (fù dòu wèi xià yòu jiāo tào mì tí de weighted mean)* "fixed-length and variable-length kernels" is translated as "固定长度和可变长度核心" (gù jí cháng dào hé kě bìan cháng dào)* "NGSIM and CitySim" are translated as "NGSIM和CitySim" (NGSIM hé CitySim)

Evaluation of Deep Neural Operator Models toward Ocean Forecasting

paper_url: http://arxiv.org/abs/2308.11814
repo_url: None
paper_authors: Ellery Rajagopal, Anantha N. S. Babu, Tony Ryu, Patrick J. Haley Jr., Chris Mirabito, Pierre F. J. Lermusiaux
for: 研究深度学习模型在预测时间序列数据中的可用性，特别是在大气和海洋领域，以及更广泛的大液体社区中。
methods: 使用数据驱动的深度学习模型来模拟和预测经典流体流动和实际海洋动力学 simulations。
results: 训练深度 neural operator 模型可以预测 идеализирован的 periodic eddy shedding，并且在实际海洋表层流动和初步研究中显示了一定的能力和潜力。

Abstract
Data-driven, deep-learning modeling frameworks have been recently developed for forecasting time series data. Such machine learning models may be useful in multiple domains including the atmospheric and oceanic ones, and in general, the larger fluids community. The present work investigates the possible effectiveness of such deep neural operator models for reproducing and predicting classic fluid flows and simulations of realistic ocean dynamics. We first briefly evaluate the capabilities of such deep neural operator models when trained on a simulated two-dimensional fluid flow past a cylinder. We then investigate their application to forecasting ocean surface circulation in the Middle Atlantic Bight and Massachusetts Bay, learning from high-resolution data-assimilative simulations employed for real sea experiments. We confirm that trained deep neural operator models are capable of predicting idealized periodic eddy shedding. For realistic ocean surface flows and our preliminary study, they can predict several of the features and show some skill, providing potential for future research and applications.

摘要
<>使用数据驱动的深度学习模型框架，近期开发了用于预测时间序列数据的预测模型。这些机器学习模型可能在多个领域中有用，包括大气和海洋领域，以及涵盖更广泛的大气和海洋社区。本文研究了使用深度神经运算模型来重现和预测经典流体流和真实的海洋动力学 simulate。我们首先简要评估了深度神经运算模型在模拟两维流体流 past cylinder 的能力。然后，我们探索了它们在forecasting ocean surface circulation in the Middle Atlantic Bight and Massachusetts Bay 中的应用，学习从高分辨率数据 assimilative simulations 中获得的实际海洋实验数据。我们确认了训练过的深度神经运算模型可以预测理想化的 periodic eddy shedding。对真实的海洋表面流和我们的初步研究，它们可以预测一些特征，并表现出一定的能力，这提供了未来研究和应用的潜在可能性。Note: Simplified Chinese is used in this translation, which is a more casual and informal style of Chinese that is commonly used in online communication and informal writing. If you prefer Traditional Chinese or a more formal style, please let me know and I can provide those versions as well.

paper_url: http://arxiv.org/abs/2308.11804
repo_url: None
paper_authors: Eugene Bagdasaryan, Vitaly Shmatikov
for: 这篇论文旨在探讨多Modal Encoder如何受到攻击，以及这些攻击如何影响下游任务。
methods: 论文使用了多Modal Encoder将图像、声音、文本、视频等多种模式映射到单一的嵌入空间中，并证明这些嵌入可能受到攻击。
results: 论文通过使用ImageBind embeddings示例，展示了攻击者可以通过让输入具有相似的嵌入来让图像与文本、声音与文本等多种模式相互映射。这些攻击可以影响下游任务，如图像生成、文本生成和零例分类。

Abstract
Multi-modal encoders map images, sounds, texts, videos, etc. into a single embedding space, aligning representations across modalities (e.g., associate an image of a dog with a barking sound). We show that multi-modal embeddings can be vulnerable to an attack we call "adversarial illusions." Given an input in any modality, an adversary can perturb it so as to make its embedding close to that of an arbitrary, adversary-chosen input in another modality. Illusions thus enable the adversary to align any image with any text, any text with any sound, etc. Adversarial illusions exploit proximity in the embedding space and are thus agnostic to downstream tasks. Using ImageBind embeddings, we demonstrate how adversarially aligned inputs, generated without knowledge of specific downstream tasks, mislead image generation, text generation, and zero-shot classification.

摘要
多Modal Encoder将图像、声音、文本、视频等多种模式映射到单一的嵌入空间中，使模式之间的表示相似（例如，将一幅狗图与一个叫声相关联）。我们显示出，多Modal嵌入可能受到我们称为“ adversarial illusions”的攻击。给定任意模式的输入，敌对方可以将其扰动，使其嵌入与选择的 adversary-choosen 模式的输入的嵌入很近。这些Ilusion使敌对方可以将任意图像与任意文本、任意声音等相关联。Adversarial illusions利用嵌入空间的 proximity 特性，因此不受下游任务的影响。使用 ImageBind 嵌入，我们示示了如何通过不知道特定下游任务的知识，生成 adversarially 对齐的输入，并在图像生成、文本生成和零容量分类中造成误导。

Variational Density Propagation Continual Learning

paper_url: http://arxiv.org/abs/2308.11801
repo_url: None
paper_authors: Christopher Angelini, Nidhal Bouaynaya, Ghulam Rasool
for: 这篇论文旨在提出一个应用于实际世界中的深度神经网络（DNNs）中的缓冲学习框架，以应对实际世界中的不同类型的噪音、演进目标和资料分布迁移。
methods: 这篇论文提出了一种基于信度量化的缓冲学习方法，使用 uncertainty quantification from Bayesian Inference 来减轻严重遗传。这种方法移除了先前的方法中需要 Monte Carlo 抽样模型重量的需求，并且使用关键点数据来优化一个关键点数据的关键点数目数量。
results: 这篇论文的结果显示，这种方法可以在多个预期的实验数据上进行对应对问题的缓冲学习，并且可以实现一个最小化模型复杂度的目标。

Abstract
Deep Neural Networks (DNNs) deployed to the real world are regularly subject to out-of-distribution (OoD) data, various types of noise, and shifting conceptual objectives. This paper proposes a framework for adapting to data distribution drift modeled by benchmark Continual Learning datasets. We develop and evaluate a method of Continual Learning that leverages uncertainty quantification from Bayesian Inference to mitigate catastrophic forgetting. We expand on previous approaches by removing the need for Monte Carlo sampling of the model weights to sample the predictive distribution. We optimize a closed-form Evidence Lower Bound (ELBO) objective approximating the predictive distribution by propagating the first two moments of a distribution, i.e. mean and covariance, through all network layers. Catastrophic forgetting is mitigated by using the closed-form ELBO to approximate the Minimum Description Length (MDL) Principle, inherently penalizing changes in the model likelihood by minimizing the KL Divergence between the variational posterior for the current task and the previous task's variational posterior acting as the prior. Leveraging the approximation of the MDL principle, we aim to initially learn a sparse variational posterior and then minimize additional model complexity learned for subsequent tasks. Our approach is evaluated for the task incremental learning scenario using density propagated versions of fully-connected and convolutional neural networks across multiple sequential benchmark datasets with varying task sequence lengths. Ultimately, this procedure produces a minimally complex network over a series of tasks mitigating catastrophic forgetting.

摘要
深度神经网络（DNNs）在实际应用中经常遇到不同类型的噪声和概念目标的变化。这篇论文提出了一种基于Continual Learning数据集的框架，用于适应数据分布的变化。我们开发了一种基于uncertainty量化的Continual Learning方法，使用bayesian推理来 Mitigate catastrophic forgetting。我们超越了之前的方法，不需要对模型参数进行Monte Carlo采样，而是使用closed-form Evidence Lower Bound（ELBO）目标，approximating the predictive distribution by propagating the first two moments of a distribution，即mean和covariance，through all network layers。通过closed-form ELBO来approximate the Minimum Description Length（MDL）Principle，我们可以penalize changes in the model likelihood by minimizing the KL Divergence between the variational posterior for the current task and the previous task's variational posterior acting as the prior。通过这种方式，我们可以初始化一个稀疏的variational posterior，然后逐个任务进行minimize additional model complexity。我们的方法在多个顺序 benchmark datasets上进行了测试，并在不同的任务序列长度下进行了评估。最终，这种方法可以生成一个对多个任务具有最小复杂性的网络，从而 Mitigate catastrophic forgetting。

Complex-valued neural networks for voice anti-spoofing

paper_url: http://arxiv.org/abs/2308.11800
repo_url: None
paper_authors: Nicolas M. Müller, Philip Sperl, Konstantin Böttinger
for: 防止声音恶意模仿（voice spoofing）和声音深层变换（audio deepfake）
methods: 使用复杂值神经网络处理CQT频域表示的输入声音，保留相位信息，并允许使用可解释AI方法
results: 比前一代方法高效，在”在野”anti-spoofing数据集上取得了更好的表现，并且通过可解释AI方法对结果进行解释，答案验证表示模型学习了相位信息以探测声音 spoofing。

Abstract
Current anti-spoofing and audio deepfake detection systems use either magnitude spectrogram-based features (such as CQT or Melspectrograms) or raw audio processed through convolution or sinc-layers. Both methods have drawbacks: magnitude spectrograms discard phase information, which affects audio naturalness, and raw-feature-based models cannot use traditional explainable AI methods. This paper proposes a new approach that combines the benefits of both methods by using complex-valued neural networks to process the complex-valued, CQT frequency-domain representation of the input audio. This method retains phase information and allows for explainable AI methods. Results show that this approach outperforms previous methods on the "In-the-Wild" anti-spoofing dataset and enables interpretation of the results through explainable AI. Ablation studies confirm that the model has learned to use phase information to detect voice spoofing.

摘要
当前反伪措施和音频深层迷伪检测系统使用的是 either 幅度 спектрограм基于特征 (如 CQT 或 Melspectrograms) 或 Raw 音频经过 convolution 或 sinc-层处理。两种方法都有缺点：幅度 спектрограм抛弃相位信息，影响音频自然性， Raw-特征基于模型无法使用传统的解释性 AI 方法。这篇论文提议一种新的方法，利用复素函数神经网络来处理输入音频的复素值 CQT 频域表示。这种方法保留相位信息，并允许使用解释性 AI 方法。结果表明这种方法在 "In-the-Wild" 反伪措施数据集上表现出色，并且可以通过解释性 AI 方法解释结果。灵活性研究证明了模型已经学习了使用相位信息检测声音伪装。

Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics

paper_url: http://arxiv.org/abs/2308.11792
repo_url: None
paper_authors: Dominik Scheinert, Philipp Wiesner, Thorsten Wittkopp, Lauritz Thamsen, Jonathan Will, Odej Kao
for: 提高大数据分析任务资源配置的自动化方法，以提高效率、降低成本和能效率。
methods: Karasu使用了分享数据的方法，通过将多个用户的运行时信息相互整合，以生成轻量级性能模型，并将其组合成ensemble方法来挖掘配置搜索空间中的内在知识。
results: Karasu可以帮助提高现有方法的性能、搜索时间和成本，即使只有少量相似的 profiling 运行可用，并且可以同时优化多个目标。

Abstract
Selecting the right resources for big data analytics jobs is hard because of the wide variety of configuration options like machine type and cluster size. As poor choices can have a significant impact on resource efficiency, cost, and energy usage, automated approaches are gaining popularity. Most existing methods rely on profiling recurring workloads to find near-optimal solutions over time. Due to the cold-start problem, this often leads to lengthy and costly profiling phases. However, big data analytics jobs across users can share many common properties: they often operate on similar infrastructure, using similar algorithms implemented in similar frameworks. The potential in sharing aggregated profiling runs to collaboratively address the cold start problem is largely unexplored. We present Karasu, an approach to more efficient resource configuration profiling that promotes data sharing among users working with similar infrastructures, frameworks, algorithms, or datasets. Karasu trains lightweight performance models using aggregated runtime information of collaborators and combines them into an ensemble method to exploit inherent knowledge of the configuration search space. Moreover, Karasu allows the optimization of multiple objectives simultaneously. Our evaluation is based on performance data from diverse workload executions in a public cloud environment. We show that Karasu is able to significantly boost existing methods in terms of performance, search time, and cost, even when few comparable profiling runs are available that share only partial common characteristics with the target job.

摘要
Big data analytics tasks often have many similarities, such as operating on similar infrastructure, using similar algorithms, and implementing similar frameworks. By sharing aggregated profiling runs, we can collaboratively address the cold start problem and improve resource configuration profiling.We present Karasu, an approach to more efficient resource configuration profiling that promotes data sharing among users with similar infrastructures, frameworks, algorithms, or datasets. Karasu trains lightweight performance models using aggregated runtime information from collaborators and combines them into an ensemble method to exploit the inherent knowledge of the configuration search space. Karasu also allows for the optimization of multiple objectives simultaneously.Our evaluation is based on performance data from diverse workload executions in a public cloud environment. We show that Karasu is able to significantly improve existing methods in terms of performance, search time, and cost, even when there are few comparable profiling runs available that share only partial common characteristics with the target job.

HypBO: Expert-Guided Chemist-in-the-Loop Bayesian Search for New Materials

paper_url: http://arxiv.org/abs/2308.11787
repo_url: None
paper_authors: Abdoulatif Cisse, Xenophon Evangelopoulos, Sam Carruthers, Vladimir V. Gusev, Andrew I. Cooper
for: 该研究旨在使用人类专家知识来加速 Bayesian 优化算法，以更好地解决复杂多变量科学问题。
methods: 该方法使用专家假设来指导 Bayesian 搜索，并使用搜索结果来改进模型数据。
results: 实验结果表明，该方法可以在新、未探索的科学任务中更快地找到有价值的答案，并且可以更好地考虑专家的知识和假设。

Abstract
Robotics and automation offer massive accelerations for solving intractable, multivariate scientific problems such as materials discovery, but the available search spaces can be dauntingly large. Bayesian optimization (BO) has emerged as a popular sample-efficient optimization engine, thriving in tasks where no analytic form of the target function/property is known. Here we exploit expert human knowledge in the form of hypotheses to direct Bayesian searches more quickly to promising regions of chemical space. Previous methods have used underlying distributions derived from existing experimental measurements, which is unfeasible for new, unexplored scientific tasks. Also, such distributions cannot capture intricate hypotheses. Our proposed method, which we call HypBO, uses expert human hypotheses to generate an improved seed of samples. Unpromising seeds are automatically discounted, while promising seeds are used to augment the surrogate model data, thus achieving better-informed sampling. This process continues in a global versus local search fashion, organized in a bilevel optimization framework. We validate the performance of our method on a range of synthetic functions and demonstrate its practical utility on a real chemical design task where the use of expert hypotheses accelerates the search performance significantly.

摘要
робототехника и автоматизация提供了巨大的加速器，用于解决不可解的多变量科学问题，如材料发现。但是可用的搜索空间可能会是惊人的大。 bayesian优化（BO）已成为一种受欢迎的效率优化引擎，在没有明确目标函数/性能的情况下，在任务中取得了成功。在我们的方法中，我们使用专家人类知识来导引bayesian搜索，以更快地导向有潜力的化学空间。先前的方法使用了基于现有实验室测量的下面分布，这是不可能的 для新、未探索的科学任务。此外，这些分布也无法捕捉复杂的假设。我们提出的方法，我们称之为 HypBO，使用专家人类假设来生成改进的样本。不可能的样本将被排除，而有潜力的样本将用于增强Surrogate模型数据，从而实现更加有知识的搜索。这个过程继续在全球 versus 地方搜索的方式下进行，组织成一个双层优化框架。我们验证了我们的方法在一系列的 sintetic函数上的性能，并在一个真实的化学设计任务中表明了它的实用性。在使用专家假设时，搜索性能得到了显著的加速。

Coarse-to-Fine Multi-Scene Pose Regression with Transformers

paper_url: http://arxiv.org/abs/2308.11783
repo_url: https://github.com/yolish/c2f-ms-transformer
paper_authors: Yoli Shavit, Ron Ferens, Yosi Keller
for: 估计摄像头的位置和 orientations，并在不同场景中学习多个场景的相对位置。
methods: 使用 transformer 来捕捉 activation map 和场景编码，并使用自注意力机制来混合多个场景的信息。
results: 在常见的indoor和outdoor数据集上进行评估，并与多Scene和单Scene精度相位估计器进行比较，达到了更高的本地化精度。

Abstract
Absolute camera pose regressors estimate the position and orientation of a camera given the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron (MLP) head is trained using images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended to learn multiple scenes by replacing the MLP head with a set of fully connected layers. In this work, we propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention and decoders transform latent features and scenes encoding into pose predictions. This allows our model to focus on general features that are informative for localization, while embedding multiple scenes in parallel. We extend our previous MS-Transformer approach \cite{shavit2021learning} by introducing a mixed classification-regression architecture that improves the localization accuracy. Our method is evaluated on commonly benchmark indoor and outdoor datasets and has been shown to exceed both multi-scene and state-of-the-art single-scene absolute pose regressors.

摘要
通用相机pose预测器可以根据捕捉到的图像估计相机的位置和方向。通常，一个卷积神经树与多层完全连接层（MLP）组合用于使用图像和pose标签来嵌入单个参照场景。在最近的研究中，这种方案被扩展以学习多个场景，通过将MLP头换为一组完全连接层来实现。在这项工作中，我们提议使用转换器来学习多Scene绝对相机pose预测，其中扩展器用于聚合活动地图并将场景编码进行pose预测。这使得我们的模型能够专注于通用特征，而同时将多个场景编码在平行。我们在前一项MS-Transformer方法 \cite{shavit2021learning} 中提出了混合分类预测 arquitectucture，该结构可以提高本地化精度。我们的方法在常见的indoor和outdoor数据集上进行了评估，并已经超过了多Scene和状态对照单Scene绝对相机pose预测器。

Addressing Dynamic and Sparse Qualitative Data: A Hilbert Space Embedding of Categorical Variables

paper_url: http://arxiv.org/abs/2308.11781
repo_url: None
paper_authors: Anirban Mukherjee, Hannah H. Chang
for: 本文提出了一种新的框架，用于在量化模型中包含质量数据。之前的方法通常使用分类变量来建立量化模型，但这会导致数据缺失和偏差估计。本文使用函数分析创造了一个更灵活和精细的框架。
methods: 本文使用了函数空间中的一个稳定的映射，将观察到的分类转换为一个可重producing kernel space中的表示函数。通过里兹表示定理，我们证明了这种方法可以在量化模型中归一化分类变量的处理。
results: 通过大量的 simulations 和一个真实的应用例，我们证明了这种方法的超越性。特别在分类信息具有复杂和细腻的场景下，我们的模型表现出色。

Abstract
We propose a novel framework for incorporating qualitative data into quantitative models for causal estimation. Previous methods use categorical variables derived from qualitative data to build quantitative models. However, this approach can lead to data-sparse categories and yield inconsistent (asymptotically biased) and imprecise (finite sample biased) estimates if the qualitative information is dynamic and intricate. We use functional analysis to create a more nuanced and flexible framework. We embed the observed categories into a latent Baire space and introduce a continuous linear map -- a Hilbert space embedding -- from the Baire space of categories to a Reproducing Kernel Hilbert Space (RKHS) of representation functions. Through the Riesz representation theorem, we establish that the canonical treatment of categorical variables in causal models can be transformed into an identified structure in the RKHS. Transfer learning acts as a catalyst to streamline estimation -- embeddings from traditional models are paired with the kernel trick to form the Hilbert space embedding. We validate our model through comprehensive simulation evidence and demonstrate its relevance in a real-world study that contrasts theoretical predictions from economics and psychology in an e-commerce marketplace. The results confirm the superior performance of our model, particularly in scenarios where qualitative information is nuanced and complex.

摘要
我们提出一种新的框架，用于将质量数据 incorporated 到量化模型中，以估计 causal 关系。先前的方法通常使用分类变量， derivated 从质量数据来构建量化模型。但这种方法可能会导致数据稀缺和不一致（偏差的）估计，特别是质量信息是动态和复杂的。我们使用函数分析来创建一个更加灵活和精细的框架。我们将观察到的分类 embed 到一个具有勾配空间的 Baire 空间中，并引入一个连续线性映射——一个 Reproducing Kernel Hilbert Space (RKHS) 的表示函数的 kontinuierliche 线性映射。通过里茨表述定理，我们证明了，将分类变量在 causal 模型中的 canonical 处理可以转化为 RKHS 中已知的一个结构。借助转移学习，我们可以将传统模型中的 embeddings 与内核技巧相结合，以形成一个 Hilbert 空间的 embedding。我们通过了广泛的 simulations 证明和一个实际的案例研究，confirm 了我们的模型在质量信息是复杂和细腻的场景中表现更加出色。

Few-shot Anomaly Detection in Text with Deviation Learning

paper_url: http://arxiv.org/abs/2308.11780
repo_url: None
paper_authors: Anindya Sundar Das, Aravind Ajay, Sriparna Saha, Monowar Bhuyan
for: 本研究旨在提出一种基于深度几个示例学习的检测异常文本方法，利用有限的异常示例和直接学习异常分数，以提高异常检测的精度和效率。
methods: 本方法基于深度几个示例学习，并使用异常分数学习和多头自注意力层，以及多个实例学习方法来学习异常行为。
results: 经过对多个标准数据集的实验表明，本方法可以达到新的顶峰性能水平。

Abstract
Most current methods for detecting anomalies in text concentrate on constructing models solely relying on unlabeled data. These models operate on the presumption that no labeled anomalous examples are available, which prevents them from utilizing prior knowledge of anomalies that are typically present in small numbers in many real-world applications. Furthermore, these models prioritize learning feature embeddings rather than optimizing anomaly scores directly, which could lead to suboptimal anomaly scoring and inefficient use of data during the learning process. In this paper, we introduce FATE, a deep few-shot learning-based framework that leverages limited anomaly examples and learns anomaly scores explicitly in an end-to-end method using deviation learning. In this approach, the anomaly scores of normal examples are adjusted to closely resemble reference scores obtained from a prior distribution. Conversely, anomaly samples are forced to have anomalous scores that considerably deviate from the reference score in the upper tail of the prior. Additionally, our model is optimized to learn the distinct behavior of anomalies by utilizing a multi-head self-attention layer and multiple instance learning approaches. Comprehensive experiments on several benchmark datasets demonstrate that our proposed approach attains a new level of state-of-the-art performance.

摘要
Current methods for detecting anomalies in text mainly rely on constructing models using only unlabeled data. These models assume that no labeled anomalous examples are available, which limits their ability to utilize prior knowledge of anomalies that are typically present in small numbers in real-world applications. Moreover, these models prioritize learning feature embeddings over optimizing anomaly scores directly, which could lead to suboptimal anomaly scoring and inefficient use of data during the learning process.In this paper, we propose FATE, a deep few-shot learning-based framework that leverages limited anomaly examples and learns anomaly scores explicitly in an end-to-end manner using deviation learning. In this approach, the anomaly scores of normal examples are adjusted to closely resemble reference scores obtained from a prior distribution. Conversely, anomaly samples are forced to have anomalous scores that considerably deviate from the reference score in the upper tail of the prior. Additionally, our model is optimized to learn the distinct behavior of anomalies by utilizing a multi-head self-attention layer and multiple instance learning approaches.Comprehensive experiments on several benchmark datasets demonstrate that our proposed approach achieves a new level of state-of-the-art performance.

Understanding Hessian Alignment for Domain Generalization

paper_url: http://arxiv.org/abs/2308.11778
repo_url: https://github.com/huawei-noah/federated-learning
paper_authors: Sobhan Hemati, Guojun Zhang, Amir Estiri, Xi Chen
For: The paper focuses on improving out-of-distribution (OOD) generalization in deep learning models, specifically in healthcare and autonomous vehicles applications.* Methods: The paper analyzes the role of Hessian and gradient alignment in domain generalization using recent OOD theory of transferability. It also proposes two simple yet effective methods to match the classifier’s head Hessians and gradients, based on the Hessian Gradient Product (HGP) and Hutchinson’s method (Hutchinson), without directly calculating Hessians.* Results: The paper shows that Hessian alignment methods achieve promising performance on various OOD benchmarks, including transferability, severe correlation shift, label shift, and diversity shift.Here are the three information points in Simplified Chinese text:* For: 本文主要针对深度学习模型的对外输出（Out-of-distribution，OOD）泛化问题，具体应用于医疗和自动驾驶等领域。* Methods: 本文通过最近的OOD理论来分析领域泛化中约束的作用，并提出了两种简单 yet有效的方法来匹配核心分类器的头部Hessian和梯度，基于Hessian Gradient Product（HGP）和哈希逊的方法（Hutchinson），不直接计算Hessians。* Results: 本文显示，对于不同的OOD场景，包括转移性、严重相关变化、标签变化和多样化变化，梯度匹配方法可以获得显著的性能提升。

Abstract
Out-of-distribution (OOD) generalization is a critical ability for deep learning models in many real-world scenarios including healthcare and autonomous vehicles. Recently, different techniques have been proposed to improve OOD generalization. Among these methods, gradient-based regularizers have shown promising performance compared with other competitors. Despite this success, our understanding of the role of Hessian and gradient alignment in domain generalization is still limited. To address this shortcoming, we analyze the role of the classifier's head Hessian matrix and gradient in domain generalization using recent OOD theory of transferability. Theoretically, we show that spectral norm between the classifier's head Hessian matrices across domains is an upper bound of the transfer measure, a notion of distance between target and source domains. Furthermore, we analyze all the attributes that get aligned when we encourage similarity between Hessians and gradients. Our analysis explains the success of many regularizers like CORAL, IRM, V-REx, Fish, IGA, and Fishr as they regularize part of the classifier's head Hessian and/or gradient. Finally, we propose two simple yet effective methods to match the classifier's head Hessians and gradients in an efficient way, based on the Hessian Gradient Product (HGP) and Hutchinson's method (Hutchinson), and without directly calculating Hessians. We validate the OOD generalization ability of proposed methods in different scenarios, including transferability, severe correlation shift, label shift and diversity shift. Our results show that Hessian alignment methods achieve promising performance on various OOD benchmarks. The code is available at \url{https://github.com/huawei-noah/Federated-Learning/tree/main/HessianAlignment}.

摘要
外部数据（OOD）泛化是深度学习模型在许多实际场景中的关键能力，包括医疗和自动驾驶。在这些场景中，不同的技术已经被提出来提高OOD泛化能力。其中，梯度基本的正则化方法已经表现出色，相比其他竞争者。然而，我们对梯度和梯度对齐在领域泛化中的作用还是有限的理解。为了解决这个问题，我们通过最近的OOD理论来分析梯度和梯度对齐在领域泛化中的作用。我们证明了在目标领域和源领域之间的梯度对齐是泛化度量的上限。此外，我们还分析了梯度和梯度对齐的对应关系，包括梯度对齐和梯度的对应关系。我们的分析解释了许多正则化方法，如CORAL、IRM、V-REx、Fish、IGA和Fishr的成功，这些方法都在梯度和梯度对齐中进行正则化。最后，我们提出了两种简单 yet有效的方法来匹配梯度和梯度对齐，基于梯度和梯度对齐的产品（HGP）和哈特金森方法（Hutchinson），而无需直接计算梯度。我们 Validate了我们提出的方法在不同的OOD场景中的性能，包括传输性、严重相关转移、标签转移和多样性转移。我们的结果表明，梯度对齐方法在各种OOD场景中表现出色。代码可以在上获取。

3ET: Efficient Event-based Eye Tracking using a Change-Based ConvLSTM Network

paper_url: http://arxiv.org/abs/2308.11771
repo_url: None
paper_authors: Qinyu Chen, Zuowen Wang, Shih-Chii Liu, Chang Gao
for: 这个论文旨在提出一种稀疏变化基于卷积长短时间记忆（CB-ConvLSTM）模型，用于基于事件的眼动跟踪，这种技术是未来可穿戴医疗技术，如AR/VR头戴式设备的关键。
methods: 这个论文利用了眼睛类型的事件摄像头的优点，即快速响应和稀疏输出事件流，而不是传统的帧类型摄像头。CB-ConvLSTM架构效果地提取了眼动跟踪的空间时间特征，并且比传统的CNN结构更高效。
results: 这个论文使用了 delta-编码的束缚回归路增强活动稀疏性，从而降低了计算量约为4.7倍，而不失去精度。这种提高的效率使其成为实时眼动跟踪的理想选择，尤其是在有限的设备中。

Abstract
This paper presents a sparse Change-Based Convolutional Long Short-Term Memory (CB-ConvLSTM) model for event-based eye tracking, key for next-generation wearable healthcare technology such as AR/VR headsets. We leverage the benefits of retina-inspired event cameras, namely their low-latency response and sparse output event stream, over traditional frame-based cameras. Our CB-ConvLSTM architecture efficiently extracts spatio-temporal features for pupil tracking from the event stream, outperforming conventional CNN structures. Utilizing a delta-encoded recurrent path enhancing activation sparsity, CB-ConvLSTM reduces arithmetic operations by approximately 4.7$\times$ without losing accuracy when tested on a \texttt{v2e}-generated event dataset of labeled pupils. This increase in efficiency makes it ideal for real-time eye tracking in resource-constrained devices. The project code and dataset are openly available at \url{https://github.com/qinche106/cb-convlstm-eyetracking}.

摘要

Patient Clustering via Integrated Profiling of Clinical and Digital Data

paper_url: http://arxiv.org/abs/2308.11748
repo_url: None
paper_authors: Dongjin Choi, Andy Xiang, Ozgur Ozturk, Deep Shrestha, Barry Drake, Hamid Haidarian, Faizan Javed, Haesun Park
for: 这篇论文是为了描述一种基于负载特征的患者群组模型，用于医疗数据分析。
methods: 该模型使用了受限的低维度approximation方法，利用患者的临床数据和数字互动数据（包括浏览和搜索）构建患者profile。
results: 对于实际的患者数据集，该模型与其他基线相比，表现出较高的团结性和推荐精度。

Abstract
We introduce a novel profile-based patient clustering model designed for clinical data in healthcare. By utilizing a method grounded on constrained low-rank approximation, our model takes advantage of patients' clinical data and digital interaction data, including browsing and search, to construct patient profiles. As a result of the method, nonnegative embedding vectors are generated, serving as a low-dimensional representation of the patients. Our model was assessed using real-world patient data from a healthcare web portal, with a comprehensive evaluation approach which considered clustering and recommendation capabilities. In comparison to other baselines, our approach demonstrated superior performance in terms of clustering coherence and recommendation accuracy.

摘要
我们介绍了一种基于Profile的患者划分模型，适用于医疗数据。我们的模型利用了基于受限制的低纬度近似的方法，使用患者的临床数据和数字互动数据，包括浏览和搜索，构建患者 profil。因此，我们生成了非负嵌入向量，用于表示患者。我们的模型在使用实际的患者数据from a healthcare web portal进行评估，并对划分和推荐能力进行全面评估。与其他基线相比，我们的方法在划分准确性和推荐准确性方面表现出色。

Animal3D: A Comprehensive Dataset of 3D Animal Pose and Shape

paper_url: http://arxiv.org/abs/2308.11737
repo_url: None
paper_authors: Jiacong Xu, Yi Zhang, Jiawei Peng, Wufei Ma, Artur Jesslen, Pengliang Ji, Qixin Hu, Jiehua Zhang, Qihao Liu, Jiahao Wang, Wei Ji, Chen Wang, Xiaoding Yuan, Prakhar Kaushik, Guofeng Zhang, Jie Liu, Yushan Xie, Yawen Cui, Alan Yuille, Adam Kortylewski
for: 这个论文的目的是提出了一个全面的动物3D姿势和形状估算数据集，以便更好地理解动物行为，并可能对野生动物保护等应用程序产生影响。
methods: 这篇论文使用了高质量的26个关键点标注，以及SMAL模型的姿势和形状参数，对40种哺乳动物进行了3379张图像收集。这些标注都是由人工进行了多个阶段的检查和标注，以确保结果的最高质量。
results: 这篇论文的实验结果表明， predicting animals across species的3D姿势和形状仍然是一个非常困难的任务，尽管人体姿势估算技术已经取得了显著进步。同时，这篇论文也表明，将Synthetic数据用于适应实际数据的转移是一个可靠的策略，可以提高模型的性能。

Abstract
Accurately estimating the 3D pose and shape is an essential step towards understanding animal behavior, and can potentially benefit many downstream applications, such as wildlife conservation. However, research in this area is held back by the lack of a comprehensive and diverse dataset with high-quality 3D pose and shape annotations. In this paper, we propose Animal3D, the first comprehensive dataset for mammal animal 3D pose and shape estimation. Animal3D consists of 3379 images collected from 40 mammal species, high-quality annotations of 26 keypoints, and importantly the pose and shape parameters of the SMAL model. All annotations were labeled and checked manually in a multi-stage process to ensure highest quality results. Based on the Animal3D dataset, we benchmark representative shape and pose estimation models at: (1) supervised learning from only the Animal3D data, (2) synthetic to real transfer from synthetically generated images, and (3) fine-tuning human pose and shape estimation models. Our experimental results demonstrate that predicting the 3D shape and pose of animals across species remains a very challenging task, despite significant advances in human pose estimation. Our results further demonstrate that synthetic pre-training is a viable strategy to boost the model performance. Overall, Animal3D opens new directions for facilitating future research in animal 3D pose and shape estimation, and is publicly available.

摘要
通过正确估计动物的3D姿势和形状，可以更好地理解动物的行为，并可能帮助许多下游应用，如野生动物保护。但是，这个领域的研究受到缺乏全面和多样的数据集，高品质的3D姿势和形状标注的限制。在这篇论文中，我们提出了动物3D（Animal3D）数据集，是首个涵盖40种哺乳动物的全面数据集。动物3D包括3379张图像，来自40种哺乳动物，以及26个关键点的高品质标注。所有标注都是由人工标注和确认，以 Ensure highest quality results。基于动物3D数据集，我们在：（1）仅使用动物3D数据集进行监督学习，（2）从 sintetically generated images进行synthetic to real transfer，以及（3）人体姿势和形状估计模型的精确调整。我们的实验结果显示，预测动物遍布多种物种的3D姿势和形状仍然是一个非常困难的任务，尽管人体姿势估计已经取得了重要进步。我们的结果还显示，Synthetic pre-training是一种可行的策略，可以增强模型的性能。总之，动物3D开启了新的研究方向，并且公开可用。

Knowledge Graph Prompting for Multi-Document Question Answering

paper_url: http://arxiv.org/abs/2308.11730
repo_url: None
paper_authors: Yu Wang, Nedim Lipka, Ryan A. Rossi, Alexa Siu, Ruiyi Zhang, Tyler Derr
for: 提高大型自然语言模型（LLM）在多文档问答（MD-QA）任务中的表现，特别是在具有多个文档和不同结构的情况下。
methods: 提出了一种知识图 prompting（KGP）方法，包括知识图建构模块和知识图游走模块。知识图建构模块使用多个文档中的节点和边表示文章或文档结构的semantic/lexical相似性，而知识图游走模块使用LM帮助浏览器在多个文档中搜索相关的段落，以帮助LLM在MD-QA任务中进行问题回答。
results: 经验表明，KGP方法可以有效提高LLM在MD-QA任务中的表现，这表明可以通过图来提高LLM的表现，并且可以在多个文档和不同结构的情况下使用这种方法。

Abstract
The 'pre-train, prompt, predict' paradigm of large language models (LLMs) has achieved remarkable success in open-domain question answering (OD-QA). However, few works explore this paradigm in the scenario of multi-document question answering (MD-QA), a task demanding a thorough understanding of the logical associations among the contents and structures of different documents. To fill this crucial gap, we propose a Knowledge Graph Prompting (KGP) method to formulate the right context in prompting LLMs for MD-QA, which consists of a graph construction module and a graph traversal module. For graph construction, we create a knowledge graph (KG) over multiple documents with nodes symbolizing passages or document structures (e.g., pages/tables), and edges denoting the semantic/lexical similarity between passages or intra-document structural relations. For graph traversal, we design an LM-guided graph traverser that navigates across nodes and gathers supporting passages assisting LLMs in MD-QA. The constructed graph serves as the global ruler that regulates the transitional space among passages and reduces retrieval latency. Concurrently, the LM-guided traverser acts as a local navigator that gathers pertinent context to progressively approach the question and guarantee retrieval quality. Extensive experiments underscore the efficacy of KGP for MD-QA, signifying the potential of leveraging graphs in enhancing the prompt design for LLMs. Our code is at https://github.com/YuWVandy/KG-LLM-MDQA.

摘要
“pre-train, prompt, predict”模式下的大型自然语言模型（LLM）在开放预测问答（OD-QA）任务上取得了很大的成功。然而，有很少的研究探讨这种模式在多文档问答（MD-QA）任务中的应用，MD-QA任务需要对不同文档的内容和结构进行深入的理解。为了填补这个重要的 gap，我们提出了知识图 prompting（KGP）方法，用于在 LLM 中提示 MD-QA，该方法包括知识图构建模块和知识图游走模块。为知识图构建，我们创建了多个文档之间的知识图（KG），其中节点表示文档中的段落或结构（例如页码/表格），而边表示文档之间的semantic/lexical相似性或内部结构关系。为知识图游走，我们设计了LM-引导的图游走器，该模块可以在知识图中穿梭节点，并收集支持段落，以助于 LLM 在 MD-QA 中进行问题解答。知识图 acts as a global ruler，控制了文档之间的转移空间，并降低了检索延迟。同时，LM-引导的图游走器 acts as a local navigator，收集有关问题的context，逐渐接近问题，并保证检索质量。广泛的实验证明 KGP 对 MD-QA 具有很高的效果，这表明可以通过图来提高 LLM 的提示设计。我们的代码可以在 GitHub 上找到：https://github.com/YuWVandy/KG-LLM-MDQA。

When Are Two Lists Better than One?: Benefits and Harms in Joint Decision-making

paper_url: http://arxiv.org/abs/2308.11721
repo_url: https://github.com/kpdonahue/benefits_harms_joint_decision_making
paper_authors: Kate Donahue, Kostas Kollias, Sreenivas Gollapudi
for: 这个论文研究了一种人机算法合作的情况，在这种情况下，算法会给人选择一个 subsets of size k，人会从这个subset中选择最佳的一个。这种情况可以应用于内容推荐、路径规划等任务。
methods: 作者使用了不同的噪音模型来分析人机算法合作的优化问题，包括Mallows模型和随机Utility模型。
results: 作者发现，在一些噪音模型下，可以通过人机算法合作来提高最终选择的概率，并且这种合作效果在k在[2, n-1]之间最佳。但是，当人和算法的噪音水平不同时，合作的效果是倒挂的。

Abstract
Historically, much of machine learning research has focused on the performance of the algorithm alone, but recently more attention has been focused on optimizing joint human-algorithm performance. Here, we analyze a specific type of human-algorithm collaboration where the algorithm has access to a set of $n$ items, and presents a subset of size $k$ to the human, who selects a final item from among those $k$. This scenario could model content recommendation, route planning, or any type of labeling task. Because both the human and algorithm have imperfect, noisy information about the true ordering of items, the key question is: which value of $k$ maximizes the probability that the best item will be ultimately selected? For $k=1$, performance is optimized by the algorithm acting alone, and for $k=n$ it is optimized by the human acting alone. Surprisingly, we show that for multiple of noise models, it is optimal to set $k \in [2, n-1]$ - that is, there are strict benefits to collaborating, even when the human and algorithm have equal accuracy separately. We demonstrate this theoretically for the Mallows model and experimentally for the Random Utilities models of noisy permutations. However, we show this pattern is reversed when the human is anchored on the algorithm's presented ordering - the joint system always has strictly worse performance. We extend these results to the case where the human and algorithm differ in their accuracy levels, showing that there always exist regimes where a more accurate agent would strictly benefit from collaborating with a less accurate one, but these regimes are asymmetric between the human and the algorithm's accuracy.

摘要
Translated into Simplified Chinese:历史上，许多机器学习研究都集中在算法性能上，但最近更多的注意力转移到了人机合作性能的优化。在这种人机合作中，算法可以访问一个集合中的 $n$ 个项目，并将一个包含 $k$ 个项目的子集提供给人，人将选择最终的项目。这种情况可以模型内容推荐、路径规划或任何类型的标注任务。因为人和算法都有不准确、噪音的信息，关键问题是：哪个值的 $k$ 最大化最终选择最佳项目的概率？ For $k=1$, 性能是由算法独立作用优化的，而 For $k=n$ 是由人独立作用优化的。Surprisingly, we show that for multiple noise models, it is optimal to set $k \in [2, n-1]$ - that is, there are strict benefits to collaborating, even when the human and algorithm have equal accuracy separately. We demonstrate this theoretically for the Mallows model and experimentally for the Random Utilities models of noisy permutations. However, we show this pattern is reversed when the human is anchored on the algorithm's presented ordering - the joint system always has strictly worse performance. We extend these results to the case where the human and algorithm differ in their accuracy levels, showing that there always exist regimes where a more accurate agent would strictly benefit from collaborating with a less accurate one, but these regimes are asymmetric between the human and the algorithm's accuracy.

Collect, Measure, Repeat: Reliability Factors for Responsible AI Data Collection

paper_url: http://arxiv.org/abs/2308.12885
repo_url: None
paper_authors: Oana Inel, Tim Draws, Lora Aroyo
for:这篇论文旨在提出一种负责任AI（RAI）方法，用于系统地评估数据采集过程中的质量和可靠性因素。methods:这篇论文提出了一个综合的评估 metric 集，用于评估数据采集过程中的内部可靠性和外部稳定性。此外，论文还 validate 了这些 metric 的可行性和效果性。results:论文的实验结果表明，使用 RAI 方法可以帮助评估数据采集过程中的质量和可靠性问题，并且可以提高数据采集过程中的质量和可靠性。此外，论文还发现了一些可能的不公正和偏见问题，并提出了一些建议来解决这些问题。

Abstract
The rapid entry of machine learning approaches in our daily activities and high-stakes domains demands transparency and scrutiny of their fairness and reliability. To help gauge machine learning models' robustness, research typically focuses on the massive datasets used for their deployment, e.g., creating and maintaining documentation for understanding their origin, process of development, and ethical considerations. However, data collection for AI is still typically a one-off practice, and oftentimes datasets collected for a certain purpose or application are reused for a different problem. Additionally, dataset annotations may not be representative over time, contain ambiguous or erroneous annotations, or be unable to generalize across issues or domains. Recent research has shown these practices might lead to unfair, biased, or inaccurate outcomes. We argue that data collection for AI should be performed in a responsible manner where the quality of the data is thoroughly scrutinized and measured through a systematic set of appropriate metrics. In this paper, we propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics for an iterative in-depth analysis of the factors influencing the quality and reliability} of the generated data. We propose a granular set of measurements to inform on the internal reliability of a dataset and its external stability over time. We validate our approach across nine existing datasets and annotation tasks and four content modalities. This approach impacts the assessment of data robustness used for AI applied in the real world, where diversity of users and content is eminent. Furthermore, it deals with fairness and accountability aspects in data collection by providing systematic and transparent quality analysis for data collections.

摘要
machine learning技术在我们日常生活和高飞度领域的快速进入需要透明度和审核其公平性和可靠性。为了评估机器学习模型的坚固性，研究通常会关注在其部署时使用的庞大数据集，例如创建和维护对其起源、开发过程和伦理考虑的文档。然而，AI数据收集仍然是一种一次性的做法，常常数据集用于一个特定目的或应用程序将被重复使用。此外，数据集标注可能无法在时间上保持代表性，包含歧义或错误的标注，或者无法泛化到问题或领域。 latest research shows that these practices may lead to unfair, biased, or inaccurate outcomes.我们认为AI数据收集应该按照负责任的方式进行，即系统地评估数据集质量的相关指标。在这篇论文中，我们提出了一种负责任AI（RAI）方法，用于指导数据收集，并提供了一组 metrics 用于对数据生成的质量进行Iterative深入分析。我们验证了我们的方法在九个现有数据集和注释任务中，以及四种内容模式中。这种方法对实际世界中AI应用的评估数据坚固性产生了影响，而且解决了数据收集中的公平性和责任问题。

SuperCalo: Calorimeter shower super-resolution

paper_url: http://arxiv.org/abs/2308.11700
repo_url: https://github.com/ian-pang/supercalo
paper_authors: Ian Pang, John Andrew Raine, David Shih
for: 这篇论文的目的是来解决加大数据蒸气推测的大型巨原子对撞机计算管道中的主要瓶颈。methods: 这篇论文使用了深度生成式代理模型来解决这个挑战。results: 这篇论文引入了SuperCalo流程基础超解析模型，可以快速将高维度细腻的推测器蒸气转换为高精度的推测器蒸气，从而降低计算成本、内存需求和生成时间。此外，该论文还证明了转换后的推测器蒸气具有高度的多样性，这使得从少量粗糙的推测器蒸气中生成大量高精度的推测器蒸气，从而实现了额外的计算时间缩减。

Abstract
Calorimeter shower simulation is a major bottleneck in the Large Hadron Collider computational pipeline. There have been recent efforts to employ deep-generative surrogate models to overcome this challenge. However, many of best performing models have training and generation times that do not scale well to high-dimensional calorimeter showers. In this work, we introduce SuperCalo, a flow-based super-resolution model, and demonstrate that high-dimensional fine-grained calorimeter showers can be quickly upsampled from coarse-grained showers. This novel approach presents a way to reduce computational cost, memory requirements and generation time associated with fast calorimeter simulation models. Additionally, we show that the showers upsampled by SuperCalo possess a high degree of variation. This allows a large number of high-dimensional calorimeter showers to be upsampled from much fewer coarse showers with high-fidelity, which results in additional reduction in generation time.

摘要
加速器浸水示例是大型哈丁核爆机器学pipeline中的主要瓶颈。有最近尝试使用深度生成器模型来缓解这个挑战。然而，许多最高性能的模型的训练和生成时间不符合高维度加速器浸水的需求。在这项工作中，我们介绍SuperCalo，一种流基的超分辨模型，并证明了高维度细化加速器浸水可以快速升频自粗化浸水。这种新的方法可以降低计算成本、内存需求和生成时间相关的快速加速器 simulate模型的负担。此外，我们还证明了由SuperCalo升频的浸水具有高度的变化度。这意味着可以从少量粗化浸水中生成大量高维度加速器浸水，这会导致更大的生成时间减少。

Efficient Benchmarking (of Language Models)

paper_url: http://arxiv.org/abs/2308.11696
repo_url: https://github.com/sumankrsh/Sentiment-Analysis.ipynb
paper_authors: Yotam Perlitz, Elron Bandel, Ariel Gera, Ofir Arviv, Liat Ein-Dor, Eyal Shnarch, Noam Slonim, Michal Shmueli-Scheuer, Leshem Choshen
for: 本文旨在提出一个名为”高效评测”的问题，即智能减少LM评测计算成本而不损害可靠性。
methods: 作者使用了HELMbenchmark作为测试 caso，研究了不同的benchmark设计选择对计算vs可靠性的负面影响。提出了一个新的度量指标DIoR来评估这些决策的可靠性。
results: 研究发现，现有的领先者在HELMbenchmark中可能会通过去掉一个低排名的模型来改变排名。同时，只需几个例子就可以获得正确的排名。然而，不同的HELM场景选择会导致排名差异很大。基于这些发现，作者提出了一些具体的建议，以实现更高效的benchmark设计和使用做法，可以减少计算成本，同时减少可靠性损失。

Abstract
The increasing versatility of language models LMs has given rise to a new class of benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks are associated with massive computational costs reaching thousands of GPU hours per model. However the efficiency aspect of these evaluation efforts had raised little discussion in the literature. In this work we present the problem of Efficient Benchmarking namely intelligently reducing the computation costs of LM evaluation without compromising reliability. Using the HELM benchmark as a test case we investigate how different benchmark design choices affect the computation-reliability tradeoff. We propose to evaluate the reliability of such decisions by using a new measure Decision Impact on Reliability DIoR for short. We find for example that the current leader on HELM may change by merely removing a low-ranked model from the benchmark and observe that a handful of examples suffice to obtain the correct benchmark ranking. Conversely a slightly different choice of HELM scenarios varies ranking widely. Based on our findings we outline a set of concrete recommendations for more efficient benchmark design and utilization practices leading to dramatic cost savings with minimal loss of benchmark reliability often reducing computation by x100 or more.

摘要
LM模型的多样化性已经导致了一种新的评估板准，全面评估LM模型的各种能力。然而，这些评估努力的效率方面几乎没有在文献中得到了讨论。在这个工作中，我们提出了一个问题，即如何智能减少LM评估计算成本，无需妥协可靠性。我们使用HELM benchmark作为案例，研究不同的评估方案对计算与可靠性的负面影响。我们还提出了一个新的度量器，即决策影响可靠性（DIoR），用于评估这些决策的可靠性。我们发现，例如，现有的HELM领导者可以通过移除一个低排名的模型来改变领导者，并且发现一些例子即可以获得正确的排名。然而，不同的HELM场景选择会导致排名差异很大。基于我们的发现，我们提出了一组具体的建议，包括更有效的评估设计和使用方法，可以带来计算量减少100倍或更多，而且减少的计算量不会对可靠性产生负面影响。

Semantic Multi-Resolution Communications

paper_url: http://arxiv.org/abs/2308.11604
repo_url: None
paper_authors: Matin Mortaheb, Mohammad A. Amir Khojastepour, Srimat T. Chakradhar, Sennur Ulukus
for: 提高数据重建和保留 semantic 特征
methods: 使用深度学习 multi-resolution JSCC 框架和 multi-task learning
results: 在 MNIST 和 CIFAR10 dataset 上实验表明，提出的方法可以超过 SSCC 方法，重建数据的不同分辨率，并且可以在不同层次中提高 semantic 特征的抽取精度。

Abstract
Deep learning based joint source-channel coding (JSCC) has demonstrated significant advancements in data reconstruction compared to separate source-channel coding (SSCC). This superiority arises from the suboptimality of SSCC when dealing with finite block-length data. Moreover, SSCC falls short in reconstructing data in a multi-user and/or multi-resolution fashion, as it only tries to satisfy the worst channel and/or the highest quality data. To overcome these limitations, we propose a novel deep learning multi-resolution JSCC framework inspired by the concept of multi-task learning (MTL). This proposed framework excels at encoding data for different resolutions through hierarchical layers and effectively decodes it by leveraging both current and past layers of encoded data. Moreover, this framework holds great potential for semantic communication, where the objective extends beyond data reconstruction to preserving specific semantic attributes throughout the communication process. These semantic features could be crucial elements such as class labels, essential for classification tasks, or other key attributes that require preservation. Within this framework, each level of encoded data can be carefully designed to retain specific data semantics. As a result, the precision of a semantic classifier can be progressively enhanced across successive layers, emphasizing the preservation of targeted semantics throughout the encoding and decoding stages. We conduct experiments on MNIST and CIFAR10 dataset. The experiment with both datasets illustrates that our proposed method is capable of surpassing the SSCC method in reconstructing data with different resolutions, enabling the extraction of semantic features with heightened confidence in successive layers. This capability is particularly advantageous for prioritizing and preserving more crucial semantic features within the datasets.

摘要
深度学习基于联合源渠道编码（JSCC）已经实现了数据重建的显著进步，相比单独源渠道编码（SSCC）。这种超越来自于数据块长度有限制的情况下，SSCC的优化不足。此外，SSCC在多用户和/或多分辨率上重建数据时也缺乏能力，因为它只是为最坏通道和/或最高质量数据进行了满足。为了解决这些限制，我们提出了一种基于多任务学习（MTL）的深度学习多分辨率JSCC框架。这个提议的框架通过层次结构来编码数据，并通过当前和过去层编码数据来有效地解码。此外，这个框架具有潜在的semantic communication功能，其目标超越了数据重建，是保留特定semantic attribute的过程。这些semantic attribute可以是分类任务中的关键特征，或者其他需要保留的关键特征。在这个框架中，每层编码数据都可以被仔细设计，以保留特定数据 semantics。因此，在successive层中，精度的semantic classifier可以不断提高，强调在编码和解码阶段保留targeted semantics。我们在MNIST和CIFAR10 dataset上进行了实验，实验结果显示，我们的提议方法可以在不同分辨率下重建数据，并且可以在successive层中提高semantic feature的EXTRACTION confidence。这种能力特别有利于优先保留dataset中更重要的semantic feature。

Tryage: Real-time, intelligent Routing of User Prompts to Large Language Models

paper_url: http://arxiv.org/abs/2308.11601
repo_url: None
paper_authors: Surya Narayanan Hari, Matt Thomson
for: This paper aims to address the problem of selecting and optimizing language models for specific downstream tasks and data domains, and to provide a framework for efficient use of the expanding and evolving language model ecosystem.methods: The proposed method, Tryage, leverages a language model router to predict down-stream model performance on prompts and makes a routing decision using an objective function that integrates performance predictions with user goals and constraints.results: Tryage surpasses Gorilla and GPT3.5 turbo in dynamic model selection, identifying the optimal model with an accuracy of 50.9%, compared to 23.6% by GPT 3.5 Turbo and 10.8% by Gorilla, across heterogeneous data sets that include code, text, clinical data, and patents.

Abstract
The introduction of the transformer architecture and the self-attention mechanism has led to an explosive production of language models trained on specific downstream tasks and data domains. With over 200, 000 models in the Hugging Face ecosystem, users grapple with selecting and optimizing models to suit multifaceted workflows and data domains while addressing computational, security, and recency concerns. There is an urgent need for machine learning frameworks that can eliminate the burden of model selection and customization and unleash the incredible power of the vast emerging model library for end users. Here, we propose a context-aware routing system, Tryage, that leverages a language model router for optimal selection of expert models from a model library based on analysis of individual input prompts. Inspired by the thalamic router in the brain, Tryage employs a perceptive router to predict down-stream model performance on prompts and, then, makes a routing decision using an objective function that integrates performance predictions with user goals and constraints that are incorporated through flags (e.g., model size, model recency). Tryage allows users to explore a Pareto front and automatically trade-off between task accuracy and secondary goals including minimization of model size, recency, security, verbosity, and readability. Across heterogeneous data sets that include code, text, clinical data, and patents, the Tryage framework surpasses Gorilla and GPT3.5 turbo in dynamic model selection identifying the optimal model with an accuracy of 50.9% , compared to 23.6% by GPT 3.5 Turbo and 10.8% by Gorilla. Conceptually, Tryage demonstrates how routing models can be applied to program and control the behavior of multi-model LLM systems to maximize efficient use of the expanding and evolving language model ecosystem.

摘要
“ transformer 架构和自注意机制的引入，导致了语言模型的激素生产，现有超过200,000个模型在Hugging Face生态系统中。用户面临着选择和优化模型，以适应多样化的工作流程和数据领域，同时解决计算、安全和新鲜度等问题。为了解决这些问题，我们提出了一种基于语言模型路由的上下文意识路由系统，即Tryage。这种系统利用一个具有见识力的路由器，预测下游模型的性能，并根据用户目标和约束（如模型大小、新鲜度、安全性、verbosity和可读性），使用一个 объек function 进行路由决策。Tryage 允许用户探索 Pareto 前提，并自动交易between 任务准确率和次要目标。在不同的数据集中，包括代码、文本、医疗数据和专利，Tryage 框架超过 Gorilla 和 GPT3.5 Turbo 在动态模型选择中，可以准确地选择最佳模型，准确率为 50.9%，比 GPT 3.5 Turbo 的 23.6% 和 Gorilla 的 10.8% 高。概念上，Tryage 示例了如何通过路由模型来控制多模型 LLMS 的行为，以优化该不断扩展和演化的语言模型生态系统的使用。”

Low Tensor Rank Learning of Neural Dynamics

paper_url: http://arxiv.org/abs/2308.11567
repo_url: None
paper_authors: Arthur Pellegrino, N Alex Cayco-Gajic, Angus Chadwick
for: 这个论文的目的是研究训练过程中神经网络中连接 Population 的聚合变化，以及这些变化如何影响学习。
methods: 这个论文使用了 RNN 的训练数据来研究学习过程中weight matrices的变化，并通过low-tensor-rank decomposition来描述 weight matrices 的变化。
results: 研究发现，在训练过程中，weight matrices 的低维度结构会随着学习的进行演化，并且这种演化可以通过low-tensor-rank decomposition来描述。此外，研究还发现了一些数学结果，可以限制神经网络中 population connectivity 的演化过程。

Abstract
Learning relies on coordinated synaptic changes in recurrently connected populations of neurons. Therefore, understanding the collective evolution of synaptic connectivity over learning is a key challenge in neuroscience and machine learning. In particular, recent work has shown that the weight matrices of task-trained RNNs are typically low rank, but how this low rank structure unfolds over learning is unknown. To address this, we investigate the rank of the 3-tensor formed by the weight matrices throughout learning. By fitting RNNs of varying rank to large-scale neural recordings during a motor learning task, we find that the inferred weights are low-tensor-rank and therefore evolve over a fixed low-dimensional subspace throughout the entire course of learning. We next validate the observation of low-tensor-rank learning on an RNN trained to solve the same task by performing a low-tensor-rank decomposition directly on the ground truth weights, and by showing that the method we applied to the data faithfully recovers this low rank structure. Finally, we present a set of mathematical results bounding the matrix and tensor ranks of gradient descent learning dynamics which show that low-tensor-rank weights emerge naturally in RNNs trained to solve low-dimensional tasks. Taken together, our findings provide novel constraints on the evolution of population connectivity over learning in both biological and artificial neural networks, and enable reverse engineering of learning-induced changes in recurrent network dynamics from large-scale neural recordings.

摘要
学习依赖在相互连接的神经元Population中协调发生 synaptic 变化。因此，理解学习过程中集合 synaptic 连接性的演化是神经科学和机器学习领域的关键挑战。特别是，最近的研究表明，任务训练过的 RNN 的 weight 矩阵通常具有低纬度结构，但是这种低纬度结构如何发展过程是未知。为了解决这个问题，我们研究了 throughout 学习过程中 weight 矩阵的rank。我们使用不同级别的 RNN 适应大规模神经记录数据，并发现它们的推出 веса都是低纬度结构，因此在整个学习过程中发展在固定的低维度子空间中。我们接着验证了这种低纬度结构在一个同样的任务上训练的 RNN 中是否存在，并通过直接对真实的 weights 进行低纬度分解来验证我们的方法。最后，我们提供了一些数学结果，证明在梯度下降学习动力学中，低纬度结构的 weights 会自然地出现在解决低维度任务的 RNN 中。总之，我们的发现提供了对学习过程中 population 连接性的演化的新的约束，并允许从大规模神经记录数据中逆向工程学习学习-induced 变化的 recurrent 网络动力学。

Practical Insights on Incremental Learning of New Human Physical Activity on the Edge

paper_url: http://arxiv.org/abs/2308.11691
repo_url: None
paper_authors: George Arvanitakis, Jingwei Zuo, Mthandazo Ndhlovu, Hakim Hacid
for: 本文探讨了Edge机器学习（Edge ML）的问题，强调了在Edge设备上运行计算智能的优势和挑战。
methods: 本文使用了MAGNETO系统，通过收集移动传感器数据来学习人类活动。
results: 实验表明，Edge ML带来了数据存储受限、计算能力有限和学习类别数量的挑战。

Abstract
Edge Machine Learning (Edge ML), which shifts computational intelligence from cloud-based systems to edge devices, is attracting significant interest due to its evident benefits including reduced latency, enhanced data privacy, and decreased connectivity reliance. While these advantages are compelling, they introduce unique challenges absent in traditional cloud-based approaches. In this paper, we delve into the intricacies of Edge-based learning, examining the interdependencies among: (i) constrained data storage on Edge devices, (ii) limited computational power for training, and (iii) the number of learning classes. Through experiments conducted using our MAGNETO system, that focused on learning human activities via data collected from mobile sensors, we highlight these challenges and offer valuable perspectives on Edge ML.

摘要
<>转换文本为简化中文。<>Edge机器学习（Edge ML），将计算智能从云端系统传递到边缘设备，目前吸引了广泛的关注，因为它们的明显优点包括降低延迟、提高数据隐私和降低连接依赖。虽然这些优点吸引人，但它们也引入了传统云端方法中缺失的挑战。在这篇论文中，我们探讨边缘学习的细节，检查Edge设备上的受限数据存储、训练计算能力和学习类数的关系。通过我们使用的MAGNETO系统，我们对通过移动传感器收集的数据进行人类活动学习，并将这些挑战和Edge ML的价值观点提出。

Multi-event Video-Text Retrieval

paper_url: http://arxiv.org/abs/2308.11551
repo_url: https://github.com/gengyuanmax/mevtr
paper_authors: Gengyuan Zhang, Jisen Ren, Jindong Gu, Volker Tresp
for: 这篇论文 targets the problem of video-text retrieval (VTR) in the era of massive video-text data, specifically addressing the gap between previous models and real-world scenarios where videos contain multiple events.
methods: The proposed method, Me-Retriever, incorporates key event video representation and a new MeVTR loss for the Multi-event Video-Text Retrieval (MeVTR) task.
results: The straightforward framework outperforms other models in the Video-to-Text and Text-to-Video tasks, establishing a robust baseline for the MeVTR task.Here’s the simplified Chinese text:
for: 这篇论文targets视频文本检索任务（VTR），特别是面对巨量视频文本数据时，现有模型假设视频内容与文本匹配是一对一的，而实际应用中视频通常包含多个事件，文本则是单个事件的描述。这种差距导致先前的模型在推理过程中可能存在性能下降。
methods: 提出的方法是Me-Retriever，它使用关键事件视频表示和MeVTR损失函数来解决Multi-event Video-Text Retrieval（MeVTR）任务。
results: 这种简单的框架在视频到文本和文本到视频任务中比其他模型表现更好，建立了MeVTR任务的可靠基线。

Abstract
Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet. A plethora of work characterized by using a two-stream Vision-Language model architecture that learns a joint representation of video-text pairs has become a prominent approach for the VTR task. However, these models operate under the assumption of bijective video-text correspondences and neglect a more practical scenario where video content usually encompasses multiple events, while texts like user queries or webpage metadata tend to be specific and correspond to single events. This establishes a gap between the previous training objective and real-world applications, leading to the potential performance degradation of earlier models during inference. In this study, we introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events, as a niche scenario of the conventional Video-Text Retrieval Task. We present a simple model, Me-Retriever, which incorporates key event video representation and a new MeVTR loss for the MeVTR task. Comprehensive experiments show that this straightforward framework outperforms other models in the Video-to-Text and Text-to-Video tasks, effectively establishing a robust baseline for the MeVTR task. We believe this work serves as a strong foundation for future studies. Code is available at https://github.com/gengyuanmax/MeVTR.

摘要
视频文本检索（VTR）在互联网大量视频文本数据时代是一项重要的多Modal任务。许多工作采用了两气流视语言模型建立joint表示方式，学习视频文本对应的共同表示。然而，这些模型假设视频内容与文本对应一一，忽略了实际应用中更加常见的情况，即视频内容通常包含多个事件，而文本 queries 或 webpage metadata 则更加准确地对应单个事件。这种情况导致了之前训练目标与实际应用之间的差距，可能导致先前的模型在推理过程中表现下降。在这项研究中，我们引入了多事件视频文本检索任务（MeVTR），解决了视频内容包含多个不同事件的场景，是传统Video-Text Retrieval Task的一个特殊情况。我们提出了一个简单的模型，Me-Retriever，该模型包括关键事件视频表示和一种新的 MeVTR 损失函数。我们进行了广泛的实验，证明了这个简单的框架可以在视频到文本和文本到视频任务中超越其他模型，成为MeVTR任务的坚实基础。代码可以在https://github.com/gengyuanmax/MeVTR 上获取。

2023-08-23

eess.IV

eess.IV - 2023-08-23

Tumor-Centered Patching for Enhanced Medical Image Segmentation

paper_url: http://arxiv.org/abs/2308.12168
repo_url: None
paper_authors: Mutyyba Asghar, Ahmad Raza Shahid, Akhtar Jamil, Kiran Aftab, Syed Ather Enam
for: 这个研究旨在提高医疗影像诊断系统中的图像分割精度，特别是解决深度学习方法在实际应用中存在的限制和慢速收敛问题。
methods: 这篇研究提出了一种新的方法，称为“肿瘤中心的块分割方法”，该方法通过将块分割与肿瘤的 анатомиче上下文进行对齐，以提高特征提取的准确性和计算效率。
results: 实验结果表明，使用这种新方法可以改善类别不均衡问题，并且 segmentation 分数为 0.78、0.76 和 0.71 分别对整体、核心和加强肿瘤进行了评估。

Abstract
The realm of medical image diagnosis has advanced significantly with the integration of computer-aided diagnosis and surgical systems. However, challenges persist, particularly in achieving precise image segmentation. While deep learning techniques show potential, obstacles like limited resources, slow convergence, and class imbalance impede their effectiveness. Traditional patch-based methods, though common, struggle to capture intricate tumor boundaries and often lead to redundant samples, compromising computational efficiency and feature quality. To tackle these issues, this research introduces an innovative approach centered on the tumor itself for patch-based image analysis. This novel tumor-centered patching method aims to address the class imbalance and boundary deficiencies, enabling focused and accurate tumor segmentation. By aligning patches with the tumor's anatomical context, this technique enhances feature extraction accuracy and reduces computational load. Experimental results demonstrate improved class imbalance, with segmentation scores of 0.78, 0.76, and 0.71 for whole, core, and enhancing tumors, respectively using a lightweight simple U-Net. This approach shows potential for enhancing medical image segmentation and improving computer-aided diagnosis systems.

摘要
医疗图像诊断领域已经凭借计算机辅助诊断和手术系统的整合而取得了显著进步。然而，仍然存在一些挑战，主要是精准图像分割的问题。深度学习技术表现出了潜在的潜力，但是有限的资源、慢速融合和类别不均衡问题妨碍了其效iveness。传统的贴图方法，尽管广泛使用，但是它们往往无法捕捉复杂的肿瘤边界，导致了重复的样本生成和计算效率的下降，从而降低了特征质量。为了解决这些问题，本研究提出了一种新的方法，即基于肿瘤自身的贴图方法。这种新方法通过对肿瘤的 анатомиче上文进行匹配，提高了特征提取的准确性和计算效率。实验结果表明，使用轻量级简单的U-Net，该方法可以提高类别不均衡问题， segmentation scores为0.78、0.76和0.71 для整体、核心和增强肿瘤，分别。这种方法表现出了改善医疗图像分割的潜力，并且可能用于改善计算机辅助诊断系统。

DISGAN: Wavelet-informed Discriminator Guides GAN to MRI Super-resolution with Noise Cleaning

paper_url: http://arxiv.org/abs/2308.12084
repo_url: None
paper_authors: Qi Wang, Lucas Mahler, Julius Steiglechner, Florian Birk, Klaus Scheffler, Gabriele Lohmann
for: 这个研究旨在提出一种同时解决MRI超分辨和噪声约束的深度学习方法，而不需要显式提供噪声和清晰图像的对应训练数据。methods: 该方法基于GAN模型，并使用3D Discrete Wavelet Transform（DWT）操作作为频谱约束在GAN框架中。results: 该模型可以同时实现高质量的SR和自动噪声除除，并且不需要单独训练SR和噪声约束两个模型。

Abstract
MRI super-resolution (SR) and denoising tasks are fundamental challenges in the field of deep learning, which have traditionally been treated as distinct tasks with separate paired training data. In this paper, we propose an innovative method that addresses both tasks simultaneously using a single deep learning model, eliminating the need for explicitly paired noisy and clean images during training. Our proposed model is primarily trained for SR, but also exhibits remarkable noise-cleaning capabilities in the super-resolved images. Instead of conventional approaches that introduce frequency-related operations into the generative process, our novel approach involves the use of a GAN model guided by a frequency-informed discriminator. To achieve this, we harness the power of the 3D Discrete Wavelet Transform (DWT) operation as a frequency constraint within the GAN framework for the SR task on magnetic resonance imaging (MRI) data. Specifically, our contributions include: 1) a 3D generator based on residual-in-residual connected blocks; 2) the integration of the 3D DWT with $1\times 1$ convolution into a DWT+conv unit within a 3D Unet for the discriminator; 3) the use of the trained model for high-quality image SR, accompanied by an intrinsic denoising process. We dub the model "Denoising Induced Super-resolution GAN (DISGAN)" due to its dual effects of SR image generation and simultaneous denoising. Departing from the traditional approach of training SR and denoising tasks as separate models, our proposed DISGAN is trained only on the SR task, but also achieves exceptional performance in denoising. The model is trained on 3D MRI data from dozens of subjects from the Human Connectome Project (HCP) and further evaluated on previously unseen MRI data from subjects with brain tumours and epilepsy to assess its denoising and SR performance.

摘要
MRI超分解（SR）和噪声去除任务是深度学习领域的基础挑战，传统上被视为独立的两个任务，需要分别培训独立的深度学习模型。在这篇论文中，我们提出了一种创新的方法，通过单一的深度学习模型同时解决SR和噪声去除两个任务。我们的提议的模型主要是SR的，但也表现出了很好的噪声去除能力。不同于传统的方法，我们的新方法不使用频率相关的操作在生成过程中，而是通过基于GAN模型的频率知识导向的激活器来实现。为了实现这一点，我们利用了3D分割波лет变换（DWT）操作作为SR任务中MRI数据的频率约束。我们的贡献包括：1. 基于差分律的3D生成器，使用了重复律连接块；2. DWT+conv单元的整合，通过在3D U-Net中使用1x1卷积来实现；3. 使用训练过的模型进行高质量的SR图像生成，同时实现了内在的噪声去除过程。我们将这种模型称为“噪声去除引起的超分解GAN”（DISGAN），因为它同时实现了SR图像生成和噪声去除两个任务。与传统的方法不同，我们的DISGAN只受SR任务培训，同时也可以达到出色的噪声去除性能。我们的模型在HCP提供的3D MRI数据上进行了训练，并在未经见过的MRI数据上进行了测试，以评估其噪声去除和SR性能。

StofNet: Super-resolution Time of Flight Network

paper_url: http://arxiv.org/abs/2308.12009
repo_url: https://github.com/hahnec/stofnet
paper_authors: Christopher Hahne, Michel Hayoz, Raphael Sznitman
for: 本文旨在提出一种基于现代超解像技术的时间抵抗（ToF）探测方法，以提高ToF探测的可靠性和准确性。
methods: 该方法结合超解像和有效的征素减少块，以平衡细详信号的细节和大规模的Contextual信息。
results: 对比六种state-of-the-art方法， results showcase our proposed StofNet方法在精度、可靠性和模型复杂度方面具有显著优势。

Abstract
Time of Flight (ToF) is a prevalent depth sensing technology in the fields of robotics, medical imaging, and non-destructive testing. Yet, ToF sensing faces challenges from complex ambient conditions making an inverse modelling from the sparse temporal information intractable. This paper highlights the potential of modern super-resolution techniques to learn varying surroundings for a reliable and accurate ToF detection. Unlike existing models, we tailor an architecture for sub-sample precise semi-global signal localization by combining super-resolution with an efficient residual contraction block to balance between fine signal details and large scale contextual information. We consolidate research on ToF by conducting a benchmark comparison against six state-of-the-art methods for which we employ two publicly available datasets. This includes the release of our SToF-Chirp dataset captured by an airborne ultrasound transducer. Results showcase the superior performance of our proposed StofNet in terms of precision, reliability and model complexity. Our code is available at https://github.com/hahnec/stofnet.

摘要
时间飞行（ToF）是现代深度探测技术的广泛应用领域，包括机器人、医疗成像和非 destruktive 检测。然而，ToF 探测受到复杂的 ambient 环境的影响，从而使得反向模型化从稀疏的时间信息变得不可能。本文提出了现代超解像技术可以学习不同的围 circum ambiente 以实现可靠和准确的 ToF 探测。与现有模型不同，我们采用了结合超解像和高效的 residual 压缩块来平衡细详信号和大规模的上下文信息。我们对 ToF 进行了 benchmark 比较，使用了六种现状顶尖方法，并使用了两个公共可用的数据集。这包括我们发布的 SToF-Chirp 数据集，由一架飞行式ultrasound 传感器记录。结果显示我们提出的 StofNet 在精度、可靠性和模型复杂度方面具有优于其他六种方法。我们的代码可以在 https://github.com/hahnec/stofnet 上获取。

Comparing Autoencoder to Geometrical Features for Vascular Bifurcations Identification

paper_url: http://arxiv.org/abs/2308.12314
repo_url: None
paper_authors: Ibtissam Essadik, Anass Nouri, Raja Touahni, Florent Autrusseau
for: 本研究旨在提出两种基于自动编码器和几何特征的新方法，用于脑血管树分支点确定。
methods: 本研究使用了自动编码器和几何特征来提取特征和识别模式，并对医疗数据进行分类。
results: 研究结果表明，自动编码器方法和几何特征方法均可以高效地确定脑血管树分支点，并且具有较高的准确率和F1分数。

Abstract
The cerebrovascular tree is a complex anatomical structure that plays a crucial role in the brain irrigation. A precise identification of the bifurcations in the vascular network is essential for understanding various cerebral pathologies. Traditional methods often require manual intervention and are sensitive to variations in data quality. In recent years, deep learning techniques, and particularly autoencoders, have shown promising performances for feature extraction and pattern recognition in a variety of domains. In this paper, we propose two novel approaches for vascular bifurcation identification based respectiveley on Autoencoder and geometrical features. The performance and effectiveness of each method in terms of classification of vascular bifurcations using medical imaging data is presented. The evaluation was performed on a sample database composed of 91 TOF-MRA, using various evaluation measures, including accuracy, F1 score and confusion matrix.

摘要
《脑血管树是脑血管系统的复杂结构，对脑血管系统的灌溉具有关键作用。正确地识别血管网络中的分叉点是理解脑血管疾病的关键。传统方法通常需要人工干预，并且敏感于数据质量的变化。在最近几年，深度学习技术和特别是自动编码器在不同领域中表现出了有 promise的特性。本文提出了两种基于自动编码器和几何特征的新方法 для识别血管分叉点。每种方法的性能和效果在使用医疗影像数据进行血管分叉点的分类中得到了评估。评估结果表明，自动编码器方法在精度和效率方面具有优势。》Note: Please keep in mind that the translation is in Simplified Chinese, which is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and Macau.

Recovering a Molecule’s 3D Dynamics from Liquid-phase Electron Microscopy Movies

paper_url: http://arxiv.org/abs/2308.11927
repo_url: None
paper_authors: Enze Ye, Yuhang Wang, Hong Zhang, Yiqin Gao, Huan Wang, He Sun
for: 这项研究的目的是为了研究生物分子的动态行为，以便更好地理解它们在生物系统中的工作机制。
methods: 这项研究使用了一种创新的液相电子镜像技术（liquid-phase EM），可以让分子保持在Native的液态环境中，从而提供了一个独特的机会来观察它们的动态变化。
results: 这项研究提出了一种名为TEMPOR的Temporal Electron MicroscoPy Object Reconstruction算法，可以使用隐藏神经网络表示（INR）和动态变量自适应器（DVAE）来恢复液相EM电影中的时间序列分子结构。这个算法在两个模拟数据集7bcq和Cas9中显示出了其优势，可以直接从液相EM电影中恢复3D结构的时间变化。这是structural biology领域的首个尝试，并提供了一个有前途的新方法来研究分子的3D动态行为。

Abstract
The dynamics of biomolecules are crucial for our understanding of their functioning in living systems. However, current 3D imaging techniques, such as cryogenic electron microscopy (cryo-EM), require freezing the sample, which limits the observation of their conformational changes in real time. The innovative liquid-phase electron microscopy (liquid-phase EM) technique allows molecules to be placed in the native liquid environment, providing a unique opportunity to observe their dynamics. In this paper, we propose TEMPOR, a Temporal Electron MicroscoPy Object Reconstruction algorithm for liquid-phase EM that leverages an implicit neural representation (INR) and a dynamical variational auto-encoder (DVAE) to recover time series of molecular structures. We demonstrate its advantages in recovering different motion dynamics from two simulated datasets, 7bcq and Cas9. To our knowledge, our work is the first attempt to directly recover 3D structures of a temporally-varying particle from liquid-phase EM movies. It provides a promising new approach for studying molecules' 3D dynamics in structural biology.

摘要
生物分子动态是我们理解它们在生物系统中功能的关键。然而，现有的3D影像技术，如冷阻电子显微镜（cryo-EM），需要冻结样品，这限制了观察分子 conformational 变化的实时观察。新的液相电子显微镜技术（liquid-phase EM）可以将分子置于原始液态环境中，提供了一个独特的机会来观察它们的动态。在这篇论文中，我们提出了 TEMPOR，一种基于偶极 нейрон表示（INR）和动态变量自适应器（DVAE）的电子镜像重建算法，可以从液相电子影像中回归时间序列的分子结构。我们在两个模拟数据集7bcq和Cas9中展示了它的优势。根据我们所知，我们的工作是直接从液相电子影像中提取3D变化的首次尝试。它提供了一个有前途的新方法，用于Structural biology中研究分子的3D动态。

Studying the Impact of Augmentations on Medical Confidence Calibration

paper_url: http://arxiv.org/abs/2308.11902
repo_url: None
paper_authors: Adrit Rao, Joon-Young Lee, Oliver Aalami
for: 本研究旨在评估现代增强技术对于卷积神经网络（CNN）的准确性和性能的影响。
methods: 本研究使用了三种现代增强技术：CutMix、MixUp和CutOut，以评估它们对于CNN的准确性和性能的影响。
results: 研究发现，CutMix可以最大化准确性，而CutOut可能会降低准确性水平。

Abstract
The clinical explainability of convolutional neural networks (CNN) heavily relies on the joint interpretation of a model's predicted diagnostic label and associated confidence. A highly certain or uncertain model can significantly impact clinical decision-making. Thus, ensuring that confidence estimates reflect the true correctness likelihood for a prediction is essential. CNNs are often poorly calibrated and prone to overconfidence leading to improper measures of uncertainty. This creates the need for confidence calibration. However, accuracy and performance-based evaluations of CNNs are commonly used as the sole benchmark for medical tasks. Taking into consideration the risks associated with miscalibration is of high importance. In recent years, modern augmentation techniques, which cut, mix, and combine images, have been introduced. Such augmentations have benefited CNNs through regularization, robustness to adversarial samples, and calibration. Standard augmentations based on image scaling, rotating, and zooming, are widely leveraged in the medical domain to combat the scarcity of data. In this paper, we evaluate the effects of three modern augmentation techniques, CutMix, MixUp, and CutOut on the calibration and performance of CNNs for medical tasks. CutMix improved calibration the most while CutOut often lowered the level of calibration.

摘要
医学预测模型（Convolutional Neural Network，简称CNN）的临床解释性取决于模型预测结果和相关的信任度的共同解释。一个具有高度的确定或不确定性可以对临床决策产生重要影响。因此，确保模型的信任度反映预测结果的真实正确性可能性是非常重要的。然而，CNN通常具有质量不佳和过度自信的问题，导致不正确的不确定性评估。这创造了对 confidence 的调整的需求。然而，在医学任务中，精度和性能基于的评估方法仍然广泛使用。考虑到误差的风险非常高，特别是在医学领域。在最近几年，一些现代的扩展技术，如CutMix、MixUp和CutOut等，已经被引入。这些扩展技术可以在医学任务中提供常见的刺激，以提高CNN的耐性、对抗黑客样本和调整。在本文中，我们评估了CutMix、MixUp和CutOut等三种现代扩展技术对CNN的医学任务性能和调整的影响。结果显示，CutMix最大程度地提高了模型的准确性，而CutOut通常会降低模型的准确性。

Enhanced Residual SwinV2 Transformer for Learned Image Compression

paper_url: http://arxiv.org/abs/2308.11864
repo_url: None
paper_authors: Yongqiang Wang, Feng Liang, Haisheng Fu, Jie Liang, Haipeng Qin, Junzhe Liang
for: 提高图像压缩性能和实现简化模型
methods: 使用增强的 residual Swinv2 变换器和特征增强模块，并采用 SwinV2 变换器来进行编码和超编码
results: 对 Kodak 和 Tecnick 数据集进行测试，与一些最新的学习型图像压缩方法相当，并超过一些传统的编码器，包括 VVC，同时减少了模型复杂度56%。

Abstract
Recently, the deep learning technology has been successfully applied in the field of image compression, leading to superior rate-distortion performance. However, a challenge of many learning-based approaches is that they often achieve better performance via sacrificing complexity, which making practical deployment difficult. To alleviate this issue, in this paper, we propose an effective and efficient learned image compression framework based on an enhanced residual Swinv2 transformer. To enhance the nonlinear representation of images in our framework, we use a feature enhancement module that consists of three consecutive convolutional layers. In the subsequent coding and hyper coding steps, we utilize a SwinV2 transformer-based attention mechanism to process the input image. The SwinV2 model can help to reduce model complexity while maintaining high performance. Experimental results show that the proposed method achieves comparable performance compared to some recent learned image compression methods on Kodak and Tecnick datasets, and outperforms some traditional codecs including VVC. In particular, our method achieves comparable results while reducing model complexity by 56% compared to these recent methods.

摘要

Robust RF Data Normalization for Deep Learning

paper_url: http://arxiv.org/abs/2308.11833
repo_url: None
paper_authors: Mostafa Sharifzadeh, Habib Benali, Hassan Rivaz
for: 用于深度神经网络训练，以优化ultrasound图像处理
methods: 使用个体标准化方法，改进传统数据Normalization的性能
results: 提高深度神经网络模型的通用性和性能

Abstract
Radio frequency (RF) data contain richer information compared to other data types, such as envelope or B-mode, and employing RF data for training deep neural networks has attracted growing interest in ultrasound image processing. However, RF data is highly fluctuating and additionally has a high dynamic range. Most previous studies in the literature have relied on conventional data normalization, which has been adopted within the computer vision community. We demonstrate the inadequacy of those techniques for normalizing RF data and propose that individual standardization of each image substantially enhances the performance of deep neural networks by utilizing the data more efficiently. We compare conventional and proposed normalizations in a phase aberration correction task and illustrate how the former enhances the generality of trained models.

摘要
radio frequency (RF) 数据包含更多信息 compare to other data types, such as envelope or B-mode, 和使用 RF 数据来训练深度神经网络已经引起了ultrasound image processing 领域的关注。然而，RF 数据具有高度波动和含有高动态范围。大多数前一代研究Literature中的技术都是通过传统的数据Normalization来进行normalization，这些技术在计算机视觉领域得到广泛应用。我们示出了这些技术对 RF 数据的normalization是不充分的，并提出了基于每个图像的个体标准化，可以更有效地利用数据，并提高深度神经网络的性能。我们对传统和我们提议的normalization进行比较，并在阶梯偏移 correction 任务中示出了我们的方法可以提高训练模型的普遍性。

Frequency-Space Prediction Filtering for Phase Aberration Correction in Plane-Wave Ultrasound

paper_url: http://arxiv.org/abs/2308.11830
repo_url: None
paper_authors: Mostafa Sharifzadeh, Habib Benali, Hassan Rivaz
for: This paper aims to improve the image quality of ultrasound imaging by addressing the challenge of phase aberration, which is a significant contributing factor to image degradation in focused ultrasound imaging.
methods: The paper proposes an adaptive AR model to improve the performance of frequency-space prediction filtering (FXPF) in plane-wave imaging, where the number of contributing signals varies at different depths.
results: The proposed adaptive AR model is effective in improving image quality, as demonstrated by the improved contrast and generalized contrast-to-noise ratio metrics compared to using a fixed-order AR model.Here’s the Chinese translation of the three points:
for: 这篇论文目的是改进ultrasound imaging中的图像质量，解决phas aberration的挑战，phas aberration是注意力ultrasound imaging中的主要质量下降因素。
methods: 该论文提议使用适应AR模型来改进FXPF技术在平面波成像中的性能，在不同深度下的信号数量不同时，采用适应AR模型可以提高图像质量。
results: 提议的适应AR模型在改进图像质量方面具有良好的效果，通过对比使用固定顺序AR模型，在对比度和通用对比度比例方面表现出了改进。

Abstract
Ultrasound imaging often suffers from image degradation stemming from phase aberration, which represents a significant contributing factor to the overall image degradation in ultrasound imaging. Frequency-space prediction filtering or FXPF is a technique that has been applied within focused ultrasound imaging to alleviate the phase aberration effect. It presupposes the existence of an autoregressive (AR) model across the signals received at the transducer elements and removes any components that do not conform to the established model. In this study, we illustrate the challenge of applying this technique to plane-wave imaging, where, at shallower depths, signals from more distant elements lose relevance, and a fewer number of elements contribute to image reconstruction. While the number of contributing signals varies, adopting a fixed-order AR model across all depths, results in suboptimal performance. To address this challenge, we propose an AR model with an adaptive order and quantify its effectiveness using contrast and generalized contrast-to-noise ratio metrics.

摘要

WS-SfMLearner: Self-supervised Monocular Depth and Ego-motion Estimation on Surgical Videos with Unknown Camera Parameters

paper_url: http://arxiv.org/abs/2308.11776
repo_url: None
paper_authors: Ange Lou, Jack Noble
for: 这篇论文是为了建立一个自动化的深度和 egocentric 运动估计系统，用于医学影像导航手术。
methods: 该论文使用了一种基于 cost-volume 的自我监督方法，以便预测摄像头参数。
results: 实验结果表明，提议的方法可以改进摄像头参数、 egocentric 运动和深度估计的准确性。

Abstract
Depth estimation in surgical video plays a crucial role in many image-guided surgery procedures. However, it is difficult and time consuming to create depth map ground truth datasets in surgical videos due in part to inconsistent brightness and noise in the surgical scene. Therefore, building an accurate and robust self-supervised depth and camera ego-motion estimation system is gaining more attention from the computer vision community. Although several self-supervision methods alleviate the need for ground truth depth maps and poses, they still need known camera intrinsic parameters, which are often missing or not recorded. Moreover, the camera intrinsic prediction methods in existing works depend heavily on the quality of datasets. In this work, we aimed to build a self-supervised depth and ego-motion estimation system which can predict not only accurate depth maps and camera pose, but also camera intrinsic parameters. We proposed a cost-volume-based supervision manner to give the system auxiliary supervision for camera parameters prediction. The experimental results showed that the proposed method improved the accuracy of estimated camera parameters, ego-motion, and depth estimation.

摘要
深度估算在手术视频中扮演着重要的角色，但是创建深度地图真实数据集在手术视频中具有困难和耗时的问题，主要是因为手术场景中的光照不均匀和噪声。因此，建立一个准确和可靠的自我超vised深度和摄像头自身运动估计系统在计算机视觉领域中受到更多的关注。虽然一些自我超视方法可以减少需要真实深度地图和pose的需求，但它们仍需要已知的摄像头内参数，这些参数通常缺失或者不被记录。此外，现有的摄像头内参数预测方法依赖于数据质量。在这个工作中，我们尝试了建立一个自我超vised深度和摄像头自身运动估计系统，该系统可以预测不仅准确的深度地图和摄像头pose，还可以预测摄像头内参数。我们提议使用Volume-based超vision方式为系统提供auxiliary超vision。实验结果表明，我们的方法可以提高摄像头参数预测的准确性，以及深度估算和摄像头自身运动估计的准确性。

EndoNet: model for automatic calculation of H-score on histological slides

paper_url: http://arxiv.org/abs/2308.11562
repo_url: None
paper_authors: Egor Ushakov, Anton Naumov, Vladislav Fomberg, Polina Vishnyakova, Aleksandra Asaturova, Alina Badlaeva, Anna Tregubova, Evgeny Karpulevich, Gennady Sukhikh, Timur Fatkhudinov
for: This paper aims to develop a computer-aided method for automatic calculation of H-score on histological slides, which can improve the efficiency and accuracy of pathologists’ workflows.
methods: The proposed method, called EndoNet, uses neural networks and consists of two main parts: a detection model that predicts keypoints of centers of nuclei, and a H-score module that calculates the value of the H-score using mean pixel values of predicted keypoints.
results: The model was trained and validated on 1780 annotated tiles with a shape of 100x100 $\mu m$ and achieved 0.77 mAP on a test dataset. The model is effective and robust in the analysis of histology slides, which can improve and significantly accelerate the work of pathologists.

Abstract
H-score is a semi-quantitative method used to assess the presence and distribution of proteins in tissue samples by combining the intensity of staining and percentage of stained nuclei. It is widely used but time-consuming and can be limited in accuracy and precision. Computer-aided methods may help overcome these limitations and improve the efficiency of pathologists' workflows. In this work, we developed a model EndoNet for automatic calculation of H-score on histological slides. Our proposed method uses neural networks and consists of two main parts. The first is a detection model which predicts keypoints of centers of nuclei. The second is a H-score module which calculates the value of the H-score using mean pixel values of predicted keypoints. Our model was trained and validated on 1780 annotated tiles with a shape of 100x100 $\mu m$ and performed 0.77 mAP on a test dataset. Moreover, the model can be adjusted to a specific specialist or whole laboratory to reproduce the manner of calculating the H-score. Thus, EndoNet is effective and robust in the analysis of histology slides, which can improve and significantly accelerate the work of pathologists.

摘要
“H-score”是一种半量化方法，用于评估组织样本中蛋白质的存在和分布，通过结合染料强度和染料覆盖率的组合。它广泛使用，但时间消耗大，精度和精密性有限。计算机助け方法可以帮助解决这些问题，提高病理医生的工作效率。在这项工作中，我们开发了一个名为“EndoNet”的自动计算H-score模型。我们的提议方法包括两个主要部分：第一是一个检测模型，可以预测核心点的位置；第二是H-score模块，可以使用预测的核心点的平均像素值来计算H-score的值。我们的模型在1780个标注的块中训练和验证，在测试集上达到0.77 mAP。此外，模型可以根据特定的专家或实验室来调整H-score的计算方式，因此EndoNet是对 histology slice 的分析中效果强大和稳定的。

Open Set Synthetic Image Source Attribution

paper_url: http://arxiv.org/abs/2308.11557
repo_url: None
paper_authors: Shengbang Fang, Tai D. Nguyen, Matthew C. Stamm
for: 本研究旨在开发一种基于 metric learning 的开放集成源归属技术，以便在新的生成器出现时能够快速归属 synthetic 图像的来源。
methods: 本研究使用了一种基于 metric learning 的方法，包括学习可转移的嵌入 Vector 以便区分不同的生成器。图像首先被归类到可能的生成器中，然后根据图像与已知生成器的嵌入空间距离进行接受或拒绝。
results: 经过一系列实验，本研究 demonstrate 了开放集成源归属技术的可行性和效果，并且表明可以快速归属 synthetic 图像的来源，即使新的生成器没有在训练过程中出现。

Abstract
AI-generated images have become increasingly realistic and have garnered significant public attention. While synthetic images are intriguing due to their realism, they also pose an important misinformation threat. To address this new threat, researchers have developed multiple algorithms to detect synthetic images and identify their source generators. However, most existing source attribution techniques are designed to operate in a closed-set scenario, i.e. they can only be used to discriminate between known image generators. By contrast, new image-generation techniques are rapidly emerging. To contend with this, there is a great need for open-set source attribution techniques that can identify when synthetic images have originated from new, unseen generators. To address this problem, we propose a new metric learning-based approach. Our technique works by learning transferrable embeddings capable of discriminating between generators, even when they are not seen during training. An image is first assigned to a candidate generator, then is accepted or rejected based on its distance in the embedding space from known generators' learned reference points. Importantly, we identify that initializing our source attribution embedding network by pretraining it on image camera identification can improve our embeddings' transferability. Through a series of experiments, we demonstrate our approach's ability to attribute the source of synthetic images in open-set scenarios.

摘要
人工生成的图像已经变得越来越真实，引起了大量的公众关注。虽然人工图像吸引人的目光，但它们也 pose 一个重要的误导性问题。为了解决这个新的问题，研究人员们已经开发出了多种检测人工图像和确定它们的来源生成器的算法。然而，大多数现有的来源归属技术都是针对关闭集成 scenarios，即只能在已知的图像生成器之间进行分类。然而，新的图像生成技术在不断地出现。为了应对这个问题，有一个很大的需求是开放集成来源归属技术，可以在未经见过的生成器之间进行归属。为此，我们提出了一种新的度量学习基于的方法。我们的方法通过学习可转移的嵌入来确定生成器，即可以在不同的生成器之间进行归属，即使它们在训练时未经见过。在一系列的实验中，我们证明了我们的方法可以在开放集成 scenarios 中归属人工图像的来源。

2023-08-23

cs.CL

eess.AS - 2023-08-23

Analysis of XLS-R for Speech Quality Assessment

paper_url: http://arxiv.org/abs/2308.12077
repo_url: None
paper_authors: Bastiaan Tamm, Rik Vandenberghe, Hugo Van hamme

Abstract:
In online conferencing applications, estimating the perceived quality of an audio signal is crucial to ensure high quality of experience for the end user. The most reliable way to assess the quality of a speech signal is through human judgments in the form of the mean opinion score (MOS) metric. However, such an approach is labor intensive and not feasible for large-scale applications. The focus has therefore shifted towards automated speech quality assessment through end-to-end training of deep neural networks. Recently, it was shown that leveraging pre-trained wav2vec-based XLS-R embeddings leads to state-of-the-art performance for the task of speech quality prediction. In this paper, we perform an in-depth analysis of the pre-trained model. First, we analyze the performance of embeddings extracted from each layer of XLS-R and also for each size of the model (300M, 1B, 2B parameters). Surprisingly, we find two optimal regions for feature extraction: one in the lower-level features and one in the high-level features. Next, we investigate the reason for the two distinct optima. We hypothesize that the lower-level features capture characteristics of noise and room acoustics, whereas the high-level features focus on speech content and intelligibility. To investigate this, we analyze the sensitivity of the MOS predictions with respect to different levels of corruption in each category. Afterwards, we try fusing the two optimal feature depths to determine if they contain complementary information for MOS prediction. Finally, we compare the performance of the proposed models and assess the generalizability of the models on unseen datasets.

Joint Prediction of Audio Event and Annoyance Rating in an Urban Soundscape by Hierarchical Graph Representation Learning

paper_url: http://arxiv.org/abs/2308.11980
repo_url: https://github.com/yuanbo2020/hgrl
paper_authors: Yuanbo Hou, Siyang Song, Cheng Luo, Andrew Mitchell, Qiaoqiao Ren, Weicheng Xie, Jian Kang, Wenwu Wang, Dick Botteldooren

Abstract:
Sound events in daily life carry rich information about the objective world. The composition of these sounds affects the mood of people in a soundscape. Most previous approaches only focus on classifying and detecting audio events and scenes, but may ignore their perceptual quality that may impact humans’ listening mood for the environment, e.g. annoyance. To this end, this paper proposes a novel hierarchical graph representation learning (HGRL) approach which links objective audio events (AE) with subjective annoyance ratings (AR) of the soundscape perceived by humans. The hierarchical graph consists of fine-grained event (fAE) embeddings with single-class event semantics, coarse-grained event (cAE) embeddings with multi-class event semantics, and AR embeddings. Experiments show the proposed HGRL successfully integrates AE with AR for AEC and ARP tasks, while coordinating the relations between cAE and fAE and further aligning the two different grains of AE information with the AR.

CED: Consistent ensemble distillation for audio tagging

paper_url: http://arxiv.org/abs/2308.11957
repo_url: https://github.com/richermans/ced
paper_authors: Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Junbo Zhang, Yujun Wang

Abstract:
Augmentation and knowledge distillation (KD) are well-established techniques employed in the realm of audio classification tasks, aimed at enhancing performance and reducing model sizes on the widely recognized Audioset (AS) benchmark. Although both techniques are effective individually, their combined use, called consistent teaching, hasn’t been explored before. This paper proposes CED, a simple training framework that distils student models from large teacher ensembles with consistent teaching. To achieve this, CED efficiently stores logits as well as the augmentation methods on disk, making it scalable to large-scale datasets. Central to CED’s efficacy is its label-free nature, meaning that only the stored logits are used for the optimization of a student model only requiring 0.3% additional disk space for AS. The study trains various transformer-based models, including a 10M parameter model achieving a 49.0 mean average precision (mAP) on AS. Pretrained models and code are available at https://github.com/RicherMans/CED.

Audio Generation with Multiple Conditional Diffusion Model

paper_url: http://arxiv.org/abs/2308.11940
repo_url: None
paper_authors: Zhifang Guo, Jianguo Mao, Rui Tao, Long Yan, Kazushige Ouchi, Hong Liu, Xiangdong Wang

Abstract:
Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation. Audio samples and our dataset are publicly available at https://conditionaudiogen.github.io/conditionaudiogen/

Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement

paper_url: http://arxiv.org/abs/2308.11923
repo_url: None
paper_authors: Daiki Takeuchi, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada, Kunio Kashino

Abstract:
We proposed Audio Difference Captioning (ADC) as a new extension task of audio captioning for describing the semantic differences between input pairs of similar but slightly different audio clips. The ADC solves the problem that conventional audio captioning sometimes generates similar captions for similar audio clips, failing to describe the difference in content. We also propose a cross-attention-concentrated transformer encoder to extract differences by comparing a pair of audio clips and a similarity-discrepancy disentanglement to emphasize the difference in the latent space. To evaluate the proposed methods, we built an AudioDiffCaps dataset consisting of pairs of similar but slightly different audio clips with human-annotated descriptions of their differences. The experiment with the AudioDiffCaps dataset showed that the proposed methods solve the ADC task effectively and improve the attention weights to extract the difference by visualizing them in the transformer encoder.

KinSPEAK: Improving speech recognition for Kinyarwanda via semi-supervised learning methods

paper_url: http://arxiv.org/abs/2308.11863
repo_url: None
paper_authors: Antoine Nzeyimana

Abstract:
Despite recent availability of large transcribed Kinyarwanda speech data, achieving robust speech recognition for Kinyarwanda is still challenging. In this work, we show that using self-supervised pre-training, following a simple curriculum schedule during fine-tuning and using semi-supervised learning to leverage large unlabelled speech data significantly improve speech recognition performance for Kinyarwanda. Our approach focuses on using public domain data only. A new studio-quality speech dataset is collected from a public website, then used to train a clean baseline model. The clean baseline model is then used to rank examples from a more diverse and noisy public dataset, defining a simple curriculum training schedule. Finally, we apply semi-supervised learning to label and learn from large unlabelled data in four successive generations. Our final model achieves 3.2% word error rate (WER) on the new dataset and 15.9% WER on Mozilla Common Voice benchmark, which is state-of-the-art to the best of our knowledge. Our experiments also indicate that using syllabic rather than character-based tokenization results in better speech recognition performance for Kinyarwanda.

Example-Based Framework for Perceptually Guided Audio Texture Generation

paper_url: http://arxiv.org/abs/2308.11859
repo_url: None
paper_authors: Purnima Kamath, Chitralekha Gupta, Lonce Wyse, Suranga Nanayakkara

Abstract:
Generative models for synthesizing audio textures explicitly encode controllability by conditioning the model with labelled data. While datasets for audio textures can be easily recorded in-the-wild, semantically labeling them is expensive, time-consuming, and prone to errors due to human annotator subjectivity. Thus, to control generation, there is a need to automatically infer user-defined perceptual factors of variation in the latent space of a generative model while modelling unlabeled textures. In this paper, we propose an example-based framework to determine vectors to guide texture generation based on user-defined semantic attributes. By synthesizing a few synthetic examples to indicate the presence or absence of a semantic attribute, we can infer the guidance vectors in the latent space of a generative model to control that attribute during generation. Our results show that our method is capable of finding perceptually relevant and deterministic guidance vectors for controllable generation for both discrete as well as continuous textures. Furthermore, we demonstrate the application of this method to other tasks such as selective semantic attribute transfer.

Complex-valued neural networks for voice anti-spoofing

paper_url: http://arxiv.org/abs/2308.11800
repo_url: None
paper_authors: Nicolas M. Müller, Philip Sperl, Konstantin Böttinger

Abstract:
Current anti-spoofing and audio deepfake detection systems use either magnitude spectrogram-based features (such as CQT or Melspectrograms) or raw audio processed through convolution or sinc-layers. Both methods have drawbacks: magnitude spectrograms discard phase information, which affects audio naturalness, and raw-feature-based models cannot use traditional explainable AI methods. This paper proposes a new approach that combines the benefits of both methods by using complex-valued neural networks to process the complex-valued, CQT frequency-domain representation of the input audio. This method retains phase information and allows for explainable AI methods. Results show that this approach outperforms previous methods on the “In-the-Wild” anti-spoofing dataset and enables interpretation of the results through explainable AI. Ablation studies confirm that the model has learned to use phase information to detect voice spoofing.

paper_url: http://arxiv.org/abs/2308.11773
repo_url: None
paper_authors: Yuezhou Zhang, Amos A Folarin, Judith Dineley, Pauline Conde, Valeria de Angel, Shaoxiong Sun, Yatharth Ranjan, Zulqarnain Rashid, Callum Stewart, Petroula Laiou, Heet Sankesara, Linglong Qian, Faith Matcham, Katie M White, Carolin Oetzmann, Femke Lamers, Sara Siddi, Sara Simblett, Björn W. Schuller, Srinivasan Vairavan, Til Wykes, Josep Maria Haro, Brenda WJH Penninx, Vaibhav A Narayan, Matthew Hotopf, Richard JB Dobson, Nicholas Cummins, RADAR-CNS consortium

Abstract:
Language use has been shown to correlate with depression, but large-scale validation is needed. Traditional methods like clinic studies are expensive. So, natural language processing has been employed on social media to predict depression, but limitations remain-lack of validated labels, biased user samples, and no context. Our study identified 29 topics in 3919 smartphone-collected speech recordings from 265 participants using the Whisper tool and BERTopic model. Six topics with a median PHQ-8 greater than or equal to 10 were regarded as risk topics for depression: No Expectations, Sleep, Mental Therapy, Haircut, Studying, and Coursework. To elucidate the topic emergence and associations with depression, we compared behavioral (from wearables) and linguistic characteristics across identified topics. The correlation between topic shifts and changes in depression severity over time was also investigated, indicating the importance of longitudinally monitoring language use. We also tested the BERTopic model on a similar smaller dataset (356 speech recordings from 57 participants), obtaining some consistent results. In summary, our findings demonstrate specific speech topics may indicate depression severity. The presented data-driven workflow provides a practical approach to collecting and analyzing large-scale speech data from real-world settings for digital health research.

2023-08-23

eess.IV

eessp.SP - 2023-08-23

2023-08-22

cs.SD

cs.SD - 2023-08-22

Furnishing Sound Event Detection with Language Model Abilities

paper_url: http://arxiv.org/abs/2308.11530
repo_url: None
paper_authors: Hualei Wang, Jianguo Mao, Zhifang Guo, Jiarui Wan, Hong Liu, Xiangdong Wang
for: 本研究探讨了语言模型（LM）在视觉交互中的能力，特别是Sound Event Detection（SED）领域。
methods: 提出了一种简洁的方法，将音频特征和文本特征进行对应，以实现音频特征的分类和时间位置的捕捉。该框架包括听音编码器、对应模块和解 coupling 语言解码器。
results: 研究表明，提出的方法可以准确地生成音频事件探测的序列。

Abstract
Recently, the ability of language models (LMs) has attracted increasing attention in visual cross-modality. In this paper, we further explore the generation capacity of LMs for sound event detection (SED), beyond the visual domain. Specifically, we propose an elegant method that aligns audio features and text features to accomplish sound event classification and temporal location. The framework consists of an acoustic encoder, a contrastive module that align the corresponding representations of the text and audio, and a decoupled language decoder that generates temporal and event sequences from the audio characteristic. Compared with conventional works that require complicated processing and barely utilize limited audio features, our model is more concise and comprehensive since language model directly leverage its semantic capabilities to generate the sequences. We investigate different decoupling modules to demonstrate the effectiveness for timestamps capture and event classification. Evaluation results show that the proposed method achieves accurate sequences of sound event detection.

摘要

Deep learning-based denoising streamed from mobile phones improves speech-in-noise understanding for hearing aid users

paper_url: http://arxiv.org/abs/2308.11456
repo_url: None
paper_authors: Peter Udo Diehl, Hannes Zilly, Felix Sattler, Yosef Singer, Kevin Kepp, Mark Berry, Henning Hasemann, Marlene Zippel, Müge Kaya, Paul Meyer-Rachner, Annett Pudszuhn, Veit M. Hofmann, Matthias Vormann, Elias Sprengel
For: The paper is written for individuals with hearing loss who use hearing aids, particularly those in noisy environments.* Methods: The paper presents a deep learning-based denoising system that runs in real-time on a mobile device (iPhone 7 or Samsung Galaxy S10) and is streamed directly to the hearing aid.* Results: The denoising system improves audio quality and speech intelligibility for hearing aid users in noisy environments, as measured by subjective ratings and objective speech intelligibility tests. Subjective ratings improve by more than 40%, and speech reception thresholds improve by 1.6 dB SRT.

Abstract
The hearing loss of almost half a billion people is commonly treated with hearing aids. However, current hearing aids often do not work well in real-world noisy environments. We present a deep learning based denoising system that runs in real time on iPhone 7 and Samsung Galaxy S10 (25ms algorithmic latency). The denoised audio is streamed to the hearing aid, resulting in a total delay of around 75ms. In tests with hearing aid users having moderate to severe hearing loss, our denoising system improves audio across three tests: 1) listening for subjective audio ratings, 2) listening for objective speech intelligibility, and 3) live conversations in a noisy environment for subjective ratings. Subjective ratings increase by more than 40%, for both the listening test and the live conversation compared to a fitted hearing aid as a baseline. Speech reception thresholds, measuring speech understanding in noise, improve by 1.6 dB SRT. Ours is the first denoising system that is implemented on a mobile device, streamed directly to users' hearing aids using only a single channel as audio input while improving user satisfaction on all tested aspects, including speech intelligibility. This includes overall preference of the denoised and streamed signal over the hearing aid, thereby accepting the higher latency for the significant improvement in speech understanding.

摘要
current hearing aids often do not work well in real-world noisy environments. We present a deep learning based denoising system that runs in real time on iPhone 7 and Samsung Galaxy S10 (25ms algorithmic latency). The denoised audio is streamed to the hearing aid, resulting in a total delay of around 75ms. In tests with hearing aid users having moderate to severe hearing loss, our denoising system improves audio across three tests: 1) listening for subjective audio ratings, 2) listening for objective speech intelligibility, and 3) live conversations in a noisy environment for subjective ratings. Subjective ratings increase by more than 40%, for both the listening test and the live conversation compared to a fitted hearing aid as a baseline. Speech reception thresholds, measuring speech understanding in noise, improve by 1.6 dB SRT. Ours is the first denoising system that is implemented on a mobile device, streamed directly to users' hearing aids using only a single channel as audio input while improving user satisfaction on all tested aspects, including speech intelligibility. This includes overall preference of the denoised and streamed signal over the hearing aid, thereby accepting the higher latency for the significant improvement in speech understanding.Here is the translation in Traditional Chinese:现有的听力问题发生在实际世界中的噪音环境中，我们提出了一个基于深度学习的去噪系统，可以在iPhone 7和Samsung Galaxy S10上运行，具有25ms的算法延迟。去噪后的音频被传递到听力器中，总延迟约为75ms。在听力器用户中的中度至重度听力损伤者进行测试时，我们的去噪系统在三个测试中提高了音频质量：1）聆听Subjective audio ratings，2）聆听Objective speech intelligibility，和3）在噪音环境中进行live conversations的Subjective ratings。评价提高了超过40%，包括聆听测试和live conversations。Speech reception thresholds，用于量测噪音环境中的话语理解能力，提高了1.6 dB SRT。我们的去噪系统是第一个在 mobildevice上实现的，通过单通道音频输入直接传递到用户的听力器，同时改善了所有测试方面的用户满意度，包括话语理解能力。这包括听力器的总偏好和传递的信号，因此接受较高的延迟，以换取明显提高的话语理解能力。

Convoifilter: A case study of doing cocktail party speech recognition

paper_url: http://arxiv.org/abs/2308.11380
repo_url: None
paper_authors: Thai-Binh Nguyen, Alexander Waibel
for: 提高特定发音人员在吵闹环境下的自动语音识别精度（ASR）。
methods: 使用单道speech减雷模块， izolate发音人员的voice从背景噪声中，以及ASR模块。
results: 通过这种方法，可以将word error rate（WER）从80%降低到26.4%。通常，这两个组件在数据需求的变化下需要独立地调整。但是，speech减雷可能会导致ASR效率下降。通过实施联合精度调整策略，可以从26.4%降低到14.5%。

Abstract
This paper presents an end-to-end model designed to improve automatic speech recognition (ASR) for a particular speaker in a crowded, noisy environment. The model utilizes a single-channel speech enhancement module that isolates the speaker's voice from background noise, along with an ASR module. Through this approach, the model is able to decrease the word error rate (WER) of ASR from 80% to 26.4%. Typically, these two components are adjusted independently due to variations in data requirements. However, speech enhancement can create anomalies that decrease ASR efficiency. By implementing a joint fine-tuning strategy, the model can reduce the WER from 26.4% in separate tuning to 14.5% in joint tuning.

摘要
这篇论文介绍了一种终端到终端模型，用于提高特定说话人在吵闹环境中自动语音识别（ASR）的精度。该模型使用了一个单通道语音提升模块，用于 izolate 说话人的voice从背景噪声中分离，以及一个 ASR 模块。通过这种方法，模型可以将 ASR 的单词错误率（WER）从 80% 降低到 26.4%。通常，这两个组件在数据需求的变化情况下独立地调整。然而，语音提升可能会在 ASR 效率下生成异常情况。通过实施联合细调策略，模型可以将 WER 从 26.4% 降低到 14.5%。

Evaluation of the Speech Resynthesis Capabilities of the VoicePrivacy Challenge Baseline B1

paper_url: http://arxiv.org/abs/2308.11337
repo_url: None
paper_authors: Ünal Ege Gaznepoglu, Nils Peters
for: 这个论文主要是为了评估VPC基线B1中使用的神经网络 vocoder是否能够生成人类语音。
methods: 该论文使用了四个对象指标来衡量语音质量、波形相似度和F0相似度，以评估VPC基线B1中的语音表示和 vocoder 是否导致了人类语音的不自然感。
results: 研究发现，VPC基线B1中的语音表示和 vocoder 都会导致语音具有不自然的感觉，并且存在许多处理遗留。一个 MUSHRA-like 听众测试中，18名听众也证实了这些结论，因此需要进一步的研究来分析和synthesize VPC基线B1中的语音组件。

Abstract
Speaker anonymization systems continue to improve their ability to obfuscate the original speaker characteristics in a speech signal, but often create processing artifacts and unnatural sounding voices as a tradeoff. Many of those systems stem from the VoicePrivacy Challenge (VPC) Baseline B1, using a neural vocoder to synthesize speech from an F0, x-vectors and bottleneck features-based speech representation. Inspired by this, we investigate the reproduction capabilities of the aforementioned baseline, to assess how successful the shared methodology is in synthesizing human-like speech. We use four objective metrics to measure speech quality, waveform similarity, and F0 similarity. Our findings indicate that both the speech representation and the vocoder introduces artifacts, causing an unnatural perception. A MUSHRA-like listening test on 18 subjects corroborate our findings, motivating further research on the analysis and synthesis components of the VPC Baseline B1.

摘要
听音系统继续改进各种听音特征的隐蔽能力，但经常产生处理残留和不自然的声音作为交易。这些系统大多来自于听音秘密挑战（VPC）基线B1，使用神经 vocoder 将 F0、x-vector 和瓶颈特征转化为语音。我们通过研究上述基线的复制能力，评估这种方法是否能够生成人类语音。我们使用四个对象指标测量语音质量、波形相似性和 F0 相似性。我们的发现表明，听音表示法和 vocoder 都会产生残留和不自然的声音，导致不自然的感觉。一个 MUSHRA-like 听力测试中，18名参与者证实了我们的发现，促使我们进一步研究 VPC 基线 B1 的分析和 sintesis 组件。

Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

paper_url: http://arxiv.org/abs/2308.11276
repo_url: None
paper_authors: Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, Ying Shan
for: 提高 Text-to-music 生成（T2M-Gen）领域中的模型性能，Addressing the scarcity of large-scale publicly available music datasets with natural language captions.
methods: 提出 Music Understanding LLaMA（MU-LLaMA）模型， capable of answering music-related questions and generating captions for music files， Utilizing audio representations from a pretrained MERT model to extract music features.
results: 实验表明，提出的 MU-LLaMA 模型，通过在我们设计的 MusicQA 数据集上训练，在多种纪录体系下 achieved outstanding performance in both music question answering and music caption generation，Outperforming current state-of-the-art（SOTA）模型在这两个领域，提供了 T2M-Gen 研究领域的一个有前途的进步。

Abstract
Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale publicly available music datasets with natural language captions. To address this, we propose the Music Understanding LLaMA (MU-LLaMA), capable of answering music-related questions and generating captions for music files. Our model utilizes audio representations from a pretrained MERT model to extract music features. However, obtaining a suitable dataset for training the MU-LLaMA model remains challenging, as existing publicly accessible audio question answering datasets lack the necessary depth for open-ended music question answering. To fill this gap, we present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA Dataset designed for answering open-ended music-related questions. The experiments demonstrate that the proposed MU-LLaMA model, trained on our designed MusicQA dataset, achieves outstanding performance in both music question answering and music caption generation across various metrics, outperforming current state-of-the-art (SOTA) models in both fields and offering a promising advancement in the T2M-Gen research field.

摘要
文本到音乐生成（T2M-Gen）面临一个主要障碍，即公共可用的大规模音乐数据集中的自然语言描述缺乏。为解决这一问题，我们提议了音乐理解LLaMA（MU-LLaMA），能够回答音乐相关的问题并生成音乐文件的描述。我们的模型利用了预训练的MERT模型提供的音频表示来提取音乐特征。然而，为了训练MU-LLaMA模型，获得适当的数据集仍然是一项挑战，因为现有的公共可用的音频问答数据集缺乏必要的深度来支持开放式音乐问答。为了填补这一漏洞，我们提出了一种方法，可以将现有的音频描述数据集转化成问题答案对。此外，我们还提出了MusicQA数据集，用于回答开放式音乐相关的问题。实验结果显示，我们的提议的MU-LLaMA模型，通过我们设计的MusicQA数据集进行训练，在多种绩效指标上达到了极高的表现，超越了当前领先的SOTA模型在这两个领域，并提供了T2M-Gen研究领域的一个可能的进步。

Modeling Bends in Popular Music Guitar Tablatures

paper_url: http://arxiv.org/abs/2308.12307
repo_url: https://gitlab.com/adhooge1/bend-prediction
paper_authors: Alexandre D’Hooge, Louis Bigo, Ken Déguernel
for: 这个论文的目的是研究如何通过分析琴 Tablature 中的弯曲事件，预测下一个琴弯的发生。
methods: 该论文使用了一组25个高级特征，计算每个琴 Tablature 中的每个音符，以研究琴弯的预测。
results: 实验结果表明，使用决策树可以成功预测琴弯的发生，F1分数为0.71，false positive的预测相对较少，这显示出这种方法在将非琴乐曲转换为琴 Tablature 中的应用有潜力。

Abstract
Tablature notation is widely used in popular music to transcribe and share guitar musical content. As a complement to standard score notation, tablatures transcribe performance gesture information including finger positions and a variety of guitar-specific playing techniques such as slides, hammer-on/pull-off or bends.This paper focuses on bends, which enable to progressively shift the pitch of a note, therefore circumventing physical limitations of the discrete fretted fingerboard. In this paper, we propose a set of 25 high-level features, computed for each note of the tablature, to study how bend occurrences can be predicted from their past and future short-term context. Experiments are performed on a corpus of 932 lead guitar tablatures of popular music and show that a decision tree successfully predicts bend occurrences with an F1 score of 0.71 anda limited amount of false positive predictions, demonstrating promising applications to assist the arrangement of non-guitar music into guitar tablatures.

摘要
Tablaturenotation是流行音乐中广泛使用的notation方式，用于记录和分享吉他乐器的音乐内容。作为标准notation的补充，tablaturenotation记录了吉他演奏技巧，包括手块位置和各种特殊的演奏技巧，如滑弹、弹弦和抽弹。本文主要关注在抽弹上，抽弹可以使乐音慢慢升高或降低，因此超越 físical limitations of the discrete fretted fingerboard。在本文中，我们提出了25个高级特征，用于研究如何预测抽弹的发生。经过在932首流行音乐的领导吉他 tablature上进行实验，我们发现使用决策树可以成功预测抽弹的发生，F1分数为0.71，只有有限的假阳性预测，这表明了这种方法在将非吉他音乐转换为吉他 tablature中的应用潜力。

PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

paper_url: http://arxiv.org/abs/2308.11084
repo_url: None
paper_authors: Yimin Deng, Huaizhen Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao
for: 本研究旨在提出一种新的语音转换模型，以提高语音转换的自然性和准确性。
methods: 本研究使用了一种新的语音转换框架，称为PMVC，可以从无文本转录的情况下分离和模型语音中的内容、时域和表达信息。该框架还 introduce了一种新的语音增强算法，用于强化表达信息的提取。
results: 实验结果表明，PMVC模型可以提高语音转换的自然性和相似性，并且在AIShell-3 corpus上得到了良好的Result。

Abstract
Voice conversion as the style transfer task applied to speech, refers to converting one person's speech into a new speech that sounds like another person's. Up to now, there has been a lot of research devoted to better implementation of VC tasks. However, a good voice conversion model should not only match the timbre information of the target speaker, but also expressive information such as prosody, pace, pause, etc. In this context, prosody modeling is crucial for achieving expressive voice conversion that sounds natural and convincing. Unfortunately, prosody modeling is important but challenging, especially without text transcriptions. In this paper, we firstly propose a novel voice conversion framework named 'PMVC', which effectively separates and models the content, timbre, and prosodic information from the speech without text transcriptions. Specially, we introduce a new speech augmentation algorithm for robust prosody extraction. And building upon this, mask and predict mechanism is applied in the disentanglement of prosody and content information. The experimental results on the AIShell-3 corpus supports our improvement of naturalness and similarity of converted speech.

摘要
声音转换作为样式传递任务应用于语音，指的是将一个人的语音转换成另一个人的语音，以致其具有新的人的声音特征。到目前为止，有很多研究投入到了更好的声音转换任务的实现中。然而，一个好的声音转换模型不仅需要匹配目标说话人的声音信息，还需要表达信息 such as 语调、节奏、停顿等。在这种情况下，语调模型化是实现自然和有力的声音转换的关键。尽管语调模型化是重要的，但也是具有挑战性，尤其是无文本转录的情况下。在这篇论文中，我们首先提出了一种新的声音转换框架，名为'PMVC'，可以有效地从语音中分离并模型内容、声音和语调信息。特别是，我们提出了一种新的语音增强算法，用于Robust prosody extraction。基于这个算法，我们还应用了mask和predict机制，以实现内容和语调信息的分离。实验结果表明，在AIShell-3 corpus上，我们的改进方法可以提高声音转换的自然性和相似性。

2023-08-22

cs.CV

cs.CV - 2023-08-22

SwinFace: A Multi-task Transformer for Face Recognition, Expression Recognition, Age Estimation and Attribute Estimation

paper_url: http://arxiv.org/abs/2308.11509
repo_url: https://github.com/lxq1000/swinface
paper_authors: Lixiong Qin, Mei Wang, Chao Deng, Ke Wang, Xi Chen, Jiani Hu, Weihong Deng
for: 这个论文的目的是提出一种多功能的Face recognition和表情识别、年龄估计和面部特征估计（40个特征包括性别）的算法基于单个Swin Transformer。
methods: 该算法使用了一个共享背景和每个相关任务的子网络，并在每个任务特定分析子网络中实现了一种多级渠道注意力（MLCA）模块，以适应不同任务的冲突和需求。
results: 对于所有任务，提出的模型具有出色的表现，尤其是在RAF-DB和CLAP2015上达到了90.97%的准确率和0.22 $\epsilon$-的误差分别，这些结果在表情识别和年龄估计领域是状态的最佳结果。

Abstract
In recent years, vision transformers have been introduced into face recognition and analysis and have achieved performance breakthroughs. However, most previous methods generally train a single model or an ensemble of models to perform the desired task, which ignores the synergy among different tasks and fails to achieve improved prediction accuracy, increased data efficiency, and reduced training time. This paper presents a multi-purpose algorithm for simultaneous face recognition, facial expression recognition, age estimation, and face attribute estimation (40 attributes including gender) based on a single Swin Transformer. Our design, the SwinFace, consists of a single shared backbone together with a subnet for each set of related tasks. To address the conflicts among multiple tasks and meet the different demands of tasks, a Multi-Level Channel Attention (MLCA) module is integrated into each task-specific analysis subnet, which can adaptively select the features from optimal levels and channels to perform the desired tasks. Extensive experiments show that the proposed model has a better understanding of the face and achieves excellent performance for all tasks. Especially, it achieves 90.97% accuracy on RAF-DB and 0.22 $\epsilon$-error on CLAP2015, which are state-of-the-art results on facial expression recognition and age estimation respectively. The code and models will be made publicly available at https://github.com/lxq1000/SwinFace.

摘要
Recently, vision transformers have been applied to face recognition and analysis, achieving performance breakthroughs. However, most previous methods train a single model or an ensemble of models to perform the desired task, ignoring the synergy among different tasks and failing to achieve improved prediction accuracy, increased data efficiency, and reduced training time. This paper proposes a multi-purpose algorithm for simultaneous face recognition, facial expression recognition, age estimation, and face attribute estimation (40 attributes including gender) based on a single Swin Transformer. Our design, called SwinFace, consists of a single shared backbone and a subnet for each set of related tasks. To address the conflicts among multiple tasks and meet the different demands of tasks, a Multi-Level Channel Attention (MLCA) module is integrated into each task-specific analysis subnet, which can adaptively select the features from optimal levels and channels to perform the desired tasks. Extensive experiments show that the proposed model has a better understanding of the face and achieves excellent performance for all tasks. Especially, it achieves 90.97% accuracy on RAF-DB and 0.22 $\epsilon$-error on CLAP2015, which are state-of-the-art results on facial expression recognition and age estimation respectively. The code and models will be made publicly available at https://github.com/lxq1000/SwinFace.

LCCo: Lending CLIP to Co-Segmentation

paper_url: http://arxiv.org/abs/2308.11506
repo_url: None
paper_authors: Xin Duan, Yan Yang, Liyuan Pan, Xiabi Liu
for: 本文研究了一种基于语言图像预训练框架(CLIP)的图像集合中的共同Semantic object划分方法。
methods: 该方法使用了一种基于CLIP的三个关键模块：一个图像集合特征匹配模块，一个CLIP交互模块，和一个CLIP规范模块。这些模块共同使用CLIP来提高图像划分精度。
results: 实验结果表明，该方法在四个标准图像划分 benchmark 数据集上的性能比前state-of-the-art方法高。

Abstract
This paper studies co-segmenting the common semantic object in a set of images. Existing works either rely on carefully engineered networks to mine the implicit semantic information in visual features or require extra data (i.e., classification labels) for training. In this paper, we leverage the contrastive language-image pre-training framework (CLIP) for the task. With a backbone segmentation network that independently processes each image from the set, we introduce semantics from CLIP into the backbone features, refining them in a coarse-to-fine manner with three key modules: i) an image set feature correspondence module, encoding global consistent semantic information of the image set; ii) a CLIP interaction module, using CLIP-mined common semantics of the image set to refine the backbone feature; iii) a CLIP regularization module, drawing CLIP towards this co-segmentation task, identifying the best CLIP semantic and using it to regularize the backbone feature. Experiments on four standard co-segmentation benchmark datasets show that the performance of our method outperforms state-of-the-art methods.

摘要

Image set feature correspondence module: Encodes global consistent semantic information of the image set.2. CLIP interaction module: Uses CLIP-mined common semantics of the image set to refine the backbone feature.3. CLIP regularization module: Draws CLIP towards this co-segmentation task, identifying the best CLIP semantic and using it to regularize the backbone feature.Experiments on four standard co-segmentation benchmark datasets show that our method outperforms state-of-the-art methods.

Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition

paper_url: http://arxiv.org/abs/2308.11489
repo_url: https://github.com/wqtwjt1996/sum-l
paper_authors: Qitong Wang, Long Zhao, Liangzhe Yuan, Ting Liu, Xi Peng
For: The paper aims to tackle the challenging problem of unpaired multiview video learning, where the model needs to learn comprehensive multiview representations while dealing with variations in cross-view semantic information.* Methods: The proposed method, called Semantics-based Unpaired Multiview Learning (SUM-L), builds cross-view pseudo-pairs and performs view-invariant alignment by leveraging the semantic information of videos. Additionally, video-text alignment is performed for first-person and third-person videos to improve video representations.* Results: The method is evaluated on multiple benchmark datasets and outperforms multiple existing view-alignment methods, demonstrating its effectiveness in improving video representations under a more challenging scenario than typical paired or unpaired multimodal or multiview learning.Here are the three key points in Simplified Chinese text:* For: 本文目的是解决无对视照视频学习的挑战问题，即学习多视图表示而处理视频cross-view semantic信息的变化。* Methods: 提出的方法是基于Semantics-based Unpaired Multiview Learning（SUM-L），建立cross-view Pseudo-pairs并实现视图不变的Alignment，通过利用视频semantic信息。此外，还进行了首人和第三人视频的文本对齐，以提高视频表示。* Results: 方法在多个benchmark dataset上进行了广泛的实验，并证明其在不同于 Typical paired或Unpaired multimodal或Multiview learning的场景下表现更高效。

Abstract
We are concerned with a challenging scenario in unpaired multiview video learning. In this case, the model aims to learn comprehensive multiview representations while the cross-view semantic information exhibits variations. We propose Semantics-based Unpaired Multiview Learning (SUM-L) to tackle this unpaired multiview learning problem. The key idea is to build cross-view pseudo-pairs and do view-invariant alignment by leveraging the semantic information of videos. To facilitate the data efficiency of multiview learning, we further perform video-text alignment for first-person and third-person videos, to fully leverage the semantic knowledge to improve video representations. Extensive experiments on multiple benchmark datasets verify the effectiveness of our framework. Our method also outperforms multiple existing view-alignment methods, under the more challenging scenario than typical paired or unpaired multimodal or multiview learning. Our code is available at https://github.com/wqtwjt1996/SUM-L.

摘要
我们面临了一种复杂的无对视视频学习场景。在这种情况下，模型需要学习全面的无对视视频表示，而cross-view含义信息具有变化性。我们提议使用Semantics-based Unpaired Multiview Learning（SUM-L）解决这个无对视视频学习问题。关键思想是建立cross-view Pseudo-对和视图不变Alignment，通过利用视频含义信息来做这两个任务。为了提高多视图学习的数据效率，我们进一步进行了视频文本对齐，以全面利用视频含义知识来改善视频表示。我们的方法在多个benchmark数据集上进行了广泛的实验，并证明了我们的框架的效果。我们的方法还比多种现有的视图对齐方法高效，在更加复杂的场景下。我们的代码可以在https://github.com/wqtwjt1996/SUM-L中找到。

Opening the Vocabulary of Egocentric Actions

paper_url: http://arxiv.org/abs/2308.11488
repo_url: None
paper_authors: Dibyadip Chatterjee, Fadime Sener, Shugao Ma, Angela Yao
for: 本文旨在提出一种开放词汇动作识别任务，能够扩展 Verb 到一个开放词汇的动作中，同时处理已知和未知的对象。
methods: 本文提出了一种嵌入 CLIP 表示的提示来预测开放词汇中的交互对象，并使用一个对象agnostic verb encoder 来预测verb。
results: 对于 EPIC-KITCHENS-100 和 Assembly101 datasets，本文创建了一些开放词汇的benchmark，而关闭动作方法无法泛化，而我们的提议方法却效果很好，同时，我们的对象Encoder 也比既有的开放词汇视觉识别方法更高效。

Abstract
Human actions in egocentric videos are often hand-object interactions composed from a verb (performed by the hand) applied to an object. Despite their extensive scaling up, egocentric datasets still face two limitations - sparsity of action compositions and a closed set of interacting objects. This paper proposes a novel open vocabulary action recognition task. Given a set of verbs and objects observed during training, the goal is to generalize the verbs to an open vocabulary of actions with seen and novel objects. To this end, we decouple the verb and object predictions via an object-agnostic verb encoder and a prompt-based object encoder. The prompting leverages CLIP representations to predict an open vocabulary of interacting objects. We create open vocabulary benchmarks on the EPIC-KITCHENS-100 and Assembly101 datasets; whereas closed-action methods fail to generalize, our proposed method is effective. In addition, our object encoder significantly outperforms existing open-vocabulary visual recognition methods in recognizing novel interacting objects.

摘要
人类行为在 egocentric 视频中经常是手部-物体互动，由 verb（由手部执行的动作）和物体组成。尽管这些数据集已经大量扩大，但是它们仍然面临两个限制：动作组合稀缺和固定的交互物体集。本文提出了一个开放词汇动作认知任务。给定一组 verb 和在训练中观察到的物体，目标是将 verb 扩展到一个开放词汇的动作中，包括已知和新的交互物体。为此，我们分离 verb 和物体预测，使用 объекc-agnostic verb 编码器和提示基于 CLIP 表示来预测开放词汇的交互物体。我们创建了开放词汇 benchmark 在 EPIC-KITCHENS-100 和 Assembly101 数据集上，而封闭动作方法无法泛化，我们提posed 方法是有效的。此外，我们的物体编码器在认识新交互物体方面表现出色，大大超过了现有的开放词汇视觉认知方法。

Free Lunch for Gait Recognition: A Novel Relation Descriptor

paper_url: http://arxiv.org/abs/2308.11487
repo_url: None
paper_authors: Jilong Wang, Saihui Hou, Yan Huang, Chunshui Cao, Xu Liu, Yongzhen Huang, Liang Wang
For: This paper focuses on improving gait recognition performance by reconsidering gait representation and emphasizing inter-personal relationships among different subjects’ gait features.* Methods: The proposed method, called Relationship Descriptor (RD), uses reference-anchored gaits to describe each person’s gait and emphasizes meaningful features by normalizing the dot product between gait features and classifier weights. To address the dimensionality challenges, the method proposes a Farthest Anchored gaits Selection algorithm and a dimension reduction method.* Results: The proposed method achieves higher recognition performance than directly using extracted features and consistently outperforms the baselines on four popular gait recognition datasets (GREW, Gait3D, CASIA-B, and OU-MVLP), achieving state-of-the-art performances.Here’s the simplified Chinese text:* For: 本研究目的是提高步行识别性能，重新评估步行表示方式，强调不同人群之间的步行特征关系。* Methods: 提出的方法是关系描述符（RD），使用参照锚定的步行特征来描述每个人的步行，强调意义更高的特征，通过归一化点积分来表示每个测试样本与每个训练ID的步行原型之间的相似性。* Results: 比较直接使用提取的特征，关系描述符可以提高步行识别性能，在四个流行的步行识别数据集（GREW、Gait3D、CASIA-B、OU-MVLP）上表现出state-of-the-art的性能。

Abstract
Gait recognition is to seek correct matches for query individuals by their unique walking patterns at a long distance. However, current methods focus solely on individual gait features, disregarding inter-personal relationships. In this paper, we reconsider gait representation, asserting that gait is not just an aggregation of individual features, but also the relationships among different subjects' gait features once reference gaits are established. From this perspective, we redefine classifier weights as reference-anchored gaits, allowing each person's gait to be described by their relationship with these references. In our work, we call this novel descriptor Relationship Descriptor (RD). This Relationship Descriptor offers two benefits: emphasizing meaningful features and enhancing robustness. To be specific, The normalized dot product between gait features and classifier weights signifies a similarity relation, where each dimension indicates the similarity between the test sample and each training ID's gait prototype, respectively. Despite its potential, the direct use of relationship descriptors poses dimensionality challenges since the dimension of RD depends on the training set's identity count. To address this, we propose a Farthest Anchored gaits Selection algorithm and a dimension reduction method to boost gait recognition performance. Our method can be built on top of off-the-shelf pre-trained classification-based models without extra parameters. We show that RD achieves higher recognition performance than directly using extracted features. We evaluate the effectiveness of our method on the popular GREW, Gait3D, CASIA-B, and OU-MVLP, showing that our method consistently outperforms the baselines and achieves state-of-the-art performances.

摘要
<> translation_direction: zh-CNGait recognition是寻找正确的匹配个体的唯一步态特征，但现有方法强调个体特征，忽略了人之间的关系。在这篇论文中，我们重新定义步态表示，认为步态不仅是个体特征的汇集，还包括不同个体之间的关系。从这个视角，我们定义了一种新的描述符called Relationship Descriptor (RD)。这个描述符有两个优点：强调意义性特征和增强稳定性。具体来说，在测试样本和训练ID之间的标准化点积分比率表示两个样本之间的相似性关系，每个维度表示测试样本与每个训练ID的步态原型之间的相似性。虽有潜在的优势，直接使用关系描述符存在维度挑战，因为关系描述符的维度取决于训练集中个体数量。为解决这个问题，我们提出了最远锚定 gaits 选择算法和维度减少方法，以提高步态识别性能。我们的方法可以在现有的预训练分类模型基础上建立，无需添加参数。我们证明，RD 可以高于直接使用提取的特征来实现更高的识别性能。我们对广泛使用的 GREW、Gait3D、CASIA-B 和 OU-MVLP 进行评估，并证明我们的方法可以一直达到领先水平。Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.

Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

paper_url: http://arxiv.org/abs/2308.11485
repo_url: https://github.com/ABaldrati/CLIP4Cir
paper_authors: Alberto Baldrati, Marco Bertini, Tiberio Uricchio, Alberto del Bimbo
for: 本研究的目的是寻找基于referenced图像和相关的caption的compose image retrieval，即检索图像具有与referenced图像相同的视觉特征并满足caption中的修改。
methods: 本研究使用OpenAI CLIP模型的特征进行任务调整和组合，并使用对比学习进行训练。首先，通过任务特有的方式进行CLIP encoder的任务调整，然后在第二个阶段使用Combiner网络将图像和文本特征相结合，提供了组合特征进行检索。
results: 实验结果表明，基于CLIP特征的任务调整和Combiner网络对于FashionIQ和CIRR两个Popular和挑战性的compose image retrieval dataset具有高效性，并且超过了更复杂的现有方法。

Abstract
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Given that recent research has demonstrated the efficacy of large-scale vision and language pre-trained (VLP) models in various tasks, we rely on features from the OpenAI CLIP model to tackle the considered task. We initially perform a task-oriented fine-tuning of both CLIP encoders using the element-wise sum of visual and textual features. Then, in the second stage, we train a Combiner network that learns to combine the image-text features integrating the bimodal information and providing combined features used to perform the retrieval. We use contrastive learning in both stages of training. Starting from the bare CLIP features as a baseline, experimental results show that the task-oriented fine-tuning and the carefully crafted Combiner network are highly effective and outperform more complex state-of-the-art approaches on FashionIQ and CIRR, two popular and challenging datasets for composed image retrieval. Code and pre-trained models are available at https://github.com/ABaldrati/CLIP4Cir

摘要
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Recent research has demonstrated the efficacy of large-scale vision and language pre-trained (VLP) models in various tasks, so we rely on features from the OpenAI CLIP model to tackle the considered task. We initially perform a task-oriented fine-tuning of both CLIP encoders using the element-wise sum of visual and textual features. Then, in the second stage, we train a Combiner network that learns to combine the image-text features integrating the bimodal information and providing combined features used to perform the retrieval. We use contrastive learning in both stages of training. Starting from the bare CLIP features as a baseline, experimental results show that the task-oriented fine-tuning and the carefully crafted Combiner network are highly effective and outperform more complex state-of-the-art approaches on FashionIQ and CIRR, two popular and challenging datasets for composed image retrieval. Code and pre-trained models are available at https://github.com/ABaldrati/CLIP4Cir.Here's the translation in Traditional Chinese: Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Recent research has demonstrated the efficacy of large-scale vision and language pre-trained (VLP) models in various tasks, so we rely on features from the OpenAI CLIP model to tackle the considered task. We initially perform a task-oriented fine-tuning of both CLIP encoders using the element-wise sum of visual and textual features. Then, in the second stage, we train a Combiner network that learns to combine the image-text features integrating the bimodal information and providing combined features used to perform the retrieval. We use contrastive learning in both stages of training. Starting from the bare CLIP features as a baseline, experimental results show that the task-oriented fine-tuning and the carefully crafted Combiner network are highly effective and outperform more complex state-of-the-art approaches on FashionIQ and CIRR, two popular and challenging datasets for composed image retrieval. Code and pre-trained models are available at https://github.com/ABaldrati/CLIP4Cir.

Pose2Gait: Extracting Gait Features from Monocular Video of Individuals with Dementia

paper_url: http://arxiv.org/abs/2308.11484
repo_url: https://github.com/taatiteam/pose2gait_public
paper_authors: Caroline Malin-Mayor, Vida Adeli, Andrea Sabo, Sergey Noritsyn, Carolina Gorodetsky, Alfonso Fasano, Andrea Iaboni, Babak Taati
for: 这项研究旨在通过视频监测 older adults with dementia 的步态，早期发现健康状况下降，以防止跌倒或入院。
methods: 该研究使用了计算机视觉基于pose tracking模型来自动处理视频数据，并提取了人体 JOINT 位置。
results: 该模型能够从视频中提取步态特征，并与深度摄像头中的特征相比，Spearman 相关系数为。83和.60。这表明，三维空间时间特征可以从单一视频中预测。

Abstract
Video-based ambient monitoring of gait for older adults with dementia has the potential to detect negative changes in health and allow clinicians and caregivers to intervene early to prevent falls or hospitalizations. Computer vision-based pose tracking models can process video data automatically and extract joint locations; however, publicly available models are not optimized for gait analysis on older adults or clinical populations. In this work we train a deep neural network to map from a two dimensional pose sequence, extracted from a video of an individual walking down a hallway toward a wall-mounted camera, to a set of three-dimensional spatiotemporal gait features averaged over the walking sequence. The data of individuals with dementia used in this work was captured at two sites using a wall-mounted system to collect the video and depth information used to train and evaluate our model. Our Pose2Gait model is able to extract velocity and step length values from the video that are correlated with the features from the depth camera, with Spearman's correlation coefficients of .83 and .60 respectively, showing that three dimensional spatiotemporal features can be predicted from monocular video. Future work remains to improve the accuracy of other features, such as step time and step width, and test the utility of the predicted values for detecting meaningful changes in gait during longitudinal ambient monitoring.

摘要
视频基于环境监测 older adults 的步态有潜力检测身体健康状况下降，并允许临床专业人员和照顾者在早期发现并预防滥落或入院。通过计算机视觉技术，可自动处理视频数据，并提取关节位置。然而，目前公共可用的模型并没有适应老年人或临床人群的步态分析。在这项工作中，我们训练了深度神经网络，将二维姿态序列，从视频中捕捉到人走向墙面镜头的人走动捕捉到三维空间时间步态特征的映射。我们的 Pose2Gait 模型可以从视频中提取速度和步长值，与深度相机中的特征相关性 coefficient 为 0.83 和 0.60，表明可以从单视频中预测三维空间时间步态特征。未来的工作是提高其他特征的准确性，如步时和步宽，并测试预测值的检测意义步长监测。

VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection

paper_url: http://arxiv.org/abs/2308.11681
repo_url: None
paper_authors: Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, Yanning Zhang
for: 这个研究的目的是提出一个新的弱监督类别 видео异常探测（WSVAD）方法，并将CLIP模型 directly applied to WSVAD任务。
methods: 这个方法使用了CLIP模型的冻结版本，不需要进行预训练。它还包括一个双支路径，其中一支使用了视觉特征进行粗糙分类，另一支则全面利用语言-图像Alignment。
results: 实验结果显示，VadCLIP在两个常用的标准集上（XD-Violence和UCF-Crime）都达到了最高性能，比前一代方法高度超越。具体来说，VadCLIP在XD-Violence上 achieve 84.51% AP和88.02% AUC，在UCF-Crime上 achieve 84.51% AP和88.02% AUC。

Abstract
The recent contrastive language-image pre-training (CLIP) model has shown great success in a wide range of image-level tasks, revealing remarkable ability for learning powerful visual representations with rich semantics. An open and worthwhile problem is efficiently adapting such a strong model to the video domain and designing a robust video anomaly detector. In this work, we propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD) by leveraging the frozen CLIP model directly without any pre-training and fine-tuning process. Unlike current works that directly feed extracted features into the weakly supervised classifier for frame-level binary classification, VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP and involves dual branch. One branch simply utilizes visual features for coarse-grained binary classification, while the other fully leverages the fine-grained language-image alignment. With the benefit of dual branch, VadCLIP achieves both coarse-grained and fine-grained video anomaly detection by transferring pre-trained knowledge from CLIP to WSVAD task. We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD, surpassing the state-of-the-art methods by a large margin. Specifically, VadCLIP achieves 84.51% AP and 88.02% AUC on XD-Violence and UCF-Crime, respectively. Code and features will be released to facilitate future VAD research.

摘要
Recent contrastive language-image pre-training (CLIP) 模型已经在各种图像任务中显示出惊人的成功，揭示出了强大的视觉表示和丰富的 semantics。一个值得关注的问题是如何有效地适应这种强大模型到视频领域，并设计一个强健的视频异常检测器。在这种工作中，我们提出了VadCLIP，一种新的弱相关视频异常检测（WSVAD）方法，利用直接使用冻结CLIP模型，而不需要任何预训练和调整过程。与当前工作不同，VadCLIP不直接将提取的特征 fed into 弱相关分类器进行帧级二分类，而是利用CLIP模型的细腻语义关系，在两个分支中进行检测。一个分支使用视觉特征进行粗略二分类，另一个分支完全利用语义-图像对齐来进行细腻异常检测。通过两个分支的合作，VadCLIP可以将CLIP模型的预训练知识传递到WSVAD任务中，从而实现粗略和细腻的视频异常检测。我们进行了广泛的实验，证明VadCLIP在XD-Violence和UCF-Crime上的表现均高于当前最佳方法，具体来说是84.51% AP和88.02% AUC。代码和特征将会被发布，以便未来的VAD研究。

Multitemporal analysis in Google Earth Engine for detecting urban changes using optical data and machine learning algorithms

paper_url: http://arxiv.org/abs/2308.11468
repo_url: None
paper_authors: Mariapia Rita Iandolo, Francesca Razzano, Chiara Zarro, G. S. Yogesh, Silvia Liberata Ullo
for: 这个研究旨在使用Google Earth Engine（GEE）平台进行多时间分析，检测城市区域的变化 using 光学数据和专门的机器学习（ML）算法。
methods: 这个研究使用了GEE平台和光学数据，以及专门的ML算法进行分类和变化检测分析。
results: 结果表明，提posed方法可以准确地标识 changed和unchanged的城市区域在选定的时间段内。此外，这个研究也证明了GEE的云存储平台对处理大量卫星数据的管理有所重要性。

Abstract
The aim of this work is to perform a multitemporal analysis using the Google Earth Engine (GEE) platform for the detection of changes in urban areas using optical data and specific machine learning (ML) algorithms. As a case study, Cairo City has been identified, in Egypt country, as one of the five most populous megacities of the last decade in the world. Classification and change detection analysis of the region of interest (ROI) have been carried out from July 2013 to July 2021. Results demonstrate the validity of the proposed method in identifying changed and unchanged urban areas over the selected period. Furthermore, this work aims to evidence the growing significance of GEE as an efficient cloud-based solution for managing large quantities of satellite data.

摘要
本工作的目的是使用Google Earth Engine（GEE）平台进行多时间分析，以探测城市区域的变化使用光学数据和专门的机器学习（ML）算法。作为案例研究，埃及国的开罗城市被选为世界上最后一个十年内最为人口稠密的五大都市之一。从2013年7月至2021年7月的时间段进行了区域 интерес（ROI）的分类和变化检测分析。结果表明提出的方法的有效性，可以准确地标识变化和不变的城市区域在选定的时间段内。此外，这项工作还旨在证明GEE作为云端解决方案，对处理巨量卫星数据的管理有着高效的能力。

An Analysis of Initial Training Strategies for Exemplar-Free Class-Incremental Learning

paper_url: http://arxiv.org/abs/2308.11677
repo_url: None
paper_authors: Grégoire Petit, Michael Soumm, Eva Feillet, Adrian Popescu, Bertrand Delezoide, David Picard, Céline Hudelot
for: 这个论文的目的是探讨分类模型在数据流中建立的增量学习问题。
methods: 论文使用的方法包括增量学习过程中的新类 интеграción、采用先前学习的模型初始化、选择合适的增量学习算法和评估增量学习模型的性能等。
results: 论文的主要发现是初始学习策略对增量准确率的影响是最大的，但是选择合适的增量学习算法更重要地防止忘记。根据这些发现，论文提出了实践增量学习的实用建议。

Abstract
Class-Incremental Learning (CIL) aims to build classification models from data streams. At each step of the CIL process, new classes must be integrated into the model. Due to catastrophic forgetting, CIL is particularly challenging when examples from past classes cannot be stored, the case on which we focus here. To date, most approaches are based exclusively on the target dataset of the CIL process. However, the use of models pre-trained in a self-supervised way on large amounts of data has recently gained momentum. The initial model of the CIL process may only use the first batch of the target dataset, or also use pre-trained weights obtained on an auxiliary dataset. The choice between these two initial learning strategies can significantly influence the performance of the incremental learning model, but has not yet been studied in depth. Performance is also influenced by the choice of the CIL algorithm, the neural architecture, the nature of the target task, the distribution of classes in the stream and the number of examples available for learning. We conduct a comprehensive experimental study to assess the roles of these factors. We present a statistical analysis framework that quantifies the relative contribution of each factor to incremental performance. Our main finding is that the initial training strategy is the dominant factor influencing the average incremental accuracy, but that the choice of CIL algorithm is more important in preventing forgetting. Based on this analysis, we propose practical recommendations for choosing the right initial training strategy for a given incremental learning use case. These recommendations are intended to facilitate the practical deployment of incremental learning.

摘要
We conduct a comprehensive study to assess the influence of these factors. We present a statistical analysis framework that quantifies the relative contribution of each factor to incremental performance. Our main finding is that the initial training strategy is the dominant factor influencing average incremental accuracy, but the choice of CIL algorithm is more important in preventing forgetting. Based on this analysis, we propose practical recommendations for choosing the right initial training strategy for a given incremental learning use case. These recommendations aim to facilitate the practical deployment of incremental learning.

Food Image Classification and Segmentation with Attention-based Multiple Instance Learning

paper_url: http://arxiv.org/abs/2308.11452
repo_url: None
paper_authors: Valasia Vlachopoulou, Ioannis Sarafis, Alexandros Papadopoulos
for:* 这个论文是为了解决食物量计量问题而写的，它适用于食物监测应用场景。methods:* 这篇论文使用了弱监视学习方法，不需要像素级别的标注数据来训练食物图像分类和Semantic Segmentation模型。* 该方法基于多例学习approach，并使用了注意力机制来自动生成食物类划分图像。results:* 在FoodSeg103数据集上进行了实验，并证明了提议的方法的可行性和注意力机制的作用。

Abstract
The demand for accurate food quantification has increased in the recent years, driven by the needs of applications in dietary monitoring. At the same time, computer vision approaches have exhibited great potential in automating tasks within the food domain. Traditionally, the development of machine learning models for these problems relies on training data sets with pixel-level class annotations. However, this approach introduces challenges arising from data collection and ground truth generation that quickly become costly and error-prone since they must be performed in multiple settings and for thousands of classes. To overcome these challenges, the paper presents a weakly supervised methodology for training food image classification and semantic segmentation models without relying on pixel-level annotations. The proposed methodology is based on a multiple instance learning approach in combination with an attention-based mechanism. At test time, the models are used for classification and, concurrently, the attention mechanism generates semantic heat maps which are used for food class segmentation. In the paper, we conduct experiments on two meta-classes within the FoodSeg103 data set to verify the feasibility of the proposed approach and we explore the functioning properties of the attention mechanism.

摘要
随着食物质量评估的需求增加，计算机视觉方法在食物领域中表现出了很大的潜力。然而，传统的机器学习模型开发方法仍然需要带有像素级别的分类注释。然而，这种方法会导致数据收集和真实性生成的挑战，这些挑战会在多个设置下并且对 тысячи个类型进行多次重复。为了缓解这些挑战，文章提出了一种弱型监督的方法，不需要像素级别的注释来训练食物图像分类和 semantic segmentation 模型。该方法基于多例学习approach和注意力机制。在测试时，模型用于分类，同时注意力机制生成 semantic heat map，用于食物类划分。在文章中，我们对 FoodSeg103 数据集中的两个元类进行了实验，以验证提议的可行性，并探索注意力机制的工作性质。

Towards Discriminative Representations with Contrastive Instances for Real-Time UAV Tracking

paper_url: http://arxiv.org/abs/2308.11450
repo_url: None
paper_authors: Dan Zeng, Mingliang Zou, Xucheng Wang, Shuiwang Li
for: 提高UAV跟踪中的精准率和效率，这两个基本挑战是由于计算资源限制、电池容量和UAV最大荷载所带来的。
methods: 使用异构相关矩阵（DCF）基于的跟踪器可以在单个CPU上实现高效性，但是精准率较差。具有轻量级深度学习（DL）基于的跟踪器可以实现精准率和效率的平衡，但是性能增加受压缩率的限制。
results: 根据四个UAV标准测试集（UAV123@10fps、DTB70、UAVDT和VisDrone2018）的广泛实验结果，提出的DRCI跟踪器在对state-of-the-art UAV跟踪方法进行比较时表现出了显著的优势。

Abstract
Maintaining high efficiency and high precision are two fundamental challenges in UAV tracking due to the constraints of computing resources, battery capacity, and UAV maximum load. Discriminative correlation filters (DCF)-based trackers can yield high efficiency on a single CPU but with inferior precision. Lightweight Deep learning (DL)-based trackers can achieve a good balance between efficiency and precision but performance gains are limited by the compression rate. High compression rate often leads to poor discriminative representations. To this end, this paper aims to enhance the discriminative power of feature representations from a new feature-learning perspective. Specifically, we attempt to learn more disciminative representations with contrastive instances for UAV tracking in a simple yet effective manner, which not only requires no manual annotations but also allows for developing and deploying a lightweight model. We are the first to explore contrastive learning for UAV tracking. Extensive experiments on four UAV benchmarks, including UAV123@10fps, DTB70, UAVDT and VisDrone2018, show that the proposed DRCI tracker significantly outperforms state-of-the-art UAV tracking methods.

摘要
维护高效率和高精度是无人机追踪中的两个基本挑战，因为计算资源、电池容量和无人机的最大负载对紧。对应式相关滤波器（DCF）基本trackers可以在单一CPU上提供高效率，但是精度较差。轻量级深度学习（DL）基本trackers可以实现高效率和精度的平衡，但是性能增加受压缩率的限制。高压缩率通常会导致糟糕的描述表现。因此，本文的目标是将无人机追踪中的特征表现强化，以提高追踪精度。具体来说，我们尝试通过新的特征学习角度，从而学习更有弹性的特征表现，这不仅不需要手动标注，而且允许开发和部署轻量级模型。我们是无人机追踪中首次应用对照学习。实验结果显示，提案的DRCI tracker在四个无人机测试 benchmark 上具有明显的 superiority，包括UAV123@10fps、DTB70、UAVDT和VisDrone2018。

Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding

paper_url: http://arxiv.org/abs/2308.11448
repo_url: None
paper_authors: Jiantao Wu, Shentong Mo, Muhammad Awais, Sara Atito, Zhenhua Feng, Josef Kittler
for: 本研究旨在评估自主学习技术在计算机视觉任务中的效iveness，以避免训练和finetuning大型模型。
methods: 本研究提出了一种评估协议，基于提示 patch，用于评估零shot segmentation的能力。此外，还提出了一种简单的SSL方法，称为MMC，该方法组合了masked image modelling、 momentum based self-distillation和global contrast等技术，以提高SSL ViTs的描述性表示。
results: 实验表明，MMC方法在零shot semantic segmentation中达到了顶尖水平，并且在不同的 dataset 上都表现出色。

Abstract
Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data. In the realm of computer vision, pretrained vision transformers (ViTs) have played a pivotal role in advancing transfer learning. Nonetheless, the escalating cost of finetuning these large models has posed a challenge due to the explosion of model size. This study endeavours to evaluate the effectiveness of pure self-supervised learning (SSL) techniques in computer vision tasks, obviating the need for finetuning, with the intention of emulating human-like capabilities in generalisation and recognition of unseen objects. To this end, we propose an evaluation protocol for zero-shot segmentation based on a prompting patch. Given a point on the target object as a prompt, the algorithm calculates the similarity map between the selected patch and other patches, upon that, a simple thresholding is applied to segment the target. Another evaluation is intra-object and inter-object similarity to gauge discriminatory ability of SSP ViTs. Insights from zero-shot segmentation from prompting and discriminatory abilities of SSP led to the design of a simple SSP approach, termed MMC. This approaches combines Masked image modelling for encouraging similarity of local features, Momentum based self-distillation for transferring semantics from global to local features, and global Contrast for promoting semantics of global features, to enhance discriminative representations of SSP ViTs. Consequently, our proposed method significantly reduces the overlap of intra-object and inter-object similarities, thereby facilitating effective object segmentation within an image. Our experiments reveal that MMC delivers top-tier results in zero-shot semantic segmentation across various datasets.

摘要
自顾学前教程（SSP）在机器学习中得到广泛应用，帮助提取有意义的特征表示无需标注数据。在计算机视觉领域，预训练视transformer（ViT）已经发挥了重要的作用，促进了转移学习。然而，模型的规模快速增长带来了训练成本的涨幅问题。本研究旨在评估无标注学习（SSL）技术在计算机视觉任务中的效果，不需要训练，以模仿人类对未看到对象的概念化和识别能力。为此，我们提出了一种零shot segmentation的评估协议，基于提示 patch。给定目标对象的一点作为提示，算法计算selected patch和其他 patch之间的相似度图，然后应用简单的阈值处理来 segment the target。此外，我们还进行了内部和外部相似性评估，以评估 SSL ViTs 的泛化能力。通过零shot segmentation和泛化能力的研究，我们设计了一种简单的 SSP 方法，称为 MMC。该方法结合了做masked image模型，自身融合和全局对比，以提高 SSL ViTs 的泛化表示。实验表明，MMC 可以在不同的 dataset 上达到顶尖的 zero-shot semantic segmentation结果。

Revisiting and Exploring Efficient Fast Adversarial Training via LAW: Lipschitz Regularization and Auto Weight Averaging

paper_url: http://arxiv.org/abs/2308.11443
repo_url: None
paper_authors: Xiaojun Jia, Yuefeng Chen, Xiaofeng Mao, Ranjie Duan, Jindong Gu, Rong Zhang, Hui Xue, Xiaochun Cao
For: The paper aims to improve the robustness of machine learning models against adversarial attacks while reducing the training cost of standard adversarial training.* Methods: The paper proposes an effective Lipschitz regularization method for fast adversarial training and explores the effect of data augmentation and weight averaging in fast adversarial training.* Results: The proposed method, FGSM-LAW, demonstrates superior robustness performance compared to state-of-the-art fast adversarial training methods and advanced standard adversarial training methods, as shown in experimental evaluations on four benchmark databases.

Abstract
Fast Adversarial Training (FAT) not only improves the model robustness but also reduces the training cost of standard adversarial training. However, fast adversarial training often suffers from Catastrophic Overfitting (CO), which results in poor robustness performance. Catastrophic Overfitting describes the phenomenon of a sudden and significant decrease in robust accuracy during the training of fast adversarial training. Many effective techniques have been developed to prevent Catastrophic Overfitting and improve the model robustness from different perspectives. However, these techniques adopt inconsistent training settings and require different training costs, i.e, training time and memory costs, leading to unfair comparisons. In this paper, we conduct a comprehensive study of over 10 fast adversarial training methods in terms of adversarial robustness and training costs. We revisit the effectiveness and efficiency of fast adversarial training techniques in preventing Catastrophic Overfitting from the perspective of model local nonlinearity and propose an effective Lipschitz regularization method for fast adversarial training. Furthermore, we explore the effect of data augmentation and weight averaging in fast adversarial training and propose a simple yet effective auto weight averaging method to improve robustness further. By assembling these techniques, we propose a FGSM-based fast adversarial training method equipped with Lipschitz regularization and Auto Weight averaging, abbreviated as FGSM-LAW. Experimental evaluations on four benchmark databases demonstrate the superiority of the proposed method over state-of-the-art fast adversarial training methods and the advanced standard adversarial training methods.

摘要
快速对抗训练（FAT）不仅提高模型的Robustness，还可以降低标准对抗训练的训练成本。然而，快速对抗训练经常会遭遇Catastrophic Overfitting（CO），这会导致对抗性性能很差。Catastrophic Overfitting是指在快速对抗训练中突然和 significatively减少对抗性性能的现象。许多有效的技术已经开发以预防Catastrophic Overfitting并提高模型的Robustness，但这些技术采用不一致的训练设置和不同的训练成本，如训练时间和内存成本，导致不公平的比较。在这篇论文中，我们进行了快速对抗训练方法超过10种的全面研究，包括对抗性和训练成本。我们从模型本地非线性的角度重新评估快速对抗训练技术的效iveness和效率，并提出了一种有效的Lipschitz regularization方法。此外，我们还研究了快速对抗训练中数据扩展和权重平均的效果，并提出了一种简单又有效的自动权重平均方法，以进一步提高对抗性。通过组合这些技术，我们提出了一种基于FGSM的快速对抗训练方法，即FGSM-LAW。实验评估在四个基本数据库中，显示我们的方法在与标准对抗训练方法和先进的标准对抗训练方法相比，具有显著的优势。

SDeMorph: Towards Better Facial De-morphing from Single Morph

paper_url: http://arxiv.org/abs/2308.11442
repo_url: None
paper_authors: Nitish Shukla
for: 防止摸索攻击 (Morph Attack Detection)
methods: 无参考基于Diffusion Probabilistic Models (DDPM)和branched-UNet
results: 可以准确地回归真实的人脸特征，提高了人脸识别系统的安全性

Abstract
Face Recognition Systems (FRS) are vulnerable to morph attacks. A face morph is created by combining multiple identities with the intention to fool FRS and making it match the morph with multiple identities. Current Morph Attack Detection (MAD) can detect the morph but are unable to recover the identities used to create the morph with satisfactory outcomes. Existing work in de-morphing is mostly reference-based, i.e. they require the availability of one identity to recover the other. Sudipta et al. \cite{ref9} proposed a reference-free de-morphing technique but the visual realism of outputs produced were feeble. In this work, we propose SDeMorph (Stably Diffused De-morpher), a novel de-morphing method that is reference-free and recovers the identities of bona fides. Our method produces feature-rich outputs that are of significantly high quality in terms of definition and facial fidelity. Our method utilizes Denoising Diffusion Probabilistic Models (DDPM) by destroying the input morphed signal and then reconstructing it back using a branched-UNet. Experiments on ASML, FRLL-FaceMorph, FRLL-MorDIFF, and SMDD datasets support the effectiveness of the proposed method.

摘要
人脸识别系统（FRS）容易受到形态攻击。一个形态是通过将多个标识 combine 以达到欺骗 FRS 并让它匹配形态中的多个标识。现有的形态攻击检测（MAD）可以检测形态，但无法恢复创建形态的标识。现有的工作大多数是参考基于的，即需要一个标识的可用性来恢复另一个标识。Sudipta et al. \cite{ref9} 提出了一种无参考的恢复技术，但Visual realism 输出的质量不够高。在这个工作中，我们提出了SDeMorph（稳定扩散恢复器），一种新的无参考恢复方法，可以恢复创建形态的标识。我们的方法生成了高质量的输出，具有高度的定义和人脸准确性。我们的方法利用了Denosing Diffusion Probabilistic Models (DDPM)，通过破坏输入形态信号，然后使用分支-UNet 重建它。实验表明，我们的方法在 ASML、FRLL-FaceMorph、FRLL-MorDIFF 和 SMDD 数据集上具有效果。

Learning a More Continuous Zero Level Set in Unsigned Distance Fields through Level Set Projection

paper_url: http://arxiv.org/abs/2308.11441
repo_url: https://github.com/junshengzhou/levelsetudf
paper_authors: Junsheng Zhou, Baorui Ma, Shujuan Li, Yu-Shen Liu, Zhizhong Han
for: 该paper aimed to address the problem of reconstructing surfaces with open surfaces using unsigned distance functions (UDFs).
methods: The authors proposed to learn UDFs using neural networks and reconstruct surfaces with the gradients around the zero level set of the UDF. However, they found that the differential networks struggled to learn the zero level set, leading to large errors on unsigned distances and gradients. To resolve this, they proposed to learn a more continuous zero level set using level set projections.
results: The authors conducted comprehensive experiments in surface reconstruction for point clouds, real scans or depth maps, and demonstrated non-trivial improvements over the state-of-the-art methods. They also explored the performance in unsupervised point cloud upsampling and unsupervised point normal estimation with the learned UDF.

Abstract
Latest methods represent shapes with open surfaces using unsigned distance functions (UDFs). They train neural networks to learn UDFs and reconstruct surfaces with the gradients around the zero level set of the UDF. However, the differential networks struggle from learning the zero level set where the UDF is not differentiable, which leads to large errors on unsigned distances and gradients around the zero level set, resulting in highly fragmented and discontinuous surfaces. To resolve this problem, we propose to learn a more continuous zero level set in UDFs with level set projections. Our insight is to guide the learning of zero level set using the rest non-zero level sets via a projection procedure. Our idea is inspired from the observations that the non-zero level sets are much smoother and more continuous than the zero level set. We pull the non-zero level sets onto the zero level set with gradient constraints which align gradients over different level sets and correct unsigned distance errors on the zero level set, leading to a smoother and more continuous unsigned distance field. We conduct comprehensive experiments in surface reconstruction for point clouds, real scans or depth maps, and further explore the performance in unsupervised point cloud upsampling and unsupervised point normal estimation with the learned UDF, which demonstrate our non-trivial improvements over the state-of-the-art methods. Code is available at https://github.com/junshengzhou/LevelSetUDF .

摘要
最新的方法使用无符号距离函数（UDF）来表示形状。它们使用神经网络学习UDF并重建表面的梯度在UDF的零水平面周围。然而，� diferencial networks 受到学习零水平面的约束，在零水平面不 diferenciable 的情况下，会导致大量的unsigned distance 和梯度 around the zero level set 的错误，从而导致表面变得高度分裂和不连续。为解决这问题，我们提议通过约束水平面投影来学习更加连续的零水平面。我们的想法来自于观察到非零水平面比零水平面更加平滑和连续。我们使用梯度约束将非零水平面投影到零水平面，以实现梯度的对齐和unsigned distance 错误的修正，从而导致更加平滑和连续的unsigned distance field。我们进行了广泛的实验，包括点云重建、真实扫描或深度图重建，以及不supervised point cloud upsampling 和不supervised point normal estimation 等，其中所获得的改进均非常 significativ。代码可以在上找到。

PoseGraphNet++: Enriching 3D Human Pose with Orientation Estimation

paper_url: http://arxiv.org/abs/2308.11440
repo_url: None
paper_authors: Soubarna Banik, Edvard Avagyan, Alejandro Mendoza Gracia, Alois Knoll
for: 本研究旨在提出一种基于图гра夫 convolutional neural network（Graph Convolutional Network，GCN）的2D-to-3D提升方法，以便预测人体3D姿态包括关节位置和骨orientation。
methods: 我们提出了一种名为PoseGraphNet++的新型2D-to-3D提升网络，该网络通过节点和边卷积来利用关节和骨特征。
results: 我们在多个标准测试集上评估了我们的模型，并与状态的艺术结果相比，我们的模型在位置和旋转度量上具有类似或更高的性能。此外，我们还通过广泛的减少研究，证明了PoseGraphNet++可以借鉴关节和骨之间的相互关系，从而提高预测性能。

Abstract
Existing kinematic skeleton-based 3D human pose estimation methods only predict joint positions. Although this is sufficient to compute the yaw and pitch of the bone rotations, the roll around the axis of the bones remains unresolved by these methods. In this paper, we propose a novel 2D-to-3D lifting Graph Convolution Network named PoseGraphNet++ to predict the complete human pose including the joint positions and the bone orientations. We employ node and edge convolutions to utilize the joint and bone features. Our model is evaluated on multiple benchmark datasets, and its performance is either on par with or better than the state-of-the-art in terms of both position and rotation metrics. Through extensive ablation studies, we show that PoseGraphNet++ benefits from exploiting the mutual relationship between the joints and the bones.

摘要
现有的骨骼基本体系3D人姿估算方法只预测关节位置。尽管这足够计算关节的封顶和旋转，但是关节的滚动仍然无法被这些方法解决。在这篇论文中，我们提出了一种新的2D-to-3D提升图 convolution neural network（PoseGraphNet++），用于预测完整的人姿，包括关节位置和骨头orientation。我们利用节点和边卷积来利用关节和骨头特征。我们的模型在多个基准数据集上进行评估，其性能与或超过了现有的状态的艺术metric。通过广泛的减少研究，我们表明了PoseGraphNet++可以通过利用关节和骨头之间的相互关系来增强性能。

ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes

paper_url: http://arxiv.org/abs/2308.11417
repo_url: None
paper_authors: Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, Angela Dai
for: 提供了一个大规模的indoor场景数据集，其中每个场景都是使用高级激光扫描仪获得高分辨率的geometry和颜色信息，同时还包括了 региSTR的3300万像素图像和iPhone的RGB-D流。
methods: 使用高级激光扫描仪和DSLR相机捕捉场景图像，并使用iPhone捕捉RGB-D流。场景重建还包括了开放词汇的semantics标注，以便实现全面的semantic理解。
results: ScanNet++提供了一个新的real-worldbenchmark дляnovel view synthesis，包括高品质RGB捕捉和商业级图像的synthesis，以及一个新的3Dsemantic scene理解 benchmark，全面地涵盖了多种和抽象的semantic标注方案。ScanNet++ currently contains 460 scenes, 280,000 captured DSLR images, and over 3.7M iPhone RGBD frames.

Abstract
We present ScanNet++, a large-scale dataset that couples together capture of high-quality and commodity-level geometry and color of indoor scenes. Each scene is captured with a high-end laser scanner at sub-millimeter resolution, along with registered 33-megapixel images from a DSLR camera, and RGB-D streams from an iPhone. Scene reconstructions are further annotated with an open vocabulary of semantics, with label-ambiguous scenarios explicitly annotated for comprehensive semantic understanding. ScanNet++ enables a new real-world benchmark for novel view synthesis, both from high-quality RGB capture, and importantly also from commodity-level images, in addition to a new benchmark for 3D semantic scene understanding that comprehensively encapsulates diverse and ambiguous semantic labeling scenarios. Currently, ScanNet++ contains 460 scenes, 280,000 captured DSLR images, and over 3.7M iPhone RGBD frames.

摘要
我们现在提供ScanNet++ dataset，这是一个大规模的数据集，它结合了高质量的激光扫描仪和商用级别的颜色捕捉indoor场景。每个场景都被高级激光扫描仪 capture，并与注册的3300万像素DSLR镜头拍摄的图像，以及iPhone的RGB-D流相匹配。场景重建还被标注为开放词汇Semantics，并且明确标注了涉及多义 Label的情况，以便全面理解semantic。ScanNet++提供了一个新的真实世界标准 benchmark for novel view synthesis，不仅来自高质量RGB捕捉，还来自商用级别的图像，以及一个新的3Dsemantic场景理解标准，全面涵盖多样化和涉及多义标签的情况。目前，ScanNet++包含460个场景，280,000个拍摄的DSLR图像，以及超过370万个iPhone RGBD帧。

MatFuse: Controllable Material Generation with Diffusion Models

paper_url: http://arxiv.org/abs/2308.11408
repo_url: None
paper_authors: Giuseppe Vecchio, Renato Sortino, Simone Palazzo, Concetto Spampinato
for: This paper aims to simplify the creation of SVBRDF maps in computer graphics, using a novel unified approach based on diffusion models.
methods: The proposed method integrates multiple sources of conditioning, such as color palettes, sketches, and pictures, to enable fine-grained control and flexibility in material synthesis.
results: The proposed method yields performance comparable to state-of-the-art approaches in estimating SVBRDF, both qualitatively and quantitatively, under various conditioning settings.

Abstract
Creating high quality and realistic materials in computer graphics is a challenging and time-consuming task, which requires great expertise. In this paper, we present MatFuse, a novel unified approach that harnesses the generative power of diffusion models (DM) to simplify the creation of SVBRDF maps. Our DM-based pipeline integrates multiple sources of conditioning, such as color palettes, sketches, and pictures, enabling fine-grained control and flexibility in material synthesis. This design allows for the combination of diverse information sources (e.g., sketch + image embedding), enhancing creative possibilities in line with the principle of compositionality. We demonstrate the generative capabilities of the proposed method under various conditioning settings; on the SVBRDF estimation task, we show that our method yields performance comparable to state-of-the-art approaches, both qualitatively and quantitatively.

摘要

Non-Redundant Combination of Hand-Crafted and Deep Learning Radiomics: Application to the Early Detection of Pancreatic Cancer

paper_url: http://arxiv.org/abs/2308.11389
repo_url: None
paper_authors: Rebeca Vétil, Clément Abi-Nader, Alexandre Bône, Marie-Pierre Vullierme, Marc-Michel Rohé, Pietro Gori, Isabelle Bloch
for: 这篇论文旨在解决深度学习医学影像特征（DLR）和手工设计医学影像特征（HCR）之间的重复性问题。
methods: 作者使用了一种简单的Variational Autoencoder（VAE）来提取DLR特征，并且通过降低这两种特征之间的相互信息来确保它们之间的独立性。
results: 作者的方法可以与手工设计的特征结合，并且通过一个分类器来预测抑制肝癌的早期标志。实验结果显示，结合非重复的DLR和HCR特征可以提高预测性能，比基eline方法更好。

Abstract
We address the problem of learning Deep Learning Radiomics (DLR) that are not redundant with Hand-Crafted Radiomics (HCR). To do so, we extract DLR features using a VAE while enforcing their independence with HCR features by minimizing their mutual information. The resulting DLR features can be combined with hand-crafted ones and leveraged by a classifier to predict early markers of cancer. We illustrate our method on four early markers of pancreatic cancer and validate it on a large independent test set. Our results highlight the value of combining non-redundant DLR and HCR features, as evidenced by an improvement in the Area Under the Curve compared to baseline methods that do not address redundancy or solely rely on HCR features.

摘要
我们解决了深度学习医学影像特征（DLR）不重复的问题。我们使用VAE将DLR特征提取出来，并在这些特征之间强制独立性，以避免与手工设计的医学影像特征（HCR）之间的相互信息。这些DLR特征可以与手工设计的特征结合，并由分类器使用来预测早期癌症 markers。我们在四种早期肝癌标志物中进行了实验，并在一个大型独立测试集上验证了我们的方法。我们的结果显示了结合非重复的DLR和HCR特征的价值，即比基eline方法不 addressed 或仅仅靠赖于HCR特征的情况下，预测性能有所提高。

Targeted Data Augmentation for bias mitigation

paper_url: http://arxiv.org/abs/2308.11386
repo_url: None
paper_authors: Agnieszka Mikołajczyk-Bareła, Maria Ferlin, Michał Grochowski
for: This paper aims to address the issue of bias in AI systems by introducing a novel approach called Targeted Data Augmentation (TDA).
methods: The TDA method leverages classical data augmentation techniques to insert biases into the training data, which helps to mitigate biases in the models.
results: The paper shows that the TDA method can significantly decrease bias measures while maintaining a negligible increase in the error rate, using two diverse datasets of clinical skin lesions and male and female faces.Here’s the Chinese translation of the three points:
for: 这篇论文目的是解决人工智能系统中的偏见问题，通过引入一种新的方法 called Targeted Data Augmentation (TDA)。
methods: TDA方法利用经典的数据增强技术，插入偏见到训练数据中，以减少模型中的偏见。
results: 论文显示，TDA方法可以Significantly减少偏见度量，同时保持误差率的增长在较低水平，使用了两个多样化的数据集：皮肤病变和男女面孔数据集。

Abstract
The development of fair and ethical AI systems requires careful consideration of bias mitigation, an area often overlooked or ignored. In this study, we introduce a novel and efficient approach for addressing biases called Targeted Data Augmentation (TDA), which leverages classical data augmentation techniques to tackle the pressing issue of bias in data and models. Unlike the laborious task of removing biases, our method proposes to insert biases instead, resulting in improved performance. To identify biases, we annotated two diverse datasets: a dataset of clinical skin lesions and a dataset of male and female faces. These bias annotations are published for the first time in this study, providing a valuable resource for future research. Through Counterfactual Bias Insertion, we discovered that biases associated with the frame, ruler, and glasses had a significant impact on models. By randomly introducing biases during training, we mitigated these biases and achieved a substantial decrease in bias measures, ranging from two-fold to more than 50-fold, while maintaining a negligible increase in the error rate.

摘要
发展公正和伦理AI系统需要仔细考虑偏见缓解，这个领域经常被排除或忽略。在这个研究中，我们提出了一种新的和高效的偏见缓解方法，即目标数据扩展（TDA），该方法利用了经典数据扩展技术来解决数据和模型中的偏见问题。与努力除去偏见不同，我们的方法提议在训练过程中插入偏见，从而提高性能。为了标识偏见，我们对两个多样化的数据集进行了标注：一个是皮肤病变 dataset，另一个是男女脸部 dataset。这些偏见标注在本研究中首次公布，为未来研究提供了一个有价值的资源。通过对比方案插入，我们发现了框架、尺度和镜片等偏见对模型产生了重要影响。通过随机在训练过程中引入偏见，我们减少了这些偏见的度量，从2倍到更多于50倍，同时保持了错误率的增长在较低水平。

DALNet: A Rail Detection Network Based on Dynamic Anchor Line

paper_url: http://arxiv.org/abs/2308.11381
repo_url: https://github.com/yzichen/mmlanedet
paper_authors: Zichen Yu, Quanli Liu, Wei Wang, Liyong Zhang, Xiaoguang Zhao
for: 提高智能列车的轨道检测精度
methods: 基于动态锚点线的轨道检测网络DALNet，包括动态锚点生成器和轨道检测模块
results: DALNet在我们提供的DL-Rail轨道检测数据集和知名的Tusimple和LLAMAS车道检测标准 benchmark中达到了状态之精度表现。

Abstract
Rail detection is one of the key factors for intelligent train. In the paper, motivated by the anchor line-based lane detection methods, we propose a rail detection network called DALNet based on dynamic anchor line. Aiming to solve the problem that the predefined anchor line is image agnostic, we design a novel dynamic anchor line mechanism. It utilizes a dynamic anchor line generator to dynamically generate an appropriate anchor line for each rail instance based on the position and shape of the rails in the input image. These dynamically generated anchor lines can be considered as better position references to accurately localize the rails than the predefined anchor lines. In addition, we present a challenging urban rail detection dataset DL-Rail with high-quality annotations and scenario diversity. DL-Rail contains 7000 pairs of images and annotations along with scene tags, and it is expected to encourage the development of rail detection. We extensively compare DALNet with many competitive lane methods. The results show that our DALNet achieves state-of-the-art performance on our DL-Rail rail detection dataset and the popular Tusimple and LLAMAS lane detection benchmarks. The code will be released at https://github.com/Yzichen/mmLaneDet.

摘要
铁路检测是智能列车的关键因素之一。在论文中，我们被动 anchor line-based 铁路检测方法所 inspirited，并提出了基于动态 anchor line 的铁路检测网络 DALNet。为了解决预定义的 anchor line 是图像不特定的问题，我们设计了一种新的动态 anchor line 机制。它利用动态 anchor line 生成器生成每个铁路实例的相应 anchor line，根据输入图像中铁路的位置和形状。这些动态生成的 anchor line 可以视为更好的位置参考，以准确地 Localize 铁路。此外，我们提供了一个挑战性的城市铁路检测数据集 DL-Rail，其包括7000个图像和注释对，以及Scene 标签。我们进行了广泛的比较，结果显示，我们的 DALNet 在我们的 DL-Rail 铁路检测数据集和知名的 Tusimple 和 LLAMAS lane detection benchmark 上均达到了状态组件表现。代码将在 GitHub 上发布，详细信息请参考。

Boundary-RL: Reinforcement Learning for Weakly-Supervised Prostate Segmentation in TRUS Images

paper_url: http://arxiv.org/abs/2308.11376
repo_url: None
paper_authors: Weixi Yi, Vasilis Stavrinides, Zachary M. C. Baum, Qianye Yang, Dean C. Barratt, Matthew J. Clarkson, Yipeng Hu, Shaheer U. Saeed
for: 这个研究旨在提出一种弱度指导的类别分割方法，使用仅有单元标签进行训练，并且将类别分割看作是范围检测问题，而不是像前一些研究所使用的像素级别分类。methods: 这个方法使用了强化学习来训练一个控制函数，以寻找范围内的类别 bounding，使用一个从预训练的边界存在检测器获得的赏兹。results: 在评估这个方法的临床实际任务中，关于胆囊组织分类，我们获得了与其他已知的弱度指导方法相比，更好的表现，使用相同的标签，例如多个例问题学习。

Abstract
We propose Boundary-RL, a novel weakly supervised segmentation method that utilises only patch-level labels for training. We envision the segmentation as a boundary detection problem, rather than a pixel-level classification as in previous works. This outlook on segmentation may allow for boundary delineation under challenging scenarios such as where noise artefacts may be present within the region-of-interest (ROI) boundaries, where traditional pixel-level classification-based weakly supervised methods may not be able to effectively segment the ROI. Particularly of interest, ultrasound images, where intensity values represent acoustic impedance differences between boundaries, may also benefit from the boundary delineation approach. Our method uses reinforcement learning to train a controller function to localise boundaries of ROIs using a reward derived from a pre-trained boundary-presence classifier. The classifier indicates when an object boundary is encountered within a patch, as the controller modifies the patch location in a sequential Markov decision process. The classifier itself is trained using only binary patch-level labels of object presence, which are the only labels used during training of the entire boundary delineation framework, and serves as a weak signal to inform the boundary delineation. The use of a controller function ensures that a sliding window over the entire image is not necessary. It also prevents possible false-positive or -negative cases by minimising number of patches passed to the boundary-presence classifier. We evaluate our proposed approach for a clinically relevant task of prostate gland segmentation on trans-rectal ultrasound images. We show improved performance compared to other tested weakly supervised methods, using the same labels e.g., multiple instance learning.

摘要
我们提出了Boundary-RL，一种新的弱类标注方法，只使用补丁级别标签进行训练。我们认为 segmentation 是一个边检测问题，而不是像前一些工作一样将每个像素分类为不同的类别。这种对 segmentation 的看法可能allow for 边界定义在具有噪声artifacts的区域内的场景中，其中传统的像素级别分类基于的弱类标注方法可能无法有效地分类区域。特别是，ultrasound 图像，其中 интенсивности值表示物体边界上的声学阻抗差异，也可能受惠于边界定义方法。我们的方法使用 reinforcement learning 训练一个控制函数，用于localize 区域内的边界。该控制函数通过一个Markov 决策过程来修改补丁位置，并且使用一个预训练的边界存在分类器来指示在补丁中是否遇到了物体边界。该分类器只使用补丁级别标签进行训练，这些标签也是训练整个边界定义框架的唯一标签。我们使用控制函数来避免使用滑块窗口覆盖整个图像，并且避免可能的 false-positive 或 false-negative 情况。我们对肾脏成像进行评估，并与其他测试的弱类标注方法进行比较。我们显示出我们的方法在评估中表现出色，使用相同的标签。

Enhancing Interpretable Object Abstraction via Clustering-based Slot Initialization

paper_url: http://arxiv.org/abs/2308.11369
repo_url: None
paper_authors: Ning Gao, Bernard Hohmann, Gerhard Neumann
For: The paper is focused on improving object-centric representations using slots for efficient, flexible, and interpretable abstraction from low-level perceptual features in a compositional scene.* Methods: The paper proposes using clustering algorithms conditioned on perceptual input features to initialize the slot representations, and designs permutation invariant and permutation equivariant versions of this layer to enable exchangeable slot representations after clustering.* Results: The paper shows that its method outperforms prior works consistently, especially for complex scenes, through experiments on object discovery and novel view synthesis tasks with various datasets.

Abstract
Object-centric representations using slots have shown the advances towards efficient, flexible and interpretable abstraction from low-level perceptual features in a compositional scene. Current approaches randomize the initial state of slots followed by an iterative refinement. As we show in this paper, the random slot initialization significantly affects the accuracy of the final slot prediction. Moreover, current approaches require a predetermined number of slots from prior knowledge of the data, which limits the applicability in the real world. In our work, we initialize the slot representations with clustering algorithms conditioned on the perceptual input features. This requires an additional layer in the architecture to initialize the slots given the identified clusters. We design permutation invariant and permutation equivariant versions of this layer to enable the exchangeable slot representations after clustering. Additionally, we employ mean-shift clustering to automatically identify the number of slots for a given scene. We evaluate our method on object discovery and novel view synthesis tasks with various datasets. The results show that our method outperforms prior works consistently, especially for complex scenes.

摘要
使用槽来表示对象已经取得了高效、灵活和可解释的抽象优势。现有方法在初始化槽时随机，然后进行迭代优化。我们在这篇论文中表明，随机槽初始化会对最终槽预测的准确性产生显著影响。此外，现有方法需要先知道数据中的槽数量，这限制了实际应用的可行性。在我们的工作中，我们使用基于感知输入特征的 clustering 算法来初始化槽表示。这需要一个额外的架构层来初始化槽给确定的群集。我们设计了卷积不变和卷积对称版本的这层，以便在 clustering 后换取可交换的槽表示。此外，我们使用 Mean-Shift clustering 自动确定Scene中的槽数量。我们在对象发现和新视图合成任务中使用了多个数据集进行评估，结果表明，我们的方法在复杂场景下一直高于先前的方法。

Towards Clip-Free Quantized Super-Resolution Networks: How to Tame Representative Images

paper_url: http://arxiv.org/abs/2308.11365
repo_url: None
paper_authors: Alperen Kalay, Bahri Batuhan Bilecen, Mustafa Ayazoglu
for: 这个研究旨在解决 SR 网络中训练后量化 (PTQ) 阶段的一个重要问题，即代表性数据集 (RD)。
methods: 我们提出了一个新的 clip-free 量化管道 (CFQP)，并提供了广泛的实验证明，以 cleverly 使用 FP32 模型的输出来增强 RD 图像。这种方法可以消除不必要的 clipped 活动层，从而提高整体稳定性、减少推理时间（最多下降到 54%）、提高视觉质量 Results compared to INT8 clipped models，并在一些 SR 模型上甚至超越了不量化 FP32 模型。
results: 我们的方法可以在某些 SR 模型上提高视觉质量，同时减少推理时间，并且不需要重新训练 clipped activation。在一些情况下，我们的方法可以超越不量化 FP32 模型， both in runtime and visual quality。

Abstract
Super-resolution (SR) networks have been investigated for a while, with their mobile and lightweight versions gaining noticeable popularity recently. Quantization, the procedure of decreasing the precision of network parameters (mostly FP32 to INT8), is also utilized in SR networks for establishing mobile compatibility. This study focuses on a very important but mostly overlooked post-training quantization (PTQ) step: representative dataset (RD), which adjusts the quantization range for PTQ. We propose a novel pipeline (clip-free quantization pipeline, CFQP) backed up with extensive experimental justifications to cleverly augment RD images by only using outputs of the FP32 model. Using the proposed pipeline for RD, we can successfully eliminate unwanted clipped activation layers, which nearly all mobile SR methods utilize to make the model more robust to PTQ in return for a large overhead in runtime. Removing clipped activations with our method significantly benefits overall increased stability, decreased inference runtime up to 54% on some SR models, better visual quality results compared to INT8 clipped models - and outperforms even some FP32 non-quantized models, both in runtime and visual quality, without the need for retraining with clipped activation.

摘要
超解像（SR）网络已经被研究了一段时间，其移动和轻量级版本在最近得到了关注。量化，减小网络参数的精度（主要是FP32到INT8），也在SR网络中使用，以实现移动兼容性。这项研究关注一个很重要但又多少被注意的后training量化（PTQ）步骤：代表数据集（RD），它调整PTQ的范围。我们提出一个新的批处理管道（clip-free quantization pipeline，CFQP），并提供了详细的实验证明，以智能地增强RD图像，只使用FP32模型的输出。使用我们的方法进行RD，可以成功地消除无用的clip activation层，这些层通常在移动SR方法中使用，以使模型更具 robustness to PTQ，但是带来了大量的运行时间开销。从我们的方法中消除clip activation可以获得更好的稳定性、下降到54%的推理时间、更好的视觉质量结果，并且超过了INT8clip模型，以及FP32非量化模型，无需重新训练clip activation。

Exemplar-Free Continual Transformer with Convolutions

paper_url: http://arxiv.org/abs/2308.11357
repo_url: None
paper_authors: Anurag Roy, Vinay Kumar Verma, Sravan Voonna, Kripabandhu Ghosh, Saptarshi Ghosh, Abir Das
for: 这 paper 的目的是提出一种新的无例子（exemplar-free）的类/任务逐步学习方法，不需要在测试时显式提供任务标识符（task identifier），并且不需要保留之前训练集。
methods: 该方法使用 transformer 架构，并通过重新权重 multi-head self-attention 层中的键、查询和值权重来实现类/任务逐步学习。具体来说，通过 convolution 来重新权重这些权重，以便在每个任务上保持低的参数数量。此外，使用图像增强技术来预测任务，而无需在测试时显式提供任务标识符。
results: 实验结果表明，该方法可以在四个 benchmark 数据集上超越许多竞争方法，而且需要更少的参数。

Abstract
Continual Learning (CL) involves training a machine learning model in a sequential manner to learn new information while retaining previously learned tasks without the presence of previous training data. Although there has been significant interest in CL, most recent CL approaches in computer vision have focused on convolutional architectures only. However, with the recent success of vision transformers, there is a need to explore their potential for CL. Although there have been some recent CL approaches for vision transformers, they either store training instances of previous tasks or require a task identifier during test time, which can be limiting. This paper proposes a new exemplar-free approach for class/task incremental learning called ConTraCon, which does not require task-id to be explicitly present during inference and avoids the need for storing previous training instances. The proposed approach leverages the transformer architecture and involves re-weighting the key, query, and value weights of the multi-head self-attention layers of a transformer trained on a similar task. The re-weighting is done using convolution, which enables the approach to maintain low parameter requirements per task. Additionally, an image augmentation-based entropic task identification approach is used to predict tasks without requiring task-ids during inference. Experiments on four benchmark datasets demonstrate that the proposed approach outperforms several competitive approaches while requiring fewer parameters.

摘要

Integration of Sentinel-1 and Sentinel-2 data for Earth surface classification using Machine Learning algorithms implemented on Google Earth Engine

paper_url: http://arxiv.org/abs/2308.11340
repo_url: None
paper_authors: Francesca Razzano, Mariapia Rita Iandolo, Chiara Zarro, G. S. Yogesh, Silvia Liberata Ullo
for: 本研究使用Synthetic Aperture Radar (SAR)和光学数据进行地面分类。
methods: 通过在Google Earth Engine (GEE)平台上实施监督式机器学习（ML）算法，将Sentinel-1 (S-1)和Sentinel-2 (S-2)数据集成起来，用于地面覆盖分类。
results: 研究结果表明，在这种情况下，雷达和光学远程探测提供了补偿信息，有利地面覆盖分类，通常导致映射精度的提高。此外，本研究也证明了GEE在处理大量卫星数据方面的emerging角色。

Abstract
In this study, Synthetic Aperture Radar (SAR) and optical data are both considered for Earth surface classification. Specifically, the integration of Sentinel-1 (S-1) and Sentinel-2 (S-2) data is carried out through supervised Machine Learning (ML) algorithms implemented on the Google Earth Engine (GEE) platform for the classification of a particular region of interest. Achieved results demonstrate how in this case radar and optical remote detection provide complementary information, benefiting surface cover classification and generally leading to increased mapping accuracy. In addition, this paper works in the direction of proving the emerging role of GEE as an effective cloud-based tool for handling large amounts of satellite data.

摘要
在这项研究中，人造 aperature radar（SAR）和光学数据都被考虑用于地球表面分类。具体来说，Sentinel-1（S-1）和Sentinel-2（S-2）数据的集成通过在Google Earth Engine（GEE）平台上实施监督式机器学习（ML）算法进行地区特定的分类。实现的结果表明在这种情况下，雷达和光学远程探测提供了补做的信息，从而改善表面覆盖分类和通常提高地图准确性。此外，这篇论文还在irection towards proving the emerging role of GEE as an effective cloud-based tool for handling large amounts of satellite data.

Object Detection Difficulty: Suppressing Over-aggregation for Faster and Better Video Object Detection

paper_url: http://arxiv.org/abs/2308.11327
repo_url: https://github.com/bingqingzhang/odd-vod
paper_authors: Bingqing Zhang, Sen Wang, Yifan Liu, Brano Kusy, Xue Li, Jiajun Liu
for: 提高视频对象检测（VOD）系统的实用性
methods: 提出一种图像级对象检测难度（ODD）指标，用于衡量检测图像中对象的难度，并在VOD过程中应用该指标来减少过度聚合。
results: 对8种VOD模型进行了广泛的实验，结果表明，当选择全局参照帧时，ODD-VOD可以不断提高全局基于VOD模型的准确率。当用于加速时，ODD-VOD可以不断提高帧数（FPS），并且不会降低准确率。

Abstract
Current video object detection (VOD) models often encounter issues with over-aggregation due to redundant aggregation strategies, which perform feature aggregation on every frame. This results in suboptimal performance and increased computational complexity. In this work, we propose an image-level Object Detection Difficulty (ODD) metric to quantify the difficulty of detecting objects in a given image. The derived ODD scores can be used in the VOD process to mitigate over-aggregation. Specifically, we train an ODD predictor as an auxiliary head of a still-image object detector to compute the ODD score for each image based on the discrepancies between detection results and ground-truth bounding boxes. The ODD score enhances the VOD system in two ways: 1) it enables the VOD system to select superior global reference frames, thereby improving overall accuracy; and 2) it serves as an indicator in the newly designed ODD Scheduler to eliminate the aggregation of frames that are easy to detect, thus accelerating the VOD process. Comprehensive experiments demonstrate that, when utilized for selecting global reference frames, ODD-VOD consistently enhances the accuracy of Global-frame-based VOD models. When employed for acceleration, ODD-VOD consistently improves the frames per second (FPS) by an average of 73.3% across 8 different VOD models without sacrificing accuracy. When combined, ODD-VOD attains state-of-the-art performance when competing with many VOD methods in both accuracy and speed. Our work represents a significant advancement towards making VOD more practical for real-world applications.

摘要
当前的视频对象检测（VOD）模型经常遇到过度聚合的问题，这是因为 redundancy 的聚合策略在每帧进行特征聚合，从而导致性能下降和计算复杂性增加。在这种情况下，我们提出了一个图像级别的对象检测困难度（ODD）度量，用于衡量检测图像中对象的困难度。这些计算的ODD分数可以在 VOD 过程中使用，以避免过度聚合。我们在 VOD 系统中训练了一个 ODD 预测器，用于计算每幅图像的 ODD 分数，基于检测结果和真实 bounding box 之间的差异。ODD 分数可以在两种方式帮助 VOD 系统：1）选择更高精度的全局参照帧，以提高总体精度；2）作为 ODD 调度器中的指标，以消除容易检测的帧的聚合，从而加速 VOD 过程。我们的实验表明，当用于选择全局参照帧时，ODD-VOD 可以不断提高全球帧基本 VOD 模型的精度。当用于加速时，ODD-VOD 可以平均提高 FPS 值 by 73.3%，无需牺牲精度。当两者结合使用时，ODD-VOD 可以在精度和速度两个方面占据领先地位，代表了对 VOD 实际应用的一项重要进步。

CiteTracker: Correlating Image and Text for Visual Tracking

paper_url: http://arxiv.org/abs/2308.11322
repo_url: https://github.com/NorahGreen/CiteTracker
paper_authors: Xin Li, Yuqing Huang, Zhenyu He, Yaowei Wang, Huchuan Lu, Ming-Hsuan Yang
for: 提高视觉跟踪中目标模型和推理的精度，使得在目标呈现大幅变化时仍能准确跟踪。
methods: 提出了一种基于图像和文本的目标跟踪方法，通过图像转文本模块将目标图像区域转换为描述性文本，并通过动态描述模块适应目标变化以提高跟踪精度。
results: 经过了五种不同的数据集测试，并与现有方法进行比较，研究发现提出的跟踪方法在跟踪目标呈现大幅变化时表现出了优于现有方法的性能。

Abstract
Existing visual tracking methods typically take an image patch as the reference of the target to perform tracking. However, a single image patch cannot provide a complete and precise concept of the target object as images are limited in their ability to abstract and can be ambiguous, which makes it difficult to track targets with drastic variations. In this paper, we propose the CiteTracker to enhance target modeling and inference in visual tracking by connecting images and text. Specifically, we develop a text generation module to convert the target image patch into a descriptive text containing its class and attribute information, providing a comprehensive reference point for the target. In addition, a dynamic description module is designed to adapt to target variations for more effective target representation. We then associate the target description and the search image using an attention-based correlation module to generate the correlated features for target state reference. Extensive experiments on five diverse datasets are conducted to evaluate the proposed algorithm and the favorable performance against the state-of-the-art methods demonstrates the effectiveness of the proposed tracking method.

摘要
现有的视觉跟踪方法通常使用图像块作为目标的参考点进行跟踪。然而，单个图像块无法提供完整和准确的目标对象概念，因为图像有限制，容易受到歧义和变化的影响，这使得跟踪目标变化具有挑战性。在这篇论文中，我们提出了CiteTracker，用于增强视觉跟踪中目标模型化和推理的方法。具体来说，我们开发了一个文本生成模块，将目标图像块转换为包含类和特征信息的详细文本描述，为目标提供了全面的参考点。此外，我们还设计了一个动态描述模块，以适应目标变化，以更有效地表示目标。然后，我们使用关注机制来关联目标描述和搜索图像，生成相关特征，用于目标状态参考。我们对五种不同的数据集进行了广泛的实验，以评估提出的算法效果。结果表明，与现有方法相比，我们的跟踪方法具有优秀的效果。

Using and Abusing Equivariance

paper_url: http://arxiv.org/abs/2308.11316
repo_url: None
paper_authors: Tom Edixhoven, Attila Lengyel, Jan van Gemert
for: 学习破坏对称性的群同态卷积神经网络
methods: 使用抽样来学习破坏对称性，对2D旋转和反射进行研究
results: 发现小变化的输入维度可以使通用的网络变得相对均匀，而不是精确均匀；研究不同对称性的网络如何影响其性能，并发现在训练数据中的对称性与网络的对称性不同时，相对均匀网络可以放弃自己的对称性约束，与精确均匀网络匹配或超越它们在常用 benchmark 数据上。

Abstract
In this paper we show how Group Equivariant Convolutional Neural Networks use subsampling to learn to break equivariance to their symmetries. We focus on 2D rotations and reflections and investigate the impact of broken equivariance on network performance. We show that a change in the input dimension of a network as small as a single pixel can be enough for commonly used architectures to become approximately equivariant, rather than exactly. We investigate the impact of networks not being exactly equivariant and find that approximately equivariant networks generalise significantly worse to unseen symmetries compared to their exactly equivariant counterparts. However, when the symmetries in the training data are not identical to the symmetries of the network, we find that approximately equivariant networks are able to relax their own equivariant constraints, causing them to match or outperform exactly equivariant networks on common benchmark datasets.

摘要
在这篇论文中，我们展示了GROUP等变征 convolutional neural networks 如何通过抽象来学习破坏对其Symmetries的等变征性。我们关注了2D旋转和反射，并调查了不等变征性对网络性能的影响。我们发现，只需将网络输入维度改变一个像素，常用的架构就可以变得相对等变征了，而不是精确等变征。我们调查了不等变征网络在未seen Symmetries上的generalization情况，发现它们与等变征网络相比在Common benchmark datasets上匹配或超越。但当网络中的Symmetries与训练数据中的Symmetries不同时，我们发现approximately等变征网络能够放弃自己的等变征约束，使其与等变征网络在Common benchmark datasets上匹配或超越。

Approaching human 3D shape perception with neurally mappable models

paper_url: http://arxiv.org/abs/2308.11300
repo_url: None
paper_authors: Thomas P. O’Connell, Tyler Bonnen, Yoni Friedman, Ayush Tewari, Josh B. Tenenbaum, Vincent Sitzmann, Nancy Kanwisher
for: 这研究旨在理解人类如何自然地推测三维形状，以及这种能力是如何被计算机模型重建的？
methods: 研究使用了一种新的计算模型，即3D神经场（3D-LFN），该模型基于深度神经网络（DNN），并通过多视角训练和多视角学习目标来实现人类水平的性能。
results: 研究发现，3D-LFN支持人类水平的三维匹配判断，并在针对标准DNN模型的攻击性定义比较中表现出色。此外，研究还发现，通过多视角训练和多视角学习目标，even conventional DNN architectures可以更接近人类行为。但是，这些模型在处理新的物体类别时仍有所限制。

Abstract
Humans effortlessly infer the 3D shape of objects. What computations underlie this ability? Although various computational models have been proposed, none of them capture the human ability to match object shape across viewpoints. Here, we ask whether and how this gap might be closed. We begin with a relatively novel class of computational models, 3D neural fields, which encapsulate the basic principles of classic analysis-by-synthesis in a deep neural network (DNN). First, we find that a 3D Light Field Network (3D-LFN) supports 3D matching judgments well aligned to humans for within-category comparisons, adversarially-defined comparisons that accentuate the 3D failure cases of standard DNN models, and adversarially-defined comparisons for algorithmically generated shapes with no category structure. We then investigate the source of the 3D-LFN's ability to achieve human-aligned performance through a series of computational experiments. Exposure to multiple viewpoints of objects during training and a multi-view learning objective are the primary factors behind model-human alignment; even conventional DNN architectures come much closer to human behavior when trained with multi-view objectives. Finally, we find that while the models trained with multi-view learning objectives are able to partially generalize to new object categories, they fall short of human alignment. This work provides a foundation for understanding human shape inferences within neurally mappable computational architectures and highlights important questions for future work.

摘要
人类努力地推断物体的3D形状。这些计算是如何进行的？虽然一些计算模型已经被提出，但None of them capture the human ability to match object shape across viewpoints。在这里，我们问whether and how this gap might be closed。我们开始于一种相对新的计算模型，3D神经场（3D-LFN），这个模型涵盖了经典分析synthesis的基本原理，并将其 embedding在深度神经网络（DNN）中。我们发现，3D-LFN支持3D匹配判断，与人类匹配的性能很高，包括在类别内的比较、对抗定义的比较和对算法生成的形状进行比较。然后，我们通过一系列计算实验来研究3D-LFN的能力是如何实现人类对应的性能的。我们发现，在训练中 expose to multiple viewpoints of objects和多视图学习目标是模型人类匹配的主要因素。甚至使用标准DNN架构，通过多视图学习目标进行训练，模型的性能就会更接近人类的行为。最后，我们发现，使用多视图学习目标训练的模型可以部分地泛化到新的物体类别，但是它们仍然不够接近人类的匹配。这个研究提供了人类形状推断在神经Mappable的计算架构中的基础，并高亮了未来工作的重要问题。

BHSD: A 3D Multi-Class Brain Hemorrhage Segmentation Dataset

paper_url: http://arxiv.org/abs/2308.11298
repo_url: None
paper_authors: Biao Wu, Yutong Xie, Zeyu Zhang, Jinchao Ge, Kaspar Yaxley, Suzan Bahadir, Qi Wu, Yifan Liu, Minh-Son To
for: 本研究旨在提供一个3D多类血肿 segmentation dataset（BHSD），以便为血肿 segmentation任务提供支持。
methods: 本研究使用了深度学习技术来进行 médical image segmentation，并应用于血肿 segmentation任务。
results: 本研究提供了一个包含192个Volume的多类血肿数据集，以及2200个slice-level标注的数据集。通过对这些数据集进行supervised和semi-supervisedsegmentation任务的实验，我们证明了数据集的实用性。

Abstract
Intracranial hemorrhage (ICH) is a pathological condition characterized by bleeding inside the skull or brain, which can be attributed to various factors. Identifying, localizing and quantifying ICH has important clinical implications, in a bleed-dependent manner. While deep learning techniques are widely used in medical image segmentation and have been applied to the ICH segmentation task, existing public ICH datasets do not support the multi-class segmentation problem. To address this, we develop the Brain Hemorrhage Segmentation Dataset (BHSD), which provides a 3D multi-class ICH dataset containing 192 volumes with pixel-level annotations and 2200 volumes with slice-level annotations across five categories of ICH. To demonstrate the utility of the dataset, we formulate a series of supervised and semi-supervised ICH segmentation tasks. We provide experimental results with state-of-the-art models as reference benchmarks for further model developments and evaluations on this dataset.

摘要
Intracranial hemorrhage (ICH) 是一种生物学条件，表示脑内或脑部内出血，这可以归因于多种因素。正确地识别、定位和评估ICH具有重要临床意义，具体取决于出血情况。although deep learning techniques have been widely used in medical image segmentation and have been applied to the ICH segmentation task, existing public ICH datasets do not support the multi-class segmentation problem. To address this, we develop the Brain Hemorrhage Segmentation Dataset (BHSD), which provides a 3D multi-class ICH dataset containing 192 volumes with pixel-level annotations and 2200 volumes with slice-level annotations across five categories of ICH. To demonstrate the utility of the dataset, we formulate a series of supervised and semi-supervised ICH segmentation tasks. We provide experimental results with state-of-the-art models as reference benchmarks for further model developments and evaluations on this dataset.Here's the word-for-word translation:Intracranial hemorrhage (ICH) 是一种生物学条件，表示脑内或脑部内出血，这可以归因于多种因素。正确地识别、定位和评估ICH具有重要临床意义，具体取决于出血情况。虽然深度学习技术已经广泛应用于医疗图像分割，并已经应用于ICH分割任务，但现有的公共ICH数据集不支持多类分割问题。为解决这个问题，我们开发了脑出血分割数据集（BHSD），该数据集包含192卷Pixel级别标注和2200卷Slice级别标注，涵盖五类ICH。为证明数据集的实用性，我们提出了一系列的超级vised和半监督ICH分割任务。我们提供了实验结果，作为参考标准模型，以便进一步的模型开发和评估。

PCMC-T1: Free-breathing myocardial T1 mapping with Physically-Constrained Motion Correction

paper_url: http://arxiv.org/abs/2308.11281
repo_url: None
paper_authors: Eyal Hanania, Ilya Volovik, Lilach Barkat, Israel Cohen, Moti Freiman
for: The paper is focused on developing a deep-learning-based method for motion correction in free-breathing T1 mapping, which can improve the accuracy and accessibility of diffuse myocardial disease diagnosis.
methods: The proposed method, called PCMC-T1, incorporates a physically-constrained signal decay model into a deep-learning network to correct for motion artifacts in free-breathing T1 mapping.
results: PCMC-T1 was compared to baseline methods using a 5-fold experimental setup on a public dataset of 210 patients and demonstrated superior model fitting quality and clinical impact, with anatomical alignment results that were comparable to the baseline methods.

Abstract
T1 mapping is a quantitative magnetic resonance imaging (qMRI) technique that has emerged as a valuable tool in the diagnosis of diffuse myocardial diseases. However, prevailing approaches have relied heavily on breath-hold sequences to eliminate respiratory motion artifacts. This limitation hinders accessibility and effectiveness for patients who cannot tolerate breath-holding. Image registration can be used to enable free-breathing T1 mapping. Yet, inherent intensity differences between the different time points make the registration task challenging. We introduce PCMC-T1, a physically-constrained deep-learning model for motion correction in free-breathing T1 mapping. We incorporate the signal decay model into the network architecture to encourage physically-plausible deformations along the longitudinal relaxation axis. We compared PCMC-T1 to baseline deep-learning-based image registration approaches using a 5-fold experimental setup on a publicly available dataset of 210 patients. PCMC-T1 demonstrated superior model fitting quality (R2: 0.955) and achieved the highest clinical impact (clinical score: 3.93) compared to baseline methods (0.941, 0.946 and 3.34, 3.62 respectively). Anatomical alignment results were comparable (Dice score: 0.9835 vs. 0.984, 0.988). Our code and trained models are available at https://github.com/eyalhana/PCMC-T1.

摘要
T1映射是一种量化核磁共振成像（qMRI）技术，已经成为诊断散布性心肺疾病的有价值工具。然而，现有的方法几乎完全依赖呼吸停止序列来消除呼吸运动 artifacts。这限制了患者的访问和效果，特别是那些无法忍受呼吸停止的患者。图像 регистраción可以使得呼吸自由T1映射成为可能。然而，内在的时刻点不同INTENSITY带来了注册任务的挑战。我们介绍PCMC-T1，一种基于物理约束的深度学习模型，用于呼吸自由T1映射中的运动 corrections。我们将信号衰减模型integrated into网络架构，以便鼓励物理可能的扭轴运动。我们与基线方法进行比较，使用一个5-fold实验设置，在公共可用的210名患者数据集上进行了测试。PCMC-T1显示出了最高的模型适应质量（R2: 0.955）和最高的临床影响（临床分数：3.93），与基线方法（0.941、0.946和3.34、3.62分别）相比。结构对应结果相似（Dice分数：0.9835 vs. 0.984、0.988）。我们的代码和训练模型可以在https://github.com/eyalhana/PCMC-T1中获得。

HMD-NeMo: Online 3D Avatar Motion Generation From Sparse Observations

paper_url: http://arxiv.org/abs/2308.11261
repo_url: None
paper_authors: Sadegh Aliakbarian, Fatemeh Saleh, David Collier, Pashmina Cameron, Darren Cosker
for: 这个论文的目的是提出一种能够生成正确和可信的全身动作，即使只有部分手部可见，以提高混合现实场景中的吸引力。
methods: 该论文使用了一种名为HMD-NeMo的轻量级神经网络，通过在线和实时方式预测全身动作，并使用了新的时间适应mask токен来促进合理的动作在手部不可见情况下。
results: 经过了广泛的分析和评估，该论文在AMASS数据集上达到了新的状态元。

Abstract
Generating both plausible and accurate full body avatar motion is the key to the quality of immersive experiences in mixed reality scenarios. Head-Mounted Devices (HMDs) typically only provide a few input signals, such as head and hands 6-DoF. Recently, different approaches achieved impressive performance in generating full body motion given only head and hands signal. However, to the best of our knowledge, all existing approaches rely on full hand visibility. While this is the case when, e.g., using motion controllers, a considerable proportion of mixed reality experiences do not involve motion controllers and instead rely on egocentric hand tracking. This introduces the challenge of partial hand visibility owing to the restricted field of view of the HMD. In this paper, we propose the first unified approach, HMD-NeMo, that addresses plausible and accurate full body motion generation even when the hands may be only partially visible. HMD-NeMo is a lightweight neural network that predicts the full body motion in an online and real-time fashion. At the heart of HMD-NeMo is the spatio-temporal encoder with novel temporally adaptable mask tokens that encourage plausible motion in the absence of hand observations. We perform extensive analysis of the impact of different components in HMD-NeMo and introduce a new state-of-the-art on AMASS dataset through our evaluation.

摘要
<>Translate the given text into Simplified Chinese.<>创造真实和准确的全身人物动作是混合现实场景中质量体验的关键。头戴设备（HMD）通常只提供头和手6个自由度的输入信号。近期，不同的方法已经实现了在只有头和手信号的情况下取得了出色的表现。然而，根据我们所知，所有现有的方法均依赖于全手可见。这会导致在使用动作控制器时是可见的，但是一部分混合现实经验不使用动作控制器，而是基于自己центric的手姿跟踪。这引入了只有部分手可见的挑战，即HMD的视野限制。在这篇论文中，我们提出了第一个统一的方法，即HMD-NeMo，它可以在线和实时方式下生成真实和准确的全身动作。HMD-NeMo是一个轻量级的神经网络，它通过在线和实时方式下预测全身动作来解决部分手可见的挑战。我们对HMD-NeMo中不同组件的影响进行了广泛的分析，并通过我们的评估而提出了新的状态码。

Video BagNet: short temporal receptive fields increase robustness in long-term action recognition

paper_url: http://arxiv.org/abs/2308.11249
repo_url: https://github.com/ombretta/videobagnet
paper_authors: Ombretta Strafforello, Xin Liu, Klamer Schutte, Jan van Gemert
for: 提高视频动作识别模型的 robustness，使其能够更好地承受视频中的子动作顺序变化。
methods: 对于现有的深度3D convolutional模型，我们采用了限制时间响应领域大小的方法，从而实现了模型的时间响应领域的缩小。
results: 我们在 sintetic 和 real-world 视频数据集上进行了实验，发现短时间响应领域的模型具有较高的 robustness，而大于17帧的时间响应领域模型具有较低的 robustness。

Abstract
Previous work on long-term video action recognition relies on deep 3D-convolutional models that have a large temporal receptive field (RF). We argue that these models are not always the best choice for temporal modeling in videos. A large temporal receptive field allows the model to encode the exact sub-action order of a video, which causes a performance decrease when testing videos have a different sub-action order. In this work, we investigate whether we can improve the model robustness to the sub-action order by shrinking the temporal receptive field of action recognition models. For this, we design Video BagNet, a variant of the 3D ResNet-50 model with the temporal receptive field size limited to 1, 9, 17 or 33 frames. We analyze Video BagNet on synthetic and real-world video datasets and experimentally compare models with varying temporal receptive fields. We find that short receptive fields are robust to sub-action order changes, while larger temporal receptive fields are sensitive to the sub-action order.

摘要
previous research on long-term video action recognition relies on deep 3D-convolutional models with a large temporal receptive field (RF). we argue that these models are not always the best choice for temporal modeling in videos. a large temporal receptive field allows the model to encode the exact sub-action order of a video, which causes a performance decrease when testing videos have a different sub-action order. in this work, we investigate whether we can improve the model robustness to the sub-action order by shrinking the temporal receptive field of action recognition models. for this, we design video bagnet, a variant of the 3D ResNet-50 model with the temporal receptive field size limited to 1, 9, 17 or 33 frames. we analyze video bagnet on synthetic and real-world video datasets and experimentally compare models with varying temporal receptive fields. we find that short receptive fields are robust to sub-action order changes, while larger temporal receptive fields are sensitive to the sub-action order.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.

Are current long-term video understanding datasets long-term?

paper_url: http://arxiv.org/abs/2308.11244
repo_url: https://github.com/ombretta/longterm_datasets
paper_authors: Ombretta Strafforello, Klamer Schutte, Jan van Gemert
for: This paper aims to evaluate the suitability of video datasets for long-term action recognition.
methods: The proposed method defines long-term actions as those that cannot be recognized using solely short-term information, and tests this definition on three popular real-world datasets.
results: The study finds that the existing datasets can be effectively solved using shortcuts based on short-term information, and encourages researchers to use datasets that require long-term information to be solved.Here’s the simplified Chinese text for the three pieces of information:
for: 这篇论文目的是评估视频数据集是否适用于长期动作识别。
methods: 该方法定义长期动作为不能通过短期信息alone来识别的动作，并对三个实际世界数据集进行测试。
results: 研究发现现有数据集可以使用短期信息 shortcut 进行解决，并促使研究人员使用需要长期信息来解决的数据集。

Abstract
Many real-world applications, from sport analysis to surveillance, benefit from automatic long-term action recognition. In the current deep learning paradigm for automatic action recognition, it is imperative that models are trained and tested on datasets and tasks that evaluate if such models actually learn and reason over long-term information. In this work, we propose a method to evaluate how suitable a video dataset is to evaluate models for long-term action recognition. To this end, we define a long-term action as excluding all the videos that can be correctly recognized using solely short-term information. We test this definition on existing long-term classification tasks on three popular real-world datasets, namely Breakfast, CrossTask and LVU, to determine if these datasets are truly evaluating long-term recognition. Our study reveals that these datasets can be effectively solved using shortcuts based on short-term information. Following this finding, we encourage long-term action recognition researchers to make use of datasets that need long-term information to be solved.

摘要
很多现实世界应用，从运动分析到监测，都会受益于自动长期动作识别。在当前的深度学习框架中，自动动作识别模型的训练和测试通常是基于长期信息的。在这种情况下，我们提议一种方法来评估视频集是否适用于评估长期动作识别模型。为此，我们定义长期动作为排除所有可以通过短期信息来正确地识别的视频。我们在三个流行的实际世界数据集上进行测试，分别是Breakfast、CrossTask和LVU，以确定这些数据集是否真的评估长期认知。我们的研究发现，这些数据集可以通过快捷途径基于短期信息来解决。根据这个发现，我们鼓励长期动作识别研究人员使用需要长期信息来解决的数据集。

LOCATE: Self-supervised Object Discovery via Flow-guided Graph-cut and Bootstrapped Self-training

paper_url: http://arxiv.org/abs/2308.11239
repo_url: None
paper_authors: Silky Singh, Shripad Deshmukh, Mausoom Sarkar, Balaji Krishnamurthy
for: 本研究旨在无需人工监督下完成图像和视频数据集中的对象分割问题。
methods: 我们提出了一种自动化对象发现方法，利用运动和外观信息来生成高质量的对象分割面积。我们在传统图像树剖中添加了运动信息，并与外观信息进行线性组合来生成边权。
results: 我们的方法在多个标准视频对象分割、图像吸引力检测和对象分割 benchmark 上达到了与现状对照的性能。我们还通过自我训练来进一步提高性能。在审查实验中，我们的方法在未知领域中的转移性也得到了证明。

Abstract
Learning object segmentation in image and video datasets without human supervision is a challenging problem. Humans easily identify moving salient objects in videos using the gestalt principle of common fate, which suggests that what moves together belongs together. Building upon this idea, we propose a self-supervised object discovery approach that leverages motion and appearance information to produce high-quality object segmentation masks. Specifically, we redesign the traditional graph cut on images to include motion information in a linear combination with appearance information to produce edge weights. Remarkably, this step produces object segmentation masks comparable to the current state-of-the-art on multiple benchmarks. To further improve performance, we bootstrap a segmentation network trained on these preliminary masks as pseudo-ground truths to learn from its own outputs via self-training. We demonstrate the effectiveness of our approach, named LOCATE, on multiple standard video object segmentation, image saliency detection, and object segmentation benchmarks, achieving results on par with and, in many cases surpassing state-of-the-art methods. We also demonstrate the transferability of our approach to novel domains through a qualitative study on in-the-wild images. Additionally, we present extensive ablation analysis to support our design choices and highlight the contribution of each component of our proposed method.

摘要
学习图像和视频集合中的对象分割无人监督是一个具有挑战性的问题。人类容易通过gestalt原则的共同命运来识别视频中移动的焦点对象，这个原则表明移动 вместе的对象属于同一个集合。基于这个想法，我们提出了一种自动化对象发现方法，利用运动和外观信息生成高质量的对象分割面积。 Specifically,我们重新设计了传统的图像截割方法，在线性组合中包括运动信息和外观信息来生成边重量。这一步生成的对象分割面积与当前状态的各种标准测试 benchmark 相当。为了进一步提高性能，我们使用这些初步的面积作为 pseudo-ground truth 来自我准备一个 segmentation 网络，并通过自我训练来学习自己的输出。我们命名这种方法为 LOCATE，并在多个标准视频对象分割、图像焦点检测和对象分割 bencmarks 上实现了与和超越当前状态的方法。我们还进行了质量研究，以证明我们的方法在新领域中的传输性。此外，我们还提供了广泛的减少分析，以支持我们的设计选择，并高亮每个方法的贡献。

Affordance segmentation of hand-occluded containers from exocentric images

paper_url: http://arxiv.org/abs/2308.11233
repo_url: None
paper_authors: Tommaso Apicella, Alessio Xompero, Edoardo Ragusa, Riccardo Berta, Andrea Cavallaro, Paolo Gastaldo
for: 本研究旨在解决手持物体上 occlusion 问题，提高机器人可视且抓取物体的可行性。
methods: 提议的模型使用辅助分支处理物体和手部分 separately，学习受 occlusion 影响的可行特征。
results: 实验表明，我们的模型在实际和混合现实图像上具有更好的可行分割和泛化能力，比现有模型更好。

Abstract
Visual affordance segmentation identifies the surfaces of an object an agent can interact with. Common challenges for the identification of affordances are the variety of the geometry and physical properties of these surfaces as well as occlusions. In this paper, we focus on occlusions of an object that is hand-held by a person manipulating it. To address this challenge, we propose an affordance segmentation model that uses auxiliary branches to process the object and hand regions separately. The proposed model learns affordance features under hand-occlusion by weighting the feature map through hand and object segmentation. To train the model, we annotated the visual affordances of an existing dataset with mixed-reality images of hand-held containers in third-person (exocentric) images. Experiments on both real and mixed-reality images show that our model achieves better affordance segmentation and generalisation than existing models.

摘要
<>Visual affordance segmentation 可以识别物体上可互动的表面。常见的挑战是物体表面的多样性和物理特性以及遮挡。在这篇论文中，我们专注于人手持物体时的遮挡问题。为解决这个挑战，我们提议一种基于辅助分支的可互动分割模型。该模型通过对手和物体区域进行分割来学习可互动特征。通过对手和物体分割来权重特征图，以便在手遮挡情况下学习可互动特征。我们使用混合现实图像的混合现实数据集进行训练。实验表明，我们的模型在实际和混合现实图像上的可互动分割和泛化性比较好。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

LDP-Feat: Image Features with Local Differential Privacy

paper_url: http://arxiv.org/abs/2308.11223
repo_url: None
paper_authors: Francesco Pittaluga, Bingbing Zhuang
for: 保护隐私，防止恶意攻击者通过图像特征恢复原始图像
methods: 使用嵌入式法律空间和对抗性特征样本来隐藏图像特征，并提出了两种新的倒转攻击来缓解隐私风险
results: 实现了在保护隐私的情况下具有强大的视觉本地化性能，并提供了一种基于地方散布隐私的方法，这种方法可以提供固定的隐私泄露 bound，不受攻击强度的影响

Abstract
Modern computer vision services often require users to share raw feature descriptors with an untrusted server. This presents an inherent privacy risk, as raw descriptors may be used to recover the source images from which they were extracted. To address this issue, researchers recently proposed privatizing image features by embedding them within an affine subspace containing the original feature as well as adversarial feature samples. In this paper, we propose two novel inversion attacks to show that it is possible to (approximately) recover the original image features from these embeddings, allowing us to recover privacy-critical image content. In light of such successes and the lack of theoretical privacy guarantees afforded by existing visual privacy methods, we further propose the first method to privatize image features via local differential privacy, which, unlike prior approaches, provides a guaranteed bound for privacy leakage regardless of the strength of the attacks. In addition, our method yields strong performance in visual localization as a downstream task while enjoying the privacy guarantee.

摘要
现代计算机视觉服务经常需要用户将原始特征描述分发到不可信服务器。这会导致隐私风险，因为原始特征可能可以用来恢复源图像。为解决这问题，研究人员最近提出了嵌入图像特征的私有化方法，以确保图像内容的隐私。在这篇论文中，我们提出了两种新的反向攻击，以示它们可以（相对）回归原始图像特征，从而恢复隐私关键的图像内容。另外，我们还提出了首个基于本地差分隐私的图像特征隐私方法，与先前的方法不同，它提供了隐私泄露的保证，无论攻击者的强度如何。此外，我们的方法在视觉本地化任务中具有强表现力，同时享有隐私保证。

paper_url: http://arxiv.org/abs/2308.11206
repo_url: None
paper_authors: Xujie Zhang, Binbin Yang, Michael C. Kampffmeyer, Wenqing Zhang, Shiyue Zhang, Guansong Lu, Liang Lin, Hang Xu, Xiaodan Liang
for: 这个论文旨在提高模式融合和修改的方式，帮助时尚设计师更加方便地生成和修改他们的设计。
methods: 这个论文使用了DiffCloth，一种扩展了Diffusion-based管道，通过strucurally aligning cross-modal semantics来强化 diffusion models的可 composeability。
results: experiments on CM-Fashion benchmark demonstrate that DiffCloth can both yield state-of-the-art garment synthesis results and support flexible manipulation with region consistency.

Abstract
Cross-modal garment synthesis and manipulation will significantly benefit the way fashion designers generate garments and modify their designs via flexible linguistic interfaces.Current approaches follow the general text-to-image paradigm and mine cross-modal relations via simple cross-attention modules, neglecting the structural correspondence between visual and textual representations in the fashion design domain. In this work, we instead introduce DiffCloth, a diffusion-based pipeline for cross-modal garment synthesis and manipulation, which empowers diffusion models with flexible compositionality in the fashion domain by structurally aligning the cross-modal semantics. Specifically, we formulate the part-level cross-modal alignment as a bipartite matching problem between the linguistic Attribute-Phrases (AP) and the visual garment parts which are obtained via constituency parsing and semantic segmentation, respectively. To mitigate the issue of attribute confusion, we further propose a semantic-bundled cross-attention to preserve the spatial structure similarities between the attention maps of attribute adjectives and part nouns in each AP. Moreover, DiffCloth allows for manipulation of the generated results by simply replacing APs in the text prompts. The manipulation-irrelevant regions are recognized by blended masks obtained from the bundled attention maps of the APs and kept unchanged. Extensive experiments on the CM-Fashion benchmark demonstrate that DiffCloth both yields state-of-the-art garment synthesis results by leveraging the inherent structural information and supports flexible manipulation with region consistency.

摘要
cross-modal 服装合成和修改将会对时尚设计师如何生成服装和修改他们的设计进行深见改进，通过灵活的语言接口。目前的方法采用通用的文本到图像模式，通过简单的跨模态关系抽象模块，忽略了时尚设计领域中的视觉表示结构匹配。在这个工作中，我们发展了DiffCloth，一种基于扩散的渠道管道，用于跨模态服装合成和修改，具有时尚领域的可 compose 性。具体来说，我们将部级跨模态匹配问题定义为语言特征短语（AP）和视觉服装部分之间的对应问题，并通过分词分析和 semantic segmentation 获得视觉服装部分。为了解决特征混淆问题，我们还提出了一种 semantic-bundled 跨注意力，以保持每个AP的注意力地图之间的空间结构相似性。此外，DiffCloth 支持通过简单地更换文本提示中的AP来进行 manipulate 操作，并且识别并保持不变的混合mask。广泛的实验表明，DiffCloth 可以利用内置的结构信息，同时支持灵活的 manipulate 操作，并保持区域一致性。

Masked Cross-image Encoding for Few-shot Segmentation

paper_url: http://arxiv.org/abs/2308.11201
repo_url: None
paper_authors: Wenbo Xu, Huaxi Huang, Ming Cheng, Litao Yu, Qiang Wu, Jian Zhang
for: 这个论文旨在提高几张支持图像上的几个类别的描述，使用少量标注图像进行推断。
methods: 该方法使用Masked Cross-Image Encoding（MCE）来捕捉对象细节的共同视觉特征，以及图像之间的相互依赖关系。
results: 实验表明，该方法在PASCAL-$5^i$和COCO-$20^i$中表现出色，可以快速学习新类别，并且对于描述对象细节的任务有进一步的改进。

Abstract
Few-shot segmentation (FSS) is a dense prediction task that aims to infer the pixel-wise labels of unseen classes using only a limited number of annotated images. The key challenge in FSS is to classify the labels of query pixels using class prototypes learned from the few labeled support exemplars. Prior approaches to FSS have typically focused on learning class-wise descriptors independently from support images, thereby ignoring the rich contextual information and mutual dependencies among support-query features. To address this limitation, we propose a joint learning method termed Masked Cross-Image Encoding (MCE), which is designed to capture common visual properties that describe object details and to learn bidirectional inter-image dependencies that enhance feature interaction. MCE is more than a visual representation enrichment module; it also considers cross-image mutual dependencies and implicit guidance. Experiments on FSS benchmarks PASCAL-$5^i$ and COCO-$20^i$ demonstrate the advanced meta-learning ability of the proposed method.

摘要
几个例图分类（FSS）是一种密集预测任务，旨在只使用有限数量的标注图像来预测未经看过的类别。关键挑战在FSS中是将查询像素的标签分类用支持图像中学习的类 prototype。现有的FSS方法通常是独立地从支持图像中学习类Descriptor，从而忽略了支持图像和查询图像之间的丰富Contextual information和相互依赖关系。为了解决这一限制，我们提出了一种联合学习方法，称为Masked Cross-Image Encoding（MCE），旨在捕捉对象细节中的共同视觉特性，以及在支持图像和查询图像之间的双向依赖关系。MCE不仅是一种视觉表示增强模块，还考虑了图像之间的相互依赖关系和隐藏导航。在PASCAL-$5^i$和COCO-$20^i$的FSS标准测试集上，我们的提议方法表现出了更高级的元学习能力。

Novel-view Synthesis and Pose Estimation for Hand-Object Interaction from Sparse Views

paper_url: http://arxiv.org/abs/2308.11198
repo_url: None
paper_authors: Wentian Qu, Zhaopeng Cui, Yinda Zhang, Chenyu Meng, Cuixia Ma, Xiaoming Deng, Hongan Wang
for: 这个论文主要是关于手对象交互的理解和生成三维手对象交互的方法。
methods: 该论文提出了一种基于神经网络的渲染和姿态估计系统，用于从稀疏视图中理解手对象交互。该系统还可以实现3D手对象交互编辑。
results: 实验表明，该方法比前州艺术法具有更高的性能。Here’s the full translation of the paper’s abstract in Simplified Chinese:
for: 这个论文主要是关于手对象交互的理解和生成三维手对象交互的方法。
methods: 该论文提出了一种基于神经网络的渲染和姿态估计系统，用于从稀疏视图中理解手对象交互。该系统还可以实现3D手对象交互编辑。
results: 实验表明，该方法比前州艺术法具有更高的性能。I hope this helps! Let me know if you have any other questions.

Abstract
Hand-object interaction understanding and the barely addressed novel view synthesis are highly desired in the immersive communication, whereas it is challenging due to the high deformation of hand and heavy occlusions between hand and object. In this paper, we propose a neural rendering and pose estimation system for hand-object interaction from sparse views, which can also enable 3D hand-object interaction editing. We share the inspiration from recent scene understanding work that shows a scene specific model built beforehand can significantly improve and unblock vision tasks especially when inputs are sparse, and extend it to the dynamic hand-object interaction scenario and propose to solve the problem in two stages. We first learn the shape and appearance prior knowledge of hands and objects separately with the neural representation at the offline stage. During the online stage, we design a rendering-based joint model fitting framework to understand the dynamic hand-object interaction with the pre-built hand and object models as well as interaction priors, which thereby overcomes penetration and separation issues between hand and object and also enables novel view synthesis. In order to get stable contact during the hand-object interaction process in a sequence, we propose a stable contact loss to make the contact region to be consistent. Experiments demonstrate that our method outperforms the state-of-the-art methods. Code and dataset are available in project webpage https://iscas3dv.github.io/HO-NeRF.

摘要
手动对象交互理解和 hardly addressed 新视图合成是 immerse 通信中的高优先级需求，但是受到手动对象的高变形和对手与对象的 occlusion 的影响，是一个挑战。在这篇论文中，我们提出了一种基于 neural 渲染和 pose 估计的手动对象交互从稀疏视图系统，可以帮助解决3D 手动对象交互编辑问题。我们 draw 了 scene 理解工作的 inspirations，使用 scene 特定模型在 offline 阶段学习手动对象交互的 shape 和 appearance 特征，然后在 online 阶段使用 rendering-based joint 模型来理解动态手动对象交互，并且通过 pre-built 手和对象模型以及交互 prior 来解决 penetration 和 separation 问题，并实现 novel view synthesis。为了在手动对象交互过程中保持稳定的 contacts，我们提出了一种稳定 contact loss。实验表明，我们的方法超过了现有方法的性能。代码和数据集可以在项目网站 https://iscas3dv.github.io/HO-NeRF 中获取。

Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models

paper_url: http://arxiv.org/abs/2308.11186
repo_url: None
paper_authors: Baoshuo Kan, Teng Wang, Wenpeng Lu, Xiantong Zhen, Weili Guan, Feng Zheng
for: 这篇论文旨在提出一种基于人工知识的潜在图像分类模型，以提高图像分类器的泛化能力。
methods: 该方法使用两种不同类型的知识感知提问，一种是精确提取对象类型的描述信息，另一种是通过学习获得总上下文信息。此外，还设计了一个适应头来归一化重要视觉特征。
results: 对11种广泛使用的基准数据集进行了广泛的实验，结果表明，KAPT方法可以在少量样本下进行图像分类，特别是在新类别上具有泛化能力。与当前最佳方法CoCoOp相比，KAPT方法在新类别上显示了有利的性能，升准值为3.22%和2.57%。

Abstract
Pre-trained vision-language models, e.g., CLIP, working with manually designed prompts have demonstrated great capacity of transfer learning. Recently, learnable prompts achieve state-of-the-art performance, which however are prone to overfit to seen classes, failing to generalize to unseen classes. In this paper, we propose a Knowledge-Aware Prompt Tuning (KAPT) framework for vision-language models. Our approach takes inspiration from human intelligence in which external knowledge is usually incorporated into recognizing novel categories of objects. Specifically, we design two complementary types of knowledge-aware prompts for the text encoder to leverage the distinctive characteristics of category-related external knowledge. The discrete prompt extracts the key information from descriptions of an object category, and the learned continuous prompt captures overall contexts. We further design an adaptation head for the visual encoder to aggregate salient attentive visual cues, which establishes discriminative and task-aware visual representations. We conduct extensive experiments on 11 widely-used benchmark datasets and the results verify the effectiveness in few-shot image classification, especially in generalizing to unseen categories. Compared with the state-of-the-art CoCoOp method, KAPT exhibits favorable performance and achieves an absolute gain of 3.22% on new classes and 2.57% in terms of harmonic mean.

摘要
<>传统的视觉语言模型，如CLIP，通过手动设计的提示符可以实现很好的转移学习效果。然而，这些提示符容易过拟合已经看过的类别，导致无法泛化到未看过的类别。在这篇论文中，我们提出了一个知识感知提示调整（KAPT）框架，用于视觉语言模型。我们的方法启取人类智能中在识别新类别对象时，通常会 incorporate 外部知识。我们设计了两种强调知识的提示符，一种是精确提取对象类别的描述信息，另一种是学习对象上的总上下文。此外，我们还设计了一个适应头，用于Visual encoder 中的视觉特征汇聚，以建立特异性和任务意识的视觉表示。我们对11种广泛使用的 benchmark 数据集进行了广泛的实验，结果证明了我们的方法的有效性，特别是在新类别上的泛化。相比之下，与状态之arte CoCoOp 方法相比，KAPT 表现更加优秀，在新类别上实现了3.22%的绝对提升和2.57%的harmonic mean。

MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation

paper_url: http://arxiv.org/abs/2308.11185
repo_url: None
paper_authors: Najmeh Sadoughi, Xinyu Li, Avijit Vajpayee, David Fan, Bing Shuai, Hector Santos-Villalobos, Vimal Bhat, Rohith MV
for: 本研究旨在提高长形视频（>60分钟）的场景和剧情槽分 segmentation 效果。
methods: 本研究提出了一种基于多媒体Modalities的 Multimodal alignmEnt aGgregation and distillAtion（MEGA）方法，通过多种输入的粗略对应和模式维度的嵌入来解决长形视频的同步问题。
results: 实验结果表明，MEGA方法在 MovieNet 数据集上场景分 segmentation Task 上提高了+1.19%的平均准确率，并在 TRIPOD 数据集上剧情分 segmentation Task 上提高了+5.51%的总一致率。

Abstract
Previous research has studied the task of segmenting cinematic videos into scenes and into narrative acts. However, these studies have overlooked the essential task of multimodal alignment and fusion for effectively and efficiently processing long-form videos (>60min). In this paper, we introduce Multimodal alignmEnt aGgregation and distillAtion (MEGA) for cinematic long-video segmentation. MEGA tackles the challenge by leveraging multiple media modalities. The method coarsely aligns inputs of variable lengths and different modalities with alignment positional encoding. To maintain temporal synchronization while reducing computation, we further introduce an enhanced bottleneck fusion layer which uses temporal alignment. Additionally, MEGA employs a novel contrastive loss to synchronize and transfer labels across modalities, enabling act segmentation from labeled synopsis sentences on video shots. Our experimental results show that MEGA outperforms state-of-the-art methods on MovieNet dataset for scene segmentation (with an Average Precision improvement of +1.19%) and on TRIPOD dataset for act segmentation (with a Total Agreement improvement of +5.51%)

摘要

ReFit: Recurrent Fitting Network for 3D Human Recovery

paper_url: http://arxiv.org/abs/2308.11184
repo_url: https://github.com/yufu-wang/ReFit
paper_authors: Yufu Wang, Kostas Daniilidis
for: 本研究旨在提出一种基于神经网络的单影像参数3D人体重建方法，以解决人体重建问题。
methods: 本方法使用反馈更新循环，通过在每个迭代步骤中将人体模型中的关键点投影到特征图中，并使用回归型更新器来调整模型以更好地适应图像。
results: 本研究表明，使用反馈更新循环可以更快速地训练神经网络模型，同时提高了标准benchmark测试数据集上的性能。此外，本方法还可以应用于多视图适应和单视图形状适应等其他优化设定。

Abstract
We present Recurrent Fitting (ReFit), a neural network architecture for single-image, parametric 3D human reconstruction. ReFit learns a feedback-update loop that mirrors the strategy of solving an inverse problem through optimization. At each iterative step, it reprojects keypoints from the human model to feature maps to query feedback, and uses a recurrent-based updater to adjust the model to fit the image better. Because ReFit encodes strong knowledge of the inverse problem, it is faster to train than previous regression models. At the same time, ReFit improves state-of-the-art performance on standard benchmarks. Moreover, ReFit applies to other optimization settings, such as multi-view fitting and single-view shape fitting. Project website: https://yufu-wang.github.io/refit_humans/

摘要
我们介绍Recurrent Fitting（ReFit），一个神经网络架构，用于单一图像、Parametric 3D人体重建。ReFit学习了一个反馈-更新循环，与解析问题的解决策略相似。在每个迭代步骤中，它将人体模型中的关键点投射到特征对称中，以查询反馈，并使用回归型更新器进行调整，以更好地适应图像。由于ReFit传递强大的倒数问题知识，因此它在训练时比前一代 regression 模型更快。同时，ReFit在标准参考数据上提高了现场性能。此外，ReFit可以应用到其他优化设定，例如多视点适摄和单视点形状适摄。相关网站：https://yufu-wang.github.io/refit_humans/

A three in one bottom-up framework for simultaneous semantic segmentation, instance segmentation and classification of multi-organ nuclei in digital cancer histology

paper_url: http://arxiv.org/abs/2308.11179
repo_url: None
paper_authors: Ibtihaj Ahmad, Syed Muhammad Israr, Zain Ul Islam
for: 这paper是用于同时进行核体分割和分类的数字 histology 的研究，它们在计算机协助癌症诊断中扮演着关键的角色，但是这个问题仍然是一个挑战。methods: 这paper使用了一种多头decoder的结构，其中每个头都有独立的权重损失函数，用于生成核体分割、边提议和分类地图。这些输出被用来进行后处理，生成最终的核体分割和分类结果。results: 该paper实现了高性能的核体分割和分类，其中semantic segmentation的Dice score为0.841，实例 segmentation的bPQ score为0.713，和nuclei classification的mPQ score为0.633。此外，该方法在19种组织中得到了普适性。

Abstract
Simultaneous segmentation and classification of nuclei in digital histology play an essential role in computer-assisted cancer diagnosis; however, it remains challenging. The highest achieved binary and multi-class Panoptic Quality (PQ) remains as low as 0.68 bPQ and 0.49 mPQ, respectively. It is due to the higher staining variability, variability across the tissue, rough clinical conditions, overlapping nuclei, and nuclear class imbalance. The generic deep-learning methods usually rely on end-to-end models, which fail to address these problems associated explicitly with digital histology. In our previous work, DAN-NucNet, we resolved these issues for semantic segmentation with an end-to-end model. This work extends our previous model to simultaneous instance segmentation and classification. We introduce additional decoder heads with independent weighted losses, which produce semantic segmentation, edge proposals, and classification maps. We use the outputs from the three-head model to apply post-processing to produce the final segmentation and classification. Our multi-stage approach utilizes edge proposals and semantic segmentations compared to direct segmentation and classification strategies followed by most state-of-the-art methods. Due to this, we demonstrate a significant performance improvement in producing high-quality instance segmentation and nuclei classification. We have achieved a 0.841 Dice score for semantic segmentation, 0.713 bPQ scores for instance segmentation, and 0.633 mPQ for nuclei classification. Our proposed framework is generalized across 19 types of tissues. Furthermore, the framework is less complex compared to the state-of-the-art.

摘要
同时分割和分类肿瘤细胞在数字 histology 中扮演着重要的角色，但是它仍然是一个挑战。最高的二分和多分类 Panoptic Quality (PQ) 只有0.68 bPQ和0.49 mPQ，这是因为细胞染色的变化、组织内部的变化、较为恶劣的临床条件、重叠的细胞和核类别偏度。通用的深度学习方法通常采用端到端模型，这些模型无法直接地解决数字 histology 中相关的问题。在我们的前一个工作中，我们提出了 DAN-NucNet 模型，解决了这些问题。本文将 extend 我们的前一个模型，以实现同时的实例分割和分类。我们添加了多个解码头，每个解码头都有独立的权重损失，它们生成的结果包括semantic segmentation、edge proposal 和分类图像。我们使用这些三个头的输出来进行后处理，以生成最终的分割和分类结果。我们的多阶段方法利用了 edge proposal 和 semantic segmentation，而不是直接进行分割和分类的方法，这使得我们可以提高高质量的实例分割和核类分类。我们在19种组织中实现了0.841 Dice 分割率、0.713 bPQ 分割率和0.633 mPQ 分类率。我们提出的框架比现有的状态之一更加简单。

Improving Misaligned Multi-modality Image Fusion with One-stage Progressive Dense Registration

paper_url: http://arxiv.org/abs/2308.11165
repo_url: None
paper_authors: Di Wang, Jinyuan Liu, Long Ma, Risheng Liu, Xin Fan
for: addressing the challenges of misalignments between multi-modality images in image fusion
methods: 一种Cross-modality Multi-scale Progressive Dense Registration (C-MPDR) scheme, which uses a one-stage optimization to improve the fusion performance of misaligned multi-modality images
results: 提高了多模态图像匹配的混合性能

Abstract
Misalignments between multi-modality images pose challenges in image fusion, manifesting as structural distortions and edge ghosts. Existing efforts commonly resort to registering first and fusing later, typically employing two cascaded stages for registration,i.e., coarse registration and fine registration. Both stages directly estimate the respective target deformation fields. In this paper, we argue that the separated two-stage registration is not compact, and the direct estimation of the target deformation fields is not accurate enough. To address these challenges, we propose a Cross-modality Multi-scale Progressive Dense Registration (C-MPDR) scheme, which accomplishes the coarse-to-fine registration exclusively using a one-stage optimization, thus improving the fusion performance of misaligned multi-modality images. Specifically, two pivotal components are involved, a dense Deformation Field Fusion (DFF) module and a Progressive Feature Fine (PFF) module. The DFF aggregates the predicted multi-scale deformation sub-fields at the current scale, while the PFF progressively refines the remaining misaligned features. Both work together to accurately estimate the final deformation fields. In addition, we develop a Transformer-Conv-based Fusion (TCF) subnetwork that considers local and long-range feature dependencies, allowing us to capture more informative features from the registered infrared and visible images for the generation of high-quality fused images. Extensive experimental analysis demonstrates the superiority of the proposed method in the fusion of misaligned cross-modality images.

摘要
《多Modalität图像误差问题在图像融合中带来挑战，导致结构扭曲和边幻影。现有尝试通常采用分两个阶段进行 регистрирование，即粗略 регистрирование和精细 регистрирование，两个阶段直接估计目标扭曲场。在这篇论文中，我们认为分开的两个阶段 регистрирование不是 компакт的，而直接估计目标扭曲场也不够精确。为了解决这些挑战，我们提出了一种交叉模态多尺度进行密度注registratin（C-MPDR）方案，通过一个单一的优化过程，提高误差的多模态图像融合性能。具体来说，这种方案包括两个关键组件：一个密集的Deformation Field Fusion（DFF）模块和一个进行进行精细调整的Progressive Feature Fine（PFF）模块。DFF模块将在当前级别预测多尺度扭曲子场，而PFF模块将逐渐调整剩下的误差特征。两个模块共同帮助估计最终的扭曲场。此外，我们还开发了基于Transformer-Conv的融合（TCF）子网络，该子网络考虑了本地和远程特征依赖关系，使我们能够更好地捕捉融合后的高质量混合图像中的有用特征。广泛的实验分析表明，我们提出的方法在跨模态图像融合中具有显著的优势。》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know.

Decoupled Contrastive Multi-view Clustering with High-order Random Walks

paper_url: http://arxiv.org/abs/2308.11164
repo_url: None
paper_authors: Yiding Lu, Yijie Lin, Mouxing Yang, Dezhong Peng, Peng Hu, Xi Peng
for: 提高多视图集群（MvC）的稳定性和抗衰落性，解决false negative和false positive问题。
methods: 提出了一种新的多视图集群方法（DIVIDE），通过随机游走进行全球性地标识数据对，从而可以正确地标识内部负样本和外部正样本。同时，DIVIDE采用了一种新的多视图集群架构，在不同的嵌入空间进行对比学习，以提高集群性能和抗衰落性。
results: 通过对四个标准测试集进行广泛的实验，证明了DIVIDE在完整的MvC设定下和缺失视图的MvC设定下都有较高的性能，并且在不同的缺失视图情况下保持稳定性。

Abstract
In recent, some robust contrastive multi-view clustering (MvC) methods have been proposed, which construct data pairs from neighborhoods to alleviate the false negative issue, i.e., some intra-cluster samples are wrongly treated as negative pairs. Although promising performance has been achieved by these methods, the false negative issue is still far from addressed and the false positive issue emerges because all in- and out-of-neighborhood samples are simply treated as positive and negative, respectively. To address the issues, we propose a novel robust method, dubbed decoupled contrastive multi-view clustering with high-order random walks (DIVIDE). In brief, DIVIDE leverages random walks to progressively identify data pairs in a global instead of local manner. As a result, DIVIDE could identify in-neighborhood negatives and out-of-neighborhood positives. Moreover, DIVIDE embraces a novel MvC architecture to perform inter- and intra-view contrastive learning in different embedding spaces, thus boosting clustering performance and embracing the robustness against missing views. To verify the efficacy of DIVIDE, we carry out extensive experiments on four benchmark datasets comparing with nine state-of-the-art MvC methods in both complete and incomplete MvC settings.

摘要
近些年，一些强大的多视图决定 clustering（MvC）方法被提出，这些方法从邻居中构建数据对以解决内部样本被错误地视为负对的问题。 although these methods have achieved promising performance, the false negative issue is still not fully addressed, and a new issue has emerged, i.e., false positives, because all in- and out-of-neighborhood samples are simply treated as positive and negative, respectively. To address these issues, we propose a novel robust method, called decoupled contrastive multi-view clustering with high-order random walks (DIVIDE).DIVIDE 方法利用Random Walk来逐渐标识全球的数据对，而不是局部的方法。 As a result, DIVIDE could identify in-neighborhood negatives and out-of-neighborhood positives. 更over, DIVIDE 方法采用一种新的MvC架构，以进行不同的嵌入空间中的对比学习，从而提高划分性和对缺失视图的抗性。 To verify the effectiveness of DIVIDE, we conduct extensive experiments on four benchmark datasets, comparing with nine state-of-the-art MvC methods in both complete and incomplete MvC settings.

A Preliminary Investigation into Search and Matching for Tumour Discrimination in WHO Breast Taxonomy Using Deep Networks

paper_url: http://arxiv.org/abs/2308.11162
repo_url: None
paper_authors: Abubakr Shafique, Ricardo Gonzalez, Liron Pantanowitz, Puay Hoon Tan, Alberto Machado, Ian A Cree, Hamid R. Tizhoosh
for: 这项研究旨在开发一个基于深度学习的搜索able数字Atlas，用于帮助病理学家对悉数据库中的罕见癌症进行查找和匹配。
methods: 该研究使用了一个国际知名的TCGA数据库，并使用了一个国际顶尖的深度学习模型，对 millions of diagnostic histopathology images进行了预训练。然后，对WHO乳腺分类系统（第5版）中的35种肿瘤类型进行了索引和分析，并使用了深度特征来Visualize所有肿瘤类型。
results: 该研究发现，使用深度学习模型对WHO乳腺分类系统数据进行索引和分析，可以达到88%的准确率，并且使用top-n肿瘤类型进行验证可以达到91%的准确率。这些结果表明，使用索引数字Archive可以investigate complex relationships among common and rare breast lesions。

Abstract
Breast cancer is one of the most common cancers affecting women worldwide. They include a group of malignant neoplasms with a variety of biological, clinical, and histopathological characteristics. There are more than 35 different histological forms of breast lesions that can be classified and diagnosed histologically according to cell morphology, growth, and architecture patterns. Recently, deep learning, in the field of artificial intelligence, has drawn a lot of attention for the computerized representation of medical images. Searchable digital atlases can provide pathologists with patch matching tools allowing them to search among evidently diagnosed and treated archival cases, a technology that may be regarded as computational second opinion. In this study, we indexed and analyzed the WHO breast taxonomy (Classification of Tumours 5th Ed.) spanning 35 tumour types. We visualized all tumour types using deep features extracted from a state-of-the-art deep learning model, pre-trained on millions of diagnostic histopathology images from the TCGA repository. Furthermore, we test the concept of a digital "atlas" as a reference for search and matching with rare test cases. The patch similarity search within the WHO breast taxonomy data reached over 88% accuracy when validating through "majority vote" and more than 91% accuracy when validating using top-n tumour types. These results show for the first time that complex relationships among common and rare breast lesions can be investigated using an indexed digital archive.

摘要
乳癌是全球女性最常见的恶性肿瘤之一，其包括多种生物学、临床和 histopathological 特征的肿瘤。有超过35种不同的乳腺病变可以根据细胞形态、生长和建筑模式进行分类和诊断。近些年来，人工智能技术中的深度学习在医疗领域得到了广泛的关注，特别是在计算机化的医疗图像 Representation 方面。搜索able digital atlas 可以为病理学家提供一个搜索和匹配已诊断和治疗的档案库，这种技术可以视为计算机化的第二次诊断。本研究对 WHO 乳腺分类（第5版）进行了索引和分析，包括35种肿瘤类型。我们使用了一个国际上最新的深度学习模型，该模型在TCGA 数据库上进行了 millions 个诊断 Histopathology 图像的预处理，然后对所有肿瘤类型进行了深度特征提取和可视化。此外，我们还测试了一个数字"atlas"作为参考，用于搜索和匹配罕见案例。在 WHO 乳腺分类数据中进行了质心精度搜索，结果表明，使用 "多数投票" 验证的精度高达88%，使用 top-n 肿瘤类型验证的精度高达91%。这些结果表明，使用索引数字档案库，可以 investigate Complex relationships among common and rare breast lesions.

SwinV2DNet: Pyramid and Self-Supervision Compounded Feature Learning for Remote Sensing Images Change Detection

paper_url: http://arxiv.org/abs/2308.11159
repo_url: None
paper_authors: Dalong Zheng, Zebin Wu, Jia Liu, Zhihui Wei
for: 本研究旨在提出一种综合厚度网络（SwinV2DNet），以综合利用 transformer 和 CNN 的优点，解决现有网络在特征学习中的缺陷。
methods: 该网络包括 Swin V2 和 VGG16 两部分，通过 densely connected 的 Swin V2 脊梁和 CNN 分支，捕捉到 cambio 关系特征，并通过 mixed feature pyramid (MFP) 提供了多层次和协调的特征学习。
results: 对四个常用的公共遥感数据集进行比较，该网络可以达到state-of-the-art 的变化检测分数和细化变化地图，并且通过自我监督策略提高了 CNN 分支的训练问题。

Abstract
Among the current mainstream change detection networks, transformer is deficient in the ability to capture accurate low-level details, while convolutional neural network (CNN) is wanting in the capacity to understand global information and establish remote spatial relationships. Meanwhile, both of the widely used early fusion and late fusion frameworks are not able to well learn complete change features. Therefore, based on swin transformer V2 (Swin V2) and VGG16, we propose an end-to-end compounded dense network SwinV2DNet to inherit the advantages of both transformer and CNN and overcome the shortcomings of existing networks in feature learning. Firstly, it captures the change relationship features through the densely connected Swin V2 backbone, and provides the low-level pre-changed and post-changed features through a CNN branch. Based on these three change features, we accomplish accurate change detection results. Secondly, combined with transformer and CNN, we propose mixed feature pyramid (MFP) which provides inter-layer interaction information and intra-layer multi-scale information for complete feature learning. MFP is a plug and play module which is experimentally proven to be also effective in other change detection networks. Further more, we impose a self-supervision strategy to guide a new CNN branch, which solves the untrainable problem of the CNN branch and provides the semantic change information for the features of encoder. The state-of-the-art (SOTA) change detection scores and fine-grained change maps were obtained compared with other advanced methods on four commonly used public remote sensing datasets. The code is available at https://github.com/DalongZ/SwinV2DNet.

摘要
当前主流的变化检测网络中，变换器缺乏捕捉准确的低级细节的能力，而卷积神经网络（CNN）缺乏建立远程空间关系和全局信息的能力。同时，现有的早期融合和晚期融合框架都不能良好地学习完整的变化特征。因此，基于Swin transformer V2（Swin V2）和VGG16，我们提出了一种端到端融合密集网络SwinV2DNet，继承了变换器和CNN的优点，并超越了现有网络在特征学习方面的缺陷。首先，SwinV2DNet通过密集连接的Swin V2脊梁捕捉变化关系特征，并提供低级预变和后变特征通过一个CNN分支。基于这三个变化特征，我们实现了准确的变化检测结果。其次，我们提出了混合特征阶梯（MFP），该模块通过卷积神经网络和变换器的结合，为完整的特征学习提供了间层交互信息和多尺度内层信息。MFP是一个可插入的模块，实验证明其效果也可以应用于其他变化检测网络。此外，我们对新的CNN分支进行自我超vision策略，解决了CNN分支的训练不可能问题，并为encoder的特征提供了semantic变化信息。与其他先进方法相比，我们在四个常用的公共遥感数据集上获得了状态前的变化检测分数和细化变化地图。代码可以在https://github.com/DalongZ/SwinV2DNet上获取。

Domain Generalization via Rationale Invariance

paper_url: http://arxiv.org/abs/2308.11158
repo_url: https://github.com/liangchen527/ridg
paper_authors: Liang Chen, Yong Zhang, Yibing Song, Anton van den Hengel, Lingqiao Liu
for: 提高领域总结的稳定性，以便在未见 environments 中维持良好的结果。
methods: 通过对决策过程的最后一层抽象来解决领域总结挑战。具体来说，我们提议将每个样本的元素级别贡献视为决策的理由，并将每个样本的理由表示为一个矩阵。为了确保模型具有良好的普适性，我们建议每个类别的理由矩阵具有相似性，表明模型在各个环境中依靠域外特征来做出决策。
results: 我们的实验表明，通过引入 rational invariance loss 来实现这一思路，可以在不同的领域上实现竞争性的结果，即使使用的是简单的代码。代码可以在 \url{https://github.com/liangchen527/RIDG} 上找到。

Abstract
This paper offers a new perspective to ease the challenge of domain generalization, which involves maintaining robust results even in unseen environments. Our design focuses on the decision-making process in the final classifier layer. Specifically, we propose treating the element-wise contributions to the final results as the rationale for making a decision and representing the rationale for each sample as a matrix. For a well-generalized model, we suggest the rationale matrices for samples belonging to the same category should be similar, indicating the model relies on domain-invariant clues to make decisions, thereby ensuring robust results. To implement this idea, we introduce a rationale invariance loss as a simple regularization technique, requiring only a few lines of code. Our experiments demonstrate that the proposed approach achieves competitive results across various datasets, despite its simplicity. Code is available at \url{https://github.com/liangchen527/RIDG}.

摘要

TOPIC: A Parallel Association Paradigm for Multi-Object Tracking under Complex Motions and Diverse Scenes

paper_url: http://arxiv.org/abs/2308.11157
repo_url: https://github.com/holmescao/TOPICTrack
paper_authors: Xiaoyan Cao, Yiyao Zheng, Yao Yao, Huapeng Qin, Xiaoyu Cao, Shihui Guo
for: 这 paper 的目的是提出一种新的多对目标跟踪（MOT）数据集，以解决现有数据集忽略了复杂的运动模式的问题。
methods: 该 paper 使用了一种新的并行相关模式（Parallel Paradigm），并提出了一种基于运动和外观特征的两个圆形相关机制（TOPIC），以及一种基于注意力的外观重建模块（AARM）来提高跟踪效果。
results: 该 paper 的方法在四个公共数据集和新 introduce 的 BEE23 数据集上实现了领先的表现，包括减少 false negatives 12% 到 51% compared to 单一特征相关模式。

Abstract
Video data and algorithms have been driving advances in multi-object tracking (MOT). While existing MOT datasets focus on occlusion and appearance similarity, complex motion patterns are widespread yet overlooked. To address this issue, we introduce a new dataset called BEE23 to highlight complex motions. Identity association algorithms have long been the focus of MOT research. Existing trackers can be categorized into two association paradigms: single-feature paradigm (based on either motion or appearance feature) and serial paradigm (one feature serves as secondary while the other is primary). However, these paradigms are incapable of fully utilizing different features. In this paper, we propose a parallel paradigm and present the Two rOund Parallel matchIng meChanism (TOPIC) to implement it. The TOPIC leverages both motion and appearance features and can adaptively select the preferable one as the assignment metric based on motion level. Moreover, we provide an Attention-based Appearance Reconstruct Module (AARM) to reconstruct appearance feature embeddings, thus enhancing the representation of appearance features. Comprehensive experiments show that our approach achieves state-of-the-art performance on four public datasets and BEE23. Notably, our proposed parallel paradigm surpasses the performance of existing association paradigms by a large margin, e.g., reducing false negatives by 12% to 51% compared to the single-feature association paradigm. The introduced dataset and association paradigm in this work offers a fresh perspective for advancing the MOT field. The source code and dataset are available at https://github.com/holmescao/TOPICTrack.

摘要
视频数据和算法在多对目标跟踪（MOT）领域取得了重大进步。现有的MOT数据集集中焦点在 occlusion 和外观相似性上，然而复杂的运动模式却被忽略。为了解决这个问题，我们提出了一个新的数据集called BEE23，以强调复杂的运动。目标跟踪算法的研究总是围绕着 Identity association 问题进行，现有的跟踪器可以分为两种联系思维：单一特征思维（基于 either motion 或 appearance feature）以及串行思维（一个特征服务为次要，另一个特征服务为主要）。然而，这些思维方法无法完全利用不同的特征。在这篇论文中，我们提议了并行联系思维，并通过 Two rOund Parallel matchIng meChanism（TOPIC）来实现。TOPIC 利用了运动和外观特征，并可以根据运动水平选择适合的一个作为分配度量。此外，我们还提供了 Attention-based Appearance Reconstruct Module（AARM）来重建外观特征嵌入，从而提高外观特征的表示。我们对四个公共数据集和 BEE23 进行了广泛的实验，结果显示我们的方法在这些数据集上达到了当前最佳性能。尤其是，我们提出的并行联系思维在现有的联系思维方法之上减少了12%至51%的假阳性。在这篇论文中，我们还提供了一个新的数据集和联系思维方法，这将为 MOT 领域带来新的视角，并且代码和数据集可以在上获取。

High Dynamic Range Imaging of Dynamic Scenes with Saturation Compensation but without Explicit Motion Compensation

paper_url: http://arxiv.org/abs/2308.11140
repo_url: https://github.com/haesoochung/hdri-saturation-compensation
paper_authors: Haesoo Chung, Nam Ik Cho
for: 提高高动态范围（HDR）图像的获得和修复，解决由相机传感器的限制导致的信息损失问题。
methods: 使用改进的运动补偿和灰度调整问题的解决方案，通过 Contextual attention 技术来修复过度曝光区域。
results: 比对 state-of-the-art 方法，示出了更高的质量和量化评价结果。

Abstract
High dynamic range (HDR) imaging is a highly challenging task since a large amount of information is lost due to the limitations of camera sensors. For HDR imaging, some methods capture multiple low dynamic range (LDR) images with altering exposures to aggregate more information. However, these approaches introduce ghosting artifacts when significant inter-frame motions are present. Moreover, although multi-exposure images are given, we have little information in severely over-exposed areas. Most existing methods focus on motion compensation, i.e., alignment of multiple LDR shots to reduce the ghosting artifacts, but they still produce unsatisfying results. These methods also rather overlook the need to restore the saturated areas. In this paper, we generate well-aligned multi-exposure features by reformulating a motion alignment problem into a simple brightness adjustment problem. In addition, we propose a coarse-to-fine merging strategy with explicit saturation compensation. The saturated areas are reconstructed with similar well-exposed content using adaptive contextual attention. We demonstrate that our method outperforms the state-of-the-art methods regarding qualitative and quantitative evaluations.

摘要
高动态范围（HDR）摄影是一项非常具有挑战性的任务，因为摄像头传感器的限制会导致大量信息的丢失。为实现HDR摄影，一些方法会 capture多个低动态范围（LDR）图像，并将它们进行不同的曝光设定来聚集更多的信息。然而，这些方法会导致在 significative 的运动误差存在时出现幻影artefacts。此外，虽然我们有多个曝光图像，但我们在严重过曝光区域中具有少量信息。大多数现有方法将焦点放在运动补做上，即将多个LDR拍摄Alignment 来减少幻影artefacts，但这些方法仍然生成不满足的结果。这些方法还往往忽略了重建过曝光区域的需求。在这篇论文中，我们将生成准确尺度的多个曝光特征，通过将运动Alignment 转换为简单的明亮调整问题来实现。此外，我们还提出了一种粗略到细节的合并策略，并且使用明确的过曝光补做。在过曝光区域中，我们使用适应的上下文关注来重建相似的准确曝光内容。我们示示了我们的方法在质量和量度上的超越现有方法。

Efficient View Synthesis with Neural Radiance Distribution Field

paper_url: http://arxiv.org/abs/2308.11130
repo_url: https://github.com/yushuang-wu/NeRDF
paper_authors: Yushuang Wu, Xiao Li, Jinglu Wang, Xiaoguang Han, Shuguang Cui, Yan Lu
For: 高品质视角合成* Methods: 使用小型网络模型，采用频率基准来模型光度分布，采样一次网络前进来计算像素值* Results: 提供了一个更好的权衡 между速度、质量和网络大小，与传统方法相比，具有约254倍的速度提升，同时保持了类似的网络大小和质量水平。

Abstract
Recent work on Neural Radiance Fields (NeRF) has demonstrated significant advances in high-quality view synthesis. A major limitation of NeRF is its low rendering efficiency due to the need for multiple network forwardings to render a single pixel. Existing methods to improve NeRF either reduce the number of required samples or optimize the implementation to accelerate the network forwarding. Despite these efforts, the problem of multiple sampling persists due to the intrinsic representation of radiance fields. In contrast, Neural Light Fields (NeLF) reduce the computation cost of NeRF by querying only one single network forwarding per pixel. To achieve a close visual quality to NeRF, existing NeLF methods require significantly larger network capacities which limits their rendering efficiency in practice. In this work, we propose a new representation called Neural Radiance Distribution Field (NeRDF) that targets efficient view synthesis in real-time. Specifically, we use a small network similar to NeRF while preserving the rendering speed with a single network forwarding per pixel as in NeLF. The key is to model the radiance distribution along each ray with frequency basis and predict frequency weights using the network. Pixel values are then computed via volume rendering on radiance distributions. Experiments show that our proposed method offers a better trade-off among speed, quality, and network size than existing methods: we achieve a ~254x speed-up over NeRF with similar network size, with only a marginal performance decline. Our project page is at yushuang-wu.github.io/NeRDF.

摘要
最近的神经辐射场（NeRF）研究取得了高品质视图合成的重要进步。然而，NeRF具有辐射场的内置表示，导致每个像素需要多个网络请求，从而降低了渲染效率。现有的方法可以减少需要的样本数或者优化网络实现以加速网络请求。然而，这些努力仍然无法消除多样本的问题。相比之下，神经光场（NeLF）可以通过每个像素只需要一次网络请求来减少计算成本。然而，现有的NeLF方法需要较大的网络容量，从而限制了实际的渲染效率。在这种情况下，我们提出了一种新的表示方式 called Neural Radiance Distribution Field（NeRDF），旨在实现高效的视图合成。具体来说，我们使用一个小型神经网络，类似于NeRF，同时保持与NeLF一样的渲染速度。关键在于，我们使用频率基is来模型每个辐射线上的光辐射分布，并使用网络来预测频率权重。然后，通过量 rendering 技术来计算像素值。实验表明，我们的提议方法可以在Speed、质量和网络大小之间取得更好的平衡，相比之下NeRF和NeLF的方法。我们的项目页面是 yushuang-wu.github.io/NeRDF。

Hey That’s Mine Imperceptible Watermarks are Preserved in Diffusion Generated Outputs

paper_url: http://arxiv.org/abs/2308.11123
repo_url: None
paper_authors: Luke Ditria, Tom Drummond
for: 保护内容在线分享
methods: 使用隐形水印技术训练生成模型，并测试其能够在生成图像中检测水印。
results: 通过统计测试，确定了模型是否训练使用水印数据，以及水印数据中的特征与生成图像之间的相关性。

Abstract
Generative models have seen an explosion in popularity with the release of huge generative Diffusion models like Midjourney and Stable Diffusion to the public. Because of this new ease of access, questions surrounding the automated collection of data and issues regarding content ownership have started to build. In this paper we present new work which aims to provide ways of protecting content when shared to the public. We show that a generative Diffusion model trained on data that has been imperceptibly watermarked will generate new images with these watermarks present. We further show that if a given watermark is correlated with a certain feature of the training data, the generated images will also have this correlation. Using statistical tests we show that we are able to determine whether a model has been trained on marked data, and what data was marked. As a result our system offers a solution to protect intellectual property when sharing content online.

摘要
traducciona el texto a chino simplificado.<>广泛的生成模型在大量生成扩散模型如中途和稳定扩散的公共释放后得到了普及。由于这新的访问方便，自动收集数据和内容所有权问题开始升温。在这篇论文中，我们介绍新的工作，旨在在公共分享时保护内容。我们显示，通过训练在不可见水印数据上的生成扩散模型，新生成的图像都会包含这些水印。此外，如果给定的水印与训练数据中某个特征相关，那么生成的图像也会具有这种相关性。通过统计测试，我们表明可以判断模型是否训练在标记数据上，以及具体的标记数据。因此，我们的系统可以解决在线分享内容时保护知识产权。

Random Word Data Augmentation with CLIP for Zero-Shot Anomaly Detection

paper_url: http://arxiv.org/abs/2308.11119
repo_url: None
paper_authors: Masato Tamura
for: 这个研究是为了开发一个 zero-shot anomaly detection 方法，利用 CLIP 的视觉语言模型来提供数据源。
methods: 这个方法使用 CLIP 的 prompt-guided classification 技术，将每个图像分成多个部分，并将每个部分作为 input 进行类别。此外，还使用了一些随机生成的词语，以增加训练数据的多样性。
results: 实验结果显示，这个方法可以在 zero-shot 设定下 achieves state-of-the-art 性能，不需要耗费很多时间进行训练。

Abstract
This paper presents a novel method that leverages a visual-language model, CLIP, as a data source for zero-shot anomaly detection. Tremendous efforts have been put towards developing anomaly detectors due to their potential industrial applications. Considering the difficulty in acquiring various anomalous samples for training, most existing methods train models with only normal samples and measure discrepancies from the distribution of normal samples during inference, which requires training a model for each object category. The problem of this inefficient training requirement has been tackled by designing a CLIP-based anomaly detector that applies prompt-guided classification to each part of an image in a sliding window manner. However, the method still suffers from the labor of careful prompt ensembling with known object categories. To overcome the issues above, we propose leveraging CLIP as a data source for training. Our method generates text embeddings with the text encoder in CLIP with typical prompts that include words of normal and anomaly. In addition to these words, we insert several randomly generated words into prompts, which enables the encoder to generate a diverse set of normal and anomalous samples. Using the generated embeddings as training data, a feed-forward neural network learns to extract features of normal and anomaly from CLIP's embeddings, and as a result, a category-agnostic anomaly detector can be obtained without any training images. Experimental results demonstrate that our method achieves state-of-the-art performance without laborious prompt ensembling in zero-shot setups.

摘要
To overcome these issues, the proposed method leverages CLIP as a data source for training. The method generates text embeddings with the text encoder in CLIP using typical prompts that include words related to normal and anomalous samples. Additionally, several randomly generated words are inserted into the prompts to enable the encoder to generate a diverse set of normal and anomalous samples. These embeddings are then used as training data for a feed-forward neural network to extract features of normal and anomalous samples from CLIP's embeddings. As a result, a category-agnostic anomaly detector can be obtained without any training images.Experimental results demonstrate that the proposed method achieves state-of-the-art performance in zero-shot setups without the need for laborious prompt ensembling.

LAN-HDR: Luminance-based Alignment Network for High Dynamic Range Video Reconstruction

paper_url: http://arxiv.org/abs/2308.11116
repo_url: https://github.com/haesoochung/lan-hdr
paper_authors: Haesoo Chung, Nam Ik Cho
for: 提高高清晰度和动态范围（HDR）影像技术，以满足用户对高质量视频的需求。
methods: 基于特征空间的灵活抽象网络（LAN-HDR），包括对适应模块和梦想模块。对模块使用灵活抽象来减少流量估计错误。
results: 比较现有方法表现更好或相当，在多个标准测试数据集上进行了广泛的实验。

Abstract
As demands for high-quality videos continue to rise, high-resolution and high-dynamic range (HDR) imaging techniques are drawing attention. To generate an HDR video from low dynamic range (LDR) images, one of the critical steps is the motion compensation between LDR frames, for which most existing works employed the optical flow algorithm. However, these methods suffer from flow estimation errors when saturation or complicated motions exist. In this paper, we propose an end-to-end HDR video composition framework, which aligns LDR frames in the feature space and then merges aligned features into an HDR frame, without relying on pixel-domain optical flow. Specifically, we propose a luminance-based alignment network for HDR (LAN-HDR) consisting of an alignment module and a hallucination module. The alignment module aligns a frame to the adjacent reference by evaluating luminance-based attention, excluding color information. The hallucination module generates sharp details, especially for washed-out areas due to saturation. The aligned and hallucinated features are then blended adaptively to complement each other. Finally, we merge the features to generate a final HDR frame. In training, we adopt a temporal loss, in addition to frame reconstruction losses, to enhance temporal consistency and thus reduce flickering. Extensive experiments demonstrate that our method performs better or comparable to state-of-the-art methods on several benchmarks.

摘要
As demands for high-quality videos continue to rise, high-resolution and high-dynamic range (HDR) imaging techniques are drawing attention. To generate an HDR video from low dynamic range (LDR) images, one of the critical steps is the motion compensation between LDR frames, for which most existing works employed the optical flow algorithm. However, these methods suffer from flow estimation errors when saturation or complicated motions exist. In this paper, we propose an end-to-end HDR video composition framework, which aligns LDR frames in the feature space and then merges aligned features into an HDR frame, without relying on pixel-domain optical flow. Specifically, we propose a luminance-based alignment network for HDR (LAN-HDR) consisting of an alignment module and a hallucination module. The alignment module aligns a frame to the adjacent reference by evaluating luminance-based attention, excluding color information. The hallucination module generates sharp details, especially for washed-out areas due to saturation. The aligned and hallucinated features are then blended adaptively to complement each other. Finally, we merge the features to generate a final HDR frame. In training, we adopt a temporal loss, in addition to frame reconstruction losses, to enhance temporal consistency and thus reduce flickering. Extensive experiments demonstrate that our method performs better or comparable to state-of-the-art methods on several benchmarks.

Development of a Novel Quantum Pre-processing Filter to Improve Image Classification Accuracy of Neural Network Models

paper_url: http://arxiv.org/abs/2308.11112
repo_url: https://github.com/hajimesuzuki999/qpf
paper_authors: Farina Riaz, Shahab Abdulla, Hajime Suzuki, Srinjoy Ganguly, Ravinesh C. Deo, Susan Hopkins
for: 提高图像分类准确率
methods: 使用量子预处理筛选器（QPF），应用于图像分类神经网络模型中
results: 在MNIST和EMNIST数据集上，图像分类准确率提高至95.4%和75.9%，分别提高了2.9%和7.1%，无需添加额外参数或优化机器学习过程。

Abstract
This paper proposes a novel quantum pre-processing filter (QPF) to improve the image classification accuracy of neural network (NN) models. A simple four qubit quantum circuit that uses Y rotation gates for encoding and two controlled NOT gates for creating correlation among the qubits is applied as a feature extraction filter prior to passing data into the fully connected NN architecture. By applying the QPF approach, the results show that the image classification accuracy based on the MNIST (handwritten 10 digits) and the EMNIST (handwritten 47 class digits and letters) datasets can be improved, from 92.5% to 95.4% and from 68.9% to 75.9%, respectively. These improvements were obtained without introducing extra model parameters or optimizations in the machine learning process. However, tests performed on the developed QPF approach against a relatively complex GTSRB dataset with 43 distinct class real-life traffic sign images showed a degradation in the classification accuracy. Considering this result, further research into the understanding and the design of a more suitable quantum circuit approach for image classification neural networks could be explored utilizing the baseline method proposed in this paper.

摘要

Classification of the lunar surface pattern by AI architectures: Does AI see a rabbit in the Moon?

paper_url: http://arxiv.org/abs/2308.11107
repo_url: None
paper_authors: Daigo Shoji
for: 这篇论文的目的是研究月面的颜色模式是否类似于兔子。
methods: 这篇论文使用了七种人工智能架构来评估月面颜色模式与兔子之间的相似性。
results: 测试结果显示，在某些地区，月面颜色模式更容易被识别为兔子，而不是人脸。此外，使用ImageNet权重时，ConvNeXt和CLIP occasionally可以归类月面颜色模式为兔子。

Abstract
In Asian countries, there is a tradition that a rabbit (the Moon rabbit) lives on the Moon. As the origin of this tradition, usually, two reasons are mentioned. One reason is that the color pattern of the lunar surface is similar to the shape of a rabbit. The other reason is that both the Moon and rabbit are symbols of fertility because the Moon appears and disappears (i.e., waxing and waning) cyclically, and rabbits bear children frequently. Considering the latter reason, is the lunar surface color pattern not similar to a rabbit? Here, the similarity between rabbit and the lunar surface pattern was evaluated using seven AI architectures. In the test by CLIP, assuming that people look at the Moon in the early evening frequently, the lunar surface is more similar to a rabbit than a face at low latitude regions, while it can be classified as face as latitude increases, which is consistent with that the oldest literature about the Moon rabbit was written in India and that there is a culture of human's face in the Moon in Europe. Tested with ImageNet weights, ConvNeXt and CLIP sometimes classified the lunar surface pattern into rabbit with relatively high probabilities. Cultures are generated by our attitude to the environment. Both dynamic and static similarities may be required to induce our imagination.

摘要
在亚洲国家，有一传统认为月球上有兔子（月兔）生活。这个传统的起源通常被推断为两个理由。一个理由是月球表面的颜色排列与兔子的形状类似。另一个理由是月球和兔子都是生育的符号，因为月球出现和消失（即增减）的征例行为，兔子则经常产下幼崽。考虑后一个理由，月球表面的颜色排列与兔子是否类似？在这个问题上，我们使用了七种人工智能架构进行评估。在CLIP测试中，假设人们在晚上经常看到月球，那么月球表面的颜色排列更像兔子 than 人脸，而且随着纬度增加，月球表面可以被识别为人脸。这与最古老的月兔文学成果在印度被写成，以及欧洲文化中人类面部在月球上的存在相一致。使用ImageNet权重，ConvNeXt和CLIP occasionally将月球表面 Pattern classification为兔子，并且拥有相对高的概率。文化是由我们对环境的态度所生成的。我们可能需要同时考虑动态和静态相似性，以便让我们的想象力激发。

Recursive Video Lane Detection

paper_url: http://arxiv.org/abs/2308.11106
repo_url: https://github.com/dongkwonjin/rvld
paper_authors: Dongkwon Jin, Dahyun Kim, Chang-Su Kim
for: 这篇论文提出了一种用于视频中检测路面线的新算法，即回归视频lane检测器（RVLD），用于在视频中检测路面线。
methods: 该算法包括一个内部lane检测器（ILD）和一个预测lane检测器（PLD）。首先，我们设计了ILD来在当前帧中地址路面线。然后，我们开发了PLD，以利用上一帧的信息来在当前帧中更可靠地检测路面线。为此，我们估算了运动场和将上一帧的输出折叠到当前帧中。使用折叠后的信息，我们精细地修改当前帧的特征图以更好地检测路面线。
results: 实验结果表明，RVLD在视频路面线数据集上的性能明显超过了现有的检测器。我们的代码可以在https://github.com/dongkwonjin/RVLD中下载。

Abstract
A novel algorithm to detect road lanes in videos, called recursive video lane detector (RVLD), is proposed in this paper, which propagates the state of a current frame recursively to the next frame. RVLD consists of an intra-frame lane detector (ILD) and a predictive lane detector (PLD). First, we design ILD to localize lanes in a still frame. Second, we develop PLD to exploit the information of the previous frame for lane detection in a current frame. To this end, we estimate a motion field and warp the previous output to the current frame. Using the warped information, we refine the feature map of the current frame to detect lanes more reliably. Experimental results show that RVLD outperforms existing detectors on video lane datasets. Our codes are available at https://github.com/dongkwonjin/RVLD.

摘要
“本文提出了一种新的算法检测视频中的路线，即回归视频车道检测器（RVLD）。这种算法在当前帧中进行回归状态，并在下一帧中使用这些状态来提高车道检测的准确性。RVLD由内帧车道检测器（ILD）和预测车道检测器（PLD）两部分组成。首先，我们设计了 ILD 以确定视频帧中的车道。其次，我们开发了 PLD，以利用上一帧的信息来提高当前帧中的车道检测。为此，我们对上一帧的视频进行了运动场景的估算，并将上一帧的输出折叠到当前帧中。使用折叠后的信息，我们可以更加精确地修改当前帧的特征图，以更好地检测车道。实验结果表明，RVLD 在视频车道数据集上的性能比既有的检测器更高。我们的代码可以在 GitHub 上找到：https://github.com/dongkwonjin/RVLD。”

MosaiQ: Quantum Generative Adversarial Networks for Image Generation on NISQ Computers

paper_url: http://arxiv.org/abs/2308.11096
repo_url: None
paper_authors: Daniel Silver, Tirthak Patel, William Cutler, Aditya Ranjan, Harshitta Gandhi, Devesh Tiwari
for: 研究量子机器学习和视觉技术，尤其是量子图像生成技术，以提高图像质量和可靠性。
methods: 我们提出了一个名为MosaiQ的高质量量子图像生成GAN框架，可以在当今的中期级量子计算机（NISQ）上执行。
results: MosaiQ可以生成高质量的图像，并且可以在不同的图像生成任务中实现高度的可靠性和稳定性。

Abstract
Quantum machine learning and vision have come to the fore recently, with hardware advances enabling rapid advancement in the capabilities of quantum machines. Recently, quantum image generation has been explored with many potential advantages over non-quantum techniques; however, previous techniques have suffered from poor quality and robustness. To address these problems, we introduce, MosaiQ, a high-quality quantum image generation GAN framework that can be executed on today's Near-term Intermediate Scale Quantum (NISQ) computers.

摘要
量子机器学习和视觉在最近几年来得到了更多的关注，各种硬件进步使得量子机器的能力得到了快速提升。近期，量子图像生成得到了广泛研究，但以前的技术受到了低质量和稳定性的限制。为了解决这些问题，我们介绍了 MosaiQ，一个高质量量子图像生成GAN框架，可以在当今的中等规模量子计算机上执行。

Addressing Fairness and Explainability in Image Classification Using Optimal Transport

paper_url: http://arxiv.org/abs/2308.11090
repo_url: None
paper_authors: Philipp Ratz, François Hu, Arthur Charpentier
for: 本研究旨在提高人工智能系统的可信worthiness和公平性，使其在医疗和警察等领域中建立信任和责任感。
methods: 本研究使用优化的运输理论来揭示图像中偏见的起源和后果，这种方法可以轻松扩展到表格数据中。
results: 研究发现，通过使用拟合度量来评估模型的偏见，可以独立地保持预测准确性和揭示偏见的起源。这些发现对于建立可信worthiness和公平性的人工智能系统具有重要意义。

Abstract
Algorithmic Fairness and the explainability of potentially unfair outcomes are crucial for establishing trust and accountability of Artificial Intelligence systems in domains such as healthcare and policing. Though significant advances have been made in each of the fields separately, achieving explainability in fairness applications remains challenging, particularly so in domains where deep neural networks are used. At the same time, ethical data-mining has become ever more relevant, as it has been shown countless times that fairness-unaware algorithms result in biased outcomes. Current approaches focus on mitigating biases in the outcomes of the model, but few attempts have been made to try to explain \emph{why} a model is biased. To bridge this gap, we propose a comprehensive approach that leverages optimal transport theory to uncover the causes and implications of biased regions in images, which easily extends to tabular data as well. Through the use of Wasserstein barycenters, we obtain scores that are independent of a sensitive variable but keep their marginal orderings. This step ensures predictive accuracy but also helps us to recover the regions most associated with the generation of the biases. Our findings hold significant implications for the development of trustworthy and unbiased AI systems, fostering transparency, accountability, and fairness in critical decision-making scenarios across diverse domains.

摘要
算法公平和可解释性是建立人工智能系统信任和负责任的关键因素，尤其在医疗和警察领域。虽然在每个领域 separately 有所进步，但在公平应用中实现可解释性仍然是挑战，特别是在使用深度神经网络时。同时，伦理数据挖掘已成为非常重要，因为无数次证明了不公平的算法会导致偏见的结果。现有的方法主要是减轻模型的偏见结果，但几乎没有尝试解释模型为何偏见。为了bridging这个差距，我们提出了一种全面的方法，利用最优运输理论来揭示偏见区域在图像中的原因和后果，这种方法可以轻松扩展到表格数据上。通过使用拓扑 Wasserstein 中心，我们可以获得不виси于敏感变量的分数，但保持其排序。这一步确保预测精度，同时帮助我们回归偏见区域的生成。我们的发现对于开发可靠、无偏的人工智能系统的发展有着深远的意义，推动了诚实、负责任和公平在多个领域中的决策过程中的透明度和公平。

Long-Term Prediction of Natural Video Sequences with Robust Video Predictors

paper_url: http://arxiv.org/abs/2308.11079
repo_url: None
paper_authors: Luke Ditria, Tom Drummond
for: 预测高维视频序列是一个非常困难的问题，因为可能的未来场景的数量会 exponential 增长随着时间的推移。特别是在从有限的世界Snapshot中预测自然的视频场景时，内在的不确定性会快速增加，使长期预测变得非常困难。
methods: 我们在这篇论文中引入了一些改进了现有工作，以创建Robust Video Predictors (RoViPs)。我们使用深度Perceptual和 uncertainty-based reconstructionloss来创建高质量短期预测。使用Attention-based skip connections以实现跨距离空间特征输入的长距离移动，以进一步提高性能。
results: 我们显示了使用单步预测任务iterated 可以生成非常长、自然的视频序列。

Abstract
Predicting high dimensional video sequences is a curiously difficult problem. The number of possible futures for a given video sequence grows exponentially over time due to uncertainty. This is especially evident when trying to predict complicated natural video scenes from a limited snapshot of the world. The inherent uncertainty accumulates the further into the future you predict making long-term prediction very difficult. In this work we introduce a number of improvements to existing work that aid in creating Robust Video Predictors (RoViPs). We show that with a combination of deep Perceptual and uncertainty-based reconstruction losses we are able to create high quality short-term predictions. Attention-based skip connections are utilised to allow for long range spatial movement of input features to further improve performance. Finally, we show that by simply making the predictor robust to its own prediction errors, it is possible to produce very long, realistic natural video sequences using an iterated single-step prediction task.

摘要
Predicting high-dimensional video sequences is a challenging problem. The number of possible futures for a given video sequence grows exponentially with time due to uncertainty. This is particularly evident when trying to predict complex natural video scenes from a limited snapshot of the world. The inherent uncertainty accumulates the further into the future you predict, making long-term prediction very difficult. In this work, we introduce several improvements to existing methods that aid in creating Robust Video Predictors (RoViPs). We show that by combining deep perceptual and uncertainty-based reconstruction losses, we can create high-quality short-term predictions. Attention-based skip connections are used to allow for long-range spatial movement of input features, further improving performance. Finally, we show that by simply making the predictor robust to its own prediction errors, we can produce very long, realistic natural video sequences using an iterated single-step prediction task.Here's the text with some notes on the translation:* "high-dimensional" is translated as "高维的" (gāo wéi de), which is a common way to describe high-dimensional data in Chinese.* "video sequences" is translated as "视频序列" (pǐn yǐng xù xià), which is a literal translation of the English phrase.* "uncertainty" is translated as "不确定性" (bù jì dìng xìng), which is a common way to describe uncertainty in Chinese.* "complicated" is translated as "复杂的" (fù zhòng de), which is a common way to describe complex or intricate things in Chinese.* "natural video scenes" is translated as "自然的视频场景" (zì rán de pǐn yǐng chǎng jǐng), which is a literal translation of the English phrase.* "long-term prediction" is translated as "长期预测" (cháng qī yù tiè), which is a common way to describe long-term predictions in Chinese.* "iterated single-step prediction task" is translated as " iterate 单步预测任务" (pītī yī xiào yù jì), which is a literal translation of the English phrase.I hope this helps! Let me know if you have any further questions.

Audio-Visual Class-Incremental Learning

paper_url: http://arxiv.org/abs/2308.11073
repo_url: https://github.com/weiguopian/av-cil_iccv2023
paper_authors: Weiguo Pian, Shentong Mo, Yunhui Guo, Yapeng Tian
for: 这篇论文提出了一种 audio-visual 类增 learning 问题，即在 audio-visual 视频认知中进行类增 learning。
methods: 该论文提出了一种叫做 AV-CIL 的方法，该方法通过 dual-audio-visual 相似性约束 (D-AVSC) 和视觉注意力练化 (VAD) 来保持音频视频modalities之间的semantic similarity，并且能够在类增 learning 过程中 preserved previously learned audio-guided visual attentive ability。
results: 该论文的实验结果表明，AV-CIL 方法在 audio-visual 类增 learning 中 Significantly outperforms 现有的类增 learning 方法。

Abstract
In this paper, we introduce audio-visual class-incremental learning, a class-incremental learning scenario for audio-visual video recognition. We demonstrate that joint audio-visual modeling can improve class-incremental learning, but current methods fail to preserve semantic similarity between audio and visual features as incremental step grows. Furthermore, we observe that audio-visual correlations learned in previous tasks can be forgotten as incremental steps progress, leading to poor performance. To overcome these challenges, we propose AV-CIL, which incorporates Dual-Audio-Visual Similarity Constraint (D-AVSC) to maintain both instance-aware and class-aware semantic similarity between audio-visual modalities and Visual Attention Distillation (VAD) to retain previously learned audio-guided visual attentive ability. We create three audio-visual class-incremental datasets, AVE-Class-Incremental (AVE-CI), Kinetics-Sounds-Class-Incremental (K-S-CI), and VGGSound100-Class-Incremental (VS100-CI) based on the AVE, Kinetics-Sounds, and VGGSound datasets, respectively. Our experiments on AVE-CI, K-S-CI, and VS100-CI demonstrate that AV-CIL significantly outperforms existing class-incremental learning methods in audio-visual class-incremental learning. Code and data are available at: https://github.com/weiguoPian/AV-CIL_ICCV2023.

摘要
在这篇论文中，我们介绍了音频视频类增长学习（Audio-Visual Class-Incremental Learning，AVCIL），这是一种类增长学习场景 для音频视频识别。我们示示了 joint audio-visual 模型可以提高类增长学习性能，但现有方法无法保持音频视频特征之间的semantic similarity，特别是在增量步骤增长时。此外，我们发现在前一个任务学习的音频视频相关性可以在后续任务学习过程中被忘记，导致性能下降。为了解决这些挑战，我们提出了 Dual-Audio-Visual Similarity Constraint（D-AVSC）和 Visual Attention Distillation（VAD）两种方法。我们创建了三个音频视频类增长数据集： AVE-Class-Incremental（AVE-CI）、Kinetics-Sounds-Class-Incremental（K-S-CI）和 VGGSound100-Class-Incremental（VS100-CI），基于 AVE、Kinetics-Sounds 和 VGGSound 数据集。我们在 AVE-CI、K-S-CI 和 VS100-CI 上进行了实验，并证明了 AV-CIL 在音频视频类增长学习中表现出色，超过了现有的类增长学习方法。代码和数据可以在 GitHub 上获取：https://github.com/weiguoPian/AV-CIL_ICCV2023。

TeD-SPAD: Temporal Distinctiveness for Self-supervised Privacy-preservation for video Anomaly Detection

paper_url: http://arxiv.org/abs/2308.11072
repo_url: https://github.com/UCF-CRCV/TeD-SPAD
paper_authors: Joseph Fioresi, Ishan Rajendrakumar Dave, Mubarak Shah
for: 这个研究的目的是为了提出一个具有隐私保护的视预测异常探测方法，以解决现有的人工监控不足和隐私泄露问题。
methods: 这个方法使用了一个自我监控的三元损害函数来增强时间特征，并且使用了一个具有隐私保护的数据隐藏技术来防止隐私泄露。
results: 这个方法在三个弱型监控的视预测异常探测 datasets（UCF-Crime、XD-Violence和ShanghaiTech）上取得了一个良好的平衡，即在保护隐私的同时，也能够维持视预测异常探测的性能。

Abstract
Video anomaly detection (VAD) without human monitoring is a complex computer vision task that can have a positive impact on society if implemented successfully. While recent advances have made significant progress in solving this task, most existing approaches overlook a critical real-world concern: privacy. With the increasing popularity of artificial intelligence technologies, it becomes crucial to implement proper AI ethics into their development. Privacy leakage in VAD allows models to pick up and amplify unnecessary biases related to people's personal information, which may lead to undesirable decision making. In this paper, we propose TeD-SPAD, a privacy-aware video anomaly detection framework that destroys visual private information in a self-supervised manner. In particular, we propose the use of a temporally-distinct triplet loss to promote temporally discriminative features, which complements current weakly-supervised VAD methods. Using TeD-SPAD, we achieve a positive trade-off between privacy protection and utility anomaly detection performance on three popular weakly supervised VAD datasets: UCF-Crime, XD-Violence, and ShanghaiTech. Our proposed anonymization model reduces private attribute prediction by 32.25% while only reducing frame-level ROC AUC on the UCF-Crime anomaly detection dataset by 3.69%. Project Page: https://joefioresi718.github.io/TeD-SPAD_webpage/

摘要
“视频异常检测（VAD）无人监测是一项复杂的计算机视觉任务，如果成功实施，它将对社会产生积极的影响。然而，现有的大多数方法忽略了一个重要的现实问题：隐私。随着人工智能技术的普及，它们的开发中需要落实合适的人工智能伦理。VAD模型可以捕捉和强调人们个人信息的无关的偏见，导致不良决策。在这篇论文中，我们提出了一种隐私意识视频异常检测框架，即TeD-SPAD。具体来说，我们提出使用时间特征分别的 triplet损失来促进时间特征分别，这与现有的弱监测VAD方法相 complement。使用TeD-SPAD，我们在三个流行的弱监测VAD数据集上实现了隐私保护和异常检测性能的积极负作用。我们的提出的匿名化模型可以降低人员特征预测值32.25%，只减某些数据集的异常检测精度3.69%。项目页面：https://joefioresi718.github.io/TeD-SPAD_webpage/”

MetaGCD: Learning to Continually Learn in Generalized Category Discovery

paper_url: http://arxiv.org/abs/2308.11063
repo_url: None
paper_authors: Yanan Wu, Zhixiang Chi, Yang Wang, Songhe Feng
for: 本研究的目的是解决一种实际场景，在训练过程中遇到未标注的数据，该数据包含已知和新类的混合类。目标是不断发现新类，同时保持已知类的性能。我们称之为 Continual Generalized Category Discovery (C-GCD) Setting。
methods: 我们提出了一种方法，即 MetaGCD，以便在 C-GCD Setting 中不断发现新类而不忘记已知类。我们使用了一个元学习框架，并利用了在线标注数据来模拟测试增量学习过程。我们定义了一个元目标，旨在同时解决两个矛盾的学习目标，以实现不断发现新类而不忘记已知类。此外，我们还提出了一种软邻域基于对比网络，以便区分无关图像而吸引相关图像。
results: 我们在三个广泛使用的标准测试benchmark上建立了强大的基准，并进行了广泛的实验。我们的方法在 C-GCD Setting 中表现出色，可以不断发现新类而不忘记已知类。

Abstract
In this paper, we consider a real-world scenario where a model that is trained on pre-defined classes continually encounters unlabeled data that contains both known and novel classes. The goal is to continually discover novel classes while maintaining the performance in known classes. We name the setting Continual Generalized Category Discovery (C-GCD). Existing methods for novel class discovery cannot directly handle the C-GCD setting due to some unrealistic assumptions, such as the unlabeled data only containing novel classes. Furthermore, they fail to discover novel classes in a continual fashion. In this work, we lift all these assumptions and propose an approach, called MetaGCD, to learn how to incrementally discover with less forgetting. Our proposed method uses a meta-learning framework and leverages the offline labeled data to simulate the testing incremental learning process. A meta-objective is defined to revolve around two conflicting learning objectives to achieve novel class discovery without forgetting. Furthermore, a soft neighborhood-based contrastive network is proposed to discriminate uncorrelated images while attracting correlated images. We build strong baselines and conduct extensive experiments on three widely used benchmarks to demonstrate the superiority of our method.

摘要
在这篇论文中，我们考虑了一个真实世界场景，在这个场景中，一个已经训练过的模型在不断接触到预先定义的类和未经标注的数据中遇到了新类。目标是同时发现新类并保持已知类的性能。我们称这种设定为总类发现（C-GCD）。现有的新类发现方法无法直接处理这种设定，这是因为它们假设了尚未标注的数据只包含新类。此外，它们也无法在不断发现新类的情况下保持已知类的性能。在这种工作中，我们终止了这些假设，并提出了一种方法，称为MetaGCD，以incremental learning来发现新类而减少忘记。我们的提出的方法使用了meta-学框架，利用了在线标注数据来模拟测试增量学习过程。我们定义了一个meta-目标，旨在在新类发现和已知类性能之间协调两个不同的学习目标。此外，我们还提出了一种软邻域基于的对比网络，以便在不同类型之间分辨细分图像。我们建立了强大的基elines并进行了广泛的实验，以证明我们的方法的优越性。

UnLoc: A Unified Framework for Video Localization Tasks

paper_url: http://arxiv.org/abs/2308.11062
repo_url: https://github.com/google-research/scenic
paper_authors: Shen Yan, Xuehan Xiong, Arsha Nagrani, Anurag Arnab, Zhonghao Wang, Weina Ge, David Ross, Cordelia Schmid
for: 这个研究的目的是提出一种新的视频地理ocalization方法，用于在未经trim的视频中进行时间地理ocalization。
methods: 这个方法使用预训练的图像和文本楼层，并将 tokens 传递给一个视频-文本融合模型。输出的融合模型输出将用于构建一个特征 пирамид，每个层与一个头相连，以预测每帧的相关性分数和开始/结束时间偏移。
results: 这个方法可以实现视频地理ocalization、时间地理ocalization和动作分割等三个任务，并且在所有三个任务中达到了现有最佳Result。

Abstract
While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task. We design a new approach for this called UnLoc, which uses pretrained image and text towers, and feeds tokens to a video-text fusion model. The output of the fusion module are then used to construct a feature pyramid in which each level connects to a head to predict a per-frame relevancy score and start/end time displacements. Unlike previous works, our architecture enables Moment Retrieval, Temporal Localization, and Action Segmentation with a single stage model, without the need for action proposals, motion based pretrained features or representation masking. Unlike specialized models, we achieve state of the art results on all three different localization tasks with a unified approach. Code will be available at: \url{https://github.com/google-research/scenic}.

摘要
大规模图像文本预训练模型，如CLIP，已经在剪辑后的视频上进行多种任务，但是它们在未剪辑视频中的时间本地化仍然是一个未解决的问题。我们设计了一种新的方法called UnLoc，它使用预训练的图像和文本楼层，并将Token传递给视频-文本融合模型。融合模块的输出被用construct一个功能PYRAMID，每个层与一个头连接，以预测每帧的相关性分数和开始/结束时间偏移。与之前的方法不同，我们的体系允许场景检索、时间本地化和动作分割使用单一的阶段模型，不需要动作提案、运动基于预训练特征或表示掩码。与专门的模型不同，我们在所有三种本地化任务中达到了状态arta的结果，代码将在：\url{https://github.com/google-research/scenic}上提供。

Harmonization Across Imaging Locations(HAIL): One-Shot Learning for Brain MRI

paper_url: http://arxiv.org/abs/2308.11047
repo_url: None
paper_authors: Abhijeet Parida, Zhifan Jiang, Syed Muhammad Anwar, Nicholas Foreman, Nicholas Stence, Michael J. Fisher, Roger J. Packer, Robert A. Avery, Marius George Linguraru
for: 这篇论文旨在提出一种用于类比较罕见疾病，如儿童脑肿瘤的机器学习基于医疗影像诊断和预测方法。
methods: 本论文使用了生成对抗网络（GANs）进行深度学习预测，并提出了一种一击学习方法，使用神经风格转移来调整医疗影像的强度标准。
results: 实验结果显示，该方法可以保持病人解剖结构，并调整影像强度以适应新的诊疗所需。此外，该方法可以在未见到训练数据的情况下进行应用，因此具有实际应用和临床试验的价值。

Abstract
For machine learning-based prognosis and diagnosis of rare diseases, such as pediatric brain tumors, it is necessary to gather medical imaging data from multiple clinical sites that may use different devices and protocols. Deep learning-driven harmonization of radiologic images relies on generative adversarial networks (GANs). However, GANs notoriously generate pseudo structures that do not exist in the original training data, a phenomenon known as "hallucination". To prevent hallucination in medical imaging, such as magnetic resonance images (MRI) of the brain, we propose a one-shot learning method where we utilize neural style transfer for harmonization. At test time, the method uses one image from a clinical site to generate an image that matches the intensity scale of the collaborating sites. Our approach combines learning a feature extractor, neural style transfer, and adaptive instance normalization. We further propose a novel strategy to evaluate the effectiveness of image harmonization approaches with evaluation metrics that both measure image style harmonization and assess the preservation of anatomical structures. Experimental results demonstrate the effectiveness of our method in preserving patient anatomy while adjusting the image intensities to a new clinical site. Our general harmonization model can be used on unseen data from new sites, making it a valuable tool for real-world medical applications and clinical trials.

摘要
Our approach combines learning a feature extractor, neural style transfer, and adaptive instance normalization. At test time, the method uses one image from a clinical site to generate an image that matches the intensity scale of the collaborating sites. We also propose a novel strategy to evaluate the effectiveness of image harmonization approaches with evaluation metrics that both measure image style harmonization and assess the preservation of anatomical structures.Experimental results demonstrate the effectiveness of our method in preserving patient anatomy while adjusting the image intensities to a new clinical site. Our general harmonization model can be used on unseen data from new sites, making it a valuable tool for real-world medical applications and clinical trials.Here's the Simplified Chinese translation:为了使用机器学习来诊断和预测罕见疾病，如儿童脑肿瘤，需要从多个临床Site收集医学成像数据，这些数据可能使用不同的设备和协议。但是，使用生成对抗网络（GANs）进行深度学习驱动的成像融合可能会导致“幻觉”现象，即生成不存在于训练数据中的 pseudo 结构。为了避免幻觉在医学成像中，我们提议一种一键学习方法，利用神经风格传输来融合。在测试时，方法使用一个来自临床Site的图像，通过神经风格传输来生成医学成像，以适应新的临床Site的INTENSITY规模。我们还提出了一种新的评估图像融合方法的效果的策略，该策略包括评估图像风格融合和评估结构保持。实验结果表明，我们的方法可以保持患者的解剖结构，同时调整图像INTENSITY来适应新的临床Site。我们的通用融合模型可以在新的Site上使用未看过的数据，因此它是实际医疗应用和临床试验中的有价值工具。

Coordinate Quantized Neural Implicit Representations for Multi-view Reconstruction

paper_url: http://arxiv.org/abs/2308.11025
repo_url: https://github.com/machineperceptionlab/cq-nir
paper_authors: Sijia Jiang, Jing Hua, Zhizhong Han
for: 用于学习神经隐式表示法从多视图图像中获取3D重建
methods: 使用量化坐标为神经网络中的输入，并使用离散坐标和其позициональ编码来学习隐式函数
results: 提高了多视图一致性约束，并且不会增加计算负担，在最新的方法上显示了超过状态艺术的优势

Abstract
In recent years, huge progress has been made on learning neural implicit representations from multi-view images for 3D reconstruction. As an additional input complementing coordinates, using sinusoidal functions as positional encodings plays a key role in revealing high frequency details with coordinate-based neural networks. However, high frequency positional encodings make the optimization unstable, which results in noisy reconstructions and artifacts in empty space. To resolve this issue in a general sense, we introduce to learn neural implicit representations with quantized coordinates, which reduces the uncertainty and ambiguity in the field during optimization. Instead of continuous coordinates, we discretize continuous coordinates into discrete coordinates using nearest interpolation among quantized coordinates which are obtained by discretizing the field in an extremely high resolution. We use discrete coordinates and their positional encodings to learn implicit functions through volume rendering. This significantly reduces the variations in the sample space, and triggers more multi-view consistency constraints on intersections of rays from different views, which enables to infer implicit function in a more effective way. Our quantized coordinates do not bring any computational burden, and can seamlessly work upon the latest methods. Our evaluations under the widely used benchmarks show our superiority over the state-of-the-art. Our code is available at https://github.com/MachinePerceptionLab/CQ-NIR.

摘要
Recently, there have been significant advancements in learning neural implicit representations from multi-view images for 3D reconstruction. Using sinusoidal functions as positional encodings has proven to be crucial in revealing high-frequency details with coordinate-based neural networks. However, the use of high-frequency positional encodings can lead to unstable optimization, resulting in noisy reconstructions and artifacts in empty space. To address this issue, we propose learning neural implicit representations with quantized coordinates, which reduces uncertainty and ambiguity in the field during optimization. Instead of using continuous coordinates, we discretize them into discrete coordinates using nearest interpolation among quantized coordinates, which are obtained by discretizing the field in an extremely high resolution. We then use these discrete coordinates and their positional encodings to learn implicit functions through volume rendering, significantly reducing variations in the sample space and triggering more multi-view consistency constraints on intersections of rays from different views, enabling more effective inference of implicit functions. Our quantized coordinates do not increase computational burden and can seamlessly work with the latest methods. Our evaluations on widely used benchmarks show our superiority over the state-of-the-art. Our code is available at https://github.com/MachinePerceptionLab/CQ-NIR.

Multi-Task Hypergraphs for Semi-supervised Learning using Earth Observations

paper_url: http://arxiv.org/abs/2308.11021
repo_url: None
paper_authors: Mihai Pirvu, Alina Marcu, Alexandra Dobrescu, Nabil Belbachir, Marius Leordeanu
for: 这个论文是为了解决多任务学习中数据缺失问题，特别是在地球观测领域，where ground-truth data is often missing.
methods: 该论文提出了一种多任务hypergraphSelf-supervised learning方法，其中每个节点是一个任务，不同的路径通过hypergraph到达给定任务都成为了无监督教师，并通过 ensemble 学习生成可靠的pseudolabels。
results: 经过对NASA NEO数据集的广泛实验，论文示出了其多任务半监督方法的价值，包括在强基elines和最近的工作上的一致提升。此外，论文还表明了hypergraph可以适应不监督数据分布变化，并可靠地恢复缺失数据，以及其可以在多个观测层次上适应数据缺失情况。

Abstract
There are many ways of interpreting the world and they are highly interdependent. We exploit such complex dependencies and introduce a powerful multi-task hypergraph, in which every node is a task and different paths through the hypergraph reaching a given task become unsupervised teachers, by forming ensembles that learn to generate reliable pseudolabels for that task. Each hyperedge is part of an ensemble teacher for a given task and it is also a student of the self-supervised hypergraph system. We apply our model to one of the most important problems of our times, that of Earth Observation, which is highly multi-task and it often suffers from missing ground-truth data. By performing extensive experiments on the NASA NEO Dataset, spanning a period of 22 years, we demonstrate the value of our multi-task semi-supervised approach, by consistent improvements over strong baselines and recent work. We also show that the hypergraph can adapt unsupervised to gradual data distribution shifts and reliably recover, through its multi-task self-supervision process, the missing data for several observational layers for up to seven years.

摘要
世界上有很多方法来解释，它们之间很高度相互依赖。我们利用这些复杂的依赖关系，引入一个强大的多任务超графи，其中每个节点是一个任务，不同的路径通过超графи到达某个任务就会成为无监督教师。每个超边都是一个任务的ensemble教师，同时也是自我超vised系统的学生。我们应用我们的模型到现代时代最重要的问题之一：地球观测，这是一个高度多任务的问题，经常缺少真实数据。通过对NASA NEO数据集进行广泛的实验，覆盖22年时间段，我们展示了我们的多任务半超监教学方法的价值，通过与强基线和最新研究的相对比较，我们的模型在多个任务上具有稳定和可靠的表现。此外，我们还证明了超графи可以适应无监督数据分布变化，通过自我超vision过程，可以重新生成多个观测层的缺失数据，保持稳定的表现，达到7年之久。

Spectral Graphormer: Spectral Graph-based Transformer for Egocentric Two-Hand Reconstruction using Multi-View Color Images

paper_url: http://arxiv.org/abs/2308.11015
repo_url: https://github.com/eldentse/Spectral-Graphormer
paper_authors: Tze Ho Elden Tse, Franziska Mueller, Zhengyang Shen, Danhang Tang, Thabo Beeler, Mingsong Dou, Yinda Zhang, Sasa Petrovic, Hyung Jin Chang, Jonathan Taylor, Bardia Doosti
For: reconstruction of two high fidelity hands from multi-view RGB images* Methods: transformer-based framework with spectral graph convolution decoder and optimization-based refinement stage* Results: realistic two-hand reconstructions with physical plausibility and generalization to real data, as well as real-time AR/VR applications.Here’s the full summary in Simplified Chinese:
for: 重构两个高精度手掌从多视图RGB图像中
methods: 使用变换器基于框架，并使用spectral graph convolution decoder和优化基于的反馈阶段
results: 生成真实的两手重构，并实现数据实际性和AR/VR应用中的实时渲染。

Abstract
We propose a novel transformer-based framework that reconstructs two high fidelity hands from multi-view RGB images. Unlike existing hand pose estimation methods, where one typically trains a deep network to regress hand model parameters from single RGB image, we consider a more challenging problem setting where we directly regress the absolute root poses of two-hands with extended forearm at high resolution from egocentric view. As existing datasets are either infeasible for egocentric viewpoints or lack background variations, we create a large-scale synthetic dataset with diverse scenarios and collect a real dataset from multi-calibrated camera setup to verify our proposed multi-view image feature fusion strategy. To make the reconstruction physically plausible, we propose two strategies: (i) a coarse-to-fine spectral graph convolution decoder to smoothen the meshes during upsampling and (ii) an optimisation-based refinement stage at inference to prevent self-penetrations. Through extensive quantitative and qualitative evaluations, we show that our framework is able to produce realistic two-hand reconstructions and demonstrate the generalisation of synthetic-trained models to real data, as well as real-time AR/VR applications.

摘要
我们提出了一种基于转换器的新框架，可以从多视角RGB图像中重建高精度两只手。与现有的手姿估计方法不同，我们处理一个更加复杂的问题Setting，即从 egocentric 视角直接将两只手的绝对根姿 Parameters 高分辨率RGB图像中进行重建。由于现有的数据集不可能进行 egocentric 视角或缺乏背景变化，我们创建了一个大规模的 simulate 数据集和实际数据集，以验证我们的多视角图像特征融合策略。为使重建 Physically plausible，我们提出了两种策略：（i）一种粗到细 spectral 图像 conv 嵌入器来平滑网格时的抗锯齿，以及（ii）在推理时进行优化基于反射 stage 以避免自身穿孔。经过广泛的量化和质量评估，我们表明我们的框架可以生成真实的两只手重建，并示出了 synthetic 训练的模型在实际数据上的普适性，以及实时 AR/VR 应用。

Autonomous Detection of Methane Emissions in Multispectral Satellite Data Using Deep Learning

paper_url: http://arxiv.org/abs/2308.11003
repo_url: None
paper_authors: Bertrand Rouet-Leduc, Thomas Kerdreux, Alexandre Tuel, Claudia Hulbert
for: 监控全球暖化的一种重要方法，减少温室气体的排放
methods: 使用深度学习方法自动识别几何图像中的甲烷泄漏
results: 比前一代多spectrum甲烷数据产品降低假阳性率，无需对潜在泄漏地点有专业知识

Abstract
Methane is one of the most potent greenhouse gases, and its short atmospheric half-life makes it a prime target to rapidly curb global warming. However, current methane emission monitoring techniques primarily rely on approximate emission factors or self-reporting, which have been shown to often dramatically underestimate emissions. Although initially designed to monitor surface properties, satellite multispectral data has recently emerged as a powerful method to analyze atmospheric content. However, the spectral resolution of multispectral instruments is poor, and methane measurements are typically very noisy. Methane data products are also sensitive to absorption by the surface and other atmospheric gases (water vapor in particular) and therefore provide noisy maps of potential methane plumes, that typically require extensive human analysis. Here, we show that the image recognition capabilities of deep learning methods can be leveraged to automatize the detection of methane leaks in Sentinel-2 satellite multispectral data, with dramatically reduced false positive rates compared with state-of-the-art multispectral methane data products, and without the need for a priori knowledge of potential leak sites. Our proposed approach paves the way for the automated, high-definition and high-frequency monitoring of point-source methane emissions across the world.

摘要
偏二氢（methane）是全球暖化气体中最强大的一种，它的大气半衰期短，使其成为快速降低全球暖化的目标。然而，当前的偏二氢泄漏监测技术主要基于估算性的泄漏因素或者自我报告，这些估算经常会很大的下降估计。虽然初始设计用于观察地面特性，但卫星多spectral数据最近才被发现作为分析大气内容的 poderoso方法。然而，多spectral工具的 spectral resolution很差，偏二氢测量通常很吵闹。偏二氢数据产品也受到地面和其他大气气体（水蒸气特别是）的吸收，因此提供的偏二氢潮区图像通常需要人工分析。在这里，我们展示了深度学习方法的图像识别能力可以自动化卫星Sentinel-2多spectral数据中的偏二氢泄漏检测，与现有的多spectral偏二氢数据产品相比， false positive 率有所下降，而无需先知泄漏点位置的假设。我们的提出的方法开 up a new way for the automatic, high-definition and high-frequency monitoring of point-source methane emissions around the world.

Switched auxiliary loss for robust training of transformer models for histopathological image segmentation

paper_url: http://arxiv.org/abs/2308.10994
repo_url: None
paper_authors: Mustaffa Hussain, Saharsh Barve
For: 本研究旨在提供一种模型，用于分类多个器官Functional Tissue Units (FTUs)的cell population neighborhoods，以便帮助病理学家更好地理解人体疾病的影响。* Methods: 该模型使用HuBMAP + HPA - Hacking the Human Body competition dataset进行训练，并提出了shifted auxiliary loss来解决深度模型的减速问题。* Results: 该模型在公共数据集上取得了0.793的dice分数，而在私有数据集上取得了0.778的dice分数，与传统方法相比，该方法提供了1%的提升。这些结果表明 transformers 模型在医学图像分析中的粗粒预测任务中的表现非常出色，并且可以帮助我们更好地理解人体细胞和组织的关系，从而更好地理解人体健康的影响。

Abstract
Functional tissue Units (FTUs) are cell population neighborhoods local to a particular organ performing its main function. The FTUs provide crucial information to the pathologist in understanding the disease affecting a particular organ by providing information at the cellular level. In our research, we have developed a model to segment multi-organ FTUs across 5 organs namely: the kidney, large intestine, lung, prostate and spleen by utilizing the HuBMAP + HPA - Hacking the Human Body competition dataset. We propose adding shifted auxiliary loss for training models like the transformers to overcome the diminishing gradient problem which poses a challenge towards optimal training of deep models. Overall, our model achieved a dice score of 0.793 on the public dataset and 0.778 on the private dataset and shows a 1% improvement with the use of the proposed method. The findings also bolster the use of transformers models for dense prediction tasks in the field of medical image analysis. The study assists in understanding the relationships between cell and tissue organization thereby providing a useful medium to look at the impact of cellular functions on human health.

摘要
功能组织单元（FTU）是指器官本地的细胞群聚，提供了病理学家理解器官疾病的关键信息。在我们的研究中，我们开发了一种方法来在5种器官（肾脏、大小肠、肺、肾脏和脾脏）的FTU中进行多器官分割，使用了HuBMAP+HPA-Hacking the Human Body竞赛数据集。我们提议在训练模型时使用偏移 auxiliary loss，以解决深度模型训练过程中的减少梯度问题。总的来说，我们的模型在公共数据集上 achievement了0.793的 dice 分数，在私有数据集上 achievement了0.778的 dice 分数，与使用我们提议的方法相比，提高了1%。这些结果也证明了transformers模型在医学影像分析中的稠密预测任务中的可靠性。本研究帮助我们理解细胞和组织结构之间的关系，从而提供了一个有用的媒体来查看细胞功能对人类健康的影响。

Debiasing Counterfactuals In the Presence of Spurious Correlations

paper_url: http://arxiv.org/abs/2308.10984
repo_url: None
paper_authors: Amar Kumar, Nima Fathi, Raghav Mehta, Brennan Nichyporuk, Jean-Pierre R. Falet, Sotirios Tsaftaris, Tal Arbel
for: The paper is written for the task of medical imaging classification, specifically addressing the issue of deep learning models relying on spurious correlations in the training data.
methods: The paper proposes an end-to-end training framework that integrates popular debiasing classifiers (such as distributionally robust optimization) with counterfactual image generation to expose generalizable imaging markers of relevance to the task, and a novel metric (Spurious Correlation Latching Score) to quantify the extent of classifier reliance on spurious correlations.
results: The paper demonstrates through comprehensive experiments on two public datasets (with simulated and real visual artifacts) that the debiasing method (i) learns generalizable markers across the population and (ii) successfully ignores spurious correlations and focuses on the underlying disease pathology.Here’s the information in Simplified Chinese text:
for: 这篇论文是为了医学成像分类任务而写的，特别是处理深度学习模型在训练数据中遇到的假 correlate 问题。
methods: 这篇论文提出了一种结合流行的偏差纠正分类器（如分布式稳定优化）和对假 correlate 进行抗衡的末端训练框架，以及一个新的指标（假 correlate 抓取得分）来衡量分类器偏差的程度。
results: 这篇论文通过对公共数据集（包括模拟和实际视觉杂质）进行了广泛的实验，证明了偏差方法可以（一）学习人口中的通用标记，并（二）忽略假 correlate 并专注于下面疾病生物学。

Abstract
Deep learning models can perform well in complex medical imaging classification tasks, even when basing their conclusions on spurious correlations (i.e. confounders), should they be prevalent in the training dataset, rather than on the causal image markers of interest. This would thereby limit their ability to generalize across the population. Explainability based on counterfactual image generation can be used to expose the confounders but does not provide a strategy to mitigate the bias. In this work, we introduce the first end-to-end training framework that integrates both (i) popular debiasing classifiers (e.g. distributionally robust optimization (DRO)) to avoid latching onto the spurious correlations and (ii) counterfactual image generation to unveil generalizable imaging markers of relevance to the task. Additionally, we propose a novel metric, Spurious Correlation Latching Score (SCLS), to quantify the extent of the classifier reliance on the spurious correlation as exposed by the counterfactual images. Through comprehensive experiments on two public datasets (with the simulated and real visual artifacts), we demonstrate that the debiasing method: (i) learns generalizable markers across the population, and (ii) successfully ignores spurious correlations and focuses on the underlying disease pathology.

摘要
In this work, we introduce the first end-to-end training framework that integrates both (i) popular debiasing classifiers (e.g. distributionally robust optimization (DRO)) to avoid latching onto the spurious correlations and (ii) counterfactual image generation to unveil generalizable imaging markers of relevance to the task. We also propose a novel metric, Spurious Correlation Latching Score (SCLS), to quantify the extent of the classifier reliance on the spurious correlation as exposed by the counterfactual images.Through comprehensive experiments on two public datasets (with simulated and real visual artifacts), we demonstrate that the debiasing method: (i) learns generalizable markers across the population, and (ii) successfully ignores spurious correlations and focuses on the underlying disease pathology.

VQA Therapy: Exploring Answer Differences by Visually Grounding Answers

paper_url: http://arxiv.org/abs/2308.11662
repo_url: https://github.com/ccychongyanchen/vqatherapycrowdsourcing
paper_authors: Chongyan Chen, Samreen Anjum, Danna Gurari
for: 这篇论文是关于视觉问答任务的研究，旨在更好地理解不同人对同一张图片的问题提出不同的答案的原因。
methods: 该论文引入了第一个可视地将每个答案与每个视觉问题相关联的数据集，称为VQAAnswerTherapy。然后，该论文提出了两个新的问题：一是判断视觉问题是否有唯一的答案基础，二是找到所有答案基础的地方。
results: 该论文使用现代算法对这两个新问题进行了评估，以示其在这些问题上的成功和缺点。数据集和评估服务器可以在https://vizwiz.org/tasks-and-datasets/vqa-answer-therapy/上公开获取。

Abstract
Visual question answering is a task of predicting the answer to a question about an image. Given that different people can provide different answers to a visual question, we aim to better understand why with answer groundings. We introduce the first dataset that visually grounds each unique answer to each visual question, which we call VQAAnswerTherapy. We then propose two novel problems of predicting whether a visual question has a single answer grounding and localizing all answer groundings. We benchmark modern algorithms for these novel problems to show where they succeed and struggle. The dataset and evaluation server can be found publicly at https://vizwiz.org/tasks-and-datasets/vqa-answer-therapy/.

摘要
“视觉问答是一项任务，旨在预测图像上的问题的答案。由于不同的人可能对同一个视觉问题提供不同的答案，我们想要更好地理解这些答案的原因。我们引入了首个可视地固定每个独特答案的视觉问题数据集，称之为VQAAnswerTherapy。我们then提出了两个新的问题：可视地判断问题是否有唯一的答案固定，以及本地化所有答案固定。我们对现代算法进行了测试，以示其在这些新问题上的表现。数据集和评估服务器可以在https://vizwiz.org/tasks-and-datasets/vqa-answer-therapy/上公共获取。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

SupEuclid: Extremely Simple, High Quality OoD Detection with Supervised Contrastive Learning and Euclidean Distance

paper_url: http://arxiv.org/abs/2308.10973
repo_url: None
paper_authors: Jarrod Haas
for: 本研究旨在提出一种简单而有效的Out-of-Distribution（OoD）检测方法，可以在标准benchmark上达到state-of-the-art的结果。
methods: 本研究使用Supervised Contrastive Learning（SCL）将ResNet18进行训练，并使用Euclidean distance作为分数规则进行评价。
results: 研究发现，使用SCL训练的ResNet18可以在近和远OoD检测benchmark上达到state-of-the-art的结果，无需使用更复杂的方法或更大的模型。

Abstract
Out-of-Distribution (OoD) detection has developed substantially in the past few years, with available methods approaching, and in a few cases achieving, perfect data separation on standard benchmarks. These results generally involve large or complex models, pretraining, exposure to OoD examples or extra hyperparameter tuning. Remarkably, it is possible to achieve results that can exceed many of these state-of-the-art methods with a very simple method. We demonstrate that ResNet18 trained with Supervised Contrastive Learning (SCL) produces state-of-the-art results out-of-the-box on near and far OoD detection benchmarks using only Euclidean distance as a scoring rule. This may obviate the need in some cases for more sophisticated methods or larger models, and at the very least provides a very strong, easy to use baseline for further experimentation and analysis.

摘要

MRI Field-transfer Reconstruction with Limited Data: Regularization by Neural Style Transfer

paper_url: http://arxiv.org/abs/2308.10968
repo_url: None
paper_authors: Guoyao Shen, Yancheng Zhu, Hernan Jara, Sean B. Andersson, Chad W. Farris, Stephan Anderson, Xin Zhang
for: 高品质的MRI重建方法
methods: 使用深度学习模型和风格传递的减噪方法
results: 使用 клиничеMRI扫描数据，能够显著提高图像质量

Abstract
Recent works have demonstrated success in MRI reconstruction using deep learning-based models. However, most reported approaches require training on a task-specific, large-scale dataset. Regularization by denoising (RED) is a general pipeline which embeds a denoiser as a prior for image reconstruction. The potential of RED has been demonstrated for multiple image-related tasks such as denoising, deblurring and super-resolution. In this work, we propose a regularization by neural style transfer (RNST) method to further leverage the priors from the neural transfer and denoising engine. This enables RNST to reconstruct a high-quality image from a noisy low-quality image with different image styles and limited data. We validate RNST with clinical MRI scans from 1.5T and 3T and show that RNST can significantly boost image quality. Our results highlight the capability of the RNST framework for MRI reconstruction and the potential for reconstruction tasks with limited data.

摘要
Here's the Simplified Chinese translation:近期研究表明，深度学习基于模型可以成功地进行MRI重建。然而，大多数报道的方法需要任务特定、大规模的数据集来进行训练。噪声除掉（RED）是一个通用管道，它将噪声除掉作为图像重建的前提。RED已经在多种图像相关任务中展示出色，如噪声除掉、锐化和超分辨率。在这个工作中，我们提出了基于神经风格传输的噪声除掉方法（RNST），以更好地利用神经传输和噪声除掉引擎中的前提。这使得RNST可以从噪声低质量图像中重建高质量图像，并且可以处理不同的图像风格和有限的数据。我们验证了RNST使用临床MRI扫描数据，并证明了RNST可以显著提高图像质量。我们的结果表明RNST框架可以用于MRI重建，并且可能在有限数据情况下实现高质量重建。

BundleSeg: A versatile, reliable and reproducible approach to white matter bundle segmentation

paper_url: http://arxiv.org/abs/2308.10958
repo_url: None
paper_authors: Etienne St-Onge, Kurt G Schilling, Francois Rheault
for: 提供一种可靠、可重复、快速的白 matter 路径EXTRACTING方法
methods: combinest 一种迭代注册过程与一种新发展的精确流线搜索算法，以高效地分割流线无需 трактogram clustering 或简化假设
results: 比state-of-the-art segmentation方法具有改进的重复性和复制性，以及显著的速度提高。通过增强精度和减少变化，提供了一种 valuabe 的工具 для neuroscience 研究，提高了轨迹学研究中white matter pathways的敏感性和特异性。

Abstract
This work presents BundleSeg, a reliable, reproducible, and fast method for extracting white matter pathways. The proposed method combines an iterative registration procedure with a recently developed precise streamline search algorithm that enables efficient segmentation of streamlines without the need for tractogram clustering or simplifying assumptions. We show that BundleSeg achieves improved repeatability and reproducibility than state-of-the-art segmentation methods, with significant speed improvements. The enhanced precision and reduced variability in extracting white matter connections offer a valuable tool for neuroinformatic studies, increasing the sensitivity and specificity of tractography-based studies of white matter pathways.

摘要
这个研究提出了一种可靠、可重复、快速的白 mater 血管路径EXTRACTING方法，称为BundleSeg。该方法结合了迭代注册过程和最近开发的精炼流线搜索算法，可以高效地 segments 流线，无需进行跟踪ogram clustering 或假设。我们展示了 BundleSeg 在EXTRACTING white matter 血管路径上的重复性和复制性得到了提高，同时速度也得到了显著提高。这种更高精度和减少的变化可以为 neuroscience 研究提供一个有价值的工具，提高了追踪学基于 white matter 血管路径的敏感性和特异性。

CamP: Camera Preconditioning for Neural Radiance Fields

paper_url: http://arxiv.org/abs/2308.10902
repo_url: None
paper_authors: Keunhong Park, Philipp Henzler, Ben Mildenhall, Jonathan T. Barron, Ricardo Martin-Brualla
for: 高精度3D场景重建
methods: 使用Proxy问题计算抑制器，并使用该抑制器作为预Conditioner进行相机参数优化
results: 对Mip-NeRF 360 dataset中的场景进行重建，比对其他State-of-the-art NeRF方法（如Zip-NeRF）和State-of-the-art联合优化方法（如SCNeRF）减少错误率（RMSE）67%，相比减少29%Here’s a more detailed explanation of each point:
for: The paper is written for optimizing Neural Radiance Fields (NeRF) to obtain high-fidelity 3D scene reconstructions of objects and large-scale scenes.
methods: The paper proposes using a proxy problem to compute a whitening transform that eliminates the correlation between camera parameters and normalizes their effects, and using this transform as a preconditioner for the camera parameters during joint optimization.
results: The paper shows that the proposed approach significantly improves reconstruction quality on scenes from the Mip-NeRF 360 dataset, reducing error rates (RMSE) by 67% compared to state-of-the-art NeRF approaches that do not optimize for cameras like Zip-NeRF, and by 29% relative to state-of-the-art joint optimization approaches using the camera parameterization of SCNeRF.

Abstract
Neural Radiance Fields (NeRF) can be optimized to obtain high-fidelity 3D scene reconstructions of objects and large-scale scenes. However, NeRFs require accurate camera parameters as input -- inaccurate camera parameters result in blurry renderings. Extrinsic and intrinsic camera parameters are usually estimated using Structure-from-Motion (SfM) methods as a pre-processing step to NeRF, but these techniques rarely yield perfect estimates. Thus, prior works have proposed jointly optimizing camera parameters alongside a NeRF, but these methods are prone to local minima in challenging settings. In this work, we analyze how different camera parameterizations affect this joint optimization problem, and observe that standard parameterizations exhibit large differences in magnitude with respect to small perturbations, which can lead to an ill-conditioned optimization problem. We propose using a proxy problem to compute a whitening transform that eliminates the correlation between camera parameters and normalizes their effects, and we propose to use this transform as a preconditioner for the camera parameters during joint optimization. Our preconditioned camera optimization significantly improves reconstruction quality on scenes from the Mip-NeRF 360 dataset: we reduce error rates (RMSE) by 67% compared to state-of-the-art NeRF approaches that do not optimize for cameras like Zip-NeRF, and by 29% relative to state-of-the-art joint optimization approaches using the camera parameterization of SCNeRF. Our approach is easy to implement, does not significantly increase runtime, can be applied to a wide variety of camera parameterizations, and can straightforwardly be incorporated into other NeRF-like models.

摘要
神经辐射场（NeRF）可以优化以获得高精度3D场景重建。然而，NeRF需要准确的摄像头参数作为输入，否则会得到模糊的渲染。通常来说，摄像头参数的内在和外在参数通过Structure-from-Motion（SfM）方法进行估计，但这些技术很少能提供完美的估计。因此，先前的工作已经提议同时优化摄像头参数和NeRF，但这些方法容易陷入困难的设置中。在这个工作中，我们分析了不同的摄像头参数化对这个联合优化问题的影响，并发现标准参数化 exhibit 大量的差异幅度，这可能导致一个不正确的优化问题。我们提出使用一个代理问题计算一个卷积变换，该变换消除了摄像头参数与normalize 其效果的相关性，并我们提议使用这个变换作为摄像头参数的预处理器。我们的预处理后的摄像头优化显著提高了Mip-NeRF 360 dataset中的重建质量：我们降低了误差率（RMSE）相比领先的NeRF方法，Zip-NeRF，和相对于领先的联合优化方法使用摄像头参数化的 SCNeRF，降低了29%。我们的方法易于实现，不会增加运行时间，可以应用于多种摄像头参数化，并可以 straightforwardly 与其他NeRF-like模型结合使用。

Few-Shot Physically-Aware Articulated Mesh Generation via Hierarchical Deformation

paper_url: http://arxiv.org/abs/2308.10898
repo_url: None
paper_authors: Xueyi Liu, Bin Wang, He Wang, Li Yi
for: 本研究旨在解决具有少量示例的物理可知树状对象生成问题。通过观察含有只几个示例的人工骨架对象数据集，我们希望通过学习一个模型，以生成多样化的骨架，并保证其视觉准确性和物理可行性。
methods: 我们提出了两项关键创新，即1）基于分治哲学的层次骨架变换基本模型，以适应具有少量示例的问题，并且可以借鉴大规模的固定骨架上的可转移变换模式；2）基于物理学的变换修正方案，以促进物理可行的生成。
results: 我们在6种人工骨架类别上进行了广泛的实验，并证明了我们的方法在几何上比前方法更好，可以更好地生成具有多样性、高视觉准确性和物理可行性的骨架。此外，我们还进行了ablation研究，以验证我们的两项创新的准确性。研究页面及代码可以在https://meowuu7.github.io/few-arti-obj-gen中找到。

Abstract
We study the problem of few-shot physically-aware articulated mesh generation. By observing an articulated object dataset containing only a few examples, we wish to learn a model that can generate diverse meshes with high visual fidelity and physical validity. Previous mesh generative models either have difficulties in depicting a diverse data space from only a few examples or fail to ensure physical validity of their samples. Regarding the above challenges, we propose two key innovations, including 1) a hierarchical mesh deformation-based generative model based upon the divide-and-conquer philosophy to alleviate the few-shot challenge by borrowing transferrable deformation patterns from large scale rigid meshes and 2) a physics-aware deformation correction scheme to encourage physically plausible generations. We conduct extensive experiments on 6 articulated categories to demonstrate the superiority of our method in generating articulated meshes with better diversity, higher visual fidelity, and better physical validity over previous methods in the few-shot setting. Further, we validate solid contributions of our two innovations in the ablation study. Project page with code is available at https://meowuu7.github.io/few-arti-obj-gen.

摘要
我们研究几何物体生成中受限的几何物体生成问题。通过观察一个具有少量示例的柔软物体数据集，我们希望通过学习一个可以生成多样化的几何模型，以保证高Visual faithfulness和物理有效性。先前的几何生成模型容易在几何空间中示出少量示例的多样性或者缺乏物理有效性的问题。为了解决这些挑战，我们提出了两项关键创新：1. 基于分治理的几何变形生成模型，根据分治理的哲学，从大规模的固定几何中继承可质量的变形模式，以便在几何空间中减少几何数据的几何变形问题。2. 基于物理知识的几何变形修正方案，以促进物理可能的生成。我们对6种柔软物体类型进行了广泛的实验，并证明了我们的方法在几何数据中生成的几何模型具有更高的多样性、更高的Visual faithfulness和更高的物理有效性，相比之前的方法。此外，我们还进行了减少几何变形和物理变形修正的精细分析，以证明我们的两项创新的凝聚性。项目页面和代码可以在https://meowuu7.github.io/few-arti-obj-gen中找到。

Can Language Models Learn to Listen?

paper_url: http://arxiv.org/abs/2308.10897
repo_url: https://github.com/Sfedfcv/redesigned-pancake
paper_authors: Evonne Ng, Sanjay Subramanian, Dan Klein, Angjoo Kanazawa, Trevor Darrell, Shiry Ginosar
For: The paper is written for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker’s words.* Methods: The approach uses an autoregressive model that predicts a response of a listener, which is a sequence of listener facial gestures, quantized using a VQ-VAE. The model treats the quantized atomic motion elements as additional language token inputs to a transformer-based large language model.* Results: The generated listener motion is fluent and reflective of language semantics, as shown through quantitative metrics and a qualitative user study. The model demonstrates the ability to utilize temporal and semantic aspects of spoken text.

Abstract
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words. Given an input transcription of the speaker's words with their timestamps, our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE. Since gesture is a language component, we propose treating the quantized atomic motion elements as additional language token inputs to a transformer-based large language model. Initializing our transformer with the weights of a language model pre-trained only on text results in significantly higher quality listener responses than training a transformer from scratch. We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study. In our evaluation, we analyze the model's ability to utilize temporal and semantic aspects of spoken text. Project page: https://people.eecs.berkeley.edu/~evonne_ng/projects/text2listen/

摘要
我们提出了一个框架，用于基于说话人的话语生成适当的面部响应。给定输入词语讲解和时间戳，我们的方法通过自动递归预测一个听众的响应：一个序列化的听众面部姿势，使用VQ-VAE进行量化。由于姿势是语言成分，我们提议对量化的原子运动元素视为额外的语言标记输入，并将其传递给基于转换器的大语言模型进行处理。初始化我们的转换器使用已经预训练的语言模型的权重，比训练从零开始的转换器更有显著的质量提升。我们表示我们生成的听众动作是流畅的，并具有语言 semantics 的表达。在我们的评估中，我们分析了模型对说话文本的时间和 semantics 方面的使用。项目页面：https://people.eecs.berkeley.edu/~evonne_ng/projects/text2listen/

Differentiable Shadow Mapping for Efficient Inverse Graphics

paper_url: http://arxiv.org/abs/2308.10896
repo_url: https://github.com/mworchel/differentiable-shadow-mapping
paper_authors: Markus Worchel, Marc Alexa
for: 该论文主要研究如何高效地生成triangle mesh中的阴影。
methods: 该论文提出了一种将预filtered shadow mapping技术与现有的可导着色器结合使用，以实现 triangle mesh中的可导visibility信息。
results: 研究发现，使用可导阴影图可以与不同的 inverse graphics problems 比较快速，并且与不同的灯光传输模拟相比，可以达到类似的准确性水平，而不同的可导着色器无法收敛。

Abstract
We show how shadows can be efficiently generated in differentiable rendering of triangle meshes. Our central observation is that pre-filtered shadow mapping, a technique for approximating shadows based on rendering from the perspective of a light, can be combined with existing differentiable rasterizers to yield differentiable visibility information. We demonstrate at several inverse graphics problems that differentiable shadow maps are orders of magnitude faster than differentiable light transport simulation with similar accuracy -- while differentiable rasterization without shadows often fails to converge.

摘要
我们显示了如何有效地生成阴影在分割形状的几何 Rendering 中。我们的中心观察是可以将阴影预filtering，一种基于灯光的见识测试，与现有的可微化矿物 Rendering 结合，以获得可微化可视性信息。我们在一些逆图学问题中展示了这种方法比对于对称的阴影传递 Simulation 更快速，并且与不含阴影的可微化矿物 Rendering 相比，通常会出现不收敛的问题。

Unlocking Accuracy and Fairness in Differentially Private Image Classification

paper_url: http://arxiv.org/abs/2308.10888
repo_url: None
paper_authors: Leonard Berrada, Soham De, Judy Hanwen Shen, Jamie Hayes, Robert Stanforth, David Stutz, Pushmeet Kohli, Samuel L. Smith, Borja Balle
For: 这个研究的目的是让机器学习模型在保护敏感资料的情况下训练，以确保对敏感资料的训练不会泄露敏感信息。* Methods: 这个研究使用了差异调教（Differential Privacy）的金标准框架，以提供正式的隐私保证。* Results: 研究发现，使用预先训练的基础模型，并在这些模型上实现差异调教，可以实现与非隐私模型相似的准确性水平，甚至在资料分布shift的情况下仍能保持高度的准确性。

Abstract
Privacy-preserving machine learning aims to train models on private data without leaking sensitive information. Differential privacy (DP) is considered the gold standard framework for privacy-preserving training, as it provides formal privacy guarantees. However, compared to their non-private counterparts, models trained with DP often have significantly reduced accuracy. Private classifiers are also believed to exhibit larger performance disparities across subpopulations, raising fairness concerns. The poor performance of classifiers trained with DP has prevented the widespread adoption of privacy preserving machine learning in industry. Here we show that pre-trained foundation models fine-tuned with DP can achieve similar accuracy to non-private classifiers, even in the presence of significant distribution shifts between pre-training data and downstream tasks. We achieve private accuracies within a few percent of the non-private state of the art across four datasets, including two medical imaging benchmarks. Furthermore, our private medical classifiers do not exhibit larger performance disparities across demographic groups than non-private models. This milestone to make DP training a practical and reliable technology has the potential to widely enable machine learning practitioners to train safely on sensitive datasets while protecting individuals' privacy.

摘要
隐私保护机器学习的目标是在使用private数据进行训练时保持敏感信息的隐私。不同于其非私有对应的模型，模型通过隐私保护（DP）训练的准确率通常会下降 significatively。此外，private分类器可能会在不同的人口群体之间存在更大的性别差异，引发公平性的问题。由于DP训练的模型性能较差，因此在实业中广泛采用隐私保护机器学习的应用仍然受限。在这篇文章中，我们表明了基于预先训练的基础模型，通过DP进行细化训练可以达到与非私有模型相同的准确率，即使在数据分布变化较大的情况下。我们在四个数据集上达到了与非私有状态的准确率，包括两个医疗影像标准 benchmark。此外，我们的私有医疗分类器不会在不同的人口群体中存在更大的性别差异，与非私有模型相比。这一突破可能使DP训练成为实用和可靠的技术，使机器学习实践者可以在敏感数据上进行安全的训练，同时保护个人隐私。

Vision Transformer Pruning Via Matrix Decomposition

paper_url: http://arxiv.org/abs/2308.10839
repo_url: None
paper_authors: Tianyi Sun
for: 降低存储、运行时内存和计算需求
methods: 使用矩阵分解方法（包括各种QR分解和LU分解）来降低矩阵维度和复杂度
results: 结果表明使用Singular Value Decomposition方法可以保持重要特征的生成，同时降低存储、运行时内存和计算需求。

Abstract
This is a further development of Vision Transformer Pruning via matrix decomposition. The purpose of the Vision Transformer Pruning is to prune the dimension of the linear projection of the dataset by learning their associated importance score in order to reduce the storage, run-time memory, and computational demands. In this paper we further reduce dimension and complexity of the linear projection by implementing and comparing several matrix decomposition methods while preserving the generated important features. We end up selected the Singular Value Decomposition as the method to achieve our goal by comparing the original accuracy scores in the original Github repository and the accuracy scores of using those matrix decomposition methods, including Singular Value Decomposition, four versions of QR Decomposition, and LU factorization.

摘要
这是vision transformer减少的进一步发展，通过矩阵分解来减少数据集的维度。vision transformer减少的目的是学习数据集中每个特征的重要性分数，以降低存储、运行内存和计算成本。在这篇论文中，我们进一步减少了矩阵 projection 的维度和复杂性，并比较了多种矩阵分解方法，包括四种QR分解版本和LU分解。最终，我们选择了单值分解来实现我们的目标，并比较了原始数据集的准确率和使用这些矩阵分解方法得到的准确率。

EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition

paper_url: http://arxiv.org/abs/2308.10832
repo_url: https://github.com/gmberton/eigenplaces
paper_authors: Gabriele Berton, Gabriele Trivigno, Barbara Caputo, Carlo Masone
For: The paper is written for the task of visual place recognition, specifically to improve the robustness of the model to different viewpoints.* Methods: The proposed method, called EigenPlaces, uses a new approach to train the neural network on images from different viewpoints, which embeds viewpoint robustness into the learned global descriptors. The method clusters the training data to explicitly present the model with different views of the same points of interest, without the need for extra supervision.* Results: The paper presents experiments on the most comprehensive set of datasets in literature, showing that EigenPlaces outperforms previous state-of-the-art methods on the majority of datasets, while requiring 60% less GPU memory for training and using 50% smaller descriptors.Here are the three key points in Simplified Chinese text:* For: 这篇论文是为了解决视觉地点识别任务中的视角不一致问题，以提高模型的视角Robustness。* Methods: 提议的方法是EigenPlaces，它使用了一种新的方法来在不同视角的图像上训练神经网络，从而在学习的全球描述符中嵌入视角Robustness。该方法通过对训练数据进行分组，以显式地给模型提供不同视角的同一个点的兴趣点。无需额外监督。* Results: 论文通过对Literature中最完整的数据集进行实验，发现EigenPlaces在大多数数据集上超过了之前的状态之为，而且需要60% menos的GPU内存进行训练，并使用50%更小的描述符。

Abstract
Visual Place Recognition is a task that aims to predict the place of an image (called query) based solely on its visual features. This is typically done through image retrieval, where the query is matched to the most similar images from a large database of geotagged photos, using learned global descriptors. A major challenge in this task is recognizing places seen from different viewpoints. To overcome this limitation, we propose a new method, called EigenPlaces, to train our neural network on images from different point of views, which embeds viewpoint robustness into the learned global descriptors. The underlying idea is to cluster the training data so as to explicitly present the model with different views of the same points of interest. The selection of this points of interest is done without the need for extra supervision. We then present experiments on the most comprehensive set of datasets in literature, finding that EigenPlaces is able to outperform previous state of the art on the majority of datasets, while requiring 60\% less GPU memory for training and using 50\% smaller descriptors. The code and trained models for EigenPlaces are available at {\small{\url{https://github.com/gmberton/EigenPlaces}}, while results with any other baseline can be computed with the codebase at {\small{\url{https://github.com/gmberton/auto_VPR}}.

摘要
“视觉地标识任务的目标是根据图像（叫做查询）的视觉特征来预测图像的位置。通常通过图像检索，将查询图像与大量地标注的图像库中的最相似图像进行匹配，使用学习的全局描述符。一个主要挑战在这个任务中是识别不同视点下的地标。为解决这个限制，我们提出了一新的方法，叫做EigenPlaces，通过在不同视点下训练神经网络，将视点强度嵌入到学习的全局描述符中。这个思想是将训练数据集分为不同视点下的部分，以便显式地将模型暴露给不同视点下的同一个点感兴趣。无需额外监督，我们选择了这些点感兴趣。我们then present experiments on the most comprehensive set of datasets in literature, finding that EigenPlaces is able to outperform previous state of the art on the majority of datasets, while requiring 60% less GPU memory for training and using 50% smaller descriptors. The code and trained models for EigenPlaces are available at [https://github.com/gmberton/EigenPlaces](https://github.com/gmberton/EigenPlaces), while results with any other baseline can be computed with the codebase at [https://github.com/gmberton/auto_VPR](https://github.com/gmberton/auto_VPR).”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.

Pixel Adaptive Deep Unfolding Transformer for Hyperspectral Image Reconstruction

paper_url: http://arxiv.org/abs/2308.10820
repo_url: None
paper_authors: Miaoyu Li, Ying Fu, Ji Liu, Yulun Zhang
for: 本文提出了一种基于深度 unfolding 框架的高spectral像素 (HSI) 重建方法，以解决现有方法尚未能充分匹配 HSI 数据的问题。
methods: 本文使用了一种拥有 pixele adaptive descent 步长的数据模块，并引入了 Non-local Spectral Transformer (NST) 来强调 HSI 的3D特性。另外，通过 Fast Fourier Transform (FFT) 改进了不同阶段和层次的特征表达，以解决不同阶段和层次之间的交互问题。
results: 实验结果表明， compared to 现有 HSI 重建方法，本文提出的方法在 simulated 和实际场景中具有更高的重建性能。代码可以在 https://github.com/MyuLi/PADUT 上下载。

Abstract
Hyperspectral Image (HSI) reconstruction has made gratifying progress with the deep unfolding framework by formulating the problem into a data module and a prior module. Nevertheless, existing methods still face the problem of insufficient matching with HSI data. The issues lie in three aspects: 1) fixed gradient descent step in the data module while the degradation of HSI is agnostic in the pixel-level. 2) inadequate prior module for 3D HSI cube. 3) stage interaction ignoring the differences in features at different stages. To address these issues, in this work, we propose a Pixel Adaptive Deep Unfolding Transformer (PADUT) for HSI reconstruction. In the data module, a pixel adaptive descent step is employed to focus on pixel-level agnostic degradation. In the prior module, we introduce the Non-local Spectral Transformer (NST) to emphasize the 3D characteristics of HSI for recovering. Moreover, inspired by the diverse expression of features in different stages and depths, the stage interaction is improved by the Fast Fourier Transform (FFT). Experimental results on both simulated and real scenes exhibit the superior performance of our method compared to state-of-the-art HSI reconstruction methods. The code is released at: https://github.com/MyuLi/PADUT.

摘要

Fixed gradient descent step in the data module while the degradation of HSI is agnostic in the pixel level.2. Inadequate prior module for 3D HSI cube.3. Stage interaction ignoring the differences in features at different stages.To address these issues, we propose a Pixel Adaptive Deep Unfolding Transformer (PADUT) for HSI reconstruction. In the data module, we employ a pixel adaptive descent step to focus on pixel-level agnostic degradation. In the prior module, we introduce the Non-local Spectral Transformer (NST) to emphasize the 3D characteristics of HSI for recovering. Moreover, inspired by the diverse expression of features in different stages and depths, we improve the stage interaction by the Fast Fourier Transform (FFT).Experimental results on both simulated and real scenes demonstrate the superior performance of our method compared to state-of-the-art HSI reconstruction methods. The code is available at: https://github.com/MyuLi/PADUT.

Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers

paper_url: http://arxiv.org/abs/2308.10814
repo_url: None
paper_authors: Natalia Frumkin, Dibakar Gope, Diana Marculescu
for: 提高量化神经网络的精度和效率
methods: 使用进化搜索和infoNCE损失函数 traverse非线性测试损失 landscape
results: 在不同量化级别（3-bit、4-bit、8-bit）下，提高全量化ViT-Base的top-1准确率，并在极端量化场景下保持稳定性和可靠性

Abstract
Quantization scale and bit-width are the most important parameters when considering how to quantize a neural network. Prior work focuses on optimizing quantization scales in a global manner through gradient methods (gradient descent \& Hessian analysis). Yet, when applying perturbations to quantization scales, we observe a very jagged, highly non-smooth test loss landscape. In fact, small perturbations in quantization scale can greatly affect accuracy, yielding a $0.5-0.8\%$ accuracy boost in 4-bit quantized vision transformers (ViTs). In this regime, gradient methods break down, since they cannot reliably reach local minima. In our work, dubbed Evol-Q, we use evolutionary search to effectively traverse the non-smooth landscape. Additionally, we propose using an infoNCE loss, which not only helps combat overfitting on the small calibration dataset ($1,000$ images) but also makes traversing such a highly non-smooth surface easier. Evol-Q improves the top-1 accuracy of a fully quantized ViT-Base by $10.30\%$, $0.78\%$, and $0.15\%$ for $3$-bit, $4$-bit, and $8$-bit weight quantization levels. Extensive experiments on a variety of CNN and ViT architectures further demonstrate its robustness in extreme quantization scenarios. Our code is available at https://github.com/enyac-group/evol-q

摘要
《量化缩放和位宽是神经网络量化时的关键参数。先前的工作通过梯度方法优化量化缩放，但当应用扰动时，我们发现测试损失 landscape 非常峰峦，高度不平。事实上，小于扰动量化缩放可以大幅提高准确性，达到 $0.5-0.8\%$ 的准确率提升。在这种情况下，梯度方法失效，因为它们无法可靠地到达地方最优点。在我们的工作中，称为 Evol-Q，我们使用进化搜索来有效地探索非平坦的表面。此外，我们也提议使用 infoNCE 损失函数，它不仅能够降低在小训练集（1,000 张图像）中的溢出问题，而且使探索非平坦表面更加容易。 Evol-Q 在完全量化 ViT-Base 上提高了 top-1 准确率，分别为 $10.30\%$, $0.78\%$, $0.15\%$ для $3$-bit、$4$-bit 和 $8$-bit 量化 веса级别。我们还进行了对多种 CNN 和 ViT 架构的广泛实验，以证明它的稳定性在极端量化场景下。我们的代码可以在 https://github.com/enyac-group/evol-q 上获取。》

2023-08-22

cs.AI

cs.AI - 2023-08-22

Furnishing Sound Event Detection with Language Model Abilities

paper_url: http://arxiv.org/abs/2308.11530
repo_url: None
paper_authors: Hualei Wang, Jianguo Mao, Zhifang Guo, Jiarui Wan, Hong Liu, Xiangdong Wang
for: 本研究探讨语言模型（LM）在视觉跨模态中的能力，特别是sound event detection（SED）领域。
methods: 我们提出了一种简洁的方法，通过对音频特征和文本特征的对应进行对齐，实现声音事件分类和时间位置的生成。该框架包括一个音频编码器、一个对应模块和一个独立的语言解码器。
results: 我们的模型可以准确地生成声音事件探测序列。与传统方法相比，我们的模型更加简洁和全面，因为它直接利用语言模型的 semantic 能力来生成序列。我们还对不同的解码模块进行了研究，以示timestamps capture和事件分类的效果。

Abstract
Recently, the ability of language models (LMs) has attracted increasing attention in visual cross-modality. In this paper, we further explore the generation capacity of LMs for sound event detection (SED), beyond the visual domain. Specifically, we propose an elegant method that aligns audio features and text features to accomplish sound event classification and temporal location. The framework consists of an acoustic encoder, a contrastive module that align the corresponding representations of the text and audio, and a decoupled language decoder that generates temporal and event sequences from the audio characteristic. Compared with conventional works that require complicated processing and barely utilize limited audio features, our model is more concise and comprehensive since language model directly leverage its semantic capabilities to generate the sequences. We investigate different decoupling modules to demonstrate the effectiveness for timestamps capture and event classification. Evaluation results show that the proposed method achieves accurate sequences of sound event detection.

摘要
最近，语言模型（LM）在视觉交互领域的能力受到了越来越多的关注。在这篇论文中，我们进一步探索语言模型对声音事件检测（SED）的生成能力，超出视觉领域。我们提出了一种简洁的方法，将音频特征和文本特征进行对齐，以完成声音事件类型和时间位置的分类。该框架包括一个声音编码器、一个对应模块，将文本和音频特征的对应表示进行对齐，以及一个独立的语言解码器，从音频特征中生成时间序列和事件序列。相比于传统的方法，需要复杂的处理和尝试用有限的音频特征，我们的模型更简洁和全面，因为语言模型直接利用其语义能力来生成序列。我们 investigate了不同的解 Coupling模块，以示出对时间捕捉和事件分类的效果。评估结果显示，我们的方法可以准确地检测声音事件。

TrackFlow: Multi-Object Tracking with Normalizing Flows

paper_url: http://arxiv.org/abs/2308.11513
repo_url: None
paper_authors: Gianluca Mancusi, Aniello Panariello, Angelo Porrello, Matteo Fabbri, Simone Calderara, Rita Cucchiara
for: 提高多对象跟踪的性能，尤其是在多模态 Setting 中。
methods: 使用深度概率模型来计算候选对应关系的可能性，以提高跟踪-by-检测算法的性能。
results: 在 simulate 和实际 benchmark 上进行了实验，显示了我们的方法可以提高跟踪-by-检测算法的性能。

Abstract
The field of multi-object tracking has recently seen a renewed interest in the good old schema of tracking-by-detection, as its simplicity and strong priors spare it from the complex design and painful babysitting of tracking-by-attention approaches. In view of this, we aim at extending tracking-by-detection to multi-modal settings, where a comprehensive cost has to be computed from heterogeneous information e.g., 2D motion cues, visual appearance, and pose estimates. More precisely, we follow a case study where a rough estimate of 3D information is also available and must be merged with other traditional metrics (e.g., the IoU). To achieve that, recent approaches resort to either simple rules or complex heuristics to balance the contribution of each cost. However, i) they require careful tuning of tailored hyperparameters on a hold-out set, and ii) they imply these costs to be independent, which does not hold in reality. We address these issues by building upon an elegant probabilistic formulation, which considers the cost of a candidate association as the negative log-likelihood yielded by a deep density estimator, trained to model the conditional joint probability distribution of correct associations. Our experiments, conducted on both simulated and real benchmarks, show that our approach consistently enhances the performance of several tracking-by-detection algorithms.

摘要
隐身多目标跟踪领域最近又有新的关注，旧的schema tracking-by-detection，因为它的简单性和强制约束，不需要复杂的设计和痛苦照顾 tracking-by-attention 方法。在这个视图下，我们想扩展 tracking-by-detection 到多模式设定，其中需要从不同的信息源（例如，2D 运动指示、视觉特征和姿态估计）计算总成本。更加准确地说，我们采用了一个实验研究，其中有一个粗略的3D 信息估计也可以与传统的 метри（例如，IoU）一起使用。为了实现这一点，现有的方法通常采用 either simple rules or complex heuristics 来均衡每个成本的贡献。然而，i) 它们需要在保留集上精心调整特制的超参数，并 ii) 它们假设这些成本是独立的，而实际上不是。我们解决这些问题，是通过基于简洁概率形式ulation，它考虑候选关联的成本为负极log-概率的深度概率预测器，用于模型候选关联的条件联合概率分布。我们的实验，在 Both simulated 和 real benchmarks 上进行，显示了我们的方法能够一致提高许多 tracking-by-detection 算法的性能。

paper_url: http://arxiv.org/abs/2308.11684
repo_url: None
paper_authors: Despoina Chatzakou, Juan Soler-Company, Theodora Tsikrika, Leo Wanner, Stefanos Vrochidis, Ioannis Kompatsiaris
for: 防止社交媒体上的负面内容的 spreadof and retain online identity
methods: 使用多个用户活动特征进行机器学习基于检测，以确定两个或多个虚拟标识是否属于同一个真实人
results: 在恶意和恐怖主义相关的推特内容中，模型的效果得到证明

Abstract
Social media users often hold several accounts in their effort to multiply the spread of their thoughts, ideas, and viewpoints. In the particular case of objectionable content, users tend to create multiple accounts to bypass the combating measures enforced by social media platforms and thus retain their online identity even if some of their accounts are suspended. User identity linkage aims to reveal social media accounts likely to belong to the same natural person so as to prevent the spread of abusive/illegal activities. To this end, this work proposes a machine learning-based detection model, which uses multiple attributes of users' online activity in order to identify whether two or more virtual identities belong to the same real natural person. The models efficacy is demonstrated on two cases on abusive and terrorism-related Twitter content.

摘要

Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions

paper_url: http://arxiv.org/abs/2308.11483
repo_url: None
paper_authors: Pouya Pezeshkpour, Estevam Hruschka
for: 这 paper 探讨了 Large Language Models (LLMs) 在不同的 NLP 任务中表现的稳定性问题，特别是在多选问题上。methods: 作者们使用了多种方法来 investigate LLMs 的不稳定性，包括对选项的重新排序和几个示例的尝试。results: 研究发现，当选项的顺序发生变化时，LLMs 的表现会受到很大的影响，表现差异可达 13% 到 75% 不同的benchmark上。通过 detailed 分析，作者们 conjecture 这种不稳定性源于 LLMs 对最佳选项的不确定性，并且特定的选项位置可能会帮助模型更准确地预测最佳选项。

Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in various NLP tasks. However, previous works have shown these models are sensitive towards prompt wording, and few-shot demonstrations and their order, posing challenges to fair assessment of these models. As these models become more powerful, it becomes imperative to understand and address these limitations. In this paper, we focus on LLMs robustness on the task of multiple-choice questions -- commonly adopted task to study reasoning and fact-retrieving capability of LLMs. Investigating the sensitivity of LLMs towards the order of options in multiple-choice questions, we demonstrate a considerable performance gap of approximately 13% to 75% in LLMs on different benchmarks, when answer options are reordered, even when using demonstrations in a few-shot setting. Through a detailed analysis, we conjecture that this sensitivity arises when LLMs are uncertain about the prediction between the top-2/3 choices, and specific options placements may favor certain prediction between those top choices depending on the question caused by positional bias. We also identify patterns in top-2 choices that amplify or mitigate the model's bias toward option placement. We found that for amplifying bias, the optimal strategy involves positioning the top two choices as the first and last options. Conversely, to mitigate bias, we recommend placing these choices among the adjacent options. To validate our conjecture, we conduct various experiments and adopt two approaches to calibrate LLMs' predictions, leading to up to 8 percentage points improvement across different models and benchmarks.

摘要

Expecting The Unexpected: Towards Broad Out-Of-Distribution Detection

paper_url: http://arxiv.org/abs/2308.11480
repo_url: https://github.com/servicenow/broad-openood
paper_authors: Charles Guille-Escuret, Pierre-André Noël, Ioannis Mitliagkas, David Vazquez, Joao Monteiro
for: 本研究旨在提高部署机器学习系统的可靠性，通过开发检测出现在训练集之外的输入（Out-of-distribution，OOD）方法。
methods: 本研究对现有的OOD检测方法进行了评估，并发现这些方法只能够有效地检测未知的类型，而对其他类型的分布转移表现不一致。为解决这个问题，我们提出了一种基于生成模型的ensemble方法，可以提供更一致和全面的OOD检测解决方案。
results: 我们的研究发现，现有的OOD检测方法在不同类型的分布转移中的性能不一致，而我们的ensemble方法可以提供更高的可靠性和敏感性。我们还发布了一个名为BROAD（Benchmarking Resilience Over Anomaly Diversity）的数据集，以便评估OOD检测方法的性能。

Abstract
Improving the reliability of deployed machine learning systems often involves developing methods to detect out-of-distribution (OOD) inputs. However, existing research often narrowly focuses on samples from classes that are absent from the training set, neglecting other types of plausible distribution shifts. This limitation reduces the applicability of these methods in real-world scenarios, where systems encounter a wide variety of anomalous inputs. In this study, we categorize five distinct types of distribution shifts and critically evaluate the performance of recent OOD detection methods on each of them. We publicly release our benchmark under the name BROAD (Benchmarking Resilience Over Anomaly Diversity). Our findings reveal that while these methods excel in detecting unknown classes, their performance is inconsistent when encountering other types of distribution shifts. In other words, they only reliably detect unexpected inputs that they have been specifically designed to expect. As a first step toward broad OOD detection, we learn a generative model of existing detection scores with a Gaussian mixture. By doing so, we present an ensemble approach that offers a more consistent and comprehensive solution for broad OOD detection, demonstrating superior performance compared to existing methods. Our code to download BROAD and reproduce our experiments is publicly available.

摘要
提高机器学习系统部署时的可靠性通常涉及到开发检测出idanormal inputs的方法。然而，现有研究通常只关注 absent classes 中的样本，忽视其他类型的可能性 Distribution Shift。这种限制 reduce了这些方法在实际应用中的适用性，因为系统会遇到各种异常输入。在这种研究中，我们分类ified five distinct types of distribution shifts, and critically evaluated the performance of recent OOD detection methods on each of them. We publicly release our benchmark under the name BROAD (Benchmarking Resilience Over Anomaly Diversity). Our findings reveal that while these methods excel in detecting unknown classes, their performance is inconsistent when encountering other types of distribution shifts. In other words, they only reliably detect unexpected inputs that they have been specifically designed to expect. As a first step toward broad OOD detection, we learn a generative model of existing detection scores with a Gaussian mixture. By doing so, we present an ensemble approach that offers a more consistent and comprehensive solution for broad OOD detection, demonstrating superior performance compared to existing methods. Our code to download BROAD and reproduce our experiments is publicly available.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard forms of Chinese. If you need the translation in Traditional Chinese, please let me know.

Revisiting column-generation-based matheuristic for learning classification trees

paper_url: http://arxiv.org/abs/2308.11477
repo_url: https://github.com/krooonal/col_gen_estimator
paper_authors: Krunal Kishor Patel, Guy Desaulniers, Andrea Lodi
for: 这篇论文目的是提高分类问题的解决方法，特别是在机器学习领域中使用决策树模型。
methods: 该论文使用的方法是基于列生成的规则逻辑，以提高分类问题的解决效率和可扩展性。
results: 对于多类分类问题，该方法可以减少数据点数量，并使用数据依赖的约束来提高分类质量。 computational results表明，这些改进可以提高解决效率。

Abstract
Decision trees are highly interpretable models for solving classification problems in machine learning (ML). The standard ML algorithms for training decision trees are fast but generate suboptimal trees in terms of accuracy. Other discrete optimization models in the literature address the optimality problem but only work well on relatively small datasets. \cite{firat2020column} proposed a column-generation-based heuristic approach for learning decision trees. This approach improves scalability and can work with large datasets. In this paper, we describe improvements to this column generation approach. First, we modify the subproblem model to significantly reduce the number of subproblems in multiclass classification instances. Next, we show that the data-dependent constraints in the master problem are implied, and use them as cutting planes. Furthermore, we describe a separation model to generate data points for which the linear programming relaxation solution violates their corresponding constraints. We conclude by presenting computational results that show that these modifications result in better scalability.

摘要

IT3D: Improved Text-to-3D Generation with Explicit View Synthesis

paper_url: http://arxiv.org/abs/2308.11473
repo_url: https://github.com/buaacyw/it3d-text-to-3d
paper_authors: Yiwen Chen, Chi Zhang, Xiaofeng Yang, Zhongang Cai, Gang Yu, Lei Yang, Guosheng Lin
for: 本研究旨在提高文本到3D图像转换技术，并使用大型文本到图像扩散模型（LDM）提取知识。
methods: 本研究使用图像到图像管道，利用LDM生成高质量多视图图像，并通过Diffusion-GAN双向训练策略来引导3D模型训练。
results: 实验结果表明，本方法比基eline方法有更高的质量和精度，能够更好地解决文本到3D图像转换中的一些问题，如过度满、缺乏细节和不实际的输出。

Abstract
Recent strides in Text-to-3D techniques have been propelled by distilling knowledge from powerful large text-to-image diffusion models (LDMs). Nonetheless, existing Text-to-3D approaches often grapple with challenges such as over-saturation, inadequate detailing, and unrealistic outputs. This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues. Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images based on the renderings of coarse 3D models. Although the generated images mostly alleviate the aforementioned issues, challenges such as view inconsistency and significant content variance persist due to the inherent generative nature of large diffusion models, posing extensive difficulties in leveraging these images effectively. To overcome this hurdle, we advocate integrating a discriminator alongside a novel Diffusion-GAN dual training strategy to guide the training of 3D models. For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data. We conduct a comprehensive set of experiments that demonstrate the effectiveness of our method over baseline approaches.

摘要
最近的文本到3D技术发展受到了大型文本到图像扩散模型（LDM）的知识储存的推动。然而，现有的文本到3D方法通常会遇到过度饱和、不够细节和不实际的输出等问题。本研究提出了一种新的策略，利用可控多视图图像来解决这些问题。我们的方法是利用图像到图像管道，利用LDM来生成基于粗糙3D模型的高质量poses图像。虽然生成的图像大多消除了以上问题，但是由于大扩散模型的生成性，仍然存在视角不一致和内容差异等问题。为解决这个障碍，我们提议在3D模型训练中添加一个判别器，并采用Diffusion-GAN双向训练策略来引导3D模型的训练。对于添加的判别器，生成的多视图图像被视为真实数据，而 renderings of 优化的3D模型则被视为假数据。我们进行了一系列的实验，证明了我们的方法在基础方法上表现更高效。

Dynamic Open Vocabulary Enhanced Safe-landing with Intelligence (DOVESEI)

paper_url: http://arxiv.org/abs/2308.11471
repo_url: https://github.com/mistlab/dovesei
paper_authors: Haechan Mark Bong, Rongge Zhang, Ricardo de Azambuja, Giovanni Beltrame
for: 本研究目标是为城市空中机器人开发安全降落。
methods: 本研究使用视 servoing 技术，利用开放词汇图像分割，适应不同场景，并且不需要大量数据更新内部模型。
results: 实验表明，该系统可以在100米高度下成功执行降落动作，且通过引入动态专注机制，提高降落成功率。

Abstract
This work targets what we consider to be the foundational step for urban airborne robots, a safe landing. Our attention is directed toward what we deem the most crucial aspect of the safe landing perception stack: segmentation. We present a streamlined reactive UAV system that employs visual servoing by harnessing the capabilities of open vocabulary image segmentation. This approach can adapt to various scenarios with minimal adjustments, bypassing the necessity for extensive data accumulation for refining internal models, thanks to its open vocabulary methodology. Given the limitations imposed by local authorities, our primary focus centers on operations originating from altitudes of 100 meters. This choice is deliberate, as numerous preceding works have dealt with altitudes up to 30 meters, aligning with the capabilities of small stereo cameras. Consequently, we leave the remaining 20m to be navigated using conventional 3D path planning methods. Utilizing monocular cameras and image segmentation, our findings demonstrate the system's capability to successfully execute landing maneuvers at altitudes as low as 20 meters. However, this approach is vulnerable to intermittent and occasionally abrupt fluctuations in the segmentation between frames in a video stream. To address this challenge, we enhance the image segmentation output by introducing what we call a dynamic focus: a masking mechanism that self adjusts according to the current landing stage. This dynamic focus guides the control system to avoid regions beyond the drone's safety radius projected onto the ground, thus mitigating the problems with fluctuations. Through the implementation of this supplementary layer, our experiments have reached improvements in the landing success rate of almost tenfold when compared to global segmentation. All the source code is open source and available online (github.com/MISTLab/DOVESEI).

摘要

Internal Cross-layer Gradients for Extending Homogeneity to Heterogeneity in Federated Learning

paper_url: http://arxiv.org/abs/2308.11464
repo_url: None
paper_authors: Yun-Hin Chan, Rui Zhou, Running Zhao, Zhihan Jiang, Edith C. -H. Ngai
for: 提高模型不同的 Federated Learning 方法处理系统不同性能的能力
methods: 利用内部交叉层导数，不需要客户端之间的通信，可以增强深层导数的相似性
results: 实验结果证明 InCo Aggregation 的效果，显示内部交叉层导数是提高性能的有效途径

Abstract
Federated learning (FL) inevitably confronts the challenge of system heterogeneity in practical scenarios. To enhance the capabilities of most model-homogeneous FL methods in handling system heterogeneity, we propose a training scheme that can extend their capabilities to cope with this challenge. In this paper, we commence our study with a detailed exploration of homogeneous and heterogeneous FL settings and discover three key observations: (1) a positive correlation between client performance and layer similarities, (2) higher similarities in the shallow layers in contrast to the deep layers, and (3) the smoother gradients distributions indicate the higher layer similarities. Building upon these observations, we propose InCo Aggregation that leverags internal cross-layer gradients, a mixture of gradients from shallow and deep layers within a server model, to augment the similarity in the deep layers without requiring additional communication between clients. Furthermore, our methods can be tailored to accommodate model-homogeneous FL methods such as FedAvg, FedProx, FedNova, Scaffold, and MOON, to expand their capabilities to handle the system heterogeneity. Copious experimental results validate the effectiveness of InCo Aggregation, spotlighting internal cross-layer gradients as a promising avenue to enhance the performance in heterogenous FL.

摘要
联合学习（FL）在实际应用中遇到系统多样性的挑战。为了增强大多数模型相似的FL方法在处理系统多样性的能力，我们提出了一个训练方案，可以将其扩展到处理这个挑战。在这篇论文中，我们开始我们的研究，进行了详细的探索Homogeneous和Heterogeneous FL Setting中的三个关键观察：（1）客户端性能和层 similarity 之间的正相关，（2）在浅层较为高 similarity ，而深层较低 similarity，（3）在各层 Similarity 中更平滑的梯度分布，这些观察可以帮助我们更好地理解FL系统的多样性问题。基于这些观察，我们提出了InCo Aggregation，利用服务器模型中的内部交叉层梯度，把深层层梯度与浅层梯度混合，以增强深层层梯度的相似性，不需要客户端之间的额外交流。此外，我们的方法可以与模型相似的FL方法，如FedAvg、FedProx、FedNova、Scaffold和MOON相容，以扩展它们的能力，处理系统多样性。实际实验结果显示，InCo Aggregation 具有很好的效果，强调了内部交叉层梯度作为提高FL系统多样性性能的有力之路。

A Survey on Self-Supervised Representation Learning

paper_url: http://arxiv.org/abs/2308.11455
repo_url: https://github.com/microsoft/esvit
paper_authors: Tobias Uelwer, Jan Robine, Stefan Sylvius Wagner, Marc Höftmann, Eric Upschulte, Sebastian Konietzny, Maike Behrendt, Stefan Harmeling
for: 本文提供了一个总结性的综述，探讨了一些无监督学习方法，用于学习图像表示。这些表示可以用于下游任务，如分类或物体检测。
methods: 本文使用了一些无监督学习方法，包括自适应卷积神经网络、自适应层次神经网络和卷积神经网络。
results: 根据Literature review，这些方法在下游任务中表现非常出色，与监督学习方法相当。Here’s the translation in English:
for: This paper provides a comprehensive review of methods for learning image representations without supervision, which can be used in downstream tasks such as classification or object detection.
methods: The paper uses several unsupervised learning methods, including autoencoders, self-attention mechanisms, and convolutional neural networks.
results: According to the literature review, these methods have performed extremely well in downstream tasks, comparable to supervised learning methods.

Abstract
Learning meaningful representations is at the heart of many tasks in the field of modern machine learning. Recently, a lot of methods were introduced that allow learning of image representations without supervision. These representations can then be used in downstream tasks like classification or object detection. The quality of these representations is close to supervised learning, while no labeled images are needed. This survey paper provides a comprehensive review of these methods in a unified notation, points out similarities and differences of these methods, and proposes a taxonomy which sets these methods in relation to each other. Furthermore, our survey summarizes the most-recent experimental results reported in the literature in form of a meta-study. Our survey is intended as a starting point for researchers and practitioners who want to dive into the field of representation learning.

摘要
学习有意义的表示是现代机器学习领域中的核心任务之一。最近，许多无监督学习方法被引入，可以学习图像表示。这些表示可以在下游任务中使用，如分类或物体检测。这些无监督学习方法的表示质量与监督学习相似，但无需标注图像。本文提供了这些方法的统一notation，指出这些方法之间的相似性和差异，并提出了这些方法的分类方式。此外，我们的survey还summarized了Literature中最近的实验结果，并进行了meta-study。本文为研究者和实践者提供了进入无监督学习领域的开始点。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Convergence guarantee for consistency models

paper_url: http://arxiv.org/abs/2308.11449
repo_url: None
paper_authors: Junlong Lyu, Zhitang Chen, Shoubo Feng
for: 这 paper 的目的是为Consistency Models (CMs) 提供首次一致性保证，这种一步生成模型可以生成与Diffusion Models相同的样本。
methods: 这 paper 使用了基本的score-matching error assumption, consistency error assumption和数据分布的smoothness假设，以确保CMs 可以效率地从任何现实数据分布中采样，并且采样Error小于$W_2$.
results: 这 paper 的结果包括：(1) 对于$L^2$-accurate score和consistency假设，CMs 可以在一步中采样到任何现实数据分布，并且采样Error scales polynomially in all parameters; (2) 不需要强制对数据分布的假设，如log-Sobelev inequality; (3) 可以further reduce the error by using Multistep Consistency Sampling procedure.

Abstract
We provide the first convergence guarantees for the Consistency Models (CMs), a newly emerging type of one-step generative models that can generate comparable samples to those generated by Diffusion Models. Our main result is that, under the basic assumptions on score-matching errors, consistency errors and smoothness of the data distribution, CMs can efficiently sample from any realistic data distribution in one step with small $W_2$ error. Our results (1) hold for $L^2$-accurate score and consistency assumption (rather than $L^\infty$-accurate); (2) do note require strong assumptions on the data distribution such as log-Sobelev inequality; (3) scale polynomially in all parameters; and (4) match the state-of-the-art convergence guarantee for score-based generative models (SGMs). We also provide the result that the Multistep Consistency Sampling procedure can further reduce the error comparing to one step sampling, which support the original statement of "Consistency Models, Yang Song 2023". Our result further imply a TV error guarantee when take some Langevin-based modifications to the output distributions.

摘要
我们提供了一些一步生成模型（CM）的协调保证，这是一种最近崛起的一种生成模型，可以生成与演化模型（Diffusion Models）相似的样本。我们的主要结果是，假设score-matching error、consistency error和资料分布的平滑性满足某些基本假设，则CM可以将任何现实的资料分布 efficiently sampled in one step with small $W_2$ error。我们的结果包括：1. 对于$L^2$-accurate score和consistency假设（而不是$L^\infty$-accurate）;2. 不需要对于资料分布的强则假设，如log-Sobelev不等式;3. 随所有参数的度量 polynomially scale;4. 与state-of-the-art score-based生成模型（SGMs）的协调保证相符。我们还提供了一个Multistep Consistency Sampling程序，可以降低比一步样本的错误，这支持原始的“Consistency Models, Yang Song 2023”的声明。我们的结果进一步显示了一个TV错误保证，当将一些Langevin-based modifications套用到输出分布时。

Aspect-oriented Opinion Alignment Network for Aspect-Based Sentiment Classification

paper_url: http://arxiv.org/abs/2308.11447
repo_url: https://github.com/aone-nlp/absa-aoan
paper_authors: Xueyi Liu, Rui Hou, Yanglei Gan, Da Luo, Changlin Li, Xiaojun Shi, Qiao Liu
for: 这篇论文目的是提出一种新的方法来解决在多个方面的文本分析中存在的semantic mismatch问题，以提高 Fine-grained sentiment analysis 的精度。
methods: 该方法使用了一种新的Aspect-oriented Opinion Alignment Network (AOAN)，包括一个邻域span增强模块和一个多元视角注意机制，以强调对Opinion words和对应的方面的上下文关系。
results: 实验结果表明，该模型在三个标准数据集上达到了领域的最佳效果，代表着该方法在 Fine-grained sentiment analysis 中的成功应用。

Abstract
Aspect-based sentiment classification is a crucial problem in fine-grained sentiment analysis, which aims to predict the sentiment polarity of the given aspect according to its context. Previous works have made remarkable progress in leveraging attention mechanism to extract opinion words for different aspects. However, a persistent challenge is the effective management of semantic mismatches, which stem from attention mechanisms that fall short in adequately aligning opinions words with their corresponding aspect in multi-aspect sentences. To address this issue, we propose a novel Aspect-oriented Opinion Alignment Network (AOAN) to capture the contextual association between opinion words and the corresponding aspect. Specifically, we first introduce a neighboring span enhanced module which highlights various compositions of neighboring words and given aspects. In addition, we design a multi-perspective attention mechanism that align relevant opinion information with respect to the given aspect. Extensive experiments on three benchmark datasets demonstrate that our model achieves state-of-the-art results. The source code is available at https://github.com/AONE-NLP/ABSA-AOAN.

摘要
非常详细的 sentiment 分析中，尤其是 aspect-based sentiment classification，目标是根据不同的上下文来预测具体的 sentiment 偏好。先前的研究已经做出了很大的进步，通过使用注意力机制来提取不同的 opinion 词。但是，一个持续的挑战是如何有效地处理 semantic 匹配问题，这些问题来自于注意力机制不够地对 opinion 词和对应的 aspect 进行匹配。为了解决这个问题，我们提出了一种新的 Aspect-oriented Opinion Alignment Network (AOAN)，用于捕捉不同的 opinion 词和 aspect 之间的上下文关系。Here's the translation in Traditional Chinese: 非常细致的 sentiment 分析中，尤其是 aspect-based sentiment classification，目标是根据不同的上下文来预测具体的 sentiment 偏好。先前的研究已经做出了很大的进步，通过使用注意力机制来提取不同的 opinion 词。但是，一个持续的挑战是如何有效地处理 semantic 匹配问题，这些问题来自于注意力机制不够地对 opinion 词和对应的 aspect 进行匹配。为了解决这个问题，我们提出了一种新的 Aspect-oriented Opinion Alignment Network (AOAN)，用于捕捉不同的 opinion 词和 aspect 之间的上下文关系。Note that the translation is in Simplified Chinese, as requested. If you would like the translation in Traditional Chinese instead, please let me know.

Exploration of Rashomon Set Assists Explanations for Medical Data

paper_url: http://arxiv.org/abs/2308.11446
repo_url: None
paper_authors: Katarzyna Kobylińska, Mateusz Krzyziński, Rafał Machowicz, Mariusz Adamek, Przemysław Biecek
For: This paper aims to address the problem of relying solely on performance metrics in machine learning modeling, particularly in medical and healthcare studies, by introducing a novel process to explore Rashomon set models.* Methods: The proposed approach uses the $\texttt{Rashomon_DETECT}$ algorithm to identify the most different models within the Rashomon set, and the Profile Disparity Index (PDI) to quantify differences in variable effects among models.* Results: The approach is demonstrated on a foundational case study of predicting survival among hemophagocytic lymphohistiocytosis (HLH) patients, as well as on other medical data sets, showing its effectiveness and versatility in various contexts.Here are the three points in Simplified Chinese:* For: 这篇论文目的是解决机器学习模型选择过程中围绕性能指标偏重的问题，尤其在医疗和健康研究中，以获得更多的有价值信息。* Methods: 该方法使用 $\texttt{Rashomon_DETECT}$ 算法 Identify Rashomon set 中最为不同的模型，并使用 Profile Disparity Index (PDI) 量化变量效果之间的差异。* Results: 该方法在针对 Hemophagocytic lymphohistiocytosis (HLH) 患者存活预测的基本案例研究中，以及其他医疗数据集中，得到了有效和多样的结果。

Abstract
The machine learning modeling process conventionally culminates in selecting a single model that maximizes a selected performance metric. However, this approach leads to abandoning a more profound analysis of slightly inferior models. Particularly in medical and healthcare studies, where the objective extends beyond predictions to valuable insight generation, relying solely on performance metrics can result in misleading or incomplete conclusions. This problem is particularly pertinent when dealing with a set of models with performance close to maximum one, known as $\textit{Rashomon set}$. Such a set can be numerous and may contain models describing the data in a different way, which calls for comprehensive analysis. This paper introduces a novel process to explore Rashomon set models, extending the conventional modeling approach. The cornerstone is the identification of the most different models within the Rashomon set, facilitated by the introduced $\texttt{Rashomon_DETECT}$ algorithm. This algorithm compares profiles illustrating prediction dependencies on variable values generated by eXplainable Artificial Intelligence (XAI) techniques. To quantify differences in variable effects among models, we introduce the Profile Disparity Index (PDI) based on measures from functional data analysis. To illustrate the effectiveness of our approach, we showcase its application in predicting survival among hemophagocytic lymphohistiocytosis (HLH) patients - a foundational case study. Additionally, we benchmark our approach on other medical data sets, demonstrating its versatility and utility in various contexts.

摘要
传统的机器学习模型选择过程是通过选择最大化一个选择的性能指标来完成的。然而，这种方法会抛弃更深入的模型分析。特别在医疗和健康研究中，目标不仅是预测，还是生成有价值的理解。只靠性能指标来结论可能导致误导或不完整的结论。这种问题特别存在于处理一组性能几乎最大的模型集合，称为“Rashomon集”。这个集合可能很多，其中包含描述数据不同方式的模型，需要全面的分析。本文提出了一种新的模型探索过程，扩展传统模型选择策略。其核心是在Rashomon集中 identific 最不同的模型，由我们引入的 $\texttt{Rashomon\_DETECT}$ 算法实现。这个算法比较使用 eXplainable Artificial Intelligence（XAI）技术生成的变量值预测依赖的profile。为了量化不同模型中变量效应的差异，我们引入了 Profile Disparity Index（PDI），基于函数数据分析中的度量。我们通过应用这种方法在 Hemophagocytic lymphohistiocytosis（HLH）患者的存活预测中进行了示例，并将其应用于其他医疗数据集，以示其多样性和可用性。

Inferring gender from name: a large scale performance evaluation study

paper_url: http://arxiv.org/abs/2308.12381
repo_url: None
paper_authors: Kriste Krstovski, Yao Lu, Ye Xu
for: 这个论文主要目的是为了对名称到性别推断的算法和软件产品进行大规模性能评估，以及提出两种新的混合方法以实现更高的性能。
methods: 本文使用了多个大量注释的名称数据集来进行分析，并提出了两种新的混合方法。
results: 研究发现现有方法中的任何一种都无法在所有情况下达到最佳性能，而两种新提出的混合方法均可以在所有情况下实现更高的性能。

Abstract
A person's gender is a crucial piece of information when performing research across a wide range of scientific disciplines, such as medicine, sociology, political science, and economics, to name a few. However, in increasing instances, especially given the proliferation of big data, gender information is not readily available. In such cases researchers need to infer gender from readily available information, primarily from persons' names. While inferring gender from name may raise some ethical questions, the lack of viable alternatives means that researchers have to resort to such approaches when the goal justifies the means - in the majority of such studies the goal is to examine patterns and determinants of gender disparities. The necessity of name-to-gender inference has generated an ever-growing domain of algorithmic approaches and software products. These approaches have been used throughout the world in academia, industry, governmental and non-governmental organizations. Nevertheless, the existing approaches have yet to be systematically evaluated and compared, making it challenging to determine the optimal approach for future research. In this work, we conducted a large scale performance evaluation of existing approaches for name-to-gender inference. Analysis are performed using a variety of large annotated datasets of names. We further propose two new hybrid approaches that achieve better performance than any single existing approach.

摘要
人的性别信息是科学研究中不可或缺的重要信息，包括医学、社会学、政治科学和经济学等领域。然而，随着大数据的普及，性别信息越来越难以获得。在这些情况下，研究人员需要根据可用的信息进行性别推断，主要是根据人名。虽然从名字中推断性别可能会附带一些伦理问题，但由于现有的可行方法缺乏，研究人员需要采用这些方法以实现研究目标。在全球范围内，这些方法已经广泛应用于大学、企业、政府和非政府组织中。然而，现有的方法尚未得到系统性的评估和比较，这使得未来研究中选择最佳方法仍然存在挑战。在这项工作中，我们进行了大规模性能评估现有的名字到性别推断方法。分析使用了多种大量注释的名字数据集。此外，我们还提出了两种新的混合方法，其性能更高于任何单独的现有方法。

A Survey on Large Language Model based Autonomous Agents

paper_url: http://arxiv.org/abs/2308.11432
repo_url: https://github.com/paitesanshi/llm-agent-survey
paper_authors: Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, Ji-Rong Wen
for: 本研究准备了一份总结LLM基于自主代理的研究，包括LLM基于代理的构建、应用领域和评价策略等方面。
methods: 本研究使用了大量网络知识获得的大语言模型(LLM)，并提出了一个统一框架来涵盖大多数之前的工作。
results: 本研究通过对LLM基于代理的各种应用领域和评价策略的总结，提出了一些挑战和未来方向，并将相关参考文献存储在https://github.com/Paitesanshi/LLM-Agent-Survey中。

Abstract
Autonomous agents have long been a prominent research topic in the academic community. Previous research in this field often focuses on training agents with limited knowledge within isolated environments, which diverges significantly from the human learning processes, and thus makes the agents hard to achieve human-like decisions. Recently, through the acquisition of vast amounts of web knowledge, large language models (LLMs) have demonstrated remarkable potential in achieving human-level intelligence. This has sparked an upsurge in studies investigating autonomous agents based on LLMs. To harness the full potential of LLMs, researchers have devised diverse agent architectures tailored to different applications. In this paper, we present a comprehensive survey of these studies, delivering a systematic review of the field of autonomous agents from a holistic perspective. More specifically, our focus lies in the construction of LLM-based agents, for which we propose a unified framework that encompasses a majority of the previous work. Additionally, we provide a summary of the various applications of LLM-based AI agents in the domains of social science, natural science, and engineering. Lastly, we discuss the commonly employed evaluation strategies for LLM-based AI agents. Based on the previous studies, we also present several challenges and future directions in this field. To keep track of this field and continuously update our survey, we maintain a repository for the related references at https://github.com/Paitesanshi/LLM-Agent-Survey.

摘要
自主代理已经是学术界的一个主要研究话题，早期的研究通常是在隔离环境中训练有限知识的代理，这与人类学习过程不同，导致代理做出的决策困难达到人类水平。然而，随着互联网知识的掌握，大型自然语言模型（LLM）在实现人类智能水平方面表现出了很好的潜力。这导致了对自主代理基于 LLM 的研究的快速增长。为了挖掘 LLM 的潜力，研究者们设计了多种特定应用场景的代理建模。在这篇文章中，我们提供了一份系统性的评论，涵盖了这些研究的大部分。我们更加关注 LLM 基于代理的建模，并提出了一个统一框架，覆盖了大多数前期工作。此外，我们还提供了自然科学、社会科学和工程等领域 LLM 基于 AI 代理的多种应用案例。最后，我们讨论了对 LLM 基于 AI 代理的评价策略，并根据前期研究提出了一些挑战和未来方向。为了保持这一领域的报道和不断更新我们的评论，我们在 GitHub 上建立了一个参考库，可以在 https://github.com/Paitesanshi/LLM-Agent-Survey 中找到。

A Study on the Impact of Non-confounding Covariates on the Inferential Performance of Methods based on the Potential Outcome Framework

paper_url: http://arxiv.org/abs/2308.11676
repo_url: None
paper_authors: Yonghe Zhao, Shuai Fu, Huiyan Sun
for: The paper is written to provide a unified graphical framework for causal inference models based on the Potential Outcome Framework (POF), and to analyze the influence of non-confounding covariates on the inference performance of these models.
methods: The paper uses a graphical framework to present the underlying principles of causal inference models based on the POF, and conducts extensive experiments on synthetic datasets to validate the theoretical conclusions.
results: The paper finds that the optimal scenario for eliminating confounding bias is for the covariates to exclusively encompass confounders, and that adjustment variables contribute to more accurate inferences in the task of inferring counterfactual outcomes.

Abstract
The Potential Outcome Framework (POF) plays a prominent role in the field of causal inference. Most causal inference models based on the POF (CIMs-B-POF) are designed for eliminating confounding bias and default to an underlying assumption of Confounding Covariates. This assumption posits that the covariates consist solely of confounders. However, the assumption of Confounding Covariates is challenging to maintain in practice, particularly when dealing with high-dimensional covariates. While certain methods have been proposed to differentiate the distinct components of covariates prior to conducting causal inference, the consequences of treating non-confounding covariates as confounders remain unclear. This ambiguity poses a potential risk when applying the CIMs-B-POF in practical scenarios. In this paper, we present a unified graphical framework for the CIMs-B-POF, which greatly enhances the comprehension of these models' underlying principles. Using this graphical framework, we quantitatively analyze the extent to which the inference performance of CIMs-B-POF is influenced when incorporating various types of non-confounding covariates, such as instrumental variables, mediators, colliders, and adjustment variables. The key findings are: in the task of eliminating confounding bias, the optimal scenario is for the covariates to exclusively encompass confounders; in the subsequent task of inferring counterfactual outcomes, the adjustment variables contribute to more accurate inferences. Furthermore, extensive experiments conducted on synthetic datasets consistently validate these theoretical conclusions.

摘要
Potential Outcome Framework (POF) 在 causal inference 领域扮演着重要的角色。大多数基于 POF 的 causal inference 模型 (CIMs-B-POF) 是为了消除干扰偏见而设计的，默认假设是 Confounding Covariates 假设，即 covariates 仅仅包含干扰因素。然而，在实践中保持 Confounding Covariates 假设是困难的，特别是处理高维 covariates 时。虽然一些方法已经被提出来分解 covariates 的不同组成部分，然而对非干扰 covariates 被视为干扰因素的后果仍然不清楚。这种不确定性在实践中应用 CIMs-B-POF 时可能存在风险。在这篇论文中，我们提出了一个统一的图形 Framework для CIMs-B-POF，这有助于更好地理解这些模型的基本原理。使用这个图形 Framework，我们量化分析了在不同类型的非干扰 covariates 存在时，CIMs-B-POF 的推理性能是如何受影响的。我们发现，在消除干扰偏见的任务中，理想的情况是 covariates 仅仅包含干扰因素；在后续的对 counterfactual 结果进行推理任务中，调整变量对更准确的推理做出了贡献。此外，我们在 synthetic 数据上进行了广泛的实验，并 consistently 验证了这些理论结论。

AIxArtist: A First-Person Tale of Interacting with Artificial Intelligence to Escape Creative Block

paper_url: http://arxiv.org/abs/2308.11424
repo_url: None
paper_authors: Makayla Lewis
for: 这篇论文探讨了人工智能（AI）如何支持艺术创作，以及在艺术创作过程中AI的可追溯性。
methods: 本论文采用了人工智能工具HIS、ChatGPT和Midjourney，进行了一些实验和探索，以探索AI如何支持艺术创作。
results: 本论文发现了一些关键问题，包括创作过程的透明性、作品的起源和伦理问题，以及创作是 copying 还是灵感？这些问题需要进一步的讨论和探索。

Abstract
The future of the arts and artificial intelligence (AI) is promising as technology advances. As the use of AI in design becomes more widespread, art practice may not be a human-only art form and could instead become a digitally integrated experience. With enhanced creativity and collaboration, arts and AI could work together towards creating artistic outputs that are visually appealing and meet the needs of the artist and viewer. While it is uncertain how far the integration will go, arts and AI will likely influence one another. This workshop pictorial puts forward first-person research that shares interactions between an HCI researcher and AI as they try to escape the creative block. The pictorial paper explores two questions: How can AI support artists' creativity, and what does it mean to be explainable in this context? HIs, ChatGPT and Midjourney were engaged; the result was a series of reflections that require further discussion and explorations in the XAIxArts community: Transparency of attribution, the creation process, ethics of asking, and inspiration vs copying.

摘要
This workshop pictorial presents first-person research that explores the interactions between an HCI researcher and AI as they try to escape creative blocks. The pictorial paper examines two questions: how can AI support artists' creativity, and what does it mean to be explainable in this context? The research involved engaging with AI models such as ChatGPT and Midjourney, leading to a series of reflections that require further discussion and exploration in the XAIxArts community. These reflections include transparency of attribution, the creation process, ethics of asking, and inspiration vs copying.

TurboViT: Generating Fast Vision Transformers via Generative Architecture Search

paper_url: http://arxiv.org/abs/2308.11421
repo_url: None
paper_authors: Alexander Wong, Saad Abbasi, Saeejith Nair
for: 这个研究的目的是实现高通过率且低 computation complexity的类比视觉 Transformer 架构设计。
methods: 这个研究使用了 Generative Architecture Search (GAS) 来生成高效的类比视觉 Transformer 架构设计，并且将注意力集中在面精度和 Q-pooling 设计模式上。
results: TurboViT 架构设计在 ImageNet-1K 数据集上实现了比较高的精度和低的 computation complexity，与其他 10 个现有的高效类比视觉 Transformer 网络架构设计相比。 Inference 延误和批处处理时间都表现出色，在低延误场景下，TurboViT 的延误时间比 FasterViT-0 低了 >3.21 倍，而且对 batch 处理也表现出 >3.18 倍的提高。

Abstract
Vision transformers have shown unprecedented levels of performance in tackling various visual perception tasks in recent years. However, the architectural and computational complexity of such network architectures have made them challenging to deploy in real-world applications with high-throughput, low-memory requirements. As such, there has been significant research recently on the design of efficient vision transformer architectures. In this study, we explore the generation of fast vision transformer architecture designs via generative architecture search (GAS) to achieve a strong balance between accuracy and architectural and computational efficiency. Through this generative architecture search process, we create TurboViT, a highly efficient hierarchical vision transformer architecture design that is generated around mask unit attention and Q-pooling design patterns. The resulting TurboViT architecture design achieves significantly lower architectural computational complexity (>2.47$\times$ smaller than FasterViT-0 while achieving same accuracy) and computational complexity (>3.4$\times$ fewer FLOPs and 0.9% higher accuracy than MobileViT2-2.0) when compared to 10 other state-of-the-art efficient vision transformer network architecture designs within a similar range of accuracy on the ImageNet-1K dataset. Furthermore, TurboViT demonstrated strong inference latency and throughput in both low-latency and batch processing scenarios (>3.21$\times$ lower latency and >3.18$\times$ higher throughput compared to FasterViT-0 for low-latency scenario). These promising results demonstrate the efficacy of leveraging generative architecture search for generating efficient transformer architecture designs for high-throughput scenarios.

摘要
视transformer在近年来的视觉任务中表现出了前所未有的水平。然而，这些网络架构的建筑和计算复杂性使得它们在实际应用中高速、低内存要求下部署困难。因此，有一些研究是设计高效的视transformer架构。在这项研究中，我们通过生成式建筑搜索（GAS）来生成高效的视transformer架构设计，以达到精度和建筑计算效率的平衡。通过这个生成过程，我们创造了TurboViT，一种高效的层次视transformer架构设计，基于面积注意力和Q-Pooling设计模式。TurboViT架构设计的建筑计算复杂性比FasterViT-0大于2.47倍，计算复杂性比MobileViT2-2.0大于3.4倍，同时精度相同。此外，TurboViT在低延迟和批处理场景中表现出了优秀的执行时间和 Throughput，比FasterViT-0在低延迟场景下执行时间大于3.21倍，比MobileViT2-2.0在批处理场景下执行时间大于3.18倍。这些优秀的结果表明了利用生成式建筑搜索生成高效的transformer架构设计的有效性。

Tensor Regression

paper_url: http://arxiv.org/abs/2308.11419
repo_url: https://github.com/tensorly/torch
paper_authors: Jiani Liu, Ce Zhu, Zhen Long, Yipeng Liu
for: This paper is written for students, researchers, and practitioners who work with high dimensional data and are interested in tensor-based regression analysis.
methods: The paper provides a systematic study and analysis of tensor-based regression models and their applications, including a comprehensive review of existing methods, their core ideas, and theoretical characteristics.
results: The paper covers the basics of tensor-based regression, provides examples of how to use existing methods to solve specific regression tasks with multiway data, and discusses available datasets and software resources for efficient implementation.

Abstract
Regression analysis is a key area of interest in the field of data analysis and machine learning which is devoted to exploring the dependencies between variables, often using vectors. The emergence of high dimensional data in technologies such as neuroimaging, computer vision, climatology and social networks, has brought challenges to traditional data representation methods. Tensors, as high dimensional extensions of vectors, are considered as natural representations of high dimensional data. In this book, the authors provide a systematic study and analysis of tensor-based regression models and their applications in recent years. It groups and illustrates the existing tensor-based regression methods and covers the basics, core ideas, and theoretical characteristics of most tensor-based regression methods. In addition, readers can learn how to use existing tensor-based regression methods to solve specific regression tasks with multiway data, what datasets can be selected, and what software packages are available to start related work as soon as possible. Tensor Regression is the first thorough overview of the fundamentals, motivations, popular algorithms, strategies for efficient implementation, related applications, available datasets, and software resources for tensor-based regression analysis. It is essential reading for all students, researchers and practitioners of working on high dimensional data.

摘要
“tensor regression”是数据分析和机器学习领域的一个关键领域，旨在探索变量之间的依赖关系，通常使用向量。随着神经成像、计算机视觉、气候学和社交网络等技术的发展，传统的数据表示方法面临了挑战。tensor是高维数据的自然表示方法。本书提供了tensor-based regression模型的系统性研究和分析，以及其在最近几年的应用。它分组和描述了现有的tensor-based regression方法，覆盖基础知识、核心思想和主要特征。此外，读者还可以了解如何使用现有的tensor-based regression方法来解决特定的多向数据回归任务，选择合适的数据集和使用哪些软件包来进行相关工作。“tensor regression”是高维数据处理的基础知识，是所有师生、研究人员和实践者都必须阅读的一本书。

Interpretable Distribution-Invariant Fairness Measures for Continuous Scores

paper_url: http://arxiv.org/abs/2308.11375
repo_url: None
paper_authors: Ann-Kristin Becker, Oana Dumitrasc, Klaus Broelemann
for: 这个论文主要是为了扩展对连续分数的算法公平性评估方法。
methods: 该论文提出了一种基于沃氏距离的公平性评估方法，该方法可以快速计算并且对不同模型、数据集或时间点进行比较。
results: 研究人员通过实验表明，新提出的公平性评估方法可以更好地捕捉到不同群体之间的差异，并且可以比较不同的模型、数据集或时间点之间的偏见。

Abstract
Measures of algorithmic fairness are usually discussed in the context of binary decisions. We extend the approach to continuous scores. So far, ROC-based measures have mainly been suggested for this purpose. Other existing methods depend heavily on the distribution of scores, are unsuitable for ranking tasks, or their effect sizes are not interpretable. Here, we propose a distributionally invariant version of fairness measures for continuous scores with a reasonable interpretation based on the Wasserstein distance. Our measures are easily computable and well suited for quantifying and interpreting the strength of group disparities as well as for comparing biases across different models, datasets, or time points. We derive a link between the different families of existing fairness measures for scores and show that the proposed distributionally invariant fairness measures outperform ROC-based fairness measures because they are more explicit and can quantify significant biases that ROC-based fairness measures miss. Finally, we demonstrate their effectiveness through experiments on the most commonly used fairness benchmark datasets.

摘要
各种算法公平度量通常在二分类决策中被讨论。我们扩展了这种方法，以适应连续分数。目前，ROC基尼度量是为此目的提出的主要方法。其他现有方法受分布的影响很大，不适用于排名任务，或者其效果不能解释。我们提议一种不受分布影响的公平度量方法，基于温顿距离。我们的度量方法容易计算，适合量化和解释群体差异的强度以及不同模型、数据集、时间点之间的偏见。我们还 derivates了不同家族的现有公平度量方法之间的连接，并证明了我们提议的不受分布影响的公平度量方法在ROC基尼度量方法之上表现更好，因为它们更加明确，可以量化ROC基尼度量方法所过look的重要偏见。最后，我们通过使用最常用的公平性标准数据集进行实验，证明了它们的有效性。

How Much Temporal Long-Term Context is Needed for Action Segmentation?

paper_url: http://arxiv.org/abs/2308.11358
repo_url: https://github.com/ltcontext/ltcontext
paper_authors: Emad Bahrami, Gianpiero Francesca, Juergen Gall
for: temporal action segmentation
methods: transformer-based model with sparse attention
results: best performance for temporal action segmentationHere’s the full text in Simplified Chinese:
for: 这篇论文是为了解决视频中的时间动作分割问题而写的。
methods: 这篇论文使用了 transformer 模型，并使用了稀谱注意力来捕捉整个视频的上下文。
results: 实验结果表明，模型需要捕捉整个视频的上下文，才能达到最佳的时间动作分割性能。

Abstract
Modeling long-term context in videos is crucial for many fine-grained tasks including temporal action segmentation. An interesting question that is still open is how much long-term temporal context is needed for optimal performance. While transformers can model the long-term context of a video, this becomes computationally prohibitive for long videos. Recent works on temporal action segmentation thus combine temporal convolutional networks with self-attentions that are computed only for a local temporal window. While these approaches show good results, their performance is limited by their inability to capture the full context of a video. In this work, we try to answer how much long-term temporal context is required for temporal action segmentation by introducing a transformer-based model that leverages sparse attention to capture the full context of a video. We compare our model with the current state of the art on three datasets for temporal action segmentation, namely 50Salads, Breakfast, and Assembly101. Our experiments show that modeling the full context of a video is necessary to obtain the best performance for temporal action segmentation.

摘要
<>模型长期视频上下文是多个细致任务的关键，包括时间动作 segmentation。一个有趣的问题是如何多少长期时间上下文是必需的 для最佳性能。虽然变换器可以模型视频的长期上下文，但这会对长视频计算昂贵。现有的时间动作 segmentation方法通过将时间卷积网络与自注意力组合在一起，但其性能受到当前视频上下文的限制。在这个工作中，我们尝试回答如何多少长期时间上下文是必需的 для时间动作 segmentation，我们提出了一种基于变换器的模型，通过稀疏注意力来捕捉整个视频的上下文。我们与当前领域的状态速算三个数据集进行比较，分别是50Salads、Breakfast和Assembly101。我们的实验结果表明，模型整个视频的上下文是获得最佳性能的关键。

Semantic RGB-D Image Synthesis

paper_url: http://arxiv.org/abs/2308.11356
repo_url: None
paper_authors: Shijie Li, Rong Li, Juergen Gall
for: 提高RGB-D图像Semantic分割的训练数据多样性
methods: 提出了一种基于Semantic图像Synthesis的方法，使用多模态数据生成真实的RGB-D图像
results: 比前一代单模态方法有大幅提高，并且通过混合实际和生成图像进行训练可以进一步提高RGB-D图像Semantic分割的精度

Abstract
Collecting diverse sets of training images for RGB-D semantic image segmentation is not always possible. In particular, when robots need to operate in privacy-sensitive areas like homes, the collection is often limited to a small set of locations. As a consequence, the annotated images lack diversity in appearance and approaches for RGB-D semantic image segmentation tend to overfit the training data. In this paper, we thus introduce semantic RGB-D image synthesis to address this problem. It requires synthesising a realistic-looking RGB-D image for a given semantic label map. Current approaches, however, are uni-modal and cannot cope with multi-modal data. Indeed, we show that extending uni-modal approaches to multi-modal data does not perform well. In this paper, we therefore propose a generator for multi-modal data that separates modal-independent information of the semantic layout from the modal-dependent information that is needed to generate an RGB and a depth image, respectively. Furthermore, we propose a discriminator that ensures semantic consistency between the label maps and the generated images and perceptual similarity between the real and generated images. Our comprehensive experiments demonstrate that the proposed method outperforms previous uni-modal methods by a large margin and that the accuracy of an approach for RGB-D semantic segmentation can be significantly improved by mixing real and generated images during training.

摘要
Collecting diverse sets of training images for RGB-D semantic image segmentation 不一定是可能的。特别是当机器人需要在隐私敏感的地方 like 家庭中运行时，收集是经常受限于一小组地点。因此，标注图像缺乏多样性的外观和RGB-D semantic image segmentation 的方法容易过拟合训练数据。在这篇论文中，我们因此引入semantic RGB-D 图像合成来解决这个问题。它需要生成一个看起来很真实的 RGB-D 图像，以及一个给定的semantic label map。current approach是单模的，无法处理多模数据。我们实际上发现，将单模方法扩展到多模数据并不能达到好的效果。因此，我们提议一个生成器，可以分离modal-independent信息和modal-dependent信息。此外，我们还提议一个检验器，确保标注图像和生成的图像之间的semantic consistency，以及生成的图像和实际图像之间的 percepatual similarity。我们的全面实验表明，我们的方法可以大幅提高前一代单模方法的性能，并且可以在训练时混合实际和生成的图像，以提高RGB-D semantic segmentation的精度。

ProAgent: Building Proactive Cooperative AI with Large Language Models

paper_url: http://arxiv.org/abs/2308.11339
repo_url: https://github.com/PKU-Alignment/ProAgent
paper_authors: Ceyao Zhang, Kaijie Yang, Siyi Hu, Zihao Wang, Guanghe Li, Yihang Sun, Cheng Zhang, Zhaowei Zhang, Anji Liu, Song-Chun Zhu, Xiaojun Chang, Junge Zhang, Feng Yin, Yitao Liang, Yaodong Yang
for: 这个论文主要目标是为了开发一种能够在人机合作中表现出突出的智能代理（ProAgent），帮助人机合作实现更高效的协同作业。
methods: 这个论文使用了大型自然语言模型（LLM）来为代理机制表现出更高级别的智能行为，包括预测合作伙伴的下一步决策并根据此形ulate更好的计划。
results: 实验结果表明，ProAgent在与其他AI代理和人类代理合作时表现出了remarkable的性能优势，比如自适应学习和人类学习等方法。此外，ProAgent还具有高度可模块化和可解释性，可以轻松地整合到各种协同enario中。

Abstract
Building AIs with adaptive behaviors in human-AI cooperation stands as a pivotal focus in AGI research. Current methods for developing cooperative agents predominantly rely on learning-based methods, where policy generalization heavily hinges on past interactions with specific teammates. These approaches constrain the agent's capacity to recalibrate its strategy when confronted with novel teammates. We propose \textbf{ProAgent}, a novel framework that harnesses large language models (LLMs) to fashion a \textit{pro}active \textit{agent} empowered with the ability to anticipate teammates' forthcoming decisions and formulate enhanced plans for itself. ProAgent excels at cooperative reasoning with the capacity to dynamically adapt its behavior to enhance collaborative efforts with teammates. Moreover, the ProAgent framework exhibits a high degree of modularity and interpretability, facilitating seamless integration to address a wide array of coordination scenarios. Experimental evaluations conducted within the framework of \textit{Overcook-AI} unveil the remarkable performance superiority of ProAgent, outperforming five methods based on self-play and population-based training in cooperation with AI agents. Further, when cooperating with human proxy models, its performance exhibits an average improvement exceeding 10\% compared to the current state-of-the-art, COLE. The advancement was consistently observed across diverse scenarios involving interactions with both AI agents of varying characteristics and human counterparts. These findings inspire future research for human-robot collaborations. For a hands-on demonstration, please visit \url{https://pku-proagent.github.io}.

摘要
(Simplified Chinese translation)建立AI具有适应行为的概念在人类-AI合作中是AGI研究中的核心焦点。目前，开发合作代理人的方法主要依赖于学习方法，其策略泛化强度受到特定团队成员的交互影响。这些方法限制了代理人的策略重塑能力，面临新的团队成员时。我们提出了\textbf{ProAgent}框架，利用大型自然语言模型（LLMs）为代理人带来了能预测伙伴的决策并提出改进的计划的能力。ProAgent在合作理解方面表现出色，可以适应团队合作中的各种情况，并且具有高度的可重新组合性和可读性，可以轻松地与不同的协调enario进行集成。在\textit{Overcook-AI}框架下，我们进行了实验评估，发现ProAgent在与基于自我玩家和人口学习的五种方法进行比较时，在合作 with AI代理人方面表现出了杰出的性能优势。此外，当与人类代理模型进行合作时，其性能表现出了平均提高超过10%，与当前的状态艺术COLE相比。这些发现在不同的情况下，包括与不同特征的AI代理人和人类对手进行交互时，均得到了证明。这些发现 inspirits future research on human-robot collaboration. For a hands-on demonstration, please visit \url{https://pku-proagent.github.io}.

On the Opportunities and Challenges of Offline Reinforcement Learning for Recommender Systems

paper_url: http://arxiv.org/abs/2308.11336
repo_url: None
paper_authors: Xiaocong Chen, Siyu Wang, Julian McAuley, Dietmar Jannach, Lina Yao
for: 本研究旨在探讨在推荐系统中使用无线连接学习，特别是在不同的环境下进行学习和推荐。
methods: 本研究使用了无线连接学习的各种方法，包括Q-learning、SARSA和 Deep Q-Networks等，以学习用户的偏好和行为。
results: 研究发现，使用无线连接学习可以提高推荐系统的数据效率和学习速度，同时也可以提高用户的满意度和使用频率。

Abstract
Reinforcement learning serves as a potent tool for modeling dynamic user interests within recommender systems, garnering increasing research attention of late. However, a significant drawback persists: its poor data efficiency, stemming from its interactive nature. The training of reinforcement learning-based recommender systems demands expensive online interactions to amass adequate trajectories, essential for agents to learn user preferences. This inefficiency renders reinforcement learning-based recommender systems a formidable undertaking, necessitating the exploration of potential solutions. Recent strides in offline reinforcement learning present a new perspective. Offline reinforcement learning empowers agents to glean insights from offline datasets and deploy learned policies in online settings. Given that recommender systems possess extensive offline datasets, the framework of offline reinforcement learning aligns seamlessly. Despite being a burgeoning field, works centered on recommender systems utilizing offline reinforcement learning remain limited. This survey aims to introduce and delve into offline reinforcement learning within recommender systems, offering an inclusive review of existing literature in this domain. Furthermore, we strive to underscore prevalent challenges, opportunities, and future pathways, poised to propel research in this evolving field.

摘要

GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training

paper_url: http://arxiv.org/abs/2308.11331
repo_url: None
paper_authors: Xinchi Deng, Han Shi, Runhui Huang, Changlin Li, Hang Xu, Jianhua Han, James Kwok, Shen Zhao, Wei Zhang, Xiaodan Liang
for: 本文提出了一种数据驱动自动模型增长算法，用于对语言-图像做contrastive预训练，并且可以处理不断增长的在线数据。
methods: 本文使用了动态增长空间和最优化 архитектуры，以适应在线学习场景。同时，提出了共享编码器，以增强语言-图像融合度。
results: 相比已有方法，GrowCLIP在零参数图像分类任务上提高了2.3%的平均排名第一精度。在Flickr30K dataset上，GrowCLIP在零参数图像检索任务上提高了1.2%的找到第一图像-文本准确率。

Abstract
Cross-modal pre-training has shown impressive performance on a wide range of downstream tasks, benefiting from massive image-text pairs collected from the Internet. In practice, online data are growing constantly, highlighting the importance of the ability of pre-trained model to learn from data that is continuously growing. Existing works on cross-modal pre-training mainly focus on training a network with fixed architecture. However, it is impractical to limit the model capacity when considering the continuously growing nature of pre-training data in real-world applications. On the other hand, it is important to utilize the knowledge in the current model to obtain efficient training and better performance. To address the above issues, in this paper, we propose GrowCLIP, a data-driven automatic model growing algorithm for contrastive language-image pre-training with continuous image-text pairs as input. Specially, we adopt a dynamic growth space and seek out the optimal architecture at each growth step to adapt to online learning scenarios. And the shared encoder is proposed in our growth space to enhance the degree of cross-modal fusion. Besides, we explore the effect of growth in different dimensions, which could provide future references for the design of cross-modal model architecture. Finally, we employ parameter inheriting with momentum (PIM) to maintain the previous knowledge and address the issue of the local minimum dilemma. Compared with the existing methods, GrowCLIP improves 2.3% average top-1 accuracy on zero-shot image classification of 9 downstream tasks. As for zero-shot image retrieval, GrowCLIP can improve 1.2% for top-1 image-to-text recall on Flickr30K dataset.

摘要
跨modal预训练已经在各种下游任务中显示出很好的性能，受到互联网上庞大的图片文本对的收集启发。在实践中，网络上数据不断增长，高亮了预训练数据的不断增长的重要性。现有的跨modal预训练方法主要是通过固定的网络架构进行训练。然而，在实际应用中，限制模型容量是不切实际的，因为预训练数据的不断增长会导致模型无法适应。相反，我们需要利用当前模型的知识来获得高效的训练和更好的性能。为此，在这篇论文中，我们提出了GrowCLIP，一种基于数据驱动的自动模型增长算法，用于对冲对的语言图片预训练。我们采用动态生长空间，在每个增长步骤中寻找最佳的网络架构，以适应在线学习场景。此外，我们还提出了共享编码器，以增强模型之间的混合度。此外，我们还研究了不同维度的增长效果，这可能会对未来的跨modal模型架构设计产生影响。最后，我们采用参数继承与势（PIM）来维护之前的知识，解决局部最小问题。相比之下，GrowCLIP与现有方法相比，提高了9个下游任务的zero-shot图像分类精度，平均提高2.3%。此外，GrowCLIP还可以提高Flickr30K dataset上的zero-shot图像检索的top-1图像文本恢复率，提高1.2%。

From Mundane to Meaningful: AI’s Influence on Work Dynamics – evidence from ChatGPT and Stack Overflow

paper_url: http://arxiv.org/abs/2308.11302
repo_url: None
paper_authors: Quentin Gallea
for: 这篇论文探讨了如何利用生成AI实现代码编程的产生效率提升，同时也提出了这些新技术对工作和知识共享方式的影响。
methods: 该论文使用了 quasi-experimental 方法（差异分析），利用Stack Overflow上最大的在线编程社区的使用情况，评估 ChatGPT 的发布对编程问题的影响。
results: 研究发现，ChatGPT 的发布导致编程问题数量减少，同时问题的 докуumenation 质量也有所提高。此外，剩下的问题也变得更加复杂。这些结果表明，利用生成AI可以实现工作效率提升，同时也会导致工作方式的重大变革，让人类专注于更加复杂的任务。

Abstract
This paper illustrates how generative AI could give opportunities for big productivity gains but also opens up questions about the impact of these new powerful technologies on the way we work and share knowledge. More specifically, we explore how ChatGPT changed a fundamental aspect of coding: problem-solving. To do so, we exploit the effect of the sudden release of ChatGPT on the 30th of November 2022 on the usage of the largest online community for coders: Stack Overflow. Using quasi-experimental methods (Difference-in-Difference), we find a significant drop in the number of questions. In addition, the questions are better documented after the release of ChatGPT. Finally, we find evidence that the remaining questions are more complex. These findings suggest not only productivity gains but also a fundamental change in the way we work where routine inquiries are solved by AI allowing humans to focus on more complex tasks.

摘要
In Simplified Chinese:这篇论文描述了如何生成AI可以带来大量的产出增长，但同时也提出了这些新技术对我们工作和知识分享方式的影响。我们Specifically，我们研究了ChatGPT如何改变编程中的问题解决方式。为此，我们利用了chatGPT于11月30日的突然发布对Stack Overflow上最大的编程社区的使用情况产生的影响。使用 quasi-experimental方法（Difference-in-Difference），我们发现了问题数量减少的显著影响。此外，剩下的问题更加详细。这些发现不仅表明产出增长，还表明了我们工作的基本变革，Routine inquiry由AI解决，让人类可以专注于更复杂的任务。

Improving Knot Prediction in Wood Logs with Longitudinal Feature Propagation

paper_url: http://arxiv.org/abs/2308.11291
repo_url: https://github.com/jeremyfix/icvs2023
paper_authors: Salim Khazem, Jeremy Fix, Cédric Pradalier
for: 本研究旨在预测木材内部缺陷的位置，以提高木材质量评估的准确性和效率。
methods: 本研究使用了卷积回归神经网络来解决木材外形特征与内部缺陷的Binary segmentation任务。
results: 研究表明，通过使用便宜的外形测量设备（如激光 Profiler）进行训练，可以通过卷积回归神经网络来预测木材内部缺陷的位置，并且可以在不同的树种上进行效果评估。

Abstract
The quality of a wood log in the wood industry depends heavily on the presence of both outer and inner defects, including inner knots that are a result of the growth of tree branches. Today, locating the inner knots require the use of expensive equipment such as X-ray scanners. In this paper, we address the task of predicting the location of inner defects from the outer shape of the logs. The dataset is built by extracting both the contours and the knots with X-ray measurements. We propose to solve this binary segmentation task by leveraging convolutional recurrent neural networks. Once the neural network is trained, inference can be performed from the outer shape measured with cheap devices such as laser profilers. We demonstrate the effectiveness of our approach on fir and spruce tree species and perform ablation on the recurrence to demonstrate its importance.

摘要
木材行业中木材质量受到内部和外部缺陷的影响，包括由树木分支生长而成的内弯。今天，找到内弯需要使用昂贵的设备，如X射线扫描仪。在这篇论文中，我们解决了根据外形测量内弯的任务。我们提出使用卷积回归神经网络解决这个二分类任务。一旦神经网络训练完毕，可以通过便宜的设备，如激光 Profilers 进行推理。我们在桦树和落叶树种中展示了我们的方法的效果，并对循环的重要性进行了剖除。

ShadowNet for Data-Centric Quantum System Learning

paper_url: http://arxiv.org/abs/2308.11290
repo_url: None
paper_authors: Yuxuan Du, Yibo Yang, Tongliang Liu, Zhouchen Lin, Bernard Ghanem, Dacheng Tao
for: 本研究旨在探讨大量量子系统的动态学习问题，以减轻维度祸咎的影响。
methods: 本研究提出了一种数据驱动学习 paradigma，结合了神经网络协议和经典阴影，以便实现多种量子学习任务。
results: 研究表明，该 paradigma可以在偏远量子系统中提供准确和可靠的预测结果，并且可以在批处理大量量子系统时实现快速和高效的预测。

Abstract
Understanding the dynamics of large quantum systems is hindered by the curse of dimensionality. Statistical learning offers new possibilities in this regime by neural-network protocols and classical shadows, while both methods have limitations: the former is plagued by the predictive uncertainty and the latter lacks the generalization ability. Here we propose a data-centric learning paradigm combining the strength of these two approaches to facilitate diverse quantum system learning (QSL) tasks. Particularly, our paradigm utilizes classical shadows along with other easily obtainable information of quantum systems to create the training dataset, which is then learnt by neural networks to unveil the underlying mapping rule of the explored QSL problem. Capitalizing on the generalization power of neural networks, this paradigm can be trained offline and excel at predicting previously unseen systems at the inference stage, even with few state copies. Besides, it inherits the characteristic of classical shadows, enabling memory-efficient storage and faithful prediction. These features underscore the immense potential of the proposed data-centric approach in discovering novel and large-scale quantum systems. For concreteness, we present the instantiation of our paradigm in quantum state tomography and direct fidelity estimation tasks and conduct numerical analysis up to 60 qubits. Our work showcases the profound prospects of data-centric artificial intelligence to advance QSL in a faithful and generalizable manner.

摘要
大量量子系统的动力学是由维度瓶颈所困难。统计学学习提供了新的可能性，通过神经网络协议和类型热影，然而两者都有局限性：前者受到预测不确定性的困扰，而后者缺乏泛化能力。我们提议一种数据驱动学习思想，结合这两种方法，以便实现多样化量子系统学习（QSL）任务。具体来说，我们的思想利用类型热影并与量子系统其他易获得信息一起创建训练集，然后通过神经网络学习探索QSL问题下的底层映射规则。通过神经网络的泛化能力，这种思想可以在训练阶段在线上培养，并在探索阶段预测未经见过的系统，即使只有几个状态的复制。此外，它继承类型热影的特点，即储存和预测的具有快速储存和准确预测的特点。这些特点强调了我们提议的数据驱动方法在发现新的大规模量子系统方面的极大潜力。为了证明这一点，我们在量子状态探测和直接准确率估计任务中实现了实例，并进行了数值分析至60个量子比特。我们的工作展示了数据驱动人工智能在忠实和泛化的方式下提高量子系统学习的可能性。

Recording of 50 Business Assignments

paper_url: http://arxiv.org/abs/2308.12211
repo_url: https://github.com/microsoft/50BusinessAssignmentsLog
paper_authors: Michal Sroka, Mohammadreza Fani Sani
for: 本研究用于发现和分析用户如何完成业务任务，提供有价值的进程效率和优化的研究发现。
methods: 本文提供了50个真实的企业过程数据集，这些数据集有很大的研究应用potential，包括任务挖掘和过程自动化。
results: 本研究提供了一个有价值的数据集，这些数据集有助于研究人员和实践者了解进程效率和优化。

Abstract
One of the main use cases of process mining is to discover and analyze how users follow business assignments, providing valuable insights into process efficiency and optimization. In this paper, we present a comprehensive dataset consisting of 50 real business processes. The dataset holds significant potential for research in various applications, including task mining and process automation which is a valuable resource for researchers and practitioners.

摘要
一个主要的用 caso of proces mining 是发现和分析用户如何跟踪商业任务，提供有价值的信息来进行过程效率和优化。在这篇论文中，我们提供了完整的数据集，包含50个真实的商业过程。该数据集具有较高的研究价值，包括任务挖掘和过程自动化，是研究人员和实践者的宝贵资源。

CNN based Cuneiform Sign Detection Learned from Annotated 3D Renderings and Mapped Photographs with Illumination Augmentation

paper_url: http://arxiv.org/abs/2308.11277
repo_url: None
paper_authors: Ernst Stötzner, Timo Homburg, Hubert Mara
for: 针对ancient near eastern studies (DANES) 社区面临的挑战，我们开发了数字工具来处理篆字体系，这是一种3D文字痕迹在泥 TABLETS上的历史文化，覆盖了三千多年和至少八种主要语言。
methods: 我们使用了HeiCuBeDa和MaiCuBeDa数据集，包括约500个标注的板表，并提供了一种新的OCR-like方法来处理混合图像数据。我们的签名localization使用RepPoints探测器来预测字符的位置为 bounding boxes。我们使用了GigaMesh的MSII（曲率）基于的渲染、Phong-ashed 3D模型和照片，以及光照增强。
results: 使用渲染的3D图像进行签名检测比使用照片更好，而且我们的方法在混合数据集上表现良好，而且Phong renderings和特别是MSII renderings在照片上提高了结果。

Abstract
Motivated by the challenges of the Digital Ancient Near Eastern Studies (DANES) community, we develop digital tools for processing cuneiform script being a 3D script imprinted into clay tablets used for more than three millennia and at least eight major languages. It consists of thousands of characters that have changed over time and space. Photographs are the most common representations usable for machine learning, while ink drawings are prone to interpretation. Best suited 3D datasets that are becoming available. We created and used the HeiCuBeDa and MaiCuBeDa datasets, which consist of around 500 annotated tablets. For our novel OCR-like approach to mixed image data, we provide an additional mapping tool for transferring annotations between 3D renderings and photographs. Our sign localization uses a RepPoints detector to predict the locations of characters as bounding boxes. We use image data from GigaMesh's MSII (curvature, see https://gigamesh.eu) based rendering, Phong-shaded 3D models, and photographs as well as illumination augmentation. The results show that using rendered 3D images for sign detection performs better than other work on photographs. In addition, our approach gives reasonably good results for photographs only, while it is best used for mixed datasets. More importantly, the Phong renderings, and especially the MSII renderings, improve the results on photographs, which is the largest dataset on a global scale.

摘要
受到数字古近东研究（DANES）社区的挑战启发，我们开发了数字工具来处理古代 Mesopotamia 文字，这是一种3D字符印制在泥版上，用于超过三千年和至少八种主要语言。它包含了数千个字符，随着时间和空间的变化而变化。 photographs 是最常用的机器学习 Representation，而墨水Drawing 容易被解释。我们创建了 HeiCuBeDa 和 MaiCuBeDa 数据集，包含约500个注释的泥版。为了我们的新的 OCR-like 方法，我们提供了一个附加的映射工具，用于将3D渲染与 photographs 之间的注释传输。我们使用 GigaMesh 的 MSII（曲率，见）基于的渲染、Phong 灯光渲染和 photographs 以及照明增强。结果表明，使用3D渲染来检测字符性能更高于其他工作的 photographs 上。此外，我们的方法在 photographs 上提供了reasonably good的结果，而且在混合数据集上表现最佳。尤其是 Phong 渲染和 MSII 渲染，对于 photographs 上的结果有所提高。

Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

paper_url: http://arxiv.org/abs/2308.11276
repo_url: None
paper_authors: Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, Ying Shan
for: solves the problem of text-to-music generation (T2M-Gen) faced due to the scarcity of large-scale publicly available music datasets with natural language captions.
methods: utilizes audio representations from a pretrained MERT model to extract music features, and introduces a methodology for generating question-answer pairs from existing audio captioning datasets, as well as the MusicQA Dataset designed for answering open-ended music-related questions.
results: achieves outstanding performance in both music question answering and music caption generation across various metrics, outperforming current state-of-the-art (SOTA) models in both fields and offering a promising advancement in the T2M-Gen research field.

Abstract
Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale publicly available music datasets with natural language captions. To address this, we propose the Music Understanding LLaMA (MU-LLaMA), capable of answering music-related questions and generating captions for music files. Our model utilizes audio representations from a pretrained MERT model to extract music features. However, obtaining a suitable dataset for training the MU-LLaMA model remains challenging, as existing publicly accessible audio question answering datasets lack the necessary depth for open-ended music question answering. To fill this gap, we present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA Dataset designed for answering open-ended music-related questions. The experiments demonstrate that the proposed MU-LLaMA model, trained on our designed MusicQA dataset, achieves outstanding performance in both music question answering and music caption generation across various metrics, outperforming current state-of-the-art (SOTA) models in both fields and offering a promising advancement in the T2M-Gen research field.

摘要
文本转换为乐曲生成（T2M-Gen）遇到了一个重要的障碍，即公共可用的大规模乐曲数据集中的自然语言描述缺乏。为解决这个问题，我们提议了Music Understanding LLaMA（MU-LLaMA），可以回答乐曲相关的问题并生成乐曲文件的描述。我们的模型利用了预训练的MERT模型来提取乐曲特征。但是获得合适的模型训练数据集仍然是一个挑战，因为现有的公共可用的音频问答数据集缺乏必要的深度来回答开放式乐曲问题。为了填补这个空白，我们提出了一种方法，可以将现有的音频描述数据集转换成问题-答案对，并介绍了MusicQA数据集，用于回答开放式乐曲相关的问题。实验结果表明，我们提出的MU-LLaMA模型，在我们设计的MusicQA数据集上进行训练，在多种纪录计中表现出色，超越当前的状态机（SOTA）模型在乐曲问题回答和乐曲描述生成方面，并为T2M-Gen研究领域带来了可期的进步。

Robust Lagrangian and Adversarial Policy Gradient for Robust Constrained Markov Decision Processes

paper_url: http://arxiv.org/abs/2308.11267
repo_url: None
paper_authors: David M. Bossens
for: 本 paper 目的是提出一种基于 reinforcement learning 的任务模型框架，即 robust constrained Markov decision process (RCMDP)，该框架可以考虑行为约束和模型不确定性，并提供了对模型不确定性的鲁棒性。
methods: 本 paper 使用的方法包括：1) 基于值估计的最坏情况动力学；2) 基于拉格朗日点的最坏情况动力学；3) 对 RCMDP 的劣化策略算法。
results: 本 paper 的实验结果表明，提出的 two algorithms（RCPG with Robust Lagrangian 和 Adversarial RCPG）在 injecting perturbations 的 inventory management 和 safe navigation 任务中表现比较出色，特别是 Adversarial RCPG 在所有测试中排名第二。

Abstract
The robust constrained Markov decision process (RCMDP) is a recent task-modelling framework for reinforcement learning that incorporates behavioural constraints and that provides robustness to errors in the transition dynamics model through the use of an uncertainty set. Simulating RCMDPs requires computing the worst-case dynamics based on value estimates for each state, an approach which has previously been used in the Robust Constrained Policy Gradient (RCPG). Highlighting potential downsides of RCPG such as not robustifying the full constrained objective and the lack of incremental learning, this paper introduces two algorithms, called RCPG with Robust Lagrangian and Adversarial RCPG. RCPG with Robust Lagrangian modifies RCPG by taking the worst-case dynamics based on the Lagrangian rather than either the value or the constraint. Adversarial RCPG also formulates the worst-case dynamics based on the Lagrangian but learns this directly and incrementally as an adversarial policy through gradient descent rather than indirectly and abruptly through constrained optimisation on a sorted value list. A theoretical analysis first derives the Lagrangian policy gradient for the policy optimisation of both proposed algorithms and then the adversarial policy gradient to learn the adversary for Adversarial RCPG. Empirical experiments injecting perturbations in inventory management and safe navigation tasks demonstrate the competitive performance of both algorithms compared to traditional RCPG variants as well as non-robust and non-constrained ablations. In particular, Adversarial RCPG ranks among the top two performing algorithms on all tests.

摘要
RCMDP（有约束的马尔可夫决策过程）是一种最近的任务模型框架，用于机器学习中的激励学习，它包含行为约束并提供了对转移动力学模型中的错误的Robustness。模拟RCMDP需要基于每个状态的值估计计算最差情况的动力学，这同RCPG（有约束的策略梯度）一样。在RCPG中，作者提出了两种算法，即RCPG with Robust Lagrangian和Adversarial RCPG。RCPG with Robust Lagrangian通过基于Lagrangian而不是值或约束来修改RCPG。Adversarial RCPG也基于Lagrangian，但是通过对敌对策略进行准确的梯度下降来直接和逐步地学习对敌。作者首先 derivates Lagrangian policy gradient для政策优化两个提出的算法，然后 derivates adversarial policy gradient来学习对敌。实验表明，两种算法在各种任务中表现竞争性，特别是Adversarial RCPG在所有测试中排名第二。

Efficient Last-iterate Convergence Algorithms in Solving Games

paper_url: http://arxiv.org/abs/2308.11256
repo_url: None
paper_authors: Linjian Meng, Zhenxing Ge, Wenbin Li, Bo An, Yang Gao
For: The paper is written for learning Nash equilibrium (NE) in two-player zero-sum normal-form games (NFGs) and extensive-form games (EFGs) using no-regret algorithms.* Methods: The paper uses the Reward Transformation (RT) framework, which transforms the problem of learning NE in the original game into a series of strongly convex-concave optimization problems (SCCPs). The authors also use Regret Matching+ (RM+) algorithm to solve the SCCPs, and propose a novel transformation method to enable RM+ to solve the SCCPs.* Results: The paper shows that the proposed algorithm, Reward Transformation RM+ (RTRM+), enjoys last-iterate convergence under the discrete-time feedback setting, and significantly outperforms existing last-iterate convergence algorithms and RM+ (CFR+) in experiments.

Abstract
No-regret algorithms are popular for learning Nash equilibrium (NE) in two-player zero-sum normal-form games (NFGs) and extensive-form games (EFGs). Many recent works consider the last-iterate convergence no-regret algorithms. Among them, the two most famous algorithms are Optimistic Gradient Descent Ascent (OGDA) and Optimistic Multiplicative Weight Update (OMWU). However, OGDA has high per-iteration complexity. OMWU exhibits a lower per-iteration complexity but poorer empirical performance, and its convergence holds only when NE is unique. Recent works propose a Reward Transformation (RT) framework for MWU, which removes the uniqueness condition and achieves competitive performance with OMWU. Unfortunately, RT-based algorithms perform worse than OGDA under the same number of iterations, and their convergence guarantee is based on the continuous-time feedback assumption, which does not hold in most scenarios. To address these issues, we provide a closer analysis of the RT framework, which holds for both continuous and discrete-time feedback. We demonstrate that the essence of the RT framework is to transform the problem of learning NE in the original game into a series of strongly convex-concave optimization problems (SCCPs). We show that the bottleneck of RT-based algorithms is the speed of solving SCCPs. To improve the their empirical performance, we design a novel transformation method to enable the SCCPs can be solved by Regret Matching+ (RM+), a no-regret algorithm with better empirical performance, resulting in Reward Transformation RM+ (RTRM+). RTRM+ enjoys last-iterate convergence under the discrete-time feedback setting. Using the counterfactual regret decomposition framework, we propose Reward Transformation CFR+ (RTCFR+) to extend RTRM+ to EFGs. Experimental results show that our algorithms significantly outperform existing last-iterate convergence algorithms and RM+ (CFR+).

摘要
“无后悔算法”受欢迎用于学习两player零余游戏（NFG）和广泛游戏（EFG）中的 Nash equilibriium（NE）。许多最近的研究将注意力集中在最后迭代具有无后悔性的算法。其中最具知名度的两个算法是Optimistic Gradient Descent Ascent（OGDA）和Optimistic Multiplicative Weight Update（OMWU）。然而，OGDA的每迭代复杂度较高，而OMWU的实际性较差，且其对NE的唯一性Conditions不够严格。Recent works propose a Reward Transformation（RT） framework for MWU, which removes the uniqueness condition and achieves competitive performance with OMWU。然而，RT-based algorithms under the same number of iterations than OGDA, and their convergence guarantee is based on the continuous-time feedback assumption, which does not hold in most scenarios。To address these issues, we provide a closer analysis of the RT framework, which holds for both continuous and discrete-time feedback。我们展示了RT framework的核心是将学习NE在原始游戏中的问题转换为一系列强oly convex-concave optimization problems（SCCPs）。我们显示了RT-based algorithms的瓶颈在SCCPs的解决方法。为了提高它们的实际性表现，我们设计了一个新的变换方法，让SCCPs可以通过Regret Matching+（RM+），一个无后悔算法，解决，从而产生Reward Transformation RM+（RTRM+）。RTRM+ 满足最后迭代具有无后悔性的条件。使用Counterfactual regret decomposition framework，我们提出Reward Transformation CFR+（RTCFR+）来扩展RTRM+到EFGs。实验结果显示我们的算法在已知的最后迭代具有无后悔性的算法和RM+（CFR+）中具有优秀的实际表现。

A survey on bias in machine learning research

paper_url: http://arxiv.org/abs/2308.11254
repo_url: https://github.com/Aastha2104/Parkinson-Disease-Prediction
paper_authors: Agnieszka Mikołajczyk-Bareła, Michał Grochowski
for: 本研究旨在帮助理解机器学习中的偏见源泉和错误，以提高机器学习模型的公平、透明和准确性。
methods: 本文提供了四十个可能的机器学习漏洞和错误的示例，并对每个示例进行了详细的描述。
results: 本文通过对机器学习管道中的偏见和错误的分析，帮助开发者更好地检测和缓解偏见，从而提高机器学习模型的公平性和准确性。

Abstract
Current research on bias in machine learning often focuses on fairness, while overlooking the roots or causes of bias. However, bias was originally defined as a "systematic error," often caused by humans at different stages of the research process. This article aims to bridge the gap between past literature on bias in research by providing taxonomy for potential sources of bias and errors in data and models. The paper focus on bias in machine learning pipelines. Survey analyses over forty potential sources of bias in the machine learning (ML) pipeline, providing clear examples for each. By understanding the sources and consequences of bias in machine learning, better methods can be developed for its detecting and mitigating, leading to fairer, more transparent, and more accurate ML models.

摘要
现有研究 often focuses on fairness 的偏见在机器学习中，而忽略了偏见的根源或 causa。然而，偏见原本是一种“系统性错误”，常由人类在不同阶段的研究过程中引入。这篇文章目的是 bridge the gap between past literature on bias in research by providing a taxonomy for potential sources of bias and errors in data and models. The paper focuses on bias in machine learning pipelines. The survey analyzes over forty potential sources of bias in the machine learning (ML) pipeline, providing clear examples for each. By understanding the sources and consequences of bias in machine learning, better methods can be developed for its detecting and mitigating, leading to fairer, more transparent, and more accurate ML models.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Hong Kong, Macau, and Taiwan.

Modeling Bends in Popular Music Guitar Tablatures

paper_url: http://arxiv.org/abs/2308.12307
repo_url: https://gitlab.com/adhooge1/bend-prediction
paper_authors: Alexandre D’Hooge, Louis Bigo, Ken Déguernel
for: 这篇论文主要用于研究 guitar 乐谱中的弯曲技巧，以及如何使用数据分析方法来预测弯曲的发生。
methods: 这篇论文使用了一些数据分析技术，包括决策树等，来研究弯曲的预测。
results: 实验结果表明，使用这些数据分析技术可以准确预测弯曲的发生，并且具有一定的可靠性和精度。

Abstract
Tablature notation is widely used in popular music to transcribe and share guitar musical content. As a complement to standard score notation, tablatures transcribe performance gesture information including finger positions and a variety of guitar-specific playing techniques such as slides, hammer-on/pull-off or bends.This paper focuses on bends, which enable to progressively shift the pitch of a note, therefore circumventing physical limitations of the discrete fretted fingerboard. In this paper, we propose a set of 25 high-level features, computed for each note of the tablature, to study how bend occurrences can be predicted from their past and future short-term context. Experiments are performed on a corpus of 932 lead guitar tablatures of popular music and show that a decision tree successfully predicts bend occurrences with an F1 score of 0.71 anda limited amount of false positive predictions, demonstrating promising applications to assist the arrangement of non-guitar music into guitar tablatures.

摘要
Tablaturenotation是流行音乐中广泛使用的notation方式，用于记录和分享吉他乐器的音乐内容。作为标准notation的补充，tablaturenotation记录了演奏手势信息，包括手指位置和许多特有的吉他演奏技巧，如滑弹、弹压和弯曲。本文关注的是弯曲，它可以使演奏者在不改变 физиical fretted fingerboard的前提下，逐渐改变音符的抑制值。在本文中，我们提出了25个高级特征，用于研究如何通过检测短期内的前后文来预测琴曲中的弯曲出现。实验使用了932首流行乐器琴曲tablature，并显示了一棵决策树可以成功预测琴曲中的弯曲出现，F1分数为0.71，并且具有有限的假阳性预测，这表明可以用于将非吉他音乐转换成琴曲tablature。

Multi-Source Domain Adaptation for Cross-Domain Fault Diagnosis of Chemical Processes

paper_url: http://arxiv.org/abs/2308.11247
repo_url: None
paper_authors: Eduardo Fernandes Montesuma, Michela Mulas, Fred Ngolè Mboula, Francesco Corona, Antoine Souloumiac
for: 这种研究旨在提高过程监测中的故障诊断精度，具体来说是利用机器学习算法预测故障类型基于感知器读ings。
methods: 该研究使用单源预处理适应（SSDA）和多源预处理适应（MSDA）算法进行故障诊断，并在田东曼进程中进行了广泛的比较。
results: 研究显示，在多源场景下使用多个预处理源可以提高故障诊断精度，具体来说是比单源场景提高23%的平均精度。此外，无适应情况下，多源场景可以提高不适应精度的平均提升率为8.4%。

Abstract
Fault diagnosis is an essential component in process supervision. Indeed, it determines which kind of fault has occurred, given that it has been previously detected, allowing for appropriate intervention. Automatic fault diagnosis systems use machine learning for predicting the fault type from sensor readings. Nonetheless, these models are sensible to changes in the data distributions, which may be caused by changes in the monitored process, such as changes in the mode of operation. This scenario is known as Cross-Domain Fault Diagnosis (CDFD). We provide an extensive comparison of single and multi-source unsupervised domain adaptation (SSDA and MSDA respectively) algorithms for CDFD. We study these methods in the context of the Tennessee-Eastmann Process, a widely used benchmark in the chemical industry. We show that using multiple domains during training has a positive effect, even when no adaptation is employed. As such, the MSDA baseline improves over the SSDA baseline classification accuracy by 23% on average. In addition, under the multiple-sources scenario, we improve classification accuracy of the no adaptation setting by 8.4% on average.

摘要
FAULT诊断是 proces supervision 中的一个重要组件。它可以确定哪种缺陷已经发生，只要它已经检测到了，然后采取相应的 intervención。自动FAULT诊断系统使用机器学习来预测缺陷类型从感知值中。然而，这些模型对数据分布的变化敏感，这些变化可能是由监测过程中的变化所致，如操作模式的变化。这种情况被称为 Cross-Domain Fault Diagnosis (CDFD)。我们提供了广泛的单源和多源无监督领域适应 (SSDA 和 MSDA 分别) 算法的 Comparative study для CDFD。我们在 Tennessee-Eastmann 过程中进行了这些方法的研究，这是化学工业中广泛使用的一个标准测试 benchmark。我们发现在训练时使用多个频道有益，即使没有适应也。因此，MSDA 基线提高了 SSDA 基eline 的分类精度，在 average 上提高了 23%。此外，在多源场景下，我们在无适应情况下提高了分类精度的 average 上提高了 8.4%。

Faster Optimization in S-Graphs Exploiting Hierarchy

paper_url: http://arxiv.org/abs/2308.11242
repo_url: None
paper_authors: Hriday Bavle, Jose Luis Sanchez-Lopez, Javier Civera, Holger Voos
for:This paper aims to improve the scalability of Situational Graphs (S-Graphs) in large environments for Simultaneous Localization and Mapping (SLAM) by marginalizing redundant robot poses and their connections to observations.methods:The proposed method generates and optimizes room-local graphs encompassing all graph entities within a room-like structure, compresses the S-Graphs, and performs windowed local optimization of the compressed graph at regular time-distance intervals. Global optimization is performed every time a loop closure is detected.results:The proposed method achieves similar accuracy compared to the baseline while reducing the computation time by 39.81% compared to the baseline.

Abstract
3D scene graphs hierarchically represent the environment appropriately organizing different environmental entities in various layers. Our previous work on situational graphs extends the concept of 3D scene graph to SLAM by tightly coupling the robot poses with the scene graph entities, achieving state-of-the-art results. Though, one of the limitations of S-Graphs is scalability in really large environments due to the increased graph size over time, increasing the computational complexity. To overcome this limitation in this work we present an initial research of an improved version of S-Graphs exploiting the hierarchy to reduce the graph size by marginalizing redundant robot poses and their connections to the observations of the same structural entities. Firstly, we propose the generation and optimization of room-local graphs encompassing all graph entities within a room-like structure. These room-local graphs are used to compress the S-Graphs marginalizing the redundant robot keyframes within the given room. We then perform windowed local optimization of the compressed graph at regular time-distance intervals. A global optimization of the compressed graph is performed every time a loop closure is detected. We show similar accuracy compared to the baseline while showing a 39.81% reduction in the computation time with respect to the baseline.

摘要
三维场景图 hierarchically represents the environment, appropriately organizing different environmental entities in various layers. Our previous work on situational graphs extends the concept of 3D scene graph to SLAM by tightly coupling the robot poses with the scene graph entities, achieving state-of-the-art results. However, one of the limitations of S-Graphs is scalability in really large environments due to the increased graph size over time, increasing the computational complexity. To overcome this limitation, in this work we present an initial research of an improved version of S-Graphs by exploiting the hierarchy to reduce the graph size by marginalizing redundant robot poses and their connections to the observations of the same structural entities. First, we propose the generation and optimization of room-local graphs encompassing all graph entities within a room-like structure. These room-local graphs are used to compress the S-Graphs marginalizing the redundant robot keyframes within the given room. We then perform windowed local optimization of the compressed graph at regular time-distance intervals. A global optimization of the compressed graph is performed every time a loop closure is detected. We show similar accuracy compared to the baseline while showing a 39.81% reduction in computation time with respect to the baseline.

An Effective Transformer-based Contextual Model and Temporal Gate Pooling for Speaker Identification

paper_url: http://arxiv.org/abs/2308.11241
repo_url: https://github.com/harunorikawano/speaker-identification-with-tgp
paper_authors: Harunori Kawano, Sota Shimizu
for: 这个研究是为了开发一个高精度的 speaker identification 模型，使用 Transformer 架构和自我超vised learning。
methods: 本研究使用了 Transformer-based contextual model，并进一步提出了 Temporal Gate Pooling 方法来增强模型的学习能力。
results: 研究获得了使用 VoxCeleb1 资料集进行认个者识别 tasks 的85.9%的精度，与 wav2vec2 的317.7M个parameters相比，这个方法具有较高的精度和较低的 Parameters 数量。

Abstract
Wav2vec2 has achieved success in applying Transformer architecture and self-supervised learning to speech recognition. Recently, these have come to be used not only for speech recognition but also for the entire speech processing. This paper introduces an effective end-to-end speaker identification model applied Transformer-based contextual model. We explored the relationship between the parameters and the performance in order to discern the structure of an effective model. Furthermore, we propose a pooling method, Temporal Gate Pooling, with powerful learning ability for speaker identification. We applied Conformer as encoder and BEST-RQ for pre-training and conducted an evaluation utilizing the speaker identification of VoxCeleb1. The proposed method has achieved an accuracy of 85.9% with 28.5M parameters, demonstrating comparable precision to wav2vec2 with 317.7M parameters. Code is available at https://github.com/HarunoriKawano/speaker-identification-with-tgp.

摘要
它使用 transformer 架构和自动学习来实现了speech recognition的成功。最近，这些技术不仅用于speech recognition，还用于整个speech processing。这篇论文介绍了一种高效的端到端speaker identification模型，该模型使用 transformer-based 上下文模型。我们研究了参数与性能之间的关系，以便理解高效模型的结构。此外，我们提出了一种pooling方法， named Temporal Gate Pooling，具有强大的学习能力。我们使用 Conformer 作为编码器，并使用 BEST-RQ 进行预训练。我们对 VoxCeleb1 进行了评估，并实现了85.9%的准确率，相比之下，wav2vec2 的参数数量为317.7M，我们的方法准确率相对较高。代码可以在 GitHub 上找到：https://github.com/HarunoriKawano/speaker-identification-with-tgp。

ROSGPT_Vision: Commanding Robots Using Only Language Models’ Prompts

paper_url: http://arxiv.org/abs/2308.11236
repo_url: https://github.com/bilel-bj/rosgpt_vision
paper_authors: Bilel Benjdira, Anis Koubaa, Anas M. Ali
for: 这个论文主要是提出一种新的 робо控制方法，使用语言模型提示来控制 робо。
methods: 该方法使用语言模型和视觉模型结合在一起，通过自动化机制来执行 robotic 任务。
results: 这个方法可以减少 robotic 开发成本，并且可以在实际应用中提高应用质量。Here’s a more detailed explanation of each point:
for: The paper proposes a new method for controlling robots using only language prompts, which is a significant departure from traditional methods that rely on technical details and programming.
methods: The method uses a combination of language models and vision models to automate the mechanisms behind the prompts, allowing the robot to execute tasks based on natural language descriptions.
results: The method has been shown to reduce development costs and improve the quality of applications, and it has the potential to advance robotic research in this direction.I hope this helps! Let me know if you have any further questions.

Abstract
In this paper, we argue that the next generation of robots can be commanded using only Language Models' prompts. Every prompt interrogates separately a specific Robotic Modality via its Modality Language Model (MLM). A central Task Modality mediates the whole communication to execute the robotic mission via a Large Language Model (LLM). This paper gives this new robotic design pattern the name of: Prompting Robotic Modalities (PRM). Moreover, this paper applies this PRM design pattern in building a new robotic framework named ROSGPT_Vision. ROSGPT_Vision allows the execution of a robotic task using only two prompts: a Visual and an LLM prompt. The Visual Prompt extracts, in natural language, the visual semantic features related to the task under consideration (Visual Robotic Modality). Meanwhile, the LLM Prompt regulates the robotic reaction to the visual description (Task Modality). The framework automates all the mechanisms behind these two prompts. The framework enables the robot to address complex real-world scenarios by processing visual data, making informed decisions, and carrying out actions automatically. The framework comprises one generic vision module and two independent ROS nodes. As a test application, we used ROSGPT_Vision to develop CarMate, which monitors the driver's distraction on the roads and makes real-time vocal notifications to the driver. We showed how ROSGPT_Vision significantly reduced the development cost compared to traditional methods. We demonstrated how to improve the quality of the application by optimizing the prompting strategies, without delving into technical details. ROSGPT_Vision is shared with the community (link: https://github.com/bilel-bj/ROSGPT_Vision) to advance robotic research in this direction and to build more robotic frameworks that implement the PRM design pattern and enables controlling robots using only prompts.

摘要
在这篇论文中，我们 argueThat the next generation of robots can be commanded using only Language Models' prompts. Every prompt interrogates separately a specific Robotic Modality via its Modality Language Model (MLM). A central Task Modality mediates the whole communication to execute the robotic mission via a Large Language Model (LLM). This paper gives this new robotic design pattern the name of: Prompting Robotic Modalities (PRM). Moreover, this paper applies this PRM design pattern in building a new robotic framework named ROSGPT_Vision. ROSGPT_Vision allows the execution of a robotic task using only two prompts: a Visual and an LLM prompt. The Visual Prompt extracts, in natural language, the visual semantic features related to the task under consideration (Visual Robotic Modality). Meanwhile, the LLM Prompt regulates the robotic reaction to the visual description (Task Modality). The framework automates all the mechanisms behind these two prompts. The framework enables the robot to address complex real-world scenarios by processing visual data, making informed decisions, and carrying out actions automatically. The framework comprises one generic vision module and two independent ROS nodes. As a test application, we used ROSGPT_Vision to develop CarMate, which monitors the driver's distraction on the roads and makes real-time vocal notifications to the driver. We showed how ROSGPT_Vision significantly reduced the development cost compared to traditional methods. We demonstrated how to improve the quality of the application by optimizing the prompting strategies, without delving into technical details. ROSGPT_Vision is shared with the community (link: https://github.com/bilel-bj/ROSGPT_Vision) to advance robotic research in this direction and to build more robotic frameworks that implement the PRM design pattern and enables controlling robots using only prompts.

Adaptive White-Box Watermarking with Self-Mutual Check Parameters in Deep Neural Networks

paper_url: http://arxiv.org/abs/2308.11235
repo_url: None
paper_authors: Zhenzhe Gao, Zhaoxia Yin, Hongjian Zhan, Heng Yin, Yue Lu
for: 检测和防止人工智能模型中的意外或恶意篡改。
methods: 使用敏感 watermarking 技术来识别和检测篡改。
results: 测试结果表明，当篡改率低于 20% 时，我们的方法可以达到很高的恢复性能。而对于受到 watermarking 影响的模型，我们使用可适应位数技术来恢复模型的精度。

Abstract
Artificial Intelligence (AI) has found wide application, but also poses risks due to unintentional or malicious tampering during deployment. Regular checks are therefore necessary to detect and prevent such risks. Fragile watermarking is a technique used to identify tampering in AI models. However, previous methods have faced challenges including risks of omission, additional information transmission, and inability to locate tampering precisely. In this paper, we propose a method for detecting tampered parameters and bits, which can be used to detect, locate, and restore parameters that have been tampered with. We also propose an adaptive embedding method that maximizes information capacity while maintaining model accuracy. Our approach was tested on multiple neural networks subjected to attacks that modified weight parameters, and our results demonstrate that our method achieved great recovery performance when the modification rate was below 20%. Furthermore, for models where watermarking significantly affected accuracy, we utilized an adaptive bit technique to recover more than 15% of the accuracy loss of the model.

摘要
人工智能（AI）在应用广泛，但也存在风险，因为在部署过程中可能会有不恰当或恶意篡改。因此， Regular checks 是必要的，以检测和预防这些风险。某些攻击可能会导致模型参数的篡改，因此我们需要一种方法来检测和修复篡改的参数。在这篇论文中，我们提出了一种用于检测篡改参数和位数的方法，可以用来检测、定位和修复篡改的参数。此外，我们还提出了一种适应式嵌入方法，可以最大化信息容量，同时保持模型的准确性。我们的方法在多个神经网络上进行了测试，并达到了篡改率低于20%时的恢复性能。此外，对于模型中 watermarking 对准确性产生了较大的影响，我们使用适应位数技术来恢复模型的准确性，达到了超过15%的恢复效果。

Traffic Flow Optimisation for Lifelong Multi-Agent Path Finding

paper_url: http://arxiv.org/abs/2308.11234
repo_url: None
paper_authors: Zhe Chen, Daniel Harabor, Jiaoyang Li, Peter J. Stuckey
for: 解决多 Agent 路径规划问题，即 robotics 中多 Agent 需要计算共享地图上免撞的路径。
methods: 提出一种新的方法，使用填充避免拥填的路径引导 Agent 达到目的地。
results: 在一shot MAPF 和 Lifelong MAPF 两个大规模场景中，提供了较好的解决方案，对一shot MAPF 的解决质量做出了重要改进，对 Lifelong MAPF 的总 Throughput 做出了大幅提高。

Abstract
Multi-Agent Path Finding (MAPF) is a fundamental problem in robotics that asks us to compute collision-free paths for a team of agents, all moving across a shared map. Although many works appear on this topic, all current algorithms struggle as the number of agents grows. The principal reason is that existing approaches typically plan free-flow optimal paths, which creates congestion. To tackle this issue we propose a new approach for MAPF where agents are guided to their destination by following congestion-avoiding paths. We evaluate the idea in two large-scale settings: one-shot MAPF, where each agent has a single destination, and lifelong MAPF, where agents are continuously assigned new tasks. For one-shot MAPF we show that our approach substantially improves solution quality. For Lifelong MAPF we report large improvements in overall throughput.

摘要
多智能路径找（MAPF）是 robotics 中的基本问题，它要求我们计算多个智能机器人在共享地图上的冲突自由路径。虽然有很多研究对这个问题进行了努力，但现有的方法都难以承受多个机器人的增加。主要的原因是现有的方法通常计划自由流优化路径，这会导致堵塞。为解决这个问题，我们提出了一种新的 MAPF 方法，即使 agents 跟随堵塞避免路径来达到目的地。我们在两个大规模设置中评估了这个想法：一个是一次 MAPF，每个机器人都有单个目的地；另一个是持续 MAPF，机器人 continuous 被分配新任务。对一次 MAPF 我们显示了我们的方法可以显著提高解决方案质量。对持续 MAPF 我们报告了大幅提高总吞吐量。

On-Premise AIOps Infrastructure for a Software Editor SME: An Experience Report

paper_url: http://arxiv.org/abs/2308.11225
repo_url: None
paper_authors: Anes Bendimerad, Youcef Remil, Romain Mathonat, Mehdi Kaytoue
for: 本研究旨在探讨在企业内部实施基于开源工具的AIOps解决方案，以提高软件维护和监测。
methods: 本研究使用开源工具构建了一套完整的AIOps基础设施，并提供了不同选择的原则和策略。
results: 研究结果表明，使用开源工具构建AIOps基础设施可以减少成本和提高软件维护效率，同时也可以满足公司的数据管理和安全需求。

Abstract
Information Technology has become a critical component in various industries, leading to an increased focus on software maintenance and monitoring. With the complexities of modern software systems, traditional maintenance approaches have become insufficient. The concept of AIOps has emerged to enhance predictive maintenance using Big Data and Machine Learning capabilities. However, exploiting AIOps requires addressing several challenges related to the complexity of data and incident management. Commercial solutions exist, but they may not be suitable for certain companies due to high costs, data governance issues, and limitations in covering private software. This paper investigates the feasibility of implementing on-premise AIOps solutions by leveraging open-source tools. We introduce a comprehensive AIOps infrastructure that we have successfully deployed in our company, and we provide the rationale behind different choices that we made to build its various components. Particularly, we provide insights into our approach and criteria for selecting a data management system and we explain its integration. Our experience can be beneficial for companies seeking to internally manage their software maintenance processes with a modern AIOps approach.

摘要
信息技术已成为不同行业的关键组成部分，导致软件维护和监测得到了更大的关注。由于现代软件系统的复杂性，传统维护方法已成为不足。AIOps概念出现以增强预测维护，通过大数据和机器学习技术来提高维护效率。然而，使用AIOps存在数据复杂性和事件管理等挑战。 comercial解决方案存在，但它们可能不适用于某些公司，因为高成本、数据管理问题和私有软件的限制。本文探讨在公司内部实施On-premise AIOps解决方案的可行性，通过使用开源工具。我们介绍了一个完整的AIOps基础设施，我们在公司中成功地部署了这个基础设施，并提供了不同组件的选择原则。尤其是数据管理系统的选择和集成方法。我们的经验可以帮助公司通过现代AIOps方法 internally管理软件维护过程。

Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis

paper_url: http://arxiv.org/abs/2308.11224
repo_url: https://github.com/ayame1006/llmtograph
paper_authors: Chang Liu, Bo Wu
for: 这个研究旨在评估四种大语言模型（LLMs）在处理图数据上的应用能力。
methods: 该研究使用了四种不同的评估指标：理解、正确性、准确性和修复能力。
results: 研究发现，LLMs可以很好地理解图数据的自然语言描述，并且可以对图结构进行有效的推理。GPT模型在正确性方面表现出色，而其他模型则在结构理解方面表现较差。GPT模型在多个答案 зада题上常出现错误答案，这可能会降低其修复能力。另外，GPT-4能够修复GPT-3.5-turbo和自己之前的迭代的答案。研究代码可以在 GitHub 上找到：https://github.com/Ayame1006/LLMtoGraph。

Abstract
Large Language Models (LLMs) have garnered considerable interest within both academic and industrial. Yet, the application of LLMs to graph data remains under-explored. In this study, we evaluate the capabilities of four LLMs in addressing several analytical problems with graph data. We employ four distinct evaluation metrics: Comprehension, Correctness, Fidelity, and Rectification. Our results show that: 1) LLMs effectively comprehend graph data in natural language and reason with graph topology. 2) GPT models can generate logical and coherent results, outperforming alternatives in correctness. 3) All examined LLMs face challenges in structural reasoning, with techniques like zero-shot chain-of-thought and few-shot prompting showing diminished efficacy. 4) GPT models often produce erroneous answers in multi-answer tasks, raising concerns in fidelity. 5) GPT models exhibit elevated confidence in their outputs, potentially hindering their rectification capacities. Notably, GPT-4 has demonstrated the capacity to rectify responses from GPT-3.5-turbo and its own previous iterations. The code is available at: https://github.com/Ayame1006/LLMtoGraph.

摘要
大型语言模型（LLM）在学术和业务领域都受到了广泛关注。然而，对图数据的应用仍然尚未得到充分探索。本研究通过评估四种LLM在解决多个分析问题上的能力来评估LLM在图数据上的应用。我们采用了四种评估指标：理解、正确性、准确性和修复。我们的结果显示：1）LLM可以很好地理解图数据的自然语言描述和图结构的关系。2）GPT模型在正确性方面表现出色，超越了其他选择。3）所有考试LLM都面临着结构理解的挑战，特别是零shot链条思维和几个shot提示的效果减退。4）GPT模型在多个答案任务中很容易出现错误答案，这可能会影响它们的准确性。5）GPT模型表现出高度自信心，这可能会阻碍它们的修复能力。备注的是，GPT-4已经示出了可以修复GPT-3.5-turbo和自己的前一个迭代的能力。代码可以在 GitHub上找到：https://github.com/Ayame1006/LLMtoGraph。

Federated Learning on Patient Data for Privacy-Protecting Polycystic Ovary Syndrome Treatment

paper_url: http://arxiv.org/abs/2308.11220
repo_url: https://github.com/toriqiu/fl-pcos
paper_authors: Lucia Morris, Tori Qiu, Nikhil Raghuraman
for: 这篇论文是为了探讨 Federated Learning（FL）在预测女性淋巴疾病多囊卵巢综合症（PCOS）的优化药物方案。
methods: 这篇论文使用了多种 Federated Learning 方法，并在人工合成 PCOS 患者数据集上进行了测试。
results: 研究发现，这些 Federated Learning 方法在论文中提出的数据隐私保护和药物优选问题上都具有出色的表现。

Abstract
The field of women's endocrinology has trailed behind data-driven medical solutions, largely due to concerns over the privacy of patient data. Valuable datapoints about hormone levels or menstrual cycling could expose patients who suffer from comorbidities or terminate a pregnancy, violating their privacy. We explore the application of Federated Learning (FL) to predict the optimal drug for patients with polycystic ovary syndrome (PCOS). PCOS is a serious hormonal disorder impacting millions of women worldwide, yet it's poorly understood and its research is stunted by a lack of patient data. We demonstrate that a variety of FL approaches succeed on a synthetic PCOS patient dataset. Our proposed FL models are a tool to access massive quantities of diverse data and identify the most effective treatment option while providing PCOS patients with privacy guarantees.

摘要
妇女激素学的应用落后于数据驱动医疗解决方案，主要是由于患者数据隐私问题的担忧。 valuable datapoints about 激素水平或月经周期可能暴露患有并发症或中止怀孕的患者，违反其隐私。我们探讨了 Federated Learning（FL）的应用，以预测患有多囊卵巢综合症（PCOS）患者最佳药物。 PCOS 是世界上数百万女性中的一种严重激素失衡症，但它的研究受到缺乏患者数据的限制。我们示出了多种 FL 方法在 sintetic PCOS 患者数据集上得到成功。我们的提议的 FL 模型是一种访问庞大数据量和鉴别最有效的治疗方案的工具，同时为 PCOS 患者提供隐私保证。

Federated Learning in Big Model Era: Domain-Specific Multimodal Large Models

paper_url: http://arxiv.org/abs/2308.11217
repo_url: None
paper_authors: Zengxiang Li, Zhaoxiang Hou, Hui Liu, Ying Wang, Tongzhi Li, Longfei Xie, Chao Shi, Chengyi Yang, Weishan Zhang, Zelei Liu, Liang Xu
for: 这篇论文旨在提出一种多模态联合学习框架，帮助多家企业通过私有领域数据来共同训练大型模型，实现多enario智能服务。
methods: 论文提出了多模态联合学习的策略性转型，包括智能基础和目标的重要性在大模型时代，以及在多种数据、模型聚合、性能和成本交易、数据隐私和奖励机制等方面的新挑战。
results: 论文介绍了一个城市安全运营管理案例研究，其中多家企业共同提供多模态数据和专业知识，实现了城市安全运营管理的分布部署和有效协调。初步实验表明，企业可以通过多模态模型联合学习提高和储存智能能力，共同创造出高质量智能服务，涵盖能源基础设施安全、住宅社区安全和城市运营管理等多个领域。

Abstract
Multimodal data, which can comprehensively perceive and recognize the physical world, has become an essential path towards general artificial intelligence. However, multimodal large models trained on public datasets often underperform in specific industrial domains. This paper proposes a multimodal federated learning framework that enables multiple enterprises to utilize private domain data to collaboratively train large models for vertical domains, achieving intelligent services across scenarios. The authors discuss in-depth the strategic transformation of federated learning in terms of intelligence foundation and objectives in the era of big model, as well as the new challenges faced in heterogeneous data, model aggregation, performance and cost trade-off, data privacy, and incentive mechanism. The paper elaborates a case study of leading enterprises contributing multimodal data and expert knowledge to city safety operation management , including distributed deployment and efficient coordination of the federated learning platform, technical innovations on data quality improvement based on large model capabilities and efficient joint fine-tuning approaches. Preliminary experiments show that enterprises can enhance and accumulate intelligent capabilities through multimodal model federated learning, thereby jointly creating an smart city model that provides high-quality intelligent services covering energy infrastructure safety, residential community security, and urban operation management. The established federated learning cooperation ecosystem is expected to further aggregate industry, academia, and research resources, realize large models in multiple vertical domains, and promote the large-scale industrial application of artificial intelligence and cutting-edge research on multimodal federated learning.

摘要
多模式数据，能够全面感知和识别物理世界，已成为通往普遍人工智能的关键Path。然而，多模式大型模型在公共数据集上训练时经常表现不佳在特定行业领域。这篇论文提出了一种多模式联合学习框架，允许多家企业使用私有领域数据共同训练大型模型，以实现多场景智能服务。作者对联合学习在智能基础和目标时代的战略性转变进行了深入探讨，以及新的多样数据、模型汇集、性能和成本负担、数据隐私和奖励机制等挑战。论文还介绍了一个城市安全运营管理案例研究，包括分布式部署和有效协调联合学习平台，以及基于大型模型技术的数据质量改进和高效联合练习方法。初步实验显示，企业可以通过多模式模型联合学习增强和积累智能能力，共同创造出高质量智能服务，涵盖能源基础设施安全、住宅社区安全和城市运营管理。建立的联合学习合作生态系统预期会进一步吸引产业、学术和研究资源，实现多个垂直领域的大型模型，并推动人工智能和多模式联合学习的大规模产业应用。

ConcatPlexer: Additional Dim1 Batching for Faster ViTs

paper_url: http://arxiv.org/abs/2308.11199
repo_url: None
paper_authors: Donghoon Han, Seunghyeon Seo, Donghyeon Jeon, Jiho Jang, Chaerin Kong, Nojun Kwak
for: 这个研究旨在提高视觉识别的效率，以提高模型的测试速度和精度。methods: 本研究使用了一种叫做Data Multiplexing（DataMUX）的成本削减方法，并将其应用到视觉模型中。它还导入了一个名为Image Multiplexer的新方法，以及一些新的组件，以解决DataMux在视觉模型中的弱点。results: 研究发现，使用ConcatPlexer模型可以大幅提高视觉识别的启动速度，同时保持了69.5%和83.4%的验证精度。相比之下，ViT-B/16模型需要23.5%更多的GFLOPs以达到相同的验证精度。

Abstract
Transformers have demonstrated tremendous success not only in the natural language processing (NLP) domain but also the field of computer vision, igniting various creative approaches and applications. Yet, the superior performance and modeling flexibility of transformers came with a severe increase in computation costs, and hence several works have proposed methods to reduce this burden. Inspired by a cost-cutting method originally proposed for language models, Data Multiplexing (DataMUX), we propose a novel approach for efficient visual recognition that employs additional dim1 batching (i.e., concatenation) that greatly improves the throughput with little compromise in the accuracy. We first introduce a naive adaptation of DataMux for vision models, Image Multiplexer, and devise novel components to overcome its weaknesses, rendering our final model, ConcatPlexer, at the sweet spot between inference speed and accuracy. The ConcatPlexer was trained on ImageNet1K and CIFAR100 dataset and it achieved 23.5% less GFLOPs than ViT-B/16 with 69.5% and 83.4% validation accuracy, respectively.

摘要
transformers 在自然语言处理（NLP）领域表现出色，同时在计算机视觉领域也引发了多种创新应用。然而，transformers 的高性能和模型灵活性却导致计算成本增加，因此许多研究团队提出了降低计算成本的方法。 draw inspiration from language models 的 cost-cutting method，我们提出了一种新的方法 для高效的视觉识别，即图像多重化（Image Multiplexer）。我们首先介绍了图像多重化的原型，然后开发了新的组件来缓解其缺点，最终得到了我们的最终模型——ConcatPlexer。ConcatPlexer 在 ImageNet1K 和 CIFAR100 dataset 上训练，与 ViT-B/16 的 GFLOPs 相比，减少了 23.5%，而 validation accuracy 则达到了 69.5% 和 83.4%。

ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data

paper_url: http://arxiv.org/abs/2308.11194
repo_url: https://github.com/stanfordmimi/villa
paper_authors: Maya Varma, Jean-Benoit Delbrouck, Sarah Hooper, Akshay Chaudhari, Curtis Langlotz
For: 这种研究旨在评估当前的视觉语言模型（VLM）在高度复杂的多模态数据上的表现，以及如何改进VLM以更好地捕捉高度复杂的图像区域和文本特征之间的关系。* Methods: 该研究使用了一种名为ViLLA的新方法，它包括一个轻量级自动生成的地图模型和一个对比度VLM，以学习从复杂数据中捕捉高度复杂的区域特征和文本特征之间的关系。* Results: 研究表明，在四个领域（合成图像、产品图像、医疗图像和自然图像）上，ViLLA可以在精细的理解任务中表现出色，比如零shot对象检测（COCO上AP50点为3.6，LVIS上mAP点为0.6）和检索任务（R-Precision点为14.2）。

Abstract
Vision-language models (VLMs), such as CLIP and ALIGN, are generally trained on datasets consisting of image-caption pairs obtained from the web. However, real-world multimodal datasets, such as healthcare data, are significantly more complex: each image (e.g. X-ray) is often paired with text (e.g. physician report) that describes many distinct attributes occurring in fine-grained regions of the image. We refer to these samples as exhibiting high pairwise complexity, since each image-text pair can be decomposed into a large number of region-attribute pairings. The extent to which VLMs can capture fine-grained relationships between image regions and textual attributes when trained on such data has not been previously evaluated. The first key contribution of this work is to demonstrate through systematic evaluations that as the pairwise complexity of the training dataset increases, standard VLMs struggle to learn region-attribute relationships, exhibiting performance degradations of up to 37% on retrieval tasks. In order to address this issue, we introduce ViLLA as our second key contribution. ViLLA, which is trained to capture fine-grained region-attribute relationships from complex datasets, involves two components: (a) a lightweight, self-supervised mapping model to decompose image-text samples into region-attribute pairs, and (b) a contrastive VLM to learn representations from generated region-attribute pairs. We demonstrate with experiments across four domains (synthetic, product, medical, and natural images) that ViLLA outperforms comparable VLMs on fine-grained reasoning tasks, such as zero-shot object detection (up to 3.6 AP50 points on COCO and 0.6 mAP points on LVIS) and retrieval (up to 14.2 R-Precision points).

摘要
视力语言模型（VLM），如CLIP和ALIGN，通常在图像-描述文本对 obtained from the web 上进行训练。然而，真实世界多Modal数据，如医疗数据，是非常复杂的：每个图像（例如 X-ray）通常与描述多个细腻区域的文本（例如医生报告）相对应。我们称这些样本为具有高对比复杂性，因为每个图像-文本对可以被分解成大量的区域-特征对。VLM 是否能够在这些数据上捕捉细腻的区域-特征关系，尚未被评估。我们的第一个关键贡献是通过系统性的评估表明，随着对于训练数据的对比复杂性的增加，标准的 VLM 会遇到性能下降，最高达37% 的搜索任务。为解决这个问题，我们介绍了我们的第二个关键贡献——ViLLA。ViLLA 是一种可以从复杂数据中捕捉细腻区域-特征关系的模型，它包括两个组件：（a）一种轻量级、自动学习的映射模型，用于将图像-文本对分解成区域-特征对，以及（b）一种对比 VLM，用于从生成的区域-特征对中学习表示。我们通过在四个领域（人工、产品、医疗和自然图像）进行实验，证明 ViLLA 在细腻理解任务中（例如零shot物体检测和搜索）表现出色，比较类似 VLM 高出3.6 AP50 点和0.6 mAP 点。

Diversity Measures: Domain-Independent Proxies for Failure in Language Model Queries

paper_url: http://arxiv.org/abs/2308.11189
repo_url: https://github.com/lab-v2/diversity_measures
paper_authors: Noel Ngu, Nathaniel Lee, Paulo Shakarian
for: 这篇论文旨在提供一些基于它的各种应用领域的错误预测方法，以便更好地评估大语言模型的性能。
methods: 这篇论文使用了三种不同的方法来衡量大语言模型的错误程度，即熵度、金尼鲁分离度和中心距离。这些方法不仅可以独立地评估模型的性能，还可以应用于几个不同的应用场景，如几个示例问题、链式思维和错误检测。
results: 根据实验结果，这三种方法都强相关于模型的失败概率。此外，这些方法还可以应用于不同的数据集和温度设置，并且可以用于评估模型的性能。

Abstract
Error prediction in large language models often relies on domain-specific information. In this paper, we present measures for quantification of error in the response of a large language model based on the diversity of responses to a given prompt - hence independent of the underlying application. We describe how three such measures - based on entropy, Gini impurity, and centroid distance - can be employed. We perform a suite of experiments on multiple datasets and temperature settings to demonstrate that these measures strongly correlate with the probability of failure. Additionally, we present empirical results demonstrating how these measures can be applied to few-shot prompting, chain-of-thought reasoning, and error detection.

摘要
大型语言模型中的错误预测通常需要对特定领域的信息。在这篇论文中，我们提出了基于响应中的弹性、盖尼不纯和中心距离的三种度量来衡量大型语言模型的错误。我们描述了如何使用这三种度量来评估大型语言模型的错误probability，并在多个数据集和温度设定下进行了一系列实验，以示这些度量强相关于错误的可能性。此外，我们还提供了实验结果，说明了如何使用这些度量来应用少量提示、链接思维和错误探测。

paper_url: http://arxiv.org/abs/2308.11175
repo_url: None
paper_authors: Jinpeng Wang, Ziyun Zeng, Yunxiao Wang, Yuting Wang, Xingyu Lu, Tianxiang Li, Jun Yuan, Rui Zhang, Hai-Tao Zheng, Shu-Tao Xia
for: 这篇研究旨在解决缺乏ID特征的问题，以及实际推荐情况中的冷开头问题。
methods: 本研究提出了一个多 modal 信息学习框架，包括一个基于 transformer 架构的使用者端 encoder-decoder 模型，以及一个适应项目端的动态融合模块。
results: 实验结果显示，MISSRec 能够实现高效的实际推荐情况下的推荐。

Abstract
The goal of sequential recommendation (SR) is to predict a user's potential interested items based on her/his historical interaction sequences. Most existing sequential recommenders are developed based on ID features, which, despite their widespread use, often underperform with sparse IDs and struggle with the cold-start problem. Besides, inconsistent ID mappings hinder the model's transferability, isolating similar recommendation domains that could have been co-optimized. This paper aims to address these issues by exploring the potential of multi-modal information in learning robust and generalizable sequence representations. We propose MISSRec, a multi-modal pre-training and transfer learning framework for SR. On the user side, we design a Transformer-based encoder-decoder model, where the contextual encoder learns to capture the sequence-level multi-modal synergy while a novel interest-aware decoder is developed to grasp item-modality-interest relations for better sequence representation. On the candidate item side, we adopt a dynamic fusion module to produce user-adaptive item representation, providing more precise matching between users and items. We pre-train the model with contrastive learning objectives and fine-tune it in an efficient manner. Extensive experiments demonstrate the effectiveness and flexibility of MISSRec, promising an practical solution for real-world recommendation scenarios.

摘要
目标是强化用户潜在有趣的ITEM predication，基于她/his历史交互序列。现有的大多数分列推荐器都是基于ID特征，尽管广泛使用，但它们经常在罕见ID下表现不佳，并且困难解决冷启动问题。此外，不一致的ID映射使模型的可移植性受阻，隔离类似推荐领域的相似性，这些领域可能可以协同优化。这篇论文旨在解决这些问题，通过学习多 modal 信息来学习Robust和可 generalized 序列表示。我们提议MISSRec，一种多 modal 预训练和传输学习框架，用于SR。用户方面，我们设计了一个基于Transformer的Encoder-Decoder模型，其中Contextual Encoder 学习 capture 序列级别多 modal 共谐，而一种新的兴趣相关 Decoder 被开发以更好地捕捉ITEM-modality-兴趣关系，以提高序列表示。候选ITEM 方面，我们采用动态融合模块生成用户适应ITEM表示，为用户和ITEM之间更精准的匹配提供更多的精度。我们在对照学习目标下预训练模型，并在有效的方式进行精度调整。广泛的实验表明MISSRec的有效性和灵活性，提供了实际解决现实推荐场景中的实际解决方案。

Hierarchical Point-based Active Learning for Semi-supervised Point Cloud Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.11166
repo_url: https://github.com/SmiletoE/HPAL
paper_authors: Zongyi Xu, Bo Yuan, Shanshan Zhao, Qianni Zhang, Xinbo Gao
for: 本研究旨在 Addressing the issue of limited annotations in 3D point cloud segmentation using active learning.
methods: 方法包括一个层次最小准确度模块，以及一种特征距离抑制策略，以选择重要和代表性的点 для人工标注。此外，基于这个活动策略，我们还建立了一个半监督分割框架。
results: 实验结果表明，提档的方法可以达到96.5%和100%的完全监督基线性能，只需使用0.07%和0.1%的训练数据。这些结果超越了当前最佳弱监督和活动学习方法。代码将在https://github.com/SmiletoE/HPAL中发布。

Abstract
Impressive performance on point cloud semantic segmentation has been achieved by fully-supervised methods with large amounts of labelled data. As it is labour-intensive to acquire large-scale point cloud data with point-wise labels, many attempts have been made to explore learning 3D point cloud segmentation with limited annotations. Active learning is one of the effective strategies to achieve this purpose but is still under-explored. The most recent methods of this kind measure the uncertainty of each pre-divided region for manual labelling but they suffer from redundant information and require additional efforts for region division. This paper aims at addressing this issue by developing a hierarchical point-based active learning strategy. Specifically, we measure the uncertainty for each point by a hierarchical minimum margin uncertainty module which considers the contextual information at multiple levels. Then, a feature-distance suppression strategy is designed to select important and representative points for manual labelling. Besides, to better exploit the unlabelled data, we build a semi-supervised segmentation framework based on our active strategy. Extensive experiments on the S3DIS and ScanNetV2 datasets demonstrate that the proposed framework achieves 96.5% and 100% performance of fully-supervised baseline with only 0.07% and 0.1% training data, respectively, outperforming the state-of-the-art weakly-supervised and active learning methods. The code will be available at https://github.com/SmiletoE/HPAL.

摘要
具有印象的表现在点云semantic segmentation方面已经由完全监督的方法实现了出色的成绩。由于获得大规模点云数据和点 wise标签的劳动密集，许多尝试已经被作出以探索学习3D点云 segmentation的方法。活动学习是这种目标的有效策略之一，但是仍然受到了不足的探索。最近的这些方法会测量每个预分区的uncertainty，但它们受到重复信息的困扰和需要额外的劳动进行区分。这篇论文目的在于解决这个问题，通过开发一种层次 minimum margin uncertainty module来测量每个点的uncertainty，考虑多个层次的contextual信息。然后，我们设计了一种特征距离抑制策略，以选择重要和代表性的点进行手动标注。此外，为了更好地利用无标注数据，我们构建了基于我们的活动策略的半supervised segmentation框架。广泛的实验在S3DIS和ScanNetV2数据集上表明，我们的提案的框架可以与完全监督基准相同的96.5%和100%的性能，只使用0.07%和0.1%的训练数据。此外，我们的方法还能够超越当前的弱监督和活动学习方法。代码将在https://github.com/SmiletoE/HPAL上提供。

xxMD: Benchmarking Neural Force Fields Using Extended Dynamics beyond Equilibrium

paper_url: http://arxiv.org/abs/2308.11155
repo_url: https://github.com/zpengmei/xxmd
paper_authors: Zihan Pengmei, Junyu Liu, Yinan Shu
For: The paper aims to address the limitations of current neural force field (NFF) models in representing chemical reactions by introducing a new dataset called xxMD, which includes energies and forces computed from both multireference wave function theory and density functional theory.* Methods: The paper uses a constrained distribution of internal coordinates and energies in the MD17 datasets to demonstrate their inadequacy for representing systems undergoing chemical reactions. The authors then introduce the xxMD dataset, which includes nuclear configuration spaces that authentically depict chemical reactions, making it a more chemically relevant dataset.* Results: The authors re-assess equivariant models on the xxMD datasets and find notably higher mean absolute errors than those reported for MD17 and its variants, highlighting the challenges faced in crafting a generalizable NFF model with extrapolation capability. The authors propose two new datasets, xxMD-CASSCF and xxMD-DFT, which are available online.

Abstract
Neural force fields (NFFs) have gained prominence in computational chemistry as surrogate models, superseding quantum-chemistry calculations in ab initio molecular dynamics. The prevalent benchmark for NFFs has been the MD17 dataset and its subsequent extension. These datasets predominantly comprise geometries from the equilibrium region of the ground electronic state potential energy surface, sampling from direct adiabatic dynamics. However, many chemical reactions entail significant molecular deformations, notably bond breaking. We demonstrate the constrained distribution of internal coordinates and energies in the MD17 datasets, underscoring their inadequacy for representing systems undergoing chemical reactions. Addressing this sampling limitation, we introduce the xxMD (Extended Excited-state Molecular Dynamics) dataset, derived from non-adiabatic dynamics. This dataset encompasses energies and forces ascertained from both multireference wave function theory and density functional theory. Furthermore, its nuclear configuration spaces authentically depict chemical reactions, making xxMD a more chemically relevant dataset. Our re-assessment of equivariant models on the xxMD datasets reveals notably higher mean absolute errors than those reported for MD17 and its variants. This observation underscores the challenges faced in crafting a generalizable NFF model with extrapolation capability. Our proposed xxMD-CASSCF and xxMD-DFT datasets are available at \url{https://github.com/zpengmei/xxMD}.

摘要

Exploring Unsupervised Cell Recognition with Prior Self-activation Maps

paper_url: http://arxiv.org/abs/2308.11144
repo_url: https://github.com/cpystan/psm
paper_authors: Pingyi Chen, Chenglu Zhu, Zhongyi Shui, Jiatong Cai, Sunyi Zheng, Shichuan Zhang, Lin Yang
for: 本研究旨在降低生物标注成本，提高生物图像识别效果。
methods: 我们提出了一种基于自动激活图的方法，通过自动激活图中的特征来生成假标记。然后，我们引入了一种语义归一化模块，将假标记转换为像素级别的语义假标记。
results: 我们在两个 histological 数据集上进行评估，结果表明我们的方法可以与其他全盘和弱盘方法竞争，而无需任何手动标注。此外，我们的简单 yet 有效的框架还可以实现多类细胞检测，这在已有的无监督方法中无法完成。

Abstract
The success of supervised deep learning models on cell recognition tasks relies on detailed annotations. Many previous works have managed to reduce the dependency on labels. However, considering the large number of cells contained in a patch, costly and inefficient labeling is still inevitable. To this end, we explored label-free methods for cell recognition. Prior self-activation maps (PSM) are proposed to generate pseudo masks as training targets. To be specific, an activation network is trained with self-supervised learning. The gradient information in the shallow layers of the network is aggregated to generate prior self-activation maps. Afterward, a semantic clustering module is then introduced as a pipeline to transform PSMs to pixel-level semantic pseudo masks for downstream tasks. We evaluated our method on two histological datasets: MoNuSeg (cell segmentation) and BCData (multi-class cell detection). Compared with other fully-supervised and weakly-supervised methods, our method can achieve competitive performance without any manual annotations. Our simple but effective framework can also achieve multi-class cell detection which can not be done by existing unsupervised methods. The results show the potential of PSMs that might inspire other research to deal with the hunger for labels in medical area.

摘要
Successful supervised deep learning models for cell recognition rely heavily on detailed annotations. However, obtaining these annotations can be costly and inefficient. To address this issue, we explored label-free methods for cell recognition. Our proposed method uses prior self-activation maps (PSMs) to generate pseudo masks as training targets. Specifically, we train an activation network using self-supervised learning to generate the PSMs, and then use a semantic clustering module to transform the PSMs into pixel-level semantic pseudo masks for downstream tasks. We evaluated our method on two histological datasets (MoNuSeg and BCData) and found that it can achieve competitive performance without any manual annotations. Our method is simple but effective, and can also perform multi-class cell detection, which is not possible with existing unsupervised methods. The results demonstrate the potential of PSMs to address the need for labels in medical applications.

paper_url: http://arxiv.org/abs/2308.11136
repo_url: None
paper_authors: Jitao Bai, Simiao Zhang, Zhonghao Chen
for: 这篇论文主要是关于大语言模型基于代理的应用。
methods: 论文使用了大语言模型来实现代理，并考虑了社会科学的应用。
results: 论文提出了一种新的代理方法，并通过实验证明了其效果。In English, this translates to:
for: This paper is primarily about the application of large language models based on proxies.
methods: The paper uses large language models to implement proxies and considers applications in social sciences.
results: The paper proposes a new proxy method and experiments prove its effectiveness.

Abstract
Focus on Large Language Model based agents should involve more than "human-centered" alignment or application. We argue that more attention should be paid to the agent itself and discuss the potential of social sciences for agents.

摘要
大语言模型基于代理应该超出人类中心的启aligned或应用。我们认为代理本身应该受到更多的注意力，并讨论社会科学在代理方面的潜力。Here's a word-for-word translation:大语言模型基于代理应该超出人类中心的启aligned或应用。我们认为代理本身应该受到更多的注意力，并讨论社会科学在代理方面的潜力。Note that Simplified Chinese is the standard writing system used in mainland China, while Traditional Chinese is used in Taiwan and Hong Kong.

ReLLa: Retrieval-enhanced Large Language Models for Lifelong Sequential Behavior Comprehension in Recommendation

paper_url: http://arxiv.org/abs/2308.11131
repo_url: None
paper_authors: Jianghao Lin, Rong Shan, Chenxu Zhu, Kounianhua Du, Bo Chen, Shigang Quan, Ruiming Tang, Yong Yu, Weinan Zhang
for: 这 paper 主要针对 recommendation зада务中的 zero-shot 和 few-shot 设置，以提高大语言模型 (LLM) 的表现。
methods: 该 paper 提出了一种 novel 框架，名为 Retrieval-enhanced Large Language models (ReLLa)，用于解决 LLM 在 recommendation 领域中的各种问题。
results: 经过广泛的实验，ReLLa 表现出优于现有基线模型，并能够解决 LLM 在长期序列行为理解方面的问题。

Abstract
With large language models (LLMs) achieving remarkable breakthroughs in natural language processing (NLP) domains, LLM-enhanced recommender systems have received much attention and have been actively explored currently. In this paper, we focus on adapting and empowering a pure large language model for zero-shot and few-shot recommendation tasks. First and foremost, we identify and formulate the lifelong sequential behavior incomprehension problem for LLMs in recommendation domains, i.e., LLMs fail to extract useful information from a textual context of long user behavior sequence, even if the length of context is far from reaching the context limitation of LLMs. To address such an issue and improve the recommendation performance of LLMs, we propose a novel framework, namely Retrieval-enhanced Large Language models (ReLLa) for recommendation tasks in both zero-shot and few-shot settings. For zero-shot recommendation, we perform semantic user behavior retrieval (SUBR) to improve the data quality of testing samples, which greatly reduces the difficulty for LLMs to extract the essential knowledge from user behavior sequences. As for few-shot recommendation, we further design retrieval-enhanced instruction tuning (ReiT) by adopting SUBR as a data augmentation technique for training samples. Specifically, we develop a mixed training dataset consisting of both the original data samples and their retrieval-enhanced counterparts. We conduct extensive experiments on a real-world public dataset (i.e., MovieLens-1M) to demonstrate the superiority of ReLLa compared with existing baseline models, as well as its capability for lifelong sequential behavior comprehension.

摘要
Large language models (LLMs) 在自然语言处理（NLP）领域取得了显著的突破， LLM-enhanced recommender systems 也在当前得到了广泛的关注。在这篇论文中，我们关注在适应和强化纯大语言模型（LLM）的零shot和几shot推荐任务上。首先，我们识别和描述了 LLM 在推荐领域中的生命周期行为无法理解问题，即 LLM 无法从用户行为序列中提取有用信息，即使用户行为序列的长度远远超过 LLM 的上下文限制。为解决这一问题并提高 LLM 的推荐性能，我们提出了一种新的框架，即 Retrieval-enhanced Large Language models (ReLLa)，用于零shot和几shot的推荐任务。 для零shot推荐，我们实施了 semantic user behavior retrieval (SUBR)，以提高测试样本的数据质量，从而减轻 LLM 提取用户行为序列中的关键知识的困难。而为了几shot推荐，我们进一步设计了 retrieval-enhanced instruction tuning (ReiT)，通过采用 SUBR 作为数据增强技术来培育训练样本。具体来说，我们构建了一个混合训练集，包括原始数据样本和其增强后的对应样本。我们在一个真实的公共数据集（即 MovieLens-1M）上进行了广泛的实验，以证明 ReLLa 与现有基eline模型相比，具有更高的优势，同时也能够解决生命周期行为无法理解问题。

Transformers for Capturing Multi-level Graph Structure using Hierarchical Distances

paper_url: http://arxiv.org/abs/2308.11129
repo_url: None
paper_authors: Yuankai Luo
for: 本研究旨在提出一种基于层次结构编码的图变换器，以提高图变换器对不同类型图的表现。
methods: 本研究使用了一种名为层次距离结构编码（HDSE）的方法，该方法利用图中节点之间的层次距离来建模图的多层次结构。
results: 经过对12个实际数据集的广泛实验，研究发现，使用HDSE方法可以成功地提高多种基eline transformers的表现，在10个标准测试集上实现了状态的领先性表现。

Abstract
Graph transformers need strong inductive biases to derive meaningful attention scores. Yet, current proposals rarely address methods capturing longer ranges, hierarchical structures, or community structures, as they appear in various graphs such as molecules, social networks, and citation networks. In this paper, we propose a hierarchy-distance structural encoding (HDSE), which models a hierarchical distance between the nodes in a graph focusing on its multi-level, hierarchical nature. In particular, this yields a framework which can be flexibly integrated with existing graph transformers, allowing for simultaneous application with other positional representations. Through extensive experiments on 12 real-world datasets, we demonstrate that our HDSE method successfully enhances various types of baseline transformers, achieving state-of-the-art empirical performances on 10 benchmark datasets.

摘要
GRaph transformers需要强大的推导性偏好，以derive meaningful attention scores。然而，当前的提议 rarely address methods capturing longer ranges, hierarchical structures, or community structures，as they appear in various graphs such as molecules, social networks, and citation networks。在这篇论文中，我们提议了一种层次距离结构编码(HDSE)，该模型在图中节点之间的层次距离，强调图的多层、层次结构。特别是，这种方法可以flexibly integrate with existing graph transformers，allowing for simultaneous application with other positional representations。通过对12个实际 dataset进行了广泛的实验，我们证明了我们的HDSE方法成功地提高了多种基eline transformers的性能，达到了10个标准 benchmark dataset的状态态表现。

CAME: Contrastive Automated Model Evaluation

paper_url: http://arxiv.org/abs/2308.11111
repo_url: https://github.com/pengr/contrastive_autoeval
paper_authors: Ru Peng, Qiuyang Duan, Haobo Wang, Jiachen Ma, Yanbo Jiang, Yongjun Tu, Xiu Jiang, Junbo Zhao
for: 本研究旨在提出一种新的自动模型评估（AutoEval）框架，以便评估训练完成的机器学习模型而无需使用标注测试集。
methods: 该框架基于一种新的对比损失函数，通过对比测试集中的模型表现和训练集中的模型表现来评估模型的性能。
results: 研究人员通过实验证明，CAME框架可以在AutoEval中达到新的最佳性能水平，超过先前的工作。

Abstract
The Automated Model Evaluation (AutoEval) framework entertains the possibility of evaluating a trained machine learning model without resorting to a labeled testing set. Despite the promise and some decent results, the existing AutoEval methods heavily rely on computing distribution shifts between the unlabelled testing set and the training set. We believe this reliance on the training set becomes another obstacle in shipping this technology to real-world ML development. In this work, we propose Contrastive Automatic Model Evaluation (CAME), a novel AutoEval framework that is rid of involving training set in the loop. The core idea of CAME bases on a theoretical analysis which bonds the model performance with a contrastive loss. Further, with extensive empirical validation, we manage to set up a predictable relationship between the two, simply by deducing on the unlabeled/unseen testing set. The resulting framework CAME establishes a new SOTA results for AutoEval by surpassing prior work significantly.

摘要
autoeval框架可能无需使用标注测试集来评估已经训练的机器学习模型。尽管存在承诺和一些不错的结果，现有的autoeval方法都仰赖计算分布shift между无标测试集和训练集。我们认为这种依赖于训练集的方法会成为实际ml开发中的另一个障碍。在这项工作中，我们提出了对比自动评估（CAME）框架，它不再需要使用训练集。CAME的核心想法基于对模型性能与对比损失的理论分析。我们通过大量的实验验证，成功地建立了对比测试集上的模型性能和对比损失之间的可预测关系。这种关系可以通过对无标测试集进行推理来获得。CAME的框架在autoeval领域创造了新的最佳实践（SOTA）结果，超过了之前的工作。

Anonymity at Risk? Assessing Re-Identification Capabilities of Large Language Models

paper_url: http://arxiv.org/abs/2308.11103
repo_url: https://github.com/skatinger/anonymity-at-risk-assessing-re-identification-capabilities-of-large-language-models
paper_authors: Alex Nyffenegger, Matthias Stürmer, Joel Niklaus
for: The paper explores the potential of large language models (LLMs) to re-identify individuals in court rulings, with a focus on privacy protection in the European Union and Switzerland.
methods: The authors construct a proof-of-concept using actual legal data from the Swiss federal supreme court and create an anonymized Wikipedia dataset for more rigorous testing. They introduce new metrics to measure performance and systematically analyze the factors that influence successful re-identifications.
results: Despite high re-identification rates on Wikipedia, even the best LLMs struggled with court decisions due to a lack of test datasets, the need for substantial training resources, and data sparsity in the information used for re-identification. The study concludes that re-identification using LLMs may not be feasible for now, but it could become possible in the future.Here is the information in Simplified Chinese text:
for: 本研究探讨了大语言模型（LLMs）在法律案例中重新标识个人的可能性，强调欧盟和瑞士隐私保护。
methods: 作者们使用实际的瑞士最高法院判决文档构建了证明，并创建了一个匿名的Wikipedia数据集进行更加严格的测试。他们引入了新的成本度量来衡量表现，并系统地分析了重要的成本因素。
results: 尽管在Wikipedia上获得了高的重新标识率，甚至最好的LLMs在法律案例中仍然遇到了困难，这是因为缺乏测试数据集，需要巨大的训练资源，以及法律案例中数据的稀缺性。研究结论是，使用LLMs进行重新标识可能不太可能，但是未来可能变得可能。

Abstract
Anonymity of both natural and legal persons in court rulings is a critical aspect of privacy protection in the European Union and Switzerland. With the advent of LLMs, concerns about large-scale re-identification of anonymized persons are growing. In accordance with the Federal Supreme Court of Switzerland, we explore the potential of LLMs to re-identify individuals in court rulings by constructing a proof-of-concept using actual legal data from the Swiss federal supreme court. Following the initial experiment, we constructed an anonymized Wikipedia dataset as a more rigorous testing ground to further investigate the findings. With the introduction and application of the new task of re-identifying people in texts, we also introduce new metrics to measure performance. We systematically analyze the factors that influence successful re-identifications, identifying model size, input length, and instruction tuning among the most critical determinants. Despite high re-identification rates on Wikipedia, even the best LLMs struggled with court decisions. The complexity is attributed to the lack of test datasets, the necessity for substantial training resources, and data sparsity in the information used for re-identification. In conclusion, this study demonstrates that re-identification using LLMs may not be feasible for now, but as the proof-of-concept on Wikipedia showed, it might become possible in the future. We hope that our system can help enhance the confidence in the security of anonymized decisions, thus leading to the courts being more confident to publish decisions.

摘要
“欧盟和瑞士的司法预测中的匿名性保护是一个重要的问题。随着大规模数据预测技术的发展，对匿名化后的个人重新识别的担忧增加。根据瑞士联邦最高法院的判决，我们进行了一个实验，使用瑞士联邦最高法院的法律数据来测试LLM的重新识别能力。在进一步的测试中，我们使用了一个匿名化的Wikipedia数据集，以更加严谨地检验发现。我们也引入了一个新的任务，即在文本中重新识别个人，并且引入了新的衡量表现的指标。我们系统性地分析了对成功重新识别的影响因素，发现模型大小、输入长度和调整受到最大的影响。尽管在Wikipedia上有高的重新识别率，但是even the best LLMs仅在法院的判决中取得了 moderate的成功率。这些成功率的低度是由于没有足够的测试数据、需要很大的训练资源和数据潜在的缺乏。因此，我们的研究结果表明，使用LLMs进行重新识别可能不太可能，但是在未来，这个技术可能会成为可能的。我们希望，我们的系统可以帮助提高匿名化判决的安全性，使得法院更自信地发布判决。”

Using Early Exits for Fast Inference in Automatic Modulation Classification

paper_url: http://arxiv.org/abs/2308.11100
repo_url: None
paper_authors: Elsayed Mohammed, Omar Mashaal, Hatem Abou-Zeid
for: 本研究旨在提高无线通信中的自动模式分类（AMC）技术的效率，通过使用深度学习（DL）技术提取无线信号特征。
methods: 本研究提出使用早期离开（EE）技术加速DL模型的推理，并研究了四种不同的早期离开架构和自定义多支分支训练算法。
results: 通过广泛的实验，我们发现对于中度到高度的信号含杂率（SNR），使用EE技术可以显著降低深度神经网络的推理速度，而不会产生分类精度的下降。我们还进行了推理时间与分类精度之间的平衡分析。这是目前所知道的首次应用EE技术于AMC领域的研究。

Abstract
Automatic modulation classification (AMC) plays a critical role in wireless communications by autonomously classifying signals transmitted over the radio spectrum. Deep learning (DL) techniques are increasingly being used for AMC due to their ability to extract complex wireless signal features. However, DL models are computationally intensive and incur high inference latencies. This paper proposes the application of early exiting (EE) techniques for DL models used for AMC to accelerate inference. We present and analyze four early exiting architectures and a customized multi-branch training algorithm for this problem. Through extensive experimentation, we show that signals with moderate to high signal-to-noise ratios (SNRs) are easier to classify, do not require deep architectures, and can therefore leverage the proposed EE architectures. Our experimental results demonstrate that EE techniques can significantly reduce the inference speed of deep neural networks without sacrificing classification accuracy. We also thoroughly study the trade-off between classification accuracy and inference time when using these architectures. To the best of our knowledge, this work represents the first attempt to apply early exiting methods to AMC, providing a foundation for future research in this area.

摘要
Simplified Chinese:自动模式分类（AMC）在无线通信中扮演了关键的角色，可以自动将广播信号分类。深度学习（DL）技术在AMC中越来越受到关注，因为它们可以提取广播信号的复杂特征。然而，DL模型具有高计算复杂度和高推理延迟。这篇论文提出使用早退出（EE）技术来加速DL模型在AMC中的推理。我们提出了四种EE架构和一种自定义多支分支训练算法。经过广泛的实验，我们发现在moderate to high signal-to-noise ratio（SNR）下，信号更容易分类，不需要深度的架构，可以利用我们提出的EE架构。我们的实验结果表明，EE技术可以减少深度神经网络的推理速度，而不会增加分类精度的损失。我们还在使用这些架构时进行了严格的质量评估和时间评估。根据我们所知，这是首次将EE技术应用于AMC，这为未来的相关研究提供了基础。

Video OWL-ViT: Temporally-consistent open-world localization in video

paper_url: http://arxiv.org/abs/2308.11093
repo_url: None
paper_authors: Georg Heigold, Matthias Minderer, Alexey Gritsenko, Alex Bewley, Daniel Keysers, Mario Lučić, Fisher Yu, Thomas Kipf
for: 本研究旨在适应预训练的开放视界图像模型到视频本地化中。
methods: 我们基于OWL-ViT开放词汇检测模型，并添加了一个变换器解码器，以卷积神经网络输出的一帧图像作为下一帧对象查询。
results: 我们的模型在面对挑战性的TAO-OWBenchmark上表现出色，证明了预训练大量图像文本数据可以成功传递到开放视界本地化中。

Abstract
We present an architecture and a training recipe that adapts pre-trained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tasks involving object localization applying pre-trained models is more challenging. This is particularly true for video tasks, where task-specific data is limited. We show successful transfer of open-world models by building on the OWL-ViT open-vocabulary detection model and adapting it to video by adding a transformer decoder. The decoder propagates object representations recurrently through time by using the output tokens for one frame as the object queries for the next. Our model is end-to-end trainable on video data and enjoys improved temporal consistency compared to tracking-by-detection baselines, while retaining the open-world capabilities of the backbone detector. We evaluate our model on the challenging TAO-OW benchmark and demonstrate that open-world capabilities, learned from large-scale image-text pre-training, can be transferred successfully to open-world localization across diverse videos.

摘要
Translation in Simplified Chinese:我们提出了一种架构和训练方法，可以将预训练的开放视界图像模型适应到视频地图Localization。理解开放视界（不受固定标签空间约束）是许多实际视觉任务的关键。在大量图像文本数据集上进行对比预训练，最近导致了图像级别任务的显著改进。然而，对于结构化任务，如对象localization，使用预训练模型更加困难。特别是在视频任务中，任务特定数据受限。我们在OWL-ViT开放词汇探测模型的基础上建立了一个Transformer解码器，以便在视频中传播对象表示。解码器使用下一帧的输出符号来作为下一帧的对象查询。我们的模型是基于视频数据的端到端训练的，并且比较tracking-by-detection基eline更有优势，同时保留了预训练模型的开放视界能力。我们在TAO-OWbenchmark上评估了我们的模型，并证明了可以成功传递开放视界的能力，从大规模图像文本预训练中学习到开放视界地图Localization across多种视频。

Collaborative Route Planning of UAVs, Workers and Cars for Crowdsensing in Disaster Response

paper_url: http://arxiv.org/abs/2308.11088
repo_url: None
paper_authors: Lei Han, Chunyu Tu, Zhiwen Yu, Zhiyong Yu, Weihua Shan, Liang Wang, Bin Guo
for: 本研究旨在提高灾区内部合作多代理器（UAV、工人和车辆）的数据收集效率。
methods: 本研究提出了一个多代理器路径观察法（MANF-RL-RP），具有多效设计，包括全球与本地信息处理、特定多代理器系统模型结构等。
results: 比较基准算法（Greedy-SC-RP和MANF-DNN-RP），MANF-RL-RP在任务完成率方面有显著提高。

Abstract
Efficiently obtaining the up-to-date information in the disaster-stricken area is the key to successful disaster response. Unmanned aerial vehicles (UAVs), workers and cars can collaborate to accomplish sensing tasks, such as data collection, in disaster-stricken areas. In this paper, we explicitly address the route planning for a group of agents, including UAVs, workers, and cars, with the goal of maximizing the task completion rate. We propose MANF-RL-RP, a heterogeneous multi-agent route planning algorithm that incorporates several efficient designs, including global-local dual information processing and a tailored model structure for heterogeneous multi-agent systems. Global-local dual information processing encompasses the extraction and dissemination of spatial features from global information, as well as the partitioning and filtering of local information from individual agents. Regarding the construction of the model structure for heterogeneous multi-agent, we perform the following work. We design the same data structure to represent the states of different agents, prove the Markovian property of the decision-making process of agents to simplify the model structure, and also design a reasonable reward function to train the model. Finally, we conducted detailed experiments based on the rich simulation data. In comparison to the baseline algorithms, namely Greedy-SC-RP and MANF-DNN-RP, MANF-RL-RP has exhibited a significant improvement in terms of task completion rate.

摘要
efficiently 获取在灾难 struck 地区的最新信息是灾难应对的关键。无人飞行器（UAV）、工人和车辆可以在灾难 struck 地区合作完成感知任务，如数据收集。在这篇论文中，我们明确地讨论了一组代理人（包括UAV、工人和车辆）的路径规划，以最大化任务完成率为目标。我们提出了多Agent Route Planning Algorithm（MANF-RL-RP），该算法包括了许多高效的设计，如全球-本地双信息处理和特定的模型结构 для多种Agent系统。全球-本地双信息处理包括从全球信息中提取和传递空间特征，以及来自个体代理人的本地信息的分区和筛选。在构建多种Agent系统的模型结构方面，我们进行了以下工作。我们设计了同样的数据结构来表示不同代理人的状态，证明代理人决策过程的markt价性以简化模型结构，并设计了合理的奖励函数来训练模型。最后，我们对着富有的 simulate 数据进行了详细的实验。与基准算法（即Greedy-SC-RP和MANF-DNN-RP）相比，MANF-RL-RP 在任务完成率方面表现出了显著的提升。

Neural Amortized Inference for Nested Multi-agent Reasoning

paper_url: http://arxiv.org/abs/2308.11071
repo_url: None
paper_authors: Kunal Jha, Tuan Anh Le, Chuanyang Jin, Yen-Ling Kuo, Joshua B. Tenenbaum, Tianmin Shu
for: 本研究旨在提高多智能体交互中的复杂社会推理能力，使其能够更好地理解别人对自己的推理。
methods: 本研究使用神经网络来减轻高阶社会推理的计算复杂性，以提高多智能体交互的效率。
results: 实验结果表明，我们的方法可以减少计算复杂性，同时减少准确性的削弱。

Abstract
Multi-agent interactions, such as communication, teaching, and bluffing, often rely on higher-order social inference, i.e., understanding how others infer oneself. Such intricate reasoning can be effectively modeled through nested multi-agent reasoning. Nonetheless, the computational complexity escalates exponentially with each level of reasoning, posing a significant challenge. However, humans effortlessly perform complex social inferences as part of their daily lives. To bridge the gap between human-like inference capabilities and computational limitations, we propose a novel approach: leveraging neural networks to amortize high-order social inference, thereby expediting nested multi-agent reasoning. We evaluate our method in two challenging multi-agent interaction domains. The experimental results demonstrate that our method is computationally efficient while exhibiting minimal degradation in accuracy.

摘要
多代理交互，如通信、教学和威胁，经常需要高级社会推理，即理解他们如何推理自己。这种复杂的推理可以通过嵌套多代理推理来有效模型。然而，计算复杂性随着每层推理层数的增加而呈指数增长， pose significant challenges。然而，人类在日常生活中很自然地完成复杂的社会推理。为了bridging这个 gap，我们提出了一种新的方法：利用神经网络来减轻高级社会推理，从而加快嵌套多代理推理。我们在两个复杂多代理交互领域进行了实验，结果表明我们的方法具有高效性和减少准确性下降的能力。

Temporal-Distributed Backdoor Attack Against Video Based Action Recognition

paper_url: http://arxiv.org/abs/2308.11070
repo_url: None
paper_authors: Xi Li, Songhe Wang, Ruiquan Huang, Mahanth Gowda, George Kesidis
for: 本研究旨在探讨视频数据下的后门攻击（Trojan），以及现有模型对这种攻击的抵御性。
methods: 本研究提出了一种简单 yet 有效的后门攻击方法，通过在转换域中添加杂音来植入潜在的攻击词。这种攻击可以在视频帧中逐帧插入，并且可以在攻击后继续保持高准确率。
results: 经过广泛的实验，研究人员发现这种攻击方法可以在多种知名模型上达到高度可见性和鲁棒性，并且可以在不同的视频识别 benchmark 上实现攻击。此外，研究人员还发现了一种称为 “Collateral Damage” 的现象，即在攻击过程中可能会导致模型对非目标类型的数据进行误分类。

Abstract
Deep neural networks (DNNs) have achieved tremendous success in various applications including video action recognition, yet remain vulnerable to backdoor attacks (Trojans). The backdoor-compromised model will mis-classify to the target class chosen by the attacker when a test instance (from a non-target class) is embedded with a specific trigger, while maintaining high accuracy on attack-free instances. Although there are extensive studies on backdoor attacks against image data, the susceptibility of video-based systems under backdoor attacks remains largely unexplored. Current studies are direct extensions of approaches proposed for image data, e.g., the triggers are \textbf{independently} embedded within the frames, which tend to be detectable by existing defenses. In this paper, we introduce a \textit{simple} yet \textit{effective} backdoor attack against video data. Our proposed attack, adding perturbations in a transformed domain, plants an \textbf{imperceptible, temporally distributed} trigger across the video frames, and is shown to be resilient to existing defensive strategies. The effectiveness of the proposed attack is demonstrated by extensive experiments with various well-known models on two video recognition benchmarks, UCF101 and HMDB51, and a sign language recognition benchmark, Greek Sign Language (GSL) dataset. We delve into the impact of several influential factors on our proposed attack and identify an intriguing effect termed "collateral damage" through extensive studies.

摘要
深度神经网络（DNN）在不同应用场景中取得了很大成功，如视频动作识别，然而它们却容易受到后门攻击（Trojan）。攻击者可以通过特定的触发符使得恶意修改的模型在测试实例（非目标类）中产生错误分类，而保持高精度水平。虽然对于图像数据已有广泛的研究，但视频系统对于后门攻击的抗性仍然尚未得到充分研究。现有的研究多是对图像数据进行直接扩展，例如在帧内独立地插入触发符，这些触发符可以被现有的防御策略检测。在这篇论文中，我们提出了一种简单又有效的后门攻击方法，通过在转换域中添加噪声，在视频帧中植入不可见、时间分布的触发符，并证明其具有抗性。我们通过对多种知名模型在UCf101、HMDB51和希腊手语认知 benchmark 上进行了广泛的实验，证明了我们的提案的有效性。我们还进行了详细的研究，探讨了一些影响我们提案的因素，并发现了一种感人的效果，我们称之为“副作用”。

Topological Graph Signal Compression

paper_url: http://arxiv.org/abs/2308.11068
repo_url: None
paper_authors: Guillermo Bernárdez, Lev Telyatnikov, Eduard Alarcón, Albert Cabellos-Aparicio, Pere Barlet-Ros, Pietro Liò
for: 这 paper 的目的是提出一种基于 Topological Deep Learning (TDL) 方法来压缩信号 над 图 structures。
methods: 这 paper 使用的方法包括对原始信号进行分 clustering，然后使用 topological-inspired message passing 获取压缩后的信号表示。
results: 该方法可以在两个实际 Internet Service Provider Networks 的数据集上提高标准 GNN 和 feed-forward 架构的压缩性能，从 $30%$ 到 $90%$ 的压缩率提高，表明它更好地捕捉和利用图结构中的空间和时间相关性。

Abstract
Recently emerged Topological Deep Learning (TDL) methods aim to extend current Graph Neural Networks (GNN) by naturally processing higher-order interactions, going beyond the pairwise relations and local neighborhoods defined by graph representations. In this paper we propose a novel TDL-based method for compressing signals over graphs, consisting in two main steps: first, disjoint sets of higher-order structures are inferred based on the original signal --by clustering $N$ datapoints into $K\ll N$ collections; then, a topological-inspired message passing gets a compressed representation of the signal within those multi-element sets. Our results show that our framework improves both standard GNN and feed-forward architectures in compressing temporal link-based signals from two real-word Internet Service Provider Networks' datasets --from $30\%$ up to $90\%$ better reconstruction errors across all evaluation scenarios--, suggesting that it better captures and exploits spatial and temporal correlations over the whole graph-based network structure.

摘要
最近爆发的拓扑深度学习（TDL）方法希望可以补充当前图ael neural network（GNN）的限制，自然处理更高阶交互，超出现有图表示中的对角相关和本地邻里hood。在这篇论文中，我们提出了一种基于TDL的图信号压缩方法，包括两个主要步骤：首先，通过原始信号对$N$个数据点进行分 clustering，将其分成$K\ll N$个集合；然后，基于图的拓扑结构，进行多元素集合内的扩展传递，以获得压缩后的信号表示。我们的结果表明，我们的框架可以在两个实际世界互联网服务提供商网络数据集上，将标准GNN和批处理架构超越，在压缩时间链接基于网络结构中的信号方面达到$30\%$到$90\%$的更好的重建错误，表明它更好地捕捉和利用图结构中的空间和时间相关性。

CSM-H-R: An Automatic Context Reasoning Framework for Interoperable Intelligent Systems and Privacy Protection

paper_url: http://arxiv.org/abs/2308.11066
repo_url: https://github.com/songhui01/csm-h-r
paper_authors: Songhui Yue, Xiaoyan Hong, Randy K. Smith
for: 这个论文的目的是提出一个自动化高级上下文（HLC）理解框架，以便在智能系统规模上实现智能系统的自动化整合。
methods: 该框架使用ontology和状态在运行时和模型存储阶段进行程序式组合，以实现意义full HLC的认知，并将结果应用于不同的理解技术。
results: 实验表明，该框架可以自动捕捉和理解高级上下文，并将其转换为可以应用于不同的理解技术的数据表示。此外，该框架还实现了隐私保护功能，通过域嵌入和信息卷积来减少信息相关性。

Abstract
Automation of High-Level Context (HLC) reasoning for intelligent systems at scale is imperative due to the unceasing accumulation of contextual data in the IoT era, the trend of the fusion of data from multi-sources, and the intrinsic complexity and dynamism of the context-based decision-making process. To mitigate this issue, we propose an automatic context reasoning framework CSM-H-R, which programmatically combines ontologies and states at runtime and the model-storage phase for attaining the ability to recognize meaningful HLC, and the resulting data representation can be applied to different reasoning techniques. Case studies are developed based on an intelligent elevator system in a smart campus setting. An implementation of the framework - a CSM Engine, and the experiments of translating the HLC reasoning into vector and matrix computing especially take care of the dynamic aspects of context and present the potentiality of using advanced mathematical and probabilistic models to achieve the next level of automation in integrating intelligent systems; meanwhile, privacy protection support is achieved by anonymization through label embedding and reducing information correlation. The code of this study is available at: https://github.com/songhui01/CSM-H-R.

摘要
自然语言处理（NLP）技术在智能系统中的应用在不断增长，特别是在互联网物联网（IoT）时代，数据来源的融合和上下文决策过程的内在复杂性和动态性使得高级上下文（HLC）理解成为非常重要的。为解决这一问题，我们提出了一个自动上下文理解框架CSM-H-R，该框架在运行时和模型存储阶段使用ontologies和状态进行程序性结合，以实现对有意义的HLC的识别，并且可以应用于不同的理解技术。在智能电梯系统的实际案例中，我们开发了CSM引擎，并对HLC理解进行了vector和矩阵计算的实验，特别是处理上下文的动态性，表明了使用高级数学和统计模型可以实现下一个自动化层次的智能系统集成。同时，我们实现了隐私保护支持，通过嵌入标签和减少信息相关性来实现隐身。CSM框架的代码可以在以下链接中找到：https://github.com/songhui01/CSM-H-R。

paper_url: http://arxiv.org/abs/2308.12305
repo_url: None
paper_authors: Haokun Chen, Yao Zhang, Denis Krompass, Jindong Gu, Volker Tresp
for: 这则研究旨在提高基础模型在多modal学习中的表现，并且解决集中训练数据的问题。
methods: 本研究使用 Federated Dual-Adapter Teacher (FedDAT) 方法，具有调整客户端本地更新和实施多元知识传播 (MKD)，以解决客户端数据不具同一性的问题。
results: 实验结果显示，FedDAT 在多modal Vision-Language 任务上substantially 超过了现有的中央化 PEFT 方法适应 FL 的表现。

Abstract
Recently, foundation models have exhibited remarkable advancements in multi-modal learning. These models, equipped with millions (or billions) of parameters, typically require a substantial amount of data for finetuning. However, collecting and centralizing training data from diverse sectors becomes challenging due to distinct privacy regulations. Federated Learning (FL) emerges as a promising solution, enabling multiple clients to collaboratively train neural networks without centralizing their local data. To alleviate client computation burdens and communication overheads, previous works have adapted Parameter-efficient Finetuning (PEFT) methods for FL. Hereby, only a small fraction of the model parameters are optimized and communicated during federated communications. Nevertheless, most previous works have focused on a single modality and neglected one common phenomenon, i.e., the presence of data heterogeneity across the clients. Therefore, in this work, we propose a finetuning framework tailored to heterogeneous multi-modal FL, called Federated Dual-Aadapter Teacher (FedDAT). Specifically, our approach leverages a Dual-Adapter Teacher (DAT) to address data heterogeneity by regularizing the client local updates and applying Mutual Knowledge Distillation (MKD) for an efficient knowledge transfer. FedDAT is the first approach that enables an efficient distributed finetuning of foundation models for a variety of heterogeneous Vision-Language tasks. To demonstrate its effectiveness, we conduct extensive experiments on four multi-modality FL benchmarks with different types of data heterogeneity, where FedDAT substantially outperforms the existing centralized PEFT methods adapted for FL.

摘要
最近，基金会模型在多模态学习中展现了显著的进步。这些模型通常需要大量数据进行微调，但收集和中央化训练数据因为不同隐私规定而变得困难。为了解决这问题，聚合学习（FL）成为了一种有前途的解决方案，允许多个客户共同训练神经网络，无需中央化本地数据。以减少客户计算负担和通信开销为目的，先前的工作已经采用了参数效率微调（PEFT）方法进行FL。然而，大多数先前的工作宁悠单一模式，忽视了客户端数据的不同性。因此，在本工作中，我们提出了适应多模式、多数据类型 federated 微调框架，称为 FedDAT。具体来说，我们的方法利用了双适应教师（DAT）来处理客户端数据的不同性，通过规则化客户端本地更新和应用知识传播（MKD）进行高效的知识传递。FedDAT 是首个能够有效地在多模态 FL 上进行基础模型的分布式微调。为证明其效果，我们在四个多模态 FL 测试准则上进行了广泛的实验，其中 FedDAT 在不同类型的数据不同性下显著超过了已有的中央化 PEFT 方法。

Beyond Discriminative Regions: Saliency Maps as Alternatives to CAMs for Weakly Supervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.11052
repo_url: None
paper_authors: M. Maruf, Arka Daw, Amartya Dutta, Jie Bu, Anuj Karpatne
for: 本研究比较了抽象图和特征图两种方法在弱监督 semantic segmentation (WS3) 中的表现，并提出了一些新的评价指标来全面评估这两种方法的性能。
methods: 本研究使用了特征图和抽象图两种方法来生成pseudo-ground truth，并通过多个视角来比较它们的相似性和不同性。
results: 研究发现，使用抽象图可以更好地解决WS3中的非特征区域 (NDR) 问题，并且通过随机裁剪提高了抽象图的性能。

Abstract
In recent years, several Weakly Supervised Semantic Segmentation (WS3) methods have been proposed that use class activation maps (CAMs) generated by a classifier to produce pseudo-ground truths for training segmentation models. While CAMs are good at highlighting discriminative regions (DR) of an image, they are known to disregard regions of the object that do not contribute to the classifier's prediction, termed non-discriminative regions (NDR). In contrast, attribution methods such as saliency maps provide an alternative approach for assigning a score to every pixel based on its contribution to the classification prediction. This paper provides a comprehensive comparison between saliencies and CAMs for WS3. Our study includes multiple perspectives on understanding their similarities and dissimilarities. Moreover, we provide new evaluation metrics that perform a comprehensive assessment of WS3 performance of alternative methods w.r.t. CAMs. We demonstrate the effectiveness of saliencies in addressing the limitation of CAMs through our empirical studies on benchmark datasets. Furthermore, we propose random cropping as a stochastic aggregation technique that improves the performance of saliency, making it a strong alternative to CAM for WS3.

摘要
Translation notes:* "Weakly Supervised Semantic Segmentation" (WS3) is translated as "弱指示 semantic segmentation" (WS3) in Simplified Chinese.* "Class activation map" (CAM) is translated as "类划分图" (CAM) in Simplified Chinese.* "Discriminative regions" (DR) is translated as "分化区" (DR) in Simplified Chinese.* "Non-discriminative regions" (NDR) is translated as "非分化区" (NDR) in Simplified Chinese.* "Attribution methods" such as "saliency maps" is translated as "责任方法" such as "吸引图" in Simplified Chinese.* "Stochastic aggregation technique" such as "random cropping" is translated as "随机聚合技术" such as "随机裁剪" in Simplified Chinese.

Personalized Event Prediction for Electronic Health Records

paper_url: http://arxiv.org/abs/2308.11013
repo_url: None
paper_authors: Jeong Min Lee, Milos Hauskrecht
For: The paper aims to develop accurate predictive models of clinical event sequences to support patient care, specifically by addressing the challenge of patient-specific variability in clinical conditions.* Methods: The paper proposes and investigates multiple new event sequence prediction models and methods, including refinement of population-wide models to subpopulations, self-adaptation, and meta-level model switching.* Results: The paper analyzes and tests the performance of these models on clinical event sequences of patients in the MIMIC-III database.

Abstract
Clinical event sequences consist of hundreds of clinical events that represent records of patient care in time. Developing accurate predictive models of such sequences is of a great importance for supporting a variety of models for interpreting/classifying the current patient condition, or predicting adverse clinical events and outcomes, all aimed to improve patient care. One important challenge of learning predictive models of clinical sequences is their patient-specific variability. Based on underlying clinical conditions, each patient's sequence may consist of different sets of clinical events (observations, lab results, medications, procedures). Hence, simple population-wide models learned from event sequences for many different patients may not accurately predict patient-specific dynamics of event sequences and their differences. To address the problem, we propose and investigate multiple new event sequence prediction models and methods that let us better adjust the prediction for individual patients and their specific conditions. The methods developed in this work pursue refinement of population-wide models to subpopulations, self-adaptation, and a meta-level model switching that is able to adaptively select the model with the best chance to support the immediate prediction. We analyze and test the performance of these models on clinical event sequences of patients in MIMIC-III database.

摘要
临床事件序列包括数百个临床事件记录，表示患者 receiving 的记录时间。开发准确预测模型临床序列非常重要，以支持多种模型，用于解释/分类当前患者状况，预测不良临床事件和结果，以提高患者治疗。一个重要的预测临床序列模型挑战是每个患者的病人特有性。基于下面的临床状况，每个患者的序列可能包含不同的临床事件（观察结果、实验室测试、药物、手术）。因此，从事件序列中学习的人口广泛模型可能无法准确预测每个患者的特定动态和差异。为解决问题，我们提出和探索多种新的事件序列预测模型和方法，使我们能更好地适应患者和其特定状况。我们在MIMIC-III数据库中分析和测试这些模型的性能。

“Guinea Pig Trials” Utilizing GPT: A Novel Smart Agent-Based Modeling Approach for Studying Firm Competition and Collusion

paper_url: http://arxiv.org/abs/2308.10974
repo_url: None
paper_authors: Xu Han, Zengqing Wu, Chuan Xiao
For: The paper is written to study firm competition and collusion using a novel framework called Smart Agent-Based Modeling (SABM), which employs GPT-4 technologies to represent firms and their interactions.* Methods: The study uses a controlled experiment with smart agents to examine firm price competition and collusion behaviors under various conditions, comparing the results to those obtained through experiments with human subjects.* Results: The paper finds that smart agents consistently reach tacit collusion in the absence of communication, leading to prices converging at levels higher than the Bertrand equilibrium price but lower than monopoly or cartel prices. With communication allowed, smart agents achieve a higher-level collusion with prices close to cartel prices, and collusion forms more quickly with communication. These results highlight the importance of communication in enhancing trust between firms and facilitating collusion.

Abstract
Firm competition and collusion involve complex dynamics, particularly when considering communication among firms. Such issues can be modeled as problems of complex systems, traditionally approached through experiments involving human subjects or agent-based modeling methods. We propose an innovative framework called Smart Agent-Based Modeling (SABM), wherein smart agents, supported by GPT-4 technologies, represent firms, and interact with one another. We conducted a controlled experiment to study firm price competition and collusion behaviors under various conditions. SABM is more cost-effective and flexible compared to conducting experiments with human subjects. Smart agents possess an extensive knowledge base for decision-making and exhibit human-like strategic abilities, surpassing traditional ABM agents. Furthermore, smart agents can simulate human conversation and be personalized, making them ideal for studying complex situations involving communication. Our results demonstrate that, in the absence of communication, smart agents consistently reach tacit collusion, leading to prices converging at levels higher than the Bertrand equilibrium price but lower than monopoly or cartel prices. When communication is allowed, smart agents achieve a higher-level collusion with prices close to cartel prices. Collusion forms more quickly with communication, while price convergence is smoother without it. These results indicate that communication enhances trust between firms, encouraging frequent small price deviations to explore opportunities for a higher-level win-win situation and reducing the likelihood of triggering a price war. We also assigned different personas to firms to analyze behavioral differences and tested variant models under diverse market structures. The findings showcase the effectiveness and robustness of SABM and provide intriguing insights into competition and collusion.

摘要
企业竞争和勾结涉及到复杂的动态，特别是在公司之间的交流方面。这些问题可以通过人类实验或智能代理模型（ABM）来模拟。我们提出了一种创新的框架called Smart Agent-Based Modeling（SABM），其中智能代理，受到GPT-4技术支持，代表公司，并互动相互。我们进行了一项控制性实验，以研究企业价格竞争和勾结行为的不同情况。SABM相比人类实验更加经济和灵活。智能代理具有广泛的知识库和人类策略能力，超过传统ABM代理。此外，智能代理可以模拟人类对话，可个性化，使其适用于研究复杂的交流情况。我们的结果表明，在无交流情况下，智能代理一般会达成tacit collusion，导致价格相对于BERTRAND平衡价格高，但比单一垄断或垄断价格低。当交流被允许时，智能代理可以实现更高级别的勾结，价格接近垄断价格。勾结形成更快，无交流情况下价格均衡更平滑。这些结果表明，交流可以增强公司之间的信任，使小价格偏移更频繁地探索机会，降低价格战的可能性。我们还将不同的公司个性分配给不同的公司，以分析行为差异，并在多种市场结构下测试不同的模型。结果显示SABM的效果和稳定性，并提供了精彩的竞争和勾结的新思路。

DocPrompt: Large-scale continue pretrain for zero-shot and few-shot document question answering

paper_url: http://arxiv.org/abs/2308.10959
repo_url: None
paper_authors: Sijin Wu, Dan Zhang, Teng Hu, Shikun Feng
for: 文章旨在提出一种名为 Docprompt 的文档问答模型，可以在文档问答任务中实现强大的零学习和几学习性能。
methods: 文章提出了一种新的弱监督数据生成方法、一种多Stage训练方法和一种理解模型&生成模型集成方法。
results: 实验结果显示， после继续预训练， Docprompt 模型在文档问答任务上明显超过了现有的强基线模型，并且可以大幅提高文档问答客户项目的交付效率和模型性能，降低注释成本和劳动成本。

Abstract
In this paper, we propose Docprompt for document question answering tasks with powerful zero-shot and few-shot performance. We proposed a novel weakly supervised data generation method, a novel multl-stage training method and a novel understanding model & generation model ensemble method. Experiment results show that the Docprompt model after continue pretrain significantly outperforms the existing strong baseline models on document question answering tasks. This method greatly improves the delivery efficiency and model performance of document question answering customer projects, reducing annotation costs and labor costs. Our demo can be found at https://huggingface.co/spaces/PaddlePaddle/ERNIE-Layout.

摘要
在这篇论文中，我们提出了 Docprompt，用于文档问答任务的强大零shot和几shot性能的解决方案。我们提出了一种新的软参数生成方法、一种多Stage训练方法和一种新的理解模型&生成模型结合方法。实验结果显示，在继续预训练后，Docprompt模型在文档问答任务上明显超越了现有的强基线模型。这种方法可以大幅提高文档问答客户项目的交付效率和模型性能，降低注释成本和劳动成本。您可以在https://huggingface.co/spaces/PaddlePaddle/ERNIE-Layout找到我们的demo。

Structured World Models from Human Videos

paper_url: http://arxiv.org/abs/2308.10901
repo_url: None
paper_authors: Russell Mendonca, Shikhar Bahl, Deepak Pathak
For: The paper aims to enable robots to learn complex manipulation skills directly in the real world using a small amount of interaction data.* Methods: The approach uses human video data to build a structured, human-centric action space grounded in visual affordances, and trains a world model on human videos before fine-tuning on a small amount of robot interaction data without task supervision.* Results: The approach allows robots to learn various manipulation skills in complex settings in under 30 minutes of interaction.Here is the same information in Simplified Chinese:* For: 论文旨在帮助机器人直接在真实世界中学习复杂的抓取技能，只需要很少的互动数据。* Methods: 方法使用人类视频数据构建一个基于视觉可用性的结构化人类行为空间，然后在人类视频上训练世界模型，并在小量机器人互动数据上练习而不需要任务指导。* Results: 方法可以让机器人在复杂的设置下快速学习多种抓取技能，仅需要30分钟的互动。

Abstract
We tackle the problem of learning complex, general behaviors directly in the real world. We propose an approach for robots to efficiently learn manipulation skills using only a handful of real-world interaction trajectories from many different settings. Inspired by the success of learning from large-scale datasets in the fields of computer vision and natural language, our belief is that in order to efficiently learn, a robot must be able to leverage internet-scale, human video data. Humans interact with the world in many interesting ways, which can allow a robot to not only build an understanding of useful actions and affordances but also how these actions affect the world for manipulation. Our approach builds a structured, human-centric action space grounded in visual affordances learned from human videos. Further, we train a world model on human videos and fine-tune on a small amount of robot interaction data without any task supervision. We show that this approach of affordance-space world models enables different robots to learn various manipulation skills in complex settings, in under 30 minutes of interaction. Videos can be found at https://human-world-model.github.io

摘要
我们面临的问题是直接在实际世界中学习复杂的通用行为。我们提议一种方法，使用只需一些不同场景的实际互动轨迹来教育机器人快速学习抓取技能。从计算机视觉和自然语言学习领域的成功经验中，我们认为，为了高效地学习，机器人必须能够利用互联网规模的人类视频数据。人类在与世界交互中有很多有趣的方式，这些方式可以帮助机器人不仅构建有用的动作和可用性的理解，还可以了解这些动作如何影响世界进行抓取。我们的方法是建立基于视觉可用性学习的人类行为空间，并在这个空间中训练一个世界模型。我们在人类视频上进行了训练，并在少量机器人互动数据上进行了精度调整。我们显示，这种可用性空间世界模型的方法可以让不同的机器人在复杂的设置下快速学习多种抓取技能，仅用30分钟的互动。视频可以在https://human-world-model.github.io找到。

TADA! Text to Animatable Digital Avatars

paper_url: http://arxiv.org/abs/2308.10899
repo_url: https://github.com/TingtingLiao/TADA
paper_authors: Tingting Liao, Hongwei Yi, Yuliang Xiu, Jiaxaing Tang, Yangyi Huang, Justus Thies, Michael J. Black
For: The paper aims to generate high-quality 3D avatars from textual descriptions, with realistic animations and detailed geometry.* Methods: The approach uses a 2D diffusion model and an animatable parametric body model, along with hierarchical rendering and score distillation sampling (SDS) to create detailed 3D avatars from text.* Results: The paper demonstrates that TADA significantly surpasses existing approaches on both qualitative and quantitative measures, enabling the creation of large-scale digital character assets that are ready for animation and rendering, and are easily editable through natural language.

Abstract
We introduce TADA, a simple-yet-effective approach that takes textual descriptions and produces expressive 3D avatars with high-quality geometry and lifelike textures, that can be animated and rendered with traditional graphics pipelines. Existing text-based character generation methods are limited in terms of geometry and texture quality, and cannot be realistically animated due to inconsistent alignment between the geometry and the texture, particularly in the face region. To overcome these limitations, TADA leverages the synergy of a 2D diffusion model and an animatable parametric body model. Specifically, we derive an optimizable high-resolution body model from SMPL-X with 3D displacements and a texture map, and use hierarchical rendering with score distillation sampling (SDS) to create high-quality, detailed, holistic 3D avatars from text. To ensure alignment between the geometry and texture, we render normals and RGB images of the generated character and exploit their latent embeddings in the SDS training process. We further introduce various expression parameters to deform the generated character during training, ensuring that the semantics of our generated character remain consistent with the original SMPL-X model, resulting in an animatable character. Comprehensive evaluations demonstrate that TADA significantly surpasses existing approaches on both qualitative and quantitative measures. TADA enables creation of large-scale digital character assets that are ready for animation and rendering, while also being easily editable through natural language. The code will be public for research purposes.

摘要
我们介绍TADA，一个简单又有效的方法，将文本描述转换为高品质的3D人物模型，包括高级的几何和生命力的纹理，可以通过传统的グラフィックス管线进行动画和渲染。现有的文本基于的人物生成方法受到几何和纹理质量的限制，并且无法真实地动画，因为几何和纹理之间的对齐不稳定，尤其是在脸部区域。为了突破这些限制，TADA利用了2D传播模型和可动的 Parametric Body Model。具体来说，我们从SMPL-X中 derivated一个可优化的高分辨率人体模型，包括3D偏移和纹理图像，并使用层次渲染和分析抽象 Sampling (SDS) 创建高品质、细节满怀的3D人物。为了保证几何和纹理之间的对齐，我们在SDS训练过程中使用 render 的 норма和 RGB 图像，并利用它们的隐藏嵌入来稳定训练。此外，我们还引入了多种表情参数，以使得生成的人物在训练过程中具有表情，以保持与原始 SMPL-X 模型的 semantics 一致，使得生成的人物可以被动画。我们的评估结果显示，TADA 在 both 质量和量化度上有所提高，与现有的方法相比。TADA 可以实现大规模的数码人物资产的创建，并且可以通过自然语言进行易于修改。我们将代码公开供研究用途。

Giraffe: Adventures in Expanding Context Lengths in LLMs

paper_url: http://arxiv.org/abs/2308.10882
repo_url: https://github.com/abacusai/long-context
paper_authors: Arka Pal, Deep Karkhanis, Manley Roberts, Samuel Dooley, Arvind Sundararajan, Siddartha Naidu
for: 这个论文主要用于探讨现代大型自然语言处理器（LLMs）如何在评估时处理长输入序列。
methods: 该论文使用现有的context length extrapolation方法，包括修改 pozitional encoding 系统以指示输入序列中token或活动的位置。并 introduce some new design,如修改基于position encoding的减少策略。
results: 该论文通过三个新的评估任务（FreeFormQA、AlteredNumericQA和LongChat-Lines）以及折减指标来测试这些方法。发现线性扩展是最佳的扩展方法，并示出可以通过使用更长的扩展级别在评估时获得更好的性能。同时，发现修改基于position encoding的减少策略也有扩展能力。基于这些结果，该论文释放了三个新的13B参数长Context模型，即4k和16k context模型从基础LLaMA-13B中训练，以及32k context模型从基础LLaMA2-13B中训练。同时还释放了 reproduce 结果的代码。

Abstract
Modern large language models (LLMs) that rely on attention mechanisms are typically trained with fixed context lengths which enforce upper limits on the length of input sequences that they can handle at evaluation time. To use these models on sequences longer than the train-time context length, one might employ techniques from the growing family of context length extrapolation methods -- most of which focus on modifying the system of positional encodings used in the attention mechanism to indicate where tokens or activations are located in the input sequence. We conduct a wide survey of existing methods of context length extrapolation on a base LLaMA or LLaMA 2 model, and introduce some of our own design as well -- in particular, a new truncation strategy for modifying the basis for the position encoding. We test these methods using three new evaluation tasks (FreeFormQA, AlteredNumericQA, and LongChat-Lines) as well as perplexity, which we find to be less fine-grained as a measure of long context performance of LLMs. We release the three tasks publicly as datasets on HuggingFace. We discover that linear scaling is the best method for extending context length, and show that further gains can be achieved by using longer scales at evaluation time. We also discover promising extrapolation capabilities in the truncated basis. To support further research in this area, we release three new 13B parameter long-context models which we call Giraffe: 4k and 16k context models trained from base LLaMA-13B, and a 32k context model trained from base LLaMA2-13B. We also release the code to replicate our results.

摘要
现代大语言模型（LLM）通常通过注意机制训练，但是它们的评估时间上下文长度是固定的，这限制了它们可以处理的输入序列长度。为了使这些模型处理 longer than train-time context length 的序列，可以使用Context length extrapolation方法。我们对现有的方法进行了广泛的survey，并介绍了一些我们自己的设计，包括一种新的截断策略 для修改基于位置编码的系统。我们使用三个新的评估任务（FreeFormQA、AlteredNumericQA和LongChat-Lines）以及折叠指标来测试这些方法。我们发现线性扩展是最佳的扩展方法，并且可以通过使用更长的扩展级别来进一步提高性能。此外，我们发现 truncated basis 具有扩展的潜在能力。为支持进一步的研究，我们释放了三个13B参数的长 context模型，即4k和16k上下文模型从基础 LLMA-13B 开始，以及32k上下文模型从基础 LLMA2-13B 开始。我们还释放了复制我们结果的代码。

Analyzing Transformer Dynamics as Movement through Embedding Space

paper_url: http://arxiv.org/abs/2308.10874
repo_url: None
paper_authors: Sumeet S. Singh
for: This paper explores the underlying mechanics of Transformer language models and how they give rise to intelligent behaviors.
methods: The authors use a systems approach to analyze Transformers and develop a mathematical framework that views the models as movement through embedding space.
results: The paper reveals important insights into the emergence of intelligence in Transformers, including the idea that the models are essentially “Embedding Space walkers” that compose context into a single vector, and that attention plays a key role in associating vectors and influencing the organization of the embedding space. Additionally, the authors find some evidence for their semantic space theory, which posits that embedding vectors represent semantic concepts.

Abstract
Transformer language models exhibit intelligent behaviors such as understanding natural language, recognizing patterns, acquiring knowledge, reasoning, planning, reflecting and using tools. This paper explores how their underlying mechanics give rise to intelligent behaviors. We adopt a systems approach to analyze Transformers in detail and develop a mathematical framework that frames their dynamics as movement through embedding space. This novel perspective provides a principled way of thinking about the problem and reveals important insights related to the emergence of intelligence: 1. At its core the Transformer is a Embedding Space walker, mapping intelligent behavior to trajectories in this vector space. 2. At each step of the walk, it composes context into a single composite vector whose location in Embedding Space defines the next step. 3. No learning actually occurs during decoding; in-context learning and generalization are simply the result of different contexts composing into different vectors. 4. Ultimately the knowledge, intelligence and skills exhibited by the model are embodied in the organization of vectors in Embedding Space rather than in specific neurons or layers. These abilities are properties of this organization. 5. Attention's contribution boils down to the association-bias it lends to vector composition and which influences the aforementioned organization. However, more investigation is needed to ascertain its significance. 6. The entire model is composed from two principal operations: data independent filtering and data dependent aggregation. This generalization unifies Transformers with other sequence models and across modalities. Building upon this foundation we formalize and test a semantic space theory which posits that embedding vectors represent semantic concepts and find some evidence of its validity.

摘要
吸收器语言模型展示出智能行为，如理解自然语言、识别模式、获得知识、reasoning、规划、反思和使用工具。这篇论文探讨它们的基本机制如何产生智能行为。我们采用系统方法分析吸收器，并开发了一个数学框架来描述它们的动态。这种新的视角提供了一个原则性的方法来思考问题，并揭示了智能行为的出现的重要关键点：1. 吸收器的核心是Embedding Space漫步者，将智能行为映射到vector空间中的路径上。2. 在每一步中，吸收器将上下文融合成一个单一的复合向量，该向量在Embedding Space中的位置定义下一步的路径。3. 在解码过程中，没有实际学习发生，而是在不同上下文中的融合导致了不同的向量组合，从而实现了吸收器的智能行为。4. 吸收器的智能、智慧和技能都是Embedding Space中向量的组织方式所具有的，而不是特定的神经元或层。这些能力是这种组织的属性。5. 关注的贡献在向量组合中带来了关联偏好，影响了Embedding Space中向量的组织，但需要进一步的调查以确定其重要性。6. 整个模型由两种主要操作组成：数据独立的滤波和数据依赖的聚合。这种一致性将吸收器与其他序列模型和多种模式相连接。基于这个基础，我们正式提出了一种 semantics空间理论，即向量表示 semantic concepts，并发现了一些证据支持这一理论的有效性。

Real World Time Series Benchmark Datasets with Distribution Shifts: Global Crude Oil Price and Volatility

paper_url: http://arxiv.org/abs/2308.10846
repo_url: https://github.com/oilpricebenchmarks/COB
paper_authors: Pranay Pasula
for: 本研究的目的是提供task-labeled时间序列数据集，用于驱动 kontinual learning在金融领域的进步。
methods: 本研究使用了资产价格数据的变换，生成了volatility proxy，并使用了期望最大化（EM）算法来适应模型。
results: 研究发现，通过包含任务标签，四种 kontinual learning算法在多个预测时间 horizon 上表现出了 Universal 的改进。

Abstract
The scarcity of task-labeled time-series benchmarks in the financial domain hinders progress in continual learning. Addressing this deficit would foster innovation in this area. Therefore, we present COB, Crude Oil Benchmark datasets. COB includes 30 years of asset prices that exhibit significant distribution shifts and optimally generates corresponding task (i.e., regime) labels based on these distribution shifts for the three most important crude oils in the world. Our contributions include creating real-world benchmark datasets by transforming asset price data into volatility proxies, fitting models using expectation-maximization (EM), generating contextual task labels that align with real-world events, and providing these labels as well as the general algorithm to the public. We show that the inclusion of these task labels universally improves performance on four continual learning algorithms, some state-of-the-art, over multiple forecasting horizons. We hope these benchmarks accelerate research in handling distribution shifts in real-world data, especially due to the global importance of the assets considered. We've made the (1) raw price data, (2) task labels generated by our approach, (3) and code for our algorithm available at https://oilpricebenchmarks.github.io.

摘要
<>转换文本为简化中文。<>金融领域内存续ous task-标注时间序列 benchmark 的缺乏，阻碍了持续学习的进步。为了解决这一问题，我们提出了 COB，即 Crude Oil Benchmark 数据集。 COB 包含了30年的资产价格，其中 exhibit 显著的分布shift，并且根据这些分布shift 生成对应的任务（即 режи）标签。我们的贡献包括将资产价格数据转换为Volatility proxy，使用期望最大化（EM）方法进行适应，生成基于实际世界事件的contextual task标签，并将这些标签以及通用的算法公开发布。我们表明，包括这些任务标签在内的 continual learning 算法在多个预测时间 horizon 上 universally 提高了四种状态之际的表现。我们希望这些 benchmark 可以加速实际数据中的分布shift处理研究，特别是由于我们考虑的资产的全球重要性。我们在上提供了（1）原始价格数据，（2）由我们方法生成的任务标签，（3）以及代码。

Neural Networks Optimizations Against Concept and Data Drift in Malware Detection

paper_url: http://arxiv.org/abs/2308.10821
repo_url: None
paper_authors: William Maillet, Benjamin Marais
for: 提高基eline neural network对概念飘移问题的处理能力
methods: Feature reduction和使用最新验证集训练，并提出了Drift-Resilient Binary Cross-Entropy损失函数
results: 对2020-2023年间 collected的新型恶意文件进行评估，提高了15.2%的恶意文件检测率 compared to baseline model

Abstract
Despite the promising results of machine learning models in malware detection, they face the problem of concept drift due to malware constant evolution. This leads to a decline in performance over time, as the data distribution of the new files differs from the training one, requiring regular model update. In this work, we propose a model-agnostic protocol to improve a baseline neural network to handle with the drift problem. We show the importance of feature reduction and training with the most recent validation set possible, and propose a loss function named Drift-Resilient Binary Cross-Entropy, an improvement to the classical Binary Cross-Entropy more effective against drift. We train our model on the EMBER dataset (2018) and evaluate it on a dataset of recent malicious files, collected between 2020 and 2023. Our improved model shows promising results, detecting 15.2% more malware than a baseline model.

摘要
尽管机器学习模型在针对恶意软件检测方面表现出色，但它们面临着概念漂移问题，这是因为恶意软件不断演化，导致模型在时间上的性能下降。为了解决这个问题，我们提出了一种模型无关协议，用于改进基eline神经网络，以适应漂移问题。我们表明了减少特征和使用最新的验证集训练的重要性，并提出了一种名为“漂移抗性二进制十字积分”的损失函数，比 класси的二进制十字积分更有效地防止漂移。我们在EMBER数据集（2018）上训练了我们的模型，并在2020-2023年间收集的一个数据集上进行了评估。我们改进后的模型显示了出色的表现，能够检测到2018年训练集中的15.2%更多的恶意软件。

2023-08-22

cs.CL

cs.CL - 2023-08-22

Empowering Refugee Claimants and their Lawyers: Using Machine Learning to Examine Decision-Making in Refugee Law

paper_url: http://arxiv.org/abs/2308.11531
repo_url: None
paper_authors: Claire Barale
For: This paper aims to help stakeholders in refugee status adjudications, such as lawyers, judges, governing bodies, and claimants, make better decisions through data-driven intelligence and increase understanding and transparency of the refugee application process.* Methods: The paper presents a completed experiment on retrieving past cases and ongoing efforts related to analyzing legal decision-making processes on a dataset of Canadian cases, using NLP-based solutions.* Results: The paper introduces a novel benchmark for future NLP research in refugee law and expects to achieve benefits such as reduced time-to-decision, fairer and more transparent outcomes, and improved decision quality.Here are the three points in Simplified Chinese text:* For: 这个论文旨在帮助难民地位评估中的潜在利益相关者，如律师、法官、管理机构和申请人，通过数据驱动智能来做出更好的决策，并提高难民申请过程中所有参与者的理解和透明度。* Methods: 论文提出了一个完成的实验，涉及到过去案例的收集，以及对加拿大案例集进行法律决策过程的分析，使用NLP技术解决问题。* Results: 论文引入了一个新的NLP研究 benchmark，预计可以实现减少决策时间、提高决策质量、更公平和透明的决策结果等利益。

Abstract
Our project aims at helping and supporting stakeholders in refugee status adjudications, such as lawyers, judges, governing bodies, and claimants, in order to make better decisions through data-driven intelligence and increase the understanding and transparency of the refugee application process for all involved parties. This PhD project has two primary objectives: (1) to retrieve past cases, and (2) to analyze legal decision-making processes on a dataset of Canadian cases. In this paper, we present the current state of our work, which includes a completed experiment on part (1) and ongoing efforts related to part (2). We believe that NLP-based solutions are well-suited to address these challenges, and we investigate the feasibility of automating all steps involved. In addition, we introduce a novel benchmark for future NLP research in refugee law. Our methodology aims to be inclusive to all end-users and stakeholders, with expected benefits including reduced time-to-decision, fairer and more transparent outcomes, and improved decision quality.

摘要
我们的项目的目标是帮助和支持难民地位审批相关方，如律师、法官、管理机构和申请人，以使更好的决策。我们通过数据驱动智能来增加所有参与方的理解和透明度，并提高决策的质量。这个博士项目有两个主要目标：（1）检索历史案例，（2）分析加拿大案例的法律决策过程。在这篇论文中，我们介绍了我们的当前工作，包括已经完成的试验部分（1）以及正在进行的努力（2）。我们认为，NLP技术非常适合解决这些挑战，我们正在调查是否可以自动化所有步骤。此外，我们还介绍了一个新的标准测试集，用于未来NLP研究领域的难民法。我们的方法旨在包容所有终端用户和参与方，期望的利益包括减少时间决策、更公平和透明的结果，以及改善决策质量。

Unsupervised Prototype Adapter for Vision-Language Models

paper_url: http://arxiv.org/abs/2308.11507
repo_url: None
paper_authors: Yi Zhang, Ce Zhang, Xueting Hu, Zhihai He
for: 本研究旨在提高视觉语言模型的适应性，并且不需要大量的标注样本。
methods: 我们提出了一种无监督的训练方法，通过CLIP自动选择最确定的示例，并生成类prototype来初始化可学习的原型模型。
results: 我们的实验结果表明，我们的方法可以大幅超越8批CoOp、8批Tip-Adapter和现有的UPL方法，并且在图像识别和领域泛化任务中具有优秀的表现。

Abstract
Recently, large-scale pre-trained vision-language models (e.g. CLIP and ALIGN) have demonstrated remarkable effectiveness in acquiring transferable visual representations. To leverage the valuable knowledge encoded within these models for downstream tasks, several fine-tuning approaches, including prompt tuning methods and adapter-based methods, have been developed to adapt vision-language models effectively with supervision. However, these methods rely on the availability of annotated samples, which can be labor-intensive and time-consuming to acquire, thus limiting scalability. To address this issue, in this work, we design an unsupervised fine-tuning approach for vision-language models called Unsupervised Prototype Adapter (UP-Adapter). Specifically, for the unannotated target datasets, we leverage the text-image aligning capability of CLIP to automatically select the most confident samples for each class. Utilizing these selected samples, we generate class prototypes, which serve as the initialization for the learnable prototype model. After fine-tuning, the prototype model prediction is combined with the original CLIP's prediction by a residual connection to perform downstream recognition tasks. Our extensive experimental results on image recognition and domain generalization show that the proposed unsupervised method outperforms 8-shot CoOp, 8-shot Tip-Adapter, and also the state-of-the-art UPL method by large margins.

摘要
现在，大规模预训练视觉语言模型（例如CLIP和ALIGN）已经表现出了很好的抽象能力。为了利用这些模型中嵌入的有价值知识来进行下游任务，有多种精度调整方法，如提示调整方法和适配器基本方法，已经开发出来。然而，这些方法需要有注解样本，这可以是劳动密集和时间消耗的。为了解决这个问题，在这项工作中，我们设计了一种无监督的精度调整方法 для视觉语言模型，即Unsupervised Prototype Adapter（UP-Adapter）。具体来说，对于无注解目标数据集，我们利用CLIP的文本图像对齐能力自动选择每个类型的最有信心的样本。使用这些选择的样本，我们生成类prototype，这些类prototype作为初始化来学习可变prototype模型。经过精度调整后，prototype模型预测结果与原CLIP预测结果之间进行差分连接，以进行下游识别任务。我们对图像识别和领域泛化进行了广泛的实验，结果表明，提posed的无监督方法可以大幅超越8批CoOp、8批Tip-Adapter以及状态监督UPL方法。

Can Authorship Representation Learning Capture Stylistic Features?

paper_url: http://arxiv.org/abs/2308.11490
repo_url: https://github.com/llnl/luar
paper_authors: Andrew Wang, Cristina Aggazzotti, Rebecca Kotula, Rafael Rivera Soto, Marcus Bishop, Nicholas Andrews
for: 这 paper 的目的是用数据驱动的方式学习作者表示，以便进行作者归属性预测。
methods: 这 paper 使用了大量的文本 corpus 和作者标签，通过数据驱动的方式学习作者表示。
results: 这 paper 的实验结果表明，学习的作者表示可以准确地捕捉作者的写作风格，并且可以鲁棒地抗压缩数据变换，如主题的变化。

Abstract
Automatically disentangling an author's style from the content of their writing is a longstanding and possibly insurmountable problem in computational linguistics. At the same time, the availability of large text corpora furnished with author labels has recently enabled learning authorship representations in a purely data-driven manner for authorship attribution, a task that ostensibly depends to a greater extent on encoding writing style than encoding content. However, success on this surrogate task does not ensure that such representations capture writing style since authorship could also be correlated with other latent variables, such as topic. In an effort to better understand the nature of the information these representations convey, and specifically to validate the hypothesis that they chiefly encode writing style, we systematically probe these representations through a series of targeted experiments. The results of these experiments suggest that representations learned for the surrogate authorship prediction task are indeed sensitive to writing style. As a consequence, authorship representations may be expected to be robust to certain kinds of data shift, such as topic drift over time. Additionally, our findings may open the door to downstream applications that require stylistic representations, such as style transfer.

摘要
自动分解作者的风格从写作内容中分离是计算语言学领域的长期问题。同时，有大量文本库已经标注作者的出现，使得可以通过数据驱动方式学习作者表示，这种任务显然更加依赖于编码写作风格而非编码内容。然而，成功完成这个代理任务并不能确保这些表示capture风格，因为作者可能也与其他隐藏变量相关，如话题。为了更好地理解这些表示中传递的信息，以及特别是验证假设是编码写作风格的，我们系统地进行了一系列targeted实验。实验结果表明，learned for surrogate authorship prediction task的表示确实敏感于写作风格。因此，作者表示可能会对某些数据变换具有Robustness，如时间的话题漂移。此外，我们的发现可能会开启下游应用需要风格表示的应用场景，如样式传递。

Learning to generate and corr- uh I mean repair language in real-time

paper_url: http://arxiv.org/abs/2308.11683
repo_url: https://bitbucket.org/dylandialoguesystem/dsttr
paper_authors: Arash Eshghi, Arash Ashrafzadeh
for: 这个论文的目的是为了开发一种能够在实时语言处理中进行自然和控制的对话AI系统。
methods: 这个论文使用了之前已经学习的动态语法语法和CHILDES数据集，开发了一个基于 probabilistic model 的增量生成模型，用于实现实时语言处理。
results: 研究发现，使用这个模型可以在78%的情况下输出金标候选答案，ROUGE-l分数为0.86。此外，模型还可以在生成目标改变时自动生成自修复，自动评估显示，模型可以正确地生成自修复的情况为85%。小规模的人工评估也证明了生成的自修复是自然和正确的。

Abstract
In conversation, speakers produce language incrementally, word by word, while continuously monitoring the appropriateness of their own contribution in the dynamically unfolding context of the conversation; and this often leads them to repair their own utterance on the fly. This real-time language processing capacity is furthermore crucial to the development of fluent and natural conversational AI. In this paper, we use a previously learned Dynamic Syntax grammar and the CHILDES corpus to develop, train and evaluate a probabilistic model for incremental generation where input to the model is a purely semantic generation goal concept in Type Theory with Records (TTR). We show that the model's output exactly matches the gold candidate in 78% of cases with a ROUGE-l score of 0.86. We further do a zero-shot evaluation of the ability of the same model to generate self-repairs when the generation goal changes mid-utterance. Automatic evaluation shows that the model can generate self-repairs correctly in 85% of cases. A small human evaluation confirms the naturalness and grammaticality of the generated self-repairs. Overall, these results further highlight the generalisation power of grammar-based models and lay the foundations for more controllable, and naturally interactive conversational AI systems.

摘要
在对话中，说话人会生成语言Word by Word，同时监测自己的言语是否适切，并在对话背景下动态地进行修复。这种实时语言处理能力是对话AI的自然化和流畅化的关键。在这篇论文中，我们使用先前学习的动态 syntax grammatical model和CHILDES corpus来开发、训练和评估一种随机生成模型，其输入是在类型理论中的某种语义生成目标概念。我们显示该模型的输出与金标准候选之间的匹配率为78%，ROUGE-l分数为0.86。我们进一步进行零shot评估模型在生成自修复时的能力。自动评估显示模型可以正确地生成自修复的85%情况下。一小规模的人工评估也证明了生成的自修复是自然和正确的。总的来说，这些结果再次强调了基于语法模型的模型的通用性，并为更可控、自然交互的对话AI系统开创了基础。

SONAR: Sentence-Level Multimodal and Language-Agnostic Representations

paper_url: http://arxiv.org/abs/2308.11466
repo_url: https://github.com/facebookresearch/sonar
paper_authors: Paul-Ambroise Duquenne, Holger Schwenk, Benoît Sagot
for: 本文提出了一个新的多语言多模式固定大小句子嵌入空间SONAR，用于实现多语言多模式嵌入。
methods: 作者使用了一种新的 sentence encoder 和 speech encoder，通过 teacher-student Setting来训练语言特定的语音编码器，并使用了一种文本解码器来实现文本到文本和语音到文本翻译。
results: 作者的方法在多语言多模式嵌入搜索任务上显著超越了现有的嵌入方法，如LASER3和LabSE。此外，作者的语音编码器在相似搜索任务上也表现出色，并且可以实现零shot语言和模式组合的语音翻译。

Abstract
We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space. Our single text encoder, covering 200 languages, substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. Our encoders outperform existing speech encoders on similarity search tasks. We also provide a text decoder for 200 languages, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations. Our text-to-text results are competitive compared to the state-of-the-art NLLB~1B model, despite the fixed-size bottleneck representation. Our zero-shot speech-to-text translation results compare favorably with strong supervised baselines such as Whisper.

摘要
我们介绍SONAR，一个新的多语言多模式固定大小句子嵌入空间。我们的单一文本编码器，覆盖200种语言，与现有的句子嵌入 such as LASER3和LabSE在xsim和xsim++多 lingual similarity搜寻任务上表现出色，并可以将语音段落嵌入同一个 SONAR嵌入空间中使用语言特定的语音编码器在教师-学生设定下在语音识别数据上训练。我们的编码器在类似搜寻任务上表现出色，而我们还提供了200种语言的文本解oder，可以进行文本-文本和语音-文本机器翻译，包括零�� conocido语言和模式组合。我们的文本-文本结果与现有的NLLB1B模型相匹配，即使受到固定大小瓶颈表现的限制。我们的零�� known语音-文本翻译结果与强化过的基准模型such as Whisper相匹配。

Extracting Relational Triples Based on Graph Recursive Neural Network via Dynamic Feedback Forest Algorithm

paper_url: http://arxiv.org/abs/2308.11411
repo_url: None
paper_authors: Hongyin Zhu
for: 将文本数据转化为结构化知识
methods: 使用依赖树分析和图 recursive neural networks (GRNNs) 实现 triple extraction 任务
results: 提出了一种新的方法，可以在模型训练时通过推理操作连接各个子任务的表示，实现子任务的 интеграción

Abstract
Extracting relational triples (subject, predicate, object) from text enables the transformation of unstructured text data into structured knowledge. The named entity recognition (NER) and the relation extraction (RE) are two foundational subtasks in this knowledge generation pipeline. The integration of subtasks poses a considerable challenge due to their disparate nature. This paper presents a novel approach that converts the triple extraction task into a graph labeling problem, capitalizing on the structural information of dependency parsing and graph recursive neural networks (GRNNs). To integrate subtasks, this paper proposes a dynamic feedback forest algorithm that connects the representations of subtasks by inference operations during model training. Experimental results demonstrate the effectiveness of the proposed method.

摘要
将文本数据转化为结构化知识，EXTRACTING relational triples（主语、谓语、谓 Object）从文本中提取是一个基本任务。命名实体识别（NER）和关系提取（RE）是这个知识生成管道的两个基础任务。这两个任务的集成带来了很大挑战，因为它们之间存在很大的差异。本文提出了一种新的方法，将 triple 提取任务转化为图标注问题，利用语言结构信息和图循环神经网络（GRNN）。为了将子任务集成，本文提出了一种动态反馈森林算法，在模型训练过程中，通过推理操作连接子任务的表示。实验结果表明，提出的方法有效。

Convoifilter: A case study of doing cocktail party speech recognition

paper_url: http://arxiv.org/abs/2308.11380
repo_url: None
paper_authors: Thai-Binh Nguyen, Alexander Waibel
for: 提高特定说话人在嘈杂环境中自动语音识别（ASR）的精度。
methods: 使用单通道语音净化模块减少背景噪声，并与ASR模块结合使用。通过这种方法，模型可以将单个说话人的语音净化到26.4%。
results: 模型可以将单个说话人的语音净化到14.5%，比单独调整的26.4%更低。

Abstract
This paper presents an end-to-end model designed to improve automatic speech recognition (ASR) for a particular speaker in a crowded, noisy environment. The model utilizes a single-channel speech enhancement module that isolates the speaker's voice from background noise, along with an ASR module. Through this approach, the model is able to decrease the word error rate (WER) of ASR from 80% to 26.4%. Typically, these two components are adjusted independently due to variations in data requirements. However, speech enhancement can create anomalies that decrease ASR efficiency. By implementing a joint fine-tuning strategy, the model can reduce the WER from 26.4% in separate tuning to 14.5% in joint tuning.

摘要
这份研究报告介绍了一种用于改进特定发音人员在嘈杂环境下的自动语音识别（ASR）模型。该模型使用单通道语音提升模块，以隔离发音人员的声音与背景噪声，同时还包括ASR模块。通过这种方法，模型可以降低ASR的单词错误率（WER）从80%降至26.4%。通常，这两个组件在数据需求的变化下独立地调整。然而，语音提升可能会导致ASR效率下降。通过实施联合细调策略，模型可以在联合细调中降低WER从26.4%下降至14.5%。

paper_url: http://arxiv.org/abs/2308.11351
repo_url: None
paper_authors: Tao Chen, Ze Lin, Hui Li, Jiayi Ji, Yiyi Zhou, Guanbin Li, Rongrong Ji
for: 这篇论文的目的是提出一种高质量产品概要生成方法，以吸引顾客兴趣，提高购买意愿。
methods: 该方法使用多Modal模型，同时考虑了多个细腻的特征，包括文本和图像模式，以生成高质量的产品概要。
results: 实验结果表明，该方法在一个大规模的中文电商 dataset 上的评价metric 上显著 OUTPERFORMS 现有的产品概要方法。

Abstract
Given the long textual product information and the product image, Multi-Modal Product Summarization (MMPS) aims to attract customers' interest and increase their desire to purchase by highlighting product characteristics with a short textual summary. Existing MMPS methods have achieved promising performance. Nevertheless, there still exist several problems: 1) lack end-to-end product summarization, 2) lack multi-grained multi-modal modeling, and 3) lack multi-modal attribute modeling. To address these issues, we propose an end-to-end multi-grained multi-modal attribute-aware product summarization method (M3PS) for generating high-quality product summaries in e-commerce. M3PS jointly models product attributes and generates product summaries. Meanwhile, we design several multi-grained multi-modal tasks to better guide the multi-modal learning of M3PS. Furthermore, we model product attributes based on both text and image modalities so that multi-modal product characteristics can be manifested in the generated summaries. Extensive experiments on a real large-scale Chinese e-commence dataset demonstrate that our model outperforms state-of-the-art product summarization methods w.r.t. several summarization metrics.

摘要
文本级别的产品描述和产品图像，多模式产品概述（MMPS）目标是通过突出产品特点而吸引顾客兴趣，提高购买意愿。现有的MMPS方法已经实现了一定的成果。然而，还存在一些问题：1）缺乏端到端产品概述，2）缺乏多层多模式模型，3）缺乏多模式属性模型。为了解决这些问题，我们提出了一种端到端多层多模式属性感知产品概述方法（M3PS），用于生成高质量的电商产品概述。M3PS同时模型产品属性，并生成产品概述。此外，我们设计了多个多层多模式任务，以更好地引导多模式学习。同时，我们基于文本和图像模式来模型产品属性，以便在生成的概述中表达多模式产品特点。我们对大规模中国电商数据进行了广泛的实验，并证明了我们的模型在多个概述指标上表现比现状态的产品概述方法更高。

LEAP: Efficient and Automated Test Method for NLP Software

paper_url: http://arxiv.org/abs/2308.11284
repo_url: https://github.com/lumos-xiao/leap
paper_authors: Mingxuan Xiao, Yan Xiao, Hai Dong, Shunhui Ji, Pengcheng Zhang
for: 提高 DNN 模型的Robustness，透过自动生成 adversarial test cases。
methods: 使用 Levy flight-based Adaptive Particle swarm optimization integrated with textual features，并采用 initialization population 增加测试用例的多样性，以及使用启动器算法和精准搜索缩短搜索时间。
results: 对 NLP 软件进行了系列测试，并证明了 LEAP 能够生成高精度的 adversarial test cases，同时具有较高的效率和可迁移性。

Abstract
The widespread adoption of DNNs in NLP software has highlighted the need for robustness. Researchers proposed various automatic testing techniques for adversarial test cases. However, existing methods suffer from two limitations: weak error-discovering capabilities, with success rates ranging from 0% to 24.6% for BERT-based NLP software, and time inefficiency, taking 177.8s to 205.28s per test case, making them challenging for time-constrained scenarios. To address these issues, this paper proposes LEAP, an automated test method that uses LEvy flight-based Adaptive Particle swarm optimization integrated with textual features to generate adversarial test cases. Specifically, we adopt Levy flight for population initialization to increase the diversity of generated test cases. We also design an inertial weight adaptive update operator to improve the efficiency of LEAP's global optimization of high-dimensional text examples and a mutation operator based on the greedy strategy to reduce the search time. We conducted a series of experiments to validate LEAP's ability to test NLP software and found that the average success rate of LEAP in generating adversarial test cases is 79.1%, which is 6.1% higher than the next best approach (PSOattack). While ensuring high success rates, LEAP significantly reduces time overhead by up to 147.6s compared to other heuristic-based methods. Additionally, the experimental results demonstrate that LEAP can generate more transferable test cases and significantly enhance the robustness of DNN-based systems.

摘要
“随着深度神经网络（DNN）在自然语言处理（NLP）软件中的广泛应用，问题的Robustness问题得到了吸引注意。研究人员提出了多种自动测试技术，但现有方法受到两个限制：一是弱的错误发现能力，成功率从0%到24.6%之间，二是时间浪费，每个测试案例需要177.8s至205.28s，这使得它们在时间紧张的情况下具有挑战性。为了解决这些问题，本文提出了LEAP，一个自动测试方法，利用LEvy flight-based Adaptive Particle swarm optimization与文本特征来生成攻击测试案例。具体来说，我们在人口初始化中采用Levy flight，以增加生成的测试案例的多样性。我们还设计了一个吸引力适应更新算法，以提高LEAP的全球优化高维文本示例的效率。此外，我们还设计了基于推导策略的突变算法，以减少搜索时间。我们对NLP软件进行了一系列实验， Validate LEAP的测试能力，结果显示，LEAP的平均成功率为79.1%，高于下一个最佳方法（PSOattack）的6.1%。同时，LEAP可以保证高的成功率，并对其他着重基于规律的方法实现时间优化，最多减少147.6s。实验结果显示，LEAP可以生成更转移的测试案例，并对DNN基于系统增加了更高的Robustness。”

HopPG: Self-Iterative Program Generation for Multi-Hop Question Answering over Heterogeneous Knowledge

paper_url: http://arxiv.org/abs/2308.11257
repo_url: None
paper_authors: Yingyao Wang, Yongwei Zhou, Chaoqun Duan, Junwei Bao, Tiejun Zhao
for: 本研究旨在提高semantic parsing方法的多步问答能力，特别是在非结构化知识库中进行多步问答。
methods: 本研究提出了一种自适应框架（HopPG），通过利用前一步执行结果来检索支持信息和生成后续程序，解决了传统semantic parsing方法在多步问答中的缺陷。
results: 实验结果表明，HopPG在MMQA-T^2上表现出色，特别是在多步问答中超过了现有semantic-parsing基eline。

Abstract
The semantic parsing-based method is an important research branch for knowledge-based question answering. It usually generates executable programs lean upon the question and then conduct them to reason answers over a knowledge base. Benefit from this inherent mechanism, it has advantages in the performance and the interpretability. However,traditional semantic parsing methods usually generate a complete program before executing it, which struggles with multi-hop question answering over heterogeneous knowledge. Firstly,a complete multi-hop program relies on multiple heterogeneous supporting facts, and it is difficult for models to receive these facts simultaneously. Secondly,these methods ignore the interaction information between the previous-hop execution result and the current-hop program generation. To alleviate these challenges, we propose a self-iterative framework for multi-hop program generation (HopPG) over heterogeneous knowledge, which leverages the previous-hop execution results to retrieve supporting facts and generate subsequent programs iteratively. We evaluate our model on MMQA-T^2. The experimental results show that HopPG outperforms existing semantic-parsing-based baselines, especially on the multi-hop questions.

摘要
“ semantic parsing-based 方法是知识基于问题回答的重要研究分支。它通常将问题转换为可执行的程式，然后将其与知识库进行推理，获得答案。由于这个自然的机制，它具有性能和可读性的优势。然而，传统的 semantic parsing 方法通常会生成完整的程式 перед执行，这会对于多步骤问题回答 sobre 不同的知识类型产生困难。首先，完整的多步骤程式需要多个不同的支持事实，而这些模型很难同时获取这些事实。其次，这些方法忽略了前一步执行结果和现在一步程式生成之间的互动信息。为了解决这些挑战，我们提出了一个自我迭代框架 для 多步骤程式生成 (HopPG) over 不同的知识，它利用前一步执行结果来获取支持事实并生成下一步程式。我们将我们的模型评估在 MMQA-T^2 上。实验结果显示，HopPG 比 existed semantic-parsing-based 基eline更高效，特别是在多步骤问题上。”

ViCo: Engaging Video Comment Generation with Human Preference Rewards

paper_url: http://arxiv.org/abs/2308.11171
repo_url: None
paper_authors: Yuchong Sun, Bei Liu, Xu Chen, Ruihua Song, Jianlong Fu
for: 本研究旨在生成视频评论，以增强视频社交媒体上的互动性和参与度。
methods: 本研究提出了三种新的设计方案，包括利用赞数作为评论的表现度量，自动评估评论的参与度，以及使用初始生成器生成评论，然后通过奖励模型进行反馈优化。
results: 实验结果表明，使用本研究提出的方法可以生成高质量的视频评论，特别是在考虑参与度时。

Abstract
Engaging video comments play an important role in video social media, as they are the carrier of feelings, thoughts, or humor of the audience. Preliminary works have made initial exploration for video comment generation by adopting caption-style encoder-decoder models. However, comment generation presents some unique challenges distinct from caption generation, which makes these methods somewhat less effective at generating engaging comments. In contrast to the objective and descriptive nature of captions, comments tend to be inherently subjective, making it hard to quantify and evaluate the engagement of comments. Furthermore, the scarcity of truly engaging comments brings difficulty to collecting enough high-quality training examples. In this paper, we propose ViCo with three novel designs to tackle the above challenges for generating engaging Video Comments. Firstly, to quantify the engagement of comments, we utilize the number of "likes" each comment receives as a proxy of human preference after an appropriate debiasing procedure. Secondly, to automatically evaluate the engagement of comments, we train a reward model to align its judgment to the above proxy. Our user studies indicate that this reward model effectively aligns with human judgments. Lastly, to alleviate the scarcity of high-quality comments, an initial generator is trained on readily available but noisy data to generate comments. Then the reward model is employed to offer feedback on the generated comments, thus optimizing the initial generator. To facilitate the research of video commenting, we collect a large video comment-dataset (ViCo-20k) with rich metadata from a popular video website. Experiments on ViCo-20k show that the comments generated by our ViCo model exhibit the best performance in terms of both quantitative and qualitative results, particularly when engagement is considered.

摘要
优化视频评论的核心在于促进视频社交媒体上的评论内容的互动性和趣味性。现有的初步工作已经采用了caption风格的编解oder模型进行视频评论生成。然而，评论生成存在一些独特的挑战，与caption生成不同，这些挑战使得这些方法在生成互动评论时有所不足。在评论中，评论内容具有主观性，使得评估评论的互动性变得更加困难。此外，缺乏真正有趣的评论使得收集高质量的训练示例具有挑战性。在这篇论文中，我们提出了ViCo模型，其中包括三个新的设计来解决以上挑战。首先，我们利用每个评论 receives的“喜欢”数作为人类偏好的代理，并进行了适当的偏移处理。其次，我们训练了一个奖励模型，以使其对于上述代理的评价与人类评价相互对应。我们的用户研究表明，这个奖励模型与人类评价之间具有良好的一致性。最后，我们使用初始生成器在 readily available但含有噪声的数据上生成评论，然后使用奖励模型来反馈给初始生成器，以便优化初始生成器。为促进视频评论研究，我们收集了一个大量的视频评论数据集（ViCo-20k），其中包括了视频网站上具有丰富 metadata 的视频评论。我们在ViCo-20k数据集上进行了实验，结果显示，我们的ViCo模型在互动性和质量两个方面表现出色，特别是在考虑互动性时。

LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning (Practical Experience Report)

paper_url: http://arxiv.org/abs/2308.11148
repo_url: None
paper_authors: Junyi Lu, Lei Yu, Xiaojia Li, Li Yang, Chun Zuo
for: automatizing code review activities
methods: utilizes parameter-efficient fine-tuning (PEFT) methods and LLaMA, a popular large language model
results: equals the performance of existing code-review-focused models with a small model size and limited tuning epochs

Abstract
The automation of code review activities, a long-standing pursuit in software engineering, has been primarily addressed by numerous domain-specific pre-trained models. Despite their success, these models frequently demand extensive resources for pre-training from scratch. In contrast, Large Language Models (LLMs) provide an intriguing alternative, given their remarkable capabilities when supplemented with domain-specific knowledge. However, their potential for automating code review tasks remains largely unexplored. In response to this research gap, we present LLaMA-Reviewer, an innovative framework that leverages the capabilities of LLaMA, a popular LLM, in the realm of code review. Mindful of resource constraints, this framework employs parameter-efficient fine-tuning (PEFT) methods, delivering high performance while using less than 1% of trainable parameters. An extensive evaluation of LLaMA-Reviewer is conducted on two diverse, publicly available datasets. Notably, even with the smallest LLaMA base model consisting of 6.7B parameters and a limited number of tuning epochs, LLaMA-Reviewer equals the performance of existing code-review-focused models. The ablation experiments provide insights into the influence of various fine-tuning process components, including input representation, instruction tuning, and different PEFT methods. To foster continuous progress in this field, the code and all PEFT-weight plugins have been made open-source.

摘要
<> translate the following text into Simplified Chinese<>软件工程中的代码审查活动自动化，是一项长期追求的问题，已经由许多域 especific pre-trained models addresses。Despite their success, these models often require extensive resources for pre-training from scratch。In contrast, Large Language Models (LLMs) provide an interesting alternative，given their remarkable capabilities when supplemented with domain-specific knowledge。However，their potential for automating code review tasks remains largely unexplored。In response to this research gap，we present LLaMA-Reviewer，an innovative framework that leverages the capabilities of LLaMA，a popular LLM，in the realm of code review。Mindful of resource constraints，this framework employs parameter-efficient fine-tuning (PEFT) methods，delivering high performance while using less than 1% of trainable parameters。An extensive evaluation of LLaMA-Reviewer is conducted on two diverse，publicly available datasets。Notably，even with the smallest LLaMA base model consisting of 6.7B parameters and a limited number of tuning epochs，LLaMA-Reviewer equals the performance of existing code-review-focused models。The ablation experiments provide insights into the influence of various fine-tuning process components，including input representation，instruction tuning，and different PEFT methods。To foster continuous progress in this field，the code and all PEFT-weight plugins have been made open-source。

NLP-based detection of systematic anomalies among the narratives of consumer complaints

paper_url: http://arxiv.org/abs/2308.11138
repo_url: None
paper_authors: Peiheng Gao, Ning Sun, Xuefeng Wang, Chen Yang, Ričardas Zitikis
for: 该研究是为了检测consumer complaint narratives中的系统性异常（systematic anomalies）。
methods: 该研究使用NLP技术将consumer complaint narratives转化为量化数据，然后使用一种算法检测系统性异常。
results: 研究使用Consumer Financial Protection Bureau的consumer complaint database示例，并成功地检测到了一些系统性异常。

Abstract
We develop an NLP-based procedure for detecting systematic nonmeritorious consumer complaints, simply called systematic anomalies, among complaint narratives. While classification algorithms are used to detect pronounced anomalies, in the case of smaller and frequent systematic anomalies, the algorithms may falter due to a variety of reasons, including technical ones as well as natural limitations of human analysts. Therefore, as the next step after classification, we convert the complaint narratives into quantitative data, which are then analyzed using an algorithm for detecting systematic anomalies. We illustrate the entire procedure using complaint narratives from the Consumer Complaint Database of the Consumer Financial Protection Bureau.

摘要
我们开发了一种基于自然语言处理（NLP）技术的系统性异常检测程序，用于检测消费者投诉文本中的系统性异常。尽管分类算法可以检测明显的异常，但在小型和频繁的系统性异常情况下，算法可能会失败，这可能是技术上的限制以及人类分析员的自然限制。因此，我们将投诉文本转换成量化数据，然后使用一种检测系统性异常的算法进行分析。我们使用美国消费者金融保护署的消费者投诉数据库中的投诉文本进行示例。

Towards Objective Evaluation of Socially-Situated Conversational Robots: Assessing Human-Likeness through Multimodal User Behaviors

paper_url: http://arxiv.org/abs/2308.11020
repo_url: None
paper_authors: Koji Inoue, Divesh Lala, Keiko Ochi, Tatsuya Kawahara, Gabriel Skantze
for: 评估社交对话机器人的人类化程度
methods: 基于多Modal用户行为的对话对话集采集，并对用户行为与人类化分数之间进行相似性分析，以评估机器人的人类化程度
results: 研究发现，通过对用户行为进行分析，可以对机器人的人类化程度进行评估，并且这种方法可以增强 объекivity和可重复性。

Abstract
This paper tackles the challenging task of evaluating socially situated conversational robots and presents a novel objective evaluation approach that relies on multimodal user behaviors. In this study, our main focus is on assessing the human-likeness of the robot as the primary evaluation metric. While previous research often relied on subjective evaluations from users, our approach aims to evaluate the robot's human-likeness based on observable user behaviors indirectly, thus enhancing objectivity and reproducibility. To begin, we created an annotated dataset of human-likeness scores, utilizing user behaviors found in an attentive listening dialogue corpus. We then conducted an analysis to determine the correlation between multimodal user behaviors and human-likeness scores, demonstrating the feasibility of our proposed behavior-based evaluation method.

摘要
To begin, we created an annotated dataset of human-likeness scores, utilizing user behaviors found in an attentive listening dialogue corpus. We then conducted an analysis to determine the correlation between multimodal user behaviors and human-likeness scores, demonstrating the feasibility of our proposed behavior-based evaluation method.Translation notes:* "socially situated" is translated as "社交境中" (shè jìoù zhōng zhī)* "conversational robots" is translated as "对话机器人" (duì yǔ jī rén)* "human-likeness" is translated as "人类化" (rén xìng huà)* "objective evaluation" is translated as "客观评价" (kè jiàn píng jì)* "multimodal user behaviors" is translated as "多Modal用户行为" (duō modāl yòng hòu xíng bèi)* "attentive listening dialogue corpus" is translated as "注意听录对话 corpus" (zhù yì tīng luō duì hǎo)

Using language models in the implicit automated assessment of mathematical short answer items

paper_url: http://arxiv.org/abs/2308.11006
repo_url: None
paper_authors: Christopher Ormerod
for: 这项研究的目的是提出一种新的短 constructed response assessment方法，以便更准确地评估学生的数学知识。
methods: 该方法使用一个管道，从Student response中提取关键值。这个管道包括两个精心调整的语言模型，第一个模型判断学生回答中是否含有关键值，第二个模型则确定回答中关键值的位置。
results: 研究表明，这种管道方法比传统的分桌评分法更加准确和有用，可以为学生提供更有arget的反馈，帮助学生提高数学知识。

Abstract
We propose a new way to assess certain short constructed responses to mathematics items. Our approach uses a pipeline that identifies the key values specified by the student in their response. This allows us to determine the correctness of the response, as well as identify any misconceptions. The information from the value identification pipeline can then be used to provide feedback to the teacher and student. The value identification pipeline consists of two fine-tuned language models. The first model determines if a value is implicit in the student response. The second model identifies where in the response the key value is specified. We consider both a generic model that can be used for any prompt and value, as well as models that are specific to each prompt and value. The value identification pipeline is a more accurate and informative way to assess short constructed responses than traditional rubric-based scoring. It can be used to provide more targeted feedback to students, which can help them improve their understanding of mathematics.

摘要
我们提出了一种新的方法来评估某些短 constructed responses 的数学项目。我们的方法使用一个管道，以确定学生在回答中提供的关键值。这些值可以确定回答的正确性，以及学生可能存在的误解。管道中的信息可以用于向教师和学生提供反馈。我们的值标识管道包括两个精心调整的自然语言模型。第一个模型判断学生回答中是否包含关键值。第二个模型确定回答中关键值的位置。我们考虑了一个通用的模型，可以用于任何提问和值，以及每个提问和值的特定模型。值标识管道比传统的分类型分配法更准确和有用，可以为学生提供更有向导性的反馈，帮助他们深化数学理解。

LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles

paper_url: http://arxiv.org/abs/2308.10855
repo_url: https://github.com/thukelab/lateval
paper_authors: Shulin Huang, Shirong Ma, Yinghui Li, Mengzuo Huang, Wuhe Zou, Weidong Zhang, Hai-Tao Zheng
for: 评估大语言模型的横向思维能力
methods: 使用 Lateral Thinking Puzzles benchmark 评估模型的问题提出和信息 интеграción能力
results: 发现大多数语言模型在交互中具有差guide的横向思维能力，比如 GPT-4 也存在一定的差异，与人类相比仍有很大差距

Abstract
With the continuous evolution and refinement of LLMs, they are endowed with impressive logical reasoning or vertical thinking capabilities. But can they think out of the box? Do they possess proficient lateral thinking abilities? Following the setup of Lateral Thinking Puzzles, we propose a novel evaluation benchmark, LatEval, which assesses the model's lateral thinking within an interactive framework. In our benchmark, we challenge LLMs with 2 aspects: the quality of questions posed by the model and the model's capability to integrate information for problem-solving. We find that nearly all LLMs struggle with employing lateral thinking during interactions. For example, even the most advanced model, GPT-4, exhibits the advantage to some extent, yet still maintain a noticeable gap when compared to human. This evaluation benchmark provides LLMs with a highly challenging and distinctive task that is crucial to an effective AI assistant.

摘要
We challenge LLMs with two aspects: the quality of questions posed by the model and the model's ability to integrate information for problem-solving. Our results show that nearly all LLMs struggle with using lateral thinking during interactions. For example, even the most advanced model, GPT-4, exhibits some advantage, but still maintains a noticeable gap compared to humans. This evaluation benchmark provides LLMs with a highly challenging and distinctive task that is crucial for an effective AI assistant.

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors in Agents

paper_url: http://arxiv.org/abs/2308.10848
repo_url: https://github.com/openbmb/agentverse
paper_authors: Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, Zhiyuan Liu, Maosong Sun, Jie Zhou
for: 这篇论文旨在提出一种基于大语言模型的多代理框架，以增强任务完成效率和效果。
methods: 该框架使用了多种方法，包括 dynamically adjusting its composition 和 collaboratively accomplishing tasks。
results: 实验结果表明，该框架可以有效地派Send multi-agent groups that outperform a single agent。 Additionally, the paper explores the emergence of social behaviors among individual agents within a group during collaborative task accomplishment.

Abstract
Autonomous agents empowered by Large Language Models (LLMs) have undergone significant improvements, enabling them to generalize across a broad spectrum of tasks. However, in real-world scenarios, cooperation among individuals is often required to enhance the efficiency and effectiveness of task accomplishment. Hence, inspired by human group dynamics, we propose a multi-agent framework \framework that can collaboratively and dynamically adjust its composition as a greater-than-the-sum-of-its-parts system. Our experiments demonstrate that \framework framework can effectively deploy multi-agent groups that outperform a single agent. Furthermore, we delve into the emergence of social behaviors among individual agents within a group during collaborative task accomplishment. In view of these behaviors, we discuss some possible strategies to leverage positive ones and mitigate negative ones for improving the collaborative potential of multi-agent groups. Our codes for \framework will soon be released at \url{https://github.com/OpenBMB/AgentVerse}.

摘要
自主代理 empowered by Large Language Models (LLMs) 已经经历了重要的改进，使其能够广泛应用于多种任务。然而，在实际场景中，人们之间的合作是经常需要的，以提高任务完成的效率和效果。因此， drawing inspiration from human group dynamics, we propose a multi-agent framework \framework that can collaboratively and dynamically adjust its composition as a greater-than-the-sum-of-its-parts system. Our experiments demonstrate that \framework framework can effectively deploy multi-agent groups that outperform a single agent. Furthermore, we delve into the emergence of social behaviors among individual agents within a group during collaborative task accomplishment. In view of these behaviors, we discuss some possible strategies to leverage positive ones and mitigate negative ones for improving the collaborative potential of multi-agent groups. Our codes for \framework will soon be released at \url{https://github.com/OpenBMB/AgentVerse}.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. If you need the translation in Traditional Chinese, please let me know.

2023-08-22

cs.LG

cs.LG - 2023-08-22

A free from local minima algorithm for training regressive MLP neural networks

paper_url: http://arxiv.org/abs/2308.11532
repo_url: None
paper_authors: Augusto Montisci
for: 本文提出了一种创新的多层感知网络训练方法，以避免地方最小值问题。
methods: 该方法基于训练集分布的性质，或者说是其内部图像，以避免地方最小值问题。
results: 本文对一个知名的基准数据集进行了表现示例，并证明了该方法可以减少地方最小值问题。

Abstract
In this article an innovative method for training regressive MLP networks is presented, which is not subject to local minima. The Error-Back-Propagation algorithm, proposed by William-Hinton-Rummelhart, has had the merit of favouring the development of machine learning techniques, which has permeated every branch of research and technology since the mid-1980s. This extraordinary success is largely due to the black-box approach, but this same factor was also seen as a limitation, as soon more challenging problems were approached. One of the most critical aspects of the training algorithms was that of local minima of the loss function, typically the mean squared error of the output on the training set. In fact, as the most popular training algorithms are driven by the derivatives of the loss function, there is no possibility to evaluate if a reached minimum is local or global. The algorithm presented in this paper avoids the problem of local minima, as the training is based on the properties of the distribution of the training set, or better on its image internal to the neural network. The performance of the algorithm is shown for a well-known benchmark.

摘要
本文提出了一种创新的多层感知网络训练方法，不受地方最小值限制。惠威姆-希茨-鲁姆哈特提出的错误反传播算法在1980年代中期以来，在机器学习领域取得了杰出成就，这一成就主要归功于黑盒模型，但同时也被视为一个限制，因为随着问题的增加 complexity，这种方法的应用逐渐受限。训练算法中最 kritical的问题是搜索函数的地方最小值，通常是训练集上输出的平均方差。因为现有训练算法都是根据损失函数的导数进行驱动，因此无法评估是否到达了地方最小值。本文所提出的方法解决了本问题，基于训练集的分布特性或更好地说，是基于其内部图像的。文中所示的性能表现对一个知名的测试集进行了展示。

ReLiCADA – Reservoir Computing using Linear Cellular Automata Design Algorithm

paper_url: http://arxiv.org/abs/2308.11522
repo_url: None
paper_authors: Jonas Kantic, Fabian C. Legl, Walter Stechele, Jakob Hermann
for: 优化Reservoir Computing的设计用于时间序列应用。
methods: 使用Cellular Automata模型选择规则，并解决线性Cellular Automaton规则选择问题。
results: 对相关的标准数据集进行了严格的测试，选择的规则在总规则空间中排名在前5%，并且与其他当前领域模型相比，提供了更低的计算复杂性和训练时间，同时实现了更低的错误率。

Abstract
In this paper, we present a novel algorithm to optimize the design of Reservoir Computing using Cellular Automata models for time series applications. Besides selecting the models' hyperparameters, the proposed algorithm particularly solves the open problem of linear Cellular Automaton rule selection. The selection method pre-selects only a few promising candidate rules out of an exponentially growing rule space. When applied to relevant benchmark datasets, the selected rules achieve low errors, with the best rules being among the top 5% of the overall rule space. The algorithm was developed based on mathematical analysis of linear Cellular Automaton properties and is backed by almost one million experiments, adding up to a computational runtime of nearly one year. Comparisons to other state-of-the-art time series models show that the proposed Reservoir Computing using Cellular Automata models have lower computational complexity, at the same time, achieve lower errors. Hence, our approach reduces the time needed for training and hyperparameter optimization by up to several orders of magnitude.

摘要
在这篇论文中，我们提出了一种新的算法优化储量计算机使用细胞自动机模型进行时间序列应用。除了选择模型的超参数之外，我们的算法特别解决了线性细胞自动机规则选择的开放问题。选择方法先于搜索极其庞大的规则空间中的一些有潜力的候选规则。应用到相关的基准数据集上，选择的规则可以 дости到低的错误率，最好的规则在总规则空间中排名前5%。我们的算法基于细胞自动机的数学分析和大约一百万个实验，计算时间近一年。与其他当前最佳时间序列模型比较，我们的方法具有更低的计算复杂度，同时可以实现更低的错误率。因此，我们的方法可以减少训练和超参数优化所需的时间，可能是数量级减少。

EM for Mixture of Linear Regression with Clustered Data

paper_url: http://arxiv.org/abs/2308.11518
repo_url: None
paper_authors: Amirhossein Reisizadeh, Khashayar Gatmiry, Asuman Ozdaglar
for: 这种问题的答案是如何利用分布式学习框架中的数据归一化来提高学习效果。
methods: 作者使用了Expectation-Maximization（EM）方法来估计二元混合线性回归模型中的参数。
results: 作者表明，如果 Initialize EM 方法正确，并且 $m$ 增长为 $e^{o(n)}$，那么 EM 方法只需要 $O(1)$ 轮次来达到同样的统计准确性，并且提供了新的 asymptotic optimization 和通用的验证保证。

Abstract
Modern data-driven and distributed learning frameworks deal with diverse massive data generated by clients spread across heterogeneous environments. Indeed, data heterogeneity is a major bottleneck in scaling up many distributed learning paradigms. In many settings however, heterogeneous data may be generated in clusters with shared structures, as is the case in several applications such as federated learning where a common latent variable governs the distribution of all the samples generated by a client. It is therefore natural to ask how the underlying clustered structures in distributed data can be exploited to improve learning schemes. In this paper, we tackle this question in the special case of estimating $d$-dimensional parameters of a two-component mixture of linear regressions problem where each of $m$ nodes generates $n$ samples with a shared latent variable. We employ the well-known Expectation-Maximization (EM) method to estimate the maximum likelihood parameters from $m$ batches of dependent samples each containing $n$ measurements. Discarding the clustered structure in the mixture model, EM is known to require $O(\log(mn/d))$ iterations to reach the statistical accuracy of $O(\sqrt{d/(mn)})$. In contrast, we show that if initialized properly, EM on the structured data requires only $O(1)$ iterations to reach the same statistical accuracy, as long as $m$ grows up as $e^{o(n)}$. Our analysis establishes and combines novel asymptotic optimization and generalization guarantees for population and empirical EM with dependent samples, which may be of independent interest.

摘要
现代数据驱动和分布式学习框架面临各种各样的大规模数据，这些数据来自客户端分布在不同环境中。实际上，数据的差异性是规模化多个分布式学习模式的主要障碍。然而，在一些应用中，客户端生成的数据可能会具有共同结构，例如联邦学习中，每个客户端都生成的所有样本均具有共同的隐藏变量。因此，可以问到如何利用分布式数据中的层次结构来改进学习方案。在这篇论文中，我们解决这个问题，特别是在估计$d$-维参数的两组线性回归问题中。我们使用常见的期望-最大化（EM）方法来估计最大 LIKELIHOOD 参数，从$m$ 批次的相互依赖样本中获取 $n$ 个测量。不考虑分布式数据中的层次结构，EM 知道需要 $O(\log(mn/d))$ 迭代来达到同样的统计准确性，其中 $m$ 是 Client 数量，$n$ 是每个客户端生成的样本数量，$d$ 是维度。然而，我们显示，如果INITIALIZED 正确， THEN EM 在结构化数据上只需要 $O(1)$ 迭代来达到同样的统计准确性，只要 $m$ 增长为 $e^{o(n)}$。我们的分析建立了和更新了可opus和总体Optimization guarantees for population和empirical EM with dependent samples，这可能是独立的兴趣。

TrackFlow: Multi-Object Tracking with Normalizing Flows

paper_url: http://arxiv.org/abs/2308.11513
repo_url: None
paper_authors: Gianluca Mancusi, Aniello Panariello, Angelo Porrello, Matteo Fabbri, Simone Calderara, Rita Cucchiara
for: 提高多目标跟踪的性能，尤其是在多模态 Setting 中，使用 tracking-by-detection 方法。
methods: 使用 Deep density estimator 模型 conditional joint probability 分布 correct associations，并且提出了一种新的 probabilistic formulation 来解决 tailored hyperparameters 的问题和 cost 独立性不足问题。
results: 实验结果表明，该方法可以 consistently enhance 多个 tracking-by-detection 算法的性能，包括在 simulated 和 real benchmark 上。

Abstract
The field of multi-object tracking has recently seen a renewed interest in the good old schema of tracking-by-detection, as its simplicity and strong priors spare it from the complex design and painful babysitting of tracking-by-attention approaches. In view of this, we aim at extending tracking-by-detection to multi-modal settings, where a comprehensive cost has to be computed from heterogeneous information e.g., 2D motion cues, visual appearance, and pose estimates. More precisely, we follow a case study where a rough estimate of 3D information is also available and must be merged with other traditional metrics (e.g., the IoU). To achieve that, recent approaches resort to either simple rules or complex heuristics to balance the contribution of each cost. However, i) they require careful tuning of tailored hyperparameters on a hold-out set, and ii) they imply these costs to be independent, which does not hold in reality. We address these issues by building upon an elegant probabilistic formulation, which considers the cost of a candidate association as the negative log-likelihood yielded by a deep density estimator, trained to model the conditional joint probability distribution of correct associations. Our experiments, conducted on both simulated and real benchmarks, show that our approach consistently enhances the performance of several tracking-by-detection algorithms.

摘要
Multi-object tracking 领域最近又有新的关注，回归到传统的检测跟踪模式，因为它的简单性和强制约束，使得设计和监督跟踪方法变得复杂。在这个视野下，我们想扩展跟踪检测到多模态场景，计算来自不同信息（如2D运动cue、视觉外观和姿态估计）的总成本。更加准确地说，我们采用一个实际情况研究，其中有粗略的3D信息也可以用，并且需要与传统指标（例如IoU）进行混合。以前的方法通常采用简单的规则或复杂的规则来平衡每个成本的贡献。然而，这些方法有两个缺点：一是需要手动调整特定的Hyperparameter在保留集上，二是它们假设成本独立，这并不符合实际情况。我们解决这些问题，通过建立一种简洁的概率形式化，其中候选人协会的成本视为检测器训练的深度概率分布中的逻辑 JOIN 预测错误的负Log-likelihood。我们的实验，在 simulate 和实际 benchmark 上进行，表明我们的方法可以不断提高多个跟踪检测算法的性能。

Mode Combinability: Exploring Convex Combinations of Permutation Aligned Models

paper_url: http://arxiv.org/abs/2308.11511
repo_url: None
paper_authors: Adrián Csiszárik, Melinda F. Kiss, Péter Kőrösi-Szabó, Márton Muntag, Gergely Papp, Dániel Varga
for: 研究了元素积み的归一化 combinatorial方法，探索了两个具有相同结构的神经网络参数 вектор $\Theta_A$ 和 $\Theta_B$ 的可能性。
methods: 使用了各种分布在 hypercube $[0,1]^{d}$ 和其邻域中的模型组合，并进行了广泛的实验研究。
results: 发现了一些新的模型连接性观察，包括模式可连接性和模型重新定位性。具体来说，发现了一些元素积み模型组合的特性，如：两个模型重新定位到共同的第三模型时，这些模型仍然可以形成一个工作的模型。此外，还发现了模型组合的函数和权重相似性的不同性，表明这些模型组合不是空的。

Abstract
We explore element-wise convex combinations of two permutation-aligned neural network parameter vectors $\Theta_A$ and $\Theta_B$ of size $d$. We conduct extensive experiments by examining various distributions of such model combinations parametrized by elements of the hypercube $[0,1]^{d}$ and its vicinity. Our findings reveal that broad regions of the hypercube form surfaces of low loss values, indicating that the notion of linear mode connectivity extends to a more general phenomenon which we call mode combinability. We also make several novel observations regarding linear mode connectivity and model re-basin. We demonstrate a transitivity property: two models re-based to a common third model are also linear mode connected, and a robustness property: even with significant perturbations of the neuron matchings the resulting combinations continue to form a working model. Moreover, we analyze the functional and weight similarity of model combinations and show that such combinations are non-vacuous in the sense that there are significant functional differences between the resulting models.

摘要
我们研究元素精度的凸 комби nations of two个 permutation-aligned neural network parameter vectors $\Theta_A$ 和 $\Theta_B$ of size $d$. We conduct extensive experiments by examining various distributions of such model combinations parametrized by elements of the hypercube $[0,1]^{d}$ and its vicinity. Our findings reveal that broad regions of the hypercube form surfaces of low loss values, indicating that the notion of linear mode connectivity extends to a more general phenomenon which we call mode combinability. We also make several novel observations regarding linear mode connectivity and model re-basin. We demonstrate a transitivity property: two models re-based to a common third model are also linear mode connected, and a robustness property: even with significant perturbations of the neuron matchings the resulting combinations continue to form a working model. Moreover, we analyze the functional and weight similarity of model combinations and show that such combinations are non-vacuous in the sense that there are significant functional differences between the resulting models.Here's the translation in Traditional Chinese:我们研究元素精度的凸 combinat ions of two个 permutation-aligned neural network parameter vectors $\Theta_A$ 和 $\Theta_B$ of size $d$. We conduct extensive experiments by examining various distributions of such model combinations parametrized by elements of the hypercube $[0,1]^{d}$ and its vicinity. Our findings reveal that broad regions of the hypercube form surfaces of low loss values, indicating that the notion of linear mode connectivity extends to a more general phenomenon which we call mode combinability. We also make several novel observations regarding linear mode connectivity and model re-basin. We demonstrate a transitivity property: two models re-based to a common third model are also linear mode connected, and a robustness property: even with significant perturbations of the neuron matchings the resulting combinations continue to form a working model. Moreover, we analyze the functional and weight similarity of model combinations and show that such combinations are non-vacuous in the sense that there are significant functional differences between the resulting models.

Can Authorship Representation Learning Capture Stylistic Features?

paper_url: http://arxiv.org/abs/2308.11490
repo_url: https://github.com/llnl/luar
paper_authors: Andrew Wang, Cristina Aggazzotti, Rebecca Kotula, Rafael Rivera Soto, Marcus Bishop, Nicholas Andrews
for: 这个论文主要目的是为了研究自动分离作者的风格和内容之间的关系，以及如何使用数据驱动方法来学习作者表示。
methods: 这篇论文使用了大量文本 Corpora 和作者标签，通过数据驱动方法来学习作者表示。
results: 实验结果表明，这些表示能够捕捉到作者的风格特征，并且可以鲁棒地抵制一些数据变化，如时间的主题演变。

Abstract
Automatically disentangling an author's style from the content of their writing is a longstanding and possibly insurmountable problem in computational linguistics. At the same time, the availability of large text corpora furnished with author labels has recently enabled learning authorship representations in a purely data-driven manner for authorship attribution, a task that ostensibly depends to a greater extent on encoding writing style than encoding content. However, success on this surrogate task does not ensure that such representations capture writing style since authorship could also be correlated with other latent variables, such as topic. In an effort to better understand the nature of the information these representations convey, and specifically to validate the hypothesis that they chiefly encode writing style, we systematically probe these representations through a series of targeted experiments. The results of these experiments suggest that representations learned for the surrogate authorship prediction task are indeed sensitive to writing style. As a consequence, authorship representations may be expected to be robust to certain kinds of data shift, such as topic drift over time. Additionally, our findings may open the door to downstream applications that require stylistic representations, such as style transfer.

摘要
自动分离作者的风格从写作内容是计算语言学领域的长期问题，而有大量文本资料Set with作者标签的可用性，使得可以通过数据驱动方式学习作者表示，这种任务据称更多地依赖于编码写作风格而非编码内容。然而，成功在这个代理任务上并不能保证这些表示 capture 写作风格，因为作者可能也与其他隐藏变量相关，如话题。为了更好地理解这些表示所传递的信息，以及特别是验证假设这些表示主要编码写作风格，我们系统地进行了一系列targeted Experiments。结果表明，learned for the surrogate authorship prediction task 的表示 действительно敏感于写作风格。因此，作者表示可能会具有 certain types of data shift 的 Robustness， such as topic drift over time。此外，我们的发现可能会开启下游应用需要风格表示的应用，如样式转移。

Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions

paper_url: http://arxiv.org/abs/2308.11483
repo_url: None
paper_authors: Pouya Pezeshkpour, Estevam Hruschka
for: 这 paper 专门研究了 Large Language Models (LLMs) 在多选问题上的 Robustness，以及这些模型如何受到示例数量和示例选择的影响。
methods: 这 paper 使用了多个 benchmark 和多种方法来调查 LLMs 的 Robustness，包括对答案选项的重新排序和几种 calibration 技术。
results: 研究发现，当答案选项重新排序时，LLMs 的表现会受到显著影响，表现下降约 13% 到 75%。此外，研究还发现了一些特征选项的位置可以增加或减少模型的偏见。通过不同的实验和调整方法， authors 提出了一些方法来提高 LLMs 的 Robustness。

Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in various NLP tasks. However, previous works have shown these models are sensitive towards prompt wording, and few-shot demonstrations and their order, posing challenges to fair assessment of these models. As these models become more powerful, it becomes imperative to understand and address these limitations. In this paper, we focus on LLMs robustness on the task of multiple-choice questions -- commonly adopted task to study reasoning and fact-retrieving capability of LLMs. Investigating the sensitivity of LLMs towards the order of options in multiple-choice questions, we demonstrate a considerable performance gap of approximately 13% to 75% in LLMs on different benchmarks, when answer options are reordered, even when using demonstrations in a few-shot setting. Through a detailed analysis, we conjecture that this sensitivity arises when LLMs are uncertain about the prediction between the top-2/3 choices, and specific options placements may favor certain prediction between those top choices depending on the question caused by positional bias. We also identify patterns in top-2 choices that amplify or mitigate the model's bias toward option placement. We found that for amplifying bias, the optimal strategy involves positioning the top two choices as the first and last options. Conversely, to mitigate bias, we recommend placing these choices among the adjacent options. To validate our conjecture, we conduct various experiments and adopt two approaches to calibrate LLMs' predictions, leading to up to 8 percentage points improvement across different models and benchmarks.

摘要
大型语言模型（LLM）在不同的自然语言处理任务中表现出色，但之前的研究表明这些模型对提示语言和少量示例的顺序有敏感性，这会带来评估这些模型的困难。随着这些模型的能力不断提高，我们需要更好地理解和解决这些限制。在这篇论文中，我们关注LLM在多选问题上的Robustness--一个通用的任务来评估LLM的理解和知识 retrieve能力。我们发现，当答案选项的顺序发生变化时，LLM的表现会受到大约13%到75%的变化，即使在几个示例中进行示例。通过详细分析，我们推断这种敏感性来自于LLM对前三个选项之间的预测不确定性，特定的选项位置可能会偏导certain prediction between top choices，这取决于问题的特点。我们还发现了选项的顺序对抑制或增强模型偏好的 Patterns。我们发现，为了增强偏好，最佳策略是将前两个选项放在第一和最后的位置。相反，以减少偏好，我们建议将这些选项放在相邻的位置。为了证实我们的推断，我们进行了多种实验和采用了两种方法来调整LLM的预测，导致不同的模型和benchmark上的提高。

Expecting The Unexpected: Towards Broad Out-Of-Distribution Detection

paper_url: http://arxiv.org/abs/2308.11480
repo_url: https://github.com/servicenow/broad-openood
paper_authors: Charles Guille-Escuret, Pierre-André Noël, Ioannis Mitliagkas, David Vazquez, Joao Monteiro
for: This paper aims to improve the reliability of deployed machine learning systems by developing methods to detect out-of-distribution (OOD) inputs and addressing the limitation of existing research that only focuses on samples from classes absent from the training set.methods: The paper evaluates the performance of recent OOD detection methods on five distinct types of distribution shifts and releases a benchmark called BROAD (Benchmarking Resilience Over Anomaly Diversity) to publicly release their findings. The authors also propose an ensemble approach that leverages a generative model of existing detection scores to achieve superior performance in broad OOD detection.results: The paper reveals that existing OOD detection methods excel in detecting unknown classes but are inconsistent in detecting other types of distribution shifts. The proposed ensemble approach demonstrates superior performance compared to existing methods, offering a more consistent and comprehensive solution for broad OOD detection.

Abstract
Improving the reliability of deployed machine learning systems often involves developing methods to detect out-of-distribution (OOD) inputs. However, existing research often narrowly focuses on samples from classes that are absent from the training set, neglecting other types of plausible distribution shifts. This limitation reduces the applicability of these methods in real-world scenarios, where systems encounter a wide variety of anomalous inputs. In this study, we categorize five distinct types of distribution shifts and critically evaluate the performance of recent OOD detection methods on each of them. We publicly release our benchmark under the name BROAD (Benchmarking Resilience Over Anomaly Diversity). Our findings reveal that while these methods excel in detecting unknown classes, their performance is inconsistent when encountering other types of distribution shifts. In other words, they only reliably detect unexpected inputs that they have been specifically designed to expect. As a first step toward broad OOD detection, we learn a generative model of existing detection scores with a Gaussian mixture. By doing so, we present an ensemble approach that offers a more consistent and comprehensive solution for broad OOD detection, demonstrating superior performance compared to existing methods. Our code to download BROAD and reproduce our experiments is publicly available.

摘要
通常，升级部署的机器学习系统的可靠性通常需要开发检测非标型输入（OOD）的方法。然而，现有的研究frequently只关注 absence classes 的样本，忽视了其他类型的可能的分布变化。这种限制reduces the applicability of these methods in real-world scenarios, where systems encounter a wide variety of anomalous inputs。在这个研究中，我们将非标型输入分为五种不同的分布变化类别，并且严格评估了近期OOD检测方法在每个类别中的表现。我们公开发布了我们的 benchmark nder the name BROAD（Benchmarking Resilience Over Anomaly Diversity）。我们的发现表明，虽然这些方法在 unknown classes 方面表现出色，但在其他类型的分布变化时，其表现是不一致的。换句话说，它们只可靠地检测它们已经专门设计来预期的不同输入。作为一个初步的广泛OOD检测方法，我们学习了一个基于泛化模型的现有检测得分的 generator。通过这样做，我们提供了一种更一致和全面的解决方案，并证明了与现有方法相比，我们的方法具有更高的表现。我们的代码可以下载BROAD和重现我们的实验。

Revisiting column-generation-based matheuristic for learning classification trees

paper_url: http://arxiv.org/abs/2308.11477
repo_url: https://github.com/krooonal/col_gen_estimator
paper_authors: Krunal Kishor Patel, Guy Desaulniers, Andrea Lodi
for: 提高分类问题的解决方法的扩展性和可扩展性。
methods: 使用列生成方法进行决策树的学习，并对实际问题进行改进。
results: 改进后的方法可以更好地扩展到大型数据集，并且在分类问题中获得更高的准确率。

Abstract
Decision trees are highly interpretable models for solving classification problems in machine learning (ML). The standard ML algorithms for training decision trees are fast but generate suboptimal trees in terms of accuracy. Other discrete optimization models in the literature address the optimality problem but only work well on relatively small datasets. \cite{firat2020column} proposed a column-generation-based heuristic approach for learning decision trees. This approach improves scalability and can work with large datasets. In this paper, we describe improvements to this column generation approach. First, we modify the subproblem model to significantly reduce the number of subproblems in multiclass classification instances. Next, we show that the data-dependent constraints in the master problem are implied, and use them as cutting planes. Furthermore, we describe a separation model to generate data points for which the linear programming relaxation solution violates their corresponding constraints. We conclude by presenting computational results that show that these modifications result in better scalability.

摘要

Internal Cross-layer Gradients for Extending Homogeneity to Heterogeneity in Federated Learning

paper_url: http://arxiv.org/abs/2308.11464
repo_url: None
paper_authors: Yun-Hin Chan, Rui Zhou, Running Zhao, Zhihan Jiang, Edith C. -H. Ngai
for: 提高模型不同的设备之间的兼容性，使得模型可以更好地处理异ogeneous系统的挑战。
methods: 提出一种基于 internal cross-layer gradients 的训练方案，可以增强深层模型之间的相似性，无需客户端之间的额外通信。
results: 经验证明，Internal Cross-layer Aggregation 可以提高异ogeneous FL 的性能，并且可以让模型 homogeneous FL 方法，如 FedAvg、FedProx、FedNova、Scaffold 和 MOON，在异ogeneous系统中展现出更好的表现。

Abstract
Federated learning (FL) inevitably confronts the challenge of system heterogeneity in practical scenarios. To enhance the capabilities of most model-homogeneous FL methods in handling system heterogeneity, we propose a training scheme that can extend their capabilities to cope with this challenge. In this paper, we commence our study with a detailed exploration of homogeneous and heterogeneous FL settings and discover three key observations: (1) a positive correlation between client performance and layer similarities, (2) higher similarities in the shallow layers in contrast to the deep layers, and (3) the smoother gradients distributions indicate the higher layer similarities. Building upon these observations, we propose InCo Aggregation that leverags internal cross-layer gradients, a mixture of gradients from shallow and deep layers within a server model, to augment the similarity in the deep layers without requiring additional communication between clients. Furthermore, our methods can be tailored to accommodate model-homogeneous FL methods such as FedAvg, FedProx, FedNova, Scaffold, and MOON, to expand their capabilities to handle the system heterogeneity. Copious experimental results validate the effectiveness of InCo Aggregation, spotlighting internal cross-layer gradients as a promising avenue to enhance the performance in heterogenous FL.

摘要
federated learning (FL) 在实际场景中遇到系统不同性的挑战。为了增强大多数模型同型 FL 方法在处理系统不同性的能力，我们提议一种训练方案，可以使其能够处理这种挑战。在这篇论文中，我们开始我们的研究，进行了详细的同型和不同型 FL 设置的探索，发现了三个关键观察结果：（1）客户端性能与层相似性之间存在正相关关系，（2）在浅层比深层更高的相似性，（3）更平滑的梯度分布表明更高的层相似性。基于这些观察结果，我们提出了内部交叉层梯度（InCo Aggregation），利用服务器模型内部的交叉层梯度，以增强深层层次的相似性，而不需要客户端之间进行额外的通信。此外，我们的方法可以与现有的模型同型 FL 方法，如 FedAvg、FedProx、FedNova、Scaffold 和 MOON 集成，以扩展它们的能力，处理系统不同性。丰富的实验结果证明了 InCo Aggregation 的有效性，指出交叉层梯度为各种不同类型 FL 方法的发展提供了一个有前途的方向。

An Analysis of Initial Training Strategies for Exemplar-Free Class-Incremental Learning

paper_url: http://arxiv.org/abs/2308.11677
repo_url: None
paper_authors: Grégoire Petit, Michael Soumm, Eva Feillet, Adrian Popescu, Bertrand Delezoide, David Picard, Céline Hudelot
for: 这篇论文的目的是研究增量学习（Class-Incremental Learning）中如何从数据流中建立分类模型。
methods: 这篇论文使用了两种初始学习策略：使用目标资料集的内部组件，或者使用预先训练的模型 weights。
results: 这篇论文的主要结果是通过实验研究，发现初始学习策略是增量性能的主要影响因素，但是选择增量学习算法是预防忘记的更重要因素。

Abstract
Class-Incremental Learning (CIL) aims to build classification models from data streams. At each step of the CIL process, new classes must be integrated into the model. Due to catastrophic forgetting, CIL is particularly challenging when examples from past classes cannot be stored, the case on which we focus here. To date, most approaches are based exclusively on the target dataset of the CIL process. However, the use of models pre-trained in a self-supervised way on large amounts of data has recently gained momentum. The initial model of the CIL process may only use the first batch of the target dataset, or also use pre-trained weights obtained on an auxiliary dataset. The choice between these two initial learning strategies can significantly influence the performance of the incremental learning model, but has not yet been studied in depth. Performance is also influenced by the choice of the CIL algorithm, the neural architecture, the nature of the target task, the distribution of classes in the stream and the number of examples available for learning. We conduct a comprehensive experimental study to assess the roles of these factors. We present a statistical analysis framework that quantifies the relative contribution of each factor to incremental performance. Our main finding is that the initial training strategy is the dominant factor influencing the average incremental accuracy, but that the choice of CIL algorithm is more important in preventing forgetting. Based on this analysis, we propose practical recommendations for choosing the right initial training strategy for a given incremental learning use case. These recommendations are intended to facilitate the practical deployment of incremental learning.

摘要
Translated into Simplified Chinese:类增量学习（CIL）目标是从数据流中构建分类模型。在CIL过程中，每步都需要将新类添加到模型中。由于悬崖式忘记，CIL在不能保存过去类例的情况下特别挑战。到目前为止，大多数方法都是基于目标数据集的CIL过程。然而，使用基于大量数据的自我supervised学习模型的初始化方法在最近几年得到了推动。CIL过程的初始模型可以只使用目标数据集的第一批样本，或者使用基于辅助数据集的预训练模型。这两种初始学习策略的选择会对增量学习模型的性能产生重要影响，但是这一点还没有得到深入研究。增量性能的选择也受到CIL算法、神经网络架构、目标任务的性质、流动数据集中类别的分布和可用学习示例的影响。我们进行了全面的实验研究，以评估这些因素的影响。我们还提供了一个统计分析框架，以量化每个因素对增量性能的相对贡献。我们的主要发现是，初始学习策略是增量性能的主要影响因素，但是选择CIL算法更重要地防止忘记。根据这一分析，我们提出了实用的初始学习策略选择建议，以便实现增量学习的实用部署。

A Survey on Self-Supervised Representation Learning

paper_url: http://arxiv.org/abs/2308.11455
repo_url: https://github.com/microsoft/esvit
paper_authors: Tobias Uelwer, Jan Robine, Stefan Sylvius Wagner, Marc Höftmann, Eric Upschulte, Sebastian Konietzny, Maike Behrendt, Stefan Harmeling
for: 本文提供了一个权威的综述，涵盖无监督图像表示学习方法的最新进展。
methods: 本文使用了多种无监督图像表示学习方法，包括自适应卷积、卷积扩展、卷积束规范等。
results: 根据 literature 中的 meta-study，无监督图像表示学习方法的实验结果显示，这些方法可以达到类似于有监督学习的水平，而无需使用 labels。

Abstract
Learning meaningful representations is at the heart of many tasks in the field of modern machine learning. Recently, a lot of methods were introduced that allow learning of image representations without supervision. These representations can then be used in downstream tasks like classification or object detection. The quality of these representations is close to supervised learning, while no labeled images are needed. This survey paper provides a comprehensive review of these methods in a unified notation, points out similarities and differences of these methods, and proposes a taxonomy which sets these methods in relation to each other. Furthermore, our survey summarizes the most-recent experimental results reported in the literature in form of a meta-study. Our survey is intended as a starting point for researchers and practitioners who want to dive into the field of representation learning.

摘要
学习有意义的表示是现代机器学习领域中的核心任务之一。近些年，许多没有监督的方法被引入，允许学习图像表示。这些表示可以在下游任务中使用，如分类或物体检测。这些方法的表示质量与监督学习相似，但无需 labels 的图像。这篇评论文件提供了这些方法的统一表示、相似性和差异，以及这些方法之间的相关性的分类。此外，我们的评论还summarizes 文献中最新的实验结果，并进行了一个元学习。我们的评论是为研究人员和实践者们，想要深入探索表示学习领域的开始点。

Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding

paper_url: http://arxiv.org/abs/2308.11448
repo_url: None
paper_authors: Jiantao Wu, Shentong Mo, Muhammad Awais, Sara Atito, Zhenhua Feng, Josef Kittler
for: The paper is written for advancing the field of computer vision and machine learning, specifically in the area of self-supervised learning and transfer learning.
methods: The paper proposes a new evaluation protocol for zero-shot segmentation based on a prompting patch, as well as a simple self-supervised learning approach called MMC that combines Masked image modelling, Momentum based self-distillation, and global Contrast to enhance discriminative representations of SSP ViTs.
results: The paper reports top-tier results in zero-shot semantic segmentation across various datasets, demonstrating the effectiveness of the proposed MMC approach in object segmentation tasks.

Abstract
Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data. In the realm of computer vision, pretrained vision transformers (ViTs) have played a pivotal role in advancing transfer learning. Nonetheless, the escalating cost of finetuning these large models has posed a challenge due to the explosion of model size. This study endeavours to evaluate the effectiveness of pure self-supervised learning (SSL) techniques in computer vision tasks, obviating the need for finetuning, with the intention of emulating human-like capabilities in generalisation and recognition of unseen objects. To this end, we propose an evaluation protocol for zero-shot segmentation based on a prompting patch. Given a point on the target object as a prompt, the algorithm calculates the similarity map between the selected patch and other patches, upon that, a simple thresholding is applied to segment the target. Another evaluation is intra-object and inter-object similarity to gauge discriminatory ability of SSP ViTs. Insights from zero-shot segmentation from prompting and discriminatory abilities of SSP led to the design of a simple SSP approach, termed MMC. This approaches combines Masked image modelling for encouraging similarity of local features, Momentum based self-distillation for transferring semantics from global to local features, and global Contrast for promoting semantics of global features, to enhance discriminative representations of SSP ViTs. Consequently, our proposed method significantly reduces the overlap of intra-object and inter-object similarities, thereby facilitating effective object segmentation within an image. Our experiments reveal that MMC delivers top-tier results in zero-shot semantic segmentation across various datasets.

摘要
自我监督预训练（SSP）已经成为机器学习中受欢迎的技术之一，允许提取有用的特征表示无需标注数据。在计算机视觉领域，预训练视transformer（ViT）已经扮演了重要的角色，促进了转移学习。然而，模型的规模快速增长，使得贵重的训练变得更加昂贵。本研究旨在评估SSL技术在计算机视觉任务中的效果，不需要训练，以便模拟人类类似的总结和对未看到的对象的识别。为此，我们提出了一种零shot分割评估方法，基于提示 patch。给定目标对象的一点作为提示，算法计算该选择的 patch 和其他 patches 之间的相似度图，然后应用简单的阈值处理来 segment the target。此外，我们还进行了针对SSL ViTs的内部和外部相似性评估，以衡量它们的权威性。通过零shot分割和SSL ViTs的权威性，我们设计了一种简单的SSP方法，名为MMC。这种方法结合了卷积图像模型，自动提取的 local features，以及全局对比，以提高 SSL ViTs 的权威性。我们的实验表明，MMC 可以在不同的dataset上达到顶尖的零shotSemantic segmentation结果。

Exploration of Rashomon Set Assists Explanations for Medical Data

paper_url: http://arxiv.org/abs/2308.11446
repo_url: None
paper_authors: Katarzyna Kobylińska, Mateusz Krzyziński, Rafał Machowicz, Mariusz Adamek, Przemysław Biecek
for: This paper aims to address the issue of relying solely on performance metrics in machine learning modeling, particularly in medical and healthcare studies, where valuable insights beyond predictions are desired.methods: The proposed approach utilizes the $\texttt{Rashomon_DETECT}$ algorithm, which compares prediction dependencies on variable values generated by XAI techniques, to identify the most different models within a Rashomon set. The Profile Disparity Index (PDI) is introduced to quantify differences in variable effects among models.results: The approach is demonstrated on a foundational case study of predicting survival among hemophagocytic lymphohistiocytosis (HLH) patients, as well as other medical data sets, showcasing its versatility and utility in various contexts. The results suggest that the proposed approach can provide comprehensive analysis of Rashomon set models, offering valuable insights beyond maximum performance metrics.

Abstract
The machine learning modeling process conventionally culminates in selecting a single model that maximizes a selected performance metric. However, this approach leads to abandoning a more profound analysis of slightly inferior models. Particularly in medical and healthcare studies, where the objective extends beyond predictions to valuable insight generation, relying solely on performance metrics can result in misleading or incomplete conclusions. This problem is particularly pertinent when dealing with a set of models with performance close to maximum one, known as $\textit{Rashomon set}$. Such a set can be numerous and may contain models describing the data in a different way, which calls for comprehensive analysis. This paper introduces a novel process to explore Rashomon set models, extending the conventional modeling approach. The cornerstone is the identification of the most different models within the Rashomon set, facilitated by the introduced $\texttt{Rashomon_DETECT}$ algorithm. This algorithm compares profiles illustrating prediction dependencies on variable values generated by eXplainable Artificial Intelligence (XAI) techniques. To quantify differences in variable effects among models, we introduce the Profile Disparity Index (PDI) based on measures from functional data analysis. To illustrate the effectiveness of our approach, we showcase its application in predicting survival among hemophagocytic lymphohistiocytosis (HLH) patients - a foundational case study. Additionally, we benchmark our approach on other medical data sets, demonstrating its versatility and utility in various contexts.

摘要
This paper proposes a novel process to explore the Rashomon set models, extending the conventional modeling approach. The key is to identify the most different models within the Rashomon set, facilitated by the introduced $\texttt{Rashomon_DETECT}$ algorithm. This algorithm compares profiles illustrating prediction dependencies on variable values generated by eXplainable Artificial Intelligence (XAI) techniques. To quantify differences in variable effects among models, we introduce the Profile Disparity Index (PDI) based on measures from functional data analysis.To demonstrate the effectiveness of our approach, we apply it to predicting survival among hemophagocytic lymphohistiocytosis (HLH) patients, which serves as a foundational case study. We also benchmark our approach on other medical data sets, showing its versatility and utility in various contexts.

Inferring gender from name: a large scale performance evaluation study

paper_url: http://arxiv.org/abs/2308.12381
repo_url: None
paper_authors: Kriste Krstovski, Yao Lu, Ye Xu
for: 这个研究的目的是为了评估现有的名字到性别推断方法的性能，并提出两种新的混合方法以提高性能。
methods: 该研究使用了许多大量注解的名字数据集进行分析，并评估了现有的名字到性别推断方法。
results: 研究发现现有的名字到性别推断方法的性能并不一致，并提出了两种新的混合方法，这两种方法在性能上超过了任何单一的现有方法。

Abstract
A person's gender is a crucial piece of information when performing research across a wide range of scientific disciplines, such as medicine, sociology, political science, and economics, to name a few. However, in increasing instances, especially given the proliferation of big data, gender information is not readily available. In such cases researchers need to infer gender from readily available information, primarily from persons' names. While inferring gender from name may raise some ethical questions, the lack of viable alternatives means that researchers have to resort to such approaches when the goal justifies the means - in the majority of such studies the goal is to examine patterns and determinants of gender disparities. The necessity of name-to-gender inference has generated an ever-growing domain of algorithmic approaches and software products. These approaches have been used throughout the world in academia, industry, governmental and non-governmental organizations. Nevertheless, the existing approaches have yet to be systematically evaluated and compared, making it challenging to determine the optimal approach for future research. In this work, we conducted a large scale performance evaluation of existing approaches for name-to-gender inference. Analysis are performed using a variety of large annotated datasets of names. We further propose two new hybrid approaches that achieve better performance than any single existing approach.

摘要
人的性别信息是科学研究中不可或缺的一部分，包括医学、社会学、政治科学和经济学等领域。然而，随着大数据的普及，性别信息越来越难以获得。在这些情况下，研究人员需要从可得到的信息中推断性别，主要是从人名中获取。虽然从名字推断性别可能会引起一些伦理问题，但在目的 justify the means 的情况下，研究人员需要采用这种方法。由于这种方法在全球范围内广泛应用，因此需要系统地评估和比较现有方法，以确定未来研究中的优化方法。在这项工作中，我们进行了大规模的性别推断方法的性能评估。我们使用了多种大型注释数据集来进行分析。此外，我们还提出了两种新的混合方法，其性能比任何单一现有方法更高。

A Study on the Impact of Non-confounding Covariates on the Inferential Performance of Methods based on the Potential Outcome Framework

paper_url: http://arxiv.org/abs/2308.11676
repo_url: None
paper_authors: Yonghe Zhao, Shuai Fu, Huiyan Sun
for: This paper focuses on the Potential Outcome Framework (POF) and its application in causal inference, specifically addressing the challenges of dealing with high-dimensional covariates.methods: The paper presents a unified graphical framework for the Causal Inference Models based on the POF (CIMs-B-POF) and analyzes the influence of various types of non-confounding covariates on the inference performance.results: The key findings are that the optimal scenario for eliminating confounding bias is for the covariates to exclusively encompass confounders, while the adjustment variables contribute to more accurate inferences in the task of inferring counterfactual outcomes. The theoretical conclusions are validated through extensive experiments conducted on synthetic datasets.

Abstract
The Potential Outcome Framework (POF) plays a prominent role in the field of causal inference. Most causal inference models based on the POF (CIMs-B-POF) are designed for eliminating confounding bias and default to an underlying assumption of Confounding Covariates. This assumption posits that the covariates consist solely of confounders. However, the assumption of Confounding Covariates is challenging to maintain in practice, particularly when dealing with high-dimensional covariates. While certain methods have been proposed to differentiate the distinct components of covariates prior to conducting causal inference, the consequences of treating non-confounding covariates as confounders remain unclear. This ambiguity poses a potential risk when applying the CIMs-B-POF in practical scenarios. In this paper, we present a unified graphical framework for the CIMs-B-POF, which greatly enhances the comprehension of these models' underlying principles. Using this graphical framework, we quantitatively analyze the extent to which the inference performance of CIMs-B-POF is influenced when incorporating various types of non-confounding covariates, such as instrumental variables, mediators, colliders, and adjustment variables. The key findings are: in the task of eliminating confounding bias, the optimal scenario is for the covariates to exclusively encompass confounders; in the subsequent task of inferring counterfactual outcomes, the adjustment variables contribute to more accurate inferences. Furthermore, extensive experiments conducted on synthetic datasets consistently validate these theoretical conclusions.

摘要
《可能结果框架（POF）在 causal inference 领域发挥着重要作用。大多数基于 POF 的 causal inference 模型（CIMs-B-POF）是为了消除偏见干扰而设计的，默认假设是 Confounding Covariates 假设，即covariates 仅仅包含偏见变量。然而，在实践中保持 Confounding Covariates 假设是困难的，特别是面临高维 covariates 时。虽然一些方法已经被提出来分解 covariates 中不同的组分，但对非偏见 covariates 被当作偏见变量的后果仍然不清楚。这种不确定性可能会在实践中应用 CIMs-B-POF 时产生风险。在这篇论文中，我们提出了一个统一的图表方法，可以很好地提高 CIMs-B-POF 的理解。使用这个图表方法，我们量化分析了在不同类型的非偏见 covariates 存在时，CIMs-B-POF 的推断性能是如何受影响的。我们发现，在消除偏见干扰的任务中，covariates 应该仅仅包含偏见变量；在后续的对 counterfactual 结果进行推断任务中，调整变量对更准确的推断做出了贡献。此外，我们在 sintetic 数据上进行了广泛的实验，并 consistently validate 了这些理论结论。

TurboViT: Generating Fast Vision Transformers via Generative Architecture Search

paper_url: http://arxiv.org/abs/2308.11421
repo_url: None
paper_authors: Alexander Wong, Saad Abbasi, Saeejith Nair
for: 这研究的目的是为了设计高效的视觉变换器网络，以满足高 Throughput、低内存需求的实际应用场景。
methods: 这研究使用的方法是Generative Architecture Search（GAS），通过这种方法，我们创造了一种高效的层次结构视觉变换器网络，叫做TurboViT。TurboViT网络设计围绕着面罩单元注意力和Q-池化设计模式。
results: TurboViT网络设计在ImageNet-1K数据集上实现了与其他10个state-of-the-art高效视觉变换器网络相同的准确率（>2.47$\times$ smaller than FasterViT-0），同时具有较低的计算复杂性（>3.4$\times$ fewer FLOPs和0.9% higher accuracy than MobileViT2-2.0）。此外，TurboViT在低延迟和批处理场景中的执行时间和 durchput也表现出了优异性（>3.21$\times$ lower latency和>3.18$\times$ higher throughput compared to FasterViT-0 for low-latency scenario）。

Abstract
Vision transformers have shown unprecedented levels of performance in tackling various visual perception tasks in recent years. However, the architectural and computational complexity of such network architectures have made them challenging to deploy in real-world applications with high-throughput, low-memory requirements. As such, there has been significant research recently on the design of efficient vision transformer architectures. In this study, we explore the generation of fast vision transformer architecture designs via generative architecture search (GAS) to achieve a strong balance between accuracy and architectural and computational efficiency. Through this generative architecture search process, we create TurboViT, a highly efficient hierarchical vision transformer architecture design that is generated around mask unit attention and Q-pooling design patterns. The resulting TurboViT architecture design achieves significantly lower architectural computational complexity (>2.47$\times$ smaller than FasterViT-0 while achieving same accuracy) and computational complexity (>3.4$\times$ fewer FLOPs and 0.9% higher accuracy than MobileViT2-2.0) when compared to 10 other state-of-the-art efficient vision transformer network architecture designs within a similar range of accuracy on the ImageNet-1K dataset. Furthermore, TurboViT demonstrated strong inference latency and throughput in both low-latency and batch processing scenarios (>3.21$\times$ lower latency and >3.18$\times$ higher throughput compared to FasterViT-0 for low-latency scenario). These promising results demonstrate the efficacy of leveraging generative architecture search for generating efficient transformer architecture designs for high-throughput scenarios.

摘要
视transformer在最近几年内已经达到了不可比拟的表现水平，但是这些网络架构的建筑和计算复杂度使其在实际应用中高速、低内存要求下部署具有挑战性。因此，最近有很多研究关于高效的视transformer架构设计。在这种研究中，我们通过生成架构搜索（GAS）来生成高效的视transformer架构设计，以达到精度和建筑计算效率之间的平衡。通过这个生成架构搜索过程，我们创造了TurboViT，一种高效的层次结构视transformer架构设计，围绕着面积单元注意力和Q-pooling设计模式。TurboViT架构设计的结果显示与其他10种状态OF-the-art高效视transformer网络架构设计相比，在ImageNet-1K数据集上 achieve 同等精度时表现出较低的建筑计算复杂度（>2.47倍于FasterViT-0）和计算复杂度（>3.4倍于MobileViT2-2.0）。此外，TurboViT在低延迟和批处理场景中表现出了优秀的执行速度和通过put（>3.21倍于FasterViT-0的低延迟场景）。这些惊喜的结果证明了可以通过生成架构搜索来生成高效的transformer架构设计，以满足高速应用场景。

Designing an attack-defense game: how to increase robustness of financial transaction models via a competition

paper_url: http://arxiv.org/abs/2308.11406
repo_url: None
paper_authors: Alexey Zaytsev, Alex Natekin, Evgeni Vorsin, Valerii Smirnov, Georgii Smirnov, Oleg Sidorshin, Alexander Senin, Alexander Dudin, Dmitry Berestnev
for:The paper is written to investigate the current state and dynamics of adversarial attacks and defenses for neural network models that use sequential financial data as the input, with a focus on the practicality of the attacks and defenses in real-world scenarios.methods:The paper uses a competition-based approach to examine the problems in modern financial transaction data, where participants compete directly against each other to test the robustness of their models. The authors also conduct a meta-study on the used approaches, including numerical experiments and ablation studies, to evaluate their effectiveness.results:The paper reports that the developed attacks and defenses outperform existing alternatives from the literature, demonstrating the validity of the competition as a tool for uncovering vulnerabilities of machine learning models and mitigating them in various domains. The authors also show that the attacks and defenses are practical in terms of execution.

Abstract
Given the escalating risks of malicious attacks in the finance sector and the consequential severe damage, a thorough understanding of adversarial strategies and robust defense mechanisms for machine learning models is critical. The threat becomes even more severe with the increased adoption in banks more accurate, but potentially fragile neural networks. We aim to investigate the current state and dynamics of adversarial attacks and defenses for neural network models that use sequential financial data as the input. To achieve this goal, we have designed a competition that allows realistic and detailed investigation of problems in modern financial transaction data. The participants compete directly against each other, so possible attacks and defenses are examined in close-to-real-life conditions. Our main contributions are the analysis of the competition dynamics that answers the questions on how important it is to conceal a model from malicious users, how long does it take to break it, and what techniques one should use to make it more robust, and introduction additional way to attack models or increase their robustness. Our analysis continues with a meta-study on the used approaches with their power, numerical experiments, and accompanied ablations studies. We show that the developed attacks and defenses outperform existing alternatives from the literature while being practical in terms of execution, proving the validity of the competition as a tool for uncovering vulnerabilities of machine learning models and mitigating them in various domains.

摘要
随着金融领域的黑客攻击风险的增加和对机器学习模型的严重损害，了解黑客策略和机器学习模型的Robust防御机制成为 kritical。随着银行更加广泛采用更加准确但可能脆弱的神经网络，威胁变得更加严重。我们的目标是研究现有的机器学习模型在使用时间序数据为输入时遭受的敌意攻击和防御策略。为了实现这个目标，我们设计了一个竞赛，allowing participants to compete directly with each other, allowing for a realistic and detailed investigation of problems in modern financial transaction data. Our main contributions include the analysis of the competition dynamics, which answers questions such as how important it is to conceal a model from malicious users, how long it takes to break it, and what techniques can be used to make it more robust. Additionally, we introduce new ways to attack models or increase their robustness.我们的分析继续通过一个meta-study，对使用的方法进行评估，包括其力量、数值实验和附加的ablation study。我们发现，我们的攻击和防御策略在执行上具有实用性，并且在不同领域中有效地抵御机器学习模型的攻击。因此，我们的竞赛可以作为探索机器学习模型的漏洞并解决它们的工具。

Non-Redundant Combination of Hand-Crafted and Deep Learning Radiomics: Application to the Early Detection of Pancreatic Cancer

paper_url: http://arxiv.org/abs/2308.11389
repo_url: None
paper_authors: Rebeca Vétil, Clément Abi-Nader, Alexandre Bône, Marie-Pierre Vullierme, Marc-Michel Rohé, Pietro Gori, Isabelle Bloch
for: 避免 Deep Learning Radiomics (DLR) 与 Hand-Crafted Radiomics (HCR) 重叠。
methods: 使用 VAE 提取 DLR 特征，并通过最小化它们与 HCR 特征的相互信息来保证其独立性。
results: 结果表明将非重叠 DLR 和 HCR 特征结合使用，可以提高预测肝癌早期 marker 的精度，比基eline方法更好。

Abstract
We address the problem of learning Deep Learning Radiomics (DLR) that are not redundant with Hand-Crafted Radiomics (HCR). To do so, we extract DLR features using a VAE while enforcing their independence with HCR features by minimizing their mutual information. The resulting DLR features can be combined with hand-crafted ones and leveraged by a classifier to predict early markers of cancer. We illustrate our method on four early markers of pancreatic cancer and validate it on a large independent test set. Our results highlight the value of combining non-redundant DLR and HCR features, as evidenced by an improvement in the Area Under the Curve compared to baseline methods that do not address redundancy or solely rely on HCR features.

摘要
我们对 Deep Learning Radiomics (DLR) 的学习进行非重复的应用，以避免与手动设计的 Radiomics (HCR) 重叠。我们使用 VAE 提取 DLR 特征，并通过最小化它们与 HCR 特征之间的相互信息，以确保它们之间的独立性。结果显示，可以结合 DLR 和手动设计的特征，并由分类器使用以预测早期肝癌标志。我们在四种早期肝癌标志上显示了我们的方法，并在大量独立的测试集上验证了我们的结果，结果显示了结合非重复的 DLR 和 HCR 特征的优点，相比于不处理重复性或仅仅靠待 HCR 特征的基eline方法，有所提高了预测精度。

Targeted Data Augmentation for bias mitigation

paper_url: http://arxiv.org/abs/2308.11386
repo_url: None
paper_authors: Agnieszka Mikołajczyk-Bareła, Maria Ferlin, Michał Grochowski
for: This paper aims to address the issue of bias in AI systems by introducing a novel approach called Targeted Data Augmentation (TDA).
methods: The TDA method leverages classical data augmentation techniques to insert biases into the training data, rather than removing them. This approach is designed to improve the performance of AI models by mitigating biases.
results: The authors found that their TDA method significantly decreased bias measures in two diverse datasets (clinical skin lesions and male and female faces) while maintaining a negligible increase in the error rate. The results show that the method can effectively mitigate biases associated with the frame, ruler, and glasses.

Abstract
The development of fair and ethical AI systems requires careful consideration of bias mitigation, an area often overlooked or ignored. In this study, we introduce a novel and efficient approach for addressing biases called Targeted Data Augmentation (TDA), which leverages classical data augmentation techniques to tackle the pressing issue of bias in data and models. Unlike the laborious task of removing biases, our method proposes to insert biases instead, resulting in improved performance. To identify biases, we annotated two diverse datasets: a dataset of clinical skin lesions and a dataset of male and female faces. These bias annotations are published for the first time in this study, providing a valuable resource for future research. Through Counterfactual Bias Insertion, we discovered that biases associated with the frame, ruler, and glasses had a significant impact on models. By randomly introducing biases during training, we mitigated these biases and achieved a substantial decrease in bias measures, ranging from two-fold to more than 50-fold, while maintaining a negligible increase in the error rate.

摘要
发展公正和伦理AI系统需要仔细考虑偏见缓解，这是常常被忽略或忽视的领域。在这项研究中，我们提出了一种新的和高效的偏见缓解方法，称为Targeted Data Augmentation（TDA），它利用了经典的数据扩展技术来解决数据和模型中的偏见问题。与劳动ious的偏见去除任务不同，我们的方法提议在训练过程中插入偏见，从而改善性能。为了确定偏见，我们对两个多样化的数据集进行了标注：一个是皮肤病变数据集，另一个是男女脸部数据集。这些偏见标注在这项研究中首次公布，提供了未来研究的价值资源。通过对比事实插入偏见，我们发现了框架、尺度和眼镜等偏见对模型的影响是显著的。通过在训练过程中随机插入偏见，我们缓解了这些偏见，并达到了偏见度量的显著减少，从2倍到超过50倍，而保持了错误率的微不足。

Interpretable Distribution-Invariant Fairness Measures for Continuous Scores

paper_url: http://arxiv.org/abs/2308.11375
repo_url: None
paper_authors: Ann-Kristin Becker, Oana Dumitrasc, Klaus Broelemann
For: The paper focuses on developing measures of algorithmic fairness for continuous scores, which can be applied to ranking tasks and are not heavily dependent on the distribution of scores.* Methods: The authors propose a distributionally invariant version of fairness measures for continuous scores based on the Wasserstein distance, which is easily computable and can quantify significant biases that ROC-based fairness measures miss.* Results: The proposed fairness measures outperform ROC-based fairness measures in quantifying and interpreting the strength of group disparities and comparing biases across different models, datasets, or time points, as demonstrated through experiments on the most commonly used fairness benchmark datasets.

Abstract
Measures of algorithmic fairness are usually discussed in the context of binary decisions. We extend the approach to continuous scores. So far, ROC-based measures have mainly been suggested for this purpose. Other existing methods depend heavily on the distribution of scores, are unsuitable for ranking tasks, or their effect sizes are not interpretable. Here, we propose a distributionally invariant version of fairness measures for continuous scores with a reasonable interpretation based on the Wasserstein distance. Our measures are easily computable and well suited for quantifying and interpreting the strength of group disparities as well as for comparing biases across different models, datasets, or time points. We derive a link between the different families of existing fairness measures for scores and show that the proposed distributionally invariant fairness measures outperform ROC-based fairness measures because they are more explicit and can quantify significant biases that ROC-based fairness measures miss. Finally, we demonstrate their effectiveness through experiments on the most commonly used fairness benchmark datasets.

摘要
algorithmic fairness的衡量方法通常在二进制决策之上讨论。我们将这些方法扩展到连续分数上。目前，ROC基本的方法是为此目的提出的。其他现有的方法都受到分布的影响，不适合排名任务，或者其效果不能解释。我们提出一种不受分布影响的公平度衡量方法，基于沃氏距离。我们的方法容易计算，适合量化和解释群体差异的强度以及不同模型、数据集、时间点之间的偏见。我们还 derivate了现有的分数公平度衡量方法之间的连接，并证明我们的分布不受影响的公平度衡量方法在ROC基本的公平度衡量方法的基础上表现更好，因为它们更加直观，可以量化ROC基本的公平度衡量方法所过look的偏见。最后，我们通过使用最常用的公平度 benchmark datasets进行实验，证明了我们的方法的效果。Note: Please note that the translation is in Simplified Chinese, which is one of the two standard varieties of Chinese. If you need the translation in Traditional Chinese, please let me know.

How Much Temporal Long-Term Context is Needed for Action Segmentation?

paper_url: http://arxiv.org/abs/2308.11358
repo_url: https://github.com/ltcontext/ltcontext
paper_authors: Emad Bahrami, Gianpiero Francesca, Juergen Gall
for: 本研究的目的是回答 temporal action segmentation 需要多少长期时间上下文来实现最佳性能。
methods: 我们使用 transformer 基于模型，并使用 sparse attention capture 整个视频的上下文。
results: 我们的实验表明，模型整个视频的上下文是必要的来实现 temporal action segmentation 的最佳性能。

Abstract
Modeling long-term context in videos is crucial for many fine-grained tasks including temporal action segmentation. An interesting question that is still open is how much long-term temporal context is needed for optimal performance. While transformers can model the long-term context of a video, this becomes computationally prohibitive for long videos. Recent works on temporal action segmentation thus combine temporal convolutional networks with self-attentions that are computed only for a local temporal window. While these approaches show good results, their performance is limited by their inability to capture the full context of a video. In this work, we try to answer how much long-term temporal context is required for temporal action segmentation by introducing a transformer-based model that leverages sparse attention to capture the full context of a video. We compare our model with the current state of the art on three datasets for temporal action segmentation, namely 50Salads, Breakfast, and Assembly101. Our experiments show that modeling the full context of a video is necessary to obtain the best performance for temporal action segmentation.

摘要
<>模型长期视频上下文是关键的多种细化任务中，包括时间动作分割。一个有趣的问题是如何多少长期时间上下文是必要的 для最佳性能。虽然转换器可以模型视频的长期上下文，但这会对长视频进行计算是不可持续的。 current works on temporal action segmentation thus combine temporal convolutional networks with self-attention that are computed only for a local temporal window. 尽管这些方法显示良好的结果，但它们的性能受到全视频上下文的限制。在这种情况下，我们试图回答如何多少长期时间上下文是必要的 для时间动作分割，我们提出了一种基于转换器的模型，利用稀疏注意力来捕捉全视频上下文。我们与当前状态的艺术比较这些模型在三个时间动作分割数据集上的性能，结果表明，模型全视频上下文是必要的来获得最佳的时间动作分割性能。Note: "转换器" in the text refers to "transformer" in English.

Machine learning assisted exploration for affine Deligne-Lusztig varieties

paper_url: http://arxiv.org/abs/2308.11355
repo_url: https://github.com/jinpf314/ml4adlv
paper_authors: Bin Dong, Xuhua He, Pengfei Jin, Felix Schremmer, Qingchao Yu
for: 本研究用机器学习助手探索了拟合Deligne-Lusztig变换（ADLV）的几何结构，主要目标是研究ADLV中不空间分布、维度和分解成分的 enumerate 问题。
methods: 该研究提出了一种新的、混合数学和机器学习的框架，包括数据生成、模型训练、模式分析和人工审查。这个框架实现了数学研究和机器学习之间的紧凑合作，并且能够加速纯数学研究，从而找到新的假设和有前途的研究方向。
results: 本研究通过实验和数学分析，提出了一个虚数维度公式和一个新的下界问题，并且提供了一个完整的数学证明。此外，本研究还提供了计算ADLV和机器学习模型的源代码，以便进一步的探索。

Abstract
This paper presents a novel, interdisciplinary study that leverages a Machine Learning (ML) assisted framework to explore the geometry of affine Deligne-Lusztig varieties (ADLV). The primary objective is to investigate the nonemptiness pattern, dimension and enumeration of irreducible components of ADLV. Our proposed framework demonstrates a recursive pipeline of data generation, model training, pattern analysis, and human examination, presenting an intricate interplay between ML and pure mathematical research. Notably, our data-generation process is nuanced, emphasizing the selection of meaningful subsets and appropriate feature sets. We demonstrate that this framework has a potential to accelerate pure mathematical research, leading to the discovery of new conjectures and promising research directions that could otherwise take significant time to uncover. We rediscover the virtual dimension formula and provide a full mathematical proof of a newly identified problem concerning a certain lower bound of dimension. Furthermore, we extend an open invitation to the readers by providing the source code for computing ADLV and the ML models, promoting further explorations. This paper concludes by sharing valuable experiences and highlighting lessons learned from this collaboration.

摘要
Translated into Simplified Chinese:这篇论文介绍了一种新的、混合学科的研究方法，利用机器学习（ML）助手来探索非线性Deligne-Lusztig多折形式（ADLV）的几何结构。研究的主要目标是研究ADLV的非空性模式、维度和总数的reducible组件。我们提出的framework是一个层次结构，包括数据生成、模型训练、模式分析和人工审阅，这种结构展示了机器学习和纯数学研究之间的细腻相互作用。值得注意的是，我们的数据生成过程强调选择有意义的子集和适当的特征集。我们示示了这个框架可以加速纯数学研究，导致新的假设和潜在的研究方向的发现，这些研究方向可能需要很长时间才能探索出来。我们还重新发现了虚间维度公式，并提供了一个新的问题的完整数学证明，这个问题关于某些下界维度的问题。此外，我们向读者开放邀请，提供了计算ADLV和ML模型的源代码，以便进一步的探索。这篇论文 conclude by sharing valuable experiences and highlighting lessons learned from this collaboration.

WEARS: Wearable Emotion AI with Real-time Sensor data

paper_url: http://arxiv.org/abs/2308.11673
repo_url: None
paper_authors: Dhruv Limbani, Daketi Yatin, Nitish Chaturvedi, Vaishnavi Moorthy, Pushpalatha M, Harichandana BSS, Sumit Kumar
for: 这个研究旨在开发一个基于智能手表感应器的情绪预测系统，以便在日常生活中提高用户的情绪预测。
methods: 这个研究使用了一种混合式感应器，包括心率、加速度和运动感应器，以捕捉用户的情绪变化。研究还使用了英文和地方语言的影片来点击用户的情绪，并将数据集成为真实时间的实验数据。
results: 研究结果显示，使用多层感应器模型可以达到93.75%的准确率，并且进行了特征去掉的研究以了解不同感应器对情绪的影响。

Abstract
Emotion prediction is the field of study to understand human emotions. Existing methods focus on modalities like text, audio, facial expressions, etc., which could be private to the user. Emotion can be derived from the subject's psychological data as well. Various approaches that employ combinations of physiological sensors for emotion recognition have been proposed. Yet, not all sensors are simple to use and handy for individuals in their daily lives. Thus, we propose a system to predict user emotion using smartwatch sensors. We design a framework to collect ground truth in real-time utilizing a mix of English and regional language-based videos to invoke emotions in participants and collect the data. Further, we modeled the problem as binary classification due to the limited dataset size and experimented with multiple machine-learning models. We also did an ablation study to understand the impact of features including Heart Rate, Accelerometer, and Gyroscope sensor data on mood. From the experimental results, Multi-Layer Perceptron has shown a maximum accuracy of 93.75 percent for pleasant-unpleasant (high/low valence classification) moods.

摘要
情感预测是人类情感的研究领域，现有方法主要集中在文本、音频、表情等Modalities上，这些Modalities可能是用户私有的。情感可以从主体的心理数据中提取。不同的方法使用多种生物学传感器进行情感识别，但这些传感器不一定是用户日常生活中容易使用的。因此，我们提出了使用智能手表传感器预测用户情感的系统。我们设计了一个框架，通过混合英文和地方语言基于视频来采集真实时间的基准数据，并对这些数据进行模型化。由于数据集的限制，我们将问题定型为二分类问题，并对多种机器学习模型进行实验。我们还进行了减少特征的研究，以了解心率、加速计和自转仪减少情感的影响。从实验结果来看，多层感知器在高低抑裂（pleasant-unpleasant）情感类型上达到了最高的准确率为93.75%。

Careful at Estimation and Bold at Exploration

paper_url: http://arxiv.org/abs/2308.11348
repo_url: None
paper_authors: Xing Chen, Yijun Liu, Zhaogeng Liu, Hechang Chen, Hengshuai Yao, Yi Chang
for: 这篇论文主要针对的是连续动作空间的探索策略，具体来说是DPRL中的策略探索。
methods: 本文提出了一种新的探索策略，基于double-Q函数框架，它包括Q值更新的greedy Q softmax算法，以及将探索策略与Q值更新结合的方法。
results: 本文在Mujoco测试环境中评估了该方法，与之前的最佳方法进行比较，并在许多环境中表现出色，尤其是在最复杂的人工智能环境中。

Abstract
Exploration strategies in continuous action space are often heuristic due to the infinite actions, and these kinds of methods cannot derive a general conclusion. In prior work, it has been shown that policy-based exploration is beneficial for continuous action space in deterministic policy reinforcement learning(DPRL). However, policy-based exploration in DPRL has two prominent issues: aimless exploration and policy divergence, and the policy gradient for exploration is only sometimes helpful due to inaccurate estimation. Based on the double-Q function framework, we introduce a novel exploration strategy to mitigate these issues, separate from the policy gradient. We first propose the greedy Q softmax update schema for Q value update. The expected Q value is derived by weighted summing the conservative Q value over actions, and the weight is the corresponding greedy Q value. Greedy Q takes the maximum value of the two Q functions, and conservative Q takes the minimum value of the two different Q functions. For practicality, this theoretical basis is then extended to allow us to combine action exploration with the Q value update, except for the premise that we have a surrogate policy that behaves like this exploration policy. In practice, we construct such an exploration policy with a few sampled actions, and to meet the premise, we learn such a surrogate policy by minimizing the KL divergence between the target policy and the exploration policy constructed by the conservative Q. We evaluate our method on the Mujoco benchmark and demonstrate superior performance compared to previous state-of-the-art methods across various environments, particularly in the most complex Humanoid environment.

摘要
在连续动作空间中，探索策略经常是规则性的，因为动作的数量是无限的。在先前的工作中，已经证明了在决定性政策再增强学习（DPRL）中，政策基于的探索是有利的。然而，DPRL中的政策基于探索有两个主要问题：无目的探索和政策分化，而且policy梯度 для探索只是在部分情况下有帮助，因为估计不准确。基于双Q函数框架，我们提出了一种新的探索策略，与policy梯度分离。我们首先提出了Q值更新的软MAXschema，其中预期Q值是通过对动作的权重SUMming来计算，权重是对应的greedy Q值。greedy Q值取得最大值，而conservative Q值取得最小值。为了实用，我们将这种理论基础与动作探索结合，除非我们有一个伪政策，该伪政策在探索策略中行为类似。在实践中，我们构建了一个这种探索策略，使用了一些采样的动作，并为了满足这种前提，我们学习了一个伪政策，使其与目标政策的KL散度尽量小。我们在Mujoco bencmark上评估了我们的方法，并在不同环境中显示出比前一个状态的方法更高的性能，特别是在最复杂的人工肢体环境中。

ProAgent: Building Proactive Cooperative AI with Large Language Models

paper_url: http://arxiv.org/abs/2308.11339
repo_url: https://github.com/PKU-Alignment/ProAgent
paper_authors: Ceyao Zhang, Kaijie Yang, Siyi Hu, Zihao Wang, Guanghe Li, Yihang Sun, Cheng Zhang, Zhaowei Zhang, Anji Liu, Song-Chun Zhu, Xiaojun Chang, Junge Zhang, Feng Yin, Yitao Liang, Yaodong Yang
for: 这 paper 的目的是开发一种基于大语言模型的智能代理（ProAgent），用于协同决策和行为适应。
methods: 这 paper 使用了大语言模型（LLMs）来实现 ProAgent，并通过对自己和团队成员的决策进行预测和规划来提高协同决策的能力。
results: 实验结果表明，ProAgent 在协同决策中表现出色，与现有的五种基于自我玩家和人口训练的方法相比，在与 AI 代理和人类代理合作时表现出明显的性能优势。

Abstract
Building AIs with adaptive behaviors in human-AI cooperation stands as a pivotal focus in AGI research. Current methods for developing cooperative agents predominantly rely on learning-based methods, where policy generalization heavily hinges on past interactions with specific teammates. These approaches constrain the agent's capacity to recalibrate its strategy when confronted with novel teammates. We propose \textbf{ProAgent}, a novel framework that harnesses large language models (LLMs) to fashion a \textit{pro}active \textit{agent} empowered with the ability to anticipate teammates' forthcoming decisions and formulate enhanced plans for itself. ProAgent excels at cooperative reasoning with the capacity to dynamically adapt its behavior to enhance collaborative efforts with teammates. Moreover, the ProAgent framework exhibits a high degree of modularity and interpretability, facilitating seamless integration to address a wide array of coordination scenarios. Experimental evaluations conducted within the framework of \textit{Overcook-AI} unveil the remarkable performance superiority of ProAgent, outperforming five methods based on self-play and population-based training in cooperation with AI agents. Further, when cooperating with human proxy models, its performance exhibits an average improvement exceeding 10\% compared to the current state-of-the-art, COLE. The advancement was consistently observed across diverse scenarios involving interactions with both AI agents of varying characteristics and human counterparts. These findings inspire future research for human-robot collaborations. For a hands-on demonstration, please visit \url{https://pku-proagent.github.io}.

摘要
建立人工智能机器人通过适应行为在人机合作中占据着重要地位，是AGI研究中的纽带。现有的合作代理人开发方法主要靠学习基本方法，策略总结受到过去与特定队友的交互所限。这些方法限制代理人的能力，在遇到新的队友时重新调整策略。我们提出了ProAgent框架，利用大型自然语言模型（LLM）为代理人带来了一种积极的战略，可以预测队友的下一步决策并根据此制定更加优化的计划。ProAgent在合作理解方面表现出了出色的能力，能够适应不同的队友并 dynamically adapt its behavior to enhance collaborative efforts。此外，ProAgent框架具有高度的可组合性和可读性，可以轻松地整合多种协调enario。在Overcook-AI框架下进行的实验评估中，ProAgent表现出了很高的性能优势，比基于自我玩家和人口训练的五种方法更出色地合作与AI代理人。此外，当与人类代理模型合作时，其表现也表现出了平均超过10%的提高，相比当前状态艺的COLE。这些发现鼓励未来的人机合作研究。如果您想了解更多，请访问\url{https://pku-proagent.github.io}.

Generalising sequence models for epigenome predictions with tissue and assay embeddings

paper_url: http://arxiv.org/abs/2308.11671
repo_url: None
paper_authors: Jacob Deasy, Ron Schwessinger, Ferran Gonzalez, Stephen Young, Kim Branson
for: 用于预测脱氧核糖体谱的序列模型方法在最近几年内得到了扩展，包括序列长度、模型大小和谱型多样性。
methods: 我们利用Contextualised Genomic Network（CGN）将组织和检测的嵌入式到输入空间中，以增强长距离序列嵌入的表达。
results: 我们的方法在多种脱氧核糖体谱中展现出了强相关性，并且为首次研究了基因变化对脱氧核糖体序列模型训练的影响。我们的总体方法在多个设置中超越了现有的状况检验。

Abstract
Sequence modelling approaches for epigenetic profile prediction have recently expanded in terms of sequence length, model size, and profile diversity. However, current models cannot infer on many experimentally feasible tissue and assay pairs due to poor usage of contextual information, limiting $\textit{in silico}$ understanding of regulatory genomics. We demonstrate that strong correlation can be achieved across a large range of experimental conditions by integrating tissue and assay embeddings into a Contextualised Genomic Network (CGN). In contrast to previous approaches, we enhance long-range sequence embeddings with contextual information in the input space, rather than expanding the output space. We exhibit the efficacy of our approach across a broad set of epigenetic profiles and provide the first insights into the effect of genetic variants on epigenetic sequence model training. Our general approach to context integration exceeds state of the art in multiple settings while employing a more rigorous validation procedure.

摘要
Sequence模型方法 для聚类profile预测最近几年在序列长度、模型大小和profile多样性方面得到了扩展。然而，当前的模型无法在许多实验可能的组织和测试对之间进行预测，因为它们不充分利用了Contextual information，限制了$\textit{in silico}$理解调控 genomics。我们示示了在大量实验条件下强相关性可以通过integrating组织和测试嵌入到Contextualised Genomic Network (CGN)中来实现。与之前的方法不同的是，我们在输入空间中增强长距离序列嵌入，而不是扩展输出空间。我们在多个聚类profile中展示了我们的方法的效果，并为聚类profile模型的训练提供了首次的遗传变异的影响。我们的总体方法在多个设置中超越了现状的最佳实践，并且采用了更严格的验证过程。

Protect Federated Learning Against Backdoor Attacks via Data-Free Trigger Generation

paper_url: http://arxiv.org/abs/2308.11333
repo_url: None
paper_authors: Yanxin Yang, Ming Hu, Yue Cao, Jun Xia, Yihao Huang, Yang Liu, Mingsong Chen
For: The paper aims to address the vulnerability of Federated Learning (FL) to poisoning attacks, specifically backdoor attacks, by proposing a novel data-free trigger-generation-based defense approach.* Methods: The proposed approach uses two characteristics of backdoor attacks to generate trigger images that can eliminate poisoned models and ensure the updated global model is benign. These methods include identifying the differences between the old and new global models, and evaluating the effect of the generated images.* Results: The approach is shown to defend against almost all existing types of backdoor attacks and outperform seven state-of-the-art defense methods with both IID and non-IID scenarios. Additionally, the approach can successfully defend against backdoor attacks even when 80% of the clients are malicious.

Abstract
As a distributed machine learning paradigm, Federated Learning (FL) enables large-scale clients to collaboratively train a model without sharing their raw data. However, due to the lack of data auditing for untrusted clients, FL is vulnerable to poisoning attacks, especially backdoor attacks. By using poisoned data for local training or directly changing the model parameters, attackers can easily inject backdoors into the model, which can trigger the model to make misclassification of targeted patterns in images. To address these issues, we propose a novel data-free trigger-generation-based defense approach based on the two characteristics of backdoor attacks: i) triggers are learned faster than normal knowledge, and ii) trigger patterns have a greater effect on image classification than normal class patterns. Our approach generates the images with newly learned knowledge by identifying the differences between the old and new global models, and filters trigger images by evaluating the effect of these generated images. By using these trigger images, our approach eliminates poisoned models to ensure the updated global model is benign. Comprehensive experiments demonstrate that our approach can defend against almost all the existing types of backdoor attacks and outperform all the seven state-of-the-art defense methods with both IID and non-IID scenarios. Especially, our approach can successfully defend against the backdoor attack even when 80\% of the clients are malicious.

摘要
为了防止恶意客户端对模型进行恶意攻击，我们提出了一种基于数据预处理技术的数据自由触发器生成方法。我们发现，在攻击者拥有恶意模型时，攻击者可以通过直接修改模型参数或使用恶意数据进行本地训练来插入后门。为了解决这些问题，我们提出了一种基于两个特征的触发器生成方法：一是触发器在学习过程中更快速地学习，二是触发器图像分类效果更大于正常类图像。我们的方法通过生成新的图像，并对这些图像进行筛选来消除恶意模型。我们的方法可以在IID和非IID场景下对大部分现有的后门攻击进行防御，并且在80%的客户端是恶意的情况下也能成功防御。

Machine Learning-based Positioning using Multivariate Time Series Classification for Factory Environments

paper_url: http://arxiv.org/abs/2308.11670
repo_url: None
paper_authors: Nisal Hemadasa Manikku Badu, Marcus Venzke, Volker Turau, Yanqiu Huang
for: This paper is written for indoor positioning systems (IPS) in privacy-concerned factory environments, where external infrastructures are infeasible or expensive to deploy.
methods: The paper uses machine learning (ML) models, specifically a multivariate time series classification (MTSC) approach, to localize moving entities in these environments using motion and ambient sensors.
results: The paper presents a comparative analysis of different ML models for indoor positioning, including CNN-1D, MLP, and DT. The results show that all models can achieve accuracies above 80%, with DT having the lowest memory footprint and inference latency, making it a promising choice for real-world deployments.

Abstract
Indoor Positioning Systems (IPS) gained importance in many industrial applications. State-of-the-art solutions heavily rely on external infrastructures and are subject to potential privacy compromises, external information requirements, and assumptions, that make it unfavorable for environments demanding privacy and prolonged functionality. In certain environments deploying supplementary infrastructures for indoor positioning could be infeasible and expensive. Recent developments in machine learning (ML) offer solutions to address these limitations relying only on the data from onboard sensors of IoT devices. However, it is unclear which model fits best considering the resource constraints of IoT devices. This paper presents a machine learning-based indoor positioning system, using motion and ambient sensors, to localize a moving entity in privacy concerned factory environments. The problem is formulated as a multivariate time series classification (MTSC) and a comparative analysis of different machine learning models is conducted in order to address it. We introduce a novel time series dataset emulating the assembly lines of a factory. This dataset is utilized to assess and compare the selected models in terms of accuracy, memory footprint and inference speed. The results illustrate that all evaluated models can achieve accuracies above 80 %. CNN-1D shows the most balanced performance, followed by MLP. DT was found to have the lowest memory footprint and inference latency, indicating its potential for a deployment in real-world scenarios.

摘要
本文报告了一种基于机器学习的indoor Positioning System，使用运动和 ambient 传感器来确定移动对象的位置。问题被формализ为多变量时间序列分类（MTSC），并进行了不同机器学习模型的比较分析，以解决该问题。我们创建了一个新的时间序列数据集，模拟了制造工场的生产线。这个数据集用于评估和比较选择的模型，以确定它们的准确率、内存占用量和推理速度。结果表明，所有评估模型都可以达到上千分之八十的准确率。CNN-1D 显示出最佳的平衡性，其次是 MLP。DT 具有最低的内存占用量和推理延迟，因此在实际应用场景中具有潜在的优势。

Class Label-aware Graph Anomaly Detection

paper_url: http://arxiv.org/abs/2308.11669
repo_url: https://github.com/jhkim611/clad
paper_authors: Junghoon Kim, Yeonjun In, Kanghoon Yoon, Junmo Lee, Chanyoung Park
for: 这篇论文的目的是提出一种基于类标签的图像异常检测方法，以提高无监督图像异常检测的性能。
methods: 该方法使用有限数量的标注节点来增强无监督图像异常检测的性能。
results: 对于十个 dataset 的实验结果表明，CLAD 在缺乏类标签信息的情况下也能够显著超过现有的无监督图像异常检测方法。源代码可以在 \url{https://github.com/jhkim611/CLAD} 上下载。

Abstract
Unsupervised GAD methods assume the lack of anomaly labels, i.e., whether a node is anomalous or not. One common observation we made from previous unsupervised methods is that they not only assume the absence of such anomaly labels, but also the absence of class labels (the class a node belongs to used in a general node classification task). In this work, we study the utility of class labels for unsupervised GAD; in particular, how they enhance the detection of structural anomalies. To this end, we propose a Class Label-aware Graph Anomaly Detection framework (CLAD) that utilizes a limited amount of labeled nodes to enhance the performance of unsupervised GAD. Extensive experiments on ten datasets demonstrate the superior performance of CLAD in comparison to existing unsupervised GAD methods, even in the absence of ground-truth class label information. The source code for CLAD is available at \url{https://github.com/jhkim611/CLAD}.

摘要
<>转换给定文本到简化中文。<>无监督GAD方法假设缺失异常标签，即节点是否异常。我们从前一些无监督方法中的观察结果中发现，不仅假设缺失异常标签，而且还假设缺失分类标签（用于普通的节点分类任务中的节点分类）。在这项工作中，我们研究了分类标签对无监督GAD的利用性，具体来说，如何通过限制使用有限数量的标签节点来提高无监督GAD的性能。为此，我们提出了一个具有限制使用有限数量的标签节点的分类标签感知Graph异常检测框架（CLAD）。经过对十个数据集的广泛实验，我们发现CLAD在无监督GAD方法中表现出优于现有的无监督GAD方法，即使没有ground truth分类标签信息。CLAD的源代码可以在\url{https://github.com/jhkim611/CLAD}中找到。

Uncertainty Estimation of Transformers’ Predictions via Topological Analysis of the Attention Matrices

paper_url: http://arxiv.org/abs/2308.11295
repo_url: None
paper_authors: Elizaveta Kostenok, Daniil Cherniavskii, Alexey Zaytsev
for: 本研究旨在解决深度学习模型在预测中的信任度评估问题，这是自然语言处理领域中的一个开放问题。
methods: 本研究使用了基于Transformer架构的神经网络，并利用Topological Data Analysis方法来探索这些模型中的内部关系。
results: 研究人员提出了基于注意力机制的不确定性估计方法，并与传统方法进行比较。结果显示，提出的算法在质量上超过了现有方法，并开启了一个新的应用领域，但是需要选择 topological 特征。

Abstract
Determining the degree of confidence of deep learning model in its prediction is an open problem in the field of natural language processing. Most of the classical methods for uncertainty estimation are quite weak for text classification models. We set the task of obtaining an uncertainty estimate for neural networks based on the Transformer architecture. A key feature of such mo-dels is the attention mechanism, which supports the information flow between the hidden representations of tokens in the neural network. We explore the formed relationships between internal representations using Topological Data Analysis methods and utilize them to predict model's confidence. In this paper, we propose a method for uncertainty estimation based on the topological properties of the attention mechanism and compare it with classical methods. As a result, the proposed algorithm surpasses the existing methods in quality and opens up a new area of application of the attention mechanism, but requires the selection of topological features.

摘要
决定深度学习模型预测结果的信度是自然语言处理领域的开启问题。大多数古典方法用于不确定性估计都对文本分类模型而言是很弱的。我们设定了基于Transformer架构的神经网络中的不确定性估计任务。Transformer模型的关键特征是对内存表现的注意机制，它支持内存表现之间的信息流动。我们利用Topological Data Analysis方法来探索内存表现之间的关系，并将其用于预测模型的信度。在这篇论文中，我们提出了基于注意机制的不确定性估计方法，并与古典方法进行比较。结果显示，我们的方法与古典方法相比有较高的质量，并开启了新的注意机制应用领域，但是需要选择适当的数据特征。

Network Momentum across Asset Classes

paper_url: http://arxiv.org/abs/2308.11294
repo_url: None
paper_authors: Xingyue Pu, Stephen Roberts, Xiaowen Dong, Stefan Zohren
for: 这篇论文旨在研究网络势动，即资产间势动的协同作用，以探讨多个资产类别之间的势动风险偏好传递。
methods: 该论文使用了一种线性和可解释的图学习模型，以揭示多资产类别之间的势动风险偏好传递网络。
results: 研究发现，资产间势动风险偏好传递网络存在，并且可以通过利用这些网络来构建一个多资产投资策略，其中的Sharpe比率为1.5，年化回报率为22%。

Abstract
We investigate the concept of network momentum, a novel trading signal derived from momentum spillover across assets. Initially observed within the confines of pairwise economic and fundamental ties, such as the stock-bond connection of the same company and stocks linked through supply-demand chains, momentum spillover implies a propagation of momentum risk premium from one asset to another. The similarity of momentum risk premium, exemplified by co-movement patterns, has been spotted across multiple asset classes including commodities, equities, bonds and currencies. However, studying the network effect of momentum spillover across these classes has been challenging due to a lack of readily available common characteristics or economic ties beyond the company level. In this paper, we explore the interconnections of momentum features across a diverse range of 64 continuous future contracts spanning these four classes. We utilise a linear and interpretable graph learning model with minimal assumptions to reveal the intricacies of the momentum spillover network. By leveraging the learned networks, we construct a network momentum strategy that exhibits a Sharpe ratio of 1.5 and an annual return of 22%, after volatility scaling, from 2000 to 2022. This paper pioneers the examination of momentum spillover across multiple asset classes using only pricing data, presents a multi-asset investment strategy based on network momentum, and underscores the effectiveness of this strategy through robust empirical analysis.

摘要

Improving Knot Prediction in Wood Logs with Longitudinal Feature Propagation

paper_url: http://arxiv.org/abs/2308.11291
repo_url: https://github.com/jeremyfix/icvs2023
paper_authors: Salim Khazem, Jeremy Fix, Cédric Pradalier
for: 这个研究是为了预测木材中内部缺陷的位置，以提高木材质量评估的精度和效率。
methods: 本研究使用了卷积回传神经网络来解决木材外形特征与内部缺陷的Binary segmentation任务。
results: 研究发现，通过使用卷积回传神经网络可以从木材外形特征中预测内部缺陷的位置，并且可以使用便宜的 Laser profilers 进行测量。对于 Fir 和 Spruce 树species 进行了评估，并进行了删除循环的评估。

Abstract
The quality of a wood log in the wood industry depends heavily on the presence of both outer and inner defects, including inner knots that are a result of the growth of tree branches. Today, locating the inner knots require the use of expensive equipment such as X-ray scanners. In this paper, we address the task of predicting the location of inner defects from the outer shape of the logs. The dataset is built by extracting both the contours and the knots with X-ray measurements. We propose to solve this binary segmentation task by leveraging convolutional recurrent neural networks. Once the neural network is trained, inference can be performed from the outer shape measured with cheap devices such as laser profilers. We demonstrate the effectiveness of our approach on fir and spruce tree species and perform ablation on the recurrence to demonstrate its importance.

摘要
木材行业中，木材质量受到外部和内部缺陷的影响，包括内部的结节，这些结节由树木的生长所导致。现在，找到内部缺陷需要使用昂贵的设备，如X射线扫描仪。在这篇论文中，我们Addresses the task of predicting内部缺陷的位置从木材外形的形态。我们使用卷积循环神经网络解决这个二分类标注任务。一旦神经网络训练完成，可以通过便宜的设备，如激光 Profiler 进行推理。我们在桦树和落叶树两种树种上验证了我们的方法，并对循环的重要性进行了遮红。

ShadowNet for Data-Centric Quantum System Learning

paper_url: http://arxiv.org/abs/2308.11290
repo_url: None
paper_authors: Yuxuan Du, Yibo Yang, Tongliang Liu, Zhouchen Lin, Bernard Ghanem, Dacheng Tao
for: 这篇论文旨在探讨如何使用统计学学习方法来解决大量量子系统的问题，即所谓的“量子系统学习”（Quantum System Learning，QSL）。
methods: 这篇论文提出了一种基于数据的学习方法， combining classical shadows和神经网络，以便解决多种QSL任务。这种方法可以在训练 stage上使用几何学习，并在推断 stage上使用神经网络进行预测，从而可以减少预测的不确定性。
results: 作者们在实验中使用了这种方法来解决量子状态探测和直接准确率计算等问题，并成功地在60个码之间进行了数值分析。这些结果表明这种数据驱动的人工智能方法可以帮助我们更好地探索和理解大量量子系统。

Abstract
Understanding the dynamics of large quantum systems is hindered by the curse of dimensionality. Statistical learning offers new possibilities in this regime by neural-network protocols and classical shadows, while both methods have limitations: the former is plagued by the predictive uncertainty and the latter lacks the generalization ability. Here we propose a data-centric learning paradigm combining the strength of these two approaches to facilitate diverse quantum system learning (QSL) tasks. Particularly, our paradigm utilizes classical shadows along with other easily obtainable information of quantum systems to create the training dataset, which is then learnt by neural networks to unveil the underlying mapping rule of the explored QSL problem. Capitalizing on the generalization power of neural networks, this paradigm can be trained offline and excel at predicting previously unseen systems at the inference stage, even with few state copies. Besides, it inherits the characteristic of classical shadows, enabling memory-efficient storage and faithful prediction. These features underscore the immense potential of the proposed data-centric approach in discovering novel and large-scale quantum systems. For concreteness, we present the instantiation of our paradigm in quantum state tomography and direct fidelity estimation tasks and conduct numerical analysis up to 60 qubits. Our work showcases the profound prospects of data-centric artificial intelligence to advance QSL in a faithful and generalizable manner.

摘要
大量量子系统的动态性受到维度约束的困扰。统计学学习提供了新的可能性，使用神经网络协议和类征阴影，然而这两种方法都有局限性：前者受到预测不确定性的困扰，后者缺乏总结能力。我们提出了一种数据驱动学习方法，结合这两种方法，以便推动多种量子系统学习任务（QSL）。具体来说，我们的方法利用类征阴影和其他量子系统较易获得的信息，创建训练集，然后通过神经网络学习探索到量子系统下的下 mapping 规则。利用神经网络的总结能力，我们的方法可以在训练stage上进行线上培养，在探索阶段以几个状态复制预测未看过的系统，并且具有快速储存和准确预测的特点。这些特点强调了我们提出的数据驱动方法在发现新的大规模量子系统方面的巨大潜力。为了证明这一点，我们在量子状态探测和直接准确率估计任务中实现了我们的方法，并进行了数值分析至60个量子比特。我们的工作展示了数据驱动人工智能在 faithful和总结的方式下进行量子系统学习的极大潜力。

Test Time Embedding Normalization for Popularity Bias Mitigation

paper_url: http://arxiv.org/abs/2308.11288
repo_url: https://github.com/ml-postech/tten
paper_authors: Dain Kim, Jinhyeok Park, Dongwoo Kim
for: 本研究旨在 Mitigating popularity bias in recommender systems, where popular items tend to dominate recommendation results.
methods: 我们提出了 ‘Test Time Embedding Normalization’ 方法，这是一种简单 yet effective strategy for mitigating popularity bias. Our approach utilizes the normalized item embedding during the inference stage to control the influence of embedding magnitude, which is highly correlated with item popularity.
results: 我们通过了广泛的实验，证明了我们的方法可以比前一些偏袋减弱方法更好地减少偏袋强度。 In addition, we found that the angular similarity between user and item embeddings can distinguish preferable and non-preferable items regardless of their popularity.

Abstract
Popularity bias is a widespread problem in the field of recommender systems, where popular items tend to dominate recommendation results. In this work, we propose 'Test Time Embedding Normalization' as a simple yet effective strategy for mitigating popularity bias, which surpasses the performance of the previous mitigation approaches by a significant margin. Our approach utilizes the normalized item embedding during the inference stage to control the influence of embedding magnitude, which is highly correlated with item popularity. Through extensive experiments, we show that our method combined with the sampled softmax loss effectively reduces popularity bias compare to previous approaches for bias mitigation. We further investigate the relationship between user and item embeddings and find that the angular similarity between embeddings distinguishes preferable and non-preferable items regardless of their popularity. The analysis explains the mechanism behind the success of our approach in eliminating the impact of popularity bias. Our code is available at https://github.com/ml-postech/TTEN.

摘要
受欢迎偏见是推荐系统领域中的一个广泛存在的问题，即受欢迎的 Item 占据推荐结果的主导地位。在这项工作中，我们提出了“测试时嵌入normalization”作为一种简单 yet effective的缓解受欢迎偏见策略，超过了之前的缓解方法的性能水平。我们的方法在推荐阶段使用 норmalized item嵌入来控制嵌入大小的影响，与受欢迎度高度相关。经过广泛的实验，我们表明了我们的方法与抽取softmax损失结合可以有效地缓解受欢迎偏见。我们进一步调查用户和 Item 嵌入之间的关系，发现angular相似性 между嵌入可以分辨用户喜欢和不喜欢的 Item，无论它们的受欢迎程度。这种分析解释了我们的方法在消除受欢迎偏见的机制。我们的代码可以在https://github.com/ml-postech/TTEN中找到。

CNN based Cuneiform Sign Detection Learned from Annotated 3D Renderings and Mapped Photographs with Illumination Augmentation

paper_url: http://arxiv.org/abs/2308.11277
repo_url: None
paper_authors: Ernst Stötzner, Timo Homburg, Hubert Mara
for: 提高ancient near eastern studies中的数字化研究难度，开发了用于处理古代 Mesopotamia文字的数字工具。
methods: 使用3D渲染、摄影和推理的组合方法，并提供了一种将3D渲染与摄影数据进行映射的工具。
results: 使用3D渲染的照片获得更好的结果，而且在混合数据集上也能够获得良好的结果。此外，使用MSII渲染提高了照片上的结果。

Abstract
Motivated by the challenges of the Digital Ancient Near Eastern Studies (DANES) community, we develop digital tools for processing cuneiform script being a 3D script imprinted into clay tablets used for more than three millennia and at least eight major languages. It consists of thousands of characters that have changed over time and space. Photographs are the most common representations usable for machine learning, while ink drawings are prone to interpretation. Best suited 3D datasets that are becoming available. We created and used the HeiCuBeDa and MaiCuBeDa datasets, which consist of around 500 annotated tablets. For our novel OCR-like approach to mixed image data, we provide an additional mapping tool for transferring annotations between 3D renderings and photographs. Our sign localization uses a RepPoints detector to predict the locations of characters as bounding boxes. We use image data from GigaMesh's MSII (curvature, see https://gigamesh.eu) based rendering, Phong-shaded 3D models, and photographs as well as illumination augmentation. The results show that using rendered 3D images for sign detection performs better than other work on photographs. In addition, our approach gives reasonably good results for photographs only, while it is best used for mixed datasets. More importantly, the Phong renderings, and especially the MSII renderings, improve the results on photographs, which is the largest dataset on a global scale.

摘要
受敦着数字古 Near East Studies (DANES) 社区的挑战启发，我们开发了数字工具来处理楔形文字，这是一种3D文字印刷在泥 TABLETS上，用于超过三千年和至少八种主要语言。它包括数以千计的字符，随着时间和空间而变化。 photographs 是最常见的机器学习表示方式，而墨汁画可能存在解释的问题。我们创建了 HeiCuBeDa 和 MaiCuBeDa 数据集，它们包含约500个标注的楔形文字。为我们的新的 OCR-like 方法，我们提供了一个附加的映射工具，用于将3D渲染和 photographs 之间的标注转移。我们的字符定位使用 RepPoints 探测器来预测字符的位置，并使用 GigaMesh 的 MSII (曲率，请参见 ) 基于的渲染、Phong 灯光渲染和 photographs 以及照明增强。结果表明，使用rendered 3D图像进行字符检测比其他作品更好，而且我们的方法在混合数据集上获得了良好的结果。此外，我们的方法在 photographs 上也获得了良好的结果，特别是使用 MSII 渲染，这是全球规模最大的数据集。

FoX: Formation-aware exploration in multi-agent reinforcement learning

paper_url: http://arxiv.org/abs/2308.11272
repo_url: None
paper_authors: Yonghyeon Jo, Sunwoo Lee, Junghyuk Yum, Seungyul Han
for: 提高多代理人学习（MARL）中的探索性能，解决多代理人探索空间的扩展性和不可见性问题。
methods: 提出了一种基于形态关系的探索框架（FoX），通过让半可见代理人访问不同形态的有用状态，使其更加意识到当前形态。
results: 在Google研究足球（GRF）和罕见Starcraft II多代理人挑战（SMAC）任务上，提出的FoX框架与现状的MARL算法相比，有显著的性能提升。

Abstract
Recently, deep multi-agent reinforcement learning (MARL) has gained significant popularity due to its success in various cooperative multi-agent tasks. However, exploration still remains a challenging problem in MARL due to the partial observability of the agents and the exploration space that can grow exponentially as the number of agents increases. Firstly, in order to address the scalability issue of the exploration space, we define a formation-based equivalence relation on the exploration space and aim to reduce the search space by exploring only meaningful states in different formations. Then, we propose a novel formation-aware exploration (FoX) framework that encourages partially observable agents to visit the states in diverse formations by guiding them to be well aware of their current formation solely based on their own observations. Numerical results show that the proposed FoX framework significantly outperforms the state-of-the-art MARL algorithms on Google Research Football (GRF) and sparse Starcraft II multi-agent challenge (SMAC) tasks.

摘要
近期，深度多代理强化学习（MARL）已经受到了各种合作多代理任务中的成功。然而，探索仍然是 MARL 中的一个挑战，因为代理的部分可见性和探索空间的快速增长，尤其是当代理数量增加时。为了解决探索空间的扩展性问题，我们定义了一种形态基于的等价关系在探索空间上，并尝试将搜索限定在不同形态的状态上。然后，我们提出了一种新的形态感知探索（FoX）框架，使得部分可见的代理可以尽可能地访问不同形态的状态，通过基于自己观察的形态意识来导航。numerical results显示，我们提出的 FoX 框架在 Google Research Football (GRF) 和 sparse Starcraft II multi-agent challenge (SMAC) 任务上表现出了明显的优异。

Quantum-Inspired Machine Learning: a Survey

paper_url: http://arxiv.org/abs/2308.11269
repo_url: None
paper_authors: Larry Huynh, Jin Hong, Ajmal Mian, Hajime Suzuki, Yanqiu Wu, Seyit Camtepe
for: 这篇论文旨在为研究者提供一份完整、 инте格遍渠的量子机器学习（QiML）概述，涵盖了QiML多个研究领域，如张量网络 simulations、dequantized algorithms等，并展示了最新的进展、实践应用以及未来研究方向。
methods: 本论文使用了多种方法，包括张量网络 simulations、dequantized algorithms等，以探讨QiML的多个研究领域。
results: 本论文通过对QiML的多个研究领域进行探讨，揭示了这些领域的最新进展和实践应用，并预测未来研究的发展趋势。

Abstract
Quantum-inspired Machine Learning (QiML) is a burgeoning field, receiving global attention from researchers for its potential to leverage principles of quantum mechanics within classical computational frameworks. However, current review literature often presents a superficial exploration of QiML, focusing instead on the broader Quantum Machine Learning (QML) field. In response to this gap, this survey provides an integrated and comprehensive examination of QiML, exploring QiML's diverse research domains including tensor network simulations, dequantized algorithms, and others, showcasing recent advancements, practical applications, and illuminating potential future research avenues. Further, a concrete definition of QiML is established by analyzing various prior interpretations of the term and their inherent ambiguities. As QiML continues to evolve, we anticipate a wealth of future developments drawing from quantum mechanics, quantum computing, and classical machine learning, enriching the field further. This survey serves as a guide for researchers and practitioners alike, providing a holistic understanding of QiML's current landscape and future directions.

摘要
量子启发机器学习（QiML）是一个迅速发展的领域，在全球范围内吸引了研究者们的关注，因为它可能利用量子力学原理在类型性计算框架中实现。然而，当前的文献综述往往停留在更广泛的量子机器学习（QML）领域，而不是深入探究QiML。为了填补这个空白，本调查提供了一个整合和全面的QiML调查，探讨QiML的多个研究领域，包括张量网络 simulate、解量化算法等，展示最新的进展、实际应用以及未来研究方向。此外，本调查还提出了一个具体的QiML定义，通过分析各种先前解释的歧义，并且解决了这些歧义的内在不确定性。随着QiML的进一步发展，我们预计将有更多的未来发展，这些发展将来自量子力学、量子计算和类型性机器学习，使得QiML领域的研究和应用得到进一步的推动。本调查作为研究者和实践者的指南，为您提供了QiML当前领域的整体认识和未来发展方向。

Robust Lagrangian and Adversarial Policy Gradient for Robust Constrained Markov Decision Processes

paper_url: http://arxiv.org/abs/2308.11267
repo_url: None
paper_authors: David M. Bossens
for: 本研究的目的是提出一种基于Markov决策过程的 robust constrained reinforcement learning（RCRL）框架，以满足行为约束和模型误差的约束。
methods: 本研究使用了Value Estimation（值估计）和Lagrangian（拉格朗日）来模型 RCMDPs，并提出了两种算法：RCPG with Robust Lagrangian和Adversarial RCPG。
results: 实验结果显示，相比传统的RCPG变种和非 robust、非约束的ablations，提出的两种算法在 инвенoty管理和安全航行任务中表现竞争力强，特别是Adversarial RCPG在所有测试中排名第二。

Abstract
The robust constrained Markov decision process (RCMDP) is a recent task-modelling framework for reinforcement learning that incorporates behavioural constraints and that provides robustness to errors in the transition dynamics model through the use of an uncertainty set. Simulating RCMDPs requires computing the worst-case dynamics based on value estimates for each state, an approach which has previously been used in the Robust Constrained Policy Gradient (RCPG). Highlighting potential downsides of RCPG such as not robustifying the full constrained objective and the lack of incremental learning, this paper introduces two algorithms, called RCPG with Robust Lagrangian and Adversarial RCPG. RCPG with Robust Lagrangian modifies RCPG by taking the worst-case dynamics based on the Lagrangian rather than either the value or the constraint. Adversarial RCPG also formulates the worst-case dynamics based on the Lagrangian but learns this directly and incrementally as an adversarial policy through gradient descent rather than indirectly and abruptly through constrained optimisation on a sorted value list. A theoretical analysis first derives the Lagrangian policy gradient for the policy optimisation of both proposed algorithms and then the adversarial policy gradient to learn the adversary for Adversarial RCPG. Empirical experiments injecting perturbations in inventory management and safe navigation tasks demonstrate the competitive performance of both algorithms compared to traditional RCPG variants as well as non-robust and non-constrained ablations. In particular, Adversarial RCPG ranks among the top two performing algorithms on all tests.

摘要
RCMDP（Robust Constrained Markov Decision Process）是一种最新的任务建模框架，用于机器学习中的强化学习，它将行为约束和模型转移动力模型中的错误 incorporated 。 simulate RCMDP 需要根据每个状态的值估计计算最差情况的动力学，这种方法已经在 Robust Constrained Policy Gradient (RCPG) 中使用。 highlighting RCPG 的 potential downsides, such as not robustifying the full constrained objective and the lack of incremental learning, 这篇论文提出了两种算法，即 RCPG with Robust Lagrangian 和 Adversarial RCPG。 RCPG with Robust Lagrangian 修改了 RCPG，通过 Lagrangian 而不是值或约束来计算最差情况的动力学。 Adversarial RCPG 也使用 Lagrangian，但是通过对 gradient descent 来直接和逐步地学习 adversary，而不是通过受限的优化来 indirectly 和突然地学习。理论分析首先 derive 了 Lagrangian policy gradient для policy 优化 both proposed algorithms，然后 derive 了 adversarial policy gradient 来学习 adversary для Adversarial RCPG。 empirical experiments 在 inventory management 和 safe navigation 任务中注入抖动，表明 both algorithms 与 traditional RCPG 变体以及非强化和非约束的ablations 相比，表现竞争性。尤其是 Adversarial RCPG，在所有测试中排名前两名。

Efficient Last-iterate Convergence Algorithms in Solving Games

paper_url: http://arxiv.org/abs/2308.11256
repo_url: None
paper_authors: Linjian Meng, Zhenxing Ge, Wenbin Li, Bo An, Yang Gao
for: 学习二人零Sum游戏的 Nash 平衡（NE）。
methods: 使用 Reward Transformation 框架，转化为 strongly convex-concave 优化问题（SCCP），并使用 Regret Matching+ 算法（RM+）解决 SCCP。
results: 提出了 Reward Transformation RM+ 算法（RTRM+），实现了 last-iterate 收敛，并在实验中表现出色，胜过现有的 last-iterate 收敛算法和 RM+。

Abstract
No-regret algorithms are popular for learning Nash equilibrium (NE) in two-player zero-sum normal-form games (NFGs) and extensive-form games (EFGs). Many recent works consider the last-iterate convergence no-regret algorithms. Among them, the two most famous algorithms are Optimistic Gradient Descent Ascent (OGDA) and Optimistic Multiplicative Weight Update (OMWU). However, OGDA has high per-iteration complexity. OMWU exhibits a lower per-iteration complexity but poorer empirical performance, and its convergence holds only when NE is unique. Recent works propose a Reward Transformation (RT) framework for MWU, which removes the uniqueness condition and achieves competitive performance with OMWU. Unfortunately, RT-based algorithms perform worse than OGDA under the same number of iterations, and their convergence guarantee is based on the continuous-time feedback assumption, which does not hold in most scenarios. To address these issues, we provide a closer analysis of the RT framework, which holds for both continuous and discrete-time feedback. We demonstrate that the essence of the RT framework is to transform the problem of learning NE in the original game into a series of strongly convex-concave optimization problems (SCCPs). We show that the bottleneck of RT-based algorithms is the speed of solving SCCPs. To improve the their empirical performance, we design a novel transformation method to enable the SCCPs can be solved by Regret Matching+ (RM+), a no-regret algorithm with better empirical performance, resulting in Reward Transformation RM+ (RTRM+). RTRM+ enjoys last-iterate convergence under the discrete-time feedback setting. Using the counterfactual regret decomposition framework, we propose Reward Transformation CFR+ (RTCFR+) to extend RTRM+ to EFGs. Experimental results show that our algorithms significantly outperform existing last-iterate convergence algorithms and RM+ (CFR+).

摘要
不后悔算法是现代学习 Nash 均衡 (NE) 的两 player 零差游戏 (NFGs) 和扩展游戏 (EFGs) 中非常流行的。许多最近的工作是考虑最后迭代 convergence 不后悔算法。其中，最具知名度的两个算法是 Optimistic Gradient Descent Ascent (OGDA) 和 Optimistic Multiplicative Weight Update (OMWU)。但是 OGDA 的每迭代复杂度高，OWMU 的实际性较差，且其对 NE 独特性的数学保证只适用于当 NE 独特时。最近的工作提出了一个 Reward Transformation (RT) 框架，用于将 MWU 转换为一系列的强烈凸凹优化问题 (SCCPs)。这个框架可以消除 NE 独特性的限制，并实现与 OMWU 相同的性能。然而，RT-based 算法在同一个迭代次数下表现比 OGDA 差，且其对应的数学保证基于连续时间反馈假设，这不是现实中的普遍情况。为了解决这些问题，我们进行了更深入的 RT 框架的分析，该框架适用于连续时间和离散时间反馈。我们证明了 RT 框架的核心是将原始游戏中学习 NE 的问题转换为一系列的强烈凸凹优化问题 (SCCPs)。我们显示了 SCCPs 的瓶颈是解决 SCCPs 的速度。为了改善 RT-based 算法的实际性，我们设计了一个新的转换方法，使得 SCCPs 可以通过 Regret Matching+ (RM+) 解决，这是一个具有更好的实际性的无后悔算法。我们称之为 Reward Transformation RM+ (RTRM+)。RTRM+ 在离散时间反馈下实现了最后迭代均衡。使用反思过去 regret decomposition 框架，我们提出了 Reward Transformation CFR+ (RTCFR+)，以扩展 RTRM+ 到 EFGs。实验结果显示了我们的算法在与现有的最后迭代均衡算法和 RM+ (CFR+) 进行比较时，具有较高的实际性。

A survey on bias in machine learning research

paper_url: http://arxiv.org/abs/2308.11254
repo_url: https://github.com/Aastha2104/Parkinson-Disease-Prediction
paper_authors: Agnieszka Mikołajczyk-Bareła, Michał Grochowski
for: 本文旨在探讨机器学习中的偏见问题，尤其是偏见的起源和 causa causans。
methods: 本文提供了机器学习管道中偏见的四十多种可能的来源，并提供了具体的例子。
results: 通过理解机器学习中的偏见来源和后果，可以开发更好的偏见检测和缓解方法，以实现更公正、透明和准确的机器学习模型。

Abstract
Current research on bias in machine learning often focuses on fairness, while overlooking the roots or causes of bias. However, bias was originally defined as a "systematic error," often caused by humans at different stages of the research process. This article aims to bridge the gap between past literature on bias in research by providing taxonomy for potential sources of bias and errors in data and models. The paper focus on bias in machine learning pipelines. Survey analyses over forty potential sources of bias in the machine learning (ML) pipeline, providing clear examples for each. By understanding the sources and consequences of bias in machine learning, better methods can be developed for its detecting and mitigating, leading to fairer, more transparent, and more accurate ML models.

摘要
当前的研究 souvent focuses on fairness, 而忽略了偏见的根源或原因。然而，偏见最初是定义为“系统性错误”，经常由人类在不同的研究过程中引起。本文旨在将过去的偏见研究文献与机器学习（ML）管道中的偏见相连接，并提供了40多个可能的偏见来源的分类。通过理解机器学习中的偏见来源和后果，可以开发出更好的偏见检测和缓解方法，以提供更公平、透明和准确的机器学习模型。

Multi-Source Domain Adaptation for Cross-Domain Fault Diagnosis of Chemical Processes

paper_url: http://arxiv.org/abs/2308.11247
repo_url: None
paper_authors: Eduardo Fernandes Montesuma, Michela Mulas, Fred Ngolè Mboula, Francesco Corona, Antoine Souloumiac
For: This paper focuses on Cross-Domain Fault Diagnosis (CDFD) in process supervision, using machine learning to predict fault types from sensor readings.* Methods: The authors compare single and multi-source unsupervised domain adaptation (SSDA and MSDA respectively) algorithms for CDFD, using the Tennessee-Eastmann Process as a benchmark.* Results: The MSDA baseline improves classification accuracy by 23% on average compared to the SSDA baseline, and using multiple sources during training improves accuracy by 8.4% on average even without adaptation.

Abstract
Fault diagnosis is an essential component in process supervision. Indeed, it determines which kind of fault has occurred, given that it has been previously detected, allowing for appropriate intervention. Automatic fault diagnosis systems use machine learning for predicting the fault type from sensor readings. Nonetheless, these models are sensible to changes in the data distributions, which may be caused by changes in the monitored process, such as changes in the mode of operation. This scenario is known as Cross-Domain Fault Diagnosis (CDFD). We provide an extensive comparison of single and multi-source unsupervised domain adaptation (SSDA and MSDA respectively) algorithms for CDFD. We study these methods in the context of the Tennessee-Eastmann Process, a widely used benchmark in the chemical industry. We show that using multiple domains during training has a positive effect, even when no adaptation is employed. As such, the MSDA baseline improves over the SSDA baseline classification accuracy by 23% on average. In addition, under the multiple-sources scenario, we improve classification accuracy of the no adaptation setting by 8.4% on average.

摘要
缺陷诊断是生产过程监测中的一个重要组件。它可以确定已经检测到的缺陷是哪种类型，以便进行适当的干预。自动缺陷诊断系统使用机器学习预测缺陷类型从感知器读数。然而，这些模型对数据分布的变化敏感，这些变化可能由监测过程中的变化引起，如操作模式的变化。这种情况被称为跨领域缺陷诊断（CDFD）。我们对单源和多源无监督领域适应（SSDA和MSDA）算法进行了广泛的比较，并在 Tennessee-Eastmann 过程中进行了研究。我们发现在训练时使用多个领域可以提高分类精度，即使没有适应。因此，MSDA 基线比 SSDA 基eline 分类精度提高了23%的平均值。此外，在多源enario下，我们通过不进行适应来提高无适应设置的分类精度的平均提高了8.4%。

An Effective Transformer-based Contextual Model and Temporal Gate Pooling for Speaker Identification

paper_url: http://arxiv.org/abs/2308.11241
repo_url: https://github.com/harunorikawano/speaker-identification-with-tgp
paper_authors: Harunori Kawano, Sota Shimizu
for: 这 paper 是为了构建一个高效的扩展speaker identification模型，使用 Transformer 架构和自监学习。
methods: 该 paper 使用了 Transformer-based contextual model，并进行了参数与性能之间的关系分析，以探索有效模型的结构。此外，它还提出了一种强大学习能力的 pooling 方法：Temporal Gate Pooling。
results: 该 paper 在 VoxCeleb1 测试集上进行了评估，并 achieved 85.9% 的准确率，与 wav2vec2 的 317.7M 参数相比，只有 28.5M 参数。代码可以在 https://github.com/HarunoriKawano/speaker-identification-with-tgp 上获取。

Abstract
Wav2vec2 has achieved success in applying Transformer architecture and self-supervised learning to speech recognition. Recently, these have come to be used not only for speech recognition but also for the entire speech processing. This paper introduces an effective end-to-end speaker identification model applied Transformer-based contextual model. We explored the relationship between the parameters and the performance in order to discern the structure of an effective model. Furthermore, we propose a pooling method, Temporal Gate Pooling, with powerful learning ability for speaker identification. We applied Conformer as encoder and BEST-RQ for pre-training and conducted an evaluation utilizing the speaker identification of VoxCeleb1. The proposed method has achieved an accuracy of 85.9% with 28.5M parameters, demonstrating comparable precision to wav2vec2 with 317.7M parameters. Code is available at https://github.com/HarunoriKawano/speaker-identification-with-tgp.

摘要
噪音2vec2在应用Transformer架构和自动学习的情况下取得了成功，现在这些技术不仅用于语音识别，也用于整个语音处理。这篇研讨会发表了一个有效的终端到终端的话者识别模型，使用Transformer-based对话模型。我们探索了模型参数和性能之间的关系，以了解有效模型的结构。此外，我们提出了一种排程方法，时间门排程（Temporal Gate Pooling），具有强大的学习能力。我们使用Conformer构成器和BEST-RQ进行预训，并在使用VoxCeleb1的话者识别进行评估。我们的方法实现了85.9%的精度，仅使用28.5M个参数，与wav2vec2的317.7M个参数相比，具有相似的精度。代码可以在https://github.com/HarunoriKawano/speaker-identification-with-tgp上取得。

Minwise-Independent Permutations with Insertion and Deletion of Features

paper_url: http://arxiv.org/abs/2308.11240
repo_url: None
paper_authors: Rameshwar Pratap, Raghav Kulkarni
for: 本研究的目的是研究$\mathrm{minHash}$算法在高维数据中的应用，特别是在特征集的Insertion和Deletion操作时的性能分析和提出一种可以适应这种情况的算法。
methods: 本研究使用了$\mathrm{minHash}$算法，并提出了一种基于批处理的方法来适应特征集的Insertion和Deletion操作。
results: 研究表明，使用这种方法可以减少$\mathrm{minHash}$的计算时间，同时保持与从scratch计算$\mathrm{minHash}$的性能相似。

Abstract
In their seminal work, Broder \textit{et. al.}~\citep{BroderCFM98} introduces the $\mathrm{minHash}$ algorithm that computes a low-dimensional sketch of high-dimensional binary data that closely approximates pairwise Jaccard similarity. Since its invention, $\mathrm{minHash}$ has been commonly used by practitioners in various big data applications. Further, the data is dynamic in many real-life scenarios, and their feature sets evolve over time. We consider the case when features are dynamically inserted and deleted in the dataset. We note that a naive solution to this problem is to repeatedly recompute $\mathrm{minHash}$ with respect to the updated dimension. However, this is an expensive task as it requires generating fresh random permutations. To the best of our knowledge, no systematic study of $\mathrm{minHash}$ is recorded in the context of dynamic insertion and deletion of features. In this work, we initiate this study and suggest algorithms that make the $\mathrm{minHash}$ sketches adaptable to the dynamic insertion and deletion of features. We show a rigorous theoretical analysis of our algorithms and complement it with extensive experiments on several real-world datasets. Empirically we observe a significant speed-up in the running time while simultaneously offering comparable performance with respect to running $\mathrm{minHash}$ from scratch. Our proposal is efficient, accurate, and easy to implement in practice.

摘要
original text:In their seminal work, Broder et al. (1998) introduced the minHash algorithm that computes a low-dimensional sketch of high-dimensional binary data that closely approximates pairwise Jaccard similarity. Since its invention, minHash has been commonly used by practitioners in various big data applications. Further, the data is dynamic in many real-life scenarios, and their feature sets evolve over time. We consider the case when features are dynamically inserted and deleted in the dataset. We note that a naive solution to this problem is to repeatedly recompute minHash with respect to the updated dimension. However, this is an expensive task as it requires generating fresh random permutations. To the best of our knowledge, no systematic study of minHash is recorded in the context of dynamic insertion and deletion of features. In this work, we initiate this study and suggest algorithms that make the minHash sketches adaptable to the dynamic insertion and deletion of features. We show a rigorous theoretical analysis of our algorithms and complement it with extensive experiments on several real-world datasets. Empirically we observe a significant speed-up in the running time while simultaneously offering comparable performance with respect to running minHash from scratch. Our proposal is efficient, accurate, and easy to implement in practice.Simplified Chinese translation:布鲁德等人（1998）在其著作中提出了minHash算法，该算法可以将高维二分数据简化为低维矩阵，并且准确地反映对应的Jacard相似性。自其发明以来，minHash在各种大数据应用中广泛使用。然而，数据在实际场景中经常是动态的，特别是特征集的更新。我们考虑的情况是在数据集中动态插入和删除特征。我们注意到，一个简单的解决方案是重复计算minHash，以适应更新的维度。然而，这是一项昂贵的任务，因为需要生成新的随机排序。我们知道，在动态插入和删除特征的情况下，minHash的系统性研究并没有被记录。在这项工作中，我们开始这种研究，并提出了适应动态插入和删除特征的minHash笔记。我们提供了严格的理论分析，并补充了多个实际数据集的实验。我们观察到，在运行时间上，我们的提案可以获得显著的加速，同时保持与从头开始计算minHash的性能相似。我们的提案是高效、准确、易于实现。

Federated Learning on Patient Data for Privacy-Protecting Polycystic Ovary Syndrome Treatment

paper_url: http://arxiv.org/abs/2308.11220
repo_url: https://github.com/toriqiu/fl-pcos
paper_authors: Lucia Morris, Tori Qiu, Nikhil Raghuraman
for: 这项研究是为了采用 Federated Learning（FL）预测女性患有多囊卵巢综合征（PCOS）患者最佳药物。
methods: 该研究使用了多种 Federated Learning 方法在 sintetic PCOS 患者数据集上进行验证。
results: 研究表明，使用 Federated Learning 方法可以在具有各种数据的大规模数据集上预测PCOS 患者最佳药物，同时为患者提供隐私保证。

Abstract
The field of women's endocrinology has trailed behind data-driven medical solutions, largely due to concerns over the privacy of patient data. Valuable datapoints about hormone levels or menstrual cycling could expose patients who suffer from comorbidities or terminate a pregnancy, violating their privacy. We explore the application of Federated Learning (FL) to predict the optimal drug for patients with polycystic ovary syndrome (PCOS). PCOS is a serious hormonal disorder impacting millions of women worldwide, yet it's poorly understood and its research is stunted by a lack of patient data. We demonstrate that a variety of FL approaches succeed on a synthetic PCOS patient dataset. Our proposed FL models are a tool to access massive quantities of diverse data and identify the most effective treatment option while providing PCOS patients with privacy guarantees.

摘要
女性endoocrinology领域落后数据驱动医疗解决方案，主要是由于患者数据隐私担忧。有价值的数据点，如血浆激素水平或月经周期，可能泄露患有慢性疾病或 abortion 的患者身份，违反其隐私。我们探讨了 Federated Learning（FL）在预测女性泌乳体积缩疾病（PCOS）患者最佳药物方案方面的应用。PCOS 是全球数百万女性患有的严重孢子激素疾病，但它的研究受到缺乏患者数据的限制。我们展示了多种 FL 方法在人工 PCOS 患者数据集上取得成功。我们提议的 FL 模型是一种访问庞大数据量和提供 PCOS 患者隐私保证的工具。

Federated Learning in Big Model Era: Domain-Specific Multimodal Large Models

paper_url: http://arxiv.org/abs/2308.11217
repo_url: None
paper_authors: Zengxiang Li, Zhaoxiang Hou, Hui Liu, Ying Wang, Tongzhi Li, Longfei Xie, Chao Shi, Chengyi Yang, Weishan Zhang, Zelei Liu, Liang Xu
for:这篇论文旨在提出一种多模态联合学习框架，帮助多家企业通过私有领域数据来共同训练大型模型，以实现多enario智能服务。methods:该论文提出了多模态联合学习的策略性转型，包括智能基础和目标在大模型时代的探讨，以及在不同数据和模型聚合、性能和成本交易、数据隐私和奖励机制等方面的新挑战。results:实验表明，通过多模态联合学习，企业可以增强和积累智能能力，共同创建智能城市模型，提供高质量智能服务，涵盖能源基础设施安全、住宅区安全和城市运营管理等多个场景。

Abstract
Multimodal data, which can comprehensively perceive and recognize the physical world, has become an essential path towards general artificial intelligence. However, multimodal large models trained on public datasets often underperform in specific industrial domains. This paper proposes a multimodal federated learning framework that enables multiple enterprises to utilize private domain data to collaboratively train large models for vertical domains, achieving intelligent services across scenarios. The authors discuss in-depth the strategic transformation of federated learning in terms of intelligence foundation and objectives in the era of big model, as well as the new challenges faced in heterogeneous data, model aggregation, performance and cost trade-off, data privacy, and incentive mechanism. The paper elaborates a case study of leading enterprises contributing multimodal data and expert knowledge to city safety operation management , including distributed deployment and efficient coordination of the federated learning platform, technical innovations on data quality improvement based on large model capabilities and efficient joint fine-tuning approaches. Preliminary experiments show that enterprises can enhance and accumulate intelligent capabilities through multimodal model federated learning, thereby jointly creating an smart city model that provides high-quality intelligent services covering energy infrastructure safety, residential community security, and urban operation management. The established federated learning cooperation ecosystem is expected to further aggregate industry, academia, and research resources, realize large models in multiple vertical domains, and promote the large-scale industrial application of artificial intelligence and cutting-edge research on multimodal federated learning.

摘要
多Modal数据，可以全面感受和识别物理世界，已成为人工智能的重要路径。然而，多Modal大型模型在公共数据集上训练时经常表现不佳在特定行业领域。这篇论文提出了一种多Modal联邦学习框架，允许多家企业使用私有领域数据共同训练大型模型，实现多场景智能服务。作者深入讨论了联邦学习在智能基础和目标时代下的战略性转变，以及在不同数据、模型集成、性能和成本衡量、数据隐私和奖励机制等方面面临的新挑战。论文还介绍了一个城市安全运营管理案例研究，包括分布式部署和有效协调联邦学习平台，技术创新以及大型模型能力的数据质量提升和有效的联合精度调整方法。初步实验表明，企业可通过多Modal模型联邦学习增强和积累智能能力，共同创建高质量智能服务，涵盖能源基础设施安全、住宅社区安全和城市运营管理等多个方向。建立的联邦学习合作生态系统预计会进一步吸引产业、学术和研究资源，实现多个垂直领域的大型模型，并推动人工智能和多Modal联邦学习的大规模工业应用。

Hamiltonian GAN

paper_url: http://arxiv.org/abs/2308.11216
repo_url: https://github.com/koritsky/hamiltonian_learning
paper_authors: Christine Allen-Blanchette
for: 本研究旨在透过哈密顿形式学习Physics-inspired video generation的方法，并且从数据中学习配置空间的表示。
methods: 本研究使用了一个具有学习的配置空间地图和哈密顿神经网络动作模型，从数据中学习配置空间的表示。
results: 本研究使用了一个物理灵感的循环坐标损失函数，并且证明了其可以提高表示的解释性和有效性。

Abstract
A growing body of work leverages the Hamiltonian formalism as an inductive bias for physically plausible neural network based video generation. The structure of the Hamiltonian ensures conservation of a learned quantity (e.g., energy) and imposes a phase-space interpretation on the low-dimensional manifold underlying the input video. While this interpretation has the potential to facilitate the integration of learned representations in downstream tasks, existing methods are limited in their applicability as they require a structural prior for the configuration space at design time. In this work, we present a GAN-based video generation pipeline with a learned configuration space map and Hamiltonian neural network motion model, to learn a representation of the configuration space from data. We train our model with a physics-inspired cyclic-coordinate loss function which encourages a minimal representation of the configuration space and improves interpretability. We demonstrate the efficacy and advantages of our approach on the Hamiltonian Dynamics Suite Toy Physics dataset.

摘要
<> translate "A growing body of work leverages the Hamiltonian formalism as an inductive bias for physically plausible neural network based video generation. The structure of the Hamiltonian ensures conservation of a learned quantity (e.g., energy) and imposes a phase-space interpretation on the low-dimensional manifold underlying the input video. While this interpretation has the potential to facilitate the integration of learned representations in downstream tasks, existing methods are limited in their applicability as they require a structural prior for the configuration space at design time. In this work, we present a GAN-based video generation pipeline with a learned configuration space map and Hamiltonian neural network motion model, to learn a representation of the configuration space from data. We train our model with a physics-inspired cyclic-coordinate loss function which encourages a minimal representation of the configuration space and improves interpretability. We demonstrate the efficacy and advantages of our approach on the Hamiltonian Dynamics Suite Toy Physics dataset."中文简体版：<>现有一些研究利用哈密顿ormalism作为神经网络基于视频生成的启发，这种结构保证学习的量（例如能量）的保守，并在输入视频的低维度抽象空间中强制实施相位空间的 интерpretation。这种 интерpretation有可能为下游任务提供助け，但现有方法受限于设计时需要的结构预先知识。在这个工作中，我们提出了基于GAN的视频生成管道，使用学习的配置空间地图和哈密顿神经网络运动模型，从数据中学习配置空间的表示。我们使用physics-inspired循环坐标损失函数，鼓励学习的配置空间表示最小化，提高可读性。我们在哈密顿动力学Suite Toy Physics数据集上证明了我们的方法的有效性和优势。

A Simple Framework for Multi-mode Spatial-Temporal Data Modeling

paper_url: http://arxiv.org/abs/2308.11204
repo_url: https://github.com/lzhmarkk/simmst
paper_authors: Zihang Liu, Le Yu, Tongyu Zhu, Leiei Sun
for: 本文提出了一种简单的多模式空间时间数据模型方法，用于捕捉多种空间模式之间的关系和时间相关性。
methods: 本文提出了一种通用的交叉模式空间关系学习模块，可以适应不同的空间模式之间的连接和信息传递。此外，文章还使用多层感知器来捕捉时间相关性和通道相关性。
results: 实验结果表明，该模型可以在三个实际 dataset 上具有更高的效果，同时具有较低的空间和时间复杂度。此外，通用的交叉模式空间关系学习模块的一致性也得到了验证。

Abstract
Spatial-temporal data modeling aims to mine the underlying spatial relationships and temporal dependencies of objects in a system. However, most existing methods focus on the modeling of spatial-temporal data in a single mode, lacking the understanding of multiple modes. Though very few methods have been presented to learn the multi-mode relationships recently, they are built on complicated components with higher model complexities. In this paper, we propose a simple framework for multi-mode spatial-temporal data modeling to bring both effectiveness and efficiency together. Specifically, we design a general cross-mode spatial relationships learning component to adaptively establish connections between multiple modes and propagate information along the learned connections. Moreover, we employ multi-layer perceptrons to capture the temporal dependencies and channel correlations, which are conceptually and technically succinct. Experiments on three real-world datasets show that our model can consistently outperform the baselines with lower space and time complexity, opening up a promising direction for modeling spatial-temporal data. The generalizability of the cross-mode spatial relationships learning module is also validated.

摘要
空间-时间数据模型目的是挖掘系统中对象之间的下一个空间关系和时间依赖关系。然而，大多数现有方法都是单模式的，缺乏多模式关系的理解。虽然有些最近提出的方法可以学习多模式关系，但它们是基于复杂的组件，具有更高的模型复杂性。在这篇论文中，我们提出了一个简单的多模式空间-时间数据模型 framework，以实现效率和效果的共同挥ansion。具体来说，我们设计了一个通用的跨模式空间关系学习组件，可以适应地建立多个模式之间的连接，并在学习到的连接上传递信息。此外，我们使用多层感知器来捕捉时间依赖关系和通道相关性，这些概念和技术都非常简洁。在三个实际数据集上进行了实验，我们的模型能够一直高于基eline，占用更少的空间和时间复杂度，开启了一个有前途的空间-时间数据模型领域。此外，我们验证了跨模式空间关系学习模块的一致性。

SegRNN: Segment Recurrent Neural Network for Long-Term Time Series Forecasting

paper_url: http://arxiv.org/abs/2308.11200
repo_url: None
paper_authors: Shengsheng Lin, Weiwei Lin, Wentai Wu, Feiyu Zhao, Ruichao Mo, Haotong Zhang
for: 本文 targets the Long-term Time Series Forecasting (LTSF) domain, addressing the challenges faced by RNN-based methods when dealing with excessively long look-back windows and forecast horizons.
methods: 本文 proposes two novel strategies to reduce the number of iterations in RNNs for LTSF tasks: Segment-wise Iterations and Parallel Multi-step Forecasting (PMF).
results: 实验结果显示，SegRNN 能够在 LTSF 任务中获得显著的改善，不仅比 SOTA Transformer-based 模型高，还可以降低 runtime 和 memory usage 超过 78%。

Abstract
RNN-based methods have faced challenges in the Long-term Time Series Forecasting (LTSF) domain when dealing with excessively long look-back windows and forecast horizons. Consequently, the dominance in this domain has shifted towards Transformer, MLP, and CNN approaches. The substantial number of recurrent iterations are the fundamental reasons behind the limitations of RNNs in LTSF. To address these issues, we propose two novel strategies to reduce the number of iterations in RNNs for LTSF tasks: Segment-wise Iterations and Parallel Multi-step Forecasting (PMF). RNNs that combine these strategies, namely SegRNN, significantly reduce the required recurrent iterations for LTSF, resulting in notable improvements in forecast accuracy and inference speed. Extensive experiments demonstrate that SegRNN not only outperforms SOTA Transformer-based models but also reduces runtime and memory usage by more than 78%. These achievements provide strong evidence that RNNs continue to excel in LTSF tasks and encourage further exploration of this domain with more RNN-based approaches. The source code is coming soon.

摘要

ConcatPlexer: Additional Dim1 Batching for Faster ViTs

paper_url: http://arxiv.org/abs/2308.11199
repo_url: None
paper_authors: Donghoon Han, Seunghyeon Seo, Donghyeon Jeon, Jiho Jang, Chaerin Kong, Nojun Kwak
for: 提高视觉识别的效率和精度
methods: 使用额外维度批处理（ concatenation）和特性改进技术
results: 在ImageNet1K和CIFAR100 dataset上训练的ConcatPlexer模型，相比ViT-B/16，具有23.5% less GFLOPs，同时保持69.5%和83.4%的验证精度。

Abstract
Transformers have demonstrated tremendous success not only in the natural language processing (NLP) domain but also the field of computer vision, igniting various creative approaches and applications. Yet, the superior performance and modeling flexibility of transformers came with a severe increase in computation costs, and hence several works have proposed methods to reduce this burden. Inspired by a cost-cutting method originally proposed for language models, Data Multiplexing (DataMUX), we propose a novel approach for efficient visual recognition that employs additional dim1 batching (i.e., concatenation) that greatly improves the throughput with little compromise in the accuracy. We first introduce a naive adaptation of DataMux for vision models, Image Multiplexer, and devise novel components to overcome its weaknesses, rendering our final model, ConcatPlexer, at the sweet spot between inference speed and accuracy. The ConcatPlexer was trained on ImageNet1K and CIFAR100 dataset and it achieved 23.5% less GFLOPs than ViT-B/16 with 69.5% and 83.4% validation accuracy, respectively.

摘要
启示器（Transformers）在自然语言处理（NLP）领域以外，也在计算机视觉领域取得了巨大成功，激发了许多创新的方法和应用。然而，启示器的高性能和模型灵活性却带来了计算成本的增加，因此许多研究均提出了减轻这种负担的方法。 Drawing inspiration from a cost-cutting method originally proposed for language models, Data Multiplexing (DataMUX), we propose a novel approach for efficient visual recognition that employs additional dim1 batching (i.e., concatenation) to greatly improve the throughput with little compromise in accuracy.我们首先介绍了图像多重化器（Image Multiplexer），该模型是对DataMux的原始方法的简单适应。然后，我们设计了新的组件来缓解Image Multiplexer的缺点，这使得我们的最终模型——ConcatPlexer——在推理速度和准确率之间占据了一个平衡点。 ConcatPlexer在ImageNet1K和CIFAR100 dataset上训练，其在ViT-B/16模型的23.5%的GFLOPs上下降，同时保持了69.5%和83.4%的验证精度。

Toward Generalizable Machine Learning Models in Speech, Language, and Hearing Sciences: Power Analysis and Sample Size Estimation

paper_url: http://arxiv.org/abs/2308.11197
repo_url: None
paper_authors: Hamzeh Ghasemzadeh, Robert E. Hillman, Daryush D. Mehta
For: The paper aims to provide quantitative evidence to incentivize researchers to use the more robust method of nested cross-validation in machine learning-based analysis, and to present methods and MATLAB codes for power analysis during the design of a study.* Methods: The paper uses Monte Carlo simulations to compare the performance of four different cross-validation methods (single holdout, 10-fold, train-validation-test, and nested 10-fold) in terms of statistical power and statistical confidence.* Results: The paper finds that the nested 10-fold cross-validation method results in the highest statistical confidence and the highest statistical power, while providing an unbiased estimate of the accuracy. The required sample size with a single holdout is found to be 50% higher than what would be needed with nested cross-validation, and the confidence in the model based on nested cross-validation is found to be as much as four times higher than the confidence in the single holdout-based model.

Abstract
This study's first purpose is to provide quantitative evidence that would incentivize researchers to instead use the more robust method of nested cross-validation. The second purpose is to present methods and MATLAB codes for doing power analysis for ML-based analysis during the design of a study. Monte Carlo simulations were used to quantify the interactions between the employed cross-validation method, the discriminative power of features, the dimensionality of the feature space, and the dimensionality of the model. Four different cross-validations (single holdout, 10-fold, train-validation-test, and nested 10-fold) were compared based on the statistical power and statistical confidence of the ML models. Distributions of the null and alternative hypotheses were used to determine the minimum required sample size for obtaining a statistically significant outcome ({\alpha}=0.05, 1-\b{eta}=0.8). Statistical confidence of the model was defined as the probability of correct features being selected and hence being included in the final model. Our analysis showed that the model generated based on the single holdout method had very low statistical power and statistical confidence and that it significantly overestimated the accuracy. Conversely, the nested 10-fold cross-validation resulted in the highest statistical confidence and the highest statistical power, while providing an unbiased estimate of the accuracy. The required sample size with a single holdout could be 50% higher than what would be needed if nested cross-validation were used. Confidence in the model based on nested cross-validation was as much as four times higher than the confidence in the single holdout-based model. A computational model, MATLAB codes, and lookup tables are provided to assist researchers with estimating the sample size during the design of their future studies.

摘要

Automatic Task Parallelization of Dataflow Graphs in ML/DL models

paper_url: http://arxiv.org/abs/2308.11192
repo_url: None
paper_authors: Srinjoy Das, Lawrence Rauchwerger
for: 提高机器学习（ML）和深度学习（DL）模型的训练和推理性能
methods: 使用批处理和操作符并行方法，并利用 kritical-path-based 线性归一化来扩大并行路径
results: 在多个 ML 图ogram上达到了最多 1.9 倍的速度提升，并且在编译和运行时间方面也超过了一些当前机制，同时方法轻量级和快速，适用于有限的端点设备

Abstract
Several methods exist today to accelerate Machine Learning(ML) or Deep-Learning(DL) model performance for training and inference. However, modern techniques that rely on various graph and operator parallelism methodologies rely on search space optimizations which are costly in terms of power and hardware usage. Especially in the case of inference, when the batch size is 1 and execution is on CPUs or for power-constrained edge devices, current techniques can become costly, complicated or inapplicable. To ameliorate this, we present a Critical-Path-based Linear Clustering approach to exploit inherent parallel paths in ML dataflow graphs. Our task parallelization approach further optimizes the structure of graphs via cloning and prunes them via constant propagation and dead-code elimination. Contrary to other work, we generate readable and executable parallel Pytorch+Python code from input ML models in ONNX format via a new tool that we have built called {\bf Ramiel}. This allows us to benefit from other downstream acceleration techniques like intra-op parallelism and potentially pipeline parallelism. Our preliminary results on several ML graphs demonstrate up to 1.9$\times$ speedup over serial execution and outperform some of the current mechanisms in both compile and runtimes. Lastly, our methods are lightweight and fast enough so that they can be used effectively for power and resource-constrained devices, while still enabling downstream optimizations.

摘要
Translated into Simplified Chinese:有很多方法可以加速机器学习（ML）或深度学习（DL）模型的训练和推理性能。然而，现代技术很多都是基于图和运算符并行方法，这些方法会对硬件和电力使用非常高。尤其在推理时，当批处理数量为1并在CPUs上执行时，现有的技术可能会变得昂贵、复杂或无法应用。为了改善这种情况，我们提出了一种关键路径基 linear clustering 方法，可以利用 ML 数据流图中的自然并行路径。我们的任务并行化方法还可以优化图的结构via 克隆和常量卷积，并通过死代码消除来减少图。与其他工作不同，我们可以通过一个新建的工具 Ramiel 将 ML 模型转换为可读的并行 Pytorch+Python 代码，并且可以通过其他下游加速技术，如内部并行和批处理并行。我们的初步结果表明，对于多个 ML 图，我们可以 achieved up to 1.9 倍的速度提升，并且超越了一些当前的机制。此外，我们的方法具有轻量级和快速的特点，可以在功能和资源限制的设备上使用，同时仍可以开启下游优化。

Diversity Measures: Domain-Independent Proxies for Failure in Language Model Queries

paper_url: http://arxiv.org/abs/2308.11189
repo_url: https://github.com/lab-v2/diversity_measures
paper_authors: Noel Ngu, Nathaniel Lee, Paulo Shakarian
for: 这 paper 旨在提供一些基于响应多样性的错误预测方法，而不是基于特定应用领域的信息。
methods: 这 paper 使用了三种测量错误概率的方法：基于熵、基于金谱纷乱度和基于中心距离。
results: 实验表明，这些测量方法与错误概率强相关，并且可以应用于几个不同的任务，如几个提示、链式思维和错误检测。

Abstract
Error prediction in large language models often relies on domain-specific information. In this paper, we present measures for quantification of error in the response of a large language model based on the diversity of responses to a given prompt - hence independent of the underlying application. We describe how three such measures - based on entropy, Gini impurity, and centroid distance - can be employed. We perform a suite of experiments on multiple datasets and temperature settings to demonstrate that these measures strongly correlate with the probability of failure. Additionally, we present empirical results demonstrating how these measures can be applied to few-shot prompting, chain-of-thought reasoning, and error detection.

摘要
大型语言模型中的错误预测通常基于域特定信息。在这篇论文中，我们提出了基于响应提问的多样性来衡量语言模型的错误评估的方法。我们介绍了三种这种方法：基于熵、基于吉尼纯度和基于中心距离。我们在多个数据集和温度设置下进行了一系列实验，并证明了这些指标与错误概率强相关。此外，我们还提供了实证结果，证明这些指标可以应用于几个shot提问、链式思维和错误检测等领域。

A three in one bottom-up framework for simultaneous semantic segmentation, instance segmentation and classification of multi-organ nuclei in digital cancer histology

paper_url: http://arxiv.org/abs/2308.11179
repo_url: None
paper_authors: Ibtihaj Ahmad, Syed Muhammad Israr, Zain Ul Islam
For: The paper is focused on developing a deep learning framework for simultaneous instance segmentation and classification of nuclei in digital histology images.* Methods: The proposed method uses a multi-stage approach with additional decoder heads and independent weighted losses to produce semantic segmentation, edge proposals, and classification maps. The method also utilizes post-processing techniques to improve the final segmentation and classification results.* Results: The proposed method achieves high performance on several benchmark datasets, with a Dice score of 0.841 for semantic segmentation, bPQ scores of 0.713 for instance segmentation, and mPQ scores of 0.633 for nuclei classification. The method is also generalized across 19 types of tissues and is less complex compared to state-of-the-art methods.Here are the three key points in Simplified Chinese text:
for: 本文目标是为数字 histology 图像同时实现实例分割和分类。
methods: 提议的方法使用多阶段方法，并使用多个推理头和独立的权重损失来生成 semantic segmentation、edge proposals 和分类地図。方法还使用后处理技术来改进最终分割和分类结果。
results: 提议的方法在多个 referential 数据集上达到了高性能，包括 semantic segmentation 的 Dice 分数为 0.841，实例分割的 bPQ 分数为 0.713，和 nuclei 分类的 mPQ 分数为 0.633。方法还可以在多种组织中进行普适化，并且比 state-of-the-art 方法更加简单。

Abstract
Simultaneous segmentation and classification of nuclei in digital histology play an essential role in computer-assisted cancer diagnosis; however, it remains challenging. The highest achieved binary and multi-class Panoptic Quality (PQ) remains as low as 0.68 bPQ and 0.49 mPQ, respectively. It is due to the higher staining variability, variability across the tissue, rough clinical conditions, overlapping nuclei, and nuclear class imbalance. The generic deep-learning methods usually rely on end-to-end models, which fail to address these problems associated explicitly with digital histology. In our previous work, DAN-NucNet, we resolved these issues for semantic segmentation with an end-to-end model. This work extends our previous model to simultaneous instance segmentation and classification. We introduce additional decoder heads with independent weighted losses, which produce semantic segmentation, edge proposals, and classification maps. We use the outputs from the three-head model to apply post-processing to produce the final segmentation and classification. Our multi-stage approach utilizes edge proposals and semantic segmentations compared to direct segmentation and classification strategies followed by most state-of-the-art methods. Due to this, we demonstrate a significant performance improvement in producing high-quality instance segmentation and nuclei classification. We have achieved a 0.841 Dice score for semantic segmentation, 0.713 bPQ scores for instance segmentation, and 0.633 mPQ for nuclei classification. Our proposed framework is generalized across 19 types of tissues. Furthermore, the framework is less complex compared to the state-of-the-art.

摘要
<>对于计算机助理肿瘤诊断中的数字 histology 中的同时分割和分类，它扮演着关键的角色，但是它还是具有挑战性。最高的二元和多类 Panoptic Quality (PQ) 仍然为 0.68 bPQ 和 0.49 mPQ，分别。这是因为高程染色变化、质量不均匀、质量不稳定、核体重叠和核体类别偏好。通常的深度学习方法通常采用端到端模型，这些模型无法直接地Addressing these problems associated with digital histology. 在我们之前的工作中，我们已经解决了这些问题，并提出了 DAN-NucNet 模型。这个工作是将我们之前的模型扩展到同时实例分割和分类。我们添加了更多的解码头，每个解码头都有独立的权重损失，它们生成核心分割、边提案和分类图像。我们使用这些三个头的输出来进行后处理，生成最终的分割和分类结果。我们的多个阶段方法利用边提案和核心分割，而不是直接分割和分类策略，这使得我们可以达到高质量实例分割和核体分类。我们实现了 0.841 Dice 分割率、0.713 bPQ 分割率和 0.633 mPQ 分类率。我们的提出的框架可以普遍应用于 19 种组织类型。此外，我们的框架比state-of-the-art更加简单。

A Preliminary Investigation into Search and Matching for Tumour Discrimination in WHO Breast Taxonomy Using Deep Networks

paper_url: http://arxiv.org/abs/2308.11162
repo_url: None
paper_authors: Abubakr Shafique, Ricardo Gonzalez, Liron Pantanowitz, Puay Hoon Tan, Alberto Machado, Ian A Cree, Hamid R. Tizhoosh
for: 这个研究旨在开发一个基于深度学习的数字图像库，用于帮助病理学家更好地诊断乳腺癌。
methods: 该研究使用了一个国际化的深度学习模型，从TCGA数据库中提取了数百万个诊断 histopathology 图像，并对这些图像进行了深度特征提取。然后，该模型将图像与WTHO乳腺癌分类系统进行了对比，以便建立一个可搜索的数字图像库。
results: 研究发现，使用深度学习模型对乳腺癌图像进行了高精度的搜索和匹配，并且在验证过程中达到了88%的准确率。这些结果表明，使用数字图像库可以帮助病理学家更好地理解乳腺癌的复杂关系，并提高诊断的准确性。

Abstract
Breast cancer is one of the most common cancers affecting women worldwide. They include a group of malignant neoplasms with a variety of biological, clinical, and histopathological characteristics. There are more than 35 different histological forms of breast lesions that can be classified and diagnosed histologically according to cell morphology, growth, and architecture patterns. Recently, deep learning, in the field of artificial intelligence, has drawn a lot of attention for the computerized representation of medical images. Searchable digital atlases can provide pathologists with patch matching tools allowing them to search among evidently diagnosed and treated archival cases, a technology that may be regarded as computational second opinion. In this study, we indexed and analyzed the WHO breast taxonomy (Classification of Tumours 5th Ed.) spanning 35 tumour types. We visualized all tumour types using deep features extracted from a state-of-the-art deep learning model, pre-trained on millions of diagnostic histopathology images from the TCGA repository. Furthermore, we test the concept of a digital "atlas" as a reference for search and matching with rare test cases. The patch similarity search within the WHO breast taxonomy data reached over 88% accuracy when validating through "majority vote" and more than 91% accuracy when validating using top-n tumour types. These results show for the first time that complex relationships among common and rare breast lesions can be investigated using an indexed digital archive.

摘要
乳癌是世界各地妇女中最常见的一种恶性肿瘤，它们包括多种生物、临床和 histopathological 特征。在乳腺癌方面，有 более than 35 种不同的 histological 型术，可以通过细胞形态、生长和建筑模式进行分类和诊断。在人工智能领域，深度学习在最近几年引起了很多关注，用于计算机视觉图像的数字化表示。搜索可搜索的数字图频表可以为病理学家提供一种可搜索的数字图频表，以便通过对已诊断和治疗的档案进行搜索，这可以被视为计算机化的第二次诊断。在这项研究中，我们使用 WHO 乳腺分类法（第五版），涵盖了35种肿瘤类型。我们使用一种国际顶尖的深度学习模型，从TCGA数据库中获取了数百万个诊断 histopathology 图像，并将所有肿瘤类型用深度特征进行视觉化。此外，我们还测试了一种数字"图频"作为参考，以便对罕见案例进行搜索和匹配。在 WHO 乳腺分类法数据中进行质心 Similarity 搜索后，对 "主要投票" 和 "top-n" 案例进行验证，结果显示，这种方法可以达到88% 的准确率和91% 的准确率。这些结果表明，使用索引数字档案可以调查乳腺癌中复杂的关系。

xxMD: Benchmarking Neural Force Fields Using Extended Dynamics beyond Equilibrium

paper_url: http://arxiv.org/abs/2308.11155
repo_url: https://github.com/zpengmei/xxmd
paper_authors: Zihan Pengmei, Junyu Liu, Yinan Shu
for: 这篇论文主要针对的是计算化学中的神经力场模型（NFFs），它们在计算分子动力学中取代了量子化学计算。
methods: 作者使用了非自 Adair 动力学来生成新的xxMD数据集，这个数据集包括了多 reference 波函数理论和物理学术函数理论来确定能量和力。
results: 作者发现了 MD17 数据集中的坐标分布和能量分布存在约束，这使得这些数据集不适用于化学反应。而xxMD数据集更好地反映化学反应，并且可以用于评估NFF模型的泛化能力。

Abstract
Neural force fields (NFFs) have gained prominence in computational chemistry as surrogate models, superseding quantum-chemistry calculations in ab initio molecular dynamics. The prevalent benchmark for NFFs has been the MD17 dataset and its subsequent extension. These datasets predominantly comprise geometries from the equilibrium region of the ground electronic state potential energy surface, sampling from direct adiabatic dynamics. However, many chemical reactions entail significant molecular deformations, notably bond breaking. We demonstrate the constrained distribution of internal coordinates and energies in the MD17 datasets, underscoring their inadequacy for representing systems undergoing chemical reactions. Addressing this sampling limitation, we introduce the xxMD (Extended Excited-state Molecular Dynamics) dataset, derived from non-adiabatic dynamics. This dataset encompasses energies and forces ascertained from both multireference wave function theory and density functional theory. Furthermore, its nuclear configuration spaces authentically depict chemical reactions, making xxMD a more chemically relevant dataset. Our re-assessment of equivariant models on the xxMD datasets reveals notably higher mean absolute errors than those reported for MD17 and its variants. This observation underscores the challenges faced in crafting a generalizable NFF model with extrapolation capability. Our proposed xxMD-CASSCF and xxMD-DFT datasets are available at \url{https://github.com/zpengmei/xxMD}.

摘要

Mobility-Aware Computation Offloading for Swarm Robotics using Deep Reinforcement Learning

paper_url: http://arxiv.org/abs/2308.11154
repo_url: None
paper_authors: Xiucheng Wang, Hongzhi Guo
for: automate dirty, dangerous, and dull tasks with swarm robotics
methods: leverage mobile edge computing and mobility-aware deep reinforcement learning model
results: meet delay requirements and guarantee computation precision with minimum robot energy

Abstract
Swarm robotics is envisioned to automate a large number of dirty, dangerous, and dull tasks. Robots have limited energy, computation capability, and communication resources. Therefore, current swarm robotics have a small number of robots, which can only provide limited spatio-temporal information. In this paper, we propose to leverage the mobile edge computing to alleviate the computation burden. We develop an effective solution based on a mobility-aware deep reinforcement learning model at the edge server side for computing scheduling and resource. Our results show that the proposed approach can meet delay requirements and guarantee computation precision by using minimum robot energy.

摘要
<> translate "Swarm robotics is envisioned to automate a large number of dirty, dangerous, and dull tasks. Robots have limited energy, computation capability, and communication resources. Therefore, current swarm robotics have a small number of robots, which can only provide limited spatio-temporal information. In this paper, we propose to leverage the mobile edge computing to alleviate the computation burden. We develop an effective solution based on a mobility-aware deep reinforcement learning model at the edge server side for computing scheduling and resource. Our results show that the proposed approach can meet delay requirements and guarantee computation precision by using minimum robot energy." into Chinese (Simplified)Answer:群体机器人是想要自动化大量的危险、不净、无聊任务。机器人具有有限的能量、计算能力和通信资源。因此，当前的群体机器人只有一小部分的机器人，可以提供有限的空间时间信息。在本文中，我们提议使用移动边缘计算来减轻计算负担。我们开发了一种有效的解决方案，基于移动边缘计算服务器端的 mobilit-aware深度学习模型，用于计算调度和资源分配。我们的结果显示，我们的方法可以 garantizar delay requirements和 computation precision，使用最少的机器人能量。

Energy-Efficient On-Board Radio Resource Management for Satellite Communications via Neuromorphic Computing

paper_url: http://arxiv.org/abs/2308.11152
repo_url: None
paper_authors: Flor Ortiz, Nicolas Skatchkovsky, Eva Lagunas, Wallace A. Martins, Geoffrey Eappen, Saed Daoud, Osvaldo Simeone, Bipin Rajendran, Symeon Chatzinotas
for: 这个论文目的是为了探讨使用Machine Learning（ML）技术来实现卫星通信（SatCom）系统中的无线资源管理，以提高系统的效率和可持续性。
methods: 本论文使用的方法包括使用脑机学发展的Brain-Inspired Machine Learning（BM）模型，并进行了软件实验和实际实验。实验使用了最新发布的Intel Loihi 2芯片，并与Xilinx Versal VCK5000芯片进行了比较。
results: 本论文的结果显示，在相应的工作负载下，使用Spiking Neural Networks（SNNs）在Loihi 2芯片上可以实现高度的准确性和能效性，并且比较之前的Convolutional Neural Networks（CNNs）基准平台可以降低能 consumption by more than 100 times。这些结果显示了 neuromorphic computing 和 SNNs 在未来 SatCom 系统中的潜在潜力，并可以帮助提高系统的效率和可持续性。

Abstract
The latest satellite communication (SatCom) missions are characterized by a fully reconfigurable on-board software-defined payload, capable of adapting radio resources to the temporal and spatial variations of the system traffic. As pure optimization-based solutions have shown to be computationally tedious and to lack flexibility, machine learning (ML)-based methods have emerged as promising alternatives. We investigate the application of energy-efficient brain-inspired ML models for on-board radio resource management. Apart from software simulation, we report extensive experimental results leveraging the recently released Intel Loihi 2 chip. To benchmark the performance of the proposed model, we implement conventional convolutional neural networks (CNN) on a Xilinx Versal VCK5000, and provide a detailed comparison of accuracy, precision, recall, and energy efficiency for different traffic demands. Most notably, for relevant workloads, spiking neural networks (SNNs) implemented on Loihi 2 yield higher accuracy, while reducing power consumption by more than 100$\times$ as compared to the CNN-based reference platform. Our findings point to the significant potential of neuromorphic computing and SNNs in supporting on-board SatCom operations, paving the way for enhanced efficiency and sustainability in future SatCom systems.

摘要
最新的卫星通信（SatCom）任务 caracterized by a fully reconfigurable on-board software-defined payload, capable of adapting radio resources to the temporal and spatial variations of the system traffic. As pure optimization-based solutions have shown to be computationally tedious and to lack flexibility,机器学习（ML）based methods have emerged as promising alternatives. We investigate the application of energy-efficient brain-inspired ML models for on-board radio resource management. Apart from software simulation, we report extensive experimental results leveraging the recently released Intel Loihi 2 chip. To benchmark the performance of the proposed model, we implement conventional convolutional neural networks（CNN）on a Xilinx Versal VCK5000, and provide a detailed comparison of accuracy, precision, recall, and energy efficiency for different traffic demands. Most notably, for relevant workloads, spiking neural networks（SNNs）implemented on Loihi 2 yield higher accuracy, while reducing power consumption by more than 100 times as compared to the CNN-based reference platform. Our findings point to the significant potential of neuromorphic computing and SNNs in supporting on-board SatCom operations, paving the way for enhanced efficiency and sustainability in future SatCom systems.

LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning (Practical Experience Report)

paper_url: http://arxiv.org/abs/2308.11148
repo_url: None
paper_authors: Junyi Lu, Lei Yu, Xiaojia Li, Li Yang, Chun Zuo
for: automatization of code review activities
methods: 使用 Large Language Models (LLMs) 和 parameter-efficient fine-tuning (PEFT) 方法
results: 与已有的预训练模型相当的性能，使用 fewer than 1% of trainable parametersHere’s the full translation of the abstract in Simplified Chinese:LLaMA-Reviewer 是一个创新的框架，利用 Large Language Models (LLMs) 在代码审查中表现出色。为了确保资源效率，这个框架使用 parameter-efficient fine-tuning (PEFT) 方法，实现高性能而使用 fewer than 1% of trainable parameters。在两个不同的公共可用数据集上进行了广泛的评估。发现，即使使用最小的 LLaMA 基本模型（6.7B parameters）和限定的调整 epochs，LLaMA-Reviewer 与现有的代码审查模型表现相同。实验显示了不同调整过程 ком成分对性能的影响。为了促进这个领域的无穷进步，代码和所有 PEFT-weight 插件已经公开开源。

Abstract
The automation of code review activities, a long-standing pursuit in software engineering, has been primarily addressed by numerous domain-specific pre-trained models. Despite their success, these models frequently demand extensive resources for pre-training from scratch. In contrast, Large Language Models (LLMs) provide an intriguing alternative, given their remarkable capabilities when supplemented with domain-specific knowledge. However, their potential for automating code review tasks remains largely unexplored. In response to this research gap, we present LLaMA-Reviewer, an innovative framework that leverages the capabilities of LLaMA, a popular LLM, in the realm of code review. Mindful of resource constraints, this framework employs parameter-efficient fine-tuning (PEFT) methods, delivering high performance while using less than 1% of trainable parameters. An extensive evaluation of LLaMA-Reviewer is conducted on two diverse, publicly available datasets. Notably, even with the smallest LLaMA base model consisting of 6.7B parameters and a limited number of tuning epochs, LLaMA-Reviewer equals the performance of existing code-review-focused models. The ablation experiments provide insights into the influence of various fine-tuning process components, including input representation, instruction tuning, and different PEFT methods. To foster continuous progress in this field, the code and all PEFT-weight plugins have been made open-source.

摘要
“自动化代码审查活动，长期是软件工程中的一个探索，曾经由多种域 especific pre-trained 模型进行了主要 Address。尽管它们在成功的情况下，但它们往往需要大量的资源进行预训练。相比之下，大型语言模型（LLM）具有一定的吸引力，因为它们在具有域Specific 知识的情况下表现出了惊人的能力。然而，它们在自动化代码审查任务上的潜在作用仍然未得到了充分的探索。为了填补这个研究漏洞，我们提出了 LLaMA-Reviewer 框架，该框架利用 LLMA 的能力在代码审查领域。注意到资源约束，这个框架使用 parameter-efficient fine-tuning（PEFT）方法，实现高性能的同时使用 less than 1% 的可训练参数。为了评估 LLMA-Reviewer 的性能，我们对两个公共可用的数据集进行了广泛的评估。结果显示，即使使用 LLMA 的最小基本模型，具有 6.7B 参数并且限制 Tuning 次数，LLMA-Reviewer 的性能与现有的代码审查专门的模型相当。通过对不同的 fine-tuning 过程组件的影响进行杜绝试验，我们得到了关于输入表示、指令调整和不同 PEFT 方法的深入的理解。为了促进这一领域的不断发展，我们将 Code 和所有 PEFT-weight 插件公开发布。”

Exploring Unsupervised Cell Recognition with Prior Self-activation Maps

paper_url: http://arxiv.org/abs/2308.11144
repo_url: https://github.com/cpystan/psm
paper_authors: Pingyi Chen, Chenglu Zhu, Zhongyi Shui, Jiatong Cai, Sunyi Zheng, Shichuan Zhang, Lin Yang
for: This paper aims to reduce the dependency on manual annotations for cell recognition tasks in the medical field.
methods: The proposed method uses self-activation maps (PSMs) to generate pseudo masks as training targets. An activation network is trained with self-supervised learning, and the gradient information in the shallow layers of the network is aggregated to generate PSMs. A semantic clustering module is then introduced to transform PSMs to pixel-level semantic pseudo masks for downstream tasks.
results: The proposed method achieves competitive performance on two histological datasets (MoNuSeg and BCData) without any manual annotations. It also demonstrates the ability to perform multi-class cell detection, which is not possible with existing unsupervised methods. The results show the potential of PSMs to address the hunger for labels in the medical field.

Abstract
The success of supervised deep learning models on cell recognition tasks relies on detailed annotations. Many previous works have managed to reduce the dependency on labels. However, considering the large number of cells contained in a patch, costly and inefficient labeling is still inevitable. To this end, we explored label-free methods for cell recognition. Prior self-activation maps (PSM) are proposed to generate pseudo masks as training targets. To be specific, an activation network is trained with self-supervised learning. The gradient information in the shallow layers of the network is aggregated to generate prior self-activation maps. Afterward, a semantic clustering module is then introduced as a pipeline to transform PSMs to pixel-level semantic pseudo masks for downstream tasks. We evaluated our method on two histological datasets: MoNuSeg (cell segmentation) and BCData (multi-class cell detection). Compared with other fully-supervised and weakly-supervised methods, our method can achieve competitive performance without any manual annotations. Our simple but effective framework can also achieve multi-class cell detection which can not be done by existing unsupervised methods. The results show the potential of PSMs that might inspire other research to deal with the hunger for labels in medical area.

摘要
success of supervised deep learning models on cell recognition tasks relies on detailed annotations. Many previous works have managed to reduce the dependency on labels. However, considering the large number of cells contained in a patch, costly and inefficient labeling is still inevitable. To this end, we explored label-free methods for cell recognition. Prior self-activation maps (PSM) are proposed to generate pseudo masks as training targets. To be specific, an activation network is trained with self-supervised learning. The gradient information in the shallow layers of the network is aggregated to generate prior self-activation maps. Afterward, a semantic clustering module is then introduced as a pipeline to transform PSMs to pixel-level semantic pseudo masks for downstream tasks. We evaluated our method on two histological datasets: MoNuSeg (cell segmentation) and BCData (multi-class cell detection). Compared with other fully-supervised and weakly-supervised methods, our method can achieve competitive performance without any manual annotations. Our simple but effective framework can also achieve multi-class cell detection which can not be done by existing unsupervised methods. The results show the potential of PSMs that might inspire other research to deal with the hunger for labels in medical area.Here's the word-for-word translation:成功的深度学习模型在细胞识别任务上取决于详细的标签。许多前一作已经减少了标签的依赖。然而，考虑到细胞识别任务中的细胞数量很大，手动标注仍然是不可避免的。为此，我们探索了没有标签的细胞识别方法。我们提出了基于自动化激活网络的先前自动化图（PSM），用于生成训练目标。具体来说，我们使用了自我学习来训练激活网络。激活网络的梯度信息在深层层次上汇聚，生成先前自动化图。然后，我们引入了 semantic clustering module，用于将 PSM 转化为下游任务中的像素级别semantic pseudo mask。我们对 MoNuSeg（细胞分 segmentation）和 BCData（多类细胞检测）两个 histological 数据集进行评估。与其他完全supervised和weakly supervised方法相比，我们的方法可以实现竞争性的性能，无需任何手动标注。我们的简单 yet effective 框架还可以实现多类细胞检测，这不可能由现有的无监督方法完成。结果表明 PSM 的潜在力量，可能会激励其他研究人员在医疗领域面临标签匮乏的问题。

Graph Encoding and Neural Network Approaches for Volleyball Analytics: From Game Outcome to Individual Play Predictions

paper_url: http://arxiv.org/abs/2308.11142
repo_url: None
paper_authors: Rhys Tracy, Haotian Xia, Alex Rasla, Yuan-Fang Wang, Ambuj Singh
for: 提高复杂排球预测的准确性，为教练和运动员提供更加深刻的意义。
methods: 引入专门的图编码技术，为现有的排球数据添加更多的接触接触排球上下文。
results: 使用图神经网络（GNNs）在这些润色数据上进行三个不同的排球预测任务：牺牲结果预测、场地预测和击TYPE预测。比较基准模型的性能，分析结果，更好地理解排球牺牲中的下面关系。结果表明，使用图编码可以提供更加深刻的分析，显著提高预测结果总体。此外，我们还示出了基eline任务可以通过简单的调整（如移除封锁的击）进行显著改进。最后，我们展示了选择合适的模型结构可以更好地提取重要信息，以便更好地进行某个任务。总之，我们的研究展示了使用图编码在体育数据分析中的潜在强大和弱点，并希望这将激励未来的机器学习策略的进一步改进。

Abstract
This research aims to improve the accuracy of complex volleyball predictions and provide more meaningful insights to coaches and players. We introduce a specialized graph encoding technique to add additional contact-by-contact volleyball context to an already available volleyball dataset without any additional data gathering. We demonstrate the potential benefits of using graph neural networks (GNNs) on this enriched dataset for three different volleyball prediction tasks: rally outcome prediction, set location prediction, and hit type prediction. We compare the performance of our graph-based models to baseline models and analyze the results to better understand the underlying relationships in a volleyball rally. Our results show that the use of GNNs with our graph encoding yields a much more advanced analysis of the data, which noticeably improves prediction results overall. We also show that these baseline tasks can be significantly improved with simple adjustments, such as removing blocked hits. Lastly, we demonstrate the importance of choosing a model architecture that will better extract the important information for a certain task. Overall, our study showcases the potential strengths and weaknesses of using graph encodings in sports data analytics and hopefully will inspire future improvements in machine learning strategies across sports and applications by using graphbased encodings.

摘要
（这项研究的目标是提高复杂的排球预测精度和为教练和运动员提供更有意义的排球数据分析。我们引入了特殊的图编码技术，以添加到现有的排球数据集中的更多的接触点排球上下文。我们示出了使用图神经网络（GNNs）在这个增强的数据集上进行三种不同的排球预测任务：赛点预测、场地预测和击打类型预测。我们比较了我们的图基模型和基线模型的性能，并分析了结果以更好地理解排球赛中的下面关系。我们的结果表明，使用GNNs与我们的图编码可以获得更高级别的数据分析，导致总体预测结果得到了明显的改善。我们还示出了基eline任务可以通过简单的调整，如移除堵塞的击打，获得显著的改善。最后，我们示出了选择适合的模型架构可以更好地提取关键信息，以便更好地进行某个任务。总之，我们的研究展示了使用图编码在体育数据分析中的潜在优势和不足，希望能启发未来机器学习策略的改进，并在不同的体育和应用场景中使用图基编码。）

Towards Validating Long-Term User Feedbacks in Interactive Recommendation Systems

paper_url: http://arxiv.org/abs/2308.11137
repo_url: https://github.com/jettbrains/-L-
paper_authors: Hojoon Lee, Dongyoon Hwang, Kyushik Min, Jaegul Choo
for: 这个论文的目的是为了评估基于奖励学习（RL）的推荐系统（IRS）的性能。
methods: 该论文使用了公共可用的评论数据集来比较和评估不同的RL算法。
results: 研究发现，一个简单的贪吃奖励模型在长期奖励上一直表现出优异，而RL基本模型则不太具备这种能力。此外，应用更高的长期奖励权重会导致推荐性能下降。最后，评论反馈对基准数据集上的性能具有较少的长期影响。

Abstract
Interactive Recommender Systems (IRSs) have attracted a lot of attention, due to their ability to model interactive processes between users and recommender systems. Numerous approaches have adopted Reinforcement Learning (RL) algorithms, as these can directly maximize users' cumulative rewards. In IRS, researchers commonly utilize publicly available review datasets to compare and evaluate algorithms. However, user feedback provided in public datasets merely includes instant responses (e.g., a rating), with no inclusion of delayed responses (e.g., the dwell time and the lifetime value). Thus, the question remains whether these review datasets are an appropriate choice to evaluate the long-term effects of the IRS. In this work, we revisited experiments on IRS with review datasets and compared RL-based models with a simple reward model that greedily recommends the item with the highest one-step reward. Following extensive analysis, we can reveal three main findings: First, a simple greedy reward model consistently outperforms RL-based models in maximizing cumulative rewards. Second, applying higher weighting to long-term rewards leads to a degradation of recommendation performance. Third, user feedbacks have mere long-term effects on the benchmark datasets. Based on our findings, we conclude that a dataset has to be carefully verified and that a simple greedy baseline should be included for a proper evaluation of RL-based IRS approaches.

摘要
互动推荐系统（IRS）在过去几年中吸引了很多关注，因为它可以模拟用户和推荐系统之间的互动过程。许多方法都采用了强化学习（RL）算法，因为它们可以直接提高用户的总奖励。在 IRs 中，研究人员通常使用公共可用的评价数据来比较和评估算法。然而，用户提供的反馈仅仅包括当下响应（例如，评分），没有包括延迟响应（例如，浏览时间和生命值）。因此，问题是这些评价数据是否适合评估 IRs 的长期效果。在这种情况下，我们重新进行了 IRs experiments 中的评价，并比较了基于 RL 算法的模型和简单奖励模型，该模型在一步奖励中选择最高的一个项目。经过广泛分析，我们得出了以下三个主要发现：1. 基于 RL 算法的模型在总奖励上 consistently 下降。2. 应用更高的长期奖励权重会导致推荐性能下降。3. 用户反馈在标准数据集上有较少的长期效果。根据我们的发现，我们 conclude 认为一个数据集需要仔细验证，并且一个简单的基准模型应该包含在评估 RL-based IRS 方法中。

Transformers for Capturing Multi-level Graph Structure using Hierarchical Distances

paper_url: http://arxiv.org/abs/2308.11129
repo_url: None
paper_authors: Yuankai Luo
for: 本研究旨在提高图像预测中的表达能力，通过模型图像中的多层嵌入结构和层次结构。
methods: 本文提出了一种层次距离结构编码（HDSE），该方法利用图像中的层次结构，以提高图像预测的表达能力。 HDSE 可以轻松地与现有的图像变换器结合使用，以实现同时应用多种位置表示。
results: 通过对 12 个真实世界数据集进行广泛的实验，我们证明了 HDSE 方法可以成功地提高多种基线变换器的表达能力，在 10 个标准数据集上达到领先的 empirical 性能。

Abstract
Graph transformers need strong inductive biases to derive meaningful attention scores. Yet, current proposals rarely address methods capturing longer ranges, hierarchical structures, or community structures, as they appear in various graphs such as molecules, social networks, and citation networks. In this paper, we propose a hierarchy-distance structural encoding (HDSE), which models a hierarchical distance between the nodes in a graph focusing on its multi-level, hierarchical nature. In particular, this yields a framework which can be flexibly integrated with existing graph transformers, allowing for simultaneous application with other positional representations. Through extensive experiments on 12 real-world datasets, we demonstrate that our HDSE method successfully enhances various types of baseline transformers, achieving state-of-the-art empirical performances on 10 benchmark datasets.

摘要
Current graph transformers rely heavily on strong inductive biases to generate meaningful attention scores. However, most existing proposals neglect longer ranges, hierarchical structures, or community structures found in various graphs such as molecules, social networks, and citation networks. In this paper, we propose a hierarchical distance structural encoding (HDSE) that captures the hierarchical distance between nodes in a graph, taking into account its multi-level and hierarchical nature. This approach can be seamlessly integrated with existing graph transformers, allowing for simultaneous use with other positional representations. Extensive experiments on 12 real-world datasets demonstrate that our HDSE method significantly enhances various baseline transformers, achieving state-of-the-art performance on 10 benchmark datasets.

How Expressive are Graph Neural Networks in Recommendation?

paper_url: http://arxiv.org/abs/2308.11127
repo_url: https://github.com/hkuds/gte
paper_authors: Xuheng Cai, Lianghao Xia, Xubin Ren, Chao Huang
for: 本研究旨在提供对Graph Neural Networks（GNNs）在推荐任务中的理论分析，包括GNNs的表达能力和其在推荐任务中的效果。
methods: 本研究使用了message passing GNNs和random node initialization来证明GNNs的表达能力，并提出了一个新的表达能力指标——topological closeness，用于评估GNNs在推荐任务中的能力。
results: 研究发现，GNNs在推荐任务中的表达能力与topological closeness指标有直接的关系，而且可以通过学习eless GNN算法来优化表达能力。此外，研究还发现，GNNs在不同的推荐任务中的表达能力有所不同，而且与任务的特点有关。

Abstract
Graph Neural Networks (GNNs) have demonstrated superior performance on various graph learning tasks, including recommendation, where they leverage user-item collaborative filtering signals in graphs. However, theoretical formulations of their capability are scarce, despite their empirical effectiveness in state-of-the-art recommender models. Recently, research has explored the expressiveness of GNNs in general, demonstrating that message passing GNNs are at most as powerful as the Weisfeiler-Lehman test, and that GNNs combined with random node initialization are universal. Nevertheless, the concept of "expressiveness" for GNNs remains vaguely defined. Most existing works adopt the graph isomorphism test as the metric of expressiveness, but this graph-level task may not effectively assess a model's ability in recommendation, where the objective is to distinguish nodes of different closeness. In this paper, we provide a comprehensive theoretical analysis of the expressiveness of GNNs in recommendation, considering three levels of expressiveness metrics: graph isomorphism (graph-level), node automorphism (node-level), and topological closeness (link-level). We propose the topological closeness metric to evaluate GNNs' ability to capture the structural distance between nodes, which aligns closely with the objective of recommendation. To validate the effectiveness of this new metric in evaluating recommendation performance, we introduce a learning-less GNN algorithm that is optimal on the new metric and can be optimal on the node-level metric with suitable modification. We conduct extensive experiments comparing the proposed algorithm against various types of state-of-the-art GNN models to explore the explainability of the new metric in the recommendation task. For reproducibility, implementation codes are available at https://github.com/HKUDS/GTE.

摘要
格raph神经网络（GNNs）在多种图学任务中表现出色，包括推荐，其利用用户Item的共同满意信号在图中。然而，GNNs的理论表述尚缺乏，尽管它们在当前最佳推荐模型中的实际效iveness很高。最近，研究人员已经探索了GNNs的表达能力，并证明了消息传递GNNs在最多情况下与Weisfeiler-Lehman测试相当有力，并且GNNs在随机节点初始化下是可 универса的。然而，GNNs的“表达能力”概念仍然未得到准确定义。大多数现有作品采用图 isomorphism测试作为表达能力的度量，但这个图级任务可能不能有效地评估一个模型在推荐任务中的能力，因为推荐任务的目标是分辨不同的节点的邻接关系。在这篇论文中，我们提供了对GNNs在推荐任务中的表达能力进行全面的理论分析，包括图 isomorphism（图级）、节点自机制（节点级）和 topological closeness（链级）三种表达能力级别。我们提出了 topological closeness 度量，用于评估 GNNs 在不同节点之间的结构距离，这与推荐任务的目标吻合。为验证新的度量在推荐任务中的效用，我们提出了一种不包含学习的 GNN 算法，该算法在新的度量上是优化的，并且可以通过修改而在节点级度量上达到优化。我们进行了对多种当前最佳 GNN 模型的广泛比较，以探索新的度量在推荐任务中的解释性。为保持可重现性，我们在 GitHub 上提供了实现代码，可以在中找到。

Random Word Data Augmentation with CLIP for Zero-Shot Anomaly Detection

paper_url: http://arxiv.org/abs/2308.11119
repo_url: None
paper_authors: Masato Tamura
for: 这个研究的目的是发展一个基于 CLIP 的类型缺失检测方法，并且不需要训练图像。
methods: 这个方法使用 CLIP 的描述子引导分类器来显示每个图像的每个部分，并且使用生成的文本嵌入来训练一个对应的 feed-forward 神经网络。
results: 实验结果显示，这个方法可以在零条件下达到顶尖性能，并且不需要劳动的描述子组合。

Abstract
This paper presents a novel method that leverages a visual-language model, CLIP, as a data source for zero-shot anomaly detection. Tremendous efforts have been put towards developing anomaly detectors due to their potential industrial applications. Considering the difficulty in acquiring various anomalous samples for training, most existing methods train models with only normal samples and measure discrepancies from the distribution of normal samples during inference, which requires training a model for each object category. The problem of this inefficient training requirement has been tackled by designing a CLIP-based anomaly detector that applies prompt-guided classification to each part of an image in a sliding window manner. However, the method still suffers from the labor of careful prompt ensembling with known object categories. To overcome the issues above, we propose leveraging CLIP as a data source for training. Our method generates text embeddings with the text encoder in CLIP with typical prompts that include words of normal and anomaly. In addition to these words, we insert several randomly generated words into prompts, which enables the encoder to generate a diverse set of normal and anomalous samples. Using the generated embeddings as training data, a feed-forward neural network learns to extract features of normal and anomaly from CLIP's embeddings, and as a result, a category-agnostic anomaly detector can be obtained without any training images. Experimental results demonstrate that our method achieves state-of-the-art performance without laborious prompt ensembling in zero-shot setups.

摘要
To address this issue, the proposed method uses prompt-guided classification to apply CLIP to each part of an image in a sliding window manner. However, this method still suffers from the labor of careful prompt ensembling with known object categories. To overcome these issues, the proposed method leverages CLIP as a data source for training. The method generates text embeddings with the text encoder in CLIP using typical prompts that include words of normal and anomaly. In addition, several randomly generated words are inserted into the prompts to enable the encoder to generate a diverse set of normal and anomalous samples.Using the generated embeddings as training data, a feed-forward neural network learns to extract features of normal and anomaly from CLIP's embeddings, resulting in a category-agnostic anomaly detector without any training images. Experimental results show that the proposed method achieves state-of-the-art performance without laborious prompt ensembling in zero-shot setups.

Development of a Novel Quantum Pre-processing Filter to Improve Image Classification Accuracy of Neural Network Models

paper_url: http://arxiv.org/abs/2308.11112
repo_url: https://github.com/hajimesuzuki999/qpf
paper_authors: Farina Riaz, Shahab Abdulla, Hajime Suzuki, Srinjoy Ganguly, Ravinesh C. Deo, Susan Hopkins
for: 提高图像分类模型的准确率
methods: 使用量子预处理滤波器（QPF）方法，包括四个量子比特的简单电路，使用Y转换门进行编码和两个控制NOT门创建量子 bits 之间的相关性。
results: 应用QPF方法后，基于MNIST和EMNIST datasets的图像分类精度从92.5%提高到95.4%和从68.9%提高到75.9%，而无需添加Extra模型参数或优化机器学习过程。但是，对于GTSRB dataset的测试表明，使用这种基线方法可能会导致图像分类精度下降。因此，进一步研究量子电路方法的设计和应用可能会有所助益。

Abstract
This paper proposes a novel quantum pre-processing filter (QPF) to improve the image classification accuracy of neural network (NN) models. A simple four qubit quantum circuit that uses Y rotation gates for encoding and two controlled NOT gates for creating correlation among the qubits is applied as a feature extraction filter prior to passing data into the fully connected NN architecture. By applying the QPF approach, the results show that the image classification accuracy based on the MNIST (handwritten 10 digits) and the EMNIST (handwritten 47 class digits and letters) datasets can be improved, from 92.5% to 95.4% and from 68.9% to 75.9%, respectively. These improvements were obtained without introducing extra model parameters or optimizations in the machine learning process. However, tests performed on the developed QPF approach against a relatively complex GTSRB dataset with 43 distinct class real-life traffic sign images showed a degradation in the classification accuracy. Considering this result, further research into the understanding and the design of a more suitable quantum circuit approach for image classification neural networks could be explored utilizing the baseline method proposed in this paper.

摘要

CAME: Contrastive Automated Model Evaluation

paper_url: http://arxiv.org/abs/2308.11111
repo_url: https://github.com/pengr/contrastive_autoeval
paper_authors: Ru Peng, Qiuyang Duan, Haobo Wang, Jiachen Ma, Yanbo Jiang, Yongjun Tu, Xiu Jiang, Junbo Zhao
for: 本研究旨在提出一种新的自动评估框架，以减少对训练集的依赖。
methods: 本研究使用一种新的对照损失来评估模型性能，不再需要使用训练集。
results: 研究表明，CAME可以与传统的AutoEval方法相比，在评估模型性能方面达到新的最高水平。

Abstract
The Automated Model Evaluation (AutoEval) framework entertains the possibility of evaluating a trained machine learning model without resorting to a labeled testing set. Despite the promise and some decent results, the existing AutoEval methods heavily rely on computing distribution shifts between the unlabelled testing set and the training set. We believe this reliance on the training set becomes another obstacle in shipping this technology to real-world ML development. In this work, we propose Contrastive Automatic Model Evaluation (CAME), a novel AutoEval framework that is rid of involving training set in the loop. The core idea of CAME bases on a theoretical analysis which bonds the model performance with a contrastive loss. Further, with extensive empirical validation, we manage to set up a predictable relationship between the two, simply by deducing on the unlabeled/unseen testing set. The resulting framework CAME establishes a new SOTA results for AutoEval by surpassing prior work significantly.

摘要
《自动评估模型框架（AutoEval）》可能性评估已训练的机器学习模型而不需要标注测试集。虽有承诺和一些不错的结果，现有的AutoEval方法仍然重重靠计算测试集和训练集之间的分布变化。我们认为，这种依赖于训练集的方法会成为实际机器学习开发中的另一个障碍。在这项工作中，我们提出了对比自动评估框架（CAME），它不再需要测试集在循环中。我们的核心想法是基于对比损失函数的理论分析，将模型性能与对比损失函数之间建立一个可预测的关系。经过广泛的实验验证，我们成功地建立了一个可预测的关系，只需通过对未标注/未见测试集进行推断。CAME框架实现了对AutoEval的新的最佳实践（SOTA）结果，在先前的工作上显著超越。

Anonymity at Risk? Assessing Re-Identification Capabilities of Large Language Models

paper_url: http://arxiv.org/abs/2308.11103
repo_url: https://github.com/skatinger/anonymity-at-risk-assessing-re-identification-capabilities-of-large-language-models
paper_authors: Alex Nyffenegger, Matthias Stürmer, Joel Niklaus
for: 这个论文的目的是探讨LLMs是否可以重新标识在法律案例中匿名的人员，以保护个人隐私。
methods: 作者使用实际的瑞士联邦最高法院的法律数据构建了一个证明原型，并在更严格的测试环境中进行了进一步的调查。
results: 研究发现，即使使用最好的LLMs，在法律案例中也存在高度重新标识的难度。这种难度主要归结于数据稀缺、训练资源的需要和材料使用的复杂性。

Abstract
Anonymity of both natural and legal persons in court rulings is a critical aspect of privacy protection in the European Union and Switzerland. With the advent of LLMs, concerns about large-scale re-identification of anonymized persons are growing. In accordance with the Federal Supreme Court of Switzerland, we explore the potential of LLMs to re-identify individuals in court rulings by constructing a proof-of-concept using actual legal data from the Swiss federal supreme court. Following the initial experiment, we constructed an anonymized Wikipedia dataset as a more rigorous testing ground to further investigate the findings. With the introduction and application of the new task of re-identifying people in texts, we also introduce new metrics to measure performance. We systematically analyze the factors that influence successful re-identifications, identifying model size, input length, and instruction tuning among the most critical determinants. Despite high re-identification rates on Wikipedia, even the best LLMs struggled with court decisions. The complexity is attributed to the lack of test datasets, the necessity for substantial training resources, and data sparsity in the information used for re-identification. In conclusion, this study demonstrates that re-identification using LLMs may not be feasible for now, but as the proof-of-concept on Wikipedia showed, it might become possible in the future. We hope that our system can help enhance the confidence in the security of anonymized decisions, thus leading to the courts being more confident to publish decisions.

摘要
欧洲联盟和瑞士的法律保护中的人员匿名性是一个关键方面。随着LLM的出现，大规模重新标识匿名化人员的问题正在增长。根据瑞士最高法院的判决，我们在实际的法律数据上进行了证明，并在使用瑞士联邦最高法院的法律数据进行了更加严格的测试。我们引入了一个新任务，即在文本中重新标识人员，并引入了新的表现指标。我们系统地分析了重新标识成功的因素，并确定了模型大小、输入长度和调参是最critical的决定因素。尽管在Wikipedia上实现了高的重新标识率，但even the best LLMs在法律判决上却遇到了困难。这些问题的原因是缺乏测试集，需要庞大的训练资源，以及在重新标识中使用的数据稀缺。因此，本研究表明，使用LLM进行重新标识可能不可能，但是在Wikipedia上的证明表明，这可能在未来成为可能的。我们希望，我们的系统可以增强匿名化判决的安全性，从而使法院更自信地发布判决。

Explicability and Inexplicability in the Interpretation of Quantum Neural Networks

paper_url: http://arxiv.org/abs/2308.11098
repo_url: https://github.com/lirandepira/interpret-qnn
paper_authors: Lirandë Pira, Chris Ferrie
for: 探讨人工智能方法的可解释性，尤其是深度神经网络，因为AI支持的系统广泛应用，但行为经常无法解释。
methods: 使用本地模型独立可解释性度量评估量化神经网络和经典神经网络的可解释性。
results: 提出了“不可解释区”概念，表示无法解释的数据样本，可能受到内在随机量子测量的影响。这种研究为建立负责任和可负责任量子AI模型做出了一步。

Abstract
Interpretability of artificial intelligence (AI) methods, particularly deep neural networks, is of great interest due to the widespread use of AI-backed systems, which often have unexplainable behavior. The interpretability of such models is a crucial component of building trusted systems. Many methods exist to approach this problem, but they do not obviously generalize to the quantum setting. Here we explore the interpretability of quantum neural networks using local model-agnostic interpretability measures of quantum and classical neural networks. We introduce the concept of the band of inexplicability, representing the interpretable region in which data samples have no explanation, likely victims of inherently random quantum measurements. We see this as a step toward understanding how to build responsible and accountable quantum AI models.

摘要
人工智能（AI）技术的解释性（interpretability）在深入吸引关注，因为AI支持系统广泛应用，但它们的行为往往不可解释。解释性是构建可信系统的关键组件。虽然有许多方法可以解决这个问题，但它们不显而易于扩展到量子设置。在这里，我们探讨量子神经网络的解释性使用本地模型独立可解释性指标评估量子和经典神经网络。我们引入了带宽不可解释性概念，表示无法解释的数据样本带宽，可能是因为内在的随机量子测量带来的。我们认为这是建立负责任和可负责任量子AI模型的一个重要步骤。

Video OWL-ViT: Temporally-consistent open-world localization in video

paper_url: http://arxiv.org/abs/2308.11093
repo_url: None
paper_authors: Georg Heigold, Matthias Minderer, Alexey Gritsenko, Alex Bewley, Daniel Keysers, Mario Lučić, Fisher Yu, Thomas Kipf
for: 本研究旨在适应预训练的开放视界图像模型于视频地图Localization。
methods: 我们基于OWL-ViT开放词汇检测模型，并添加了 transformer 解码器来进行时间卷积。输出token用于下一帧的对象检测。
results: 我们的模型在TAO-OW数据集上进行了成功的转移，并且在开放视界中保持了适应性。对比tracking-by-detection基eline，我们的模型具有更好的时间一致性。

Abstract
We present an architecture and a training recipe that adapts pre-trained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tasks involving object localization applying pre-trained models is more challenging. This is particularly true for video tasks, where task-specific data is limited. We show successful transfer of open-world models by building on the OWL-ViT open-vocabulary detection model and adapting it to video by adding a transformer decoder. The decoder propagates object representations recurrently through time by using the output tokens for one frame as the object queries for the next. Our model is end-to-end trainable on video data and enjoys improved temporal consistency compared to tracking-by-detection baselines, while retaining the open-world capabilities of the backbone detector. We evaluate our model on the challenging TAO-OW benchmark and demonstrate that open-world capabilities, learned from large-scale image-text pre-training, can be transferred successfully to open-world localization across diverse videos.

摘要
我们提出了一种建筑和训练方法，用于将预训练的开放视界图像模型适应到视频本地化。理解开放视界（无需受限于固定标签空间）是许多实际视觉任务的关键。最近，对大量图像文本数据进行对比预训练已经导致了图像级任务的显著改进。但是，对于结构化任务，如对象本地化，使用预训练模型更加困难。特别是在视频任务中，任务特定的数据是有限的。我们基于OWL-ViT开放 vocabulary检测模型，并对其进行了修改，添加了一个变换器解码器。这个解码器通过在时间上循环传播对象表示，使用一帧输出的对象查询来初始化下一帧的查询。我们的模型是可以练习到视频数据上，并且具有改进的时间一致性，与跟踪-by-检测基线相比，而且保留了开放视界检测器的开放世界功能。我们在TAO-OW benchmark上进行了评估，并证明了大规模图像文本预训练学习的开放视界功能可以成功地传递到开放视界本地化中的多种视频。

Addressing Fairness and Explainability in Image Classification Using Optimal Transport

paper_url: http://arxiv.org/abs/2308.11090
repo_url: None
paper_authors: Philipp Ratz, François Hu, Arthur Charpentier
for: 本研究旨在提高人工智能系统的可靠性和负责任性，特别是在医疗和警察领域，通过解释不公平结果的原因。
methods: 本研究使用最优运输理论来探索图像中的偏见区域，可以轻松扩展到表格数据。通过 Wasserstein 质量量表，我们获得不受敏感变量影响的分数，同时保持predictive accuracy。
results: 我们的研究发现，使用最优运输理论可以快速找到偏见区域的原因，从而提高人工智能系统的可靠性和公平性。这些发现对于建立可靠、公平的人工智能系统具有重要意义，并且可以应用于各种领域中的决策制定。

Abstract
Algorithmic Fairness and the explainability of potentially unfair outcomes are crucial for establishing trust and accountability of Artificial Intelligence systems in domains such as healthcare and policing. Though significant advances have been made in each of the fields separately, achieving explainability in fairness applications remains challenging, particularly so in domains where deep neural networks are used. At the same time, ethical data-mining has become ever more relevant, as it has been shown countless times that fairness-unaware algorithms result in biased outcomes. Current approaches focus on mitigating biases in the outcomes of the model, but few attempts have been made to try to explain \emph{why} a model is biased. To bridge this gap, we propose a comprehensive approach that leverages optimal transport theory to uncover the causes and implications of biased regions in images, which easily extends to tabular data as well. Through the use of Wasserstein barycenters, we obtain scores that are independent of a sensitive variable but keep their marginal orderings. This step ensures predictive accuracy but also helps us to recover the regions most associated with the generation of the biases. Our findings hold significant implications for the development of trustworthy and unbiased AI systems, fostering transparency, accountability, and fairness in critical decision-making scenarios across diverse domains.

摘要
《算法公平性和可解释性是在健康照管和宪政等领域建立人工智能系统信任和责任的关键因素。虽然在每个领域 separately 有所进步，但在公平性应用中实现可解释性仍然是挑战，特别是在使用深度神经网络时。同时，伦理数据挖掘已成为不可或缺的，因为无数次证明了不公平的算法会导致偏见的结果。现有的方法主要是 Mitigating 偏见的结果，但几乎没有尝试解释为什么模型偏见。为了bridging这个鸿沟，我们提出了一种总体方法，利用最优运输理论来揭示偏见区域在图像中的原因和后果。通过使用 Wasserstein 矩阵中心，我们得到了不同敏感变量的独立分数，同时保持其排序。这个步骤确保预测精度，同时帮助我们回收生成偏见的区域。我们的发现对于开发可靠、无偏见的 AI 系统的开发具有深远的意义，推动了诚信、责任和公平在多个领域中的决策过程中的透明度和公平。》

Stress representations for tensor basis neural networks: alternative formulations to Finger-Rivlin-Ericksen

paper_url: http://arxiv.org/abs/2308.11080
repo_url: None
paper_authors: Jan N. Fuhg, Nikolaos Bouklas, Reese E. Jones
for: 这 paper 的目的是研究数据驱动的 constitutive 模型框架，以及其中的神经网络和经典表述定理的应用。
methods: 这 paper 使用了神经网络和经典表述定理来建立 constitutive 模型，并 investigate 了一些尚未被研究的表述形式。
results: 该 paper 测试了九种不同的模型变体，并对三种不同的材料进行了测试。结果表明，potential-based 方法和 coefficient-based 方法具有不同的优劣点，而不同的准备技术也对模型的性能产生了影响。

Abstract
Data-driven constitutive modeling frameworks based on neural networks and classical representation theorems have recently gained considerable attention due to their ability to easily incorporate constitutive constraints and their excellent generalization performance. In these models, the stress prediction follows from a linear combination of invariant-dependent coefficient functions and known tensor basis generators. However, thus far the formulations have been limited to stress representations based on the classical Rivlin and Ericksen form, while the performance of alternative representations has yet to be investigated. In this work, we survey a variety of tensor basis neural network models for modeling hyperelastic materials in a finite deformation context, including a number of so far unexplored formulations which use theoretically equivalent invariants and generators to Finger-Rivlin-Ericksen. Furthermore, we compare potential-based and coefficient-based approaches, as well as different calibration techniques. Nine variants are tested against both noisy and noiseless datasets for three different materials. Theoretical and practical insights into the performance of each formulation are given.

摘要
数据驱动的构成模型框架基于神经网络和经典表述定理在近些年来得到了广泛关注，因为它们可以轻松地包含构成约束和杰出的泛化性能。在这些模型中，压力预测来自于一种线性组合的恒定依赖 coefficient 函数和已知张量基 generator。然而，至今为止，这些 формуlas 仅限于压力表示方法基于经典的 Rivlin 和 Ericksen 形式，而其他表示方法的性能尚未被研究。在这种工作中，我们检验了一些张量基神经网络模型来模型弹性材料在有限做动上，包括一些尚未被探索的形式，它们使用了同等的 invariants 和 generator 来 Finger-Rivlin-Ericksen。此外，我们比较了 potential-based 和 coefficient-based approaches，以及不同的 calibration 技术。 nine variants 被测试在三种不同材料上，并提供了理论和实践的 Insights into 每种形式的性能。

Long-Term Prediction of Natural Video Sequences with Robust Video Predictors

paper_url: http://arxiv.org/abs/2308.11079
repo_url: None
paper_authors: Luke Ditria, Tom Drummond
for: 预测高维度视频序列是一个具有很多不确定性的问题。
methods: 该paper使用了深度学习 Perceptual 和 uncertainty-based 重建loss，以及注意力机制 skip connections，以提高短期预测质量。
results: 该paper实现了高质量短期预测，并且通过iterated single-step prediction任务，生成了非常长的自然视频序列。

Abstract
Predicting high dimensional video sequences is a curiously difficult problem. The number of possible futures for a given video sequence grows exponentially over time due to uncertainty. This is especially evident when trying to predict complicated natural video scenes from a limited snapshot of the world. The inherent uncertainty accumulates the further into the future you predict making long-term prediction very difficult. In this work we introduce a number of improvements to existing work that aid in creating Robust Video Predictors (RoViPs). We show that with a combination of deep Perceptual and uncertainty-based reconstruction losses we are able to create high quality short-term predictions. Attention-based skip connections are utilised to allow for long range spatial movement of input features to further improve performance. Finally, we show that by simply making the predictor robust to its own prediction errors, it is possible to produce very long, realistic natural video sequences using an iterated single-step prediction task.

摘要
预测高维视频序列是一个非常困难的问题，因为未来的可能性的数量会 exponential 增长随着时间的推移。特别是在预测自然视频场景时，由于全球的不确定性，预测的准确性会增加。在这项工作中，我们提出了一些改进现有工作的方法，以创建更高质量的视频预测器（RoViPs）。我们证明，通过将深度感知和不确定性基于的重建损失结合使用，可以创建高质量的短期预测。另外，我们使用注意力机制来实现长距离空间运动的输入特征传递，以进一步提高性能。最后，我们表明，通过使预测器具有自己预测错误的抗性，可以生成非常长、自然的视频序列，只需要采用迭代单步预测任务。

A Deep Dive into the Connections Between the Renormalization Group and Deep Learning in the Ising Model

paper_url: http://arxiv.org/abs/2308.11075
repo_url: None
paper_authors: Kelsie Taylor
for: 这个论文的目的是研究深度学习和离散群 teoría的关系，以及使用深度学习来实现离散群流。
methods: 这个论文使用了Restricted Boltzmann Machines（RBMs）来实现深度学习，并开发了一系列的 renormalization 技术来提供一个基准 для比较。
results: 研究发现，在1D Ising模型中，使用 Adam 优化器和 correlation length 损失函数来学习群流可以获得与分析模型相符的结果。在2D Ising模型中，使用 Wolff 算法生成样本，并使用 quasi-deterministic 方法实现群流，并计算了普通指数 exponent ν。最后，研究发现，在学习过程中，RBM层之间存在一种块结构，与离散群流有类似之处。但是，对于 simplest 的 nearest-neighbor Ising 模型，直接比较每层的权重和离散群 renormalization 发现了量化不一致。

Abstract
The renormalization group (RG) is an essential technique in statistical physics and quantum field theory, which considers scale-invariant properties of physical theories and how these theories' parameters change with scaling. Deep learning is a powerful computational technique that uses multi-layered neural networks to solve a myriad of complicated problems. Previous research suggests the possibility that unsupervised deep learning may be a form of RG flow, by being a layer-by-layer coarse graining of the original data. We examined this connection on a more rigorous basis for the simple example of Kadanoff block renormalization of the 2D nearest-neighbor Ising model, with our deep learning accomplished via Restricted Boltzmann Machines (RBMs). We developed extensive renormalization techniques for the 1D and 2D Ising model to provide a baseline for comparison. For the 1D Ising model, we successfully used Adam optimization on a correlation length loss function to learn the group flow, yielding results consistent with the analytical model for infinite N. For the 2D Ising model, we successfully generated Ising model samples using the Wolff algorithm, and performed the group flow using a quasi-deterministic method, validating these results by calculating the critical exponent \nu. We then examined RBM learning of the Ising model layer by layer, finding a blocking structure in the learning that is qualitatively similar to RG. Lastly, we directly compared the weights of each layer from the learning to Ising spin renormalization, but found quantitative inconsistencies for the simple case of nearest-neighbor Ising models.

摘要
renormalization group（RG）是物理学和量子场论中非常重要的技术，它考虑物理理论中尺度不变性的性质和尺度下降参数的变化。深度学习是一种非常强大的计算技术，它使用多层神经网络解决了许多复杂的问题。 précédente research suggests that unsupervised deep learning may be a form of RG flow, by being a layer-by-layer coarse graining of the original data. We examined this connection on a more rigorous basis for the simple example of Kadanoff block renormalization of the 2D nearest-neighbor Ising model, with our deep learning accomplished via Restricted Boltzmann Machines (RBMs). We developed extensive renormalization techniques for the 1D and 2D Ising model to provide a baseline for comparison. For the 1D Ising model, we successfully used Adam optimization on a correlation length loss function to learn the group flow, yielding results consistent with the analytical model for infinite N. For the 2D Ising model, we successfully generated Ising model samples using the Wolff algorithm, and performed the group flow using a quasi-deterministic method, validating these results by calculating the critical exponent ν. We then examined RBM learning of the Ising model layer by layer, finding a blocking structure in the learning that is qualitatively similar to RG. Lastly, we directly compared the weights of each layer from the learning to Ising spin renormalization, but found quantitative inconsistencies for the simple case of nearest-neighbor Ising models.Here's the translation in Traditional Chinese:renormalization group（RG）是物理学和量子场论中非常重要的技术，它考虑物理理论中尺度不变性的性质和尺度下降参数的变化。深度学习是一种非常强大的计算技术，它使用多层神经网络解决了许多复杂的问题。前一研究建议不监督深度学习可能是RG流的形式，通过层层粗化原始数据。我们在2D最近邻Israel模型中Kadanoff块化 Normalization中进行了更严格的检查，使用Restricted Boltzmann Machines（RBMs）进行深度学习。我们发展了1D和2D Ising模型的广泛normalization技术，以提供一个基准 для比较。For 1D Ising model，我们成功地使用Adam优化器在相互长度损失函数上学习群流，从而获得了无穷N的分析模型的结果。For 2D Ising model，我们成功地生成了Israel模型样本使用Wolff算法，并使用一种 quasi-deterministic 方法进行群流，这些结果被 validate by calculating the critical exponent ν。我们然后检查了RBM学习Israel模型层层，发现这些层层学习有一个类似RG的封顶结构。最后，我们直接比较了每个层的学习过程中的加重和Israel spin renormalization，发现这两者之间存在量化的不一致，尤其是在简单的邻近邻Israel模型中。

Neural Amortized Inference for Nested Multi-agent Reasoning

paper_url: http://arxiv.org/abs/2308.11071
repo_url: None
paper_authors: Kunal Jha, Tuan Anh Le, Chuanyang Jin, Yen-Ling Kuo, Joshua B. Tenenbaum, Tianmin Shu
for: 本研究旨在提高多智能交互中的嵌入式社会推理能力，使得计算机能够更好地理解其他智能的思维过程。
methods: 本研究提出了一种新的方法，利用神经网络来减少高阶社会推理的计算复杂性，从而提高多智能交互的效率。
results: 实验结果表明，本方法可以减少计算复杂性，同时保持准确性的水平。

Abstract
Multi-agent interactions, such as communication, teaching, and bluffing, often rely on higher-order social inference, i.e., understanding how others infer oneself. Such intricate reasoning can be effectively modeled through nested multi-agent reasoning. Nonetheless, the computational complexity escalates exponentially with each level of reasoning, posing a significant challenge. However, humans effortlessly perform complex social inferences as part of their daily lives. To bridge the gap between human-like inference capabilities and computational limitations, we propose a novel approach: leveraging neural networks to amortize high-order social inference, thereby expediting nested multi-agent reasoning. We evaluate our method in two challenging multi-agent interaction domains. The experimental results demonstrate that our method is computationally efficient while exhibiting minimal degradation in accuracy.

摘要
多智能机器人之间的交互，如 коммуникации、教学和谎梦，经常需要更高一级的社会推理，即理解别人对自己的推理。这种复杂的推理可以通过嵌入式多智能推理模型来有效地模拟。然而，计算复杂性随着每级推理的增加而呈指数增长， pose significant challenges。然而，人类在日常生活中很容易完成复杂的社会推理。为了bridging这个 gap，我们提出了一种新的方法：通过神经网络来减轻高级社会推理，从而加速嵌入式多智能推理。我们在两个复杂的多智能交互领域进行了实验，结果表明，我们的方法可以快速、有效地进行计算，同时减少了准确性的损失。

Topological Graph Signal Compression

paper_url: http://arxiv.org/abs/2308.11068
repo_url: None
paper_authors: Guillermo Bernárdez, Lev Telyatnikov, Eduard Alarcón, Albert Cabellos-Aparicio, Pere Barlet-Ros, Pietro Liò
for: 这篇论文是为了提出一种基于拓扑深度学习（TDL）方法来压缩信号 sobre 图structure的方法。
methods: 该方法包括两个主要步骤：首先，根据原始信号，使用 clustering 将 $N$ 个数据点分为 $K\ll N$ 个集合；然后，使用拓扑学照理的消息传递来在这些多个元素集中获得压缩的信号表示。
results: 我们的研究表明，我们的框架可以在两个实际 Internet Service Provider Networks 数据集上提高标准 GNN 和 feed-forward 架构的压缩率，从 $30%$ 到 $90%$ 的压缩率提高，这表明我们的方法更好地捕捉和利用图structured 网络结构中的空间和时间相关性。

Abstract
Recently emerged Topological Deep Learning (TDL) methods aim to extend current Graph Neural Networks (GNN) by naturally processing higher-order interactions, going beyond the pairwise relations and local neighborhoods defined by graph representations. In this paper we propose a novel TDL-based method for compressing signals over graphs, consisting in two main steps: first, disjoint sets of higher-order structures are inferred based on the original signal --by clustering $N$ datapoints into $K\ll N$ collections; then, a topological-inspired message passing gets a compressed representation of the signal within those multi-element sets. Our results show that our framework improves both standard GNN and feed-forward architectures in compressing temporal link-based signals from two real-word Internet Service Provider Networks' datasets --from $30\%$ up to $90\%$ better reconstruction errors across all evaluation scenarios--, suggesting that it better captures and exploits spatial and temporal correlations over the whole graph-based network structure.

摘要
最近爆出的拓扑深度学习（TDL）方法希望扩展当前图神经网络（GNN），让其自然处理更高阶交互，远离现有的对比关系和本地邻域定义的图表示。在这篇论文中，我们提出了一种基于TDL的信号压缩方法，包括两个主要步骤：首先，通过原始信号的 clustering，将 $N$ 个数据点分为 $K\ll N$ 个集合；然后，使用图理解的推递机制来在这些多元素集中获得压缩的信号表示。我们的结果表明，我们的框架可以在两个实际互联网服务提供商网络数据集上提高标准 GNN 和Feedforward 架构的压缩率，从 $30\%$ 到 $90\%$ 的各种评估场景中。这表明我们的框架更好地捕捉和利用图形结构中的空间和时间相关性。

UnLoc: A Unified Framework for Video Localization Tasks

paper_url: http://arxiv.org/abs/2308.11062
repo_url: https://github.com/google-research/scenic
paper_authors: Shen Yan, Xuehan Xiong, Arsha Nagrani, Anurag Arnab, Zhonghao Wang, Weina Ge, David Ross, Cordelia Schmid
for: 这 paper 的目的是提出一种新的视频地址Localization方法，用于解决无法裁剪的视频中的时间地址问题。
methods: 该方法使用预训练的图像和文本楼层，并将token传递给视频-文本融合模型。输出的融合模块 then 用于构建一个特征峰，每个级别与一个头相连，以预测每帧相关性分数和开始/结束时间偏移。
results: 与先前的方法不同，我们的 architecture 可以实现 Moment Retrieval、Temporal Localization 和 Action Segmentation 的三个任务，无需动作提案、运动基于预训练特征或表示掩码。与专门的模型不同，我们在三个不同的地址任务上达到了状态的最佳Result。

Abstract
While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task. We design a new approach for this called UnLoc, which uses pretrained image and text towers, and feeds tokens to a video-text fusion model. The output of the fusion module are then used to construct a feature pyramid in which each level connects to a head to predict a per-frame relevancy score and start/end time displacements. Unlike previous works, our architecture enables Moment Retrieval, Temporal Localization, and Action Segmentation with a single stage model, without the need for action proposals, motion based pretrained features or representation masking. Unlike specialized models, we achieve state of the art results on all three different localization tasks with a unified approach. Code will be available at: \url{https://github.com/google-research/scenic}.

摘要
“大规模的图像文本预训练模型，如CLIP，已经在剪辑后的视频上进行多个任务，但它们在未剪辑的视频中的时间本地化仍然是一个未探索的任务。我们设计了一种新的方法called UnLoc，它使用预训练的图像和文本楼层，并将token传递给视频文本融合模型。融合模块的输出然后用于构建一个特征层级，每个层与一个头连接以预测每帧的相关性分数和开始/结束时间偏移。与先前的方法不同，我们的架构允许场景检索、时间本地化和动作分割使用单一的阶段模型，不需要动作提案、运动基于预训练特征或表示掩蔽。与专门的模型不同，我们在三个不同的本地化任务上达到了状态机器人的最佳结果，代码将在：\url{https://github.com/google-research/scenic}上公开。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

paper_url: http://arxiv.org/abs/2308.12305
repo_url: None
paper_authors: Haokun Chen, Yao Zhang, Denis Krompass, Jindong Gu, Volker Tresp
for: 这则研究目的是为了提高基础模型在多modal学习中的表现，并且解决在不同领域的数据权益问题。
methods: 这篇研究使用了联合学习（Federated Learning）和两个数据Adapter（DAT）来处理多modal数据的不同性。
results: 研究结果显示，这个方法可以大幅提高基础模型在多modal学习中的表现，并且比起传统的中央化训练方法更加高效。

Abstract
Recently, foundation models have exhibited remarkable advancements in multi-modal learning. These models, equipped with millions (or billions) of parameters, typically require a substantial amount of data for finetuning. However, collecting and centralizing training data from diverse sectors becomes challenging due to distinct privacy regulations. Federated Learning (FL) emerges as a promising solution, enabling multiple clients to collaboratively train neural networks without centralizing their local data. To alleviate client computation burdens and communication overheads, previous works have adapted Parameter-efficient Finetuning (PEFT) methods for FL. Hereby, only a small fraction of the model parameters are optimized and communicated during federated communications. Nevertheless, most previous works have focused on a single modality and neglected one common phenomenon, i.e., the presence of data heterogeneity across the clients. Therefore, in this work, we propose a finetuning framework tailored to heterogeneous multi-modal FL, called Federated Dual-Aadapter Teacher (FedDAT). Specifically, our approach leverages a Dual-Adapter Teacher (DAT) to address data heterogeneity by regularizing the client local updates and applying Mutual Knowledge Distillation (MKD) for an efficient knowledge transfer. FedDAT is the first approach that enables an efficient distributed finetuning of foundation models for a variety of heterogeneous Vision-Language tasks. To demonstrate its effectiveness, we conduct extensive experiments on four multi-modality FL benchmarks with different types of data heterogeneity, where FedDAT substantially outperforms the existing centralized PEFT methods adapted for FL.

摘要
最近，基金会模型在多Modal学习中表现出了惊人的进步。这些模型通常需要大量数据进行finetuning，但收集和中央化训练数据从多个领域是一项具有挑战性的任务，因为各个领域的隐私法规不同。为了解决这个问题，我们提出了一种基于联合学习的 Federated Dual-Aadapter Teacher（FedDAT）方法。我们的方法使用了一个双适应教师（DAT）来 Address数据不同性问题，通过客户端本地更新的Regularization和Mutual Knowledge Distillation（MKD）来进行高效的知识传递。FedDAT是首个能够有效地在多Modal Vision-Language任务上进行分布式finetuning基础模型的方法。为了证明其效果，我们在四个多Modal FLbenchmark上进行了广泛的实验，并证明FedDAT在与中央PEFT方法进行比较时具有显著的优势。

Ultra Dual-Path Compression For Joint Echo Cancellation And Noise Suppression

paper_url: http://arxiv.org/abs/2308.11053
repo_url: None
paper_authors: Hangting Chen, Jianwei Yu, Yi Luo, Rongzhi Gu, Weihua Li, Zhuocheng Lu, Chao Weng
for: 提高全双工通信中的干扰抑制和减少噪声，并且现有的神经网络具有高计算成本和不 flexible的参数调整能力。
methods: 提出了时Frequency dual-path压缩，以实现广泛的压缩比例的计算成本。Specifically, for frequency compression, trainable filters are used to replace manually designed filters for dimension reduction. For time compression, only using frame skipped prediction causes large performance degradation, which can be alleviated by a post-processing network with full sequence modeling.
results: 发现 that dual-path compression combining both the time and frequency methods will give further performance improvement, covering compression ratios from 4x to 32x with little model size change. Moreover, the proposed models show competitive performance compared with fast FullSubNet and DeepFilterNet.

Abstract
Echo cancellation and noise reduction are essential for full-duplex communication, yet most existing neural networks have high computational costs and are inflexible in tuning model complexity. In this paper, we introduce time-frequency dual-path compression to achieve a wide range of compression ratios on computational cost. Specifically, for frequency compression, trainable filters are used to replace manually designed filters for dimension reduction. For time compression, only using frame skipped prediction causes large performance degradation, which can be alleviated by a post-processing network with full sequence modeling. We have found that under fixed compression ratios, dual-path compression combining both the time and frequency methods will give further performance improvement, covering compression ratios from 4x to 32x with little model size change. Moreover, the proposed models show competitive performance compared with fast FullSubNet and DeepFilterNet. A demo page can be found at hangtingchen.github.io/ultra_dual_path_compression.github.io/.

摘要
“干扰与噪音减少是全duplex通信的必备要素，但大多数现有的神经网络具有高计算成本和不可调节模型复杂度。在本文中，我们引入时间-频率二通路压缩以实现广泛的压缩比率。具体来说，用于频率压缩时，使用可训练的范畴缩减网络来取代手动设计的范畴缩减网络。而时间压缩则使用仅使用框架预测导致大量性能下降，可以通过全序模型来缓和。我们发现，固定压缩比率下的二通路压缩可以实现更好的性能提升，覆盖压缩比率从4x至32x的范围内，而且提案的模型与快速FullSubNet和DeepFilterNet的性能相当。请参考hangtingchen.github.io/ultra_dual_path_compression.github.io/。”

Harmonization Across Imaging Locations(HAIL): One-Shot Learning for Brain MRI

paper_url: http://arxiv.org/abs/2308.11047
repo_url: None
paper_authors: Abhijeet Parida, Zhifan Jiang, Syed Muhammad Anwar, Nicholas Foreman, Nicholas Stence, Michael J. Fisher, Roger J. Packer, Robert A. Avery, Marius George Linguraru
for: 针对罕见疾病，如儿童脑肿瘤，提出了机器学习基于诊断和预测的方法。
methods: 利用生成对抗网络（GANs）进行深度学习驱动的医疗影像协调，并使用一架一架学习Feature extractor、神经Style转移和自适应实例 нормализа。
results: 实验结果表明，我们的方法可以保持病人解剖结构的稳定性，同时调整影像的亮度尺度与新的临床站点匹配。我们的通用协调模型可以在新数据上应用，使其成为实际医疗应用和临床试验中的有价值工具。

Abstract
For machine learning-based prognosis and diagnosis of rare diseases, such as pediatric brain tumors, it is necessary to gather medical imaging data from multiple clinical sites that may use different devices and protocols. Deep learning-driven harmonization of radiologic images relies on generative adversarial networks (GANs). However, GANs notoriously generate pseudo structures that do not exist in the original training data, a phenomenon known as "hallucination". To prevent hallucination in medical imaging, such as magnetic resonance images (MRI) of the brain, we propose a one-shot learning method where we utilize neural style transfer for harmonization. At test time, the method uses one image from a clinical site to generate an image that matches the intensity scale of the collaborating sites. Our approach combines learning a feature extractor, neural style transfer, and adaptive instance normalization. We further propose a novel strategy to evaluate the effectiveness of image harmonization approaches with evaluation metrics that both measure image style harmonization and assess the preservation of anatomical structures. Experimental results demonstrate the effectiveness of our method in preserving patient anatomy while adjusting the image intensities to a new clinical site. Our general harmonization model can be used on unseen data from new sites, making it a valuable tool for real-world medical applications and clinical trials.

摘要
At test time, the method uses one image from a clinical site to generate an image that matches the intensity scale of the collaborating sites. Our approach combines learning a feature extractor, neural style transfer, and adaptive instance normalization. We also propose a novel strategy to evaluate the effectiveness of image harmonization approaches with evaluation metrics that both measure image style harmonization and assess the preservation of anatomical structures.Experimental results demonstrate the effectiveness of our method in preserving patient anatomy while adjusting the image intensities to a new clinical site. Our general harmonization model can be used on unseen data from new sites, making it a valuable tool for real-world medical applications and clinical trials.Translated into Simplified Chinese:为了使机器学习基于诊断和诊断 rare diseases 的应用成功，如儿童脑肿瘤，需要从多个临床站点收集医学成像数据，这些站点可能使用不同的设备和协议。但是，深度学习驱动的成像协调可能导致 pseudo 结构的生成，这被称为 "hallucination"。为了避免医学成像中的 hallucination，我们提议一种一遍学习方法，即使用神经风格传输来协调。在测试时，方法使用一张来自临床站点的图像，并生成一张与协作站点的强度标准匹配的图像。我们的方法结合学习特征提取器、神经风格传输和自适应实例normalization。我们还提出了一种新的评估图像协调方法的效果的策略，该策略包括评估图像风格协调和评估生物结构的保持。实验结果表明，我们的方法可以保持病人解剖结构，同时调整图像强度与新临床站点匹配。我们的通用协调模型可以在新站点上应用于未看过的数据，这使得它成为实际医疗应用和临床试验中的有价值工具。

Spurious Correlations and Where to Find Them

paper_url: http://arxiv.org/abs/2308.11043
repo_url: None
paper_authors: Gautam Sreekumar, Vishnu Naresh Boddeti
for: 本文旨在研究数据驱动学习中的假 correlations 问题，并提出一种基于 causal graph 的方法来识别和避免假 correlations。
methods: 本文使用了一些常见的假 correlations 假设，并通过synthetic datasets 进行实验研究其影响于标准 ERM 基elines。
results: 研究结果表明，假 correlations 的存在可以影响模型的性能，并且可以通过对模型设计 Parameters 的调整来避免假 correlations。

Abstract
Spurious correlations occur when a model learns unreliable features from the data and are a well-known drawback of data-driven learning. Although there are several algorithms proposed to mitigate it, we are yet to jointly derive the indicators of spurious correlations. As a result, the solutions built upon standalone hypotheses fail to beat simple ERM baselines. We collect some of the commonly studied hypotheses behind the occurrence of spurious correlations and investigate their influence on standard ERM baselines using synthetic datasets generated from causal graphs. Subsequently, we observe patterns connecting these hypotheses and model design choices.

摘要
假相关现象发生在模型学习不可靠特征时，是数据驱动学习的一个知名缺陷。虽然有许多算法提出来 mitigate it，但我们还未能同时 derive the indicators of spurious correlations。因此，基于单独的假设建立的解决方案无法超越简单的 ERM 基线。我们收集了一些常studied的假设在发生假相关时对标准 ERM 基eline的影响，并使用从 causal graphs 生成的 sintetic datasets 进行调查。后来，我们发现了这些假设和模型设计选择之间的征patterns。

Split Learning for Distributed Collaborative Training of Deep Learning Models in Health Informatics

paper_url: http://arxiv.org/abs/2308.11027
repo_url: None
paper_authors: Zhuohang Li, Chao Yan, Xinmeng Zhang, Gharib Gharibi, Zhijun Yin, Xiaoqian Jiang, Bradley A. Malin
for: 这篇论文的目的是探讨如何使用分割学习来实现医疗机构间的深度学习模型跨机构共同训练，同时保持原始数据和模型参数的私有性。
methods: 这篇论文提出了一个新的隐私保护分布式学习框架，该框架可以在不同的医疗机构中分割数据和模型参数，以保持隐私和安全性。
results: 研究者使用了多个生物医学影像和电子健康记录（EHR）数据集，证明了深度学习模型透过分割学习可以在跨机构的情况下实现高度相似的性能，同时大幅提高计算效率和降低隐私风险。

Abstract
Deep learning continues to rapidly evolve and is now demonstrating remarkable potential for numerous medical prediction tasks. However, realizing deep learning models that generalize across healthcare organizations is challenging. This is due, in part, to the inherent siloed nature of these organizations and patient privacy requirements. To address this problem, we illustrate how split learning can enable collaborative training of deep learning models across disparate and privately maintained health datasets, while keeping the original records and model parameters private. We introduce a new privacy-preserving distributed learning framework that offers a higher level of privacy compared to conventional federated learning. We use several biomedical imaging and electronic health record (EHR) datasets to show that deep learning models trained via split learning can achieve highly similar performance to their centralized and federated counterparts while greatly improving computational efficiency and reducing privacy risks.

摘要

Extreme Multilabel Classification for Specialist Doctor Recommendation with Implicit Feedback and Limited Patient Metadata

paper_url: http://arxiv.org/abs/2308.11022
repo_url: None
paper_authors: Filipa Valdeira, Stevo Racković, Valeria Danalachi, Qiwei Han, Cláudia Soares
for: 这个研究旨在开发一个更有效的医疗器官参考系统，以便为新的病人和已有过病史的病人提供更加 personnalized 的参考建议。
methods: 这个研究使用 Extreme Multilabel Classification (XML) 方法，通过将可用的特征编码为多个标签，以便预测不同领域的医生参考建议。
results: 相比于现有的推荐系统，这个方法在有病史的病人中提高了推荐的准确性约 $10%$，并且在新的病人中提高了预测的精度和传递率。

Abstract
Recommendation Systems (RS) are often used to address the issue of medical doctor referrals. However, these systems require access to patient feedback and medical records, which may not always be available in real-world scenarios. Our research focuses on medical referrals and aims to predict recommendations in different specialties of physicians for both new patients and those with a consultation history. We use Extreme Multilabel Classification (XML), commonly employed in text-based classification tasks, to encode available features and explore different scenarios. While its potential for recommendation tasks has often been suggested, this has not been thoroughly explored in the literature. Motivated by the doctor referral case, we show how to recast a traditional recommender setting into a multilabel classification problem that current XML methods can solve. Further, we propose a unified model leveraging patient history across different specialties. Compared to state-of-the-art RS using the same features, our approach consistently improves standard recommendation metrics up to approximately $10\%$ for patients with a previous consultation history. For new patients, XML proves better at exploiting available features, outperforming the benchmark in favorable scenarios, with particular emphasis on recall metrics. Thus, our approach brings us one step closer to creating more effective and personalized doctor referral systems. Additionally, it highlights XML as a promising alternative to current hybrid or content-based RS, while identifying key aspects to take into account when using XML for recommendation tasks.

摘要
医疗 Referral Systems (RS) 常常用于解决医生参诊问题。然而，这些系统可能缺乏实际场景中病人反馈和医疗记录的存在。我们的研究专注于医生参诊和预测不同专业医生的建议，包括新病人和已有咨询历史。我们使用极端多类标签分类（XML），通常用于文本分类任务，来编码可用的特征和探索不同enario。虽然其在推荐任务中的潜在优势尚未得到了充分的研究，但我们在医生参诊的情况下表明了如何将传统推荐设定转换成多类标签分类问题，以便现有的XML方法解决。此外，我们提议一种综合模型，利用病人历史跨不同专业。相比之下，使用同样的特征，我们的方法在既有咨询历史的病人中，一直保持了约10%的提升，而无咨询历史的病人中，XML更好地利用可用的特征，在有利的情况下超越了标准。因此，我们的方法使得创建更有效和个性化的医生参诊系统一步之降。此外，它还证明了XML作为推荐任务的一种有力的代替方案，同时标识了在使用XML推荐任务时需要注意的关键因素。

Multi-Task Hypergraphs for Semi-supervised Learning using Earth Observations

paper_url: http://arxiv.org/abs/2308.11021
repo_url: None
paper_authors: Mihai Pirvu, Alina Marcu, Alexandra Dobrescu, Nabil Belbachir, Marius Leordeanu
for: 这 paper 是为了解决 Earth Observation 问题，这是一个多任务和缺少真实数据的问题。
methods: 这 paper 使用了一种多任务 hypergraph，每个节点是一个任务，不同的路径通过 hypergraph 到达给定任务就成为了无监督教师，通过组合生成可靠的 pseudolabels для该任务。
results: 通过对 NASA NEO 数据集进行广泛的实验，证明了我们的多任务半监督方法的价值，包括在强基eline和最新工作上提供了一系列稳定的提升。此外，我们还证明了 hypergraph 可以适应不监督的数据分布变化，并可靠地恢复缺失数据，包括多个观测层数据，在7年之间。

Abstract
There are many ways of interpreting the world and they are highly interdependent. We exploit such complex dependencies and introduce a powerful multi-task hypergraph, in which every node is a task and different paths through the hypergraph reaching a given task become unsupervised teachers, by forming ensembles that learn to generate reliable pseudolabels for that task. Each hyperedge is part of an ensemble teacher for a given task and it is also a student of the self-supervised hypergraph system. We apply our model to one of the most important problems of our times, that of Earth Observation, which is highly multi-task and it often suffers from missing ground-truth data. By performing extensive experiments on the NASA NEO Dataset, spanning a period of 22 years, we demonstrate the value of our multi-task semi-supervised approach, by consistent improvements over strong baselines and recent work. We also show that the hypergraph can adapt unsupervised to gradual data distribution shifts and reliably recover, through its multi-task self-supervision process, the missing data for several observational layers for up to seven years.

摘要
世界有多种 интерпрета方法，它们之间高度相互依赖。我们利用这些复杂的依赖关系，引入一个强大的多任务权重图，每个节点都是一个任务，不同的路径通过权重图到达一个给定任务就成为无监督教师，通过组合形成ensemble学习器，以生成可靠的假标签 для该任务。每个权重边都是一个任务的ensemble教师，同时也是自我监督权重图系统的学生。我们应用我们的模型到当今最重要的问题之一：地球观测，这是一个高度多任务的问题，经常缺少实际数据。通过对NASA NEO数据集进行了22年的广泛实验，我们示出了我们的多任务半supervised方法的价值，经常超越强基elines和最近的工作。我们还显示了权重图可以适应无监督数据分布变化，并通过自我监督过程来重新获取数据，对于多个观测层次，保持稳定的性能，达7年之久。

Instance-based Learning with Prototype Reduction for Real-Time Proportional Myocontrol: A Randomized User Study Demonstrating Accuracy-preserving Data Reduction for Prosthetic Embedded Systems

paper_url: http://arxiv.org/abs/2308.11019
repo_url: None
paper_authors: Tim Sziburis, Markus Nowak, Davide Brunelli
for: 这个研究旨在设计、实现以及验证基于kNN的手势检测方法，用于辅助 prosthesis 控制。
methods: 为了应对高计算需求，这个研究评估了不同的数据缩减方法，包括点扩展 mapping (DSM)，以提高实时灵活性。
results: 比较这些方法的结果显示，基于 kNN 的方法在线上成功率较高，并且在实时验证中也表现良好。DSM 缩减方法更是与 kNN 方法相比，在灵活性和时间行为方面表现更佳。

Abstract
This work presents the design, implementation and validation of learning techniques based on the kNN scheme for gesture detection in prosthetic control. To cope with high computational demands in instance-based prediction, methods of dataset reduction are evaluated considering real-time determinism to allow for the reliable integration into battery-powered portable devices. The influence of parameterization and varying proportionality schemes is analyzed, utilizing an eight-channel-sEMG armband. Besides offline cross-validation accuracy, success rates in real-time pilot experiments (online target achievement tests) are determined. Based on the assessment of specific dataset reduction techniques' adequacy for embedded control applications regarding accuracy and timing behaviour, Decision Surface Mapping (DSM) proves itself promising when applying kNN on the reduced set. A randomized, double-blind user study was conducted to evaluate the respective methods (kNN and kNN with DSM-reduction) against Ridge Regression (RR) and RR with Random Fourier Features (RR-RFF). The kNN-based methods performed significantly better (p<0.0005) than the regression techniques. Between DSM-kNN and kNN, there was no statistically significant difference (significance level 0.05). This is remarkable in consideration of only one sample per class in the reduced set, thus yielding a reduction rate of over 99% while preserving success rate. The same behaviour could be confirmed in an extended user study. With k=1, which turned out to be an excellent choice, the runtime complexity of both kNN (in every prediction step) as well as DSM-kNN (in the training phase) becomes linear concerning the number of original samples, favouring dependable wearable prosthesis applications.

摘要
The results show that kNN-based methods significantly outperform regression techniques (p<0.0005), with no statistically significant difference between DSM-kNN and kNN. The kNN method with a single nearest neighbor (k=1) achieves linear runtime complexity, making it suitable for real-time wearable prosthesis applications. The study demonstrates the potential of kNN-based gesture detection for reliable and efficient prosthetic control.Here is the Simplified Chinese translation of the text:这项研究提出了基于k-最近邻居（kNN）的手势检测方法，以适应轻量级的 prosthetic 控制。为了评估不同的数据减少技术的效果，这项研究比较了kNN与和DSM-kNN（Decision Surface Mapping）与ridge regression（RR）和RR的Random Fourier Features（RR-RFF）。结果显示，kNN基本方法significantly exceeds regression 方法（p<0.0005），并且DSM-kNN和kNN之间没有 statistically significiant 差异（significance 水平0.05）。kNN方法使用单个最近邻居（k=1） achieves linear 时间复杂度，使其适用于实时穿戴式 prosthesis 应用。这项研究 validate 了 kNN 基本方法的可靠和高效的手势检测。

Personalized Event Prediction for Electronic Health Records

paper_url: http://arxiv.org/abs/2308.11013
repo_url: None
paper_authors: Jeong Min Lee, Milos Hauskrecht
for: 预测医疗事件序列，提高患者照管质量
methods: 基于个体特点进行个体化预测，自适应和多模型选择
results: 在MIMIC-III数据库中测试和分析多种预测模型，提高预测精度

Abstract
Clinical event sequences consist of hundreds of clinical events that represent records of patient care in time. Developing accurate predictive models of such sequences is of a great importance for supporting a variety of models for interpreting/classifying the current patient condition, or predicting adverse clinical events and outcomes, all aimed to improve patient care. One important challenge of learning predictive models of clinical sequences is their patient-specific variability. Based on underlying clinical conditions, each patient's sequence may consist of different sets of clinical events (observations, lab results, medications, procedures). Hence, simple population-wide models learned from event sequences for many different patients may not accurately predict patient-specific dynamics of event sequences and their differences. To address the problem, we propose and investigate multiple new event sequence prediction models and methods that let us better adjust the prediction for individual patients and their specific conditions. The methods developed in this work pursue refinement of population-wide models to subpopulations, self-adaptation, and a meta-level model switching that is able to adaptively select the model with the best chance to support the immediate prediction. We analyze and test the performance of these models on clinical event sequences of patients in MIMIC-III database.

摘要
临床事件序列包含数百个临床事件记录，表示患者的时间记录。开发准确预测临床序列模型是非常重要的，以支持评估当前患者状况、分类临床事件、预测不良临床结果等，以提高患者的护理质量。一个重要的预测临床序列模型挑战是每个患者的序列具有特殊的病人特有性。基于下面的临床状况，每个患者的序列可能包含不同的临床事件（观察结果、实验室数据、药物和手术）。因此，从多个不同患者的事件序列中学习的人口广泛模型可能无法准确预测每个患者的特有的事件序列和差异。为解决这个问题，我们提出并研究了多种新的事件序列预测模型和方法。这些方法可以更好地适应各个患者和其特定的状况。我们在MIMIC-III数据库中分析和测试这些模型的性能。

Using language models in the implicit automated assessment of mathematical short answer items

paper_url: http://arxiv.org/abs/2308.11006
repo_url: None
paper_authors: Christopher Ormerod
for: This paper proposes a new way to assess short constructed responses to mathematics items, using a value identification pipeline to determine the correctness of the response and identify any misconceptions.
methods: The paper uses a pipeline consisting of two fine-tuned language models to identify the key values specified in the student response, with the first model determining if a value is implicit in the response and the second model identifying where the key value is specified.
results: The value identification pipeline is shown to be a more accurate and informative way to assess short constructed responses than traditional rubric-based scoring, and can provide more targeted feedback to students to help them improve their understanding of mathematics.Here is the same information in Simplified Chinese text:
for: 这篇论文提出了一种新的方法来评估数学项目短 constructed responses，使用值标识管道来确定答案正确性和揪出任何误解。
methods: 该论文使用一个管道，包括两个精心调整的自然语言模型，来标识学生答案中的关键值。首先，模型确定答案中是否含有值，然后确定答案中关键值的位置。
results: 值标识管道比传统的分类型评估方法更加准确和有用，可以为学生提供更targeted的反馈，帮助他们提高数学理解。

Abstract
We propose a new way to assess certain short constructed responses to mathematics items. Our approach uses a pipeline that identifies the key values specified by the student in their response. This allows us to determine the correctness of the response, as well as identify any misconceptions. The information from the value identification pipeline can then be used to provide feedback to the teacher and student. The value identification pipeline consists of two fine-tuned language models. The first model determines if a value is implicit in the student response. The second model identifies where in the response the key value is specified. We consider both a generic model that can be used for any prompt and value, as well as models that are specific to each prompt and value. The value identification pipeline is a more accurate and informative way to assess short constructed responses than traditional rubric-based scoring. It can be used to provide more targeted feedback to students, which can help them improve their understanding of mathematics.

摘要
我们提出了一种新的方法来评估一些短 constructed responses to mathematics items。我们的方法使用一个管道来标识学生在回答中提供的关键值。这些值可以用来确定回答的正确性，以及找到任何误解。管道中的信息可以用来提供反馈给教师和学生。我们的值标识管道包括两个精心调整的自然语言模型。第一个模型判断学生回答中是否存在潜在的值。第二个模型 identific where in the response the key value is specified。我们考虑了一个通用的模型，可以用于任何提问和值，以及特定的模型，用于每个提问和值。值标识管道比传统基于精心制定的分数表更加准确和有用，可以为学生提供更加有针对性的反馈，帮助他们提高数学理解。

Autonomous Detection of Methane Emissions in Multispectral Satellite Data Using Deep Learning

paper_url: http://arxiv.org/abs/2308.11003
repo_url: None
paper_authors: Bertrand Rouet-Leduc, Thomas Kerdreux, Alexandre Tuel, Claudia Hulbert
for: 快速降低全球暖化的主要目标是减少甲烷排放，但现有的甲烷排放监测技术主要基于估算的排放因子或自报告，这些技术经常会严重地下估排放量。
methods: 使用深度学习方法来自动检测卫星多spectral数据中的甲烷泄漏，并提高了false positive率与现有Multispectral甲烷数据产品相比，而无需提前知道泄漏点位置。
results: 我们的提议方法可以实现高精度、高频率的自动甲烷排放监测，并且可以减少人工分析的需求。

Abstract
Methane is one of the most potent greenhouse gases, and its short atmospheric half-life makes it a prime target to rapidly curb global warming. However, current methane emission monitoring techniques primarily rely on approximate emission factors or self-reporting, which have been shown to often dramatically underestimate emissions. Although initially designed to monitor surface properties, satellite multispectral data has recently emerged as a powerful method to analyze atmospheric content. However, the spectral resolution of multispectral instruments is poor, and methane measurements are typically very noisy. Methane data products are also sensitive to absorption by the surface and other atmospheric gases (water vapor in particular) and therefore provide noisy maps of potential methane plumes, that typically require extensive human analysis. Here, we show that the image recognition capabilities of deep learning methods can be leveraged to automatize the detection of methane leaks in Sentinel-2 satellite multispectral data, with dramatically reduced false positive rates compared with state-of-the-art multispectral methane data products, and without the need for a priori knowledge of potential leak sites. Our proposed approach paves the way for the automated, high-definition and high-frequency monitoring of point-source methane emissions across the world.

摘要
孕气是全球暖化最强大的气体之一，它的大气半衰期短，使其成为快速降低全球暖化的目标。然而，现有的甲烷排放监测技术主要基于估算性的排放因素或自reporting，这些技术经常会很大幅度地下估排放。 Although initially designed to monitor surface properties, satellite multispectral data has recently emerged as a powerful method to analyze atmospheric content. However, the spectral resolution of multispectral instruments is poor, and methane measurements are typically very noisy. Methane data products are also sensitive to absorption by the surface and other atmospheric gases (water vapor in particular) and therefore provide noisy maps of potential methane plumes, that typically require extensive human analysis. Here, we show that the image recognition capabilities of deep learning methods can be leveraged to automatize the detection of methane leaks in Sentinel-2 satellite multispectral data, with dramatically reduced false positive rates compared with state-of-the-art multispectral methane data products, and without the need for a priori knowledge of potential leak sites. Our proposed approach paves the way for the automated, high-definition and high-frequency monitoring of point-source methane emissions across the world.

SupEuclid: Extremely Simple, High Quality OoD Detection with Supervised Contrastive Learning and Euclidean Distance

paper_url: http://arxiv.org/abs/2308.10973
repo_url: None
paper_authors: Jarrod Haas
for: 本研究旨在提出一种简单 yet powerful的 Out-of-Distribution (OoD) detection方法，以及其在不同距离 benchmarks 上的性能。
methods: 本研究使用 Supervised Contrastive Learning (SCL) 方法，并使用 Euclidean distance 作为评价函数。
results: 研究结果表明，使用 ResNet18 和 SCL 方法可以在near和far OoD detection benchmarks 上达到状态 искусственный智能水平，无需额外的模型或 hyperparameter tuning。

Abstract
Out-of-Distribution (OoD) detection has developed substantially in the past few years, with available methods approaching, and in a few cases achieving, perfect data separation on standard benchmarks. These results generally involve large or complex models, pretraining, exposure to OoD examples or extra hyperparameter tuning. Remarkably, it is possible to achieve results that can exceed many of these state-of-the-art methods with a very simple method. We demonstrate that ResNet18 trained with Supervised Contrastive Learning (SCL) produces state-of-the-art results out-of-the-box on near and far OoD detection benchmarks using only Euclidean distance as a scoring rule. This may obviate the need in some cases for more sophisticated methods or larger models, and at the very least provides a very strong, easy to use baseline for further experimentation and analysis.

摘要

Fat Shattering, Joint Measurability, and PAC Learnability of POVM Hypothesis Classes

paper_url: http://arxiv.org/abs/2308.12304
repo_url: None
paper_authors: Abram Magner, Arun Padakandla
for: 本文研究了量子测量类型的学习可能性，并提出了匹配的必要和充分条件，以及相关的样本复杂度上限。
methods: 本文使用了empirical risk和denoised ERM来学习量子测量类型，并证明了这两种方法的 universal 性。
results: 本文证明了一些量子测量类型是PAC可学习的，并给出了具体的样本复杂度上限。此外，本文还证明了一些量子测量类型的学习可能性是基于finite fat shattering dimension和approximately jointly measurable subsets的。

Abstract
We characterize learnability for quantum measurement classes by establishing matching necessary and sufficient conditions for their PAC learnability, along with corresponding sample complexity bounds, in the setting where the learner is given access only to prepared quantum states. We first probe the results from previous works on this setting. We show that the empirical risk defined in previous works and matching the definition in the classical theory fails to satisfy the uniform convergence property enjoyed in the classical setting for some learnable classes. Moreover, we show that VC dimension generalization upper bounds in previous work are frequently infinite, even for finite-dimensional POVM classes. To surmount the failure of the standard ERM to satisfy uniform convergence, we define a new learning rule -- denoised ERM. We show this to be a universal learning rule for POVM and probabilistically observed concept classes, and the condition for it to satisfy uniform convergence is finite fat shattering dimension of the class. We give quantitative sample complexity upper and lower bounds for learnability in terms of finite fat-shattering dimension and a notion of approximate finite partitionability into approximately jointly measurable subsets, which allow for sample reuse. We then show that finite fat shattering dimension implies finite coverability by approximately jointly measurable subsets, leading to our matching conditions. We also show that every measurement class defined on a finite-dimensional Hilbert space is PAC learnable. We illustrate our results on several example POVM classes.

摘要
我们Characterize学习可能性 для量子测量类型by establishing匹配必要和 sufficient conditions for their PAC学习可能性，以及相应的样本复杂度上限，在learner只有prepared量子状态的设置下。我们首先探讨过去的作品的结果。我们表明，在previous works中定义的empirical risk并不满足在classical setting中所享受的uniform convergence property，并且在一些可学习类型上，VC dimension generalization upper bounds是无限大的，即使是finite-dimensional POVM类型。为了缺乏标准ERM的uniform convergence，我们定义了一个新的学习规则—denoised ERM。我们证明这是一个universal learning rule for POVM和 probabilistically observed concept classes，并且其uniform convergence的 conditon是finite fat shattering dimension of the class。我们给出了量子 sample complexity upper和lower bounds in terms offinite fat-shattering dimension和一种 approximate finite partitionability into approximately jointly measurable subsets，这些allow for sample reuse。然后，我们证明finite fat shattering dimension implies finite coverability by approximately jointly measurable subsets，导致我们的matching conditions。 finally，我们证明every measurement class defined on a finite-dimensional Hilbert space is PAC learnable。我们在several example POVM classes illustrate our results.Note: Simplified Chinese is a simplified version of Chinese that is used in mainland China and Singapore. It is different from Traditional Chinese, which is used in Taiwan and other parts of the world.

MRI Field-transfer Reconstruction with Limited Data: Regularization by Neural Style Transfer

paper_url: http://arxiv.org/abs/2308.10968
repo_url: None
paper_authors: Guoyao Shen, Yancheng Zhu, Hernan Jara, Sean B. Andersson, Chad W. Farris, Stephan Anderson, Xin Zhang
for: 这个研究的目的是提高MRI重建的品质，使用深度学习模型并充分利用对于图像重建的热点。
methods: 这个研究使用了对于图像重建的深度学习模型，并将对于图像重建的热点转换为一个对于图像重建的热点转换。
results: 这个研究发现，使用RNST方法可以从噪压低质量图像中重建高品质图像，并且可以在有限数据情况下进行图像重建。

Abstract
Recent works have demonstrated success in MRI reconstruction using deep learning-based models. However, most reported approaches require training on a task-specific, large-scale dataset. Regularization by denoising (RED) is a general pipeline which embeds a denoiser as a prior for image reconstruction. The potential of RED has been demonstrated for multiple image-related tasks such as denoising, deblurring and super-resolution. In this work, we propose a regularization by neural style transfer (RNST) method to further leverage the priors from the neural transfer and denoising engine. This enables RNST to reconstruct a high-quality image from a noisy low-quality image with different image styles and limited data. We validate RNST with clinical MRI scans from 1.5T and 3T and show that RNST can significantly boost image quality. Our results highlight the capability of the RNST framework for MRI reconstruction and the potential for reconstruction tasks with limited data.

摘要
最近的研究已经证明深度学习模型可以成功地重建MRI图像。然而，大多数报道的方法需要训练在特定任务的大规模数据集上。权化通过减噪（RED）是一个通用管道，它将减噪算法作为图像重建的先前。RED的潜在能力已经在多种图像相关任务中展现出来，如减噪、锐化和超分辨率。在这项工作中，我们提议一种基于神经传递和减噪引擎的常规化（RNST）方法，以更好地利用神经传递和减噪引擎中的先前。这使得RNST可以从噪声低质量的图像中重建高质量的图像，并且图像风格不同和数据有限。我们验证了RNST使用临床MRI扫描数据，并显示了RNST可以显著提高图像质量。我们的结果表明RNST框架适用于MRI重建和有限数据的重建任务。

Structured World Models from Human Videos

paper_url: http://arxiv.org/abs/2308.10901
repo_url: None
paper_authors: Russell Mendonca, Shikhar Bahl, Deepak Pathak
for: 本文目标是使机器人能够直接在实际世界中学习复杂的一般行为。
methods: 该方法使用互联网规模的人类视频数据来帮助机器人快速学习抓取技能。文中提出了一种基于视觉可用性学习的人类行为空间，并在这个空间中训练了一个世界模型。
results: 实验结果表明，这种方法可以让不同的机器人在复杂的Setting中快速学习多种抓取技能，仅用30分钟的互动数据。视频可以在https://human-world-model.github.io找到。

Abstract
We tackle the problem of learning complex, general behaviors directly in the real world. We propose an approach for robots to efficiently learn manipulation skills using only a handful of real-world interaction trajectories from many different settings. Inspired by the success of learning from large-scale datasets in the fields of computer vision and natural language, our belief is that in order to efficiently learn, a robot must be able to leverage internet-scale, human video data. Humans interact with the world in many interesting ways, which can allow a robot to not only build an understanding of useful actions and affordances but also how these actions affect the world for manipulation. Our approach builds a structured, human-centric action space grounded in visual affordances learned from human videos. Further, we train a world model on human videos and fine-tune on a small amount of robot interaction data without any task supervision. We show that this approach of affordance-space world models enables different robots to learn various manipulation skills in complex settings, in under 30 minutes of interaction. Videos can be found at https://human-world-model.github.io

摘要
我们面临的问题是直接在真实世界中学习复杂、通用的行为。我们提出了一种方法，使用少量真实世界交互轨迹来快速地教育机器人抓取技能。我们Draw inspiration from the success of learning from large-scale datasets in the fields of computer vision and natural language, we believe that in order to efficiently learn, a robot must be able to leverage internet-scale, human video data。人类在world中有很多有趣的交互方式，这可以让机器人不仅构建有用的动作和可用性的理解，还可以了解这些动作如何影响world для抓取。我们的方法是建立基于视觉可用性学习的人类行为空间，然后在这个空间中训练一个世界模型，并在这个模型上精度地训练一些机器人交互数据。我们显示，这种方法可以在30分钟内帮助不同的机器人在复杂的设置中学习各种抓取技能。视频可以在https://human-world-model.github.io找到。

Unlocking Accuracy and Fairness in Differentially Private Image Classification

paper_url: http://arxiv.org/abs/2308.10888
repo_url: None
paper_authors: Leonard Berrada, Soham De, Judy Hanwen Shen, Jamie Hayes, Robert Stanforth, David Stutz, Pushmeet Kohli, Samuel L. Smith, Borja Balle
for: 这个论文的目的是使得机器学习模型在保护个人隐私的情况下训练，以避免泄露敏感信息。
methods: 这篇论文使用了差分隐私（DP）框架，以提供正式的隐私保证。
results: 研究发现，使用DP训练的私有分类器可以与非私有分类器的准确率相似，即使数据分布shift大。这些私有分类器在四个数据集上达到了非私有状态的准确率水平，并且不会因为人口群体差异而导致不公正的性表现。这个突破可以让机器学习实践者安全地训练在敏感数据上，保护个人隐私。

Abstract
Privacy-preserving machine learning aims to train models on private data without leaking sensitive information. Differential privacy (DP) is considered the gold standard framework for privacy-preserving training, as it provides formal privacy guarantees. However, compared to their non-private counterparts, models trained with DP often have significantly reduced accuracy. Private classifiers are also believed to exhibit larger performance disparities across subpopulations, raising fairness concerns. The poor performance of classifiers trained with DP has prevented the widespread adoption of privacy preserving machine learning in industry. Here we show that pre-trained foundation models fine-tuned with DP can achieve similar accuracy to non-private classifiers, even in the presence of significant distribution shifts between pre-training data and downstream tasks. We achieve private accuracies within a few percent of the non-private state of the art across four datasets, including two medical imaging benchmarks. Furthermore, our private medical classifiers do not exhibit larger performance disparities across demographic groups than non-private models. This milestone to make DP training a practical and reliable technology has the potential to widely enable machine learning practitioners to train safely on sensitive datasets while protecting individuals' privacy.

摘要
隐私保护机器学习目标是在使用private数据进行训练而不泄露敏感信息。差分隐私（DP）是考虑隐私保护的标准框架，它提供了正式的隐私保证。然而，与非private模型相比，DP训练的模型通常具有显著减少的准确率。私有分类器还可能会在不同的人口群体中存在更大的性别差异，这引发了公平性的担忧。DP训练中模型的准确率较低，使得隐私保护机器学习在业界广泛应用的推广受阻。在这里，我们展示了使用预训练基础模型并在DP下进行细化训练可以达到与非private模型准确率相似的水平，即使在数据分布shift大的情况下。我们在四个数据集上达到了与非private状态的准确率相似的私有准确率，其中包括两个医疗影像标准 benchmark。此外，我们的私有医疗分类器不会在不同的人口群体中存在更大的性别差异，与非private模型相比。这一突破可能使DP训练成为实用可靠的技术，使机器学习师可以在敏感数据上训练，而不需要担心个人隐私。

Analyzing Transformer Dynamics as Movement through Embedding Space

paper_url: http://arxiv.org/abs/2308.10874
repo_url: None
paper_authors: Sumeet S. Singh
for: 这篇论文探讨了transformer语言模型如何实现智能行为，包括理解自然语言、识别模式、获取知识、思考、规划和使用工具。methods: 作者采用系统方法分析transformer语言模型，并开发了一个数学框架来描述其动态。这个框架将transformer模型看作是一个移动在嵌入空间中的系统，从而提供了一种原则性的思考方式，并且显示出了智能行为的起源。results: 研究发现，transformer模型的核心是一个嵌入空间漫游者，它将智能行为映射到嵌入空间中的路径上。在每个步骤中，模型会将上下文集成成一个单一的组合向量，该向量的嵌入空间位置定义下一步的路径。此外，研究还发现了模型学习和总结的机制，以及它们与嵌入空间中的vector组织相关的知识和技能。

Abstract
Transformer language models exhibit intelligent behaviors such as understanding natural language, recognizing patterns, acquiring knowledge, reasoning, planning, reflecting and using tools. This paper explores how their underlying mechanics give rise to intelligent behaviors. We adopt a systems approach to analyze Transformers in detail and develop a mathematical framework that frames their dynamics as movement through embedding space. This novel perspective provides a principled way of thinking about the problem and reveals important insights related to the emergence of intelligence: 1. At its core the Transformer is a Embedding Space walker, mapping intelligent behavior to trajectories in this vector space. 2. At each step of the walk, it composes context into a single composite vector whose location in Embedding Space defines the next step. 3. No learning actually occurs during decoding; in-context learning and generalization are simply the result of different contexts composing into different vectors. 4. Ultimately the knowledge, intelligence and skills exhibited by the model are embodied in the organization of vectors in Embedding Space rather than in specific neurons or layers. These abilities are properties of this organization. 5. Attention's contribution boils down to the association-bias it lends to vector composition and which influences the aforementioned organization. However, more investigation is needed to ascertain its significance. 6. The entire model is composed from two principal operations: data independent filtering and data dependent aggregation. This generalization unifies Transformers with other sequence models and across modalities. Building upon this foundation we formalize and test a semantic space theory which posits that embedding vectors represent semantic concepts and find some evidence of its validity.

摘要
transformer 语言模型显示出智能行为，如理解自然语言、识别模式、获取知识、理智、规划、反思和使用工具。这篇论文探讨transformer的下面机制如何导致智能行为的出现。我们采用系统方法分析transformer，并开发了一个数学框架，将其动态规则表示为向量空间中的移动。这种新的视角提供了一种原则性的思考方式，并揭示了智能行为出现的重要视角：1. transformer的核心是向量空间漫步者，将智能行为映射到向量空间中的轨迹上。2. 在每一步中，它将上下文融合到一个单一的复合向量中，该向量的位置在向量空间定义下一步的移动。3. 在解码过程中，不会有实际的学习 occur;解码是一种在不同上下文中的另一种组合。4. transformer中展现出的知识、智能和技能都是向量空间中的组织方式，而不是特定的神经元或层。这些能力是这种组织方式的属性。5. 注意力的贡献主要表现在向量组合中的协同偏好，这种偏好对向量空间的组织有影响。然而，更多的研究是需要确定其重要性。6. transformer的整体结构由两种主要操作组成：数据独立滤波和数据依赖聚合。这种总结将transformer与其他序列模型和多模态之间联系起来。基于这个基础，我们正式化和测试一种semantic space理论，该理论认为向量空间中的向量表示semantic concept，并发现了一些证据支持这一点。

Majorana Demonstrator Data Release for AI/ML Applications

paper_url: http://arxiv.org/abs/2308.10856
repo_url: None
paper_authors: I. J. Arnquist, F. T. Avignone III, A. S. Barabash, C. J. Barton, K. H. Bhimani, E. Blalock, B. Bos, M. Busch, M. Buuck, T. S. Caldwell, Y. -D. Chan, C. D. Christofferson, P. -H. Chu, M. L. Clark, C. Cuesta, J. A. Detwiler, Yu. Efremenko, H. Ejiri, S. R. Elliott, N. Fuad, G. K. Giovanetti, M. P. Green, J. Gruszko, I. S. Guinn, V. E. Guiseppe, C. R. Haufe, R. Henning, D. Hervas Aguilar, E. W. Hoppe, A. Hostiuc, M. F. Kidd, I. Kim, R. T. Kouzes, T. E. Lannen V, A. Li, J. M. Lopez-Castano, R. D. Martin, R. Massarczyk, S. J. Meijer, S. Mertens, T. K. Oli, L. S. Paudel, W. Pettus, A. W. P. Poon, B. Quenallata, D. C. Radford, A. L. Reine, K. Rielage, N. W. Ruof, D. C. Schaper, S. J. Schleich, D. Tedeschi, R. L. Varner, S. Vasilyev, S. L. Watkins, J. F. Wilkerson, C. Wiseman, W. Xu, C. -H. Yu, B. X. Zhu
for: 本研究的目的是为了支持人工智能（AI）和机器学习（ML）算法的训练和测试，并透过这些算法来分析 Majorana 实验中的数据。
methods: 本研究使用的方法包括 Majorana 实验中的数据分析、数据采集和处理等。
results: 本研究提供了一个受控的 Majorana 实验数据集，包括raw Germanium 探测器波形、激发形分离icut和加工后的能量等数据，以便用于AI和ML算法的训练和测试。

Abstract
The enclosed data release consists of a subset of the calibration data from the Majorana Demonstrator experiment. Each Majorana event is accompanied by raw Germanium detector waveforms, pulse shape discrimination cuts, and calibrated final energies, all shared in an HDF5 file format along with relevant metadata. This release is specifically designed to support the training and testing of Artificial Intelligence (AI) and Machine Learning (ML) algorithms upon our data. This document is structured as follows. Section I provides an overview of the dataset's content and format; Section II outlines the location of this dataset and the method for accessing it; Section III presents the NPML Machine Learning Challenge associated with this dataset; Section IV contains a disclaimer from the Majorana collaboration regarding the use of this dataset; Appendix A contains technical details of this data release. Please direct questions about the material provided within this release to liaobo77@ucsd.edu (A. Li).

摘要
附件的数据发布包含 Majorana 实验中的一个子集滤波器数据。每个 Majorana 事件都由原始的锡律器普通波形、振荡形态筛选器和已加工的最终能量，共同存储在 HDF5 文件格式中，并附带相关的元数据。这个发布是为支持使用人工智能（AI）和机器学习（ML）算法对我们的数据进行训练和测试而设计的。这份文档的结构如下：Section I：数据集的内容和格式的概述Section II：数据集的位置和访问方法Section III：NPML 机器学习挑战与这些数据集相关Section IV： Majorana 团队对使用这些数据集的声明Appendix A：这些数据发布的技术详细信息请对这些发布中的内容有任何问题，请邮件 liaobo77@ucsd.edu (A. Li)。

Evaluating quantum generative models via imbalanced data classification benchmarks

paper_url: http://arxiv.org/abs/2308.10847
repo_url: None
paper_authors: Graham R. Enos, Matthew J. Reagor, Eric Hulburd
for: 这篇论文的目的是用Explainable AI技术分析量子机器学习模型的行为是否与传统模型不同，并在具有不同复杂度和类别偏好的实际数据集上进行系统性的应用。
methods: 这篇论文使用了 hybrid 量子-классиical 神经网络，并从 twenty 个实际数据集中采样了 Synthetic 数据。这些数据集包括太阳风暴、心脏颤动和语音数据等，每个数据集都具有不同的复杂度和类别偏好。
results: 这篇论文通过对量子生成的数据进行比较，发现了一些特定的问题可以借助 hybrid 量子-claссиical 生成模型来解决，而其他问题则更适合使用传统方法。这种分析方法可以帮助您更好地了解一个问题是否适合使用 hybrid 量子-claссиical 生成模型。

Abstract
A limited set of tools exist for assessing whether the behavior of quantum machine learning models diverges from conventional models, outside of abstract or theoretical settings. We present a systematic application of explainable artificial intelligence techniques to analyze synthetic data generated from a hybrid quantum-classical neural network adapted from twenty different real-world data sets, including solar flares, cardiac arrhythmia, and speech data. Each of these data sets exhibits varying degrees of complexity and class imbalance. We benchmark the quantum-generated data relative to state-of-the-art methods for mitigating class imbalance for associated classification tasks. We leverage this approach to elucidate the qualities of a problem that make it more or less likely to be amenable to a hybrid quantum-classical generative model.

摘要
有限的工具存在，用于判断量子机器学习模型与传统模型之间的差异，除了抽象或理论上的设定。我们使用可解释人工智能技术来分析基于混合量子-классиical神经网络的 synthetic 数据，该神经网络来自二十个真实世界数据集，包括太阳风暴、心脏病发和语音数据。每个数据集具有不同的复杂度和类别偏度。我们对量子生成的数据进行比较，与现有的方法来 mitigate 类别偏度相关的分类任务。我们利用这种方法，以便描述一个问题的特点，使其更或 menos 可能被 hybrid 量子-классиical 生成模型解释。

Real World Time Series Benchmark Datasets with Distribution Shifts: Global Crude Oil Price and Volatility

paper_url: http://arxiv.org/abs/2308.10846
repo_url: https://github.com/oilpricebenchmarks/COB
paper_authors: Pranay Pasula
for:COB datasets are created to address the scarcity of task-labeled time-series benchmarks in the financial domain, specifically for crude oil benchmarks.methods:The paper uses asset price data transformed into volatility proxies, expectation-maximization (EM) fitting, and contextual task labels aligned with real-world events to create the COB datasets.results:The inclusion of task labels universally improves performance on four continual learning algorithms over multiple forecasting horizons, demonstrating the effectiveness of the COB datasets in handling distribution shifts in real-world data.Here is the same information in Simplified Chinese text:for:COB数据集是为解决金融领域时间序列标注数据的缺乏而创建的，特别是对于钻油标准数据集。methods:本文使用资产价格数据转换为Volatility代理，使用期望最大化（EM）方法适应，并使用与实际世界事件相关的任务标签来生成COB数据集。results:将任务标签包含在内的四种持续学习算法在多个预测时间步长上显示了统一提高性能，证明COB数据集在实际世界数据中处理分布变化的能力。

Abstract
The scarcity of task-labeled time-series benchmarks in the financial domain hinders progress in continual learning. Addressing this deficit would foster innovation in this area. Therefore, we present COB, Crude Oil Benchmark datasets. COB includes 30 years of asset prices that exhibit significant distribution shifts and optimally generates corresponding task (i.e., regime) labels based on these distribution shifts for the three most important crude oils in the world. Our contributions include creating real-world benchmark datasets by transforming asset price data into volatility proxies, fitting models using expectation-maximization (EM), generating contextual task labels that align with real-world events, and providing these labels as well as the general algorithm to the public. We show that the inclusion of these task labels universally improves performance on four continual learning algorithms, some state-of-the-art, over multiple forecasting horizons. We hope these benchmarks accelerate research in handling distribution shifts in real-world data, especially due to the global importance of the assets considered. We've made the (1) raw price data, (2) task labels generated by our approach, (3) and code for our algorithm available at https://oilpricebenchmarks.github.io.

摘要
“财经领域内的任务标签时序序数据缺乏，阻碍了不断学习的进步。为了解决这个问题，我们提出了COB，即原油价格库存数据。COB包含了30年的资产价格，这些价格表现出了明显的分布迁移，并且适当地生成相应的任务（即 режи）标签，基于这些分布迁移。我们的贡献包括将资产价格资料转换为波动调整器，使用期望最大化（EM）适应模型，生成相应的任务标签，并将这些标签以及通用的算法公开给大众。我们的实验表明，将这些任务标签包含在内，可以universally提高四种不断学习算法的表现，包括一些state-of-the-art算法，在多个预测时间框架下。我们希望这些库存可以促进实际数据中的分布迁移处理研究，特别是由于考虑到全球资产的重要性。我们已经将（1）原价格数据、（2）生成的任务标签、以及（3）代码公开给大众，可以在https://oilpricebenchmarks.github.io获取。”

Neural Networks Optimizations Against Concept and Data Drift in Malware Detection

paper_url: http://arxiv.org/abs/2308.10821
repo_url: None
paper_authors: William Maillet, Benjamin Marais
for: 这个研究旨在提高基eline neural network的抗变化能力，以应对适得过时的攻击者常常演化的问题。
methods: 我们提出了一个模型独立的协议，并使用最新的验证集进行训练，以及一个名为过渡敏感度减少的损失函数，以提高模型的抗变化能力。
results: 我们在使用EMBER dataset（2018）进行训练，并在2020-2023年发生的恶意档案上进行评估，得到了15.2%更多的恶意档案检测。

Abstract
Despite the promising results of machine learning models in malware detection, they face the problem of concept drift due to malware constant evolution. This leads to a decline in performance over time, as the data distribution of the new files differs from the training one, requiring regular model update. In this work, we propose a model-agnostic protocol to improve a baseline neural network to handle with the drift problem. We show the importance of feature reduction and training with the most recent validation set possible, and propose a loss function named Drift-Resilient Binary Cross-Entropy, an improvement to the classical Binary Cross-Entropy more effective against drift. We train our model on the EMBER dataset (2018) and evaluate it on a dataset of recent malicious files, collected between 2020 and 2023. Our improved model shows promising results, detecting 15.2% more malware than a baseline model.

摘要
尽管机器学习模型在恶意软件检测方面表现出了承诺的成绩，但它们面临着概念漂移问题，这是由恶意软件不断演化所致。这会导致模型的性能逐渐下降，因为新文件的数据分布与训练数据分布不同，需要定期更新模型。在这种情况下，我们提出了一种模型无关协议，用于改进基eline neural network，以适应概念漂移问题。我们表明了减少特征和使用最新的验证集进行训练的重要性，并提出了一种名为“漂移抗性二进制十进制”的损失函数，这是对传统的二进制十进制损失函数的改进，更有效地适应概念漂移。我们使用EMBER数据集（2018）进行训练，并对2020年至2023年间收集的恶意文件进行评估。我们改进后的模型显示了承诺的成绩，能够检测到基eline模型无法检测到的15.2%的恶意软件。

2023-08-22

eess.IV

eess.IV - 2023-08-22

Multitemporal analysis in Google Earth Engine for detecting urban changes using optical data and machine learning algorithms

paper_url: http://arxiv.org/abs/2308.11468
repo_url: None
paper_authors: Mariapia Rita Iandolo, Francesca Razzano, Chiara Zarro, G. S. Yogesh, Silvia Liberata Ullo
for: 这个研究用于Multitemporal分析，使用Google Earth Engine（GEE）平台探测城市区域的变化，使用光学数据和特定的机器学习（ML）算法。
methods: 这个研究使用了GEE平台，将 optic 数据分析，使用特定的ML算法进行分类和变化探测。
results: 结果显示了提案的方法的有效性，可以正确地识别城市区域中的变化和不变化区域，并且证明GEE作为一个有效的云端解决方案，可以管理大量的卫星数据。

Abstract
The aim of this work is to perform a multitemporal analysis using the Google Earth Engine (GEE) platform for the detection of changes in urban areas using optical data and specific machine learning (ML) algorithms. As a case study, Cairo City has been identified, in Egypt country, as one of the five most populous megacities of the last decade in the world. Classification and change detection analysis of the region of interest (ROI) have been carried out from July 2013 to July 2021. Results demonstrate the validity of the proposed method in identifying changed and unchanged urban areas over the selected period. Furthermore, this work aims to evidence the growing significance of GEE as an efficient cloud-based solution for managing large quantities of satellite data.

摘要
目的是使用Google Earth Engine（GEE）平台进行多时间分析，以探测城市区域的变化使用光学数据和专门的机器学习（ML）算法。作为案例研究，开罗城市在埃及国被选为全球最后一个十年内最为人口稠密的五个超大城市之一。从2013年7月至2021年7月的时间段内，对兴趣区域（ROI）进行分类和变化检测分析。结果表明提出的方法的有效性，可以正确地标识在选定时间段内发生变化和不发生变化的城市区域。此外，这项工作还旨在证明GEE在处理大量卫星数据方面的能效性。

Integration of Sentinel-1 and Sentinel-2 data for Earth surface classification using Machine Learning algorithms implemented on Google Earth Engine

paper_url: http://arxiv.org/abs/2308.11340
repo_url: None
paper_authors: Francesca Razzano, Mariapia Rita Iandolo, Chiara Zarro, G. S. Yogesh, Silvia Liberata Ullo
for: 这种研究用于地球表面分类，具体是通过监督机器学习算法在Google Earth Engine平台 integrate Sentinel-1（S-1）和Sentinel-2（S-2）数据进行分类。
methods: 本研究使用了监督机器学习算法，并在Google Earth Engine平台上进行数据处理和分类。
results: 研究结果表明，在这种情况下，雷达和光学Remote探测提供了补偿信息，提高了地面覆盖分类的准确性。此外，这种研究也证明了Google Earth Engine平台在处理大量卫星数据方面的emerging角色。

Abstract
In this study, Synthetic Aperture Radar (SAR) and optical data are both considered for Earth surface classification. Specifically, the integration of Sentinel-1 (S-1) and Sentinel-2 (S-2) data is carried out through supervised Machine Learning (ML) algorithms implemented on the Google Earth Engine (GEE) platform for the classification of a particular region of interest. Achieved results demonstrate how in this case radar and optical remote detection provide complementary information, benefiting surface cover classification and generally leading to increased mapping accuracy. In addition, this paper works in the direction of proving the emerging role of GEE as an effective cloud-based tool for handling large amounts of satellite data.

摘要
在这一研究中，人造干扰雷达（SAR）和光学数据都被考虑用于地球表面分类。特别是通过监控Google Earth Engine（GEE）平台上的超级vised机器学习（ML）算法，将Sentinel-1（S-1）和Sentinel-2（S-2）数据集成用于特定区域的分类。实现的结果表明，在这种情况下，雷达和光学远程探测提供了补偿信息，改善表面覆盖分类并通常导致增加地图准确性。此外，这篇论文也致力于证明GEE在处理大量卫星数据方面的emerging角色。

PCMC-T1: Free-breathing myocardial T1 mapping with Physically-Constrained Motion Correction

paper_url: http://arxiv.org/abs/2308.11281
repo_url: None
paper_authors: Eyal Hanania, Ilya Volovik, Lilach Barkat, Israel Cohen, Moti Freiman
for: 用于诊断散发性心肌疾病的诊断工具。
methods: 使用深度学习模型进行动作纠正，并将物理限制纳入网络架构。
results: 相比基线方法，PCMC-T1显示出较高的模型适应质量（R2：0.955）和最高的临床影响（临床分数：3.93）。

Abstract
T1 mapping is a quantitative magnetic resonance imaging (qMRI) technique that has emerged as a valuable tool in the diagnosis of diffuse myocardial diseases. However, prevailing approaches have relied heavily on breath-hold sequences to eliminate respiratory motion artifacts. This limitation hinders accessibility and effectiveness for patients who cannot tolerate breath-holding. Image registration can be used to enable free-breathing T1 mapping. Yet, inherent intensity differences between the different time points make the registration task challenging. We introduce PCMC-T1, a physically-constrained deep-learning model for motion correction in free-breathing T1 mapping. We incorporate the signal decay model into the network architecture to encourage physically-plausible deformations along the longitudinal relaxation axis. We compared PCMC-T1 to baseline deep-learning-based image registration approaches using a 5-fold experimental setup on a publicly available dataset of 210 patients. PCMC-T1 demonstrated superior model fitting quality (R2: 0.955) and achieved the highest clinical impact (clinical score: 3.93) compared to baseline methods (0.941, 0.946 and 3.34, 3.62 respectively). Anatomical alignment results were comparable (Dice score: 0.9835 vs. 0.984, 0.988). Our code and trained models are available at https://github.com/eyalhana/PCMC-T1.

摘要
T1映射是一种量化核磁共振成像（qMRI）技术，已经成为诊断散发性心肺疾病的重要工具。然而，现有的方法很多都是基于呼吸停止序列来消除呼吸颤动 artifacts。这限制了患者可以接受的范围和效果。图像 региSTRassen可以使得T1映射在呼吸自由状态下进行。然而，不同时刻的信号强度之间的本质差异使得注册任务变得困难。我们介绍PCMC-T1，一种基于物理约束的深度学习模型，用于呼吸自由T1映射的运动 correction。我们将信号衰减模型integrated into网络架构，以便强制实施物理可能的扭轴。我们与基线方法进行比较，使用公共可用的数据集上的5-fold实验设置，PCMC-T1显示出最高的模型适应质量（R2：0.955）和最高的临床影响（临床分数：3.93），与基线方法（0.941，0.946和3.34，3.62分别）相比。结构匹配结果相似（Dice分数：0.9835 vs. 0.984，0.988）。我们的代码和训练模型可以在GitHub上找到：https://github.com/eyalhana/PCMC-T1。

Validation of apparent intra-and extra-myocellular lipid content indicator using spiral spectroscopic imaging at 3T

paper_url: http://arxiv.org/abs/2308.11668
repo_url: None
paper_authors: Antoine Naëgel, Magalie Viallon, Jabrane Karkouri, Thomas Troalen, Pierre Croisille, Hélène Ratiney
for: 这项研究旨在提出一种快速简单的螺旋MRSI方法，用于映射肌肉中IMCL和EMCL的显示内容，这是一项复杂的任务，并与经典评估结果进行比较。
methods: 该方法使用螺旋MRSI技术，并对肌肉中的IMCL和EMCL进行分离和评估。
results: 研究发现，螺旋MRSI方法可以快速和精确地映射肌肉中IMCL和EMCL的显示内容，并与经典评估结果相符。

Abstract
This work presents a fast and simple method based on spiral MRSI for mapping the IMCL and EMCL apparent content, which is a challenging task and it compares this indicator to classical quantification results in muscles of interest.

摘要
Here's the text in Simplified Chinese:这个研究提出了一种快速简单的方法，基于螺旋MRSI测量IMCL和EMCL显示的内容，这是一项具有挑战性的任务，并与经典量化结果在关注的肌肉中进行比较。

Phase Aberration Correction: A Deep Learning-Based Aberration to Aberration Approach

paper_url: http://arxiv.org/abs/2308.11149
repo_url: None
paper_authors: Mostafa Sharifzadeh, Sobhan Goudarzi, An Tang, Habib Benali, Hassan Rivaz
for: correction of phase aberration in ultrasound imaging
methods: deep learning-based approach that does not require ground truth, using an adaptive mixed loss function with both B-mode and RF data
results: enhanced performance and more efficient convergence compared to using a conventional loss function such as mean square error, as demonstrated on a publicly released dataset of 161,701 single plane-wave images (RF data)

Abstract
One of the primary sources of suboptimal image quality in ultrasound imaging is phase aberration. It is caused by spatial changes in sound speed over a heterogeneous medium, which disturbs the transmitted waves and prevents coherent summation of echo signals. Obtaining non-aberrated ground truths in real-world scenarios can be extremely challenging, if not impossible. This challenge hinders training of deep learning-based techniques' performance due to the presence of domain shift between simulated and experimental data. Here, for the first time, we propose a deep learning-based method that does not require ground truth to correct the phase aberration problem, and as such, can be directly trained on real data. We train a network wherein both the input and target output are randomly aberrated radio frequency (RF) data. Moreover, we demonstrate that a conventional loss function such as mean square error is inadequate for training such a network to achieve optimal performance. Instead, we propose an adaptive mixed loss function that employs both B-mode and RF data, resulting in more efficient convergence and enhanced performance. Finally, we publicly release our dataset, including 161,701 single plane-wave images (RF data). This dataset serves to mitigate the data scarcity problem in the development of deep learning-based techniques for phase aberration correction.

摘要
To address this challenge, we propose a deep learning-based method that does not require ground truth to correct phase aberration, and can be directly trained on real data. We train a network where the input and target output are randomly aberrated radio frequency (RF) data. However, a conventional loss function such as mean square error is inadequate for training such a network, and instead, we propose an adaptive mixed loss function that employs both B-mode and RF data, resulting in more efficient convergence and enhanced performance.To further support the development of deep learning-based techniques for phase aberration correction, we publicly release our dataset, which includes 161,701 single plane-wave images (RF data). This dataset serves to mitigate the data scarcity problem in the development of deep learning-based techniques for phase aberration correction.

Hey That’s Mine Imperceptible Watermarks are Preserved in Diffusion Generated Outputs

paper_url: http://arxiv.org/abs/2308.11123
repo_url: None
paper_authors: Luke Ditria, Tom Drummond
for: 保护在线内容的知识产权
methods: 使用隐身水印技术生成带有水印的图像，并通过统计测试确定模型是否训练使用水印数据
results: 可以通过统计测试确定模型是否训练使用水印数据，并且可以判断水印数据中特定特征的相关性。这些结果表明我们的系统可以保护在线内容的知识产权。

Abstract
Generative models have seen an explosion in popularity with the release of huge generative Diffusion models like Midjourney and Stable Diffusion to the public. Because of this new ease of access, questions surrounding the automated collection of data and issues regarding content ownership have started to build. In this paper we present new work which aims to provide ways of protecting content when shared to the public. We show that a generative Diffusion model trained on data that has been imperceptibly watermarked will generate new images with these watermarks present. We further show that if a given watermark is correlated with a certain feature of the training data, the generated images will also have this correlation. Using statistical tests we show that we are able to determine whether a model has been trained on marked data, and what data was marked. As a result our system offers a solution to protect intellectual property when sharing content online.

摘要
<>通过帮助系统，将文本翻译成简化中文。<>生成模型在发布大量生成Diffusion模型之后，如Midjourney和Stable Diffusion，受到了广泛的关注。由于这些新的访问权限，人们开始思考自动收集数据的问题以及内容所有权问题。在这篇论文中，我们提出了一种保护内容的新方法。我们显示了一个基于不可见水印的生成Diffusion模型，可以在生成新图像时携带水印。我们还表明，如果一个水印与训练数据中某个特征相关，那么生成的图像也会具有这种相关性。通过统计测试，我们证明了我们可以确定一个模型是否基于水印数据，以及这些水印数据的具体内容。因此，我们的系统可以保护在线内容的知识产权。

Switched auxiliary loss for robust training of transformer models for histopathological image segmentation

paper_url: http://arxiv.org/abs/2308.10994
repo_url: None
paper_authors: Mustaffa Hussain, Saharsh Barve
for: 本研究旨在提高 transformers 模型在医学影像分析中 dense prediction 任务中的性能，并 Investigate the use of shifted auxiliary loss to overcome the diminishing gradient problem.
methods: 我们使用 HuBMAP + HPA - Hacking the Human Body competition dataset，并提出了shifted auxiliary loss来解决深度学习模型训练过程中的减速问题。
results: 我们的模型在公共数据集上取得了 dice 分数 0.793，在私有数据集上取得了 dice 分数 0.778，与传统方法相比增加了1%的改进。这些结果表明 transformers 模型在医学影像分析中的 dense prediction 任务中具有潜在的应用价值。

Abstract
Functional tissue Units (FTUs) are cell population neighborhoods local to a particular organ performing its main function. The FTUs provide crucial information to the pathologist in understanding the disease affecting a particular organ by providing information at the cellular level. In our research, we have developed a model to segment multi-organ FTUs across 5 organs namely: the kidney, large intestine, lung, prostate and spleen by utilizing the HuBMAP + HPA - Hacking the Human Body competition dataset. We propose adding shifted auxiliary loss for training models like the transformers to overcome the diminishing gradient problem which poses a challenge towards optimal training of deep models. Overall, our model achieved a dice score of 0.793 on the public dataset and 0.778 on the private dataset and shows a 1% improvement with the use of the proposed method. The findings also bolster the use of transformers models for dense prediction tasks in the field of medical image analysis. The study assists in understanding the relationships between cell and tissue organization thereby providing a useful medium to look at the impact of cellular functions on human health.

摘要
功能组织单元（FTU）是指特定器官地区的细胞群聚，提供了病理学家理解器官疾病的重要信息。在我们的研究中，我们开发了一种方法来在5个器官（肾脏、大肠、肺、膀胱和脾脏）中的FTU进行多器官分割，使用了HuBMAP + HPA - Hacking the Human Body competition数据集。我们提议在训练模型时使用偏移 auxiliary loss，以解决深度模型训练过程中的减少梯度问题。总的来说，我们的模型在公共数据集上达到了0.793的 dice 分数，在私有数据集上达到了0.778的 dice 分数，与使用我们提议的方法相比增加了1%的改进。这些发现也推动了使用 transformers 模型进行密集预测任务在医疗影像分析领域。这项研究帮助我们理解细胞和组织组织之间的关系，并提供了一个有用的媒介来查看细胞功能对人类健康的影响。

Debiasing Counterfactuals In the Presence of Spurious Correlations

paper_url: http://arxiv.org/abs/2308.10984
repo_url: None
paper_authors: Amar Kumar, Nima Fathi, Raghav Mehta, Brennan Nichyporuk, Jean-Pierre R. Falet, Sotirios Tsaftaris, Tal Arbel
for:This paper aims to improve the performance of deep learning models in medical imaging classification tasks by addressing the issue of spurious correlations in the training data.methods:The proposed method integrates two techniques: (1) popular debiasing classifiers such as distributionally robust optimization (DRO), and (2) counterfactual image generation.results:The proposed method is effective in learning generalizable markers across the population and ignoring spurious correlations. The novel metric, Spurious Correlation Latching Score (SCLS), is used to quantify the extent of classifier reliance on spurious correlations. Through comprehensive experiments on two public datasets with simulated and real visual artifacts, the method is shown to successfully ignore spurious correlations and focus on the underlying disease pathology.

Abstract
Deep learning models can perform well in complex medical imaging classification tasks, even when basing their conclusions on spurious correlations (i.e. confounders), should they be prevalent in the training dataset, rather than on the causal image markers of interest. This would thereby limit their ability to generalize across the population. Explainability based on counterfactual image generation can be used to expose the confounders but does not provide a strategy to mitigate the bias. In this work, we introduce the first end-to-end training framework that integrates both (i) popular debiasing classifiers (e.g. distributionally robust optimization (DRO)) to avoid latching onto the spurious correlations and (ii) counterfactual image generation to unveil generalizable imaging markers of relevance to the task. Additionally, we propose a novel metric, Spurious Correlation Latching Score (SCLS), to quantify the extent of the classifier reliance on the spurious correlation as exposed by the counterfactual images. Through comprehensive experiments on two public datasets (with the simulated and real visual artifacts), we demonstrate that the debiasing method: (i) learns generalizable markers across the population, and (ii) successfully ignores spurious correlations and focuses on the underlying disease pathology.

摘要
深度学习模型可以在复杂的医疗影像分类任务中表现出色，即使基于杂乱关系（即外部因素）而不是真实的影像标记物。这会限制其在人口中的泛化能力。使用对抗面生成来暴露杂乱关系可以帮助解释，但不提供修正偏见的策略。在这项工作中，我们介绍了首个综合训练框架，它将（i）流行的偏见约束（如分布式Robust优化（DRO））与（ii）对抗面生成结合在一起，以避免杂乱关系的吸引并暴露真实的影像标记物。此外，我们提出了一个新的度量指标——杂乱关系抓取分数（SCLS），用于衡量分类器吸引杂乱关系的程度。通过对两个公共数据集（包括模拟和实际视觉损害）进行了全面的实验，我们示出了这种修正方法可以：（i）学习人口中的普遍性 markers，以及（ii）成功地忽略杂乱关系，而关注真实的疾病生物学。

BundleSeg: A versatile, reliable and reproducible approach to white matter bundle segmentation

paper_url: http://arxiv.org/abs/2308.10958
repo_url: None
paper_authors: Etienne St-Onge, Kurt G Schilling, Francois Rheault
for: 这篇论文旨在提供一种可靠、重现性好、快速的白 matter 通路EXTRACTING方法。
methods: 提议的方法结合了迭代注册过程和最新开发的精确流线搜索算法，可以高效地分 Segment 流线不需要 clustering 或简化假设。
results: 我们表明，BundleSeg 能够提高重复性和重现性，并且比状态对方法更快速。提高了 white matter 连接的精度和减少了变化，为 neuroscience 研究提供了一种有价值的工具，从而提高了轨迹学基于研究的敏感性和特点。

Abstract
This work presents BundleSeg, a reliable, reproducible, and fast method for extracting white matter pathways. The proposed method combines an iterative registration procedure with a recently developed precise streamline search algorithm that enables efficient segmentation of streamlines without the need for tractogram clustering or simplifying assumptions. We show that BundleSeg achieves improved repeatability and reproducibility than state-of-the-art segmentation methods, with significant speed improvements. The enhanced precision and reduced variability in extracting white matter connections offer a valuable tool for neuroinformatic studies, increasing the sensitivity and specificity of tractography-based studies of white matter pathways.

摘要
Translation notes:* "iterative registration procedure" becomes "迭代注册过程" (dié dài zhù rèng gōng chéng)* "precise streamline search algorithm" becomes "精确的流线搜索算法" (jīng jí de liú xiàn sōu sòu algoritmos)* "tractogram clustering" becomes "股股聚合" (gōu gōu jù hè)* "simplifying assumptions" becomes "简化假设" (jiǎn huà jiè shè)* "neuroinformatic studies" becomes "神经信息学研究" (shén xiāo xìn xué yán jí)* "tractography-based studies" becomes "股股学研究" (gōu gōu xué yán jí)

Pixel Adaptive Deep Unfolding Transformer for Hyperspectral Image Reconstruction

paper_url: http://arxiv.org/abs/2308.10820
repo_url: None
paper_authors: Miaoyu Li, Ying Fu, Ji Liu, Yulun Zhang
for: 高度精度的频谱图像（HSI）重建方法
methods: Pixel Adaptive Deep Unfolding Transformer（PADUT），包括数据模块和优先模块，并且引入Non-local Spectral Transformer（NST）和快速傅立勤变换（FFT）来提高stage interaction
results: 在模拟和实际场景上表现出优于当前领先的HSI重建方法

Abstract
Hyperspectral Image (HSI) reconstruction has made gratifying progress with the deep unfolding framework by formulating the problem into a data module and a prior module. Nevertheless, existing methods still face the problem of insufficient matching with HSI data. The issues lie in three aspects: 1) fixed gradient descent step in the data module while the degradation of HSI is agnostic in the pixel-level. 2) inadequate prior module for 3D HSI cube. 3) stage interaction ignoring the differences in features at different stages. To address these issues, in this work, we propose a Pixel Adaptive Deep Unfolding Transformer (PADUT) for HSI reconstruction. In the data module, a pixel adaptive descent step is employed to focus on pixel-level agnostic degradation. In the prior module, we introduce the Non-local Spectral Transformer (NST) to emphasize the 3D characteristics of HSI for recovering. Moreover, inspired by the diverse expression of features in different stages and depths, the stage interaction is improved by the Fast Fourier Transform (FFT). Experimental results on both simulated and real scenes exhibit the superior performance of our method compared to state-of-the-art HSI reconstruction methods. The code is released at: https://github.com/MyuLi/PADUT.

摘要

2023-08-23

The Challenges of Machine Learning for Trust and Safety: A Case Study on Misinformation Detection

Learning to Learn Financial Networks for Optimising Momentum Strategies

ULDP-FL: Federated Learning with Across Silo User-Level Differential Privacy

Curriculum Learning with Adam: The Devil Is in the Wrong Details

Predicting Drug Solubility Using Different Machine Learning Methods – Linear Regression Model with Extracted Chemical Features vs Graph Convolutional Neural Network

Self-Supervised Knowledge-Driven Deep Learning for 3D Magnetic Inversion

Robustness Analysis of Continuous-Depth Models with Lagrangian Techniques

Development and external validation of a lung cancer risk estimation tool using gradient-boosting

Unsupervised anomalies detection in IIoT edge devices networks using federated learning

Data-driven decision-focused surrogate modeling

A Probabilistic Fluctuation based Membership Inference Attack for Generative Models

Masking Strategies for Background Bias Removal in Computer Vision Models

An Accelerated Block Proximal Framework with Adaptive Momentum for Nonconvex and Nonsmooth Optimization

An Open-Source ML-Based Full-Stack Optimization Framework for Machine Learning Accelerators

Less is More – Towards parsimonious multi-task models using structured sparsity

Generalized Continual Category Discovery

Constrained Stein Variational Trajectory Optimization

Quantifying degeneracy in singular models via the learning coefficient

Cached Operator Reordering: A Unified View for Fast GNN Training

Stabilizing RNN Gradients through Pre-training

Identifying Reaction-Aware Driving Styles of Stochastic Model Predictive Controlled Vehicles by Inverse Reinforcement Learning

InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

Ensembling Uncertainty Measures to Improve Safety of Black-Box Classifiers

HarvestNet: A Dataset for Detecting Smallholder Farming Activity Using Harvest Piles and Remote Sensing

Manipulating Embeddings of Stable Diffusion Prompts

Sample Complexity of Robust Learning against Evasion Attacks

Layer-wise Feedback Propagation

A multiobjective continuation method to compute the regularization path of deep neural networks

IncreLoRA: Incremental Parameter Allocation Method for Parameter-Efficient Fine-tuning

CACTUS: a Comprehensive Abstraction and Classification Tool for Uncovering Structures

Prompt-Based Length Controlled Generation with Reinforcement Learning

A Scale-Invariant Task Balancing Approach for Multi-Task Learning

Bias-Aware Minimisation: Understanding and Mitigating Estimator Bias in Private SGD

Graph Neural Stochastic Differential Equations

MKL-$L_{0/1}$-SVM

Quantum-Noise-driven Generative Diffusion Models

Neural oscillators for magnetic hysteresis modeling

Trustworthy Representation Learning Across Domains

On Uniformly Optimal Algorithms for Best Arm Identification in Two-Armed Bandits with Fixed Budget

Relational Concept Based Models

Will More Expressive Graph Neural Networks do Better on Generative Tasks?

Approximating Score-based Explanation Techniques Using Conformal Regression

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

Anisotropic Hybrid Networks for liver tumor segmentation with uncertainty quantification

Maintaining Plasticity via Regenerative Regularization

When MiniBatch SGD Meets SplitFed Learning:Convergence Analysis and Performance Evaluation

Multi-scale Transformer Pyramid Networks for Multivariate Time Series Forecasting

RamseyRL: A Framework for Intelligent Ramsey Number Counterexample Searching

Audio Generation with Multiple Conditional Diffusion Model

Retail Demand Forecasting: A Comparative Study for Multivariate Time Series

System Identification for Continuous-time Linear Dynamical Systems

Dynamic landslide susceptibility mapping over recent three decades to uncover variations in landslide causes in subtropical urban mountainous areas

Solving Elliptic Optimal Control Problems using Physics Informed Neural Networks

Diverse Policies Converge in Reward-free Markov Decision Processe

Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement

Addressing Selection Bias in Computerized Adaptive Testing: A User-Wise Aggregate Influence Function Approach

Diagnosing Infeasible Optimization Problems Using Large Language Models

Utilizing Admissible Bounds for Heuristic Learning

Rethinking Data Perturbation and Model Stabilization for Semi-supervised Medical Image Segmentation

Shape-conditioned 3D Molecule Generation via Equivariant Diffusion Models

Adversarial Training Using Feedback Loops

SUMMIT: Source-Free Adaptation of Uni-Modal Models to Multi-Modal Targets

Cabrita: closing the gap for foreign languages

Integrating Large Language Models into the Debugging C Compiler for generating contextual error explanations

Fast Exact NPN Classification with Influence-aided Canonical Form

KinSPEAK: Improving speech recognition for Kinyarwanda via semi-supervised learning methods

Finding the Perfect Fit: Applying Regression Models to ClimateBench v1.0

A deep reinforcement learning approach for real-time demand-responsive railway rescheduling to mitigate station overcrowding using mobile data

SEA: Shareable and Explainable Attribution for Query-based Black-box Attacks

${\rm E}(3)$-Equivariant Actor-Critic Methods for Cooperative Multi-Agent Reinforcement Learning

A Survey for Federated Learning Evaluations: Goals and Measures

A Benchmark Study on Calibration

Characterizing normal perinatal development of the human brain structural connectivity

Performance Comparison and Implementation of Bayesian Variants for Network Intrusion Detection

Exploring the Effectiveness of GPT Models in Test-Taking: A Case Study of the Driver’s License Knowledge Test

Accel-GCN: High-Performance GPU Accelerator Design for Graph Convolution Networks

PatchBackdoor: Backdoor Attack against Deep Neural Networks without Model Modification

Mitigating Health Disparity on Biased Electronic Health Records via Deconfounder