2023-09-17

cs.LG

cs.LG - 2023-09-17

Mitigating Over-Smoothing and Over-Squashing using Augmentations of Forman-Ricci Curvature

paper_url: http://arxiv.org/abs/2309.09384
repo_url: None
paper_authors: Lukas Fesser, Melanie Weber
For: This paper proposes a rewiring technique based on Augmented Forman-Ricci curvature (AFRC) to mitigate over-smoothing and over-squashing effects in message-passing Graph Neural Networks (GNNs).* Methods: The proposed technique uses AFRC, a scalable curvature notation that can be computed in linear time, to characterize over-smoothing and over-squashing effects in GNNs.* Results: The proposed approach achieves state-of-the-art performance while significantly reducing the computational cost in comparison with other methods. The paper also provides effective heuristics for hyperparameters in curvature-based rewiring, which avoids expensive hyperparameter searches.

Abstract
While Graph Neural Networks (GNNs) have been successfully leveraged for learning on graph-structured data across domains, several potential pitfalls have been described recently. Those include the inability to accurately leverage information encoded in long-range connections (over-squashing), as well as difficulties distinguishing the learned representations of nearby nodes with growing network depth (over-smoothing). An effective way to characterize both effects is discrete curvature: Long-range connections that underlie over-squashing effects have low curvature, whereas edges that contribute to over-smoothing have high curvature. This observation has given rise to rewiring techniques, which add or remove edges to mitigate over-smoothing and over-squashing. Several rewiring approaches utilizing graph characteristics, such as curvature or the spectrum of the graph Laplacian, have been proposed. However, existing methods, especially those based on curvature, often require expensive subroutines and careful hyperparameter tuning, which limits their applicability to large-scale graphs. Here we propose a rewiring technique based on Augmented Forman-Ricci curvature (AFRC), a scalable curvature notation, which can be computed in linear time. We prove that AFRC effectively characterizes over-smoothing and over-squashing effects in message-passing GNNs. We complement our theoretical results with experiments, which demonstrate that the proposed approach achieves state-of-the-art performance while significantly reducing the computational cost in comparison with other methods. Utilizing fundamental properties of discrete curvature, we propose effective heuristics for hyperparameters in curvature-based rewiring, which avoids expensive hyperparameter searches, further improving the scalability of the proposed approach.

摘要
graph neural networks (GNNs) 已经成功地应用于不同领域的图数据上，但最近有一些潜在的坑害被描述了。这些坑害包括不能准确利用图中长距离连接中的信息 (过滤)，以及随着网络深度增加而导致近节点的学习表现相似化 (过滤)。一种有效的方式来描述这两种效果是离散曲率：图中长距离连接的离散曲率较低，而导致过滤的边的离散曲率较高。这一观察引起了重新连接技术的出现，这些技术通过添加或 removing 边来缓解过滤和过滤的问题。已有一些基于图特性的重新连接方法，如曲率或图laplacian的谱，被提出。然而，现有的方法，特别是基于曲率的方法，经常需要费时的优化和精心调整 hyperparameter，这限制了它们在大规模图上的应用。我们提出了基于 Augmented Forman-Ricci curvature (AFRC) 的重新连接技术，AFRC 是一种可以在线时间内计算的离散曲率表示法。我们证明 AFRC 能够有效地描述 GNN 中的过滤和过滤效果。我们通过实验证明，我们的方法可以 achieve state-of-the-art 性能，同时减少了与其他方法相比的计算成本。利用离散曲率的基本属性，我们提出了一些有效的启发式 hyperparameter 优化方法，以避免费时的寻找优化方法，进一步提高了我们的方法的可扩展性。

Federated Learning in Temporal Heterogeneity

paper_url: http://arxiv.org/abs/2309.09381
repo_url: None
paper_authors: Junghwan Lee
for: 本研究探讨了 federated learning 中的时间不同客户端上的 temporally 不一致问题。
methods: 我们提出了一种基于 empirical 观察的方法来 mitigate temporally 不一致问题，以便更有效地进行 federated learning。
results: 我们发现 global model 使用 fix-length sequence 更快地 converges than varying-length sequence。

Abstract
In this work, we explored federated learning in temporal heterogeneity across clients. We observed that global model obtained by \texttt{FedAvg} trained with fixed-length sequences shows faster convergence than varying-length sequences. We proposed methods to mitigate temporal heterogeneity for efficient federated learning based on the empirical observation.

摘要
在这项工作中，我们探索了联邦学习中的时间不同客户端之间的差异。我们发现，使用固定长度序列的\texttt{FedAvg}训练的全局模型在更快地尝试了。我们提出了一些缓解时间差异的方法，以便有效地进行联邦学习，基于实际观察。Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Fully Convolutional Generative Machine Learning Method for Accelerating Non-Equilibrium Greens Function Simulations

paper_url: http://arxiv.org/abs/2309.09374
repo_url: None
paper_authors: Preslav Aleksandrov, Ali Rezaei, Nikolas Xeni, Tapas Dutta, Asen Asenov, Vihar Georgiev
for: 这篇论文描述了一种新的模拟方法，它结合机器学习和设备模拟。这种模拟方法基于量子力学非平衡绿函数（NEGF）方法，并使用扩展到卷积生成网络。我们称之为ML-NEGF方法，并在我们的内部 simulate 软件（NESS）中实现。
methods: 这种模拟方法使用了机器学习方法，卷积生成网络来学习奈米薄膜晶体管的物理行为。
results: 据报告，ML-NEGF方法可以在相同精度下提高模拟速度，减少计算时间，平均提高了60%。

Abstract
This work describes a novel simulation approach that combines machine learning and device modelling simulations. The device simulations are based on the quantum mechanical non-equilibrium Greens function (NEGF) approach and the machine learning method is an extension to a convolutional generative network. We have named our new simulation approach ML-NEGF and we have implemented it in our in-house simulator called NESS (nano-electronics simulations software). The reported results demonstrate the improved convergence speed of the ML-NEGF method in comparison to the standard NEGF approach. The trained ML model effectively learns the underlying physics of nano-sheet transistor behaviour, resulting in faster convergence of the coupled Poisson-NEGF simulations. Quantitatively, our ML- NEGF approach achieves an average convergence acceleration of 60%, substantially reducing the computational time while maintaining the same accuracy.

摘要

A Survey on Congestion Control and Scheduling for Multipath TCP: Machine Learning vs Classical Approaches

paper_url: http://arxiv.org/abs/2309.09372
repo_url: None
paper_authors: Maisha Maliha, Golnaz Habibi, Mohammed Atiquzzaman
for: 本研究旨在解决多路 TCP (MPTCP) 中的几个问题，包括流量占用和延迟控制。
methods: 本研究使用两种主要方法：非数据驱动（传统）方法和数据驱动（机器学习）方法。
results: 本研究对这两种方法的优缺点进行比较，并提供实际环境中 MPCTP 的实现和模拟。

Abstract
Multipath TCP (MPTCP) has been widely used as an efficient way for communication in many applications. Data centers, smartphones, and network operators use MPTCP to balance the traffic in a network efficiently. MPTCP is an extension of TCP (Transmission Control Protocol), which provides multiple paths, leading to higher throughput and low latency. Although MPTCP has shown better performance than TCP in many applications, it has its own challenges. The network can become congested due to heavy traffic in the multiple paths (subflows) if the subflow rates are not determined correctly. Moreover, communication latency can occur if the packets are not scheduled correctly between the subflows. This paper reviews techniques to solve the above-mentioned problems based on two main approaches; non data-driven (classical) and data-driven (Machine Learning) approaches. This paper compares these two approaches and highlights their strengths and weaknesses with a view to motivating future researchers in this exciting area of machine learning for communications. This paper also provides details on the simulation of MPTCP and its implementations in real environments.

摘要
multipath TCP (MPTCP) 已经广泛应用于许多应用程序中，以提高网络吞吐量和低延迟。数据中心、智能手机和网络运营商都使用 MPTCP 来均衡网络流量。MPTCP 是 TCP（传输控制协议）的扩展，它提供多个路径，从而实现更高的吞吐量和低延迟。虽然 MPTCP 在许多应用中表现了更好的性能，但它还存在一些挑战。如果多个流（subflow）的流量不是正确地确定的话，网络就可能变得拥堵。此外，如果包没有正确地安排的话，则会出现交通延迟。本文评论了解决上述问题的两种方法：非数据驱动（传统）方法和数据驱动（机器学习）方法。本文比较这两种方法的优劣，并强调它们在这一领域的挑战和未来研究的可能性。此外，本文还提供了 MPTCP 的模拟和实现在真实环境中的细节。

An Automatic Tuning MPC with Application to Ecological Cruise Control

paper_url: http://arxiv.org/abs/2309.09358
repo_url: None
paper_authors: Mohammad Abtahi, Mahdis Rabbani, Shima Nazari
for: 这个论文是为了研究模型预测控制（MPC）的自动调整方法，以优化MPC控制器的性能。
methods: 该论文使用了动态Programming和神经网络来解决MPC控制器的自动调整问题，并在online操作中使用预览信息来适应道路坡度。
results: simulations results show that the proposed approach can effectively optimize the fuel consumption of the ecological cruise control system under different road geometries.

Abstract
Model predictive control (MPC) is a powerful tool for planning and controlling dynamical systems due to its capacity for handling constraints and taking advantage of preview information. Nevertheless, MPC performance is highly dependent on the choice of cost function tuning parameters. In this work, we demonstrate an approach for online automatic tuning of an MPC controller with an example application to an ecological cruise control system that saves fuel by using a preview of road grade. We solve the global fuel consumption minimization problem offline using dynamic programming and find the corresponding MPC cost function by solving the inverse optimization problem. A neural network fitted to these offline results is used to generate the desired MPC cost function weight during online operation. The effectiveness of the proposed approach is verified in simulation for different road geometries.

摘要

Structure to Property: Chemical Element Embeddings and a Deep Learning Approach for Accurate Prediction of Chemical Properties

paper_url: http://arxiv.org/abs/2309.09355
repo_url: https://github.com/dmamur/elembert
paper_authors: Shokirbek Shermukhamedov, Dilorom Mamurjonova, Michael Probst
for: 本研究使用机器学习技术来预测分子性质，加速药物发现和材料设计。
methods: 本研究使用深度学习技术，包括多层编码器和解码器架构，进行分类任务。
results: 我们通过应用这种方法，在不同的输入数据上 achiev 高度预测力，例如在Matbench和Moleculenetbenchmarks上，并进行了分子数据表示 вектор的全面分析，揭示了分子数据中的下意识模式。

Abstract
The application of machine learning (ML) techniques in computational chemistry has led to significant advances in predicting molecular properties, accelerating drug discovery, and material design. ML models can extract hidden patterns and relationships from complex and large datasets, allowing for the prediction of various chemical properties with high accuracy. The use of such methods has enabled the discovery of molecules and materials that were previously difficult to identify. This paper introduces a new ML model based on deep learning techniques, such as a multilayer encoder and decoder architecture, for classification tasks. We demonstrate the opportunities offered by our approach by applying it to various types of input data, including organic and inorganic compounds. In particular, we developed and tested the model using the Matbench and Moleculenet benchmarks, which include crystal properties and drug design-related benchmarks. We also conduct a comprehensive analysis of vector representations of chemical compounds, shedding light on the underlying patterns in molecular data. The models used in this work exhibit a high degree of predictive power, underscoring the progress that can be made with refined machine learning when applied to molecular and material datasets. For instance, on the Tox21 dataset, we achieved an average accuracy of 96%, surpassing the previous best result by 10%. Our code is publicly available at https://github.com/dmamur/elembert.

摘要
机器学习（ML）技术在计算化学中应用得到了显著的进步，包括预测分子性质、加速药物发现和材料设计。ML模型可以从复杂大量数据中提取隐藏的模式和关系，以高度准确地预测各种化学性质。这种方法的应用使得可以发现 previously difficult to identify的分子和材料。本文介绍了一种基于深度学习技术的新ML模型，包括多层编码器和解码器建筑，用于分类任务。我们通过应用这种方法于不同类型的输入数据，包括有机和无机化合物，证明了我们的方法的可行性。具体来说，我们使用了Matbench和Moleculenetbenchmark，包括晶体性和药物设计相关的benchmark，进行了全面的分子表示vector分析，揭示了分子数据中的下面纲。使用的模型在这个工作中表现出了高度预测力，这将进一步推动了对分子和材料数据的机器学习应用。例如，在Tox21dataset上，我们实现了96%的平均准确率，比前一个最佳结果高出10%。我们的代码可以在https://github.com/dmamur/elembert上下载。

Simulation-based Inference for Exoplanet Atmospheric Retrieval: Insights from winning the Ariel Data Challenge 2023 using Normalizing Flows

paper_url: http://arxiv.org/abs/2309.09337
repo_url: https://github.com/astroai-cfa/ariel_data_challenge_2023_solution
paper_authors: Mayeul Aubin, Carolina Cuesta-Lazaro, Ethan Tregidga, Javier Viaña, Cecilia Garraffo, Iouli E. Gordon, Mercedes López-Morales, Robert J. Hargreaves, Vladimir Yu. Makhnev, Jeremy J. Drake, Douglas P. Finkbeiner, Phillip Cargile
for: 这项研究旨在提出新的机器学习模型，用于分析外层行星大气层谱。
methods: 该研究使用了 Normalizing Flows 技术，预测大气参数的 posterior 分布下不同大气假设。
results: 研究发现了一种新的机器学习模型，可以更高效地分析外层行星大气层谱。此外，研究还发现了一种更高性能的模型，即使在挑战中获得较低分而然。这些发现表明需要重新评估评价指标，并且探索更加高效和准确的方法来分析外层行星大气层谱。

Abstract
Advancements in space telescopes have opened new avenues for gathering vast amounts of data on exoplanet atmosphere spectra. However, accurately extracting chemical and physical properties from these spectra poses significant challenges due to the non-linear nature of the underlying physics. This paper presents novel machine learning models developed by the AstroAI team for the Ariel Data Challenge 2023, where one of the models secured the top position among 293 competitors. Leveraging Normalizing Flows, our models predict the posterior probability distribution of atmospheric parameters under different atmospheric assumptions. Moreover, we introduce an alternative model that exhibits higher performance potential than the winning model, despite scoring lower in the challenge. These findings highlight the need to reevaluate the evaluation metric and prompt further exploration of more efficient and accurate approaches for exoplanet atmosphere spectra analysis. Finally, we present recommendations to enhance the challenge and models, providing valuable insights for future applications on real observational data. These advancements pave the way for more effective and timely analysis of exoplanet atmospheric properties, advancing our understanding of these distant worlds.

摘要
This paper presents new machine learning models developed by the AstroAI team for the Ariel Data Challenge 2023. One of our models achieved the top position among 293 competitors by leveraging Normalizing Flows to predict the posterior probability distribution of atmospheric parameters under different atmospheric assumptions.Furthermore, we introduce an alternative model that exhibits higher performance potential than the winning model, despite scoring lower in the challenge. These findings highlight the need to reevaluate the evaluation metric and prompt further exploration of more efficient and accurate approaches for analyzing exoplanet atmosphere spectra.Finally, we provide recommendations to enhance the challenge and models, offering valuable insights for future applications on real observational data. These advancements pave the way for more effective and timely analysis of exoplanet atmospheric properties, deepening our understanding of these distant worlds.

Experiential-Informed Data Reconstruction for Fishery Sustainability and Policies in the Azores

paper_url: http://arxiv.org/abs/2309.09326
repo_url: None
paper_authors: Brenda Nogueira, Gui M. Menezes, Nuno Moniz
for: 本研究的目的是重建附近葡萄牙阿鲁亚群岛渔业数据集（2010-2017年），以便更好地了解渔业捕捞方法对海洋生态系统的影响。
methods: 本研究使用了域知和机器学习方法来恢复数据集，并通过对每个鱼类捕捞数据进行分析来推断渔业工具的使用情况。
results: 研究结果表明，通过使用不同的模型方法可以有效地重建数据集，并提供了新的视角对不同渔业的行为和时间的影响，这些信息对未来鱼类人口评估和管理具有重要意义。

Abstract
Fishery analysis is critical in maintaining the long-term sustainability of species and the livelihoods of millions of people who depend on fishing for food and income. The fishing gear, or metier, is a key factor significantly impacting marine habitats, selectively targeting species and fish sizes. Analysis of commercial catches or landings by metier in fishery stock assessment and management is crucial, providing robust estimates of fishing efforts and their impact on marine ecosystems. In this paper, we focus on a unique data set from the Azores' fishing data collection programs between 2010 and 2017, where little information on metiers is available and sparse throughout our timeline. Our main objective is to tackle the task of data set reconstruction, leveraging domain knowledge and machine learning methods to retrieve or associate metier-related information to each fish landing. We empirically validate the feasibility of this task using a diverse set of modeling approaches and demonstrate how it provides new insights into different fisheries' behavior and the impact of metiers over time, which are essential for future fish population assessments, management, and conservation efforts.

摘要
鱼业分析是维护生物种和渔业生产的长期可持续性的关键。鱼网（metier）是影响海洋生态系统的关键因素，可以选择性地目标种类和鱼的大小。在鱼业资源评估和管理中，商业捕捞数据的分析是非常重要的，可以提供坚实的捕捞努力和海洋生态系统的影响。本文关注Azores鱼业数据收集计划在2010年至2017年之间的独特数据集，因为这个数据集中有少量的鱼网信息，并且这些信息在时间线上是罕见的。我们的主要目标是使用领域知识和机器学习方法来重建这个数据集，并将鱼网相关信息与每个鱼投射相关联。我们通过多种模型方法进行实验验证，并证明这个任务的可行性，从而提供新的鱼业行为和鱼网的影响情况，这些信息对未来鱼种评估、管理和保护具有重要意义。

Kinematics-aware Trajectory Generation and Prediction with Latent Stochastic Differential Modeling

paper_url: http://arxiv.org/abs/2309.09317
repo_url: None
paper_authors: Ruochen Jiao, Yixuan Wang, Xiangguo Liu, Chao Huang, Qi Zhu
for: 本研究旨在提高自动驾驶车辆的路径生成和预测能力，以便在开发和运行过程中更好地处理复杂的交通enario。
methods: 我们 integrate了机械知识和神经网络泊brace(SDE)，并基于novel latent kinematics-aware SDE（LK-SDE）开发了一种variational autoencoder，以生成车辆运动。我们的方法结合了模型基于和深度学习基于技术的优点。
results: 我们的方法在生成和预测车辆路径时比基eline方法表现出色，生成的路径更加真实、物理可行和精度可控。

Abstract
Trajectory generation and trajectory prediction are two critical tasks for autonomous vehicles, which generate various trajectories during development and predict the trajectories of surrounding vehicles during operation, respectively. However, despite significant advances in improving their performance, it remains a challenging problem to ensure that the generated/predicted trajectories are realistic, explainable, and physically feasible. Existing model-based methods provide explainable results, but are constrained by predefined model structures, limiting their capabilities to address complex scenarios. Conversely, existing deep learning-based methods have shown great promise in learning various traffic scenarios and improving overall performance, but they often act as opaque black boxes and lack explainability. In this work, we integrate kinematic knowledge with neural stochastic differential equations (SDE) and develop a variational autoencoder based on a novel latent kinematics-aware SDE (LK-SDE) to generate vehicle motions. Our approach combines the advantages of both model-based and deep learning-based techniques. Experimental results demonstrate that our method significantly outperforms baseline approaches in producing realistic, physically-feasible, and precisely-controllable vehicle trajectories, benefiting both generation and prediction tasks.

摘要
几何轨迹生成和预测是自动车的两个关键任务，它们在开发过程中产生了许多轨迹，并在运行过程中预测周围车辆的轨迹。然而，即使有了重要的进步，仍然是一个挑战性的问题，确保生成/预测的轨迹是现实、可解释和物理可行的。现有的模型基方法可以提供可解释的结果，但它们受限于预先定义的模型结构，导致它们无法处理复杂的enario。相反，现有的深度学习基本方法在学习不同的交通enario中表现出色，但它们经常作为透明的黑盒子，无法提供可解释的结果。在这个工作中，我们结合了几何知识和神经统计学 differential equation (SDE)，开发了一个基于novel latent kinematics-aware SDE (LK-SDE)的抽象自动车动作统计模型。我们的方法结合了模型基的优点和深度学习基的优点。实验结果显示，我们的方法与基准方法相比，在生成和预测轨迹任务中表现出色，具有现实、物理可行和精确控制的轨迹。

Energy stable neural network for gradient flow equations

paper_url: http://arxiv.org/abs/2309.10002
repo_url: None
paper_authors: Ganghua Fan, Tianyu Jin, Yuan Lan, Yang Xiang, Luchan Zhang
for: 解决梯度流方程 equations 的方法
methods: 使用一种基于副变量的等价形式来更新解决方案，并使用一些能量衰退块来实现梯度流方程的演化过程中的稳定性
results: 通过实验证明，该网络能够生成高精度和稳定的预测结果

Abstract
In this paper, we propose an energy stable network (EStable-Net) for solving gradient flow equations. The solution update scheme in our neural network EStable-Net is inspired by a proposed auxiliary variable based equivalent form of the gradient flow equation. EStable-Net enables decreasing of a discrete energy along the neural network, which is consistent with the property in the evolution process of the gradient flow equation. The architecture of the neural network EStable-Net consists of a few energy decay blocks, and the output of each block can be interpreted as an intermediate state of the evolution process of the gradient flow equation. This design provides a stable, efficient and interpretable network structure. Numerical experimental results demonstrate that our network is able to generate high accuracy and stable predictions.

摘要
在这篇论文中，我们提出了一种能量稳定网络（EStable-Net），用于解决梯度流方程。我们的神经网络EStable-Net中的解决方案是基于提出的辅助变量基于等效形式的梯度流方程的想法。EStable-Net使得梯度流方程中的能量逐渐减少，与演化过程中的性质相一致。神经网络EStable-Net的架构包括一些能量衰减块，每个块的输出可以被解释为梯度流方程的演化过程中的中间状态。这种设计提供了稳定、高效和可解释的网络结构。数值实验结果表明，我们的网络能够生成高精度和稳定的预测。

Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets

paper_url: http://arxiv.org/abs/2309.09258
repo_url: None
paper_authors: Pulkit Gopalani, Samyak Jha, Anirbit Mukherjee
for: 这个论文目的是证明SGD可以快速收敛到深度为2的抽象函数网络的全局最低点，无论数据是什么样的，Activation函数是否是sigmoid或tanh。
methods: 这个论文使用了SGD算法，并证明了其在 kontinuous time 下的快速收敛速率，包括使用SoftPlus activation function。
results: 论文证明了SGD在这些对象函数上的快速收敛，并且适用于任何数据和Activation函数。

Abstract
In this note, we demonstrate a first-of-its-kind provable convergence of SGD to the global minima of appropriately regularized logistic empirical risk of depth $2$ nets -- for arbitrary data and with any number of gates with adequately smooth and bounded activations like sigmoid and tanh. We also prove an exponentially fast convergence rate for continuous time SGD that also applies to smooth unbounded activations like SoftPlus. Our key idea is to show the existence of Frobenius norm regularized logistic loss functions on constant-sized neural nets which are "Villani functions" and thus be able to build on recent progress with analyzing SGD on such objectives.

摘要
在这份备忘录中，我们证明了SGD在适当规则化的Logistic Empirical Risk函数的深度为2的神经网络上具有首次性的可证明收敛性，包括任意数据和任意数量的门控件，以及具有适当的平滑和缓冲的活化函数，如sigmoid和tanh。我们还证明了SGD在继续时间下的快速收敛速率，其适用于饱和不bounded的活化函数，如SoftPlus。我们的关键想法是证明常量大小神经网络上的Frobenius norm规则化Logistic loss函数是"Villani函数"，因此可以基于最近的SGD分析进程建立。

User Assignment and Resource Allocation for Hierarchical Federated Learning over Wireless Networks

paper_url: http://arxiv.org/abs/2309.09253
repo_url: None
paper_authors: Tinghao Zhang, Kwok-Yan Lam, Jun Zhao
for: 这篇论文主要关注于提高处理器的能效性和延迟时间，并且解决资料隐私问题。
methods: 这篇论文提出了一个弹性 Federated Learning（HFL）架构，并且提出了两个算法来优化资源分配和用户分配。
results: 实验结果显示，这个提案的HFL架构可以对现有的研究进行能源和延迟时间的明显优化。

Abstract
The large population of wireless users is a key driver of data-crowdsourced Machine Learning (ML). However, data privacy remains a significant concern. Federated Learning (FL) encourages data sharing in ML without requiring data to leave users' devices but imposes heavy computation and communications overheads on mobile devices. Hierarchical FL (HFL) alleviates this problem by performing partial model aggregation at edge servers. HFL can effectively reduce energy consumption and latency through effective resource allocation and appropriate user assignment. Nevertheless, resource allocation in HFL involves optimizing multiple variables, and the objective function should consider both energy consumption and latency, making the development of resource allocation algorithms very complicated. Moreover, it is challenging to perform user assignment, which is a combinatorial optimization problem in a large search space. This article proposes a spectrum resource optimization algorithm (SROA) and a two-stage iterative algorithm (TSIA) for HFL. Given an arbitrary user assignment pattern, SROA optimizes CPU frequency, transmit power, and bandwidth to minimize system cost. TSIA aims to find a user assignment pattern that considerably reduces the total system cost. Experimental results demonstrate the superiority of the proposed HFL framework over existing studies in energy and latency reduction.

摘要
大量无线用户人群是数据拥有者学习（ML）的关键驱动力，但数据隐私保护仍然是一大问题。联邦学习（FL）鼓励数据在用户设备上进行学习，而不需要数据离开用户设备，但是在移动设备上进行计算和通信 overhead 占用了大量资源。层次联邦学习（HFL）解决了这个问题，通过在边缘服务器进行部分模型聚合来减少计算和通信 overhead。HFL 可以有效降低能源消耗和延迟，通过有效的资源分配和合适的用户分配。但是，资源分配在 HFL 中包括优化多个变量的问题，并且目标函数应该考虑到能源消耗和延迟两个方面，这使得资源分配算法的开发变得非常复杂。另外，用户分配是一个具有大量搜索空间的启发式优化问题。本文提出了一种spectrum resource optimization algorithm（SROA）和一种两个阶段迭代算法（TSIA）来解决 HFL 中的资源分配和用户分配问题。给定任意用户分配模式，SROA 将在用户设备上优化 CPU 频率、发射功率和带宽，以最小化系统成本。TSIA 则是一种希望找到一个考虑到总系统成本的用户分配模式。实验结果表明，提出的 HFL 框架在能源和延迟两个方面都有较好的性能。

High-dimensional manifold of solutions in neural networks: insights from statistical physics

paper_url: http://arxiv.org/abs/2309.09240
repo_url: None
paper_authors: Enrico M. Malatesta
for: 这篇论文探讨了神经网络的统计力学方法，尤其是使用 binary 和连续权重的 perceptron 架构，在分类设定下。
methods: 论文使用了 Gardner 的 replica 方法，derived SAT/UNSAT 转变在存储设定下。
results: 论文发现了 zero 训练错误配置的几何排序，并如何这种排序随训练集大小的增加而改变。论文还证明了，在 binary 权重模型中，算法困难是因为解决区域的消失，这个区域可以到非常大的距离。最后，论文表明了研究线性模式连接 между解决方案可以提供解决批处理的平均形状的信息。

Abstract
In these pedagogic notes I review the statistical mechanics approach to neural networks, focusing on the paradigmatic example of the perceptron architecture with binary an continuous weights, in the classification setting. I will review the Gardner's approach based on replica method and the derivation of the SAT/UNSAT transition in the storage setting. Then, I discuss some recent works that unveiled how the zero training error configurations are geometrically arranged, and how this arrangement changes as the size of the training set increases. I also illustrate how different regions of solution space can be explored analytically and how the landscape in the vicinity of a solution can be characterized. I give evidence how, in binary weight models, algorithmic hardness is a consequence of the disappearance of a clustered region of solutions that extends to very large distances. Finally, I demonstrate how the study of linear mode connectivity between solutions can give insights into the average shape of the solution manifold.

摘要
这些教学笔记中，我将对神经网络的统计力学方法进行介绍，以某种类型的感知器架构为例，并将着眼于分类设置下的情况。我将详细介绍加德纳的方法，包括使用复制方法的 derivation，以及存储设置下的 SAT/UNSAT 转变。然后，我会讨论一些最近的研究，描述了 zero training error 配置的几何排布，以及这个排布如何随训练集大小的变化。我还会说明如何在解决空间中分析不同区域的解，以及在解近 vicinity 中描述解决方案的场景。最后，我会展示如何在 binary weight 模型中，算法困难性是因为解决空间中的集中区域消失。此外，我还会讨论如何通过 linear mode 连接 между解来了解解决方案的平均形状。

Globally Convergent Accelerated Algorithms for Multilinear Sparse Logistic Regression with $\ell_0$-constraints

paper_url: http://arxiv.org/abs/2309.09239
repo_url: https://github.com/weifeng-yang/mlsr
paper_authors: Weifeng Yang, Wenwen Min
For: The paper is written for analyzing multidimensional data using a Multilinear Sparse Logistic Regression model with $\ell_0$-constraints (MLSR).* Methods: The paper proposes an Accelerated Proximal Alternating Linearized Minimization with Adaptive Momentum (APALM$^+$) method to solve the $\ell_0$-MLSR model, which is a novel approach that combines the advantages of both the $\ell_1$-norm and the $\ell_2$-norm.* Results: The paper demonstrates the superior performance of the proposed APALM$^+$ method in terms of both accuracy and speed, compared to other state-of-the-art methods, on synthetic and real-world datasets. Additionally, the paper provides a proof of convergence for the objective function of the $\ell_0$-MLSR model using the Kurdyka-Lojasiewicz property.

Abstract
Tensor data represents a multidimensional array. Regression methods based on low-rank tensor decomposition leverage structural information to reduce the parameter count. Multilinear logistic regression serves as a powerful tool for the analysis of multidimensional data. To improve its efficacy and interpretability, we present a Multilinear Sparse Logistic Regression model with $\ell_0$-constraints ($\ell_0$-MLSR). In contrast to the $\ell_1$-norm and $\ell_2$-norm, the $\ell_0$-norm constraint is better suited for feature selection. However, due to its nonconvex and nonsmooth properties, solving it is challenging and convergence guarantees are lacking. Additionally, the multilinear operation in $\ell_0$-MLSR also brings non-convexity. To tackle these challenges, we propose an Accelerated Proximal Alternating Linearized Minimization with Adaptive Momentum (APALM$^+$) method to solve the $\ell_0$-MLSR model. We provide a proof that APALM$^+$ can ensure the convergence of the objective function of $\ell_0$-MLSR. We also demonstrate that APALM$^+$ is globally convergent to a first-order critical point as well as establish convergence rate by using the Kurdyka-Lojasiewicz property. Empirical results obtained from synthetic and real-world datasets validate the superior performance of our algorithm in terms of both accuracy and speed compared to other state-of-the-art methods.

摘要
tensor数据表示多维数组。基于低维张量分解的回归方法利用结构信息来减少参数数。多线性логистиック回归作为多维数据分析的powerful工具。为了提高其效果和可解性，我们提出了多线性稀缺LOGISTIC回归模型（$\ell_0$-MLSR）。在$\ell_1$-norm和$\ell_2$-norm之外，$\ell_0$-norm约束更适合特征选择。然而，由于其非拟 convex和非均匀性质，解决它的困难重大，并且存在无法确保的收敛保证。此外，多线性操作在$\ell_0$-MLSR中也带来了非拟 convex性。为了解决这些挑战，我们提出了一种加速 proximal alternating linearized minimization with adaptive momentum（APALM$^+）方法来解决$\ell_0$-MLSR模型。我们提供了一个证明，表明APALM$^+$可以确保$\ell_0$-MLSR模型的目标函数收敛。此外，我们还证明APALM$^+$是全球收敛到一个第一阶关键点，并且使用库德ijk Lojasiewicz性质来确定收敛速率。实验结果表明，基于实验和实际数据，我们的算法在精度和速度方面与当前状态艺术方法相比具有显著优势。

Provable learning of quantum states with graphical models

paper_url: http://arxiv.org/abs/2309.09235
repo_url: None
paper_authors: Liming Zhao, Naixu Guo, Ming-Xing Luo, Patrick Rebentrost
for: 本文研究了一种可以快速学习的量子状态的方法，具体来说是Restricted Boltzmann Machines（RBMs）。
methods: 本文使用了两个邻域学习算法，即 ferromagnetic 邻域学习算法和 locally consistent 邻域学习算法，以实现高效的量子状态学习。
results: 本文证明了使用这两种邻域学习算法可以在一定的情况下实现对量子状态的高效学习，并且可以比逻辑学习更快。

Abstract
The complete learning of an $n$-qubit quantum state requires samples exponentially in $n$. Several works consider subclasses of quantum states that can be learned in polynomial sample complexity such as stabilizer states or high-temperature Gibbs states. Other works consider a weaker sense of learning, such as PAC learning and shadow tomography. In this work, we consider learning states that are close to neural network quantum states, which can efficiently be represented by a graphical model called restricted Boltzmann machines (RBMs). To this end, we exhibit robustness results for efficient provable two-hop neighborhood learning algorithms for ferromagnetic and locally consistent RBMs. We consider the $L_p$-norm as a measure of closeness, including both total variation distance and max-norm distance in the limit. Our results allow certain quantum states to be learned with a sample complexity \textit{exponentially} better than naive tomography. We hence provide new classes of efficiently learnable quantum states and apply new strategies to learn them.

摘要
完全学习一个 $n$-qubit量子状态需要样本数量呈指数函数关系于 $n$。一些作品考虑了一些量子状态的子集，可以在 polynomial 样本复杂性下学习，如稳定器状态或高温 Gibbs 状态。其他作品考虑了一种弱一种学习方式，如 PAC 学习和影子测试。在这项工作中，我们考虑了学习与神经网络状态相似的量子状态，可以有效地表示为受限 Boltzmann 机制（RBM）。为此，我们展示了二步邻居学习算法的Robustness 结果，包括 ferromagnetic 和本地一致 RBM。我们使用 $L_p$-norm 作为距离度量，包括总变分距离和最大 нор距离在限制中。我们的结果表示可以使用更好的样本复杂性学习一些量子状态，比Naive 测试更好。我们因此提供了新的有效地学习量子状态的类别，并应用新的策略来学习它们。

Double Normalizing Flows: Flexible Bayesian Gaussian Process ODEs Learning

paper_url: http://arxiv.org/abs/2309.09222
repo_url: None
paper_authors: Jian Xu, Shian Du, Junmei Yang, Xinghao Ding, John Paisley, Delu Zeng
for: 模型vector field continuous dynamical systems的bayesian inference
methods: incorporate normalizing flows to reparameterize the vector field of ODEs, 使用normalizing flows进行 posterior inference
results: 提高了模型 uncertainty和精度 estimates, 在 simulate dynamical systems and real-world human motion data中得到了更好的结果

Abstract
Recently, Gaussian processes have been utilized to model the vector field of continuous dynamical systems. Bayesian inference for such models \cite{hegde2022variational} has been extensively studied and has been applied in tasks such as time series prediction, providing uncertain estimates. However, previous Gaussian Process Ordinary Differential Equation (ODE) models may underperform on datasets with non-Gaussian process priors, as their constrained priors and mean-field posteriors may lack flexibility. To address this limitation, we incorporate normalizing flows to reparameterize the vector field of ODEs, resulting in a more flexible and expressive prior distribution. Additionally, due to the analytically tractable probability density functions of normalizing flows, we apply them to the posterior inference of GP ODEs, generating a non-Gaussian posterior. Through these dual applications of normalizing flows, our model improves accuracy and uncertainty estimates for Bayesian Gaussian Process ODEs. The effectiveness of our approach is demonstrated on simulated dynamical systems and real-world human motion data, including tasks such as time series prediction and missing data recovery. Experimental results indicate that our proposed method effectively captures model uncertainty while improving accuracy.

摘要
近期，Gaussian processes 被应用于连续动力系统的vector field模型中。Bayesian推理 для这些模型 \cite{hegde2022variational} 已经得到了广泛的研究，并在任务如时间序列预测中提供了不确定估计。然而，之前的Gaussian ProcessOrdinary Differential Equation（ODE）模型可能在非Gaussian process priors的数据集上表现不佳，因为它们的受限的先验和媒介质POSTerior可能缺乏灵活性。为了解决这个限制，我们将normalizing flows integration到了ODE的vector field中，从而获得了更灵活和表达力强的先验分布。此外，由于normalizing flows的概率密度函数是可微分的，我们可以将其应用到GP ODEs的后验推理中，生成一个非Gaussian posterior。通过这种双重应用normalizing flows，我们的模型可以提高Bayesian Gaussian Process ODEs的准确性和uncertainty估计。我们的方法在模拟动力系统和真实世界人类运动数据上进行了实验，包括时间序列预测和缺失数据恢复等任务，结果表明我们的提案方法可以有效地捕捉模型uncertainty，同时提高准确性。

MFRL-BI: Design of a Model-free Reinforcement Learning Process Control Scheme by Using Bayesian Inference

paper_url: http://arxiv.org/abs/2309.09205
repo_url: None
paper_authors: Yanrong Li, Juan Du, Wei Jiang
for: 本研究旨在提出一种基于模型自由强化学习（MFRL）的控制方案，以实时数据进行实验和控制优化，以适应实际生产系统中模型不准确的问题。
methods: 本研究使用了MFRL控制方案，通过bayesian推理更新干扰分布，以降低生产过程中干扰的大量变化。
results: 研究结果显示，提议的MFRL控制方案在无知过程模型情况下能够实现良好的控制性能，并且在数学性质上也得到了保证。计算研究也证明了我们的方法的有效性和效率。

Abstract
Design of process control scheme is critical for quality assurance to reduce variations in manufacturing systems. Taking semiconductor manufacturing as an example, extensive literature focuses on control optimization based on certain process models (usually linear models), which are obtained by experiments before a manufacturing process starts. However, in real applications, pre-defined models may not be accurate, especially for a complex manufacturing system. To tackle model inaccuracy, we propose a model-free reinforcement learning (MFRL) approach to conduct experiments and optimize control simultaneously according to real-time data. Specifically, we design a novel MFRL control scheme by updating the distribution of disturbances using Bayesian inference to reduce their large variations during manufacturing processes. As a result, the proposed MFRL controller is demonstrated to perform well in a nonlinear chemical mechanical planarization (CMP) process when the process model is unknown. Theoretical properties are also guaranteed when disturbances are additive. The numerical studies also demonstrate the effectiveness and efficiency of our methodology.

摘要
制程控制方案的设计对制造系统质量保证具有关键性。以半导体制造为例，广泛的文献关注控制优化基于certain process models（通常是线性模型），这些模型通常通过实验 перед制造过程开始获得。然而，在实际应用中，预定义的模型可能不准确，特别是对于复杂的制造系统。为解决模型不准确的问题，我们提议使用无模型反馈学习（MFRL）方法来进行实验和控制优化同时，根据实时数据进行调整。 Specifically, we design a novel MFRL control scheme by updating the distribution of disturbances using Bayesian inference to reduce their large variations during manufacturing processes. As a result, the proposed MFRL controller is demonstrated to perform well in a nonlinear chemical mechanical planarization (CMP) process when the process model is unknown. 理论性质也得到保证，当干扰是加性的时候。 numerical studies also demonstrate the effectiveness and efficiency of our methodology.

End-to-End Optimized Pipeline for Prediction of Protein Folding Kinetics

paper_url: http://arxiv.org/abs/2309.09191
repo_url: None
paper_authors: Vijay Arvind. R, Haribharathi Sivakumar, Brindha. R
for: 预测蛋白质折叠动力学的高精度且低占用内存的算法 pipeline。
methods: 使用机器学习模型进行预测。
results: 比预先状态艺术模型高4.8%的准确率，并且占用内存327倍少和运行速度7.3%快。

Abstract
Protein folding is the intricate process by which a linear sequence of amino acids self-assembles into a unique three-dimensional structure. Protein folding kinetics is the study of pathways and time-dependent mechanisms a protein undergoes when it folds. Understanding protein kinetics is essential as a protein needs to fold correctly for it to perform its biological functions optimally, and a misfolded protein can sometimes be contorted into shapes that are not ideal for a cellular environment giving rise to many degenerative, neuro-degenerative disorders and amyloid diseases. Monitoring at-risk individuals and detecting protein discrepancies in a protein's folding kinetics at the early stages could majorly result in public health benefits, as preventive measures can be taken. This research proposes an efficient pipeline for predicting protein folding kinetics with high accuracy and low memory footprint. The deployed machine learning (ML) model outperformed the state-of-the-art ML models by 4.8% in terms of accuracy while consuming 327x lesser memory and being 7.3% faster.

摘要

Data-Driven Reachability Analysis of Stochastic Dynamical Systems with Conformal Inference

paper_url: http://arxiv.org/abs/2309.09187
repo_url: None
paper_authors: Navid Hashemi, Xin Qin, Lars Lindemann, Jyotirmoy V. Deshmukh
for: 本文针对数据驱动的抽象时间概率动力系统进行可达性分析，使用协Forms inference。
methods: 本文使用数据学习来建立一个代理模型，然后使用代理模型进行可达性分析，并使用协Forms inference来评估代理模型所受的误差。
results: 本文可以为learning-enabled控制系统提供可达性保证，并且可以处理复杂的closed-loop dynamics。

Abstract
We consider data-driven reachability analysis of discrete-time stochastic dynamical systems using conformal inference. We assume that we are not provided with a symbolic representation of the stochastic system, but instead have access to a dataset of $K$-step trajectories. The reachability problem is to construct a probabilistic flowpipe such that the probability that a $K$-step trajectory can violate the bounds of the flowpipe does not exceed a user-specified failure probability threshold. The key ideas in this paper are: (1) to learn a surrogate predictor model from data, (2) to perform reachability analysis using the surrogate model, and (3) to quantify the surrogate model's incurred error using conformal inference in order to give probabilistic reachability guarantees. We focus on learning-enabled control systems with complex closed-loop dynamics that are difficult to model symbolically, but where state transition pairs can be queried, e.g., using a simulator. We demonstrate the applicability of our method on examples from the domain of learning-enabled cyber-physical systems.

摘要
我们考虑了数据驱动的可达性分析，用于离散时间渐进系统。我们假设我们没有符号表示法，而是有一个$K$-步轨迹数据集。我们的目标是构建一个流管，使得流管中的概率超过用户指定的失败概率阈值。我们的关键想法是：（1）从数据学习一个代理预测模型，（2）使用代理模型进行可达性分析，（3）使用凤凰推理来评估代理模型所吃进的误差，以提供可达性保证。我们关注learning-enabled控制系统，其中具有复杂的关闭环境，但可以通过 simulate 来查询状态转移对。我们在学习启发系统中的示例上进行了应用。

On the Connection Between Riemann Hypothesis and a Special Class of Neural Networks

paper_url: http://arxiv.org/abs/2309.09171
repo_url: None
paper_authors: Soufiane Hayou
for: 这份论文是关于里曼假设（RH）的一个检查和扩展。
methods: 论文使用了一种已知的分析条件，称为尼曼-贝尔灵 критерион，连接RH与一种特殊的神经网络最小化问题。
results: 论文提供了一种扩展的分析条件，以及一种新的方法来检查RH。

Abstract
The Riemann hypothesis (RH) is a long-standing open problem in mathematics. It conjectures that non-trivial zeros of the zeta function all have real part equal to 1/2. The extent of the consequences of RH is far-reaching and touches a wide spectrum of topics including the distribution of prime numbers, the growth of arithmetic functions, the growth of Euler totient, etc. In this note, we revisit and extend an old analytic criterion of the RH known as the Nyman-Beurling criterion which connects the RH to a minimization problem that involves a special class of neural networks. This note is intended for an audience unfamiliar with RH. A gentle introduction to RH is provided.

摘要
里曼假设（RH）是数学中一个长期开放的问题。它假设非质数函数的非质数部分都是1/2的实数部分。这个假设的影响是广泛的，覆盖了许多数学领域，包括整数分布、算术函数的增长、欧拉 totient 函数的增长等。在这份notes中，我们重新访问和扩展了一个古老的分析 критерий，称为尼曼-欧拉 criterion，它将RH与一种特殊的神经网络相连接。这份notes是为那些不熟悉RH的读者而设计的。我们会提供一个温顺的引入，以便读者更好地了解RH。

Integration of geoelectric and geochemical data using Self-Organizing Maps (SOM) to characterize a landfill

paper_url: http://arxiv.org/abs/2309.09164
repo_url: None
paper_authors: Camila Juliao, Johan Diaz, Yosmely BermÚdez, Milagrosa Aldana
For: 这个研究的目的是确定垃圾掩埋场周围区域是否存在潜在的污染风险，并通过不同方法来实现这一目的。* Methods: 本研究使用了地球电性资料（抗阻和IP）和表面甲烷测量数据，并使用了一个不supervised Neural Network（ Kohonen 型）来处理和分类这些数据。* Results: 研究结果显示，通过使用 Self-Organizing Classification Maps（SOM），可以实现精确地定义潜在污染风险区域，并将其分为不同的类别。两个图像出力被 obtiened 从训练过程中，每个图像都代表了不同的潜在污染风险区域。

Abstract
Leachates from garbage dumps can significantly compromise their surrounding area. Even if the distance between these and the populated areas could be considerable, the risk of affecting the aquifers for public use is imminent in most cases. For this reason, the delimitation and monitoring of the leachate plume are of significant importance. Geoelectric data (resistivity and IP), and surface methane measurements, are integrated and classified using an unsupervised Neural Network to identify possible risk zones in areas surrounding a landfill. The Neural Network used is a Kohonen type, which generates; as a result, Self-Organizing Classification Maps or SOM (Self-Organizing Map). Two graphic outputs were obtained from the training performed in which groups of neurons that presented a similar behaviour were selected. Contour maps corresponding to the location of these groups and the individual variables were generated to compare the classification obtained and the different anomalies associated with each of these variables. Two of the groups resulting from the classification are related to typical values of liquids percolated in the landfill for the parameters evaluated individually. In this way, a precise delimitation of the affected areas in the studied landfill was obtained, integrating the input variables via SOMs. The location of the study area is not detailed for confidentiality reasons.

摘要
垃圾排泄物可以很大程度地对周围环境造成影响。即使垃圾排泄物和人口集中区之间的距离相对较远，但是影响公共饮水储存层的风险仍然很高。因此，垃圾排泄物泄洪和监测的重要性非常大。在这种情况下，利用不超级网络（Kohonen类）进行无监督学习，并将抵抗性和IP测量数据集成，以生成自组织分类地图（SOM）。在训练过程中，选择了表现相似的神经元组，并生成了对应的Contour地图，以比较不同变量之间的分类结果和异常相关性。两个组 resulting from the classification are related to typical liquid values percolated in the landfill for the parameters evaluated individually. In this way, a precise delimitation of the affected areas in the studied landfill was obtained, integrating the input variables via SOMs. The location of the study area is not detailed for confidentiality reasons.

Total Variation Distance Estimation Is as Easy as Probabilistic Inference

paper_url: http://arxiv.org/abs/2309.09134
repo_url: None
paper_authors: Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S. Meel, Dimitrios Myrisiotis, A. Pavan, N. V. Vinodchandran
for: 这个论文是关于total variation（TV）距离估计和概率推理之间的新连接。
methods: 这篇论文使用了一种有效、结构保持的减少方法，将相对的TV距离估计转化为概率推理 над指定的导航图模型中。
results: 这篇论文提出了一种基于Bayes网的FPRAS估计TV距离 между任意类型的分布，并且只需要有效的概率推理算法。此外，这种方法还可以用于估计高维分布的TV距离。

Abstract
In this paper, we establish a novel connection between total variation (TV) distance estimation and probabilistic inference. In particular, we present an efficient, structure-preserving reduction from relative approximation of TV distance to probabilistic inference over directed graphical models. This reduction leads to a fully polynomial randomized approximation scheme (FPRAS) for estimating TV distances between distributions over any class of Bayes nets for which there is an efficient probabilistic inference algorithm. In particular, it leads to an FPRAS for estimating TV distances between distributions that are defined by Bayes nets of bounded treewidth. Prior to this work, such approximation schemes only existed for estimating TV distances between product distributions. Our approach employs a new notion of $partial$ couplings of high-dimensional distributions, which might be of independent interest.

摘要
在这篇论文中，我们建立了一种新的连接，即全量变量（TV）距离估计和概率推理之间的连接。我们 Specifically, we present a structure-preserving reduction from relative approximation of TV distance to probabilistic inference over directed graphical models. This reduction leads to a fully polynomial randomized approximation scheme (FPRAS) for estimating TV distances between distributions over any class of Bayes nets for which there is an efficient probabilistic inference algorithm. In particular, it leads to an FPRAS for estimating TV distances between distributions that are defined by Bayes nets of bounded treewidth. Prior to this work, such approximation schemes only existed for estimating TV distances between product distributions. Our approach employs a new notion of $partial$ couplings of high-dimensional distributions, which might be of independent interest.Here's the translation in Traditional Chinese:在这篇论文中，我们建立了一种新的连接，即全量变量（TV）距离估计和概率推理之间的连接。我们 Specifically, we present a structure-preserving reduction from relative approximation of TV distance to probabilistic inference over directed graphical models. This reduction leads to a fully polynomial randomized approximation scheme (FPRAS) for estimating TV distances between distributions over any class of Bayes nets for which there is an efficient probabilistic inference algorithm. In particular, it leads to an FPRAS for estimating TV distances between distributions that are defined by Bayes nets of bounded treewidth. Prior to this work, such approximation schemes only existed for estimating TV distances between product distributions. Our approach employs a new notion of $partial$ couplings of high-dimensional distributions, which might be of independent interest.

2023-09-17

eess.SP

eess.SP - 2023-09-17

Climate-Resilient UAVs: Enhancing Energy-Efficient B5G Communication in Harsh Environments

paper_url: http://arxiv.org/abs/2309.09387
repo_url: None
paper_authors: Abdu Saif, Saeed Hamood Alsamhi, Edward Curry
for: 这篇论文探讨了无人机在超 fifth generation（B5G）通信网络中的重要作用，尤其是在雨、雾、雪等不利天气条件下。
methods: 这篇研究探讨了气候鲜度无人机和能效B5G通信之间的相互作用，并分析了各种天气元素对无人机覆盖和通信动态的影响。
results: 研究发现，气候鲜度无人机可以在不同的天气条件下提供更高的能效性、降低干扰、提高数据传输率，并且在不同的天气条件下可以获得最佳通道增强。

Abstract
This paper explores the crucial role of Unmanned Aerial Vehicles (UAVs) in advancing Beyond Fifth Generation (B5G) communication networks, especially in adverse weather conditions like rain, fog, and snow. The study investigates the synergy between climate-resilient UAVs and energy-efficient B5G communication. Key findings include the impact of weather elements on UAV coverage and communication dynamics. The research demonstrates significant enhancements in energy efficiency, reduced interference, increased data transmission rates, and optimal channel gain under various weather conditions. Overall, this paper emphasizes the potential of climate-resilient UAVs to improve energy-efficient B5G communication and highlights technology's role in mitigating climate change's impact on communication systems, promoting sustainability and resilience.

摘要
这篇论文探讨无人飞行器（UAV）在 fifth Generation 以上通信网络中的重要作用，特别是在雨、雾和雪等不利天气 услови下。研究发现了气候适应UAV和能效B5G通信之间的共同作用，以及不同天气条件下UAV覆盖和通信动态的影响。研究显示在不同的天气条件下，气候适应UAV可以提高能效率，减少干扰，提高数据传输速率，并且在不同的天气条件下实现最佳通道增强。总的来说，这篇论文强调气候适应UAV在能效B5G通信中的潜在作用，并 highlights 技术在气候变化对通信系统的影响下的作用，推动可持续发展和可靠性。

Frequency-Domain Detection for Molecular Communication with Cross-Reactive Receptors

paper_url: http://arxiv.org/abs/2309.09377
repo_url: None
paper_authors: Meltem Civas, Murat Kuscu, Ozgur B. Akan
for: This paper is written for the development of a frequency-domain detection (FDD) technique for bioFET-based molecular communication receivers (MC-Rxs) to overcome molecular cross-talk in the time domain.
methods: The paper proposes the use of a frequency-domain detection technique that exploits the difference in binding reaction rates of different ligand types reflected in the power spectrum of the ligand-receptor binding noise to decode transmitted concentration signals.
results: The paper demonstrates the effectiveness of the proposed FDD technique in decoding transmitted concentration signals under stochastic molecular interference compared to a widely used time-domain detection (TDD) technique, and verifies the analytical performance bounds of the FDD through a particle-based spatial stochastic simulator simulating reactions on the MC-Rx in microfluidic channels.

Abstract
Molecular Communications (MC) is a bio-inspired communication paradigm that uses molecules as information carriers, requiring unconventional transceivers and modulation/detection techniques. Practical MC receivers (MC-Rxs) can be implemented using field-effect transistor biosensor (bioFET) architectures, where surface receptors reversibly react with ligands. The time-varying concentration of ligand-bound receptors is translated into electrical signals via field effect, which is used to decode the transmitted information. However, ligand-receptor interactions do not provide an ideal molecular selectivity, as similar ligand types, i.e., interferers, co-existing in the MC channel, can interact with the same type of receptors. Overcoming this molecular cross-talk in the time domain can be challenging, especially when Rx has no knowledge of the interferer statistics or operates near saturation. Therefore, we propose a frequency-domain detection (FDD) technique for bioFET-based MC-Rxs that exploits the difference in binding reaction rates of different ligand types reflected in the power spectrum of the ligand-receptor binding noise. We derive the bit error probability (BEP) of the FDD technique and demonstrate its effectiveness in decoding transmitted concentration signals under stochastic molecular interference compared to a widely used time-domain detection (TDD) technique. We then verified the analytical performance bounds of the FDD through a particle-based spatial stochastic simulator simulating reactions on the MC-Rx in microfluidic channels.

摘要
молекулярcommunications（MC）是一种生物体注意的通信模式，使用分子作为信息传递者，需要不同寻常的接收器和模ulation/探测技术。实际的MC接收器（MC-Rx）可以通过场效 транзистор生物感应（bioFET）建筑实现，其表面受体逆转受体与抗体结合。时间变化的抗体结合的分子浓度通过场效转换成电学信号，用于解码传输的信息。但是，抗体-受体交互不提供理想的分子选择性，因为同类抗体在MC通道中可能与同类受体结合。在时间频谱中解决这种分子交叉通信可以是挑战，特别是当接收器没有知道干扰者统计或在满载状态下操作时。因此，我们提出了频率域检测（FDD）技术，该技术利用不同抗体类型在绑定反应速率上的差异，反映在抗体-受体绑定噪声的能量спектrum中。我们计算了FDD技术的比特错误概率（BEP），并证明其在对抗杂噪声的情况下比时频域检测（TDD）技术更有效地解码传输的浓度信号。我们然后通过使用粒子基的空间随机仿真器模拟MC-Rx在微 fluidic通道中的反应，验证了我们的分析性能下限。

Frequency Estimation Using Complex-Valued Shifted Window Transformer

paper_url: http://arxiv.org/abs/2309.09352
repo_url: https://github.com/josiahwsmith10/spectral-super-resolution-swin
paper_authors: Josiah W. Smith, Murat Torlak
for: 本研究targets at estimating closely spaced frequency components of a signal, which is a fundamental problem in statistical signal processing.
methods: 本文提出了一种基于Swin transformer的新方法，包括1-D real-valued和复数值Shifted window transformer（SwinFreq和CVSwinFreq），用于1-D复数值信号的线spectra频率估计。
results: 对比传统的Periodogram、MUSIC和OMP算法以及state-of-the-art的deep learning方法cResFreq，SwinFreq和CVSwinFreq具有更高的性能、更好的分辨率和更少的模型参数，因此更适合边缘和移动应用。此外， authors发现了real-valued Swin-Freq在某些任务上表现更好，而且具有较小的模型大小。最后， authors应用了提posed方法于实际雷达范profile超分辨率 task中，实验结果 validate了SwinFreq和CVSwinFreq的数值和实验上的优越性。

Abstract
Estimating closely spaced frequency components of a signal is a fundamental problem in statistical signal processing. In this letter, we introduce 1-D real-valued and complex-valued shifted window (Swin) transformers, referred to as SwinFreq and CVSwinFreq, respectively, for line-spectra frequency estimation on 1-D complex-valued signals. Whereas 2-D Swin transformer-based models have gained traction for optical image super-resolution, we introduce for the first time a complex-valued Swin module designed to leverage the complex-valued nature of signals for a wide array of applications. The proposed approach overcomes the limitations of the classical algorithms such as the periodogram, MUSIC, and OMP in addition to state-of-the-art deep learning approach cResFreq. SwinFreq and CVSwinFreq boast superior performance at low signal-to-noise ratio SNR and improved resolution capability while requiring fewer model parameters than cResFreq, thus deeming it more suitable for edge and mobile applications. We find that the real-valued Swin-Freq outperforms its complex-valued counterpart CVSwinFreq for several tasks while touting a smaller model size. Finally, we apply the proposed techniques for radar range profile super-resolution using real data. The results from both synthetic and real experimentation validate the numerical and empirical superiority of SwinFreq and CVSwinFreq to the state-of-the-art deep learning-based techniques and traditional frequency estimation algorithms. The code and models are publicly available at https://github.com/josiahwsmith10/spectral-super-resolution-swin.

摘要
估计 closely spaced frequency component of a signal 是统计信号处理中的基本问题。在这封信件中，我们介绍了1维实数值和复数值偏移窗变换器（Swin），称为SwinFreq和CVSwinFreq，用于1维复数值信号的线pectra频率估计。而2维Swin transformer-based模型在光学超分解中获得了进步，我们是第一次引入一种用于信号的复数值Swin模块，用以利用信号的复数值特性，并且可以应用于各种应用程序。我们的方法可以超越经典算法，如期ogram、MUSIC和OMP，以及当前的深度学习方法cResFreq。SwinFreq和CVSwinFreq具有低SNR和提高分辨率的优势，同时需要 fewer model parameter than cResFreq，因此适用于边缘和移动应用。我们发现实数值Swin-Freq在一些任务上表现出色，而且具有较小的模型大小。最后，我们应用了提议的技术于雷达距离Profile超分解中使用实际数据。实验结果证明了SwinFreq和CVSwinFreq的数学和实验准确性，并超越了当前的深度学习基于技术和经典频率估计算法。代码和模型可以在https://github.com/josiahwsmith10/spectral-super-resolution-swin上获得。

Asymptotic Analysis of the Downlink in Cooperative Massive MIMO Systems

paper_url: http://arxiv.org/abs/2309.09273
repo_url: None
paper_authors: Itsik Bergel, Siddhartan Govindasamy
For: The paper is written for a cooperative cellular communications system, where multiple base stations around each mobile station cooperate to reduce interference and improve system performance.* Methods: The paper uses closed-form expressions to derive the asymptotic performance of the network as the number of antennas per base station increases, and includes Monte Carlo simulations to verify the results. The paper also proposes a power allocation algorithm that achieves near-optimal performance with reduced coordination overhead between base stations.* Results: The paper shows that the asymptotic results capture the trade-off between various system parameters, and characterize the joint effect of noise and interference. The results are useful even when the number of antennas per base station is only moderately large, and the proposed power allocation algorithm achieves near-optimal performance with reduced coordination overhead.Here is the format you requested for the simplified Chinese text:* For: 这篇论文是关于协同通信系统的下行频率调制，多个基站周围每个移动站协同进行零干扰，以降低接收到移动站的干扰。* Methods: 论文使用closed-form表达式来 deriv asymptotic性能表达式，并通过 Monte Carlo仿真来验证结果。论文还提出了一种功率分配算法，可以在基站之间减少协调开销。* Results: 论文显示， asymptotic结果 capture了系统参数之间的贸易off，并characterize了干扰和噪声的共同效应。结果是用于 moderately large antenna数per base station，并且提出的功率分配算法可以实现 near-optimal性能，同时减少协调开销。

Abstract
We consider the downlink of a cooperative cellular communications system, where several base-stations around each mobile cooperate and perform zero-forcing to reduce the received interference at the mobile. We derive closed-form expressions for the asymptotic performance of the network as the number of antennas per base station grows large. These expressions capture the trade off between various system parameters, and characterize the joint effect of noise and interference (where either noise or interference is asymptotically dominant and where both are asymptotically relevant). The asymptotic results are verified using Monte Carlo simulations, which indicate that they are useful even when the number of antennas per base station is only moderately large. Additionally, we show that when the number of antennas per base station grows large, power allocation can be optimized locally at each base station. We hence present a power allocation algorithm that achieves near optimal performance while significantly reducing the coordination overhead between base stations. The presented analysis is significantly more challenging than the uplink analysis, due to the dependence between beamforming vectors of nearby base stations. This statistical dependence is handled by introducing novel bounds on marked shot-noise point processes with dependent marks, which are also useful in other contexts.

摘要
我团队考虑了一个合作的移动通信系统的下链，其中各个基站附近的移动站合作，并通过零干扰来减少移动站接收的干扰。我们 deriv了大面积表达式，用于描述系统的极限性能，这些表达式捕捉了系统参数之间的贸易OFF和干扰的共同作用，以及干扰和噪声之间的交互作用。我们使用Monte Carlo仿真来验证我们的结论，并发现这些结论在只有 moderately 大的antenna数时仍然有用。此外，我们还证明了当antenna数量增大时，每个基站可以地方Optimize its power allocation，以达到近似优化性能的目的，同时减少基站之间协调 overhead。在这个分析中，我们面临的挑战在于附近基站的扩散波形矩阵之间的统计依赖关系。我们使用了新的 bounds on marked shot-noise point processes with dependent marks来处理这种统计依赖关系。这些bounds也可以在其他 contexts 中使用。

Toward Beamfocusing-Aided Near-Field Communications: Research Advances, Potential, and Challenges

paper_url: http://arxiv.org/abs/2309.09242
repo_url: None
paper_authors: Jiancheng An, Chau Yuen, Linglong Dai, Marco Di Renzo, Merouane Debbah, Lajos Hanzo
for: 这篇论文旨在探讨未来无线通信技术的发展，尤其是EXTREMELY大规模天线阵列（ELAA）和tera响communications（NFC）的潜在应用。
methods: 论文使用基于圆柱形波front的模型来准确描述近场无线传播通道特性。同时，论文还提出了NFC与传统远场通信的比较，并评估了NFC的频率响应和硬件设计等挑战。
results: 数值结果表明NFC可以提高空间多重化增量和位置准确性。此外，论文还提出了一些未来研究的开问，以便进一步探索NFC的潜在应用和发展。

Abstract
Next-generation mobile networks promise to support high throughput, massive connectivity, and improved energy efficiency. To achieve these ambitious goals, extremely large-scale antenna arrays (ELAAs) and terahertz communications constitute a pair of promising technologies. This will result in future wireless communications occurring in the near-field regions. To accurately portray the channel characteristics of near-field wireless propagation, spherical wavefront-based models are required and present both opportunities as well as challenges. Following the basics of near-field communications (NFC), we contrast it to conventional far-field communications. Moreover, we cover the key challenges of NFC, including its channel modeling and estimation, near-field beamfocusing, as well as hardware design. Our numerical results demonstrate the potential of NFC in improving the spatial multiplexing gain and positioning accuracy. Finally, a suite of open issues are identified for motivating future research.

摘要
Next-generation mobile networks 将支持高速、大量连接和改善能源效率。为实现这些目标，极大规模天线阵列（ELAAs）和teraHz通信技术是两种承诺技术。这将导致未来无线通信发生在近场区域。为准确描述近场无线媒体特性，球形波front基于模型是必需的，并提供了机会和挑战。根据近场通信（NFC）的基本原理，我们对它与传统远场通信进行了比较。此外，我们还讨论了NFC的主要挑战，包括通道模型化和估计、近场焦点定向以及硬件设计。我们的数字结果表明NFC可以提高空间复用增量和位置准确性。最后，我们确定了一些未解决的问题，以便激励未来的研究。Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Cramer-Rao Bound Optimization for Active RIS-Empowered ISAC Systems

paper_url: http://arxiv.org/abs/2309.09207
repo_url: None
paper_authors: Qi Zhu, Ming Li, Rang Liu, Qian Liu
for: 这篇论文的目的是探讨使用活动智能表面（RIS）强化探测信号质量和通信性能的Integrated sensing and communication（ISAC）系统。
methods: 论文使用了基站传输预编码和活动RIS反射扩散 beamforming的共同设计来优化参数估算性能，并提出了一种高效的算法基于块坐标降解（BCD）、半definite relaxation（SDR）和主导化最小化（MM）来解决非核心问题。
results: simulation results validate the effectiveness of the developed algorithm and the potential of employing active RIS in ISAC systems to enhance direct-of-arrival（DoA）估算性能。

Abstract
Integrated sensing and communication (ISAC), which simultaneously performs sensing and communication functions using the same frequency band and hardware platform, has emerged as a promising technology for future wireless systems. However, the weak echo signal received by the low-sensitivity ISAC receiver severely limits the sensing performance. Active reconfigurable intelligent surface (RIS) has become a prospective solution by situationally manipulating the wireless propagations and amplifying the signals. In this paper, we investigate the deployment of active RIS-empowered ISAC systems to enhance radar echo signal quality as well as communication performance. In particular, we focus on the joint design of the base station (BS) transmit precoding and the active RIS reflection beamforming to optimize the parameter estimation performance in terms of Cramer-Rao bound (CRB) subject to the service users' signal-to-interference-plus-noise ratio (SINR) requirements. An efficient algorithm based on block coordinate descent (BCD), semidefinite relaxation (SDR), and majorization-minimization (MM) is proposed to solve the formulated challenging non-convex problem. Finally, simulation results validate the effectiveness of the developed algorithm and the potential of employing active RIS in ISAC systems to enhance direct-of-arrival (DoA) estimation performance.

摘要
Integrated sensing and communication (ISAC)技术，它同时执行感知和通信功能，使用同一个频率带和硬件平台，已经成为未来无线系统的一种优秀技术。然而，低敏感度ISAC接收器接收到的弱回声信号，严重限制了感知性能。活动可重配置表面（RIS）已成为一种可能的解决方案，通过 Situationally manipulating wireless propagation和增强信号。在这篇论文中，我们 investigate了活动RIS-empowered ISAC系统的部署，以提高雷达回声信号质量以及通信性能。具体来说，我们关注了基站（BS）传输 precoding 和活动RIS反射扩散的共同设计，以优化参数估计性能，并达到服务用户的信号干扰plus noise ratio（SINR）要求。我们提出了一种高效的算法，基于块坐标降解（BCD）、凸relaxation（SDR）和主要化-最小化（MM）来解决复杂非对称问题。最后，我们的实验结果证明了我们提出的算法的有效性，以及活动RIS在ISAC系统中的潜在优势。

NOMA-Based Coexistence of Near-Field and Far-Field Massive MIMO Communications

paper_url: http://arxiv.org/abs/2309.09185
repo_url: None
paper_authors: Zhiguo Ding, Robert Schober, H. Vincent Poor
for: 提供了一种使用NOMA原理来支持传统近场用户的合作，以提高大量MIMO网络的性能。
methods: 使用预先配置的近场用户的空间束来服务更多的远场用户，并通过增加基站antenna数来提高NOMA-assisted大量MIMO网络的性能。
results: 研究结果表明，通过NOMA原理可以有效地支持近场和远场通信的合作，并且可以通过增加基站antenna数来提高NOMA-assisted大量MIMO网络的性能。

Abstract
This letter considers a legacy massive multiple-input multiple-output (MIMO) network, in which spatial beams have been preconfigured for near-field users, and proposes to use the non-orthogonal multiple access (NOMA) principle to serve additional far-field users by exploiting the spatial beams preconfigured for the legacy near-field users. Our results reveal that the coexistence between near-field and far-field communications can be effectively supported via NOMA, and that the performance of NOMA-assisted massive MIMO can be efficiently improved by increasing the number of antennas at the base station.

摘要
这封信件考虑了一个传统的大规模多输入多输出（MIMO）网络，在near-field用户中预先配置了空间扩散，并提议使用非对称多access（NOMA）原则来为更远的far-field用户提供服务，利用预先配置的near-field用户的空间扩散。我们的结果表明，near-field和far-field通信的共存可以通过NOMA进行有效支持，并且通过提高基站antenna数量来提高NOMA-assisted大规模MIMO的性能。Note that Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan and Hong Kong.

Throughput Analysis of IEEE 802.11bn Coordinated Spatial Reuse

paper_url: http://arxiv.org/abs/2309.09169
repo_url: None
paper_authors: Francesc Wilhelmi, Lorenzo Galati-Giordano, Giovanni Geraci, Boris Bellalta, Gianluca Fontanesi, David Nuñez
for: This paper focuses on the Coordinated Spatial Reuse (C-SR) feature of the Multi-Access Point Coordination (MAPC) in the IEEE 802.11bn amendment (Wi-Fi 8).
methods: The authors use an analytical model based on Continuous Time Markov Chains (CTMCs) to characterize the throughput and spatial efficiency of C-SR.
results: The authors show that C-SR can opportunistically enable parallel high-quality transmissions and achieve an average throughput gain of up to 59% compared to the legacy 802.11 Distributed Coordination Function (DCF) and up to 42% compared to the 802.11ax Overlapping Basic Service Set Packet Detect (OBSS/PD) mechanism.Here is the result in Simplified Chinese text:
for: 这篇论文关注了802.11bn规定（Wi-Fi 8）中的多个访问点协调（MAPC）功能之一——协调空间重用（C-SR）。
methods: 作者们使用基于 kontinuous Time Markov Chains（CTMCs）的分析模型来描述C-SR的吞吐量和空间效率。
results: 作者们表明，C-SR可以机会地实现并发高质量传输，并实现与legacy 802.11分布协调函数（DCF）和802.11ax Overlapping Basic Service Set Packet Detect（OBSS/PD）机制相比的吞吐量提高，最高达59%。

Abstract
Multi-Access Point Coordination (MAPC) is becoming the cornerstone of the IEEE 802.11bn amendment, alias Wi-Fi 8. Among the MAPC features, Coordinated Spatial Reuse (C-SR) stands as one of the most appealing due to its capability to orchestrate simultaneous access point transmissions at a low implementation complexity. In this paper, we contribute to the understanding of C-SR by introducing an analytical model based on Continuous Time Markov Chains (CTMCs) to characterize its throughput and spatial efficiency. Applying the proposed model to several network topologies, we show that C-SR opportunistically enables parallel high-quality transmissions and yields an average throughput gain of up to 59% in comparison to the legacy 802.11 Distributed Coordination Function (DCF) and up to 42% when compared to the 802.11ax Overlapping Basic Service Set Packet Detect (OBSS/PD) mechanism.

摘要
多点存取协调（MAP）正成为IEEE 802.11bn 修订（即Wi-Fi 8）的核心。其中，协调空间重复（C-SR）是最吸引人的功能之一，因为它可以实现低实现 Complexity 下的同时多点变数通信。在本文中，我们对C-SR进行了分析，并基于状态空间过程（CTMC）引入了一个分析模型，以描述它的吞吐率和空间效率。我们将这个模型应用到了多个网络架构上，结果显示，C-SR可以允许高质量的平行传输，并产生了与传统802.11 DCF和802.11ax OBSS/PD Mechanism 相比的吞吐率增加，具体是59%。

Sparse Code Multiple Access (SCMA) Technique

paper_url: http://arxiv.org/abs/2309.09127
repo_url: None
paper_authors: Sanjeev Sharma, Kuntal Deka
For: This paper is written to introduce and analyze the code domain-based sparse code multiple access (SCMA) non-orthogonal multiple access (NOMA) scheme to enhance the spectral efficiency of wireless networks.* Methods: The paper uses code domain-based SCMA, which is designed and detected using a hybrid multiple access scheme that combines code-domain and power-domain NOMA. The paper also discusses the method for codebooks design and its impact on system performance.* Results: The paper includes simulation results to show the impact of various SCMA system parameters, such as the number of users, the spreading factor, and the code length, on the system performance. The results demonstrate the potential of SCMA to enhance the spectral efficiency of wireless networks.

Abstract
Next-generation wireless networks require higher spectral efficiency and lower latency to meet the demands of various upcoming applications. Recently, non-orthogonal multiple access (NOMA) schemes are introduced in the literature for 5G and beyond. Various forms of NOMA are considered like power domain, code domain, pattern division multiple access, etc. to enhance the spectral efficiency of wireless networks. In this chapter, we introduce the code domain-based sparse code multiple access (SCMA) NOMA scheme to enhance the spectral efficiency of a wireless network. The design and detection of an SCMA system are analyzed in this chapter. Also, the method for codebooks design and its impact on system performance are highlighted. A hybrid multiple access scheme is also introduced using both code-domain and power-domain NOMA. Furthermore, simulation results are included to show the impact of various SCMA system parameters.ext-generation wireless networks require higher spectral efficiency and lower latency to meet the demands of various upcoming applications. Recently, non-orthogonal multiple access (NOMA) schemes are introduced in the literature for 5G and beyond. Various forms of NOMA are considered like power domain, code domain, pattern division multiple access, etc. to enhance the spectral efficiency of wireless networks. In this chapter, we introduce the code domainbased sparse code multiple access (SCMA) NOMA scheme to enhance the spectral efficiency of a wireless network. The design and detection of an SCMA system are analyzed in this chapter. Also, the method for codebooks design and its impact on system performance are highlighted. A hybrid multiple access scheme is also introduced using both code-domain and power-domain NOMA. Furthermore, simulation results are included to show the impact of various SCMA system parameters.

摘要

2023-09-16

cs.SD

cs.SD - 2023-09-16

Enhancing GAN-Based Vocoders with Contrastive Learning Under Data-limited Condition

paper_url: http://arxiv.org/abs/2309.09088
repo_url: None
paper_authors: Haoming Guo, Seth Z. Zhao, Jiachen Lian, Gopala Anumanchipalli, Gerald Friedland
for: 本研究旨在提高 vocoder 模型在数据有限情况下的质量，不修改模型结构或添加更多数据。
methods: 本研究使用了对mel-spectrogram进行对比学习，以提高 vocoder 模型的语音质量。此外， authors 还尝试了在多Modal情况下使用waveform进行学习，以解决权值逐出问题。
results: 研究结果表明，通过对 vocoder 模型进行对比学习，可以在数据有限情况下提高模型性能，并且分析结果表明，提posed方法可以successfully解决权值逐出问题，并生成高质量的语音。

Abstract
Vocoder models have recently achieved substantial progress in generating authentic audio comparable to human quality while significantly reducing memory requirement and inference time. However, these data-hungry generative models require large-scale audio data for learning good representations. In this paper, we apply contrastive learning methods in training the vocoder to improve the perceptual quality of the vocoder without modifying its architecture or adding more data. We design an auxiliary task with mel-spectrogram contrastive learning to enhance the utterance-level quality of the vocoder model under data-limited conditions. We also extend the task to include waveforms to improve the multi-modality comprehension of the model and address the discriminator overfitting problem. We optimize the additional task simultaneously with GAN training objectives. Our result shows that the tasks improve model performance substantially in data-limited settings. Our analysis based on the result indicates that the proposed design successfully alleviates discriminator overfitting and produces audio of higher fidelity.

摘要
很多最新的 vocoder 模型已经取得了很大的进步，可以生成比人类质量更高的真实音频，同时减少了内存需求和计算时间。然而，这些数据夹带的生成模型需要大量的音频数据来学习良好的表示。在这篇论文中，我们使用了对比学习方法来在训练 vocoder 模型中提高模型的感知质量，无需修改模型结构或添加更多的数据。我们设计了一项 auxiliary 任务，通过 mel-spectrogram 对比学习来提高 vocoder 模型在数据有限的情况下的话语质量。我们还将这项任务扩展到包括波形，以提高模型的多模式理解和解决探测器过拟合问题。我们同时优化了这些额外任务和 GAN 训练目标。我们的结果表明，这些任务可以在数据有限情况下提高模型性能的极大程度。我们的分析表明，我们的设计成功解决了探测器过拟合问题，并生成了更高的准确性和音频质量。

SynthTab: Leveraging Synthesized Data for Guitar Tablature Transcription

paper_url: http://arxiv.org/abs/2309.09085
repo_url: None
paper_authors: Yongyi Zang, Yi Zhong, Frank Cwitkowitz, Zhiyao Duan
for: 这篇论文的目的是提高电子琴 Tablature Transcription (GTT) 模型的准确性和通用性，以应对现有的数据集规模和范围有限，导致现有的 GTT 模型容易过滤和没有通用性。
methods: 作者采用了多个商业电子琴和普通琴插件来生成 SynthTab，一个大规模的电子琴 Tablature Transcription 数据集。这个数据集是基于 DadaGP 提供的广泛的 Tablature 集，并且具有丰富的特征和技巧。
results: 实验显示，将先进 GTT 模型在 SynthTab 上进行预训后，可以提高同数据集的准确性，并且在跨数据集评估中具有较好的适应性和减少了过滤问题。

Abstract
Guitar tablature is a form of music notation widely used among guitarists. It captures not only the musical content of a piece, but also its implementation and ornamentation on the instrument. Guitar Tablature Transcription (GTT) is an important task with broad applications in music education and entertainment. Existing datasets are limited in size and scope, causing state-of-the-art GTT models trained on such datasets to suffer from overfitting and to fail in generalization across datasets. To address this issue, we developed a methodology for synthesizing SynthTab, a large-scale guitar tablature transcription dataset using multiple commercial acoustic and electric guitar plugins. This dataset is built on tablatures from DadaGP, which offers a vast collection and the degree of specificity we wish to transcribe. The proposed synthesis pipeline produces audio which faithfully adheres to the original fingerings, styles, and techniques specified in the tablature with diverse timbre. Experiments show that pre-training state-of-the-art GTT model on SynthTab improves transcription accuracy in same-dataset tests. More importantly, it significantly mitigates overfitting problems of GTT models in cross-dataset evaluation.

摘要
吉他标谱是一种广泛用于吉他演奏的音乐notation，不仅记录了音乐内容，还包括 instru Meyer 和ornamentation。吉他标谱转写（GTT）是一个重要的任务，有广泛的应用在音乐教育和娱乐领域。现有的数据集 limitation 的size和scope，导致现有的GTT模型在这些数据集上进行训练后会出现过拟合和泛化问题。为解决这个问题，我们开发了一种方法ологи для生成 SynthTab，一个大规模的吉他标谱转写数据集，使用多种商业钢琴和电吉他插件。这个数据集基于DadaGP提供的大量标谱，我们可以根据我们的要求进行特定的转写。我们的合成管道可以生成具有多样 timbre 的音频，忠实地实现原始的手套、风格和技巧 specified in the tablature。实验表明，在 SynthTab 上先行训练 state-of-the-art GTT 模型可以提高同一个数据集的转写精度。更重要的是，它可以减轻 GTT 模型在不同数据集之间的过拟合问题。

Music Generation based on Generative Adversarial Networks with Transformer

paper_url: http://arxiv.org/abs/2309.09075
repo_url: None
paper_authors: Ziyi Jiang, Yi Zhong, Ruoxue Wu, Zhenghan Chen, Xiaoxuan Liang
for: 本研究旨在提高基于Transformers的自动生成音乐作品的质量，并且减少曝光偏见的影响。
methods: 我们使用了一种基于GAN框架的敌方损失函数，并使用了一个预训练的Span-BERT模型作为推论器。我们还使用了Gumbel-Softmax trick来实现整数序列的可微分化。
results: 我们通过人工评估和引入一种新的探测指标，证明了我们的方法比基于likelihood最大化的基eline模型具有更高的质量。

Abstract
Autoregressive models based on Transformers have become the prevailing approach for generating music compositions that exhibit comprehensive musical structure. These models are typically trained by minimizing the negative log-likelihood (NLL) of the observed sequence in an autoregressive manner. However, when generating long sequences, the quality of samples from these models tends to significantly deteriorate due to exposure bias. To address this issue, we leverage classifiers trained to differentiate between real and sampled sequences to identify these failures. This observation motivates our exploration of adversarial losses as a complement to the NLL objective. We employ a pre-trained Span-BERT model as the discriminator in the Generative Adversarial Network (GAN) framework, which enhances training stability in our experiments. To optimize discrete sequences within the GAN framework, we utilize the Gumbel-Softmax trick to obtain a differentiable approximation of the sampling process. Additionally, we partition the sequences into smaller chunks to ensure that memory constraints are met. Through human evaluations and the introduction of a novel discriminative metric, we demonstrate that our approach outperforms a baseline model trained solely on likelihood maximization.

摘要
自适应模型基于Transformers已成为生成具有完整音乐结构的乐曲主要方法。这些模型通常通过逐步式拟合方式进行训练，以最小化负对数梯度（NLL）为目标。然而，在生成长序列时，这些模型的样本质量往往会受到曝光偏见的影响，导致样本质量下降。为 Addressing this issue, we leveraged classifiers trained to distinguish between real and sampled sequences to identify these failures. This observation motivates our exploration of adversarial losses as a complement to the NLL objective. We employed a pre-trained Span-BERT model as the discriminator in the Generative Adversarial Network (GAN) framework, which enhances training stability in our experiments. To optimize discrete sequences within the GAN framework, we utilized the Gumbel-Softmax trick to obtain a differentiable approximation of the sampling process. Additionally, we partitioned the sequences into smaller chunks to ensure that memory constraints were met. Through human evaluations and the introduction of a novel discriminative metric, we demonstrated that our approach outperformed a baseline model trained solely on likelihood maximization.Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Unifying Robustness and Fidelity: A Comprehensive Study of Pretrained Generative Methods for Speech Enhancement in Adverse Conditions

paper_url: http://arxiv.org/abs/2309.09028
repo_url: None
paper_authors: Heming Wang, Meng Yu, Hao Zhang, Chunlei Zhang, Zhongweiyang Xu, Muqiao Yang, Yixuan Zhang, Dong Yu
for: 提高噪音环境下的语音信号质量
methods: 使用预训练的生成方法重新生成干净的语音信号
results: 实验表明，使用代码生成器可以获得更高的主观分数，并且生成的语音质量更高，噪音和反射减少。

Abstract
Enhancing speech signal quality in adverse acoustic environments is a persistent challenge in speech processing. Existing deep learning based enhancement methods often struggle to effectively remove background noise and reverberation in real-world scenarios, hampering listening experiences. To address these challenges, we propose a novel approach that uses pre-trained generative methods to resynthesize clean, anechoic speech from degraded inputs. This study leverages pre-trained vocoder or codec models to synthesize high-quality speech while enhancing robustness in challenging scenarios. Generative methods effectively handle information loss in speech signals, resulting in regenerated speech that has improved fidelity and reduced artifacts. By harnessing the capabilities of pre-trained models, we achieve faithful reproduction of the original speech in adverse conditions. Experimental evaluations on both simulated datasets and realistic samples demonstrate the effectiveness and robustness of our proposed methods. Especially by leveraging codec, we achieve superior subjective scores for both simulated and realistic recordings. The generated speech exhibits enhanced audio quality, reduced background noise, and reverberation. Our findings highlight the potential of pre-trained generative techniques in speech processing, particularly in scenarios where traditional methods falter. Demos are available at https://whmrtm.github.io/SoundResynthesis.

摘要
增强语音信号质量在不利的听音环境中是一个长期挑战的问题。现有的深度学习基于的增强方法经常在实际场景中不能有效地除去背景噪声和反射，从而影响听众体验。为解决这些挑战，我们提出了一种新的方法，使用预训练的生成方法将清晰、无反射的语音重新生成出来。这项研究利用预训练的 vocoder 或 codec 模型来生成高质量的语音，同时提高了对挑战性场景的抗性。生成方法可以有效处理语音信号中的信息损失，从而生成具有提高的听音质量和减少的artefacts的语音。通过利用预训练模型的能力，我们实现了原始语音的忠实复制在不利条件下。实验评估表明，我们的提议方法在模拟数据集和实际采样中具有显著的效果和稳定性。特别是通过利用 codec，我们在模拟和实际录音中获得了更高的主观评分。生成的语音具有提高的听音质量、减少的背景噪声和反射。我们的发现表明，预训练的生成技术在语音处理中具有潜在的潜力，特别是在传统方法失效的场景下。 Demo 可以在中找到。

Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation

paper_url: http://arxiv.org/abs/2309.08876
repo_url: None
paper_authors: Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe
for: 提高自动语音识别（ASR）模型的精度和效率，使其可以使用文本数据进行训练。
methods: 采用decoder-only架构，使用简单的文本扩充，并使用CTC预测来提供音频信息。
results: 在LibriSpeech和Switchboard datasets上，提出的模型比普通CTC预测减少了0.3%和1.4%的单词错误率，并在LibriSpeech 100h和Switchboard训练场景中超过了传统的encoder-decoder ASR模型。

Abstract
Collecting audio-text pairs is expensive; however, it is much easier to access text-only data. Unless using shallow fusion, end-to-end automatic speech recognition (ASR) models require architecture modifications or additional training schemes to use text-only data. Inspired by recent advances in decoder-only language models (LMs), such as GPT-3 and PaLM adopted for speech-processing tasks, we propose using a decoder-only architecture for ASR with simple text augmentation. To provide audio information, encoder features compressed by CTC prediction are used as prompts for the decoder, which can be regarded as refining CTC prediction using the decoder-only model. Because the decoder architecture is the same as an autoregressive LM, it is simple to enhance the model by leveraging external text data with LM training. An experimental comparison using LibriSpeech and Switchboard shows that our proposed models with text augmentation training reduced word error rates from ordinary CTC by 0.3% and 1.4% on LibriSpeech test-clean and testother set, respectively, and 2.9% and 5.0% on Switchboard and CallHome. The proposed model had advantage on computational efficiency compared with conventional encoder-decoder ASR models with a similar parameter setup, and outperformed them on the LibriSpeech 100h and Switchboard training scenarios.

摘要
收集音频文本对是costly的;但是可以轻松地获取文本数据。除非使用浅层融合，否则末端自动语音识别（ASR）模型需要建筑修改或额外训练方式来使用文本数据。受最近的语言模型（LM）的进步启发，我们提议使用decoder-only架构 для ASR，并使用简单的文本扩展。为了提供音频信息，encoder特征被CTC预测压缩后用作decoder的激活器，可以视为通过decoder-only模型来更正CTC预测。由于decoder架构与 autoregressive LM 相同，因此可以通过外部文本数据进行LM训练来增强模型。我们在LibriSpeech和Switchboard上进行了实验比较，发现我们提议的模型在文本扩展训练下降低了word error rate（PER）by 0.3%和1.4%在LibriSpeech test-clean和test-otherSet上，并且在Switchboard和CallHome上降低了2.9%和5.0%。此外，我们的模型在计算效率方面具有优势，并在LibriSpeech 100h和Switchboard训练enario上超过了传统的末端encoder-decoder ASR模型。

Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

paper_url: http://arxiv.org/abs/2309.08839
repo_url: None
paper_authors: Kaiyi Luo, Xulong Zhang, Jianzong Wang, Huaxiong Li, Ning Cheng, Jing Xiao
for: 这个论文主要针对 audio-to-text 模式下的跨模态检索问题，即使用 audio clips 和文本进行对应。
methods: 该论文提出了一种新的 Contrastive Latent Space Reconstruction Learning (CLSR) 方法，它在对比表示学习中考虑了内模态分离性，并采用了 adaptive temperature control 策略。此外，该方法还包含了模态交互的latent representation reconstruction模块。
results: 对两个 audio-text 数据集进行比较，CLSR 方法表现出了较高的效果，胜过了一些当前最佳方法。

Abstract
Cross-modal retrieval (CMR) has been extensively applied in various domains, such as multimedia search engines and recommendation systems. Most existing CMR methods focus on image-to-text retrieval, whereas audio-to-text retrieval, a less explored domain, has posed a great challenge due to the difficulty to uncover discriminative features from audio clips and texts. Existing studies are restricted in the following two ways: 1) Most researchers utilize contrastive learning to construct a common subspace where similarities among data can be measured. However, they considers only cross-modal transformation, neglecting the intra-modal separability. Besides, the temperature parameter is not adaptively adjusted along with semantic guidance, which degrades the performance. 2) These methods do not take latent representation reconstruction into account, which is essential for semantic alignment. This paper introduces a novel audio-text oriented CMR approach, termed Contrastive Latent Space Reconstruction Learning (CLSR). CLSR improves contrastive representation learning by taking intra-modal separability into account and adopting an adaptive temperature control strategy. Moreover, the latent representation reconstruction modules are embedded into the CMR framework, which improves modal interaction. Experiments in comparison with some state-of-the-art methods on two audio-text datasets have validated the superiority of CLSR.

摘要
跨模式检索（CMR）在不同领域得到了广泛应用，如多媒体搜索引擎和推荐系统。现有的大多数CMR方法强调图像到文本检索，而听音到文本检索则是一个未得到充分发展的领域，这主要是因为听音clip和文本之间找到特征点具有很大的挑战性。现有的研究受到以下两种限制：1. 大多数研究人员采用对偶学习来构建共同的特征空间，以便在数据之间可以测量相似性。然而，他们只考虑了跨模式变换，忽略了内模态分离性。此外，温度参数不适应性地调整，这会下降性能。2. 这些方法不会考虑隐藏表示的重建，这是必要的 дляsemantic alignment。本文提出了一种新的听音到文本 oriented CMR方法，称为对偶特征空间重建学习（CLSR）。CLSR方法改进了对偶表示学习，通过考虑内模态分离性和适应性温度控制策略。此外，模态交互模块被引入到CMR框架中，以提高模态交互。对两个音频到文本数据集进行比较 экспериментирова， Validated the superiority of CLSR。

FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework

paper_url: http://arxiv.org/abs/2309.08837
repo_url: None
paper_authors: Jianzong Wang, Xulong Zhang, Aolan Sun, Ning Cheng, Jing Xiao
for: 这篇论文旨在 integrate graph-to-sequence 到一个终端文本至语音框架中，以实现 syntax-aware 模型化。
methods: 这篇论文使用了 dependency parsing 模块将输入文本解析成一个 sintactic graph，然后使用 graph encoder 对这个 sintactic graph 进行编码，提取 sintactic hidden information，并与 phoneme embedding 进行拼接，并输入到 alignment 和 flow-based decoding 模块中，生成 raw audio waveform。
results: 实验结果表明，这种模型可以提供更好的语音合成效果，并且在 subjective prosodic evaluation 中获得了更高的分数。此外，模型还可以进行voice conversion。此外，通过 AI chip operator 的设计，模型的效率得到了5x的加速。

Abstract
This paper integrates graph-to-sequence into an end-to-end text-to-speech framework for syntax-aware modelling with syntactic information of input text. Specifically, the input text is parsed by a dependency parsing module to form a syntactic graph. The syntactic graph is then encoded by a graph encoder to extract the syntactic hidden information, which is concatenated with phoneme embedding and input to the alignment and flow-based decoding modules to generate the raw audio waveform. The model is experimented on two languages, English and Mandarin, using single-speaker, few samples of target speakers, and multi-speaker datasets, respectively. Experimental results show better prosodic consistency performance between input text and generated audio, and also get higher scores in the subjective prosodic evaluation, and show the ability of voice conversion. Besides, the efficiency of the model is largely boosted through the design of the AI chip operator with 5x acceleration.

摘要
这篇论文将graph-to-sequence integrate到了一个端到端的文本到语音框架中，以实现文本的 syntax-aware 模型化。具体来说，输入文本首先被依赖分析模块解析，形成一个语法图。然后，语法图被图编码器编码，以提取语法隐藏信息。这些隐藏信息与phoneme embedding相加，并输入到对齐和流程基于解码模块中，以生成原始的音频波形。模型在英语和普通话两种语言上进行了实验，使用单个说话者、少量目标说话者和多个说话者的数据集，分别进行了实验。实验结果表明，模型可以更好地保持输入文本和生成的音频波形之间的PROSODIC 一致性，并在主观的PROSODIC 评价中获得更高的分数。此外，模型的效率得到了通过AI芯片运算符的5倍加速的大幅提升。

Boosting End-to-End Multilingual Phoneme Recognition through Exploiting Universal Speech Attributes Constraints

paper_url: http://arxiv.org/abs/2309.08828
repo_url: None
paper_authors: Hao Yen, Sabato Marco Siniscalchi, Chin-Hui Lee
for: 该研究旨在提出一种多语言自动语音识别（ASR）系统，以利用语音生成器的知识来提高系统的性能。
methods: 该研究使用了一种基于语音特征的 attribute-to-phoneme 映射方法，将知识基于生成器的特征映射到输出phoneme上，以限制系统的预测。
results: 该研究在多种语言的测试数据上进行了比较，并发现了与传统多语言方法相比，提出的解决方案能够提高系统的性能，平均提高6.85%。此外，研究还发现了该解决方案能够消除与特征不一致的phoneme预测。

Abstract
We propose a first step toward multilingual end-to-end automatic speech recognition (ASR) by integrating knowledge about speech articulators. The key idea is to leverage a rich set of fundamental units that can be defined "universally" across all spoken languages, referred to as speech attributes, namely manner and place of articulation. Specifically, several deterministic attribute-to-phoneme mapping matrices are constructed based on the predefined set of universal attribute inventory, which projects the knowledge-rich articulatory attribute logits, into output phoneme logits. The mapping puts knowledge-based constraints to limit inconsistency with acoustic-phonetic evidence in the integrated prediction. Combined with phoneme recognition, our phone recognizer is able to infer from both attribute and phoneme information. The proposed joint multilingual model is evaluated through phoneme recognition. In multilingual experiments over 6 languages on benchmark datasets LibriSpeech and CommonVoice, we find that our proposed solution outperforms conventional multilingual approaches with a relative improvement of 6.85% on average, and it also demonstrates a much better performance compared to monolingual model. Further analysis conclusively demonstrates that the proposed solution eliminates phoneme predictions that are inconsistent with attributes.

摘要
我们提出一个初步的多语言端到端自动语音识别（ASR）方法，通过 интеGRATE知识About speech articulators。关键思想是利用一个丰富的基本单元，可以在所有的口语语言中 Universally defined，称为speech attributes，namely manner and place of articulation。特别是，我们构建了一些决定性的 attribute-to-phoneme mapping矩阵，基于预定的universal attribute inventory，将知识医学特征logits项目到输出phoneme logits。这种映射带有知识基础的约束，以限制与语音-phonetic证据的不一致。与phoneme recognition结合，我们的电话识别器能够从both attribute和phoneme信息中进行推理。我们提出的联合多语言模型在LibriSpeech和CommonVoice多语言测试集上进行了phoneme recognition测试，并 obtAIN了相对改善6.85%的平均提升，以及和单语言模型的较好表现。进一步的分析表明，我们的解决方案可以消除与 attribute不一致的phoneme预测。

2023-09-16

cs.CV

cs.CV - 2023-09-16

FrameRS: A Video Frame Compression Model Composed by Self supervised Video Frame Reconstructor and Key Frame Selector

paper_url: http://arxiv.org/abs/2309.09083
repo_url: https://github.com/qiqianfu/framemae
paper_authors: Qiqian Fu, Guanhong Wang, Gaoang Wang
for: 本研究提出了一种框架重建模型：FrameRS，用于自动重建视频帧。
methods: 模型包括一个自我supervised的视频帧重建器和一个关键帧选择器。帧重建器FrameMAE是基于图像MAE的原理，适应视频上下文。关键帧选择器是基于CNN结构，通过从encoder中获取高级别semantic信息，可以低计算成本预测关键帧。
results: 模型可以有效地压缩视频clip，保留约30%的重要帧。性能方面，我们的模型在计算效率和竞争准确性方面表现出色，与传统的关键帧提取算法相比有所提升。代码可以在Github上下载。

Abstract
In this paper, we present frame reconstruction model: FrameRS. It consists self-supervised video frame reconstructor and key frame selector. The frame reconstructor, FrameMAE, is developed by adapting the principles of the Masked Autoencoder for Images (MAE) for video context. The key frame selector, Frame Selector, is built on CNN architecture. By taking the high-level semantic information from the encoder of FrameMAE as its input, it can predicted the key frames with low computation costs. Integrated with our bespoke Frame Selector, FrameMAE can effectively compress a video clip by retaining approximately 30% of its pivotal frames. Performance-wise, our model showcases computational efficiency and competitive accuracy, marking a notable improvement over traditional Key Frame Extract algorithms. The implementation is available on Github

摘要
本文提出了一种框架重建模型：FrameRS。它包含自我超级视频框架重建器和关键帧选择器。框架重建器 FrameMAE 是基于图像隐藏autoencoder（MAE）的视频上的应用，而关键帧选择器 Frame Selector 是基于卷积神经网络架构。通过将高级 semantic 信息从 FrameMAE 的解码器作为输入，Frame Selector 可以预测关键帧，计算成本低。将 FrameMAE 与我们自己的 Frame Selector 结合使用，可以有效地压缩视频clip，保留约 30% 的关键帧。在性能方面，我们的模型表现出了计算效率和竞争性准确率，代表了传统关键帧提取算法的 Notable Improvement。实现可以在 Github 上找到。

Multi-camera Bird’s Eye View Perception for Autonomous Driving

paper_url: http://arxiv.org/abs/2309.09080
repo_url: None
paper_authors: David Unger, Nikhil Gosala, Varun Ravi Kumar, Shubhankar Borse, Abhinav Valada, Senthil Yogamani
for: 这 paper 主要是为了探讨多摄像头基于深度学习模型在逻辑投影空间（BEV）中的物体表示方法。
methods: 这 paper 使用的方法主要是基于深度学习模型，将摄像头图像转换到逻辑投影空间（BEV）中，并使用几何约束来保证转换的准确性。
results: 这 paper 的结果表明，使用深度学习模型对摄像头图像的转换在逻辑投影空间（BEV）中可以实现更高的准确性和灵活性，并且可以与其他感知器结合进行有效的感知融合。

Abstract
Most automated driving systems comprise a diverse sensor set, including several cameras, Radars, and LiDARs, ensuring a complete 360\deg coverage in near and far regions. Unlike Radar and LiDAR, which measure directly in 3D, cameras capture a 2D perspective projection with inherent depth ambiguity. However, it is essential to produce perception outputs in 3D to enable the spatial reasoning of other agents and structures for optimal path planning. The 3D space is typically simplified to the BEV space by omitting the less relevant Z-coordinate, which corresponds to the height dimension.The most basic approach to achieving the desired BEV representation from a camera image is IPM, assuming a flat ground surface. Surround vision systems that are pretty common in new vehicles use the IPM principle to generate a BEV image and to show it on display to the driver. However, this approach is not suited for autonomous driving since there are severe distortions introduced by this too-simplistic transformation method. More recent approaches use deep neural networks to output directly in BEV space. These methods transform camera images into BEV space using geometric constraints implicitly or explicitly in the network. As CNN has more context information and a learnable transformation can be more flexible and adapt to image content, the deep learning-based methods set the new benchmark for BEV transformation and achieve state-of-the-art performance. First, this chapter discusses the contemporary trends of multi-camera-based DNN (deep neural network) models outputting object representations directly in the BEV space. Then, we discuss how this approach can extend to effective sensor fusion and coupling downstream tasks like situation analysis and prediction. Finally, we show challenges and open problems in BEV perception.

摘要
现代自动驾驶系统通常包括多种感知器，包括数个摄像头、雷达和LiDAR，以确保完整的360度覆盖 both near and far regions。不同于雷达和LiDAR，摄像头会 Capture a 2D perspective projection with inherent depth ambiguity。然而，以便实现最佳路径规划，需要生成3D感知输出。为了简化3D空间，通常会将Z坐标（高度维度）排除，得到BEV空间（bird's eye view）。将摄像头图像转换为BEV空间的最基本方法是IPM（平面地面假设）。许多新车型的围视系统都使用IPM原理生成BEV图像，并将其显示给司机。然而，这种方法不适用于自动驾驶，因为它会引入严重的扭曲。更近期的方法使用深度神经网络直接将摄像头图像转换为BEV空间。这些方法使用摄像头图像中的几何约束，以及深度神经网络学习的变换，以生成BEV图像。由于神经网络具有更多的内容信息和可学习的变换，深度学习基于方法已经设置了新的标准 дляBEV转换，并实现了状态前景性的表现。本章首先介绍了当代多摄像头基于神经网络（deep neural network）模型，输出对象表示直接在BEV空间。然后，我们讨论了如何扩展到有效的感知融合和下游任务，如情况分析和预测。最后，我们介绍了BEV感知的挑战和开放问题。

Unsupervised Green Object Tracker (GOT) without Offline Pre-training

paper_url: http://arxiv.org/abs/2309.09078
repo_url: None
paper_authors: Zhiruo Zhou, Suya You, C. -C. Jay Kuo
for: 提高单个目标跟踪精度，降低标注成本和计算复杂性，实现灵活可 deployment on edge devices。
methods: ensemble of three prediction branches：1) 全局对象基 correlator，2) 本地 patch-based correlator，3) superpixel-based segmentator，使用了简单的模型和低计算复杂性。
results: 与现有的半监督跟踪器相当，需要大量的Offline预训练，但GOT具有较低的计算复杂性和小型模型大小，可以轻松部署于移动和边缘设备。

Abstract
Supervised trackers trained on labeled data dominate the single object tracking field for superior tracking accuracy. The labeling cost and the huge computational complexity hinder their applications on edge devices. Unsupervised learning methods have also been investigated to reduce the labeling cost but their complexity remains high. Aiming at lightweight high-performance tracking, feasibility without offline pre-training, and algorithmic transparency, we propose a new single object tracking method, called the green object tracker (GOT), in this work. GOT conducts an ensemble of three prediction branches for robust box tracking: 1) a global object-based correlator to predict the object location roughly, 2) a local patch-based correlator to build temporal correlations of small spatial units, and 3) a superpixel-based segmentator to exploit the spatial information of the target frame. GOT offers competitive tracking accuracy with state-of-the-art unsupervised trackers, which demand heavy offline pre-training, at a lower computation cost. GOT has a tiny model size (<3k parameters) and low inference complexity (around 58M FLOPs per frame). Since its inference complexity is between 0.1%-10% of DL trackers, it can be easily deployed on mobile and edge devices.

摘要
<>将文本翻译成简化中文。<>supervised trackers在标注数据上训练的场景下占据单个对象跟踪领域的先导地位，因为标签成本和计算复杂性而降低其应用于边缘设备。不supervised learning方法也被研究以减少标签成本，但它们的复杂度仍然高。为了实现轻量级高性能的跟踪，不需要线上预训练、算法透明度和可行性，我们在这里提出了一种新的单对象跟踪方法，称为绿色对象跟踪器（GOT）。GOT使用三个预测分支来提供粗略对象位置预测和高精度跟踪：1）全局对象基于相关器来预测对象位置，2）本地小区域基于相关器来建立时间相关性，3）超像素基于分割器来利用目标帧中的空间信息。GOT可以与现有的无监督跟踪器相比，它们需要大量的线上预训练，并且具有较低的计算成本（<3k参数）和低的计算复杂度（约58M FLOPs每帧）。由于其计算复杂度在0.1%-10%之间，因此它可以轻松部署在移动和边缘设备上。

paper_url: http://arxiv.org/abs/2309.09067
repo_url: https://github.com/fudong03/mmst-vit
paper_authors: Fudong Lin, Summer Crawford, Kaleb Guillot, Yihe Zhang, Yan Chen, Xu Yuan, Li Chen, Shelby Williams, Robert Minvielle, Xiangming Xiao, Drew Gholson, Nicolas Ashwell, Tri Setiyono, Brenda Tubana, Lu Peng, Magdy Bayoumi, Nian-Feng Tzeng
for: 预测美国县级作物产量，考虑植物生长季节的天气变化和气候变化对作物的影响。
methods: 我们开发了一种深度学习基于解决方案，即多Modal Spatial-Temporal Vision Transformer（MMST-ViT），利用视觉遥感数据和短期天气数据来模型植物生长季节的天气变化对作物生长的影响。
results: 我们的MMST-ViT在200个美国县的实验中表现出色，与三个性能指标之间的比较结果都高于其他相似方法。

Abstract
Precise crop yield prediction provides valuable information for agricultural planning and decision-making processes. However, timely predicting crop yields remains challenging as crop growth is sensitive to growing season weather variation and climate change. In this work, we develop a deep learning-based solution, namely Multi-Modal Spatial-Temporal Vision Transformer (MMST-ViT), for predicting crop yields at the county level across the United States, by considering the effects of short-term meteorological variations during the growing season and the long-term climate change on crops. Specifically, our MMST-ViT consists of a Multi-Modal Transformer, a Spatial Transformer, and a Temporal Transformer. The Multi-Modal Transformer leverages both visual remote sensing data and short-term meteorological data for modeling the effect of growing season weather variations on crop growth. The Spatial Transformer learns the high-resolution spatial dependency among counties for accurate agricultural tracking. The Temporal Transformer captures the long-range temporal dependency for learning the impact of long-term climate change on crops. Meanwhile, we also devise a novel multi-modal contrastive learning technique to pre-train our model without extensive human supervision. Hence, our MMST-ViT captures the impacts of both short-term weather variations and long-term climate change on crops by leveraging both satellite images and meteorological data. We have conducted extensive experiments on over 200 counties in the United States, with the experimental results exhibiting that our MMST-ViT outperforms its counterparts under three performance metrics of interest.

摘要
precise 农作物产量预测提供了重要的农业规划和决策过程中的信息。然而，在季节变化和气候变化的影响下，准确地预测农作物产量仍然是一项挑战。在这种情况下，我们开发了一种深度学习基于解决方案，即多Modal空间时间变换器（MMST-ViT），用于预测美国各县的农作物产量，并考虑了季节变化的短期天气影响和气候变化对农作物的影响。具体来说，我们的MMST-ViT包括多Modal变换器、空间变换器和时间变换器。多Modal变换器利用了远程感知数据和季节变化天气数据来模型季节变化对农作物生长的影响。空间变换器学习了高分辨率的空间相关性，以便准确地跟踪农作物的生长。时间变换器捕捉了长期时间相关性，以便学习气候变化对农作物的影响。此外，我们还开发了一种新的多Modal对比学习技术，以不需要大量的人工监督来预处理我们的模型。因此，我们的MMST-ViT可以 capture季节变化和气候变化对农作物的影响，并且在美国200多个县的实验结果表明，我们的MMST-ViT在三个关键性能指标上表现出色。

Sub-action Prototype Learning for Point-level Weakly-supervised Temporal Action Localization

paper_url: http://arxiv.org/abs/2309.09060
repo_url: None
paper_authors: Yueyang Li, Yonghong Hou, Wanqing Li
for: 提高点级弱监督时间动作地理化（PWTAL）的性能，使其能够基于唯一时间戳注释每个动作实例。
methods: 提出了一种新的子动作原型学习框架（SPL-Loc），包括子动作原型归一化（SPC）和顺序原型对齐（OPA）。 SPC 适应性地提取了表现出时间尺度和空间内容变化的动作实例的表示性质。 OPA 选择了相关的原型，以提供完整性提示符 дляpseudo标签生成。
results: 与现有SOTA PWTAL方法进行了广泛的实验，并显示了提档SPL-Loc可以准确地地理化动作边界。

Abstract
Point-level weakly-supervised temporal action localization (PWTAL) aims to localize actions with only a single timestamp annotation for each action instance. Existing methods tend to mine dense pseudo labels to alleviate the label sparsity, but overlook the potential sub-action temporal structures, resulting in inferior performance. To tackle this problem, we propose a novel sub-action prototype learning framework (SPL-Loc) which comprises Sub-action Prototype Clustering (SPC) and Ordered Prototype Alignment (OPA). SPC adaptively extracts representative sub-action prototypes which are capable to perceive the temporal scale and spatial content variation of action instances. OPA selects relevant prototypes to provide completeness clue for pseudo label generation by applying a temporal alignment loss. As a result, pseudo labels are derived from alignment results to improve action boundary prediction. Extensive experiments on three popular benchmarks demonstrate that the proposed SPL-Loc significantly outperforms existing SOTA PWTAL methods.

摘要
<>转换文本为简化中文。<>点级弱监视时间动作地点（PWTAL）目标是在每个动作实例只有一个时间标签的情况下地址动作。现有方法通常是利用密集假标签来减轻标签稀疏性，但是忽略了可能存在的次动作时间结构，导致性能较差。为解决这个问题，我们提出了一种新的子动作原型学习框架（SPL-Loc），它包括子动作原型聚类（SPC）和有序原型对齐（OPA）。SPC可适应性EXTRACT Representative sub-action prototypes, which are capable of perceiving the temporal scale and spatial content variation of action instances. OPA选择相关的原型来提供完整的假标签生成的准备，通过应用时间对齐损失。因此，假标签来自对齐结果进行改进动作边界预测。广泛的实验表明，提出的 SPL-Loc 明显超过现有SOTA PWTAL方法。

Microscale 3-D Capacitance Tomography with a CMOS Sensor Array

paper_url: http://arxiv.org/abs/2309.09039
repo_url: None
paper_authors: Manar Abdelatty, Joseph Incandela, Kangping Hu, Joseph W. Larkin, Sherief Reda, Jacob K. Rosenstein
for: 这个论文主要是用来描述电容 Tomatoesography（ECT）技术在微型系统中的应用。
methods: 该论文使用了CMOS微电极阵列来实现ECT成像，并提出了一种深度学习架构和改进的多目标训练方法来重建射电常数图像。
results: 实验结果表明，提议的方法能够高精度地重建微型系统中的3D结构，包括精确地测量微球体积和细菌生物胶囊的尺寸。 predictions accuracy为91.5%和82.7%。

Abstract
Electrical capacitance tomography (ECT) is a nonoptical imaging technique in which a map of the interior permittivity of a volume is estimated by making capacitance measurements at its boundary and solving an inverse problem. While previous ECT demonstrations have often been at centimeter scales, ECT is not limited to macroscopic systems. In this paper, we demonstrate ECT imaging of polymer microspheres and bacterial biofilms using a CMOS microelectrode array, achieving spatial resolution of 10 microns. Additionally, we propose a deep learning architecture and an improved multi-objective training scheme for reconstructing out-of-plane permittivity maps from the sensor measurements. Experimental results show that the proposed approach is able to resolve microscopic 3-D structures, achieving 91.5% prediction accuracy on the microsphere dataset and 82.7% on the biofilm dataset, including an average of 4.6% improvement over baseline computational methods.

摘要
电容测量探测技术（ECT）是一种非光学图像技术，可以测量物体内部电容 coefficient的地图，并解决一个倒逼问题。而在过去的ECT示范中，通常是在厘米级别进行，但ECT并不限于巨观系统。在这篇论文中，我们使用CMOS微电极阵列进行ECT探测，实现了10μ米的空间分辨率。此外，我们提出了深度学习架构和改进的多目标训练方案，用于从传感器测量数据中重建垂直电容地图。实验结果表明，我们的方法能够分解微观三维结构，达到91.5%的预测精度（在微球数据集上）和82.7%的预测精度（在生物质层数据集上），其中平均与基线计算方法相差4.6%。

RingMo-lite: A Remote Sensing Multi-task Lightweight Network with CNN-Transformer Hybrid Framework

paper_url: http://arxiv.org/abs/2309.09003
repo_url: None
paper_authors: Yuelei Wang, Ting Zhang, Liangjin Zhao, Lin Hu, Zhechao Wang, Ziqing Niu, Peirui Cheng, Kaiqiang Chen, Xuan Zeng, Zhirui Wang, Hongqi Wang, Xian Sun
for: 这个论文旨在提出一个轻量级的RS类型视觉基础模型，以便在边缘设备上进行RS影像解释。
methods: 这个论文使用了一个CNN-Transformer混合架构，具有一个双支结构，其中使用了Transformer模组作为低通架构，以EXTRACTRS影像的全球特征；而CNN模组则被用作堆叠高通架构，以EXTRACTRS影像的细节特征。
results: 相比于RingMo，这个提案的RingMo-lite将参数减少了大约60%，并在不同的RS影像解释任务中保持了缩减的几成比，而且在大多数场景下，其精度下降了不到2%。此外，这个研究将在未来与MindSpore computng平台集成。

Abstract
In recent years, remote sensing (RS) vision foundation models such as RingMo have emerged and achieved excellent performance in various downstream tasks. However, the high demand for computing resources limits the application of these models on edge devices. It is necessary to design a more lightweight foundation model to support on-orbit RS image interpretation. Existing methods face challenges in achieving lightweight solutions while retaining generalization in RS image interpretation. This is due to the complex high and low-frequency spectral components in RS images, which make traditional single CNN or Vision Transformer methods unsuitable for the task. Therefore, this paper proposes RingMo-lite, an RS multi-task lightweight network with a CNN-Transformer hybrid framework, which effectively exploits the frequency-domain properties of RS to optimize the interpretation process. It is combined by the Transformer module as a low-pass filter to extract global features of RS images through a dual-branch structure, and the CNN module as a stacked high-pass filter to extract fine-grained details effectively. Furthermore, in the pretraining stage, the designed frequency-domain masked image modeling (FD-MIM) combines each image patch's high-frequency and low-frequency characteristics, effectively capturing the latent feature representation in RS data. As shown in Fig. 1, compared with RingMo, the proposed RingMo-lite reduces the parameters over 60% in various RS image interpretation tasks, the average accuracy drops by less than 2% in most of the scenes and achieves SOTA performance compared to models of the similar size. In addition, our work will be integrated into the MindSpore computing platform in the near future.

摘要
在近年，远程感知（RS）视觉基础模型如RingMo出现并在各种下游任务中表现出色。然而，计算资源的高需求限制了这些模型在边缘设备上的应用。为了解决这个问题，这篇论文提出了RingMo-lite，一种RS多任务轻量级网络，它采用了CNN-Transformer混合框架，并且有效地利用RS图像的频率频谱特性来优化解释过程。RingMo-lite由Transformer模块作为低通滤波器，EXTRACTRS图像的全面特征，而CNN模块作为堆叠高通滤波器，EXTRACTRS图像的细腻细节。此外，在预训练阶段，我们设计了频率频谱遮盲图像模型（FD-MIM），该模型可以有效地捕捉RS数据中各个图像块的高频和低频特征，从而获得RS数据的秘密特征表示。根据图1，相比RingMo，我们提出的RingMo-lite减少了参数超过60%，在各种RS图像解释任务中，均低于2%的场景下，保持了SOTA的性能。此外，我们计划将这些工作与MindSpore计算平台集成。

OmniLRS: A Photorealistic Simulator for Lunar Robotics

paper_url: http://arxiv.org/abs/2309.08997
repo_url: https://github.com/antoinerichard/lunarsim
paper_authors: Antoine Richard, Junnosuke Kamohara, Kentaro Uno, Shreya Santra, Dave van der Meer, Miguel Olivares-Mendez, Kazuya Yoshida
for: The paper is written for developers and researchers who are interested in developing algorithms for lunar robotic exploration and need a high-fidelity simulator to evaluate their algorithms.
methods: The paper proposes a new lunar simulator called OmniLRS, which is based on Nvidia’s robotic simulator Isaac Sim. The simulator provides fast procedural environment generation, multi-robot capabilities, and a synthetic data pipeline for machine-learning applications.
results: The paper demonstrates the effectiveness of the simulator for image-based perception by performing sim-to-real rock instance segmentation. The results show that a YOLOv8 model trained on the simulator’s synthetic data achieves performance close to a model trained on real-world data, with a 5% performance gap. When finetuned with real data, the model achieves 14% higher average precision than the model trained on real-world data, demonstrating the simulator’s photorealism.Here’s the information in Simplified Chinese text:
for: 这篇论文是为开发 lunar 机器人探索算法而写的，需要高度实验环境来评估其算法。
methods: 这篇论文提出了一个基于 Nvidia 的 Isaac Sim 的 lunar 模拟器 OmniLRS，它提供了快速的生成环境、多机器人能力以及机器学习应用程序的数据管道。
results: 论文通过进行 sim-to-real 矿石实例分割来证明其模拟器的效果，结果显示一个 YOLOv8 模型在模拟器上训练的数据上达到与实际数据训练的模型几乎相同的性能，差距仅5%。当再进行 fine-tuning 后，模型与实际数据训练的模型之间的差距提高了14%。这示明了模拟器的真实性。

Abstract
Developing algorithms for extra-terrestrial robotic exploration has always been challenging. Along with the complexity associated with these environments, one of the main issues remains the evaluation of said algorithms. With the regained interest in lunar exploration, there is also a demand for quality simulators that will enable the development of lunar robots. % In this paper, we explain how we built a Lunar simulator based on Isaac Sim, Nvidia's robotic simulator. In this paper, we propose Omniverse Lunar Robotic-Sim (OmniLRS) that is a photorealistic Lunar simulator based on Nvidia's robotic simulator. This simulation provides fast procedural environment generation, multi-robot capabilities, along with synthetic data pipeline for machine-learning applications. It comes with ROS1 and ROS2 bindings to control not only the robots, but also the environments. This work also performs sim-to-real rock instance segmentation to show the effectiveness of our simulator for image-based perception. Trained on our synthetic data, a yolov8 model achieves performance close to a model trained on real-world data, with 5% performance gap. When finetuned with real data, the model achieves 14% higher average precision than the model trained on real-world data, demonstrating our simulator's photorealism.% to realize sim-to-real. The code is fully open-source, accessible here: https://github.com/AntoineRichard/LunarSim, and comes with demonstrations.

摘要
开发外星 robotic 探索算法总是是一个挑战。随着这些环境的复杂性，一个主要的问题是评估这些算法。与月球探索的重新兴起相关，有一个需求是高质量的月球 simulator，可以帮助月球探索机器人的开发。在这篇论文中，我们介绍了我们如何基于 Isaac Sim 和 Nvidia 的机器人 simulator 建立了一个名为 Omniverse Lunar Robotic-Sim（OmniLRS）的月球 simulator。这个 simulate 提供了快速的过程生成环境、多机器人功能以及synthetic data pipeline для机器学习应用。它还包括 ROS1 和 ROS2 绑定，可以控制不仅机器人，还可以控制环境。此外，我们还实现了 sim-to-real 的岩Instance segmentation，以示我们的 simulate 的实用性。我们在我们的synthetic数据上训练了一个 yolov8 模型，与实际数据训练的模型之间的性能差距只有5%。当 fins 化 With real data 时，模型的性能高于实际数据训练的模型， demonstrating 我们的 simulate 的 photorealism。我们的代码是完全开源的，可以在以下 GitHub 上获取：https://github.com/AntoineRichard/LunarSim，并包括示例。

RMP: A Random Mask Pretrain Framework for Motion Prediction

paper_url: http://arxiv.org/abs/2309.08989
repo_url: https://github.com/kth-rpl/rmp
paper_authors: Yi Yang, Qingwen Zhang, Thomas Gilles, Nazre Batool, John Folkesson
for: 这篇论文是针对自驾车中的路径预测问题提出了一个框架。
methods: 本文使用了随机遮盾模型，将物体位置在随机时间步上遮盾，然后由学习的神经网络（NN）填充。可以根据遮盾profile的变化，轻松地切换到不同的动作相关任务。
results: 本文透过评估Argoverse和NuScenes dataset，表明我们的提案的预训框架可以处理噪音输入，提高路径预测精度和缺失率，特别是在时间遮盾下的物体遮盾。

Abstract
As the pretraining technique is growing in popularity, little work has been done on pretrained learning-based motion prediction methods in autonomous driving. In this paper, we propose a framework to formalize the pretraining task for trajectory prediction of traffic participants. Within our framework, inspired by the random masked model in natural language processing (NLP) and computer vision (CV), objects' positions at random timesteps are masked and then filled in by the learned neural network (NN). By changing the mask profile, our framework can easily switch among a range of motion-related tasks. We show that our proposed pretraining framework is able to deal with noisy inputs and improves the motion prediction accuracy and miss rate, especially for objects occluded over time by evaluating it on Argoverse and NuScenes datasets.

摘要
As the pretraining technique becomes more popular, there has been little research on using learning-based motion prediction methods in autonomous driving. In this paper, we propose a framework to formalize the pretraining task for trajectory prediction of traffic participants. Our framework is inspired by the random masked model in natural language processing (NLP) and computer vision (CV), where the positions of objects at random timesteps are masked and then filled in by a learned neural network (NN). By changing the mask profile, our framework can easily switch among a range of motion-related tasks. We show that our proposed pretraining framework can handle noisy inputs and improve motion prediction accuracy and miss rate, especially for objects that are occluded over time, as evaluated on the Argoverse and NuScenes datasets.

Comparative study of Deep Learning Models for Binary Classification on Combined Pulmonary Chest X-ray Dataset

paper_url: http://arxiv.org/abs/2309.10829
repo_url: None
paper_authors: Shabbir Ahmed Shuvo, Md Aminul Islam, Md. Mozammel Hoque, Rejwan Bin Sulaiman
for: 这个研究的目的是比较八种深度学习模型在同一个肺病影像数据集上的二分类表现。
methods: 这个研究使用了 eight 种深度学习模型，包括 DenseNet 121、DenseNet 169、DenseNet 201、EffecientNet b0、EffecientNet lite4、GoogleNet、MobileNet 和 ResNet18。
results: 研究发现，当应用于肺病影像数据集时，DenseNet 169 表现最佳，准确率为 89.38%，MobileNet 表现次之，准确率为 92.2%。

Abstract
CNN-based deep learning models for disease detection have become popular recently. We compared the binary classification performance of eight prominent deep learning models: DenseNet 121, DenseNet 169, DenseNet 201, EffecientNet b0, EffecientNet lite4, GoogleNet, MobileNet, and ResNet18 for their binary classification performance on combined Pulmonary Chest Xrays dataset. Despite the widespread application in different fields in medical images, there remains a knowledge gap in determining their relative performance when applied to the same dataset, a gap this study aimed to address. The dataset combined Shenzhen, China (CH) and Montgomery, USA (MC) data. We trained our model for binary classification, calculated different parameters of the mentioned models, and compared them. The models were trained to keep in mind all following the same training parameters to maintain a controlled comparison environment. End of the study, we found a distinct difference in performance among the other models when applied to the pulmonary chest Xray image dataset, where DenseNet169 performed with 89.38 percent and MobileNet with 92.2 percent precision. Keywords: Pulmonary, Deep Learning, Tuberculosis, Disease detection, Xray

摘要
Recently, CNN基于深度学习模型在疾病检测中获得了广泛应用。我们对8种知名深度学习模型进行比较：DenseNet 121、DenseNet 169、DenseNet 201、EffecientNet b0、EffecientNet lite4、GoogleNet、MobileNet和ResNet18，以确定它们在同一个 dataset 上的二分类性表现。尽管这些模型在医疗图像领域的不同应用场景中广泛使用，但是在应用于同一个 dataset 上的表现却存在知识空白，这项研究意图填补这个空白。我们使用了combined Shenzhen、China（CH）和Montgomery、USA（MC）的数据集。我们将模型进行二分类训练，计算了不同模型的参数，并进行了比较。所有模型均遵循同样的训练参数，以保证比较环境的控制。研究结束后，我们发现在肺部X射影像数据集上，DenseNet169表现出了89.38%的准确率，而MobileNet表现出了92.2%的准确率。关键词：肺部、深度学习、肺病、疾病检测、X射影像

FF-LOGO: Cross-Modality Point Cloud Registration with Feature Filtering and Local to Global Optimization

paper_url: http://arxiv.org/abs/2309.08966
repo_url: None
paper_authors: Nan Ma, Mohan Wang, Yiheng Han, Yong-Jin Liu
for: 本文是一篇关于多modal点云注册的研究 paper, aiming to address the challenges of cross-modality point cloud registration.
methods: 本文提出了一种名为 FF-LOGO 的多modal点云注册方法，包括Feature Filtering和本地 global optimization两个模块。Feature Filtering 模块可以抽象出不同感知器的点云特征，并通过特征匹配来进行点云选择。本地 Adaptive Key Region Aggregation 模块和全Modal Consistency Fusion Optimization 模块是为了优化点云注册精度。
results: 实验结果表明，我们的两阶段优化可以显著提高点云注册精度，特征关联和选择模块的准确率从40.59%提高到75.74%。这表明我们的方法可以有效地解决跨模态点云注册中的挑战。

Abstract
Cross-modality point cloud registration is confronted with significant challenges due to inherent differences in modalities between different sensors. We propose a cross-modality point cloud registration framework FF-LOGO: a cross-modality point cloud registration method with feature filtering and local-global optimization. The cross-modality feature correlation filtering module extracts geometric transformation-invariant features from cross-modality point clouds and achieves point selection by feature matching. We also introduce a cross-modality optimization process, including a local adaptive key region aggregation module and a global modality consistency fusion optimization module. Experimental results demonstrate that our two-stage optimization significantly improves the registration accuracy of the feature association and selection module. Our method achieves a substantial increase in recall rate compared to the current state-of-the-art methods on the 3DCSR dataset, improving from 40.59% to 75.74%. Our code will be available at https://github.com/wangmohan17/FFLOGO.

摘要
cross-modality point cloud registration 面临着不同感知器的内生差异问题，我们提出了一种 cross-modality point cloud registration 框架 FF-LOGO：一种通过特征筛选和本地adaptive key region aggregation模块、全Modal consistency fusion优化模块实现的 cross-modality point cloud registration 方法。在Feature correlation filtering模块中，我们提取了不同感知器的Point cloud中的几何变换不变的特征，并通过特征匹配来实现点选择。我们还引入了一种 across-modality optimization process，包括一个本地adaptive key region aggregation模块和一个全Modal consistency fusion优化模块。实验结果表明，我们的两stage优化significantly improves the registration accuracy of the feature association and selection module。我们的方法在3DCSR dataset上实现了75.74%的回卷率提升，比前一个state-of-the-art方法提高了40.59%。我们的代码将在https://github.com/wangmohan17/FFLOGO上公开。

Tightening Classification Boundaries in Open Set Domain Adaptation through Unknown Exploitation

paper_url: http://arxiv.org/abs/2309.08964
repo_url: https://github.com/LucasFernando-aes/UnkE-OVANet
paper_authors: Lucas Fernando Alvarenga e Silva, Nicu Sebe, Jurandy Almeida
for: 本研究旨在提高Open Set Domain Adaptation (OSDA)方法的性能，以应对具有不同类别和预设变数的非控制环境。
methods: 本研究提出了一种基于高信度Unknown Instance的强制约束，以提高OSDA方法的分类精度。具体来说，我们采用了三种不同的损失函数来训练OSDA模型，包括直接使用静止负项目集，随机扭转负项目集使用数据增强技术，以及将Synthetically生成的负项目集中的反击特征加以训练。
results: 我们在OVANet上进行了广泛的实验，发现在Office-31和Office-Home dataset上，这些方法可以实现绝对改进，获得最大改进为1.3%的精度和5.8%的H-Score在Office-31 dataset上，以及4.7%的精度和5.8%的H-Score在Office-Home dataset上。

Abstract
Convolutional Neural Networks (CNNs) have brought revolutionary advances to many research areas due to their capacity of learning from raw data. However, when those methods are applied to non-controllable environments, many different factors can degrade the model's expected performance, such as unlabeled datasets with different levels of domain shift and category shift. Particularly, when both issues occur at the same time, we tackle this challenging setup as Open Set Domain Adaptation (OSDA) problem. In general, existing OSDA approaches focus their efforts only on aligning known classes or, if they already extract possible negative instances, use them as a new category learned with supervision during the course of training. We propose a novel way to improve OSDA approaches by extracting a high-confidence set of unknown instances and using it as a hard constraint to tighten the classification boundaries of OSDA methods. Especially, we adopt a new loss constraint evaluated in three different means, (1) directly with the pristine negative instances; (2) with randomly transformed negatives using data augmentation techniques; and (3) with synthetically generated negatives containing adversarial features. We assessed all approaches in an extensive set of experiments based on OVANet, where we could observe consistent improvements for two public benchmarks, the Office-31 and Office-Home datasets, yielding absolute gains of up to 1.3% for both Accuracy and H-Score on Office-31 and 5.8% for Accuracy and 4.7% for H-Score on Office-Home.

摘要
卷积神经网络（CNN）已经为许多研究领域带来了革命性的进步，因为它们可以从原始数据中学习。然而，当这些方法应用于不可控的环境时，许多不同的因素可以降低模型的预期性能，如不标注数据集中的不同水平域转移和类别转移。特别是当这两个问题同时出现时，我们称之为开放集领域适应（OSDA）问题。总的来说，现有的OSDA方法通常只关注知道的类别的Alignment，或者如果已经提取了可能的负样本，则在训练过程中使用它们作为新学习的类别。我们提出了一种新的方法来改进OSDA方法，通过提取高 confidence 的未知实例集并将其作为硬件约束使用，以紧张化OSDA方法的分类边界。特别是，我们采用了三种不同的损失约束，分别是：(1) 直接使用不损失的负样本；(2) 使用数据扩展技术生成的随机变换负样本；以及(3) 使用生成的负样本，含有抗击性特征。我们在一系列实验中评估了所有方法，基于 OVANet，可以看到在 Office-31 和 Office-Home 两个公共 benchmark 上，我们可以获得相对准确率和 H-Score 的绝对提升，最高达 1.3% 和 5.8%。

ExBluRF: Efficient Radiance Fields for Extreme Motion Blurred Images

paper_url: http://arxiv.org/abs/2309.08957
repo_url: None
paper_authors: Dongwoo Lee, Jeongtaek Oh, Jaesung Lim, Sunghyun Cho, Kyoung Mu Lee
for: 本研究旨在提出一种基于高效光场优化的新视觉合成方法，用于处理极大运动模糊图像。
methods: 本方法包括两个主要组件：6DOF摄像机轨迹基于运动模糊形式和 voxel-based 光场。从极其模糊图像中，我们优化锐化的光场，并通过相互关联摄像机轨迹来生成极大运动模糊图像。在训练中，我们将多个射线与摄像机轨迹相集成，以重建单个模糊颜色图像，这与物理运动模糊操作相同。我们在模糊图像空间内减少照度一致损失，并通过摄像机轨迹来获得锐化的光场。
results: 与现有方法相比，我们的方法可以很好地还原极大运动模糊视图中的3D场景，并且具有训练时间和GPU内存占用的减少。

Abstract
We present ExBluRF, a novel view synthesis method for extreme motion blurred images based on efficient radiance fields optimization. Our approach consists of two main components: 6-DOF camera trajectory-based motion blur formulation and voxel-based radiance fields. From extremely blurred images, we optimize the sharp radiance fields by jointly estimating the camera trajectories that generate the blurry images. In training, multiple rays along the camera trajectory are accumulated to reconstruct single blurry color, which is equivalent to the physical motion blur operation. We minimize the photo-consistency loss on blurred image space and obtain the sharp radiance fields with camera trajectories that explain the blur of all images. The joint optimization on the blurred image space demands painfully increasing computation and resources proportional to the blur size. Our method solves this problem by replacing the MLP-based framework to low-dimensional 6-DOF camera poses and voxel-based radiance fields. Compared with the existing works, our approach restores much sharper 3D scenes from challenging motion blurred views with the order of 10 times less training time and GPU memory consumption.

摘要
我们提出了ExBluRF，一种新的视图合成方法，用于处理极大运动模糊图像。我们的方法包括两个主要组成部分：基于6DOF相机轨迹的运动模糊形式和 voxel-based 辐射场。从极端模糊图像中，我们优化了锐利的辐射场，并同时估计相机轨迹，以生成模糊图像。在训练中，多个光束与相机轨迹相互垂直，以重建单个模糊颜色。我们在模糊图像空间中减少了照相一致损失，并从相机轨迹中获得了锐利的辐射场。我们的方法解决了训练时间和 GPU 内存占用的问题，相比之下，现有的方法可以在极大的运动模糊视图中还原许多更加锐利的3D场景，并且减少了训练时间和 GPU 内存占用的规模。

IntelliBeeHive: An Automated Honey Bee, Pollen, and Varroa Destructor Monitoring System

paper_url: http://arxiv.org/abs/2309.08955
repo_url: None
paper_authors: Christian I. Narcia-Macias, Joselito Guardado, Jocell Rodriguez, Joanne Rampersad-Ammons, Erik Enriquez, Dong-Chul Kim
for: 这个研究旨在提高我们对蜜蜂群体疾病、蜜蜂行为、人口减少和巢健康的理解，通过发展一个基于计算机视觉的蜜蜂监测系统。methods: 这个监测系统使用计算机视觉技术，包括机器学习算法，实时监测蜜蜂群体活动和健康状况，无需对蜜蜂进行干扰。results: 我们的监测系统可以准确地识别蜜蜂、识别蜜和检测 Varroa 螨，并且可以实时提供蜜蜂群体活动和健康状况的数据。系统的总性能达到 96.28%，蜜蜂模型的 F1 分数为 0.95，蜜模型的 F1 分数为 0.831。

Abstract
Utilizing computer vision and the latest technological advancements, in this study, we developed a honey bee monitoring system that aims to enhance our understanding of Colony Collapse Disorder, honey bee behavior, population decline, and overall hive health. The system is positioned at the hive entrance providing real-time data, enabling beekeepers to closely monitor the hive's activity and health through an account-based website. Using machine learning, our monitoring system can accurately track honey bees, monitor pollen-gathering activity, and detect Varroa mites, all without causing any disruption to the honey bees. Moreover, we have ensured that the development of this monitoring system utilizes cost-effective technology, making it accessible to apiaries of various scales, including hobbyists, commercial beekeeping businesses, and researchers. The inference models used to detect honey bees, pollen, and mites are based on the YOLOv7-tiny architecture trained with our own data. The F1-score for honey bee model recognition is 0.95 and the precision and recall value is 0.981. For our pollen and mite object detection model F1-score is 0.95 and the precision and recall value is 0.821 for pollen and 0.996 for "mite". The overall performance of our IntelliBeeHive system demonstrates its effectiveness in monitoring the honey bee's activity, achieving an accuracy of 96.28 % in tracking and our pollen model achieved a F1-score of 0.831.

摘要
使用计算机视觉和最新的技术进步，在这项研究中，我们开发了一套蜜蜂监测系统，以提高我们对蜂巢病毒蜂巢病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒病毒��

Robust Backdoor Attacks on Object Detection in Real World

paper_url: http://arxiv.org/abs/2309.08953
repo_url: None
paper_authors: Yaguan Qian, Boyuan Ji, Shuke He, Shenhui Huang, Xiang Ling, Bin Wang, Wei Wang
for: 这个论文的目的是提出一种适应不同大小攻击对象的变量大小触发器，以及一种基于恶意对抗训练的后门训练方法，以提高在实际世界中的攻击成功率。
methods: 该论文使用的方法包括变量大小触发器和基于恶意对抗训练的后门训练方法，以适应不同大小攻击对象和物理干扰。
results: 实验结果显示，这种 Robust Backdoor Attack (RBA) 可以在实际世界中提高攻击成功率。

Abstract
Deep learning models are widely deployed in many applications, such as object detection in various security fields. However, these models are vulnerable to backdoor attacks. Most backdoor attacks were intensively studied on classified models, but little on object detection. Previous works mainly focused on the backdoor attack in the digital world, but neglect the real world. Especially, the backdoor attack's effect in the real world will be easily influenced by physical factors like distance and illumination. In this paper, we proposed a variable-size backdoor trigger to adapt to the different sizes of attacked objects, overcoming the disturbance caused by the distance between the viewing point and attacked object. In addition, we proposed a backdoor training named malicious adversarial training, enabling the backdoor object detector to learn the feature of the trigger with physical noise. The experiment results show this robust backdoor attack (RBA) could enhance the attack success rate in the real world.

摘要
深度学习模型在多个应用领域广泛应用，如物体检测等安全领域。然而，这些模型容易受到后门攻击。大多数后门攻击研究targeted于分类模型，而对物体检测模型的研究很少。先前的工作主要集中在数字世界中进行后门攻击研究，忽略了实际世界。特别是，实际世界中后门攻击的效果会受到物体距离观察点以及照明等物理因素的影响。在这篇论文中，我们提出了可变大小的后门触发器，以适应不同大小的攻击对象，并且在不同距离和照明条件下对后门触发器进行了适应。此外，我们还提出了一种名为“邪恶学习训练”的后门训练方法，使得后门检测器能够学习触发器的特征与物理噪声。实验结果显示，我们的robust后门攻击（RBA）可以在实际世界中提高攻击成功率。

Staged Contact-Aware Global Human Motion Forecasting

paper_url: http://arxiv.org/abs/2309.08947
repo_url: https://github.com/l-scofano/stag
paper_authors: Luca Scofano, Alessio Sampieri, Elisabeth Schiele, Edoardo De Matteis, Laura Leal-Taixé, Fabio Galasso
for:Scene-aware global human motion forecasting is critical for manifold applications, including virtual reality, robotics, and sports. The task combines human trajectory and pose forecasting within the provided scene context, which represents a significant challenge.methods:We propose a STAGed contact-aware global human motion forecasting STAG, a novel three-stage pipeline for predicting global human motion in a 3D environment. We first consider the scene and the respective human interaction as contact points. Secondly, we model the human trajectory forecasting within the scene, predicting the coarse motion of the human body as a whole. The third and last stage matches a plausible fine human joint motion to complement the trajectory considering the estimated contacts.results:Compared to the state-of-the-art (SoA), STAG achieves a 1.8% and 16.2% overall improvement in pose and trajectory prediction, respectively, on the scene-aware GTA-IM dataset. A comprehensive ablation study confirms the advantages of staged modeling over end-to-end approaches.

Abstract
Scene-aware global human motion forecasting is critical for manifold applications, including virtual reality, robotics, and sports. The task combines human trajectory and pose forecasting within the provided scene context, which represents a significant challenge. So far, only Mao et al. NeurIPS'22 have addressed scene-aware global motion, cascading the prediction of future scene contact points and the global motion estimation. They perform the latter as the end-to-end forecasting of future trajectories and poses. However, end-to-end contrasts with the coarse-to-fine nature of the task and it results in lower performance, as we demonstrate here empirically. We propose a STAGed contact-aware global human motion forecasting STAG, a novel three-stage pipeline for predicting global human motion in a 3D environment. We first consider the scene and the respective human interaction as contact points. Secondly, we model the human trajectory forecasting within the scene, predicting the coarse motion of the human body as a whole. The third and last stage matches a plausible fine human joint motion to complement the trajectory considering the estimated contacts. Compared to the state-of-the-art (SoA), STAG achieves a 1.8% and 16.2% overall improvement in pose and trajectory prediction, respectively, on the scene-aware GTA-IM dataset. A comprehensive ablation study confirms the advantages of staged modeling over end-to-end approaches. Furthermore, we establish the significance of a newly proposed temporal counter called the "time-to-go", which tells how long it is before reaching scene contact and endpoints. Notably, STAG showcases its ability to generalize to datasets lacking a scene and achieves a new state-of-the-art performance on CMU-Mocap, without leveraging any social cues. Our code is released at: https://github.com/L-Scofano/STAG

摘要
Scene-aware global human motion forecasting是应用广泛的领域之一，包括虚拟现实、 робо测控和体育等。这个任务需要在提供的场景上下文中预测人体轨迹和姿势，这是一项非常具有挑战性的任务。到目前为止，只有Mao等人在NeuIPS'22中Addressed scene-aware global motion，通过综合预测未来场景接触点和全身运动的方式来完成。他们在预测未来轨迹和姿势方面进行了终端预测，但终端预测与场景中的运动预测不匹配，这会导致性能下降，我们在实验中证明了这一点。我们提出了一种名为STAG的三个阶段管道，用于预测在3D环境中的全身人类运动。我们首先考虑场景和人类之间的接触点，然后预测场景内人体轨迹，并且预测全身人体的粗略运动。最后一个阶段是匹配合理的人体 JOINT 运动，以补做轨迹中的细微运动。与现状的最佳实践（SoA）相比，STAG在Scene-aware GTA-IM dataset上表现出1.8%和16.2%的总体改进，分别是 pose 和轨迹预测。我们还进行了一项完整的减少研究，证明了分阶段模型的优势。此外，我们还证明了我们提出的一种新的时间对象“时间剩下”的重要性，该对象表示人体在场景接触点和终点之前剩下的时间。值得注意的是，STAG在缺少场景的 dataset 上表现出新的状态机制，并在CMU-Mocap dataset上 дости得了新的最佳实践，无需使用任何社会cue。我们的代码可以在https://github.com/L-Scofano/STAG 上获取。

AffordPose: A Large-scale Dataset of Hand-Object Interactions with Affordance-driven Hand Pose

paper_url: http://arxiv.org/abs/2309.08942
repo_url: https://github.com/gentlesjan/affordpose
paper_authors: Juntao Jian, Xiuping Liu, Manyi Li, Ruizhen Hu, Jian Liu
for: 这篇论文主要针对的是人机交互中对象的函数角色如何影响人的手势，以及如何通过大量人类示例来学习和理解合适的手势交互。
methods: 该论文提出了AffordPose数据集，包括26.7K个手势交互示例，每个示例包括3D对象形状、部分可行性标签和手势位置。
results: 数据分析表明，手势交互中的具体手势pose受到对象的可行性影响，同时也展现了一定的多样性。实验表明，AffordPose数据集可以有效地学习和理解细腻的手势交互。

Abstract
How human interact with objects depends on the functional roles of the target objects, which introduces the problem of affordance-aware hand-object interaction. It requires a large number of human demonstrations for the learning and understanding of plausible and appropriate hand-object interactions. In this work, we present AffordPose, a large-scale dataset of hand-object interactions with affordance-driven hand pose. We first annotate the specific part-level affordance labels for each object, e.g. twist, pull, handle-grasp, etc, instead of the general intents such as use or handover, to indicate the purpose and guide the localization of the hand-object interactions. The fine-grained hand-object interactions reveal the influence of hand-centered affordances on the detailed arrangement of the hand poses, yet also exhibit a certain degree of diversity. We collect a total of 26.7K hand-object interactions, each including the 3D object shape, the part-level affordance label, and the manually adjusted hand poses. The comprehensive data analysis shows the common characteristics and diversity of hand-object interactions per affordance via the parameter statistics and contacting computation. We also conduct experiments on the tasks of hand-object affordance understanding and affordance-oriented hand-object interaction generation, to validate the effectiveness of our dataset in learning the fine-grained hand-object interactions. Project page: https://github.com/GentlesJan/AffordPose.

摘要
人类与物体之间的互动取决于目标物体的功能角色，这引入了有关手持物体互动的具体性和适应性的问题。为了解决这个问题，需要大量的人类示例来学习和理解合适的手持物体互动。在这个工作中，我们提出了AffordPose数据集，包括手持物体互动中的具体部分可行性标签，如拧、拧、握持等，而不是通用的用途或交给意图。这些细腻的手持物体互动显示了手中心可行性对手姿的细部安排产生了影响，同时也表现出一定的多样性。我们收集了26700个手持物体互动样本，每个样本包括3D物体形状、部分可行性标签和手姿调整。经过全面的数据分析，我们发现每种可行性下的手持物体互动具有共同特征和多样性，并通过参数统计和接触计算 validate our dataset的有效性。我们还进行了手持物体可行性理解和手持物体互动生成任务的实验，以验证我们的数据集在学习细腻手持物体互动方面的效果。项目页面：https://github.com/GentlesJan/AffordPose。

Semantics-aware LiDAR-Only Pseudo Point Cloud Generation for 3D Object Detection

paper_url: http://arxiv.org/abs/2309.08932
repo_url: None
paper_authors: Tiago Cortinhal, Idriss Gouigah, Eren Erdal Aksoy
for: 提高自动驾驶系统中LiDAR感知器的精度和精细度，使其能够更好地检测远距离的细节物体。
methods: 利用LiDAR感知器alone，通过提取场景 semantics并使用多Modal domain translator生成增强的Synthetic dense point clouds，提高3D对象检测性能。
results: 在不同的高级3D对象检测方法上实现了最多2.9%的性能提升，并在KITTI 3D对象检测数据集上达到了与其他状态体LiDAR-only探测器相当的性能。

Abstract
Although LiDAR sensors are crucial for autonomous systems due to providing precise depth information, they struggle with capturing fine object details, especially at a distance, due to sparse and non-uniform data. Recent advances introduced pseudo-LiDAR, i.e., synthetic dense point clouds, using additional modalities such as cameras to enhance 3D object detection. We present a novel LiDAR-only framework that augments raw scans with denser pseudo point clouds by solely relying on LiDAR sensors and scene semantics, omitting the need for cameras. Our framework first utilizes a segmentation model to extract scene semantics from raw point clouds, and then employs a multi-modal domain translator to generate synthetic image segments and depth cues without real cameras. This yields a dense pseudo point cloud enriched with semantic information. We also introduce a new semantically guided projection method, which enhances detection performance by retaining only relevant pseudo points. We applied our framework to different advanced 3D object detection methods and reported up to 2.9% performance upgrade. We also obtained comparable results on the KITTI 3D object detection dataset, in contrast to other state-of-the-art LiDAR-only detectors.

摘要
Translated into Simplified Chinese:尽管LiDAR传感器对自动驾驶系统非常重要，因为它们提供精确的深度信息，但它们在距离较远的场景中很难捕捉细节，主要是因为LiDAR数据 sparse和不均匀。最近，人们提出了 Pseudo-LiDAR，即使用其他感知模式，如摄像头，来增强3D物体检测。我们提出了一种完全依靠LiDAR感知器和场景 semantics的 LiDAR-only 框架，可以增强 raw 扫描数据的精度。我们的框架首先使用一种分割模型提取场景 semantics，然后使用一种多模态领域翻译器生成无需真正摄像头的Synthetic 图像段和深度cue。这将生成一个充满 semantic 信息的 dense pseudo point cloud。我们还引入了一种受 semantics 引导的投影方法，可以提高检测性能。我们将我们的框架应用到不同的高级3D物体检测方法上，并reported 提高性能达2.9%。我们还在 KITTI 3D 物体检测数据集上获得了与其他状态 искусternal LiDAR-only 检测器相比的相似结果。

In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval

paper_url: http://arxiv.org/abs/2309.08928
repo_url: https://github.com/ninatu/in_style
paper_authors: Nina Shvetsova, Anna Kukleva, Bernt Schiele, Hilde Kuehne
for: 文章目的是提出一种新的文本视频检索任务 setting，即使没有手动标注的对应文本视频数据。
methods: 提出了一种名为 In-Style 的方法，可以学习文本查询的风格并将其传递给网络视频。此外，我们还提出了一种多种风格冲突训练方法，以提高模型的通用性。
results: 我们通过对多个数据集进行测试，证明了我们的风格传递框架可以在无需手动标注对应文本视频数据的情况下提高文本视频检索任务的性能，并超越了零shot文本视频检索任务的州OF-THE-ART表现。

Abstract
Large-scale noisy web image-text datasets have been proven to be efficient for learning robust vision-language models. However, when transferring them to the task of video retrieval, models still need to be fine-tuned on hand-curated paired text-video data to adapt to the diverse styles of video descriptions. To address this problem without the need for hand-annotated pairs, we propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos without any paired text-video data. To this end, we propose an approach, In-Style, that learns the style of the text queries and transfers it to uncurated web videos. Moreover, to improve generalization, we show that one model can be trained with multiple text styles. To this end, we introduce a multi-style contrastive training procedure that improves the generalizability over several datasets simultaneously. We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework on the new task of uncurated & unpaired text-video retrieval and improve state-of-the-art performance on zero-shot text-video retrieval.

摘要
大规模嘈吵网络图片文本数据集已经证明可以有效地学习Robust vision-language模型。然而，在将其转移到视频检索任务时，模型仍需要通过手动批注的文本视频数据进行微调以适应不同的视频描述风格。为解决这个问题而不需要手动批注对，我们提出了一个新的设定：文本视频检索无curated & unpaired数据。在训练时，我们只使用文本查询，并使用无curated的网络视频，没有任何文本视频批注数据。为此，我们提出了一种方法，In-Style，它可以学习文本查询的风格，并将其传递给无curated的网络视频。此外，为提高泛化性，我们表明一个模型可以通过多种文本风格进行训练。为此，我们引入了多种风格强制对照训练程序，以提高模型的泛化性。我们对多个数据集进行了检索性能评估，以证明我们的风格传递框架在新任务中的优势，并超越了零shot文本视频检索的州场性能。

DynaMoN: Motion-Aware Fast And Robust Camera Localization for Dynamic NeRF

paper_url: http://arxiv.org/abs/2309.08927
repo_url: None
paper_authors: Mert Asim Karaoglu, Hannah Schieber, Nicolas Schischka, Melih Görgülü, Florian Grötzner, Alexander Ladikos, Daniel Roth, Nassir Navab, Benjamin Busam
for: 该论文旨在提出一种基于同时定位和地图建模（SLAM）的动态重建方法，以便更好地处理动态场景中的相机位置和场景内容的变化。
methods: 该方法使用了同时定位和地图建模（SLAM）和动态掩码（motion masking）结合，以提高动态重建的精度和效率。
results: 广泛的实验 validate了该方法在真实世界 dataset 上的优势，包括 TUM RGB-D、BONN RGB-D 动态和 DyCheck 的 iPhone 数据集。该方法不仅提高了相机pose估计的精度，还提高了动态重建的质量。

Abstract
Dynamic reconstruction with neural radiance fields (NeRF) requires accurate camera poses. These are often hard to retrieve with existing structure-from-motion (SfM) pipelines as both camera and scene content can change. We propose DynaMoN that leverages simultaneous localization and mapping (SLAM) jointly with motion masking to handle dynamic scene content. Our robust SLAM-based tracking module significantly accelerates the training process of the dynamic NeRF while improving the quality of synthesized views at the same time. Extensive experimental validation on TUM RGB-D, BONN RGB-D Dynamic and the DyCheck's iPhone dataset, three real-world datasets, shows the advantages of DynaMoN both for camera pose estimation and novel view synthesis.

摘要
<>使用神经辐射场 (NeRF) 进行动态重建需要准确的摄像头位置。现有的结构从动作 (SfM) 管道通常Difficult to retrieve this information, as both the camera and scene content can change. We propose DynaMoN, which leverages simultaneous localization and mapping (SLAM) and motion masking to handle dynamic scene content. Our robust SLAM-based tracking module significantly accelerates the training process of the dynamic NeRF while improving the quality of synthesized views at the same time. Experimental validation on TUM RGB-D, BONN RGB-D Dynamic, and the DyCheck's iPhone dataset, three real-world datasets, shows the advantages of DynaMoN for both camera pose estimation and novel view synthesis.>Here's the text with some notes on the translation:* "动态重建" (dòngtài chóngběn) is used to refer to the process of reconstructing a 3D scene from a set of 2D images.* "神经辐射场" (jīngxīn chángshì chǎng) is a neural network-based method for reconstructing 3D scenes from 2D images.* "摄像头位置" (shèxiàngtou weiità) refers to the position of the camera that took the images.* "结构从动作" (jiégòng cóng dòngwù) is a computer vision technique used to reconstruct 3D scenes from 2D images.* "同时地图" (tóngshí dìmào) is a technique used to create a map of an environment while simultaneously localizing a device within that environment.* "动态Scene" (dòngtài scēn) refers to a scene that is changing or dynamic.* "视图合成" (wèi vision hébìng) is the process of generating new views of a scene from a set of existing views.I hope this helps! Let me know if you have any further questions.

Pixel Adapter: A Graph-Based Post-Processing Approach for Scene Text Image Super-Resolution

paper_url: http://arxiv.org/abs/2309.08919
repo_url: https://github.com/wenyu1009/rtsrn
paper_authors: Wenyu Zhang, Xin Deng, Baojun Jia, Xingtong Yu, Yifan Chen, jin Ma, Qing Ding, Xinming Zhang
For: 提高文本图像超分辨率图像生成的精度和效率。* Methods: 提出Pixel Adapter Module (PAM)基于图注意力来解决升采样导致的像素扭曲问题，并引入MLP-based Sequential Residual Block (MSRB)来提取文本图像的稳定特征。* Results: 在TextZoom上进行了广泛的实验，并取得了高质量的超分辨率图像生成结果，比现有方法提高了认知率。对单stage和多stage策略进行了改进，分别提高了0.7%和2.6%，从52.6%和53.7%提高到53.3%和56.3%。

Abstract
Current Scene text image super-resolution approaches primarily focus on extracting robust features, acquiring text information, and complex training strategies to generate super-resolution images. However, the upsampling module, which is crucial in the process of converting low-resolution images to high-resolution ones, has received little attention in existing works. To address this issue, we propose the Pixel Adapter Module (PAM) based on graph attention to address pixel distortion caused by upsampling. The PAM effectively captures local structural information by allowing each pixel to interact with its neighbors and update features. Unlike previous graph attention mechanisms, our approach achieves 2-3 orders of magnitude improvement in efficiency and memory utilization by eliminating the dependency on sparse adjacency matrices and introducing a sliding window approach for efficient parallel computation. Additionally, we introduce the MLP-based Sequential Residual Block (MSRB) for robust feature extraction from text images, and a Local Contour Awareness loss ($\mathcal{L}_{lca}$) to enhance the model's perception of details. Comprehensive experiments on TextZoom demonstrate that our proposed method generates high-quality super-resolution images, surpassing existing methods in recognition accuracy. For single-stage and multi-stage strategies, we achieved improvements of 0.7\% and 2.6\%, respectively, increasing the performance from 52.6\% and 53.7\% to 53.3\% and 56.3\%. The code is available at https://github.com/wenyu1009/RTSRN.

摘要
当前场景文本图像超解像方法主要集中于提取稳定特征、获取文本信息和复杂的训练策略，以生成超解像图像。然而，upsampling模块，该模块在将低分辨率图像转换为高分辨率图像的过程中扮演关键角色，在现有工作中受到了少量关注。为了解决这个问题，我们提议使用图像注意力机制（PAM）来处理像素扭曲。PAM可以有效地捕捉本地结构信息，让每个像素与邻居像素进行互动，更新特征。与过去的图像注意力机制不同，我们的方法可以在效率和内存利用上实现2-3个数量级的提升，而不需要针对稀疏相邻矩阵的依赖。此外，我们还引入了多层感知核（MLP）基于的Sequential Residual Block（MSRB），用于robust特征提取从文本图像中，以及Local Contour Awareness损失（$\mathcal{L}_{lca}$），以提高模型对细节的感知。在TextZoom上进行了广泛的实验，我们的提议方法可以生成高质量的超解像图像，超过现有方法的认知率。对单阶段和多阶段策略，我们实现了提升0.7%和2.6%，从52.6%和53.7%提升到53.3%和56.3%。代码可以在https://github.com/wenyu1009/RTSRN中获取。

Delving into Multimodal Prompting for Fine-grained Visual Classification

paper_url: http://arxiv.org/abs/2309.08912
repo_url: None
paper_authors: Xin Jiang, Hao Tang, Junyao Gao, Xiaoyu Du, Shengfeng He, Zechao Li
for: 这个研究旨在应用多modal描述来解决细分类的挑战，特别是在细分类中存在轻微的间隔和大量内分量的情况下。
methods: 这个研究使用了对称语言图像模型（CLIP），并提出了一个新的多modal诱导解决方案（MP-FGVC），包括多modal诱导方案和多modal适应方案。多modal诱导方案包括特定分类的视觉推问（SsVP）和差异感知文本推问（DaTP），它们强调了特定分类的视觉和语言差异。多modal适应方案将视觉和文本推问元素调节到共同semantic空间，以便跨modal协力推理，并通过视觉语言融合模组（VLFM）进一步提高FGVC。
results: 实验结果显示MP-FGVC在四个FGVC数据集上具有优秀的效果。

Abstract
Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category, which poses challenges due to subtle inter-class discrepancies and large intra-class variations. However, prevailing approaches primarily focus on uni-modal visual concepts. Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks, yet the applicability of such models to FGVC tasks remains uncertain. In this paper, we aim to fully exploit the capabilities of cross-modal description to tackle FGVC tasks and propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image pertaining (CLIP) model. Our MP-FGVC comprises a multimodal prompts scheme and a multimodal adaptation scheme. The former includes Subcategory-specific Vision Prompt (SsVP) and Discrepancy-aware Text Prompt (DaTP), which explicitly highlights the subcategory-specific discrepancies from the perspectives of both vision and language. The latter aligns the vision and text prompting elements in a common semantic space, facilitating cross-modal collaborative reasoning through a Vision-Language Fusion Module (VLFM) for further improvement on FGVC. Moreover, we tailor a two-stage optimization strategy for MP-FGVC to fully leverage the pre-trained CLIP model and expedite efficient adaptation for FGVC. Extensive experiments conducted on four FGVC datasets demonstrate the effectiveness of our MP-FGVC.

摘要
低级分类视觉（FGVC）涉及到细分类划分，这会带来细腻的间隔差异和大量内类变化。然而，现有的方法主要集中于单模visual概念。 recent advances in pre-trained vision-language models have shown remarkable performance in various high-level vision tasks, but the applicability of such models to FGVC tasks remains uncertain. In this paper, we aim to fully exploit the capabilities of cross-modal description to tackle FGVC tasks and propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image pertaining (CLIP) model. Our MP-FGVC consists of a multimodal prompts scheme and a multimodal adaptation scheme. The former includes Subcategory-specific Vision Prompt (SsVP) and Discrepancy-aware Text Prompt (DaTP), which explicitly highlights the subcategory-specific discrepancies from the perspectives of both vision and language. The latter aligns the vision and text prompting elements in a common semantic space, facilitating cross-modal collaborative reasoning through a Vision-Language Fusion Module (VLFM) for further improvement on FGVC. Moreover, we tailor a two-stage optimization strategy for MP-FGVC to fully leverage the pre-trained CLIP model and expedite efficient adaptation for FGVC. Extensive experiments conducted on four FGVC datasets demonstrate the effectiveness of our MP-FGVC.

Enhancing Visual Perception in Novel Environments via Incremental Data Augmentation Based on Style Transfer

paper_url: http://arxiv.org/abs/2309.08851
repo_url: https://github.com/abhibha1807/robustifying_visual_perception
paper_authors: Abhibha Gupta, Rully Agus Hendrawan, Mansur Arief
for: 强化自适应代理在真实世界场景中的部署，应对“未知未知”（ novell 和不期望的环境）。
methods: 使用 Variational Prototyping Encoder (VPE) 实现可靠地识别和处理新Input，然后逐步增强数据使用神经抽象转移来增强受测数据。
results: 比较仅从原始数据训练的模型和从原始和增强数据训练的模型，发现增强数据训练对模型性能有着明显的改善，这证明了数据增强的重要性。

Abstract
The deployment of autonomous agents in real-world scenarios is challenged by "unknown unknowns", i.e. novel unexpected environments not encountered during training, such as degraded signs. While existing research focuses on anomaly detection and class imbalance, it often fails to address truly novel scenarios. Our approach enhances visual perception by leveraging the Variational Prototyping Encoder (VPE) to adeptly identify and handle novel inputs, then incrementally augmenting data using neural style transfer to enrich underrepresented data. By comparing models trained solely on original datasets with those trained on a combination of original and augmented datasets, we observed a notable improvement in the performance of the latter. This underscores the critical role of data augmentation in enhancing model robustness. Our findings suggest the potential benefits of incorporating generative models for domain-specific augmentation strategies.

摘要
<>实际场景中自动化代理的部署受到“未知未知”的挑战，即在训练中没有遇到的新不期望环境，如受损标志。现有研究主要关注异常检测和类别偏度，但经常无法处理真正新的场景。我们的方法通过利用变量抽象编码器（VPE）来有效地识别和处理新输入，然后逐步增强数据使用神经风格传输来润色不足表示。我们比较了固定在原始数据集上训练的模型和将原始数据集和增强数据集组合训练的模型，发现后者表现更佳。这表明数据增强对模型Robustness具有重要作用。我们的发现表明可能将生成模型integrated into domain-specific augmentation strategies中。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

MA-SAM: Modality-agnostic SAM Adaptation for 3D Medical Image Segmentation

paper_url: http://arxiv.org/abs/2309.08842
repo_url: https://github.com/cchen-cc/ma-sam
paper_authors: Cheng Chen, Juzheng Miao, Dufan Wu, Zhiling Yan, Sekeun Kim, Jiang Hu, Aoxiao Zhong, Zhengliang Liu, Lichao Sun, Xiang Li, Tianming Liu, Pheng-Ann Heng, Quanzheng Li
for:* 这个研究是为了提高基于自然图像的图像分类模型（Segment Anything Model，SAM）在医疗图像分类任务中的表现。methods:* 这个研究使用了SAM的预训 weights，并在 fine-tuning 过程中添加了3D adapter，将3D信息融入到SAM的图像Encoder中。results:* 这个研究在4个医疗图像分类任务上进行了广泛的评估，并证明了MA-SAM在这些任务中的表现比其他竞争方法更好，包括nnU-Net。

Abstract
The Segment Anything Model (SAM), a foundation model for general image segmentation, has demonstrated impressive zero-shot performance across numerous natural image segmentation tasks. However, SAM's performance significantly declines when applied to medical images, primarily due to the substantial disparity between natural and medical image domains. To effectively adapt SAM to medical images, it is important to incorporate critical third-dimensional information, i.e., volumetric or temporal knowledge, during fine-tuning. Simultaneously, we aim to harness SAM's pre-trained weights within its original 2D backbone to the fullest extent. In this paper, we introduce a modality-agnostic SAM adaptation framework, named as MA-SAM, that is applicable to various volumetric and video medical data. Our method roots in the parameter-efficient fine-tuning strategy to update only a small portion of weight increments while preserving the majority of SAM's pre-trained weights. By injecting a series of 3D adapters into the transformer blocks of the image encoder, our method enables the pre-trained 2D backbone to extract third-dimensional information from input data. The effectiveness of our method has been comprehensively evaluated on four medical image segmentation tasks, by using 10 public datasets across CT, MRI, and surgical video data. Remarkably, without using any prompt, our method consistently outperforms various state-of-the-art 3D approaches, surpassing nnU-Net by 0.9%, 2.6%, and 9.9% in Dice for CT multi-organ segmentation, MRI prostate segmentation, and surgical scene segmentation respectively. Our model also demonstrates strong generalization, and excels in challenging tumor segmentation when prompts are used. Our code is available at: https://github.com/cchen-cc/MA-SAM.

摘要
segments Anything Model (SAM)，一种基础模型 для通用图像分割，在各种自然图像分割任务上表现出色，但是在医疗图像上表现较差，主要是因为医疗图像和自然图像领域之间存在巨大的差异。为了有效地适应医疗图像，需要在精度调整中包含关键的三维信息，即体积或时间信息。同时，我们希望能充分利用SAM已经预训练的参数。在这篇文章中，我们提出了一种不同类型的SAM适应框架，名为MA-SAM，可以应用于各种体积和视频医疗数据。我们的方法基于精度调整策略，只更新一小部分的参数增量，保留SAM预训练的大部分参数。通过在图像编码器中插入3D适应器，我们的方法使得预训练的2D背景能够从输入数据中提取三维信息。我们的方法在四种医疗图像分割任务上进行了广泛的评估，使用了10个公共数据集，包括CT、MRI和手术视频数据。可见地，无需使用提示，我们的方法在各种状态前的3D方法之上升级，nnU-Net的Dice值上升9.9%、2.6%和0.9%。我们的模型还表现出了强大的泛化能力，在提示使用时也能够出色地完成肿瘤分割任务。我们的代码可以在https://github.com/cchen-cc/MA-SAM上获取。

AOSR-Net: All-in-One Sandstorm Removal Network

paper_url: http://arxiv.org/abs/2309.08838
repo_url: None
paper_authors: Yazhong Si, Xulong Zhang, Fan Yang, Jianzong Wang, Ning Cheng, Jing Xiao
for: 本研究旨在解决现有的尘埃图像提高方法存在的应用限制和复杂算法结构问题。
methods: 该研究提出了一种全新的图像恢复模型，名为“一体化尘埃除网络”（AOSR-Net），该模型基于重新定义的尘埃散射模型，直接建立了图像映射关系，并且将中间参数集成到模型中，从而有效地解决了过强增强和Week Generalization问题。
results: 对于 synthetic 和实际尘埃图像进行了实验，结果显示，提出的 AOSR-Net 模型在对尘埃图像进行提高时表现出色，超过了当前最佳算法（SOTA）的性能。

Abstract
Most existing sandstorm image enhancement methods are based on traditional theory and prior knowledge, which often restrict their applicability in real-world scenarios. In addition, these approaches often adopt a strategy of color correction followed by dust removal, which makes the algorithm structure too complex. To solve the issue, we introduce a novel image restoration model, named all-in-one sandstorm removal network (AOSR-Net). This model is developed based on a re-formulated sandstorm scattering model, which directly establishes the image mapping relationship by integrating intermediate parameters. Such integration scheme effectively addresses the problems of over-enhancement and weak generalization in the field of sand dust image enhancement. Experimental results on synthetic and real-world sandstorm images demonstrate the superiority of the proposed AOSR-Net over state-of-the-art (SOTA) algorithms.

摘要
现有的沙尘抑制方法多数基于传统理论和先知知识，这常限制其在实际场景中的应用。另外，这些方法通常采用色调修正后followed by dust removal的策略，使算法结构变得太复杂。为解决这问题，我们介绍了一种新的图像恢复模型，即全 inclusive sandstorm removal network (AOSR-Net)。该模型基于重新表述的沙尘散射模型，直接建立图像映射关系，并通过 интеGRATION scheme来有效地解决抑制过激和弱泛化问题。实验结果表明，提出的AOSR-Net在 synthetic和实际沙尘图像中的表现优于state-of-the-art（SOTA）算法。

Dual-Camera Joint Deblurring-Denoising

paper_url: http://arxiv.org/abs/2309.08826
repo_url: None
paper_authors: Shayan Shekarforoush, Amanpreet Walia, Marcus A. Brubaker, Konstantinos G. Derpanis, Alex Levinshtein
for: 提高低光照照片质量
methods: 使用同步短暂曝光图像和长暂曝光图像，并将其进行拼接和权重调整
results: 实现了对同步双摄像头图像进行优化，并在实验中显示出比其他方法更高的质量和稳定性

Abstract
Recent image enhancement methods have shown the advantages of using a pair of long and short-exposure images for low-light photography. These image modalities offer complementary strengths and weaknesses. The former yields an image that is clean but blurry due to camera or object motion, whereas the latter is sharp but noisy due to low photon count. Motivated by the fact that modern smartphones come equipped with multiple rear-facing camera sensors, we propose a novel dual-camera method for obtaining a high-quality image. Our method uses a synchronized burst of short exposure images captured by one camera and a long exposure image simultaneously captured by another. Having a synchronized short exposure burst alongside the long exposure image enables us to (i) obtain better denoising by using a burst instead of a single image, (ii) recover motion from the burst and use it for motion-aware deblurring of the long exposure image, and (iii) fuse the two results to further enhance quality. Our method is able to achieve state-of-the-art results on synthetic dual-camera images from the GoPro dataset with five times fewer training parameters compared to the next best method. We also show that our method qualitatively outperforms competing approaches on real synchronized dual-camera captures.

摘要
Translated into Simplified Chinese:现代低光照摄像头技术已经显示了使用两枚长和短曝光图像的优点。这两种图像模式具有补偿的优势和缺点。前者生成了清晰但涂抹的图像，由于摄像头或物体运动而导致涂抹；而后者具有高分辨率，但由于低光子数而导致噪声。我们受到现代智能手机搭载多个后置摄像头的启示，我们提出了一种新的双摄像头方法，以实现高质量图像。我们的方法使用同步短曝光快照集的一个摄像头，同时使用另一个摄像头拍摄长曝光图像。具有同步短曝光快照的存在，我们可以（i）通过快照集取得更好的噪声纠正，（ii）从快照集中回归运动，并用于运动感知滤除长曝光图像中的涂抹，以及（iii）将两个结果融合，进一步提高质量。我们的方法在GoPro数据集上实现了状态最佳的结果，与接下来最佳方法相比，训练参数只需五分之一。我们还显示了我们的方法在真实同步双摄像头捕捉中的质量超越竞争方法。

2023-09-16

cs.AI

cs.AI - 2023-09-16

Interactively Teaching an Inverse Reinforcement Learner with Limited Feedback

paper_url: http://arxiv.org/abs/2309.09095
repo_url: https://github.com/rzayanov/irl-teaching-limited-feedback
paper_authors: Rustam Zayanov, Francisco S. Melo, Manuel Lopes
for: 本研究强调教学via示例在顺序决策任务中，特别是教师无法访问学生模型和策略的情况下。
methods: 本文使用 inverse reinforcement learning 和 active learning 方法，教师可以通过选择开始状态和推断学生策略来解决这个教学问题。
results: 在一个人工汽车驾驶环境中测试了提议的算法，结果显示该算法在学生反馈有限时是一个有效的解决方案。

Abstract
We study the problem of teaching via demonstrations in sequential decision-making tasks. In particular, we focus on the situation when the teacher has no access to the learner's model and policy, and the feedback from the learner is limited to trajectories that start from states selected by the teacher. The necessity to select the starting states and infer the learner's policy creates an opportunity for using the methods of inverse reinforcement learning and active learning by the teacher. In this work, we formalize the teaching process with limited feedback and propose an algorithm that solves this teaching problem. The algorithm uses a modified version of the active value-at-risk method to select the starting states, a modified maximum causal entropy algorithm to infer the policy, and the difficulty score ratio method to choose the teaching demonstrations. We test the algorithm in a synthetic car driving environment and conclude that the proposed algorithm is an effective solution when the learner's feedback is limited.

摘要
我们研究教学via示例在时序决策任务中。特别是在教师无法访问学生的模型和策略的情况下，并且学生对教师的反馈仅仅是从教师选择的状态开始的路径。因为选择开始状态和推理学生策略创造了教师可以使用反对抗学习和活动学习的机会。在这个研究中，我们将教学过程 formalize 为有限反馈的情况，并提出一个解决这个教学问题的算法。这个算法使用修改版的活跃值-at-risk方法选择开始状态，修改最大 causal entropy 算法推理学生策略，以及对教学示例选择的困难分数比率方法。我们在人工智能汽车驾驶环境中试验了这个算法，结果显示，我们的提案算法在学生的反馈有限时是一个有效的解决方案。

RMDM: A Multilabel Fakenews Dataset for Vietnamese Evidence Verification

paper_url: http://arxiv.org/abs/2309.09071
repo_url: None
paper_authors: Hai-Long Nguyen, Thi-Kieu-Trang Pham, Thai-Son Le, Tan-Minh Nguyen, Thi-Hai-Yen Vuong, Ha-Thanh Nguyen
for: 这个研究是为了评估大语言模型（LLM）在电子信息相关法律上的性能，特别是用于识别假新闻作为电子证据的输入。
methods: 该研究使用了一个新的和挑战性的多标签越南语 dataset（RMDM），包括四个标签：实用、误差、恶意和恶假，表示实际信息、误差信息、恶意信息和假信息。
results: 研究发现，使用 GPT 和 BERT 模型测试 RMDM 数据集时，每个标签的性能异常，表明该数据集能够挑战不同语言模型对于识别各种类型的电子信息的能力。研究结果表明，用于识别电子信息相关法律上的 fake news 仍然是一个困难的问题，需要更多的研究人员投入，以提高 AI 模型的可靠性。

Abstract
In this study, we present a novel and challenging multilabel Vietnamese dataset (RMDM) designed to assess the performance of large language models (LLMs), in verifying electronic information related to legal contexts, focusing on fake news as potential input for electronic evidence. The RMDM dataset comprises four labels: real, mis, dis, and mal, representing real information, misinformation, disinformation, and mal-information, respectively. By including these diverse labels, RMDM captures the complexities of differing fake news categories and offers insights into the abilities of different language models to handle various types of information that could be part of electronic evidence. The dataset consists of a total of 1,556 samples, with 389 samples for each label. Preliminary tests on the dataset using GPT-based and BERT-based models reveal variations in the models' performance across different labels, indicating that the dataset effectively challenges the ability of various language models to verify the authenticity of such information. Our findings suggest that verifying electronic information related to legal contexts, including fake news, remains a difficult problem for language models, warranting further attention from the research community to advance toward more reliable AI models for potential legal applications.

摘要
在这项研究中，我们提出了一个新的和挑战性的多标签越南语数据集（RMDM），用于评估大语言模型（LLM）在电子信息相关法律上下文中的性能，特点是通过假新闻作为电子证据的输入。RMDM数据集包括四个标签：真实、误information、 désinformation和mal-information，分别表示真实信息、误信息、恶意误导和危险信息。由于这些多样化的标签，RMDM数据集能够捕捉各种假新闻类型的复杂性，并为不同语言模型的性能进行评估。该数据集包括总共1556个样本，每个标签各有389个样本。初步测试表明，使用基于GPT和BERT模型的语言模型在不同标签上表现有很大差异，这表明RMDM数据集对不同语言模型的挑战性很高。我们的发现表明，在法律上下文中电子信息的验证仍然是一个具有挑战性的问题，需要研究人员继续努力，以提出更可靠的AI模型，用于 potential legal applications。

NOWJ1@ALQAC 2023: Enhancing Legal Task Performance with Classic Statistical Models and Pre-trained Language Models

paper_url: http://arxiv.org/abs/2309.09070
repo_url: None
paper_authors: Tan-Minh Nguyen, Xuan-Hoa Nguyen, Ngoc-Duy Mai, Minh-Quan Hoang, Van-Huan Nguyen, Hoang-Viet Nguyen, Ha-Thanh Nguyen, Thi-Hai-Yen Vuong
for: 本研究旨在提高法律任务性能，通过精心搭配经典统计模型和预训练语言模型（PLMs）。
methods: 我们实施了一个预处理步骤，以解决输入限制，并应用学习到rank方法，以整合各种模型中的特征。在问答任务中，我们分成两个子任务：句子分类和答案提取。我们采用了当今最佳实践，以开发每个子任务的独特系统，并利用经典统计模型和预训练语言模型。
results: 实验结果表明，我们提出的方法在比赛中具有扎实的潜力。

Abstract
This paper describes the NOWJ1 Team's approach for the Automated Legal Question Answering Competition (ALQAC) 2023, which focuses on enhancing legal task performance by integrating classical statistical models and Pre-trained Language Models (PLMs). For the document retrieval task, we implement a pre-processing step to overcome input limitations and apply learning-to-rank methods to consolidate features from various models. The question-answering task is split into two sub-tasks: sentence classification and answer extraction. We incorporate state-of-the-art models to develop distinct systems for each sub-task, utilizing both classic statistical models and pre-trained Language Models. Experimental results demonstrate the promising potential of our proposed methodology in the competition.

摘要
Note: Simplified Chinese is also known as "简化字符" or "简体字".Here's the translation in Traditional Chinese:这份研究报告描述了NOWJ1队的2023年自动法律问题回答比赛（ALQAC）方法，强调通过结合古典统计模型和预训语言模型（PLMs）来提高法律任务性能。 для文档搜寻任务，我们实现了预处理步骤以超过输入限制，并使用学习排名方法将不同模型中的特征集成。问题回答任务分为两个子任务：句子分类和答案抽取。我们使用了现代模型，开发了两个不同的系统，一个是基于古典统计模型，另一个是基于预训语言模型。实验结果显示了我们的提案方法在比赛中的应用潜力。

GenDOM: Generalizable One-shot Deformable Object Manipulation with Parameter-Aware Policy

paper_url: http://arxiv.org/abs/2309.09051
repo_url: None
paper_authors: So Kuroki, Jiaxian Guo, Tatsuya Matsushima, Takuya Okubo, Masato Kobayashi, Yuya Ikeda, Ryosuke Takanami, Paul Yoo, Yutaka Matsuo, Yusuke Iwasawa
for: 一个可以实现单一示范的弹性物品操作框架 (a framework that can achieve one-shot deformable object manipulation)
methods: 使用弹性物品参数来条件政策并在训练过程中使用多种弹性物品模拟，以将政策适应不同弹性物品 (using deformable object parameters to condition the policy and training it with a diverse range of simulated deformable objects, so that the policy can adapt to different objects)
results: 实际验证范例显示，our方法可以实现不同弹性物品的一个示范操作 (empirical validations show that our method can manipulate different objects with a single demonstration)，并在实际环境中比基eline表现更好 (and significantly outperform the baseline in both simulation and real-world environments)

Abstract
Due to the inherent uncertainty in their deformability during motion, previous methods in deformable object manipulation, such as rope and cloth, often required hundreds of real-world demonstrations to train a manipulation policy for each object, which hinders their applications in our ever-changing world. To address this issue, we introduce GenDOM, a framework that allows the manipulation policy to handle different deformable objects with only a single real-world demonstration. To achieve this, we augment the policy by conditioning it on deformable object parameters and training it with a diverse range of simulated deformable objects so that the policy can adjust actions based on different object parameters. At the time of inference, given a new object, GenDOM can estimate the deformable object parameters with only a single real-world demonstration by minimizing the disparity between the grid density of point clouds of real-world demonstrations and simulations in a differentiable physics simulator. Empirical validations on both simulated and real-world object manipulation setups clearly show that our method can manipulate different objects with a single demonstration and significantly outperforms the baseline in both environments (a 62% improvement for in-domain ropes and a 15% improvement for out-of-distribution ropes in simulation, as well as a 26% improvement for ropes and a 50% improvement for cloths in the real world), demonstrating the effectiveness of our approach in one-shot deformable object manipulation.

摘要
由于物体的自然变形性在运动中的不确定性，过去的方法在弹性物体把握中经常需要数百个实际世界示例来训练把握策略，这限制了它们在我们的变化世界中的应用。为解决这个问题，我们介绍GenDOM框架，它允许把握策略处理不同的弹性物体，只需要单个实际世界示例。为达到这个目标，我们在策略中添加了基于弹性物体参数的条件，并在训练策略时使用了多种 simulate 的弹性物体，以便策略可以根据不同的物体参数调整动作。在推理时，对于新的物体，GenDOM可以通过在一个分解器上进行最小化，使得策略可以在具有不同物体参数的情况下进行适应。我们的实验证明，我们的方法可以在各种 simulate 和实际环境中一shot 把握不同的弹性物体，并且明显超过基eline，这证明了我们的方法在一shot 弹性物体把握中的有效性。

Generative AI-Driven Storytelling: A New Era for Marketing

paper_url: http://arxiv.org/abs/2309.09048
repo_url: None
paper_authors: Marko Vidrih, Shiva Mayahi
for: 这篇论文探讨了用生成AI驱动的故事创作在市场策略中的转变力。
methods: 该论文使用实际业界例子，如Google、Netflix和Stitch Fix，解释了如何使用此技术自定义消费者体验， navigate 相关挑战。
results: 论文描述了将来的发展方向和建议，包括实时个性化 Storytelling、 immerse Storytelling 体验和社交媒体 Storytelling。In English, the three key points are:
for: This paper explores the transformative power of Generative AI-driven storytelling in marketing.
methods: The paper uses real-world examples from industry leaders like Google, Netflix, and Stitch Fix to illustrate how this technology personalizes consumer experiences and navigates challenges.
results: The paper describes future directions and recommendations for generative AI-driven storytelling, including prospective applications such as real-time personalized storytelling, immersive storytelling experiences, and social media storytelling.

Abstract
This paper delves into the transformative power of Generative AI-driven storytelling in the realm of marketing. Generative AI, distinct from traditional machine learning, offers the capability to craft narratives that resonate with consumers on a deeply personal level. Through real-world examples from industry leaders like Google, Netflix and Stitch Fix, we elucidate how this technology shapes marketing strategies, personalizes consumer experiences, and navigates the challenges it presents. The paper also explores future directions and recommendations for generative AI-driven storytelling, including prospective applications such as real-time personalized storytelling, immersive storytelling experiences, and social media storytelling. By shedding light on the potential and impact of generative AI-driven storytelling in marketing, this paper contributes to the understanding of this cutting-edge approach and its transformative power in the field of marketing.

摘要
这篇论文探讨了生成AI驱动的故事创作在市场营销中的变革力。生成AI与传统机器学习不同，具有制作吸引消费者深层次共鸣的故事的能力。通过实业领导者如Google、Netflix和Stitch Fix的实践例子，我们详细介绍了该技术如何影响市场策略、个性化消费者经验和解决相关挑战。这篇论文还探讨了未来生成AI驱动的故事创作的发展趋势和建议，包括实时个性化storytelling、 immerse storytelling经验和社交媒体storytelling。这篇论文通过探讨生成AI驱动的故事创作在市场营销中的潜力和影响，贡献于这一领域的理解和发展。

A store-and-forward cloud-based telemonitoring system for automatic assessing dysarthria evolution in neurological diseases from video-recording analysis

paper_url: http://arxiv.org/abs/2309.09038
repo_url: None
paper_authors: Lucia Migliorelli, Daniele Berardini, Kevin Cela, Michela Coccia, Laura Villani, Emanuele Frontoni, Sara Moccia
For: This study aims to provide a remote telemonitoring system to support clinicians in monitoring the evolution of dysarthria in patients with neurological diseases.* Methods: The system uses a convolutional neural network (CNN) to analyze video recordings of individuals with dysarthria, and locates facial landmarks as a prior for assessing orofacial functions related to speech.* Results: The proposed CNN achieved a normalized mean error of 1.79 on localizing facial landmarks when tested on a public dataset, and showed promising outcomes in a real-life scenario with 11 bulbar-onset ALS subjects.

Abstract
Background and objectives: Patients suffering from neurological diseases may develop dysarthria, a motor speech disorder affecting the execution of speech. Close and quantitative monitoring of dysarthria evolution is crucial for enabling clinicians to promptly implement patient management strategies and maximizing effectiveness and efficiency of communication functions in term of restoring, compensating or adjusting. In the clinical assessment of orofacial structures and functions, at rest condition or during speech and non-speech movements, a qualitative evaluation is usually performed, throughout visual observation. Methods: To overcome limitations posed by qualitative assessments, this work presents a store-and-forward self-service telemonitoring system that integrates, within its cloud architecture, a convolutional neural network (CNN) for analyzing video recordings acquired by individuals with dysarthria. This architecture, called facial landmark Mask RCNN, aims at locating facial landmarks as a prior for assessing the orofacial functions related to speech and examining dysarthria evolution in neurological diseases. Results: When tested on the Toronto NeuroFace dataset, a publicly available annotated dataset of video recordings from patients with amyotrophic lateral sclerosis (ALS) and stroke, the proposed CNN achieved a normalized mean error equal to 1.79 on localizing the facial landmarks. We also tested our system in a real-life scenario on 11 bulbar-onset ALS subjects, obtaining promising outcomes in terms of facial landmark position estimation. Discussion and conclusions: This preliminary study represents a relevant step towards the use of remote tools to support clinicians in monitoring the evolution of dysarthria.

摘要
背景和目标：神经疾病患者可能发展出语言障碍，称为肌肉语言障碍，影响语言执行。为了帮助临床医生尽快实施患者管理策略并最大化对语言功能的效果和效率，close和量化监测肌肉语言障碍的发展是非常重要的。在评估嘴部结构和功能方面，通常采用观察方式进行评估。方法：为了超越质量评估的限制，本工作提出了一个自助式抽象云端系统，其中包含一个基于卷积神经网络（CNN）的脸部特征检测模型，用于分析患者所录制的视频记录。这个架构被称为脸部特征Mask RCNN，旨在在视频记录中检测肌肉语言障碍的发展，并评估嘴部结构和功能相关的语言功能。结果：当测试在多伦多大学的脸部数据集上时，我们的CNN模型实现了1.79的正常化平均错误，用于定位脸部特征。我们还在11名有患有贝叶静脉性肌肉疾病的实际情况下测试了我们的系统，并获得了可观的结果，表明我们的系统可以准确地定位脸部特征。讨论和结论：这一初步研究表明了远程工具的使用可以帮助临床医生更好地监测肌肉语言障碍的发展。通过评估肌肉语言障碍的发展，可以帮助临床医生更好地评估患者的状况，并采取更加有效的治疗策略。此外，这种远程监测技术可以帮助患者更好地与医生进行沟通，从而提高患者的生活质量。

Improve Deep Forest with Learnable Layerwise Augmentation Policy Schedule

paper_url: http://arxiv.org/abs/2309.09030
repo_url: https://github.com/dbsxfz/augdf
paper_authors: Hongyu Zhu, Sichu Liang, Wentao Hu, Fang-Qi Li, Yali yuan, Shi-Lin Wang, Guang Cheng
for: 提高 Deep Forest 模型的表现和泛化能力，对 tabular 数据进行更好的处理和分类。
methods: 使用可学习的层 wise 数据增强策略，包括 Cut Mix for Tabular data 技术，并使用人口基数搜索算法来调整增强程度。
results: 在多个 tabular 分类任务中达到新的 state-of-the-art 水平，超过了树ensemble、深度森林、深度神经网络和 AutoML 竞争对手。

Abstract
As a modern ensemble technique, Deep Forest (DF) employs a cascading structure to construct deep models, providing stronger representational power compared to traditional decision forests. However, its greedy multi-layer learning procedure is prone to overfitting, limiting model effectiveness and generalizability. This paper presents an optimized Deep Forest, featuring learnable, layerwise data augmentation policy schedules. Specifically, We introduce the Cut Mix for Tabular data (CMT) augmentation technique to mitigate overfitting and develop a population-based search algorithm to tailor augmentation intensity for each layer. Additionally, we propose to incorporate outputs from intermediate layers into a checkpoint ensemble for more stable performance. Experimental results show that our method sets new state-of-the-art (SOTA) benchmarks in various tabular classification tasks, outperforming shallow tree ensembles, deep forests, deep neural network, and AutoML competitors. The learned policies also transfer effectively to Deep Forest variants, underscoring its potential for enhancing non-differentiable deep learning modules in tabular signal processing.

摘要
为了提高表格分类性能，我们提出了一种优化的深度森林（DF）技术，具有更强的表达力。这种技术利用级别结构来构建深度模型，从而提高模型的表达力。然而，这种积极多层学习的方法容易过拟合，导致模型的效果和泛化性受限。为了解决这个问题，我们在这篇论文中提出了一种可学习的层weise数据增强策略，包括了Cut Mix for Tabular data（CMT）增强技术来mitigate过拟合。此外，我们还提出了一种基于人口的搜索算法来调整增强intensity的每层策略。此外，我们还提出了将 intermediate层的输出集成到检查点ensemble中，以确保模型的稳定性。实验结果显示，我们的方法在不同的表格分类任务中设置了新的最佳性能记录（SOTA），超过了树ensemble、深度森林、深度神经网络和AutoML竞争对手。学习的策略也可以有效地传递到 Deep Forest 的变体中，这 подтвержда了其在表格信号处理中的潜在应用。

Earth Virtualization Engines – A Technical Perspective

paper_url: http://arxiv.org/abs/2309.09002
repo_url: None
paper_authors: Torsten Hoefler, Bjorn Stevens, Andreas F. Prein, Johanna Baehr, Thomas Schulthess, Thomas F. Stocker, John Taylor, Daniel Klocke, Pekka Manninen, Piers M. Forster, Tobias Kölling, Nicolas Gruber, Hartwig Anzt, Claudia Frauen, Florian Ziemen, Milan Klöwer, Karthik Kashinath, Christoph Schär, Oliver Fuhrer, Bryan N. Lawrence
for: 提高气候变化应对能力
methods: 结合物理模型和机器学习技术，提高气候预测的准确性、效率和可读性
results: 实现了高分辨率的气候数据访问和分析，为气候变化的研究和应对做出了重要贡献

Abstract
Participants of the Berlin Summit on Earth Virtualization Engines (EVEs) discussed ideas and concepts to improve our ability to cope with climate change. EVEs aim to provide interactive and accessible climate simulations and data for a wide range of users. They combine high-resolution physics-based models with machine learning techniques to improve the fidelity, efficiency, and interpretability of climate projections. At their core, EVEs offer a federated data layer that enables simple and fast access to exabyte-sized climate data through simple interfaces. In this article, we summarize the technical challenges and opportunities for developing EVEs, and argue that they are essential for addressing the consequences of climate change.

摘要
BERLIN峰会上的地球虚拟化引擎（EVEs）参与者们讨论了如何改善对气候变化的应对能力。EVEs旨在提供互动性强、易于访问的气候模拟和数据，为广泛的用户群提供。它们结合高分辨率物理模型和机器学习技术，以提高模拟结果的准确性、效率和可解释性。EVEs的核心是一个联邦数据层，可以通过简单的接口访问数据，并且可以在毫秒级别进行数据交互。本文将讨论EVEs的技术挑战和机遇，并认为它们是Addressing the consequences of climate change的关键工具。

Deliberative Context-Aware Ambient Intelligence System for Assisted Living Homes

paper_url: http://arxiv.org/abs/2309.08984
repo_url: None
paper_authors: Mohannad Babli, Jaime A Rincon, Eva Onaindia, Carlos Carrascosa, Vicente Julian
for: 这个论文的目的是提出一种 ambient intelligence 健康应用的决策架构，用于舒缓受抚恤的老年人受到负面情绪的情况下，并在助生活机构中进行实施。
methods: 该架构使用了决策函数，以实现 Context-aware 的人机交互、感知、规划功能、反应性和环境意识等特性。文章还进行了一些实验研究，用以证明方法的效果和有效性。
results: 实验结果表明，提出的决策函数已经成功地实现了其决策目标，并且在 simulate 的助生活机构enario 中得到了有效的结果。

Abstract
Monitoring wellbeing and stress is one of the problems covered by ambient intelligence, as stress is a significant cause of human illnesses directly affecting our emotional state. The primary aim was to propose a deliberation architecture for an ambient intelligence healthcare application. The architecture provides a plan for comforting stressed seniors suffering from negative emotions in an assisted living home and executes the plan considering the environment's dynamic nature. Literature was reviewed to identify the convergence between deliberation and ambient intelligence and the latter's latest healthcare trends. A deliberation function was designed to achieve context-aware dynamic human-robot interaction, perception, planning capabilities, reactivity, and context-awareness with regard to the environment. A number of experimental case studies in a simulated assisted living home scenario were conducted to demonstrate the approach's behavior and validity. The proposed methods were validated to show classification accuracy. The validation showed that the deliberation function has effectively achieved its deliberative objectives.

摘要
监测健康和压力是智能环境技术中一个主要问题，因为压力直接影响我们的情感状态，是人类疾病的直接原因之一。我们的目标是提议一种 ambient intelligence 健康应用的决策建构。这种建构提供了一个计划，用于使用智能环境技术来慰宠受到压力的长者，并在考虑环境的动态特点下执行该计划。我们查看了相关 литераature，以确定决策和智能环境之间的叉合，以及智能环境最新的医疗趋势。我们设计了一种决策函数，以实现 context-aware 的人机交互、感知、规划能力、感应性和对环境的Context-awareness。我们进行了一些在 simulated 助生活场景中的实验案例，以证明方法的行为和有效性。我们验证了提案的方法，并证明了决策函数已经成功完成了它的决策目标。

Accelerating In-Browser Deep Learning Inference on Diverse Edge Clients through Just-in-Time Kernel Optimizations

paper_url: http://arxiv.org/abs/2309.08978
repo_url: None
paper_authors: Fucheng Jia, Shiqi Jiang, Ting Cao, Wei Cui, Tianrui Xia, Xu Cao, Yuanchun Li, Deyu Zhang, Ju Ren, Yunxin Liu, Lili Qiu, Mao Yang
for: 这篇论文的目的是提出一个首个在浏览器中进行深度学习（DL）推理的系统，以提高浏览器中DL推理的性能。
methods: 这篇论文使用了两种新的网络编程技术来实现自动生成优化的加速器，包括：Tensor-Web Compiling Co-Design和Web-Specific Lite Kernel Optimization Space Design。这两种技术可以减少加速器生成成本，同时维持或甚至提高性能。
results: 对现代转换器模型进行评估，nn-JIT.web可以在各种客户端设备上实现到8.2倍的加速，包括ARM、Intel、AMD和Nvidia的主流CPUs和GPUs。

Abstract
Web applications are increasingly becoming the primary platform for AI service delivery, making in-browser deep learning (DL) inference more prominent. However, current in-browser inference systems fail to effectively utilize advanced web programming techniques and customize kernels for various client devices, leading to suboptimal performance. To address the issues, this paper presents the first in-browser inference system, nn-JIT.web, which enables just-in-time (JIT) auto-generation of optimized kernels for both CPUs and GPUs during inference. The system achieves this by using two novel web programming techniques that can significantly reduce kernel generation time, compared to other tensor compilers such as TVM, while maintaining or even improving performance. The first technique, Tensor-Web Compiling Co-Design, lowers compiling costs by unifying tensor and web compiling and eliminating redundant and ineffective compiling passes. The second technique, Web-Specific Lite Kernel Optimization Space Design, reduces kernel tuning costs by focusing on web programming requirements and efficient hardware resource utilization, limiting the optimization space to only dozens. nn-JIT.web is evaluated for modern transformer models on a range of client devices, including the mainstream CPUs and GPUs from ARM, Intel, AMD and Nvidia. Results show that nn-JIT.web can achieve up to 8.2x faster within 30 seconds compared to the baselines across various models.

摘要
现代浏览器中的 Web 应用程序正在成为人工智能服务的主要平台，使得在浏览器内进行深度学习（DL）推理变得更加重要。然而，当前的浏览器推理系统无法有效利用高级网络编程技术和自定义核心 для各种客户端设备，导致性能下降。为解决这些问题，这篇论文提出了第一个在浏览器内进行推理的系统，即 nn-JIT.web。该系统可以在推理过程中通过实时生成优化的核心，以提高 CPU 和 GPU 的性能。该系统使用了两种新的网络编程技术来减少核心生成时间，相比于其他tensor编译器such as TVM，而无需增加编译成本。第一种技术是tensor-web编译合理化，它将tensor编译和网络编译融合起来，从而减少了无用的编译步骤。第二种技术是网络特定的轻量级核心优化空间设计，它将精力集中在了网络编程需求和高效硬件资源利用上，从而减少了优化空间的范围，只有几十。nn-JIT.web在多种现代转换器模型上进行了测试，包括ARM、Intel、AMD和Nvidia等主流CPU和GPU。结果显示，nn-JIT.web可以在30秒内达到8.2倍的速度提升，与基准值相比。

Data-driven Reachability using Christoffel Functions and Conformal Prediction

paper_url: http://arxiv.org/abs/2309.08976
repo_url: None
paper_authors: Abdelmouaiz Tebjou, Goran Frehse, Faïcel Chamroukhi
for: 本研究旨在提出一种数据驱动的方法，用于估计动力系统中可达的状态集（reach set），而无需知道系统动力学模型的准确参数。
methods: 本方法基于Christoffel函数的approximation来估计 reach set，并且通过使用数据驱动的方法来提高样本效率和鲁棒性。
results: 本研究显示，使用这种方法可以提高样本效率和鲁棒性，并且可以避免出现在训练集和校准集中的异常样本的影响。

Abstract
An important mathematical tool in the analysis of dynamical systems is the approximation of the reach set, i.e., the set of states reachable after a given time from a given initial state. This set is difficult to compute for complex systems even if the system dynamics are known and given by a system of ordinary differential equations with known coefficients. In practice, parameters are often unknown and mathematical models difficult to obtain. Data-based approaches are promised to avoid these difficulties by estimating the reach set based on a sample of states. If a model is available, this training set can be obtained through numerical simulation. In the absence of a model, real-life observations can be used instead. A recently proposed approach for data-based reach set approximation uses Christoffel functions to approximate the reach set. Under certain assumptions, the approximation is guaranteed to converge to the true solution. In this paper, we improve upon these results by notably improving the sample efficiency and relaxing some of the assumptions by exploiting statistical guarantees from conformal prediction with training and calibration sets. In addition, we exploit an incremental way to compute the Christoffel function to avoid the calibration set while maintaining the statistical convergence guarantees. Furthermore, our approach is robust to outliers in the training and calibration set.

摘要
“一个重要的数学工具在动态系统分析中是精确地计算可达集，即从一个初始状态到一个给定时间后可达的状态集。这个集是复杂系统的情况下难以计算，即使系统动态知道并且以常微分方程表示。实际上，参数通常未知，数学模型难以取得。数据驱动的方法可以避免这些问题，通过基于数据的估计来Estimate the reach set。如果有模型可用，则可以通过数值 simulations obtain training set。在缺乏模型的情况下，则可以使用实际观察。一种最近提出的方法是使用Christoffel函数估计可达集。在某些假设下，这个估计将会趋向真实解。在这篇论文中，我们会提高这些结果，包括提高样本效率和松动一些假设，通过滤节点和训练集的统计保证。此外，我们还会利用增量式计算Christoffel函数，以避免训练集的需求，同时保持统计的测度保证。此外，我们的方法也能够抗抗噪。”

Multiagent Reinforcement Learning with an Attention Mechanism for Improving Energy Efficiency in LoRa Networks

paper_url: http://arxiv.org/abs/2309.08965
repo_url: None
paper_authors: Xu Zhang, Ziqi Lin, Shimin Gong, Bo Gu, Dusit Niyato
for: 该研究旨在提高LoRa网络的能源效率（EE），适用于工业互联网物联网（IIoT）。
methods: 该研究首先提出了一个分析模型来计算LoRa网络的系统EE性能。然后，基于多代理强化学习（MALoRa）算法，对LoRa网络中每个终端设备（ED）的传输参数分配进行优化，以最大化系统EE。
results: simulation结果表明，相比基eline算法，MALoRa算法可以显著提高LoRa网络的系统EE，但是同时也导致了一定的数据包交换率（PDR）的下降。

Abstract
Long Range (LoRa) wireless technology, characterized by low power consumption and a long communication range, is regarded as one of the enabling technologies for the Industrial Internet of Things (IIoT). However, as the network scale increases, the energy efficiency (EE) of LoRa networks decreases sharply due to severe packet collisions. To address this issue, it is essential to appropriately assign transmission parameters such as the spreading factor and transmission power for each end device (ED). However, due to the sporadic traffic and low duty cycle of LoRa networks, evaluating the system EE performance under different parameter settings is time-consuming. Therefore, we first formulate an analytical model to calculate the system EE. On this basis, we propose a transmission parameter allocation algorithm based on multiagent reinforcement learning (MALoRa) with the aim of maximizing the system EE of LoRa networks. Notably, MALoRa employs an attention mechanism to guide each ED to better learn how much ''attention'' should be given to the parameter assignments for relevant EDs when seeking to improve the system EE. Simulation results demonstrate that MALoRa significantly improves the system EE compared with baseline algorithms with an acceptable degradation in packet delivery rate (PDR).

摘要
长距离无线技术（LoRa）， caracterizada por baja consumición de energía y una comunicación larga distancia, es considerada una de las tecnologías clave para la Internet de las Cosas Industriales (IIoT). Sin embargo, a medida que se incrementa la escala de la red, la eficiencia energética (EE) de las redes LoRa disminuye drásticamente debido a colisiones de paquetes graves. Para abordar este problema, es esencial asignar los parámetros de transmisión, como el factor de spreading y la potencia de transmisión, de manera adecuada para cada dispositivo de end (ED). Sin embargo, debido al tráfico esporádico y baja tasa de actividad de las redes LoRa, evaluar el desempeño de sistema EE bajo diferentes configuraciones de parámetros es un proceso tiempoconsumidor. Por lo tanto, primero formulamos un modelo analítico para calcular el sistema EE. A partir de este modelo, propusimos un algoritmo de asignación de parámetros de transmisión basado en aprendizaje por refuerzo multientidad (MALoRa) con el objetivo de maximizar el sistema EE de las redes LoRa. Destacablemente, MALoRa utiliza un mecanismo de atención para guiar a cada ED en cómo asignar la atención adecuada a las asignaciones de parámetros relevantes para mejorar el sistema EE. Los resultados de la simulación demuestran que MALoRa mejora significativamente el sistema EE en comparación con los algoritmos de referencia con una degradación aceptable en la tasa de entrega de paquetes (PDR).

Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca

paper_url: http://arxiv.org/abs/2309.08958
repo_url: None
paper_authors: Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, Barry Haddow, Kenneth Heafield
for: 本研究旨在探讨基础大语言模型（LLM）可以如何通过具体的指令微调来开发开放式问答能力，以便应用程序如AI助手等。
methods: 本研究采用了Alpaca数据集和机器翻译对其进行多语言训练数据的组合，然后通过低级别适应和全参数训练来微调LLMs。
results: 研究发现，虽然多语言微调不对英语表现有直接影响，但它对多语言环境下LLM的稳定性至关重要。具有固定预算的情况下，一个多语言指令微调模型，只需在减少数据上进行微调，可以和每种语言单独训练的模型相比。这些发现可以为减少计算资源的情况下扩展语言支持而提供指南。

Abstract
Foundational large language models (LLMs) can be instruction-tuned to develop open-ended question-answering capability, facilitating applications such as the creation of AI assistants. While such efforts are often carried out in a single language, building on prior research, we empirically analyze cost-efficient approaches of monolingual and multilingual tuning, shedding light on the efficacy of LLMs in responding to queries across monolingual and multilingual contexts. Our study employs the Alpaca dataset and machine translations of it to form multilingual training data, which is then used to tune LLMs through low-rank adaptation and full-parameter training. Comparisons reveal that multilingual tuning is not crucial for an LLM's English performance, but is key to its robustness in a multilingual environment. With a fixed budget, a multilingual instruction-tuned model, merely trained on downsampled data, can be as powerful as training monolingual models for each language. Our findings serve as a guide for expanding language support through instruction tuning with constrained computational resources.

摘要
基础大语言模型（LLM）可以通过指令调整来发展开放式问答能力，用于应用程序，如创建人工智能助手。虽然这些努力通常在单语言上进行，基于先前的研究，但我们employs the Alpaca dataset and machine translations of it to form multilingual training data，然后使用低级别适应和全参数训练来调整LLM。对比发现，在多语言环境中，多语言调整对LLM的英语性能并无关系，但对多语言环境的稳定性至关重要。假设有固定预算，一个多语言指令调整模型，只需在减样数据上进行训练，可以与每种语言 separately 训练的模型相当有力。我们的发现可以 serve as a guide for expanding language support through instruction tuning with constrained computational resources。

Cross-Lingual Knowledge Editing in Large Language Models

paper_url: http://arxiv.org/abs/2309.08952
repo_url: None
paper_authors: Jiaan Wang, Yunlong Liang, Zengkui Sun, Yuxuan Cao, Jiarong Xu
for: 本研究旨在 investigate the cross-lingual effect of knowledge editing in natural language processing.
methods: 我们首先收集了一个大规模的 across-lingual synthetic dataset，并对不同知识编辑方法进行了英语编辑。然后，我们对这些编辑后的模型进行了中文评估，并 vice versa。
results: 我们发现了编辑后模型的可靠性、通用性、地域性和可移植性在不同语言之间存在差异。此外，我们还分析了编辑后模型的不一致行为和特定挑战。

Abstract
Knowledge editing aims to change language models' performance on several special cases (i.e., editing scope) by infusing the corresponding expected knowledge into them. With the recent advancements in large language models (LLMs), knowledge editing has been shown as a promising technique to adapt LLMs to new knowledge without retraining from scratch. However, most of the previous studies neglect the multi-lingual nature of some main-stream LLMs (e.g., LLaMA, ChatGPT and GPT-4), and typically focus on monolingual scenarios, where LLMs are edited and evaluated in the same language. As a result, it is still unknown the effect of source language editing on a different target language. In this paper, we aim to figure out this cross-lingual effect in knowledge editing. Specifically, we first collect a large-scale cross-lingual synthetic dataset by translating ZsRE from English to Chinese. Then, we conduct English editing on various knowledge editing methods covering different paradigms, and evaluate their performance in Chinese, and vice versa. To give deeper analyses of the cross-lingual effect, the evaluation includes four aspects, i.e., reliability, generality, locality and portability. Furthermore, we analyze the inconsistent behaviors of the edited models and discuss their specific challenges.

摘要
知识编辑目标是改善语言模型在特定场景（即编辑范围）的表现，通过涂抹相应的预期知识到其中。随着大语言模型（LLM）的发展，知识编辑被证明为一种有希望的技术，可以在不重新训练的情况下，使LML在新的知识上进行适应。然而，大多数先前的研究忽视了主流LLM的多语言特性（例如LLaMA、ChatGPT和GPT-4），通常集中于单语言enario，而不是考虑多语言enario。因此， edit Language Model在不同目标语言下的效果仍然未知。在这篇论文中，我们想要解决这种跨语言效果。specifically，我们首先收集了一个大规模的跨语言合成数据集，将英语ZsRE翻译成中文。然后，我们对不同知识编辑方法进行英语编辑，并在中文和vice versa中进行评估。为了更深入地分析跨语言效果，评估包括四个方面：可靠性、通用性、本地性和可移植性。此外，我们还分析了编辑后模型的不一致行为，并讨论了其特定挑战。

Universal Metric Learning with Parameter-Efficient Transfer Learning

paper_url: http://arxiv.org/abs/2309.08944
repo_url: None
paper_authors: Sungyeon Kim, Donghyun Kim, Suha Kwak
for: 本文提出了一种新的度量学习方法，即通用度量学习（UML），该方法可以捕捉多个不同分布数据之间的关系。
methods: 本文提出了一种新的度量学习方法，即通用度量学习（UML），该方法包括一个预先固定的模型和两个附加模块：随机适应器和提示池。这些模块可以捕捉 dataset-specific 知识，同时避免倾斜到主导分布的偏见。
results: 本文的实验结果表明，使用 Parametric Universal Metric leArning（PUMA）方法可以在多个不同分布数据上实现更好的性能，并使用约 69 倍 fewer 可变参数。

Abstract
A common practice in metric learning is to train and test an embedding model for each dataset. This dataset-specific approach fails to simulate real-world scenarios that involve multiple heterogeneous distributions of data. In this regard, we introduce a novel metric learning paradigm, called Universal Metric Learning (UML), which learns a unified distance metric capable of capturing relations across multiple data distributions. UML presents new challenges, such as imbalanced data distribution and bias towards dominant distributions. To address these challenges, we propose Parameter-efficient Universal Metric leArning (PUMA), which consists of a pre-trained frozen model and two additional modules, stochastic adapter and prompt pool. These modules enable to capture dataset-specific knowledge while avoiding bias towards dominant distributions. Additionally, we compile a new universal metric learning benchmark with a total of 8 different datasets. PUMA outperformed the state-of-the-art dataset-specific models while using about 69 times fewer trainable parameters.

摘要
通常在 метри学习中，对每个数据集进行特定的模型训练和测试。这种数据集特定的方法无法模拟实际世界中存在多个不同类型数据的场景。为此，我们介绍了一种新的度量学习方法，即通用度量学习（UML），它学习了一个可以捕捉多个数据分布关系的统一距离度量。UML带来了新的挑战，如数据分布偏斜和主导分布的偏袋。为解决这些挑战，我们提出了Parameter-efficient Universal Metric leArning（PUMA），它包括一个预训练的冻结模型和两个附加模块，随机适应器和提示池。这些模块允许捕捉数据集特定的知识，而不是偏袋主导分布。此外，我们编译了一个新的通用度量学习Benchmark，包括8个不同的数据集。PUMA在比较州时表现了与当前最佳数据集特定模型相比，使用了约69倍少的可训练参数。

An Unified Search and Recommendation Foundation Model for Cold-Start Scenario

paper_url: http://arxiv.org/abs/2309.08939
repo_url: None
paper_authors: Yuqi Gong, Xichen Ding, Yehui Su, Kaiming Shen, Zhongyi Liu, Guannan Zhang
for: 这种 paper 的目的是提出一种基于多个领域的搜索和推荐系统模型，以提高系统的性能和灵活性。
methods: 该 paper 使用了大语言模型（LLM）来提取域无关的文本特征，并使用方面闭合合并来将 ID 特征、域无关文本特征和任务特定的多元稀热特征 merge 到获得查询和 Item 的表示。同时，该 paper 还提出了多个搜索和推荐enario 的适应Multi-task 模块来训练多个领域的基础模型。
results: 该 paper 通过在冷启动场景中使用 pre-train finetune 方式应用 S&R Multi-Domain Foundation 模型，实现了与其他 SOTA 传输学习方法相比较好的性能。此外，S&R Multi-Domain Foundation 模型已经成功应用在阿里巴巴手机应用程序中的内容查询推荐和服务卡推荐等方面。

Abstract
In modern commercial search engines and recommendation systems, data from multiple domains is available to jointly train the multi-domain model. Traditional methods train multi-domain models in the multi-task setting, with shared parameters to learn the similarity of multiple tasks, and task-specific parameters to learn the divergence of features, labels, and sample distributions of individual tasks. With the development of large language models, LLM can extract global domain-invariant text features that serve both search and recommendation tasks. We propose a novel framework called S\&R Multi-Domain Foundation, which uses LLM to extract domain invariant features, and Aspect Gating Fusion to merge the ID feature, domain invariant text features and task-specific heterogeneous sparse features to obtain the representations of query and item. Additionally, samples from multiple search and recommendation scenarios are trained jointly with Domain Adaptive Multi-Task module to obtain the multi-domain foundation model. We apply the S\&R Multi-Domain foundation model to cold start scenarios in the pretrain-finetune manner, which achieves better performance than other SOTA transfer learning methods. The S\&R Multi-Domain Foundation model has been successfully deployed in Alipay Mobile Application's online services, such as content query recommendation and service card recommendation, etc.

摘要
现代商业搜索引擎和推荐系统中，数据来自多个领域可以共同训练多领域模型。传统方法在多任务设定下训练多领域模型，使分享参数学习多任务之间的相似性，而任务特定参数学习多任务之间的差异。随着大语言模型的发展，LLM可以提取全局领域不变的文本特征，用于搜索和推荐任务。我们提出了一种新的框架，即S\&R多领域基础框架，使用LLM提取领域不变特征，并使用方面闭合合并模块将ID特征、领域不变文本特征和任务特有的多样化稀缺特征拼接而成查询和物品的表示。此外，我们在多个搜索和推荐场景中共同训练域 adapted multi-task模块，以获得多领域基础模型。我们在寒冷开始场景中使用预训练- fine-tune方式应用S\&R多领域基础模型，实现了与其他SOTA传输学习方法相比较好的性能。S\&R多领域基础模型已经成功部署在阿里巴巴手机应用程序内的内容查询推荐和服务卡推荐等功能中。

A Novel Neural-symbolic System under Statistical Relational Learning

paper_url: http://arxiv.org/abs/2309.08931
repo_url: None
paper_authors: Dongran Yu, Xueyan Liu, Shirui Pan, Anchen Li, Bo Yang
for: The paper aims to develop a cognitive model that can exhibit human-like intellectual capabilities through neural-symbolic systems, which combine the strengths of deep learning and symbolic reasoning.
methods: The proposed method is a general bi-level probabilistic graphical reasoning framework called GBPGR, which leverages statistical relational learning to effectively integrate deep learning models and symbolic reasoning in a mutually beneficial manner.
results: The approach achieves high performance and exhibits effective generalization in both transductive and inductive tasks, as demonstrated through extensive experiments.Here’s the same information in Simplified Chinese:
for: 本研究旨在通过神经符号系统实现人类智能水平的认知模型。
methods: 提议的方法是一种通用二级概率图解架构（GBPGR），利用统计关系学来有效地结合深度学习模型和符号逻辑。
results: 方法在推uctive和概率任务中具有高性能和有效的泛化能力，经过广泛的实验证明。

Abstract
A key objective in field of artificial intelligence is to develop cognitive models that can exhibit human-like intellectual capabilities. One promising approach to achieving this is through neural-symbolic systems, which combine the strengths of deep learning and symbolic reasoning. However, current approaches in this area have been limited in their combining way, generalization and interpretability. To address these limitations, we propose a general bi-level probabilistic graphical reasoning framework called GBPGR. This framework leverages statistical relational learning to effectively integrate deep learning models and symbolic reasoning in a mutually beneficial manner. In GBPGR, the results of symbolic reasoning are utilized to refine and correct the predictions made by the deep learning models. At the same time, the deep learning models assist in enhancing the efficiency of the symbolic reasoning process. Through extensive experiments, we demonstrate that our approach achieves high performance and exhibits effective generalization in both transductive and inductive tasks.

摘要
“一个关键目标在人工智能领域是开发人工智能模型，能够展现人类智能水平的 cognitive 能力。一种具有推进性的方法是通过神经做数学系统，结合深度学习和符号推理。但现有方法在这个领域有限，导致混合、扩展和解释性不足。为了解决这些限制，我们提出一个通用二级概率 graf 推理框架 called GBPGR。这个框架利用 Statistical Relational Learning 技术，实现深度学习模型和符号推理的共同优化。在 GBPGR 中，符号推理的结果用于修正和改善深度学习模型的预测结果。另一方面，深度学习模型帮助提高符号推理的效率。经过广泛的实验，我们证明了我们的方法在转掌和推理任务中具有高性能和有效扩展。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

DOMAIN: MilDly COnservative Model-BAsed OfflINe Reinforcement Learning

paper_url: http://arxiv.org/abs/2309.08925
repo_url: None
paper_authors: Xiao-Yin Liu, Xiao-Hu Zhou, Xiao-Liang Xie, Shi-Qi Liu, Zhen-Qiu Feng, Hao Li, Mei-Jiang Gui, Tian-Yu Xiang, De-Xing Huang, Zeng-Guang Hou
for: This paper proposes a new model-based reinforcement learning algorithm called DOMAIN to address the problem of distribution shift in offline RL.
methods: The DOMAIN algorithm uses adaptive sampling of model samples to adjust the model data penalty and does not rely on model uncertainty estimation.
results: The paper shows that the DOMAIN algorithm is less conservative than previous model-based offline RL algorithms and achieves better performance than other RL algorithms on tasks that require generalization.Here’s the Chinese translation of the three points:
for: 这篇论文提出了一种新的基于模型的强化学习算法called DOMAIN，以解决停机shift问题。
methods: DOMAIN算法使用适应样本的模型样本抽象来调整模型数据罚款，不需要模型不确定性估计。
results: 论文表明DOMAIN算法比前一代基于模型的停机RL算法更加保守，并在需要总体化的任务上达到更好的性能。

Abstract
Model-based reinforcement learning (RL), which learns environment model from offline dataset and generates more out-of-distribution model data, has become an effective approach to the problem of distribution shift in offline RL. Due to the gap between the learned and actual environment, conservatism should be incorporated into the algorithm to balance accurate offline data and imprecise model data. The conservatism of current algorithms mostly relies on model uncertainty estimation. However, uncertainty estimation is unreliable and leads to poor performance in certain scenarios, and the previous methods ignore differences between the model data, which brings great conservatism. Therefore, this paper proposes a milDly cOnservative Model-bAsed offlINe RL algorithm (DOMAIN) without estimating model uncertainty to address the above issues. DOMAIN introduces adaptive sampling distribution of model samples, which can adaptively adjust the model data penalty. In this paper, we theoretically demonstrate that the Q value learned by the DOMAIN outside the region is a lower bound of the true Q value, the DOMAIN is less conservative than previous model-based offline RL algorithms and has the guarantee of security policy improvement. The results of extensive experiments show that DOMAIN outperforms prior RL algorithms on the D4RL dataset benchmark, and achieves better performance than other RL algorithms on tasks that require generalization.

摘要
模型基于的再增强学习（RL），从偏置数据集中学习环境模型，并生成更多的不同于实际环境的模型数据，已成为解决偏移 Distribution shift 在线上 RL 中的有效方法。由于模型与实际环境之间的差距，需要在算法中加入保守性来平衡准确的偏置数据和不准确的模型数据。现有算法中的保守性主要基于模型uncertainty 估计。然而，不确定性估计不可靠，导致在某些场景下表现不佳，而之前的方法忽略了模型数据之间的差异，这会带来很大的保守性。因此，这篇论文提出了一种milDly cOnservative Model-bAsed offlINe RL算法（DOMAIN），不需要估计模型uncertainty，以解决上述问题。DOMAIN 引入适应样本分布的模型样本折扣，可以适应性地调整模型数据罚款。在本论文中，我们理论上展示了 DOMAIN 外部区域上学习的 Q 值是真实 Q 值的下界，DOMAIN 比前一代模型基于的offline RL算法更加保守，并且有安全政策改进的 garant。实验结果表明，DOMAIN 在 D4RL 数据集上比前一代 RL 算法表现出色，并在需要总结的任务上达到更好的表现。

Exploration of TPUs for AI Applications

paper_url: http://arxiv.org/abs/2309.08918
repo_url: None
paper_authors: Diego Sanmartín Carrión, Vera Prohaska
for: 这篇论文主要是关于 tensor processing units (TPU) 的特有硬件加速器，用于深度学习，以及其在边缘计算中的实现。
methods: 论文首先提供了 TPU 的概述，包括神经网络设计、通用架构、编译技术和支持框架。然后进行了云和边缘 TPU 性能对比其他相似架构芯片。
results: 结果显示，TPU 可以在云和边缘计算中提供显著性能提升。同时，论文还提出了在边缘 TPU 上部署更多架构的需求，以及在边缘计算中需要更多robust的比较。

Abstract
Tensor Processing Units (TPUs) are specialized hardware accelerators for deep learning developed by Google. This paper explores the performance of TPU with a focus on AI and its implementation in edge computing. It first provides an overview of TPUs, specifically their design in relation to neural networks, their general architecture, compilation techniques and supporting frameworks. Furthermore, we provide a comparative analysis of Cloud and Edge TPU performance against other counterpart chip architectures. It is then discussed how TPUs can be used to speed up AI workloads. The results show that TPUs can provide significant performance improvements both in cloud and edge computing. Additionally, we address the need for further research for the deployment of more architectures in the Edge TPU, as well as the need for the development of more robust comparisons in edge computing.

摘要
tensor 处理单元 (TPU) 是 Google 开发的特циализирован硬件加速器，用于深度学习。本文通过对 TPU 的性能进行分析，特别是在 AI 和边缘计算中的应用。首先，文章提供了 TPU 的概述，包括神经网络设计、通用架构、编译技术和支持框架。然后，文章进行了云和边缘 TPU 性能的比较分析，与其他相关芯片架构进行比较。最后，文章讨论了如何使用 TPU 加速 AI 工作负荷。结果表明，TPU 可以在云和边缘计算中提供显著性能提升。此外，文章还提出了进一步研究边缘 TPU 部署的需求，以及边缘计算中更加robust的比较开发需求。

Bidirectional Graph GAN: Representing Brain Structure-Function Connections for Alzheimer’s Disease

paper_url: http://arxiv.org/abs/2309.08916
repo_url: None
paper_authors: Shuqiang Wang, Chen Ding
for: 本研究旨在探讨脑结构和功能之间的关系，以揭示脑病、如阿尔茨海默病（AD）的发生机制。
methods: 本研究提出了一种bidirectional graph生成对抗网络（BGGAN），用于表征脑结构和功能之间的连接。 Specifically, InnerGCN模块被设计来使生成器使用直接和间接脑区域的特征来学习映射函数。另外，一个名为Balancer的模块被设计来对生成器和判别器进行平衡优化。
results: 对ADNI数据集进行实验表明，生成的结构连接和功能连接都可以提高识别AD的准确率。此外，基于提出的模型，发现脑结构和功能之间不是一对一的对应关系。脑结构是脑功能的基础，强的结构连接通常 accompanies 强的功能连接。

Abstract
The relationship between brain structure and function is critical for revealing the pathogenesis of brain disease, including Alzheimer's disease (AD). However, it is a great challenge to map brain structure-function connections due to various reasons. In this work, a bidirectional graph generative adversarial networks (BGGAN) is proposed to represent brain structure-function connections. Specifically, by designing a module incorporating inner graph convolution network (InnerGCN), the generators of BGGAN can employ features of direct and indirect brain regions to learn the mapping function between structural domain and functional domain. Besides, a new module named Balancer is designed to counterpoise the optimization between generators and discriminators. By introducing the Balancer into BGGAN, both the structural generator and functional generator can not only alleviate the issue of mode collapse but also learn complementarity of structural and functional features. Experimental results using ADNI datasets show that the both the generated structure connections and generated function connections can improve the identification accuracy of AD. More importantly, based the proposed model, it is found that the relationship between brain structure and function is not a complete one-to-one correspondence. Brain structure is the basis of brain function. The strong structural connections are almost accompanied by strong functional connections.

摘要
《Brain Structure-Function Relationship and Alzheimer's Disease》Introduction: brain structure-function relationship is crucial for understanding the pathogenesis of brain diseases, including Alzheimer's disease (AD). However, mapping brain structure-function connections is a significant challenge due to various reasons. In this work, we propose a bidirectional graph generative adversarial networks (BGGAN) to represent brain structure-function connections.Methodology:1. Inner Graph Convolution Network (InnerGCN): We design a module incorporating InnerGCN to enable the generators of BGGAN to learn the mapping function between structural domain and functional domain using features of direct and indirect brain regions.2. Balancer: To counterpoise the optimization between generators and discriminators, we introduce a new module named Balancer. This module allows both the structural generator and functional generator to alleviate the issue of mode collapse and learn complementarity of structural and functional features.Results:1. Improved identification accuracy of AD: Experimental results using ADNI datasets show that the generated structure connections and functional connections can improve the identification accuracy of AD.2. Relationship between brain structure and function: Based on the proposed model, we found that the relationship between brain structure and function is not a complete one-to-one correspondence. Brain structure is the basis of brain function, and strong structural connections are almost accompanied by strong functional connections.Conclusion:BGGAN provides a novel approach to mapping brain structure-function connections, which can be used to better understand the pathogenesis of brain diseases such as AD. The proposed model highlights the importance of considering both structural and functional features when studying brain function and disease.

A Statistical Turing Test for Generative Models

paper_url: http://arxiv.org/abs/2309.08913
repo_url: None
paper_authors: Hayden Helm, Carey E. Priebe, Weiwei Yang
for: 本研究旨在量化人类和机器生成内容的分布差异，以便评估生成模型是否具备人类化能力。
methods: 本研究使用统计模式识别语言框架，描述当前生成模型的评估方法，并用该框架进行生成模型的评估。
results: 研究发现，当前的生成模型在评估上的表现有所提高，但仍有一定的差异与人类生成内容的分布。

Abstract
The emergence of human-like abilities of AI systems for content generation in domains such as text, audio, and vision has prompted the development of classifiers to determine whether content originated from a human or a machine. Implicit in these efforts is an assumption that the generation properties of a human are different from that of the machine. In this work, we provide a framework in the language of statistical pattern recognition that quantifies the difference between the distributions of human and machine-generated content conditioned on an evaluation context. We describe current methods in the context of the framework and demonstrate how to use the framework to evaluate the progression of generative models towards human-like capabilities, among many axes of analysis.

摘要
人类化能力的AI系统在文本、音频和视觉等领域的内容生成中得到了广泛应用，这导致了判断内容是人类还是机器生成的分类器的发展。这种假设是人类生成的特征和机器生成的特征不同。在这项工作中，我们提供了一个基于统计模式识别语言的框架，以量化人类和机器生成的内容在评估上下文中的分布差异。我们将当前方法与此框架中的方法进行描述，并通过这个框架来评估生成模型在多个轴上的进步，包括人类化能力。

V2CE: Video to Continuous Events Simulator

paper_url: http://arxiv.org/abs/2309.08891
repo_url: None
paper_authors: Zhongyang Zhang, Shuyang Cui, Kaidong Chai, Haowen Yu, Subhasis Dasgupta, Upal Mahbub, Tauhidur Rahman
for: 本文旨在提出一种基于动态视场传感器（DVS）的视频转事件流转换方法，以提高DVS在计算机视觉任务中的表现。
methods: 本文提出了一种基于多视角的事件流转换方法，并采用了一系列特别设计的损失函数来提高生成的事件VOXEL的质量。此外，本文还提出了一种基于本地动态特征的时间推测策略，以准确地恢复事件时间排序和消除时间层次问题。
results: 根据对量化指标进行严格验证，本文的方法在所有阶段的管道中具有最高精度，可以视为当前最佳实践（SOTA）。

Abstract
Dynamic Vision Sensor (DVS)-based solutions have recently garnered significant interest across various computer vision tasks, offering notable benefits in terms of dynamic range, temporal resolution, and inference speed. However, as a relatively nascent vision sensor compared to Active Pixel Sensor (APS) devices such as RGB cameras, DVS suffers from a dearth of ample labeled datasets. Prior efforts to convert APS data into events often grapple with issues such as a considerable domain shift from real events, the absence of quantified validation, and layering problems within the time axis. In this paper, we present a novel method for video-to-events stream conversion from multiple perspectives, considering the specific characteristics of DVS. A series of carefully designed losses helps enhance the quality of generated event voxels significantly. We also propose a novel local dynamic-aware timestamp inference strategy to accurately recover event timestamps from event voxels in a continuous fashion and eliminate the temporal layering problem. Results from rigorous validation through quantified metrics at all stages of the pipeline establish our method unquestionably as the current state-of-the-art (SOTA).

摘要
dynamically 视场传感器（DVS）基本解决方案在各种计算机视觉任务中受到了广泛的关注，提供了明显的优势，包括动态范围、时间分辨率和推理速度。然而，作为与活动像素感知器（APS）设备，如RGB摄像头相比较新的视觉传感器，DVS受到了充分的标注数据的缺乏问题。先前的尝试将APS数据转换为事件经常遇到问题，如实际事件与模拟事件之间的很大领域转移、缺乏量化验证和时间轴层叠问题。在本文中，我们提出了一种新的视频到事件流转换方法，考虑了DVS的特点。一系列仔细设计的损失函数可以提高生成的事件粒子质量。我们还提出了一种新的本地动态感知时间推断策略，可以准确地从事件粒子中恢复事件时间戳，消除时间轴层叠问题。经过严格的验证，我们的方法在所有阶段的管道中都有明显的优势，无疑成为当前最佳实践（SOTA）。

GCL: Gradient-Guided Contrastive Learning for Medical Image Segmentation with Multi-Perspective Meta Labels

paper_url: http://arxiv.org/abs/2309.08888
repo_url: None
paper_authors: Yixuan Wu, Jintai Chen, Jiahuan Yan, Yiheng Zhu, Danny Z. Chen, Jian Wu
for: 降低错误标签成本，为医疗影像分类 зада增加效率的方法
methods: 利用Gradient Mitigator方法与Gradient Filter方法，将多种多面 semantics整合为一个单一的高级semantic recognition能力
results: 透过实验证明，新方法GCL可以从有限标签的情况下，学习出有用的医疗影像表示，并且在不同数据集上展现出良好的一致性和普遍性

Abstract
Since annotating medical images for segmentation tasks commonly incurs expensive costs, it is highly desirable to design an annotation-efficient method to alleviate the annotation burden. Recently, contrastive learning has exhibited a great potential in learning robust representations to boost downstream tasks with limited labels. In medical imaging scenarios, ready-made meta labels (i.e., specific attribute information of medical images) inherently reveal semantic relationships among images, which have been used to define positive pairs in previous work. However, the multi-perspective semantics revealed by various meta labels are usually incompatible and can incur intractable "semantic contradiction" when combining different meta labels. In this paper, we tackle the issue of "semantic contradiction" in a gradient-guided manner using our proposed Gradient Mitigator method, which systematically unifies multi-perspective meta labels to enable a pre-trained model to attain a better high-level semantic recognition ability. Moreover, we emphasize that the fine-grained discrimination ability is vital for segmentation-oriented pre-training, and develop a novel method called Gradient Filter to dynamically screen pixel pairs with the most discriminating power based on the magnitude of gradients. Comprehensive experiments on four medical image segmentation datasets verify that our new method GCL: (1) learns informative image representations and considerably boosts segmentation performance with limited labels, and (2) shows promising generalizability on out-of-distribution datasets.

摘要
自带笔迹标注医疗图像分割任务的成本高昂，因此极其感到需要设计一种笔迹效率的方法，以减轻笔迹负担。近年来，对比学习表现出了巨大的潜力，可以通过有限的标签来提高后续任务的性能。在医疗图像场景中，已有的meta标签（即医疗图像特征信息）自然地暴露了图像之间的semantic关系，这些meta标签在过去的工作中已经被用来定义正例对。然而，医疗图像场景中的多个meta标签之间的semantic关系通常是不兼容的，这会导致"semantic contradiction"现象，从而使得组合不同meta标签的"semantic contradiction"变得不可持续。在这篇论文中，我们解决了"semantic contradiction"问题，我们提出了一种 Gradient Mitigator 方法，该方法可以系统地统一多个视角的 meta标签，使得预训练模型能够更好地捕捉高级别semantic认知能力。此外，我们强调了分割预训练中的细腻分辨率的重要性，我们开发了一种 Gradient Filter 方法，该方法可以根据梯度的大小来动态屏蔽图像对的最有分辨力的像素对。我们在四个医疗图像分割数据集进行了广泛的实验，结果表明，我们的新方法 GCL 可以：（1）学习有用的图像表示，对有限标签进行分割任务进行明显的提升，以及（2）在不同数据集上表现出良好的普适性。

Solving Satisfiability Modulo Counting for Symbolic and Statistical AI Integration With Provable Guarantees

paper_url: http://arxiv.org/abs/2309.08883
repo_url: None
paper_authors: Jinzhao Li, Nan Jiang, Yexiang Xue
for: 这篇论文主要是解决 Symbolic Artificial Intelligence 和 Statistical Artificial Intelligence 的交叉问题，即 Satisfiability Modulo Counting (SMC) 问题。
methods: 该论文提出了一种基于 NP-oracle 的多项式算法 XOR-SMC，可以解决高度NP-完全的 SMC 问题，并提供了常量近似保证。 XOR-SMC 将 SMC 问题转化为满足随机 XOR 约束的 SAT 方程问题。
results: experiments 表明，XOR-SMC 能够在解决重要的 SMC 问题时，与基线相比，提供更好的近似解决方案，并且其近似精度较高。

Abstract
Satisfiability Modulo Counting (SMC) encompasses problems that require both symbolic decision-making and statistical reasoning. Its general formulation captures many real-world problems at the intersection of symbolic and statistical Artificial Intelligence. SMC searches for policy interventions to control probabilistic outcomes. Solving SMC is challenging because of its highly intractable nature($\text{NP}^{\text{PP}$-complete), incorporating statistical inference and symbolic reasoning. Previous research on SMC solving lacks provable guarantees and/or suffers from sub-optimal empirical performance, especially when combinatorial constraints are present. We propose XOR-SMC, a polynomial algorithm with access to NP-oracles, to solve highly intractable SMC problems with constant approximation guarantees. XOR-SMC transforms the highly intractable SMC into satisfiability problems, by replacing the model counting in SMC with SAT formulae subject to randomized XOR constraints. Experiments on solving important SMC problems in AI for social good demonstrate that XOR-SMC finds solutions close to the true optimum, outperforming several baselines which struggle to find good approximations for the intractable model counting in SMC.

摘要
满足性模ulo counting（SMC）包括问题需要 Both symbolic decision-making 和统计学推理。 Its general formulation capture 了许多实际世界问题的交叉点，位于符号AI和统计AI之间。 SMC寻找策略性的输入，以控制 probabilistic outcomes。解决 SMC 是困难的，因为它具有非常困难的性质（NP 完全），包括统计推理和符号推理。 Previous research on SMC solving 缺乏可证明的保证和/或具有不佳的实际性能，特别是在存在 combinatorial constraints 时。 We propose XOR-SMC，一种可 polynomials 算法，使用 NP-oracles，解决高度困难的 SMC 问题，并提供常量approximation guarantees。 XOR-SMC 将高度困难的 SMC 转换成满足性问题，通过将 SMC 中的模型计数换成 SAT 方程Subject to randomized XOR 约束。实验表明，XOR-SMC 能够在解决重要的 SMC 问题中，提供近似 true optimum 的解决方案，超越了许多基准值，它们在 intractable model counting 中的 SMC 中寻找具有好的approximation 的解决方案。

ChatGPT-4 with Code Interpreter can be used to solve introductory college-level vector calculus and electromagnetism problems

paper_url: http://arxiv.org/abs/2309.08881
repo_url: None
paper_authors: Tanuj Kumar, Mikhail A. Kats
for: 这个论文是为了测试 chatGPT 3.5、4、4 with Code Interpreter 在大学二年级电工和电磁学问题上的性能。
methods: 作者使用了一组13个问题，并使用了不同的 chatGPT 实例来解决这些问题多次。
results: 结果显示，使用 Code Interpreter 的 chatGPT 4 可以成功解决大多数问题，而不使用 Code Interpreter 的 chatGPT 3.5 和 chatGPT 4 的性能较差。

Abstract
We evaluated ChatGPT 3.5, 4, and 4 with Code Interpreter on a set of college-level engineering-math and electromagnetism problems, such as those often given to sophomore electrical engineering majors. We selected a set of 13 problems, and had ChatGPT solve them multiple times, using a fresh instance (chat) each time. We found that ChatGPT-4 with Code Interpreter was able to satisfactorily solve most problems we tested most of the time -- a major improvement over the performance of ChatGPT-4 (or 3.5) without Code Interpreter. The performance of ChatGPT was observed to be somewhat stochastic, and we found that solving the same problem N times in new ChatGPT instances and taking the most-common answer was an effective strategy. Based on our findings and observations, we provide some recommendations for instructors and students of classes at this level.

摘要
我们对 chatGPT 3.5、4 和 4 进行了测试，使用 college-level 工程学和电磁学问题，类似于第二年电机工程学生常 receives 的问题。我们选择了 13 个问题，让 chatGPT 多次解决这些问题，每次使用新的 chat 实例。我们发现，在使用 Code Interpreter 的情况下，chatGPT 4 能够大幅提高解决这些问题的能力，比不使用 Code Interpreter 的情况下要好。chatGPT 的性能被观察到有一定的随机性，我们发现，对同一个问题多次解决，并取得最常见的答案是一个有效的策略。根据我们的发现和观察，我们对教师和学生提供了一些建议。

Data-Driven H-infinity Control with a Real-Time and Efficient Reinforcement Learning Algorithm: An Application to Autonomous Mobility-on-Demand Systems

paper_url: http://arxiv.org/abs/2309.08880
repo_url: None
paper_authors: Ali Aalipour, Alireza Khani
for: 这篇论文旨在开发一个基于Q学习的无模型实时控制算法，用于解决线性碎时系统的H$_{\infty}$控制问题。
methods: 提议的算法使用了Q学习方法，并且在线性碎时系统中实现了实时和数据效率的控制。computational complexity降低至$\mathcal{O}(\underline{q}^2)$，比Literature中的$\mathcal{O}(\underline{q}^3)$更低。
results: 实验研究显示，提议的算法可以实现实时和数据效率的控制，并且不需要初始稳定政策。在一个实际应用中，将提议的算法应用到了一个自主移动需求系统中，并且得到了良好的效果。

Abstract
Reinforcement learning (RL) is a class of artificial intelligence algorithms being used to design adaptive optimal controllers through online learning. This paper presents a model-free, real-time, data-efficient Q-learning-based algorithm to solve the H$_{\infty}$ control of linear discrete-time systems. The computational complexity is shown to reduce from $\mathcal{O}(\underline{q}^3)$ in the literature to $\mathcal{O}(\underline{q}^2)$ in the proposed algorithm, where $\underline{q}$ is quadratic in the sum of the size of state variables, control inputs, and disturbance. An adaptive optimal controller is designed and the parameters of the action and critic networks are learned online without the knowledge of the system dynamics, making the proposed algorithm completely model-free. Also, a sufficient probing noise is only needed in the first iteration and does not affect the proposed algorithm. With no need for an initial stabilizing policy, the algorithm converges to the closed-form solution obtained by solving the Riccati equation. A simulation study is performed by applying the proposed algorithm to real-time control of an autonomous mobility-on-demand (AMoD) system for a real-world case study to evaluate the effectiveness of the proposed algorithm.

摘要
“强化学习（RL）是一类人工智能算法，用于设计适应最佳控制器通过在线学习。本文提出了一种无模型、实时、数据有效的Q学习基于算法，用于解决线性时间隔系统的H$_{\infty}$控制问题。在文章中，我们显示了计算复杂性从$\mathcal{O}(\underline{q}^3)$降低到$\mathcal{O}(\underline{q}^2)$，其中$\underline{q}$是状态变量、控制输入和干扰的总和的二次函数。我们实现了无模型的优化控制器，并在线学习行为和评价网络的参数，不需要系统动力学模型的知识，因此完全无模型。此外，我们只需在第一轮执行充分的探测噪声，并不影响提议算法。无需初始稳定策略，算法可以到达由解决里氏方程得到的闭合形解。我们对实时控制一个真实的自动化移动需求系统进行了一个实验研究，以评估提议算法的有效性。”Note: "Simplified Chinese" is also known as "Mandarin" or "Standard Chinese".

PDFTriage: Question Answering over Long, Structured Documents

paper_url: http://arxiv.org/abs/2309.08872
repo_url: None
paper_authors: Jon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova, Ryan A. Rossi, Franck Dernoncourt
for: 本研究旨在解决大语言模型（LLM）在文档问答（QA）中遇到的问题，即当文档不能适应LLM的小上下文长度时。
methods: 本研究提议一种名为PDFTriage的方法，可以基于结构或内容来检索文档上下文。
results: 我们的实验显示，使用PDFTriage可以在多种问题类型中提高文档QA的效果，而现有的检索-加以LLM则失败。此外，我们还发布了一个包含80个结构化文档和900多个人工生成的问题的数据集，以便进一步研究这一基本问题。

Abstract
Large Language Models (LLMs) have issues with document question answering (QA) in situations where the document is unable to fit in the small context length of an LLM. To overcome this issue, most existing works focus on retrieving the relevant context from the document, representing them as plain text. However, documents such as PDFs, web pages, and presentations are naturally structured with different pages, tables, sections, and so on. Representing such structured documents as plain text is incongruous with the user's mental model of these documents with rich structure. When a system has to query the document for context, this incongruity is brought to the fore, and seemingly trivial questions can trip up the QA system. To bridge this fundamental gap in handling structured documents, we propose an approach called PDFTriage that enables models to retrieve the context based on either structure or content. Our experiments demonstrate the effectiveness of the proposed PDFTriage-augmented models across several classes of questions where existing retrieval-augmented LLMs fail. To facilitate further research on this fundamental problem, we release our benchmark dataset consisting of 900+ human-generated questions over 80 structured documents from 10 different categories of question types for document QA.

摘要

MHLAT: Multi-hop Label-wise Attention Model for Automatic ICD Coding

paper_url: http://arxiv.org/abs/2309.08868
repo_url: None
paper_authors: Junwen Duan, Han Jiang, Ying Yu
for: 医疗记录编码（ICD编码）任务是将医疗记录中的病理诊断代码分配给临床病理诊断。
methods: 我们提出了一种简单 yet effective 的模型，即 Multi-Hop Label-wise ATtention（MHLAT），其中使用多步标签层权重来获得更精准和有用的表示。
results: 我们在三个 MIMIC 数据集上进行了广泛的实验，并证明了我们的方法在七个指标中具有显著更好或竞争性表现，并且具有更少的参数优化。

Abstract
International Classification of Diseases (ICD) coding is the task of assigning ICD diagnosis codes to clinical notes. This can be challenging given the large quantity of labels (nearly 9,000) and lengthy texts (up to 8,000 tokens). However, unlike the single-pass reading process in previous works, humans tend to read the text and label definitions again to get more confident answers. Moreover, although pretrained language models have been used to address these problems, they suffer from huge memory usage. To address the above problems, we propose a simple but effective model called the Multi-Hop Label-wise ATtention (MHLAT), in which multi-hop label-wise attention is deployed to get more precise and informative representations. Extensive experiments on three benchmark MIMIC datasets indicate that our method achieves significantly better or competitive performance on all seven metrics, with much fewer parameters to optimize.

摘要
国际疾病分类 (ICD) 编码是将临床笔记中的 ICD 诊断代码赋予的任务。这可能是由于大量的标签（约9,000）和长文本（达8,000个token）所带来的挑战。然而，与之前的单一扫描过程不同，人类通常会重读文本和标签定义以获取更加自信的答案。此外，尽管已使用预训练语言模型来解决这些问题，但它们受到巨大的内存使用带来问题。为解决上述问题，我们提议一种简单 yet 有效的模型，称为多跳标签wise ATtention (MHLAT)，其中多跳标签wise ATtention 被部署以获取更加精确和有用的表示。我们在三个 MIMIC 数据集上进行了广泛的实验，结果显示，我们的方法在七个指标中均达到了显著更好或竞争性的性能，而且具有许多参数优化的多少。

Trajectory Tracking Control of Skid-Steering Mobile Robots with Slip and Skid Compensation using Sliding-Mode Control and Deep Learning

paper_url: http://arxiv.org/abs/2309.08863
repo_url: None
paper_authors: Payam Nourizadeh, Fiona J Stevens McFadden, Will N Browne
For: 这篇研究旨在提供一种可行的线上运行于开放环境中的游戏机器人运行控制系统，以减少游戏机器人在不可预测的环境中的追踪错误。* Methods: 本研究使用陡缓度控制技术设计了一个可靠的轨迹追踪系统，并将两个先前开发的深度学习模型 [1], [2] 组合到控制反馈循环中，以实时估算游戏机器人的滑行和不适合的滑行，并将补偿值传递到补偿器中。* Results: 实验结果显示，提案的控制器与补偿器可以将轨迹追踪系统的表现提高超过27%。

Abstract
Slip and skid compensation is crucial for mobile robots' navigation in outdoor environments and uneven terrains. In addition to the general slipping and skidding hazards for mobile robots in outdoor environments, slip and skid cause uncertainty for the trajectory tracking system and put the validity of stability analysis at risk. Despite research in this field, having a real-world feasible online slip and skid compensation is still challenging due to the complexity of wheel-terrain interaction in outdoor environments. This paper presents a novel trajectory tracking technique with real-world feasible online slip and skid compensation at the vehicle-level for skid-steering mobile robots in outdoor environments. The sliding mode control technique is utilized to design a robust trajectory tracking system to be able to consider the parameter uncertainty of this type of robot. Two previously developed deep learning models [1], [2] are integrated into the control feedback loop to estimate the robot's slipping and undesired skidding and feed the compensator in a real-time manner. The main advantages of the proposed technique are (1) considering two slip-related parameters rather than the conventional three slip parameters at the wheel-level, and (2) having an online real-world feasible slip and skid compensator to be able to reduce the tracking errors in unforeseen environments. The experimental results show that the proposed controller with the slip and skid compensator improves the performance of the trajectory tracking system by more than 27%.

摘要
滑动和滑倒补偿是移动机器人在户外环境中的导航关键，它们会导致轨迹追踪系统的不确定性和稳定分析的风险。尽管在这一领域进行了大量研究，但实现在线可行的滑动和滑倒补偿仍然是一个挑战，因为轮胎与地面的互动在户外环境中非常复杂。这篇论文提出了一种新的轨迹追踪技术，使用滑模控制技术设计一个可靠的轨迹追踪系统，并将两个先前开发的深度学习模型[1]、[2]integrated into the control feedback loop来估计机器人的滑动和不良滑倒，并在实时 manner中将其传递给补偿器。该技术的主要优点包括：一、考虑了机器人的两个滑动参数而不是传统的三个滑动参数，二、在实时可行的情况下实现了滑动和滑倒的补偿，从而降低了轨迹追踪系统的跟踪错误。实验结果显示，提案的控制器与滑动和滑倒补偿器可以提高轨迹追踪系统的性能，比例超过27%。

Emerging Approaches for THz Array Imaging: A Tutorial Review and Software Tool

paper_url: http://arxiv.org/abs/2309.08844
repo_url: None
paper_authors: Josiah W. Smith, Murat Torlak
for: 该文章主要是为了介绍在近场TERAHertz频率域中的THzSynthetic Aperture Radar（SAR）系统和算法。
methods: 该文章提出了一种组合信号处理和机器学习技术的新算法，以及一些传统的CLASSICAL和数据驱动的THz SAR算法，包括物体检测和SAR图像超分辨。
results: 该文章提出了一些Future研究方向，包括系统和算法标准化测试、采用当前最佳的深度学习技术、信号处理优化机器学习算法和гибридного数据驱动信号处理算法。

Abstract
Accelerated by the increasing attention drawn by 5G, 6G, and Internet of Things applications, communication and sensing technologies have rapidly evolved from millimeter-wave (mmWave) to terahertz (THz) in recent years. Enabled by significant advancements in electromagnetic (EM) hardware, mmWave and THz frequency regimes spanning 30 GHz to 300 GHz and 300 GHz to 3000 GHz, respectively, can be employed for a host of applications. The main feature of THz systems is high-bandwidth transmission, enabling ultra-high-resolution imaging and high-throughput communications; however, challenges in both the hardware and algorithmic arenas remain for the ubiquitous adoption of THz technology. Spectra comprising mmWave and THz frequencies are well-suited for synthetic aperture radar (SAR) imaging at sub-millimeter resolutions for a wide spectrum of tasks like material characterization and nondestructive testing (NDT). This article provides a tutorial review of systems and algorithms for THz SAR in the near-field with an emphasis on emerging algorithms that combine signal processing and machine learning techniques. As part of this study, an overview of classical and data-driven THz SAR algorithms is provided, focusing on object detection for security applications and SAR image super-resolution. We also discuss relevant issues, challenges, and future research directions for emerging algorithms and THz SAR, including standardization of system and algorithm benchmarking, adoption of state-of-the-art deep learning techniques, signal processing-optimized machine learning, and hybrid data-driven signal processing algorithms...

摘要
带动了5G、6G和物联网应用的增加关注，通信和感测技术在最近几年内快速发展从毫米波（mmWave）频率范围到tera响（THz）。通过电romagnetic（EM）硬件的重要进步，mmWave和THz频率范围分别是30GHz至300GHz和300GHz至3000GHz可以用于多种应用。THz系统的主要特点是高频带宽传输，使得超高分辨率成像和高速通信 possible;但是，硬件和算法领域中的挑战还需要解决才能广泛采用THz技术。包括mmWave和THz频率的spectrum适用于sub-millimeter分辨率的Synthetic Aperture Radar（SAR）成像，用于各种任务，如材料Characterization和非 destruктив测试（NDT）。本文提供了关于THz SAR的 tutorials review，强调emerging算法的发展，包括Signal Processing和机器学习技术的结合。本文还提供了经典和数据驱动THz SAR算法的概述，专注于安全应用中的对象探测。此外，我们还讨论了相关的问题、挑战和未来研究方向，包括系统和算法标准化、采用当前最佳的深度学习技术、Signal Processing优化的机器学习算法和混合数据驱动Signal Processing算法。

Bias and Fairness in Chatbots: An Overview

paper_url: http://arxiv.org/abs/2309.08836
repo_url: None
paper_authors: Jintang Xue, Yun-Cheng Wang, Chengwei Wei, Xiaofeng Liu, Jonghye Woo, C. -C. Jay Kuo
for: 本研究旨在提供一份对聊天机器人系统偏见和公平性的全面综述，以帮助开发者更好地设计和实现公平和无偏见的聊天机器人系统。
methods: 本研究使用了大量的文献综述和分析方法，检视了聊天机器人系统的历史和类别，分析了偏见的来源和应用中的可能的危害，并考虑了设计公平和无偏见的聊天机器人系统的因素。
results: 本研究结果表明，现代聊天机器人系统具有更高的功能和应用前景，但也存在偏见和公平性的担忧。通过分析偏见的来源和应用中的影响，以及考虑设计公平和无偏见的因素，可以更好地设计和实现公平和无偏见的聊天机器人系统。

Abstract
Chatbots have been studied for more than half a century. With the rapid development of natural language processing (NLP) technologies in recent years, chatbots using large language models (LLMs) have received much attention nowadays. Compared with traditional ones, modern chatbots are more powerful and have been used in real-world applications. There are however, bias and fairness concerns in modern chatbot design. Due to the huge amounts of training data, extremely large model sizes, and lack of interpretability, bias mitigation and fairness preservation of modern chatbots are challenging. Thus, a comprehensive overview on bias and fairness in chatbot systems is given in this paper. The history of chatbots and their categories are first reviewed. Then, bias sources and potential harms in applications are analyzed. Considerations in designing fair and unbiased chatbot systems are examined. Finally, future research directions are discussed.

摘要
chatbots 已经被研究了 более than half a century。随着自然语言处理（NLP）技术的快速发展，使用大型语言模型（LLMs）的 chatbots 在最近几年收到了很多关注。相比传统的 chatbots，现代 chatbots 更加强大，已经在实际应用中使用。然而，现代 chatbots 中存在偏见和公平问题。由于庞大的训练数据、极大的模型大小和解释性的缺失，现代 chatbots 的偏见缓和和公平保持是挑战。这个论文提供了对偏见和公平问题在 chatbot 系统中的全面回顾。首先，摘要了 chatbots 的历史和类别。然后，分析了偏见的来源和应用中的潜在危害。考虑了设计公平和不偏见 chatbot 系统的因素。最后，讨论了未来研究方向。

SLIDE: Reference-free Evaluation for Machine Translation using a Sliding Document Window

paper_url: http://arxiv.org/abs/2309.08832
repo_url: None
paper_authors: Vikas Raunak, Tom Kocmi, Matt Post
for: 这篇论文是关于语义评估的研究，旨在检验文档中的句子上是否可以提供同样的信息，如同人工参考。
methods: 这篇论文使用了一个新的评估指标，即SLIDE（SLiding Document Evaluator），它在文档中的块 Sentence 上使用了一个滑动窗口，将每个块传递给一个未修改的、市场上可得的质量评估模型进行评估。
results: 研究发现，SLIDE 可以达到高级系统精度，在某些情况下，甚至可以与人工参考 metric 减少差距。这表明，文档中的源Context 可以提供同样的信息，如同人工参考。

Abstract
Reference-based metrics that operate at the sentence level typically outperform quality estimation metrics, which have access only to the source and system output. This is unsurprising, since references resolve ambiguities that may be present in the source. We investigate whether additional source context can effectively substitute for a reference. We present a metric, SLIDE (SLiding Document Evaluator), which operates on blocks of sentences using a window that slides over each document in the test set, feeding each chunk into an unmodified, off-the-shelf quality estimation model. We find that SLIDE obtains significantly higher pairwise system accuracy than its sentence-level baseline, in some cases even eliminating the gap with reference-base metrics. This suggests that source context may provide the same information as a human reference.

摘要
通常情况下，参考基于的度量器在句子水平上表现比质量估计度量器更好，这不奇怪，因为参考可以解决源文本中的歧义。我们研究了 Whether additional source context can effectively substitute for a reference. We present a metric, SLIDE (SLiding Document Evaluator), which operates on blocks of sentences using a window that slides over each document in the test set, feeding each chunk into an unmodified, off-the-shelf quality estimation model. We find that SLIDE obtains significantly higher pairwise system accuracy than its sentence-level baseline, in some cases even eliminating the gap with reference-based metrics. This suggests that source context may provide the same information as a human reference.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

S3-DST: Structured Open-Domain Dialogue Segmentation and State Tracking in the Era of LLMs

paper_url: http://arxiv.org/abs/2309.08827
repo_url: None
paper_authors: Sarkar Snigdha Sarathi Das, Chirag Shah, Mengting Wan, Jennifer Neville, Longqi Yang, Reid Andersen, Georg Buscher, Tara Safavi
for: 提高open-domain对话系统中 Dialogue State Tracking（DST）的精度和 robustness，以适应大语言模型（LLM）驱动的对话系统中的复杂性和多样性。
methods: 提出了一种joint dialogue segmentation和state tracking的方法，使用Pre-Analytical Recollection机制来改进长期上下文跟踪。
results: 在一个Proprietary anonymous open-domain对话数据集以及公共可用的DST和分割数据集上进行了评估，与现状态的最佳方法进行比较，结果表明S3-DST在joint segmentation和state tracking中具有强大和稳定的性能。

Abstract
The traditional Dialogue State Tracking (DST) problem aims to track user preferences and intents in user-agent conversations. While sufficient for task-oriented dialogue systems supporting narrow domain applications, the advent of Large Language Model (LLM)-based chat systems has introduced many real-world intricacies in open-domain dialogues. These intricacies manifest in the form of increased complexity in contextual interactions, extended dialogue sessions encompassing a diverse array of topics, and more frequent contextual shifts. To handle these intricacies arising from evolving LLM-based chat systems, we propose joint dialogue segmentation and state tracking per segment in open-domain dialogue systems. Assuming a zero-shot setting appropriate to a true open-domain dialogue system, we propose S3-DST, a structured prompting technique that harnesses Pre-Analytical Recollection, a novel grounding mechanism we designed for improving long context tracking. To demonstrate the efficacy of our proposed approach in joint segmentation and state tracking, we evaluate S3-DST on a proprietary anonymized open-domain dialogue dataset, as well as publicly available DST and segmentation datasets. Across all datasets and settings, S3-DST consistently outperforms the state-of-the-art, demonstrating its potency and robustness the next generation of LLM-based chat systems.

摘要
traditional Dialogue State Tracking (DST) problem aims to track user preferences and intents in user-agent conversations. While sufficient for task-oriented dialogue systems supporting narrow domain applications, the advent of Large Language Model (LLM)-based chat systems has introduced many real-world intricacies in open-domain dialogues. These intricacies manifest in the form of increased complexity in contextual interactions, extended dialogue sessions encompassing a diverse array of topics, and more frequent contextual shifts. To handle these intricacies arising from evolving LLM-based chat systems, we propose joint dialogue segmentation and state tracking per segment in open-domain dialogue systems. Assuming a zero-shot setting appropriate to a true open-domain dialogue system, we propose S3-DST, a structured prompting technique that harnesses Pre-Analytical Recollection, a novel grounding mechanism we designed for improving long context tracking. To demonstrate the efficacy of our proposed approach in joint segmentation and state tracking, we evaluate S3-DST on a proprietary anonymized open-domain dialogue dataset, as well as publicly available DST and segmentation datasets. Across all datasets and settings, S3-DST consistently outperforms the state-of-the-art, demonstrating its potency and robustness for the next generation of LLM-based chat systems.Here's the breakdown of the translation:* "traditional Dialogue State Tracking (DST) problem" becomes "传统的对话状态追踪问题" (traditional DST problem)* "aims to track user preferences and intents in user-agent conversations" becomes "目标是跟踪用户首选和意图在用户代理对话中" (targets tracking user preferences and intents in user-agent conversations)* "While sufficient for task-oriented dialogue systems supporting narrow domain applications" becomes "然而对于支持窄领域应用的任务导向对话系统来说， sufficient" (However, for task-oriented dialogue systems supporting narrow domain applications, sufficient)* "the advent of Large Language Model (LLM)-based chat systems has introduced many real-world intricacies in open-domain dialogues" becomes "LLM基于对话系统的出现引入了许多实际世界中的复杂性，在开放领域对话中" (The advent of LLM-based chat systems has introduced many complexities in open-domain dialogues)* "These intricacies manifest in the form of increased complexity in contextual interactions, extended dialogue sessions encompassing a diverse array of topics, and more frequent contextual shifts" becomes "这些复杂性表现为对话中的增加复杂性，更长的对话会话，涵盖更多的话题，以及更频繁的上下文转换" (These complexities manifest as increased complexity in contextual interactions, longer dialogue sessions encompassing a diverse array of topics, and more frequent contextual shifts)* "To handle these intricacies arising from evolving LLM-based chat systems" becomes "面对这些来自演进的 LLM 基于对话系统的复杂性" (To handle these complexities arising from evolving LLM-based chat systems)* "we propose joint dialogue segmentation and state tracking per segment in open-domain dialogue systems" becomes "我们提议在开放领域对话系统中实现对话段化和状态追踪" (We propose joint dialogue segmentation and state tracking in open-domain dialogue systems)* "Assuming a zero-shot setting appropriate to a true open-domain dialogue system" becomes "假设真正的开放领域对话系统的零枪射设定" (Assuming a zero-shot setting appropriate to a true open-domain dialogue system)* "we propose S3-DST, a structured prompting technique that harnesses Pre-Analytical Recollection, a novel grounding mechanism we designed for improving long context tracking" becomes "我们提议 S3-DST，一种基于 Pre-Analytical Recollection 的结构化提示技术，用于改进长上下文跟踪" (We propose S3-DST, a structured prompting technique that harnesses Pre-Analytical Recollection, a novel grounding mechanism we designed for improving long context tracking)* "To demonstrate the efficacy of our proposed approach in joint segmentation and state tracking" becomes "用于证明我们提议的方法在对话段化和状态追踪中的有效性" (To demonstrate the efficacy of our proposed approach in joint segmentation and state tracking)* "we evaluate S3-DST on a proprietary anonymized open-domain dialogue dataset, as well as publicly available DST and segmentation datasets" becomes "我们在一个 propriety 隐私化的开放领域对话数据集上评估 S3-DST，以及公共可用的 DST 和分段数据集" (We evaluate S3-DST on a proprietary anonymized open-domain dialogue dataset, as well as publicly available DST and segmentation datasets)* "Across all datasets and settings, S3-DST consistently outperforms the state-of-the-art" becomes "在所有数据集和设定下，S3-DST 一直表现出优于当前领先的状态" (Across all datasets and settings, S3-DST consistently outperforms the state-of-the-art)* "demonstrating its potency and robustness for the next generation of LLM-based chat systems" becomes "这种表现力和可靠性为下一代 LLM 基于对话系统的发展提供了启示" (demonstrating its potency and robustness for the next generation of LLM-based chat systems)

Distributionally Robust Post-hoc Classifiers under Prior Shifts

paper_url: http://arxiv.org/abs/2309.08825
repo_url: https://github.com/weijiaheng/drops
paper_authors: Jiaheng Wei, Harikrishna Narasimhan, Ehsan Amid, Wen-Sheng Chu, Yang Liu, Abhishek Kumar
for: 本研究旨在强化机器学习模型对分布变化的抗预测能力。
methods: 我们提出了一种极其轻量级的后处理方法，通过在预训练模型上计算并应用批处理调整来减少一个目标分布下的抗预测损失。
results: 我们的方法可以提供减少分布变化导致的抗预测损失的保证，并且在实际实现中具有强大的表现。

Abstract
The generalization ability of machine learning models degrades significantly when the test distribution shifts away from the training distribution. We investigate the problem of training models that are robust to shifts caused by changes in the distribution of class-priors or group-priors. The presence of skewed training priors can often lead to the models overfitting to spurious features. Unlike existing methods, which optimize for either the worst or the average performance over classes or groups, our work is motivated by the need for finer control over the robustness properties of the model. We present an extremely lightweight post-hoc approach that performs scaling adjustments to predictions from a pre-trained model, with the goal of minimizing a distributionally robust loss around a chosen target distribution. These adjustments are computed by solving a constrained optimization problem on a validation set and applied to the model during test time. Our constrained optimization objective is inspired by a natural notion of robustness to controlled distribution shifts. Our method comes with provable guarantees and empirically makes a strong case for distributional robust post-hoc classifiers. An empirical implementation is available at https://github.com/weijiaheng/Drops.

摘要

GPT as a Baseline for Recommendation Explanation Texts

paper_url: http://arxiv.org/abs/2309.08817
repo_url: None
paper_authors: Joyce Zhou, Thorsten Joachims
for: 这个研究探讨了现代模型生成的电影推荐文本解释如何帮助用户，以及用户对不同组成部分的评价。
methods: 研究使用现代自然语言处理技术生成电影推荐文本解释，并对用户的评价进行分析。
results: 研究发现参与者对电影推荐文本解释的评价没有显著差异，但参与者对已经见过的电影的评价更高。此外，参与者 также标记了电影评论文本中重要的特征。

Abstract
In this work, we establish a baseline potential for how modern model-generated text explanations of movie recommendations may help users, and explore what different components of these text explanations that users like or dislike, especially in contrast to existing human movie reviews. We found that participants gave no significantly different rankings between movies, nor did they give significantly different individual quality scores to reviews of movies that they had never seen before. However, participants did mark reviews as significantly better when they were movies they had seen before. We also explore specific aspects of movie review texts that participants marked as important for each quality. Overall, we establish that modern LLMs are a promising source of recommendation explanations, and we intend on further exploring personalizable text explanations in the future.

摘要
在这项工作中，我们建立了现代模型生成的电影推荐文本解释的基线潜力，并探索用户对不同组成部分的响应，特别是与现有人类电影评论相比。我们发现参与者没有提供不同电影的排名，也没有对每部电影的质量分数作出不同的评价。但参与者确实将已经见过的电影的评论标记为更好。我们还探究每部电影评论文本中各个重要方面，参与者认为哪些方面是重要的。总的来说，我们发现现代LLM是可靠的推荐解释来源，我们将在未来进一步探索个性化文本解释。

2023-09-16

cs.CL

cs.CL - 2023-09-16

The Impact of Debiasing on the Performance of Language Models in Downstream Tasks is Underestimated

paper_url: http://arxiv.org/abs/2309.09092
repo_url: None
paper_authors: Masahiro Kaneko, Danushka Bollegala, Naoaki Okazaki
for: 这篇论文主要探讨了预训模型中的社会偏见，以及如何对这些偏见进行调整。
methods: 论文使用了多种方法来对预训模型进行调整，以消除社会偏见。
results: 论文的实验结果显示，对于不同的下游任务，调整预训模型的影响都是被低估的。此外，为了正确评估调整的影响，应该对 female、male 和传统词汇进行分开考虑。

Abstract
Pre-trained language models trained on large-scale data have learned serious levels of social biases. Consequently, various methods have been proposed to debias pre-trained models. Debiasing methods need to mitigate only discriminatory bias information from the pre-trained models, while retaining information that is useful for the downstream tasks. In previous research, whether useful information is retained has been confirmed by the performance of downstream tasks in debiased pre-trained models. On the other hand, it is not clear whether these benchmarks consist of data pertaining to social biases and are appropriate for investigating the impact of debiasing. For example in gender-related social biases, data containing female words (e.g. ``she, female, woman''), male words (e.g. ``he, male, man''), and stereotypical words (e.g. ``nurse, doctor, professor'') are considered to be the most affected by debiasing. If there is not much data containing these words in a benchmark dataset for a target task, there is the possibility of erroneously evaluating the effects of debiasing. In this study, we compare the impact of debiasing on performance across multiple downstream tasks using a wide-range of benchmark datasets that containing female, male, and stereotypical words. Experiments show that the effects of debiasing are consistently \emph{underestimated} across all tasks. Moreover, the effects of debiasing could be reliably evaluated by separately considering instances containing female, male, and stereotypical words than all of the instances in a benchmark dataset.

摘要
各种方法已经被提议来减少预训练模型中的社会偏见。这些方法需要从预训练模型中除去排斥性偏见信息，而不是抹除有用的信息。在之前的研究中，已经证明了这些减少后的模型在下游任务中的表现。然而，不清楚这些标准 benchmark 数据是否包含社会偏见的信息，并不适用于研究减少的影响。例如，在性别相关的社会偏见中，包含女性词汇（如“她，女性，女人”）、男性词汇（如“他，男性，男人”）和 gender 刻板印象（如“护士，医生，教授”）被视为最受减少影响。如果 benchmark 数据中不含这些词汇的数据，那么可能会误判减少的影响。在这种研究中，我们比较了减少对多个下游任务的表现，使用包含女性、男性和 gender 刻板印象的多种 benchmark 数据。实验显示，减少的影响被一致地低估，并且可以通过分别考虑包含女性、男性和 gender 刻板印象的实例来可靠地评估减少的影响。

Improving Speech Recognition for African American English With Audio Classification

paper_url: http://arxiv.org/abs/2309.09996
repo_url: None
paper_authors: Shefali Garg, Zhouyuan Huo, Khe Chai Sim, Suzan Schwartz, Mason Chua, Alëna Aksënova, Tsendsuren Munkhdalai, Levi King, Darryl Wright, Zion Mengesha, Dongseong Hwang, Tara Sainath, Françoise Beaufays, Pedro Moreno Mengibar
for: 提高短句子朗读识别系统的品质 disparities between different language varieties.
methods: 使用小量的out-of-domain(长形)非裔美国英语数据来改善US英语短句子朗读识别系统的Robustness.
results: 使用CORAAL、YouTube和Mozilla Common Voice等数据集来训练一个音频分类器，可以准确地判断一个utterance是非裔美国英语还是其他语种，包括主流美国英语。通过将这些分类器输出与杂合的地理信息相结合，可以在大规模的半监督学习中选择一组utterances进行 semi-supervised learning。经过精细调整，这些utterances的word error rate disparity reduction between AAE和MAE可以达到38.5%。

Abstract
Automatic speech recognition (ASR) systems have been shown to have large quality disparities between the language varieties they are intended or expected to recognize. One way to mitigate this is to train or fine-tune models with more representative datasets. But this approach can be hindered by limited in-domain data for training and evaluation. We propose a new way to improve the robustness of a US English short-form speech recognizer using a small amount of out-of-domain (long-form) African American English (AAE) data. We use CORAAL, YouTube and Mozilla Common Voice to train an audio classifier to approximately output whether an utterance is AAE or some other variety including Mainstream American English (MAE). By combining the classifier output with coarse geographic information, we can select a subset of utterances from a large corpus of untranscribed short-form queries for semi-supervised learning at scale. Fine-tuning on this data results in a 38.5% relative word error rate disparity reduction between AAE and MAE without reducing MAE quality.

摘要
We use CORAAL, YouTube, and Mozilla Common Voice to train an audio classifier to approximately output whether an utterance is AAE or some other variety, including Mainstream American English (MAE). By combining the classifier output with coarse geographic information, we can select a subset of utterances from a large corpus of untranscribed short-form queries for semi-supervised learning at scale. Fine-tuning on this data results in a 38.5% relative word error rate disparity reduction between AAE and MAE without reducing MAE quality.

Constructing a Knowledge Graph for Vietnamese Legal Cases with Heterogeneous Graphs

paper_url: http://arxiv.org/abs/2309.09069
repo_url: None
paper_authors: Thi-Hai-Yen Vuong, Minh-Quan Hoang, Tan-Minh Nguyen, Hoang-Trung Nguyen, Ha-Thanh Nguyen
for: 本文提出了一种法律案例文档和相关法律知识图构建方法，以提高法律信息的有效组织和下游任务的提高。
methods: 本方法包括三个主要步骤：数据抓取、信息提取和知识图部署。首先，数据抓取器从多种来源收集了大量的法律案例文档和相关法律信息，为后续处理提供了丰富的数据库。然后，信息提取步骤使用自然语言处理技术提取了法律案例文档中的法院、案件、领域和法律等实体，以及它们之间的关系。最后，知识图被部署，将这些实体连接起来，创建了一个多元图，有效地表示了法律信息，且适用于法律领域的用户，如律师、法官和学者。
results: 建立的基线模型通过不监督学习方法，并通过知识图的支持，能够为给定的法律案例提取相关法律。这种方法开启了法律领域的多种应用，如法律案例分析、法律建议和决策支持。

Abstract
This paper presents a knowledge graph construction method for legal case documents and related laws, aiming to organize legal information efficiently and enhance various downstream tasks. Our approach consists of three main steps: data crawling, information extraction, and knowledge graph deployment. First, the data crawler collects a large corpus of legal case documents and related laws from various sources, providing a rich database for further processing. Next, the information extraction step employs natural language processing techniques to extract entities such as courts, cases, domains, and laws, as well as their relationships from the unstructured text. Finally, the knowledge graph is deployed, connecting these entities based on their extracted relationships, creating a heterogeneous graph that effectively represents legal information and caters to users such as lawyers, judges, and scholars. The established baseline model leverages unsupervised learning methods, and by incorporating the knowledge graph, it demonstrates the ability to identify relevant laws for a given legal case. This approach opens up opportunities for various applications in the legal domain, such as legal case analysis, legal recommendation, and decision support.

摘要

Data crawling: Collect a large corpus of legal case documents and related laws from various sources, providing a rich database for further processing.2. Information extraction: Employ natural language processing techniques to extract entities such as courts, cases, domains, and laws, as well as their relationships from the unstructured text.3. Knowledge graph deployment: Connect these entities based on their extracted relationships, creating a heterogeneous graph that effectively represents legal information and caters to users such as lawyers, judges, and scholars.The established baseline model leverages unsupervised learning methods, and by incorporating the knowledge graph, it demonstrates the ability to identify relevant laws for a given legal case. This approach opens up opportunities for various applications in the legal domain, such as legal case analysis, legal recommendation, and decision support.

Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of RLHF

paper_url: http://arxiv.org/abs/2309.09055
repo_url: https://github.com/simengsun/alpaca_farm_lora
paper_authors: Simeng Sun, Dhawal Gupta, Mohit Iyyer
for: 这个技术报告目的是 empirically investigating an efficient implementation of RLHF using low-rank adaptation (LoRA)，以便使用两个A100 GPUs进行RLHF，而不需要八个GPUs进行全模型精细化。
methods: 本文使用了LoRA来实现RLHF，并评估了各种LoRA-based PPO实现方式的效果，包括 removing KL regularization term、使用Jensen-Shannon divergence等其他 regularizers，以及trainining with LoRA的影响。
results: 本文发现：(1) 在LoRA设置下，移除KL正化term不会对AlpacaFarm评估集的性能造成负面影响; (2) 使用Jensen-Shannon divergence等其他正化器可以提高性能; (3) PPO训练会对模型生成的回答造成负面影响，但是使用LoRA可以几乎完全解除这个效应。

Abstract
During the last stage of RLHF, a large language model is aligned to human intents via PPO training, a process that generally requires large-scale computational resources. In this technical report, we empirically investigate an efficient implementation of RLHF using low-rank adaptation (LoRA), which allows us to align the LLaMA 7B checkpoint on the Alpaca dataset using only two A100 GPUs instead of the eight required for full model fine-tuning. Despite tuning only 0.2% of LLaMA 7B's parameters, our implementation achieves better performance than the publicly-released AlpacaFarm checkpoint with full model fine-tuning. Next, we analyze several configurations of our LoRA-based PPO implementation, varying the form of the KL regularization term in the training objective. We find that (1) removing this penalty term does not harm performance on the AlpacaFarm evaluation set under our LoRA setup; (2) other regularizers, such as Jensen-Shannon divergence, lead to improved performance; and (3) while PPO training negatively impacts the factuality of model-generated responses, training with LoRA largely mitigates this effect. We release our code and pretrained checkpoints to facilitate future research on more efficient RLHF.

摘要

Removing the penalty term does not harm performance on the AlpacaFarm evaluation set under our LoRA setup.2. Other regularizers, such as Jensen-Shannon divergence, lead to improved performance.3. PPO training negatively impacts the factuality of model-generated responses, but training with LoRA largely mitigates this effect.We release our code and pretrained checkpoints to facilitate future research on more efficient RLHF.

Context-aware Adversarial Attack on Named Entity Recognition

paper_url: http://arxiv.org/abs/2309.08999
repo_url: None
paper_authors: Shuguang Chen, Leonardo Neves, Thamar Solorio
for: 研究名实recognition任务中模型的Robustness，采用Context-aware adversarial attack方法。
methods: 提出 perturbing the most informative words for recognizing entities来创建攻击示例，并研究不同的候选替换方法来生成自然和可能的攻击示例。
results: 实验和分析表明，我们的方法比强基eline更有效地使模型作出错误预测。

Abstract
In recent years, large pre-trained language models (PLMs) have achieved remarkable performance on many natural language processing benchmarks. Despite their success, prior studies have shown that PLMs are vulnerable to attacks from adversarial examples. In this work, we focus on the named entity recognition task and study context-aware adversarial attack methods to examine the model's robustness. Specifically, we propose perturbing the most informative words for recognizing entities to create adversarial examples and investigate different candidate replacement methods to generate natural and plausible adversarial examples. Experiments and analyses show that our methods are more effective in deceiving the model into making wrong predictions than strong baselines.

摘要
Here is the text in Simplified Chinese:近年来，大型预训语言模型（PLM）在自然语言处理benchmark上表现出色。然而，先前的研究表明PLM对攻击性例子有敏感性。在这项工作中，我们将焦点放在命名实体识别任务上，研究上下文意识攻击方法，以评估模型的可靠性。特别是，我们提议在识别实体时对最有用的词语进行拟合，并比较不同的候选替换方法来生成自然和可能的攻击示例。我们的实验和分析表明，我们的方法可以更有效地使模型进行错误预测。

Rethinking STS and NLI in Large Language Models

paper_url: http://arxiv.org/abs/2309.08969
repo_url: None
paper_authors: Yuxia Wang, Minghan Wang, Preslav Nakov
for: 这个研究旨在重新思考大语言模型（LLM）时代的科学技术与社会（STS）和自然语言理解（NLI）。
methods: 我们首先评估了五个数据集上的科学技术与社会（STS）和自然语言理解（NLI）的准确率，然后评估LLM的预测信心和其能够捕捉人类集体意见的能力。
results: 我们发现LLM可以为特定话题提供个性化描述，或者生成不同语调的semantically相似内容，但现在LLM很难为个人提供个性化评价或决策。此外，我们发现零shot ChatGPT在临床和生物医学STS/NLI中达到了竞争性的准确率，但是采样变化很大， ensemble结果表现最佳。

Abstract
In this study, we aim to rethink STS and NLI in the era of large language models (LLMs). We first evaluate the accuracy of clinical/biomedical STS and NLI over five datasets, and then we assess LLM predictive confidence and their capability of capturing collective human opinions. We find that LLMs may be able to provide personalised descriptions for a specific topic, or to generate semantically similar content in different tones, but that this is hard for current LLMs to make personalised judgements or decisions. We further find that zero-shot ChatGPT achieves competitive accuracy over clinical and biomedical STS/NLI, constraining to the fine-tuned BERT-base. However, there is a large variation in sampling, ensembled results perform the best.

摘要
在本研究中，我们想重新思考STS和NLI在大语言模型（LLM）时代中的应用。我们首先评估了五个数据集上的临床/生物医学STS和NLI的准确率，然后评估LLM的预测自信和人类集体意见的捕捉能力。我们发现LLM可能可以为特定主题提供个性化描述，或生成不同调式的含义相似内容，但当前LLM很难为个人作出个性化判断或决策。此外，我们发现零批量ChatGPT在临床和生物医学STS/NLI中 achieved competitive accuracy，但是精度受到 fine-tuned BERT-base 的限制。然而，采样方法的差异导致ensembled结果表现最佳。

Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large Language Models for Dynamic Inference Using Sorted Fine-Tuning (SoFT)

paper_url: http://arxiv.org/abs/2309.08968
repo_url: None
paper_authors: Parsa Kavehzadeh, Mojtaba Valipour, Marzieh Tahaei, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh
for: 这个论文旨在探讨如何使用SortedNet训练技术来实现大语言模型的动态推论，并且不需要预先训练和专门的硬件支持。
methods: 这个论文使用了SortedNet训练技术，将深度神经网络分成多个子模型，并且根据计算/准确性特征进行排序，以获得不同 Computational loads的子模型。
results: 这个论文的结果显示，使用Sorted Fine-Tuning（SoFT）技术可以实现大语言模型的动态推论，并且可以提高模型的效率，不需要预先训练和专门的硬件支持。

Abstract
The rapid advancement of large language models (LLMs) has revolutionized natural language processing (NLP). While these models excel at understanding and generating human-like text, their widespread deployment can be prohibitively expensive. SortedNet is a recent training technique for enabling dynamic inference for deep neural networks. It leverages network modularity to create sub-models with varying computational loads, sorting them based on computation/accuracy characteristics in a nested manner. We extend SortedNet to generative NLP tasks, making large language models dynamic without any pretraining and by only replacing standard Supervised Fine-Tuning (SFT) with Sorted Fine-Tuning (SoFT) at the same costs. Our approach boosts model efficiency, eliminating the need for multiple models for various scenarios during inference. We show that using this approach, we are able to unlock the potential of intermediate layers of transformers in generating the target output. Our sub-models remain integral components of the original model, minimizing storage requirements and transition costs between different computational/latency budgets. By applying this approach on LLaMa 2 13B for tuning on the Stanford Alpaca dataset and comparing it to normal tuning and early exit via PandaLM benchmark, we show that Sorted Fine-Tuning can deliver models twice as fast as the original model while maintaining or exceeding performance.

摘要
大量语言模型（LLM）的快速进步已经革命化自然语言处理（NLP）领域。虽然这些模型能够理解和生成人类语言样式的文本，但它们的广泛部署可能是非常昂贵的。SortedNet是一种最近的训练技术，用于启用深度神经网络的动态推理。它利用网络归一化来创建具有不同计算负担的子模型，并将它们按照计算/准确度特征进行嵌套排序。我们将SortedNet应用于生成型NLP任务，使得大语言模型在动态模式下运行，不需要预训练和只需要将标准超级精度训练（SFT）替换为Sorted Fine-Tuning（SoFT），并且可以保持模型的效率。我们的方法可以消除多个模型的需求，用于不同的执行环境。我们通过在LLaMa 2 13B上对斯坦福小羊数据集进行调整，并与标准训练和早期终止via PandaLM benchmark进行比较，显示Sorted Fine-Tuning可以提供比原始模型快 twice as fast的模型，同时保持或超越性能。

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

paper_url: http://arxiv.org/abs/2309.08963
repo_url: https://github.com/gersteinlab/struc-bench
paper_authors: Xiangru Tang, Yiming Zong, Jason Phang, Yilun Zhao, Wangchunshu Zhou, Arman Cohan, Mark Gerstein
for: 这个研究是为了评估当前的大型自然语言模型（LLMs）在生成复杂结构数据方面的能力，并提出了一种结构意识细化适应方法来改进这种能力。
methods: 研究者们提出了Struc-Bench，一个包括五种代表性的LLMs（即GPT-NeoX 20B、GPT-3.5、GPT-4、Vicuna）的评估方法，并对这些模型在手动构建的文本、HTML和LaTeX表格上进行了全面的评估。
results: 研究者们发现了当前模型在处理复杂结构输出时存在一些共同的格式错误和改进的可能性，并使用FormatCoT（链式思维）生成Format指令从目标输出中提取了 Format 信息。在应用结构意识细化适应方法后，LLaMA-7B模型在遵守自然语言约束方面表现出色，超过了其他评估的LLMs。

Abstract
Despite the power of Large Language Models (LLMs) like GPT-4, they still struggle with tasks that require generating complex, structured outputs. In this study, we assess the capability of Current LLMs in generating complex structured data and propose a structure-aware fine-tuning approach as a solution to improve this ability. To perform a comprehensive evaluation, we propose Struc-Bench, include five representative LLMs (i.e., GPT-NeoX 20B, GPT-3.5, GPT-4, and Vicuna) and evaluate them on our carefully constructed datasets spanning raw text, HTML, and LaTeX tables. Based on our analysis of current model performance, we identify specific common formatting errors and areas of potential improvement. To address complex formatting requirements, we utilize FormatCoT (Chain-of-Thought) to generate format instructions from target outputs. Our experiments show that our structure-aware fine-tuning method, when applied to LLaMA-7B, significantly improves adherence to natural language constraints, outperforming other evaluated LLMs. Based on these results, we present an ability map of model capabilities from six dimensions (i.e., coverage, formatting, reasoning, comprehension, pragmatics, and hallucination). This map highlights the weaknesses of LLMs in handling complex structured outputs and suggests promising directions for future work. Our code and models can be found at https://github.com/gersteinlab/Struc-Bench.

摘要
尽管大型语言模型（LLM）如GPT-4具有强大的语言生成能力，但它们仍然在需要生成复杂结构化输出的任务上遇到困难。在这项研究中，我们评估当今LLM在生成复杂结构数据方面的能力，并提出一种结构意识练化方法来改进这种能力。为了进行全面的评估，我们提出了Struc-Bench，包括5种代表性的LLM（即GPT-NeoX 20B、GPT-3.5、GPT-4、Vicuna），并对它们在我们手动构建的数据集上进行评估。根据我们对当前模型性能的分析，我们标识出了特定的公共格式错误和改进的 возмож性。为了处理复杂的格式要求，我们使用FormatCoT（链条思维）生成 Format instrucions 从目标输出。我们的实验表明，当我们应用结构意识练化方法到 LLaMA-7B 时，可以显著提高遵从自然语言约束的能力，超过其他评估的LLMs。基于这些结果，我们提出了模型能力的六个维度（即覆盖率、格式、逻辑、理解、 Pragmatics 和 hallucination）能力图，这些能力图 highlights LLMs 在处理复杂结构输出的弱点，并提出了未来工作的优秀方向。我们的代码和模型可以在 https://github.com/gersteinlab/Struc-Bench 找到。

ODSum: New Benchmarks for Open Domain Multi-Document Summarization

paper_url: http://arxiv.org/abs/2309.08960
repo_url: None
paper_authors: Yijie Zhou, Kejian Shi, Wencai Zhang, Yixin Liu, Yilun Zhao, Arman Cohan
for:这个论文主要目标是提出一种基于规则的方法，用于从查询基于文档 summarization 数据集中生成 open-domain multi-document summarization（ODMDS）数据集。methods:这种方法基于 retrieve-then-summarize 方法，并使用了一个新的数据集 ODSum，该数据集的文档索引相互关联并经常相互关系。results:通过广泛的实验， authors 发现了评价指标的变化和其可靠性，以及 LLMS Retrieving 错误导致的性能下降。 authors 还试图改进性能并调查其对不完全检索的 Robustness。

Abstract
Open-domain Multi-Document Summarization (ODMDS) is a critical tool for condensing vast arrays of documents into coherent, concise summaries. With a more inter-related document set, there does not necessarily exist a correct answer for the retrieval, making it hard to measure the retrieving performance. We propose a rule-based method to process query-based document summarization datasets into ODMDS datasets. Based on this method, we introduce a novel dataset, ODSum, a sophisticated case with its document index interdependent and often interrelated. We tackle ODMDS with the \textit{retrieve-then-summarize} method, and the performance of a list of retrievers and summarizers is investigated. Through extensive experiments, we identify variances in evaluation metrics and provide insights into their reliability. We also found that LLMs suffer great performance loss from retrieving errors. We further experimented methods to improve the performance as well as investigate their robustness against imperfect retrieval. We will release our data and code at https://github.com/yale-nlp/ODSum.

摘要
开放领域多文摘要 (ODMDS) 是一种重要的工具，可以将庞大的文档缩写成一个准确、简洁的摘要。由于文档集之间存在较多的相互关联，因此在 Retrieval 中不存在答案，这使得评估表现变得更加困难。我们提出了一种基于规则的方法，用于将查询基于文档摘要 dataset 转换为 ODMDS dataset。基于这种方法，我们提出了一个新的 dataset，ODSum，它的文档索引相互关联，并且经常存在相互关联。我们使用“Retrieve-then-summarize”方法来解决 ODMDS，并investigate了一系列的检索和摘要器的性能。通过广泛的实验，我们发现了评估指标之间的差异和其可靠性。此外，我们还发现了 LLMS 在检索错误时的性能下降。我们进一步调查了改进性能的方法，以及其对不完整检索的Robustness。我们将在 GitHub 上发布数据和代码。

Enhancing Large Language Model Induced Task-Oriented Dialogue Systems Through Look-Forward Motivated Goals

paper_url: http://arxiv.org/abs/2309.08949
repo_url: None
paper_authors: Zhiyuan Hu, Yue Feng, Yang Deng, Zekun Li, See-Kiong Ng, Anh Tuan Luu, Bryan Hooi
for: 提高对话系统的效率和成功率，以及用户满意度。
methods: 采用了前期预测对话动作的方法，并将目标奖励信号纳入对话系统中。
results: 在MultiWoZ 2.1 dataset上实现了比前一代完全监督模型更高的性能，同时用户满意度和系统效率也得到了提高。

Abstract
Recently, the development of large language models (LLMs) has been significantly enhanced the question answering and dialogue generation, and makes them become increasingly popular in current practical scenarios. While unlike the general dialogue system which emphasizes the semantic performance, the task-oriented dialogue (ToD) systems aim to achieve the dialogue goal efficiently and successfully in multiple turns. Unfortunately, existing LLM-induced ToD systems lack the direct reward toward the final goal and do not take account of the dialogue proactivity that can strengthen the dialogue efficiency. To fill these gaps, we introduce the ProToD (Proactively Goal-Driven LLM-Induced ToD) approach, which anticipates the future dialogue actions and incorporates the goal-oriented reward signal to enhance ToD systems. Additionally, we present a novel evaluation method that assesses ToD systems based on goal-driven dialogue simulations. This method allows us to gauge user satisfaction, system efficiency and successful rate while overcoming the limitations of current Information and Success metrics. Empirical experiments conducted on the MultiWoZ 2.1 dataset demonstrate that our model can achieve superior performance using only 10% of the data compared to previous end-to-end fully supervised models. This improvement is accompanied by enhanced user satisfaction and efficiency.

摘要
(Simplified Chinese translation)近期，大型语言模型（LLM）的开发已经大幅提高了问答和对话生成，使其在现实场景中越来越受欢迎。而不同于普通的对话系统，任务对话（ToD）系统的目标是在多个转换中efficiently完成对话目标。然而，现有的LLM引导的ToD系统缺乏直接奖励最终目标，并不考虑对话的积极性，这可以增强对话效率。为了填补这些空白，我们介绍了ProToD（主动目标驱动LLM引导ToD）方法，预测未来对话动作并将目标奖励信号纳入ToD系统。此外，我们还提出了一种新的评估方法，基于目标驱动对话 simulations，以评估ToD系统的用户满意度、系统效率和成功率。这种方法可以超越现有的信息和成功指标，评估ToD系统的性能。实验表明，我们的模型在MultiWoZ 2.1数据集上可以使用只有10%的数据达到前一个完全监督模型的性能，这种改进是由用户满意度和效率增加而成就的。

Contextual Label Projection for Cross-Lingual Structure Extraction

paper_url: http://arxiv.org/abs/2309.08943
repo_url: None
paper_authors: Tanmay Parekh, I-Hung Hsu, Kuan-Hao Huang, Kai-Wei Chang, Nanyun Peng
For: The paper is written for the task of creating pseudo-training data in target languages for structure extraction tasks, specifically event argument extraction.* Methods: The paper proposes a method called CLAP, which translates text to the target language and performs contextual translation on the labels using the translated text as the context. The method uses instruction-tuned language models with multilingual capabilities as the contextual translator.* Results: The paper reports that CLAP improves the F1-score by 2-2.5 points over other label projection techniques on the Chinese and Arabic ACE05 datasets.

Abstract
Translating training data into target languages has proven beneficial for cross-lingual transfer. However, for structure extraction tasks, translating data requires a label projection step, which translates input text and obtains translated labels in the translated text jointly. Previous research in label projection mostly compromises translation quality by either facilitating easy identification of translated labels from translated text or using word-level alignment between translation pairs to assemble translated phrase-level labels from the aligned words. In this paper, we introduce CLAP, which first translates text to the target language and performs contextual translation on the labels using the translated text as the context, ensuring better accuracy for the translated labels. We leverage instruction-tuned language models with multilingual capabilities as our contextual translator, imposing the constraint of the presence of translated labels in the translated text via instructions. We compare CLAP with other label projection techniques for creating pseudo-training data in target languages on event argument extraction, a representative structure extraction task. Results show that CLAP improves by 2-2.5 F1-score over other methods on the Chinese and Arabic ACE05 datasets.

摘要
训练数据的翻译到目标语言有助于cross-lingual transfer。然而， для结构提取任务，翻译数据的翻译需要一个标签投影步骤，该步骤将输入文本和其翻译后的文本一起翻译标签。过去的研究中，大多数标签投影方法会牺牲翻译质量， either by facilitating the identification of translated labels from the translated text or by using word-level alignment between translation pairs to assemble translated phrase-level labels from the aligned words。在这篇论文中，我们介绍了CLAP，它首先将文本翻译到目标语言，然后使用翻译后的文本作为 Context，以确保更高的翻译标签准确性。我们利用了Multilingual可调语言模型作为我们的上下文翻译器，并对翻译后的标签进行上下文翻译，以便在翻译后的文本中找到翻译标签。我们与其他标签投影技术进行比较，在目标语言中的ACE05数据集上进行pseudo-training数据的创建。结果显示，CLAP在中文和阿拉伯语ACE05数据集上提高了2-2.5个F1分的性能。

Leveraging Multi-lingual Positive Instances in Contrastive Learning to Improve Sentence Embedding

paper_url: http://arxiv.org/abs/2309.08929
repo_url: None
paper_authors: Kaiyan Zhao, Qiyu Wu, Xin-Qiang Cai, Yoshimasa Tsuruoka
for: 学习多语言句子表示是自然语言处理领域的基本和重要任务之一。
methods: 本文提出了一种新的方法MPCL，利用多个正例来改进学习多语言句子表示。
results: 我们的实验结果表明，相比 conventional CL，MPCL可以提高句子embedding模型的检索、Semantic相似性和分类性能。此外，我们还发现在未经看过的语言上，基于多个正例进行学习的句子embedding模型在跨语言传播性能更好。

Abstract
Learning multi-lingual sentence embeddings is a fundamental and significant task in natural language processing. Recent trends of learning both mono-lingual and multi-lingual sentence embeddings are mainly based on contrastive learning (CL) with an anchor, one positive, and multiple negative instances. In this work, we argue that leveraging multiple positives should be considered for multi-lingual sentence embeddings because (1) positives in a diverse set of languages can benefit cross-lingual learning, and (2) transitive similarity across multiple positives can provide reliable structural information to learn. In order to investigate the impact of CL with multiple positives, we propose a novel approach MPCL to effectively utilize multiple positive instances to improve learning multi-lingual sentence embeddings. Our experimental results on various backbone models and downstream tasks support that compared with conventional CL, MPCL leads to better retrieval, semantic similarity, and classification performances. We also observe that on unseen languages, sentence embedding models trained on multiple positives have better cross-lingual transferring performance than models trained on a single positive instance.

摘要
学习多语言句子嵌入是自然语言处理中的基础和重要任务。现今的多语言句子嵌入学习主要基于对比学习（CL）的 anchor、一个正例和多个负例。在这项工作中，我们认为可以利用多个正例，因为（1）多语言正例可以提高语言之间的学习，和（2）多个正例之间的相互关系可以提供可靠的结构信息来学习。为了研究CL与多个正例的影响，我们提出了一种新的方法MPCL，可以有效地利用多个正例来提高多语言句子嵌入学习。我们的实验结果表明，与传统CL相比，MPCL在不同的基础模型和下游任务中都具有更好的检索、Semantic相似性和分类性能。此外，我们还发现在未看过的语言上，基于多个正例的句子嵌入模型在cross-lingual传输性能上比基于单个正例的模型更好。

Multimodal Multi-Hop Question Answering Through a Conversation Between Tools and Efficiently Finetuned Large Language Models

paper_url: http://arxiv.org/abs/2309.08922
repo_url: None
paper_authors: Hossein Rajabzadeh, Suyuchen Wang, Hyock Ju Kwon, Bang Liu
for: Answering complex multimodal multi-hop questions
methods: 使用大语言模型（LLM）和预定的工具集来分解问题，并通过多模式多步骤的问题分解来提高 LLM 的理解能力
results: 在两个最新引入的复杂问答数据集上进行评估，实验结果显示了substantial improvement over existing state-of-the-art solutions， indicating the effectiveness and generality of the proposed strategy

Abstract
We employ a tool-interacting divide-and-conquer strategy enabling large language models (LLMs) to answer complex multimodal multi-hop questions. In particular, we harness the power of large language models to divide a given multimodal multi-hop question into unimodal single-hop sub-questions to be answered by the appropriate tool from a predefined set of tools. After all corresponding tools provide the LLM with their answers, the LLM generates the next relevant unimodal single-hop question. To increase the reasoning ability of LLMs, we prompt chatGPT to generate a tool-interacting divide-and-conquer dataset. This dataset is then used to efficiently finetune the corresponding LLM. To assess the effectiveness of this approach, we conduct an evaluation on two recently introduced complex question-answering datasets. The experimental analysis demonstrate substantial improvements over existing state-of-the-art solutions, indicating the efficacy and generality of our strategy

摘要
我们采用工具互动分解胜利策略，让大型语言模型（LLM）能够回答复杂多Modal多步问题。具体来说，我们利用大型语言模型将给定的多Modal多步问题分解成单Modal单步问题，由预定的工具集中的相应工具来答复。接下来，所有相应工具提供其答案后，LLM生成下一个相关的单Modal单步问题。为了提高LLM的逻辑能力，我们让ChatGPT生成一个工具互动分解数据集。这个数据集然后用于高效地训练相应的LLM。为评估我们的方法效果，我们对两个最近引入的复杂问题回答数据集进行评估。实验分析表明，我们的策略具有显著的改善，表明我们的方法的有效性和通用性。

Investigating Subtler Biases in LLMs: Ageism, Beauty, Institutional, and Nationality Bias in Generative Models

paper_url: http://arxiv.org/abs/2309.08902
repo_url: None
paper_authors: Mahammed Kamruzzaman, Md. Minul Islam Shovon, Gene Louis Kim
for: 这 paper 探讨了 LLM 是否存在各种社会团体中的偏见，以及这些偏见是否对 consequential 决策产生影响，如employmnet、人性评价和刑事判决。
methods: 作者使用了 sentence completion task 来测试 LLM 的偏见，并使用了多种社会团体和不同的属性来检测偏见。
results: 作者发现了 LLM 在不同社会团体和属性之间的偏见，包括年龄和美貌等。这些偏见与人类在实验心理学中发现的偏见相似。

Abstract
LLMs are increasingly powerful and widely used to assist users in a variety of tasks. This use risks the introduction of LLM biases to consequential decisions such as job hiring, human performance evaluation, and criminal sentencing. Bias in NLP systems along the lines of gender and ethnicity has been widely studied, especially for specific stereotypes (e.g., Asians are good at math). In this paper, we investigate bias along less studied, but still consequential, dimensions, such as age and beauty, measuring subtler correlated decisions that LLMs (specially autoregressive language models) make between social groups and unrelated positive and negative attributes. We ask whether LLMs hold wide-reaching biases of positive or negative sentiment for specific social groups similar to the ``what is beautiful is good'' bias found in people in experimental psychology. We introduce a template-generated dataset of sentence completion tasks that asks the model to select the most appropriate attribute to complete an evaluative statement about a person described as a member of a specific social group. We also reverse the completion task to select the social group based on an attribute. Finally, we report the correlations that we find for multiple cutting-edge LLMs. This dataset can be used as a benchmark to evaluate progress in more generalized biases and the templating technique can be used to expand the benchmark with minimal additional human annotation.

摘要

Semantic Information Extraction for Text Data with Probability Graph

paper_url: http://arxiv.org/abs/2309.08879
repo_url: None
paper_authors: Zhouxiang Zhao, Zhaohui Yang, Ye Hu, Licheng Lin, Zhaoyang Zhang
for: 本文研究了在有限通信资源下传输 semantic information，以提高文本数据的传输效率。
methods: 本文使用自然语言处理技术提取原始文本数据，然后将提取的semantic information capture在知识图中。提取semantic information的问题被设为优化框架，目标是提取最重要的semantic information进行传输。
results: 提议的算法可以减少semantic uncertainty和semantic similarity，并且与基于文本 Similarity的方法进行比较。

Abstract
In this paper, the problem of semantic information extraction for resource constrained text data transmission is studied. In the considered model, a sequence of text data need to be transmitted within a communication resource-constrained network, which only allows limited data transmission. Thus, at the transmitter, the original text data is extracted with natural language processing techniques. Then, the extracted semantic information is captured in a knowledge graph. An additional probability dimension is introduced in this graph to capture the importance of each information. This semantic information extraction problem is posed as an optimization framework whose goal is to extract most important semantic information for transmission. To find an optimal solution for this problem, a Floyd's algorithm based solution coupled with an efficient sorting mechanism is proposed. Numerical results testify the effectiveness of the proposed algorithm with regards to two novel performance metrics including semantic uncertainty and semantic similarity.

摘要
在本文中，我们研究了在有限通信资源的文本数据传输中的Semantic信息提取问题。我们考虑的模型中，一个文本数据序列需要在有限通信资源的网络中传输，只允许有限的数据传输。因此，在发送器端，原始文本数据使用自然语言处理技术进行提取。然后，提取到的Semantic信息被 capture在一个知识图中。在这个图中，我们引入了一个概率维度，用于捕捉每个信息的重要性。这个Semantic信息提取问题被 pose为一个优化框架，其目标是提取最重要的Semantic信息进行传输。为了找到最优解，我们提出了基于Floyd的算法和高效排序机制的解决方案。数据测试表明了我们提出的算法的效果，并使用了两个新的性能指标：Semanticuncertainty和Semantic similarity。

X-PARADE: Cross-Lingual Textual Entailment and Information Divergence across Paragraphs

paper_url: http://arxiv.org/abs/2309.08873
repo_url: None
paper_authors: Juan Diego Rodriguez, Katrin Erk, Greg Durrett
for: 本研究的目的是解决跨语言文本之间的信息差异问题。
methods: 本研究使用了多种方法来解决这个问题，包括经典的机器翻译tokenAlignment、文本推理方法和大语言模型的推断。
results: 研究发现这些方法在处理可推理信息方面表现不一，但都落后于人类表现。

Abstract
Understanding when two pieces of text convey the same information is a goal touching many subproblems in NLP, including textual entailment and fact-checking. This problem becomes more complex when those two pieces of text are in different languages. Here, we introduce X-PARADE (Cross-lingual Paragraph-level Analysis of Divergences and Entailments), the first cross-lingual dataset of paragraph-level information divergences. Annotators label a paragraph in a target language at the span level and evaluate it with respect to a corresponding paragraph in a source language, indicating whether a given piece of information is the same, new, or new but can be inferred. This last notion establishes a link with cross-language NLI. Aligned paragraphs are sourced from Wikipedia pages in different languages, reflecting real information divergences observed in the wild. Armed with our dataset, we investigate a diverse set of approaches for this problem, including classic token alignment from machine translation, textual entailment methods that localize their decisions, and prompting of large language models. Our results show that these methods vary in their capability to handle inferable information, but they all fall short of human performance.

摘要
理解两个文本具有相同信息是许多自然语言处理（NLP）问题的目标，包括文本推理和事实核查。当这两个文本在不同语言时，这个问题变得更加复杂。我们现在介绍了X-PARADE（跨语言段级分析异同和推理），这是首个跨语言段级信息异同数据集。注解员将目标语言中的一个段标记为源语言中的对应段之间的异同，并评估它们之间的关系，以确定一个信息是否相同、新的或新的但可以推理出来。这个概念与跨语言NLI（自然语言理解）建立了联系。我们使用了这些数据集，调查了一系列方法来解决这个问题，包括机器翻译的 класси型token对齐、文本推理方法的本地化和大语言模型的激励。我们的结果表明，这些方法在处理推理出来的信息方面表现不一样，但它们都不足以达到人类性能。

Has Sentiment Returned to the Pre-pandemic Level? A Sentiment Analysis Using U.S. College Subreddit Data from 2019 to 2022

paper_url: http://arxiv.org/abs/2309.08845
repo_url: https://github.com/alvayan/postcovidsentianalysis
paper_authors: Tian Yan, Fang Liu
for: 这项研究的目的是探讨2019年至2022年的情绪变化，特别是在疫情风险降低后情绪是否回归到了过去的水平。methods: 该研究使用Reddit数据收集于2019年、2020年、2021年和2022年，从128所美国大学/学院的Subreddit中收集数据，并使用预训练的Robustly Optimized BERT预训练方法（RoBERTa）和图像注意网络（GAT）来预测情绪。results: 研究发现，相比2019年，2020年、2021年和2022年的负面情绪的可能性分别提高了24%、4.3%和10.3%，这些增长都是 statistically significant（适用 $p$ <0.05）。这些结果表明在疫情风险降低后，情绪的组成部分在后疫情emergency era中进行了部分恢复。

Abstract
As impact of COVID-19 pandemic winds down, both individuals and society gradually return to pre-pandemic activities. This study aims to explore how people's emotions have changed from the pre-pandemic during the pandemic to post-emergency period and whether it has returned to pre-pandemic level. We collected Reddit data in 2019 (pre-pandemic), 2020 (peak pandemic), 2021, and 2022 (late stages of pandemic, transitioning period to post-emergency period) from subreddits in 128 universities/colleges in the U.S., and a set of school-level characteristics. We predicted two sets of sentiments from a pre-trained Robustly Optimized BERT pre-training approach (RoBERTa) and graph attention network (GAT) that leverages both rich semantic and relational information among posted messages and then applied a logistic stacking method to obtain the final sentiment classification. After obtaining sentiment label for each message, we used a generalized linear mixed-effects model to estimate temporal trend in sentiment from 2019 to 2022 and how school-level factors may affect sentiment. Compared to the year 2019, the odds of negative sentiment in years 2020, 2021, and 2022 are 24%, 4.3%, and 10.3% higher, respectively, which are all statistically significant(adjusted $p$<0.05). Our study findings suggest a partial recovery in the sentiment composition in the post-pandemic-emergency era. The results align with common expectations and provide a detailed quantification of how sentiments have evolved from 2019 to 2022.

摘要
COVID-19 大流行的影响逐渐减轻，个人和社会逐渐返回到前疫情时期的活动。这项研究旨在探讨人们在疫情期间和后emergency期间的情绪是否发生了变化，以及情绪是否回归到了前疫情水平。我们收集了2019年、2020年、2021年和2022年Reddit数据（来自128所美国大学/学院Subreddit），以及一组学校级特征。我们预测了两组情绪（来自Robustly Optimized BERT预训练方法（RoBERTa）和图像注意网络（GAT）），然后应用了ilogistic栈合并方法来获得最终的情绪分类。接下来，我们使用一种通用的线性混合效应模型来估计2019年至2022年间情绪的时间趋势，以及学校级因素是否影响情绪。相比2019年，2020年、2021年和2022年的负面情绪的 odds 分别高于24%、4.3%和10.3%，这些均为 statistically significant（ adjusted $p$ <0.05）。我们的研究发现，在后疫情emergency era，情绪结构有部分恢复。结果与常见预期相符，并为2019年至2022年间情绪的发展提供了详细的量化。

EchoPrompt: Instructing the Model to Rephrase Queries for Improved In-context Learning

paper_url: http://arxiv.org/abs/2309.10687
repo_url: None
paper_authors: Rajasekhar Reddy Mekala, Yasaman Razeghi, Sameer Singh
for: 提高大型语言模型在上下文学习中的表现
methods: 引入EchoPrompt策略，让模型重复提问以提高它的答案
results: 实验结果表明，EchoPrompt可以在标准和链式提问下提高zero-shot和几少shot上下文学习的表现，并且在不同的数学计算（GSM8K、SVAMP、MultiArith、SingleOp）、阅读理解（DROP、SQuAD）和逻辑理解（Shuffled Objects、Date Understanding、Coin Flipping）任务中均有显著提高。

Abstract
Large language models primarily rely on incontext learning to execute tasks. We introduce EchoPrompt, a simple yet effective approach to prompt the model to rephrase its queries before answering them. EchoPrompt is inspired by self-questioning, a cognitive strategy humans use to vocalize queries before providing answers, thereby reducing misconceptions. Experimental results demonstrate that EchoPrompt leads to substantial improvements in both zero-shot and few-shot in-context learning with standard and chain-of-thought prompting on four families of causal language models. These improvements are observed across various numerical reasoning (GSM8K, SVAMP, MultiArith, SingleOp), reading comprehension (DROP, SQuAD), and logical reasoning (Shuffled Objects, Date Understanding, Coin Flipping) tasks. On average, EchoPrompt improves the Zero-shot-CoT performance of code-davinci-002 by 5% in numerical tasks and 13% in reading comprehension tasks. We investigate the effectiveness of EchoPrompt through ablation studies, which reveal the significance of both original and rephrased queries for EchoPrompt's efficacy. Our empirical results show that EchoPrompt is an effective technique that can easily augment in-context learning for better performance.

摘要

2023-09-16

cs.LG

cs.LG - 2023-09-16

Reducing sequential change detection to sequential estimation

paper_url: http://arxiv.org/abs/2309.09111
repo_url: None
paper_authors: Shubhanshu Shekhar, Aaditya Ramdas
for: 这个论文是为了解决数据流中参数或功能θ的变化检测问题，目标是设计一种具有小检测延迟，但保证在无变化情况下降低假阳性发生频率的变化检测方案。
methods: 这篇论文使用了信息理论Sequential estimation和 confidence sequences的方法，开始每个时间步骤上一个（1-α）信心序列，并在所有活跃信心序列交叉为空时宣布变化。
results: 论文证明了变化检测方案的平均运行长度至少为1/α，从而实现了具有小结构假设（即可能存在依赖性观察和非 Parametric 分布类），但强制 garanties。

Abstract
We consider the problem of sequential change detection, where the goal is to design a scheme for detecting any changes in a parameter or functional $\theta$ of the data stream distribution that has small detection delay, but guarantees control on the frequency of false alarms in the absence of changes. In this paper, we describe a simple reduction from sequential change detection to sequential estimation using confidence sequences: we begin a new $(1-\alpha)$-confidence sequence at each time step, and proclaim a change when the intersection of all active confidence sequences becomes empty. We prove that the average run length is at least $1/\alpha$, resulting in a change detection scheme with minimal structural assumptions~(thus allowing for possibly dependent observations, and nonparametric distribution classes), but strong guarantees. Our approach bears an interesting parallel with the reduction from change detection to sequential testing of Lorden (1971) and the e-detector of Shin et al. (2022).

摘要
我们考虑了时间序列变化检测问题，目标是设计一个检测数据流分布中参数或功能 $\theta$ 变化的方案，具有小检测延迟，但 garantuee控制不变数据流中噪声频率。在这篇论文中，我们描述了一种简单的减少从时间序列变化检测到时间序列估计的方法：我们在每个时间步骤开始一个新的 $(1-\alpha)$-信度序列，并在所有活动信度序列交集为空时宣布变化。我们证明了average run length至少为 $1/\alpha$，从而实现了变化检测方案，具有最小的结构假设（即允许可能相关的观察和非参数分布类型），但具有强 guarantees。我们的方法与 Lorden (1971) 的减少从变化检测到时间序列测试，以及 Shin et al. (2022) 的 e-detector 具有一定的平行。

Test-Time Compensated Representation Learning for Extreme Traffic Forecasting

paper_url: http://arxiv.org/abs/2309.09074
repo_url: None
paper_authors: Zhiwei Zhang, Weizhong Zhang, Yaowei Huang, Kani Chen
for: 预测交通流量是一项复杂的任务，因为交通系列具有复杂的空间时间相关性。这篇论文探讨了一个尚未得到充分关注的问题：极端事件。
methods: 我们提出了一种在测试阶段补偿表示学习框架，包括空间时间分解数据银行和多头空间变换模型（CompFormer）。前者分解所有训练数据的时间维度按照 periodic 特征，而后者通过空间注意力矩阵与最近观察和历史序列在数据银行之间建立连接，以便将稳定特征传递到抗EXTREME事件。
results: 我们的方法可以在极端事件下 достичь显著改进，比如METR-LA和PEMS-BAY 测试环境中的6个强基elines。总体而言，我们的方法可以提高交通预测的准确率，最高达28.2%。

Abstract
Traffic forecasting is a challenging task due to the complex spatio-temporal correlations among traffic series. In this paper, we identify an underexplored problem in multivariate traffic series prediction: extreme events. Road congestion and rush hours can result in low correlation in vehicle speeds at various intersections during adjacent time periods. Existing methods generally predict future series based on recent observations and entirely discard training data during the testing phase, rendering them unreliable for forecasting highly nonlinear multivariate time series. To tackle this issue, we propose a test-time compensated representation learning framework comprising a spatio-temporal decomposed data bank and a multi-head spatial transformer model (CompFormer). The former component explicitly separates all training data along the temporal dimension according to periodicity characteristics, while the latter component establishes a connection between recent observations and historical series in the data bank through a spatial attention matrix. This enables the CompFormer to transfer robust features to overcome anomalous events while using fewer computational resources. Our modules can be flexibly integrated with existing forecasting methods through end-to-end training, and we demonstrate their effectiveness on the METR-LA and PEMS-BAY benchmarks. Extensive experimental results show that our method is particularly important in extreme events, and can achieve significant improvements over six strong baselines, with an overall improvement of up to 28.2%.

摘要
很多时候，交通预测是一项非常困难的任务，这是因为交通系列之间存在复杂的空间-时间相关性。在这篇论文中，我们提出了一个未得到足够关注的问题：极端事件。路口拥堵和高峰时段可能导致不同的交通枢纽的车速在相邻时期之间存在低相关性。现有的方法通常是根据最近观察值预测未来系列，并完全抛弃测试阶段的训练数据，这使得它们在预测非线性多变量时间系列方面不可靠。为解决这个问题，我们提议一个在测试阶段补偿表示学习框架，包括空间-时间分解的数据银行和多头空间变换模型（CompFormer）。前者组件可以显式分解所有的训练数据以 Temp 度量的时间维度，而后者组件可以通过空间注意力矩阵与最近观察值和历史系列在数据银行中建立连接，使CompFormer可以传递 Robust 特征以抵御特殊事件，并使用更少的计算资源。我们的模块可以与现有的预测方法进行灵活的集成，我们在 METR-LA 和 PEMS-BAY 测试准则上进行了综合实验，结果表明我们的方法在极端事件中尤为重要，可以在六个强大基准集上实现显著提高，最大提高达28.2%。

Enhancing personalised thermal comfort models with Active Learning for improved HVAC controls

paper_url: http://arxiv.org/abs/2309.09073
repo_url: None
paper_authors: Zeynep Duygu Tekler, Yue Lei, Xilei Dai, Adrian Chong
for: 本研究旨在开发个性化thermal comfort模型，以便在建筑物中实现occupant-centric控制（OCC）系统。
methods: 本研究使用Active Learning（AL）技术， Addresses the data challenges related to real-world OCC implementations。
results: 研究结果表明，使用AL技术可以减少实际标注工作量（31.0%），同时仍能提高能效性和thermal satisfaction水平（1.3%和98%）。这表明，未来实际应用中可以部署这种系统，实现个性化的 COMFORT和能效的建筑操作。

Abstract
Developing personalised thermal comfort models to inform occupant-centric controls (OCC) in buildings requires collecting large amounts of real-time occupant preference data. This process can be highly intrusive and labour-intensive for large-scale implementations, limiting the practicality of real-world OCC implementations. To address this issue, this study proposes a thermal preference-based HVAC control framework enhanced with Active Learning (AL) to address the data challenges related to real-world implementations of such OCC systems. The proposed AL approach proactively identifies the most informative thermal conditions for human annotation and iteratively updates a supervised thermal comfort model. The resulting model is subsequently used to predict the occupants' thermal preferences under different thermal conditions, which are integrated into the building's HVAC controls. The feasibility of our proposed AL-enabled OCC was demonstrated in an EnergyPlus simulation of a real-world testbed supplemented with the thermal preference data of 58 study occupants. The preliminary results indicated a significant reduction in overall labelling effort (i.e., 31.0%) between our AL-enabled OCC and conventional OCC while still achieving a slight increase in energy savings (i.e., 1.3%) and thermal satisfaction levels above 98%. This result demonstrates the potential for deploying such systems in future real-world implementations, enabling personalised comfort and energy-efficient building operations.

摘要
开发个性化thermal comfort模型，以便在建筑物中实现占位者-центric控制（OCC）需要收集大量实时占位者喜好数据。这个过程可能是非常干扰性的，特别是在大规模实施时，这会限制现实世界中OCC系统的实现可行性。为解决这个问题，本研究提出了基于Active Learning（AL）的thermal preference-based HVAC控制框架。该方法可以把实时占位者喜好数据用于训练监督thermal comfort模型，并在不同的thermalconditions下预测占位者的thermal喜好。这些喜好值后来将被集成到建筑物的HVAC控制系统中。我们在一个基于EnergyPlus的实验室中对一个真实的测试床进行了 simulate，并补充了58名学生的thermal喜好数据。初步结果表明，我们的AL-enabled OCC比 conventinal OCC减少了31.0%的标签努力（即标签每个占位者的thermal condition），同时仍保持了1.3%的能源节约和thermal满意度高于98%。这些结果表明，在未来实际世界中部署这些系统是可能的，以实现个性化的舒适性和能源效率的建筑物运行。

Recovering Missing Node Features with Local Structure-based Embeddings

paper_url: http://arxiv.org/abs/2309.09068
repo_url: None
paper_authors: Victor M. Tenorio, Madeline Navarro, Santiago Segarra, Antonio G. Marques
for: recover completely missing node features for a set of graphs
methods: incorporate prior information from both graph topology and existing nodal values, use a Graph AutoEncoder to train a node embedding space
results: accurate feature estimation approach, valuable for downstream graph classification

Abstract
Node features bolster graph-based learning when exploited jointly with network structure. However, a lack of nodal attributes is prevalent in graph data. We present a framework to recover completely missing node features for a set of graphs, where we only know the signals of a subset of graphs. Our approach incorporates prior information from both graph topology and existing nodal values. We demonstrate an example implementation of our framework where we assume that node features depend on local graph structure. Missing nodal values are estimated by aggregating known features from the most similar nodes. Similarity is measured through a node embedding space that preserves local topological features, which we train using a Graph AutoEncoder. We empirically show not only the accuracy of our feature estimation approach but also its value for downstream graph classification. Our success embarks on and implies the need to emphasize the relationship between node features and graph structure in graph-based learning.

摘要
节点特征增强图基于学习，当它们与网络结构结合使用时。然而，graph数据中缺失节点特征非常普遍。我们提出了一种框架，可以完全重建缺失节点特征的集合，只要知道一部分图的信号。我们的方法利用图ptopology和已知节点值的先前信息。我们示例中的实现方式是假设节点特征取决于本地图STRUCTURE。缺失节点值可以通过已知节点特征的聚合来估计。相似性是通过一个节点嵌入空间来 preserve local topological features，我们使用一个图自动编码器来训练。我们实际上证明了我们的特征估计方法的准确性，以及其对下游图分类的价值。我们的成功表明了和节点特征和图结构之间的关系的重要性。

Temporal Smoothness Regularisers for Neural Link Predictors

paper_url: http://arxiv.org/abs/2309.09045
repo_url: None
paper_authors: Manuel Dileo, Pasquale Minervini, Matteo Zignani, Sabrina Gaito
for: 本研究旨在提高知识图中链接预测精度，并且可以处理时间约束。
methods: 本研究使用了四次tensor的 canonical decomposition 以及时间滤波 regularization 来预测知识图中链接。
results: 研究发现，通过选择合适的时间滤波 regularizer 和补偿量，简单的 TNTComplEx 方法可以在三个 temporal link prediction 数据集上 producenew state-of-the-art 结果，并且对两个 state-of-the-art 模型进行了评估。

Abstract
Most algorithms for representation learning and link prediction on relational data are designed for static data. However, the data to which they are applied typically evolves over time, including online social networks or interactions between users and items in recommender systems. This is also the case for graph-structured knowledge bases -- knowledge graphs -- which contain facts that are valid only for specific points in time. In such contexts, it becomes crucial to correctly identify missing links at a precise time point, i.e. the temporal prediction link task. Recently, Lacroix et al. and Sadeghian et al. proposed a solution to the problem of link prediction for knowledge graphs under temporal constraints inspired by the canonical decomposition of 4-order tensors, where they regularise the representations of time steps by enforcing temporal smoothing, i.e. by learning similar transformation for adjacent timestamps. However, the impact of the choice of temporal regularisation terms is still poorly understood. In this work, we systematically analyse several choices of temporal smoothing regularisers using linear functions and recurrent architectures. In our experiments, we show that by carefully selecting the temporal smoothing regulariser and regularisation weight, a simple method like TNTComplEx can produce significantly more accurate results than state-of-the-art methods on three widely used temporal link prediction datasets. Furthermore, we evaluate the impact of a wide range of temporal smoothing regularisers on two state-of-the-art temporal link prediction models. Our work shows that simple tensor factorisation models can produce new state-of-the-art results using newly proposed temporal regularisers, highlighting a promising avenue for future research.

摘要
大多数 repreSentation learning 和链接预测算法是设计 для静态数据，但实际应用中的数据通常会随时间而变化，如在线社交网络或用户和物品之间的推荐系统中。这也是知识图的情况——知识graph——其中的事实只有在特定时间点才是有效的。在这些上下文中，正确地预测时间点缺失的链接变得非常重要，即时间预测链接任务。 Lacroix et al. 和 Sadeghian et al. 已经提出了为知识图链接预测问题的解决方案，基于四元tensor的 canonical decomposition，其中将时间步骤的表示正则化，即在邻近时间步骤上学习类似的变换。然而，选择时间平滑正则化项的影响仍未得到充分理解。在这种工作中，我们系统地分析了一些时间平滑正则化项，包括线性函数和循环架构。我们的实验表明，通过选择合适的时间平滑正则化项和正则化权重，简单的 TNTComplEx 方法可以在三个常用的时间链接预测数据集上生成较为准确的结果，并且超过当前的方法。此外，我们评估了一些时间平滑正则化项对两种当前的时间链接预测模型的影响。我们的工作显示，简单的tensor因子分解模型可以使用新的时间正则化项生成新的状态计算结果，标志着未来研究的可能性。

Study of Enhanced MISC-Based Sparse Arrays with High uDOFs and Low Mutual Coupling

paper_url: http://arxiv.org/abs/2309.09044
repo_url: None
paper_authors: X. Sheng, D. Lu, Y. Li, R. C. de Lamare
for: 提高简洁数组（SA）的均匀度和干扰降低性。
methods: 基于最大间隔间距（IES）条件的加强最大间隔间距数组（EMISC SA），包括确定IES集和基于IES集的七个均匀线子数组（ULSAs）。
results: 比较研究表明，提出的EMISC SA在均匀度和干扰降低性方面表现更出色，在其他现有SA上显著优势。

Abstract
In this letter, inspired by the maximum inter-element spacing (IES) constraint (MISC) criterion, an enhanced MISC-based (EMISC) sparse array (SA) with high uniform degrees-of-freedom (uDOFs) and low mutual-coupling (MC) is proposed, analyzed and discussed in detail. For the EMISC SA, an IES set is first determined by the maximum IES and number of elements. Then, the EMISC SA is composed of seven uniform linear sub-arrays (ULSAs) derived from an IES set. An analysis of the uDOFs and weight function shows that, the proposed EMISC SA outperforms the IMISC SA in terms of uDOF and MC. Simulation results show a significant advantage of the EMISC SA over other existing SAs.

摘要
在这封信中，启发于最大间元素间距（IES）约束（MISC） criterion，提出了一种增强型MISC-based（EMISC）稀疏阵列（SA），其具有高统一度数（uDOFs）和低相互干扰（MC）。为EMISC SA而确定了IES集，然后将其分成七个固定长度的uniform linear sub-arrays（ULSAs）。通过对uDOF和权重函数的分析，可以看出，提议的EMISC SA在uDOF和MC方面表现出了较好的性能。仔细的 simulations 表明，EMISC SA在其他现有的SA中具有显著的优势。

Forward Invariance in Neural Network Controlled Systems

paper_url: http://arxiv.org/abs/2309.09043
repo_url: https://github.com/gtfactslab/harapanahalli_lcss2024
paper_authors: Akash Harapanahalli, Saber Jafarpour, Samuel Coogan
for: 这个paper是为了证明和搜索非线性系统中神经网络控制器的前向不变集的 certificates。
methods: 该 frameworks使用了间隔分析和卷积系统理论，constructs localized first-order inclusion functions for the closed-loop system using Jacobian bounds and existing neural network verification tools，并builds a dynamical embedding system to directly correspond with a nested family of hyper-rectangles provably converging to an attractive set of the original system。
results: 该 Framework是自动化的，使用了interval analysis toolbox $\texttt{npinterval}$， along with the symbolic arithmetic toolbox $\texttt{sympy}$，demonstrated on an $8$-dimensional leader-follower system。

Abstract
We present a framework based on interval analysis and monotone systems theory to certify and search for forward invariant sets in nonlinear systems with neural network controllers. The framework (i) constructs localized first-order inclusion functions for the closed-loop system using Jacobian bounds and existing neural network verification tools; (ii) builds a dynamical embedding system where its evaluation along a single trajectory directly corresponds with a nested family of hyper-rectangles provably converging to an attractive set of the original system; (iii) utilizes linear transformations to build families of nested paralleletopes with the same properties. The framework is automated in Python using our interval analysis toolbox $\texttt{npinterval}$, in conjunction with the symbolic arithmetic toolbox $\texttt{sympy}$, demonstrated on an $8$-dimensional leader-follower system.

摘要
我们提出了基于间隔分析和升降系统理论的框架，用于证明和搜索非线性系统中神经网络控制器的前向不变集。这个框架包括以下三个步骤：(i) 使用Jacobian bound和现有的神经网络验证工具来构建封闭循环系统的本地第一阶 inclusion函数;(ii) 使用动力系统来构建一个嵌入系统，其评估路径的评估直接对应于一个嵌入在原系统中的嵌入集;(iii) 使用线性变换来构建一个家族的嵌入多面体，这些多面体具有相同的性质。我们使用Python中的$\texttt{npinterval}$工具包，以及$\texttt{sympy}$工具包，在一个8维领导者-跟随者系统上进行了实验。

Solving Quadratic Systems with Full-Rank Matrices Using Sparse or Generative Priors

paper_url: http://arxiv.org/abs/2309.09032
repo_url: None
paper_authors: Junren Chen, Shuai Huang, Michael K. Ng, Zhaoqiang Liu
For: This paper addresses the problem of recovering a high-dimensional signal from a quadratic system with full-rank matrices, with a focus on incorporating prior knowledge of the signal to improve recovery performance.* Methods: The paper proposes two algorithms for signal recovery: the thresholded Wirtinger flow (TWF) algorithm and the projected gradient descent (PGD) algorithm. The TWF algorithm consists of a spectral initialization and a thresholded gradient descent step, while the PGD algorithm uses a projected power method and a projected gradient descent step.* Results: The paper reports experimental results that demonstrate the effectiveness of the proposed algorithms for signal recovery. Specifically, the results show that the proposed approach outperforms existing provable algorithms for the sparse case, and leveraging the generative prior allows for precise image recovery in the MNIST dataset from a small number of quadratic measurements.

Abstract
The problem of recovering a signal $\boldsymbol{x} \in \mathbb{R}^n$ from a quadratic system $\{y_i=\boldsymbol{x}^\top\boldsymbol{A}_i\boldsymbol{x},\ i=1,\ldots,m\}$ with full-rank matrices $\boldsymbol{A}_i$ frequently arises in applications such as unassigned distance geometry and sub-wavelength imaging. With i.i.d. standard Gaussian matrices $\boldsymbol{A}_i$, this paper addresses the high-dimensional case where $m\ll n$ by incorporating prior knowledge of $\boldsymbol{x}$. First, we consider a $k$-sparse $\boldsymbol{x}$ and introduce the thresholded Wirtinger flow (TWF) algorithm that does not require the sparsity level $k$. TWF comprises two steps: the spectral initialization that identifies a point sufficiently close to $\boldsymbol{x}$ (up to a sign flip) when $m=O(k^2\log n)$, and the thresholded gradient descent (with a good initialization) that produces a sequence linearly converging to $\boldsymbol{x}$ with $m=O(k\log n)$ measurements. Second, we explore the generative prior, assuming that $\boldsymbol{x}$ lies in the range of an $L$-Lipschitz continuous generative model with $k$-dimensional inputs in an $\ell_2$-ball of radius $r$. We develop the projected gradient descent (PGD) algorithm that also comprises two steps: the projected power method that provides an initial vector with $O\big(\sqrt{\frac{k \log L}{m}\big)$ $\ell_2$-error given $m=O(k\log(Lnr))$ measurements, and the projected gradient descent that refines the $\ell_2$-error to $O(\delta)$ at a geometric rate when $m=O(k\log\frac{Lrn}{\delta^2})$. Experimental results corroborate our theoretical findings and show that: (i) our approach for the sparse case notably outperforms the existing provable algorithm sparse power factorization; (ii) leveraging the generative prior allows for precise image recovery in the MNIST dataset from a small number of quadratic measurements.

摘要
文本中的问题是从quadratic system $\{\mathbf{y}_i = \mathbf{x}^{\top}\mathbf{A}_i\mathbf{x}, \ i=1,\ldots,m\}$中recover $\mathbf{x} \in \mathbb{R}^n$的信号，其中 $\mathbf{A}_i$ 是高级别矩阵。这种问题在应用中如不分配距离几何和sub-波长成像频繁出现。本文使用独立标准 Gaussian 矩阵 $\mathbf{A}_i$ ，解决高维度情况($m \ll n$)，并借助先验知识来$\mathbf{x}$。首先，我们考虑 $k$-sparse $\mathbf{x}$ 情况，并引入 thresholded Wirtinger flow（TWF）算法，不需要稀疏程度 $k$。 TWF 包括两个步骤：spectral initialization，可以在 $m = O(k^2\log n)$ 个探测中找到一个 sufficiently close to $\mathbf{x}$ 的点（Up to a sign flip），以及 thresholded gradient descent（with good initialization），可以在 $m = O(k\log n)$ 个探测中产生一个线性收敛到 $\mathbf{x}$ 的序列。其次，我们探索生成先验，假设 $\mathbf{x}$ 在一个 $L$-Lipschitz 连续的生成模型中，其中输入是 $k$-维的 $\ell_2$ 球中的径 $r$。我们开发了 projected gradient descent（PGD）算法，该算法包括两个步骤：projected power method，可以在 $m = O\big(\sqrt{\frac{k \log L}{m}\big)$ 个探测中提供一个初始向量，其 $\ell_2$ 误差为 $O\big(\sqrt{\frac{k \log L}{m}\big)$；以及 projected gradient descent，可以在 $m = O(k\log\frac{Lrn}{\delta^2})$ 个探测中精确地将 $\ell_2$ 误差降至 $O(\delta)$ 的水平。实验结果证明了我们的理论发现，以及使用生成先验可以在 MNIST 数据集中从一小数量的quadratic measurements中准确地还原图像。 Specifically, our approach for the sparse case significantly outperforms the existing provable algorithm sparse power factorization; leveraging the generative prior allows for precise image recovery in the MNIST dataset from a small number of quadratic measurements.

gym-saturation: Gymnasium environments for saturation provers (System description)

paper_url: http://arxiv.org/abs/2309.09022
repo_url: None
paper_authors: Boris Shminke
for: 这篇论文描述了一个新版本的 Python 包 - gym-saturation，该包包含基于给定Clause算法的 OpenAI Gym 环境，用于导引滥化风格的证明器。
methods: 论文使用了两种不同的证明器：Vampire 和 iProver，并提供了解除证明状态表示与强化学习之间的关系的示例。此外，论文还提供了使用已知 ast2vec Python 代码嵌入模型作为一阶逻辑表示的示例。
results: 论文示例了如何使用 Ray RLlib 实现 Thompson 抽样和 proximal policy 优化两种强化学习算法，以便轻松实验新版本的 package。

Abstract
This work describes a new version of a previously published Python package - gym-saturation: a collection of OpenAI Gym environments for guiding saturation-style provers based on the given clause algorithm with reinforcement learning. We contribute usage examples with two different provers: Vampire and iProver. We also have decoupled the proof state representation from reinforcement learning per se and provided examples of using a known ast2vec Python code embedding model as a first-order logic representation. In addition, we demonstrate how environment wrappers can transform a prover into a problem similar to a multi-armed bandit. We applied two reinforcement learning algorithms (Thompson sampling and Proximal policy optimisation) implemented in Ray RLlib to show the ease of experimentation with the new release of our package.

摘要
这个工作描述了一个新版本的Python包 - gym-saturation：一个基于给定的句法算法的OpenAI Gym环境，用于指导吸血者和iProver等逻辑推理器。我们提供了使用示例，包括使用known ast2vec Python代码嵌入模型作为逻辑表示。此外，我们还示出了如何使用环境包装器将推理器转化成多重机关问题。我们在Ray RLlib中实现了两种回归学习算法（托мп逊抽样和距离策略优化），以示新版本包的使用方便。

Regularized Contrastive Pre-training for Few-shot Bioacoustic Sound Detection

paper_url: http://arxiv.org/abs/2309.08971
repo_url: None
paper_authors: Ilyass Moummad, Romain Serizel, Nicolas Farrugia
for: bioacoustic sound event detection, 了解动物行为和监测生物多样性使用音频
methods: 深度学习系统和超级vised contrastive pre-training
results: 61.52%(0.48)和68.19%(0.75)的F-score，无需特定的 annotated data 训练

Abstract
Bioacoustic sound event detection allows for better understanding of animal behavior and for better monitoring biodiversity using audio. Deep learning systems can help achieve this goal, however it is difficult to acquire sufficient annotated data to train these systems from scratch. To address this limitation, the Detection and Classification of Acoustic Scenes and Events (DCASE) community has recasted the problem within the framework of few-shot learning and organize an annual challenge for learning to detect animal sounds from only five annotated examples. In this work, we regularize supervised contrastive pre-training to learn features that can transfer well on new target tasks with animal sounds unseen during training, achieving a high F-score of 61.52%(0.48) when no feature adaptation is applied, and an F-score of 68.19%(0.75) when we further adapt the learned features for each new target task. This work aims to lower the entry bar to few-shot bioacoustic sound event detection by proposing a simple and yet effective framework for this task, by also providing open-source code.

摘要

UNIDEAL: Curriculum Knowledge Distillation Federated Learning

paper_url: http://arxiv.org/abs/2309.08961
repo_url: None
paper_authors: Yuwen Yang, Chang Liu, Xun Cai, Suizhi Huang, Hongtao Lu, Yue Ding
for: 该论文旨在解决跨Domain federated learning中的异常性问题，提出了一种名为UNIDEAL的新的 federated learning算法。
methods: 该算法使用了调整able Teacher-Student Mutual Evaluation Curriculum Learning，以进一步提高 federated learning 中的知识储存效果。
results: 对多个数据集进行了广泛的实验，与state-of-the-art基eline进行比较，结果显示，UNIDEAL可以在模型精度和通信效率两个方面达到更高的性能。此外，还提供了算法的收敛分析，显示其在非对称条件下的收敛率为O(1/T)。

Abstract
Federated Learning (FL) has emerged as a promising approach to enable collaborative learning among multiple clients while preserving data privacy. However, cross-domain FL tasks, where clients possess data from different domains or distributions, remain a challenging problem due to the inherent heterogeneity. In this paper, we present UNIDEAL, a novel FL algorithm specifically designed to tackle the challenges of cross-domain scenarios and heterogeneous model architectures. The proposed method introduces Adjustable Teacher-Student Mutual Evaluation Curriculum Learning, which significantly enhances the effectiveness of knowledge distillation in FL settings. We conduct extensive experiments on various datasets, comparing UNIDEAL with state-of-the-art baselines. Our results demonstrate that UNIDEAL achieves superior performance in terms of both model accuracy and communication efficiency. Additionally, we provide a convergence analysis of the algorithm, showing a convergence rate of O(1/T) under non-convex conditions.

摘要
《联合学习（FL）》已经出现为解决多个客户合作学习的有效方法，同时保护数据隐私。然而，跨Domain FL任务，客户拥有不同Domain或分布的数据，仍然是一个困难的问题，归因于内在的不一致性。本文提出了UNIDEAL算法，专门为跨Domainenario和多种模型架构设计。提议的方法 introduce Adjustable Teacher-Student Mutual Evaluation Curriculum Learning，对FL设置进行显著提升知识填充效果。我们在多个数据集上进行了广泛的实验，与现有的基线进行比较。我们的结果表明，UNIDEAL在模型准确率和通信效率两个方面均达到了Superior性能。此外，我们还提供了算法的收敛分析，显示其在非对称条件下的收敛率为O(1/T)。

Reducing Memory Requirements for the IPU using Butterfly Factorizations

paper_url: http://arxiv.org/abs/2309.08946
repo_url: None
paper_authors: S. -Kazem Shekofteh, Christian Alles, Holger Fröning
for: 本研究旨在探讨 Intellectual Processing Unit (IPU) 如何实现精简模型，提高高性能计算的可扩展性。
methods: 本研究使用了翅膀结构来取代完全连接和 convolutional 层，实现模型压缩。
results: 实验结果表明，使用翅膀结构可以提供 98.5% 压缩率，减少巨大的内存需求。IPU 实现可以获得 1.3x 和 1.6x 性能提升，并在实际数据集 CIFAR10 上达到 1.62x 训练时间速度提升。

Abstract
High Performance Computing (HPC) benefits from different improvements during last decades, specially in terms of hardware platforms to provide more processing power while maintaining the power consumption at a reasonable level. The Intelligence Processing Unit (IPU) is a new type of massively parallel processor, designed to speedup parallel computations with huge number of processing cores and on-chip memory components connected with high-speed fabrics. IPUs mainly target machine learning applications, however, due to the architectural differences between GPUs and IPUs, especially significantly less memory capacity on an IPU, methods for reducing model size by sparsification have to be considered. Butterfly factorizations are well-known replacements for fully-connected and convolutional layers. In this paper, we examine how butterfly structures can be implemented on an IPU and study their behavior and performance compared to a GPU. Experimental results indicate that these methods can provide 98.5% compression ratio to decrease the immense need for memory, the IPU implementation can benefit from 1.3x and 1.6x performance improvement for butterfly and pixelated butterfly, respectively. We also reach to 1.62x training time speedup on a real-word dataset such as CIFAR10.

摘要
高性能计算（HPC）在过去几十年中得到了不同的改进，尤其是硬件平台，以提供更多的处理力而不超过合理的能耗水平。知识处理单元（IPU）是一种新型的极大并行处理器，旨在加速并行计算，特别是机器学习应用中的大量并行计算。由于IPU的architecture和GPU不同，特别是IPU的内存容量远少于GPU，因此需要考虑减少模型大小的方法。蝴蝶分解是机器学习中广泛使用的替换方法，可以完全或部分替换完全连接和卷积层。在这篇论文中，我们研究了如何在IPU上实现蝴蝶结构，并研究其行为和性能，与GPU相比。实验结果表明，这些方法可以提供98.5%的压缩率，减少巨大的内存需求，IPU实现可以 benefit From 1.3x和1.6x的性能提升，分别是蝴蝶和像素化蝴蝶。此外，我们还达到了1.62x的训练时间加速，在实际数据集CIFAR10上。

Inverse classification with logistic and softmax classifiers: efficient optimization

paper_url: http://arxiv.org/abs/2309.08945
repo_url: None
paper_authors: Miguel Á. Carreira-Perpiñán, Suryabhan Singh Hada
for: 本研究目的是解决 inverse classification 问题，即查询一个训练好的分类器的 closest instance，以使得分类器预测的标签发生某种旨在的变化。
methods: 本研究使用了逻辑回归和 softmax 分类器，并利用了这两种模型的特殊性，实现了快速的优化解决方案。
results: 研究人员表明，可以通过 closed form 的方法解决逻辑回归模型，并通过迭代但非常快的方法解决 softmax 模型，可以准确地解决 inverse classification 问题，并且可以在毫秒级别的时间内解决高维度的实例和多个类型。

Abstract
In recent years, a certain type of problems have become of interest where one wants to query a trained classifier. Specifically, one wants to find the closest instance to a given input instance such that the classifier's predicted label is changed in a desired way. Examples of these ``inverse classification'' problems are counterfactual explanations, adversarial examples and model inversion. All of them are fundamentally optimization problems over the input instance vector involving a fixed classifier, and it is of interest to achieve a fast solution for interactive or real-time applications. We focus on solving this problem efficiently for two of the most widely used classifiers: logistic regression and softmax classifiers. Owing to special properties of these models, we show that the optimization can be solved in closed form for logistic regression, and iteratively but extremely fast for the softmax classifier. This allows us to solve either case exactly (to nearly machine precision) in a runtime of milliseconds to around a second even for very high-dimensional instances and many classes.

摘要

Fast Approximation of the Shapley Values Based on Order-of-Addition Experimental Designs

paper_url: http://arxiv.org/abs/2309.08923
repo_url: None
paper_authors: Liuqing Yang, Yongdao Zhou, Haoda Fu, Min-Qian Liu, Wei Zheng
for: 评估多方协作中每个 player 的贡献，以便做出公平的分配成本和利润。
methods: 使用 Shapley value 算法，但是因为计算复杂度太高，采用随机抽样法来估算 Shapley value。采用了基于 эксперименталь设计的 combinatorial structures 来实现更高精度的估算。
results: 对比 SRS 随机抽样法，DOE 采样 schemes 具有更高精度和可靠性，并且在某些情况下可以 deterministically recover 原始 Shapley value。在实验和实际应用中，DOE 采样 schemes 也表现出较快的计算速度。

Abstract
Shapley value is originally a concept in econometrics to fairly distribute both gains and costs to players in a coalition game. In the recent decades, its application has been extended to other areas such as marketing, engineering and machine learning. For example, it produces reasonable solutions for problems in sensitivity analysis, local model explanation towards the interpretable machine learning, node importance in social network, attribution models, etc. However, its heavy computational burden has been long recognized but rarely investigated. Specifically, in a $d$-player coalition game, calculating a Shapley value requires the evaluation of $d!$ or $2^d$ marginal contribution values, depending on whether we are taking the permutation or combination formulation of the Shapley value. Hence it becomes infeasible to calculate the Shapley value when $d$ is reasonably large. A common remedy is to take a random sample of the permutations to surrogate for the complete list of permutations. We find an advanced sampling scheme can be designed to yield much more accurate estimation of the Shapley value than the simple random sampling (SRS). Our sampling scheme is based on combinatorial structures in the field of design of experiments (DOE), particularly the order-of-addition experimental designs for the study of how the orderings of components would affect the output. We show that the obtained estimates are unbiased, and can sometimes deterministically recover the original Shapley value. Both theoretical and simulations results show that our DOE-based sampling scheme outperforms SRS in terms of estimation accuracy. Surprisingly, it is also slightly faster than SRS. Lastly, real data analysis is conducted for the C. elegans nervous system and the 9/11 terrorist network.

摘要
沙普利值是原本 econometrics 中的一种概念，用于公平分配合作者在协同游戏中的收益和成本。在过去几十年中，它的应用范围已经扩展到了其他领域，如市场营销、工程和机器学习。例如，它可以解决敏感分析、地方模型解释、社交网络中节点重要性、负责模型等问题。然而，它的计算束缚非常重，长期被注意。特别是在 d 个玩家协同游戏中，计算沙普利值需要评估 d! 或 2^d 个边缘贡献值，这取决于我们是使用 permutation 还是 combination 的 Shapley value формулиров。当 d 较大时，计算沙晶利值变得不可能。通常的解决方案是随机抽样 permutations 来代替完整的 permutations 列表。我们发现了一种高级的随机抽样方案，可以为 estimation 提供更高的准确性。我们的采样方案基于设计实验 (DOE) 中的 combinatorial 结构，特别是 order-of-addition 实验设计，用于研究不同的组件顺序对输出的影响。我们证明了 obtained estimates 是无偏的，并可以 deterministically 恢复原始沙晶利值。 Both theoretical 和 simulations 结果表明，我们的 DOE-based 采样方案在 estimation 精度方面超过 SRS，并且在一些情况下可以 deterministically 恢复原始沙晶利值。 surprisingly，它还是微妙些快于 SRS。最后，我们对 C. elegans 神经系统和 9/11 恐怖袭击网络进行了实际数据分析。

Efficient Methods for Non-stationary Online Learning

paper_url: http://arxiv.org/abs/2309.08911
repo_url: None
paper_authors: Peng Zhao, Yan-Feng Xie, Lijun Zhang, Zhi-Hua Zhou
for: 这个论文主要针对非站点环境下的在线减少偏差和适应偏差的优化问题。
methods: 这个论文提出了一种基于 parameter-free online learning 的减少偏差和适应偏差优化方法，通过减少每轮投影数量从 $\mathcal{O}(\log T)$ 降低到 $1$，并且只需一次 gradient query 和一次函数评估。
results: 该论文的实验结果验证了论文中的理论结论，并且显示了该方法在非站点环境下的高效性和稳定性。

Abstract
Non-stationary online learning has drawn much attention in recent years. In particular, dynamic regret and adaptive regret are proposed as two principled performance measures for online convex optimization in non-stationary environments. To optimize them, a two-layer online ensemble is usually deployed due to the inherent uncertainty of the non-stationarity, in which a group of base-learners are maintained and a meta-algorithm is employed to track the best one on the fly. However, the two-layer structure raises the concern about the computational complexity -- those methods typically maintain $\mathcal{O}(\log T)$ base-learners simultaneously for a $T$-round online game and thus perform multiple projections onto the feasible domain per round, which becomes the computational bottleneck when the domain is complicated. In this paper, we present efficient methods for optimizing dynamic regret and adaptive regret, which reduce the number of projections per round from $\mathcal{O}(\log T)$ to $1$. Moreover, our obtained algorithms require only one gradient query and one function evaluation at each round. Our technique hinges on the reduction mechanism developed in parameter-free online learning and requires non-trivial twists on non-stationary online methods. Empirical studies verify our theoretical findings.

摘要
在这篇论文中，我们提出高效的方法来优化动态 regret和适应 regret，从而减少每个回合的射影数量从 $\mathcal{O}(\log T)$ 到 1。此外，我们的算法只需要一次 gradient query 和一次函数评估在每个回合。我们的技术基于在参数自由线上学习中的减少机制，需要非常简单的非站点线上方法的非常简单的非站点线上方法的修改。实验证明了我们的理论发现。

Robust Online Covariance and Sparse Precision Estimation Under Arbitrary Data Corruption

paper_url: http://arxiv.org/abs/2309.08884
repo_url: None
paper_authors: Tong Yao, Shreyas Sundaram
for: 本研究旨在提出一种在在线enario中Robustly estimate covariance matrix的方法，以适应数据损害和攻击。
methods: 本文提出了一种基于trimmed-inner-product算法的在线方法，可以在面临arbitrary和敌意数据攻击的情况下 robustly estimate covariance matrix。
results: 本文提供了error bound和 convergence property的分析，证明了该方法可以准确地估计精度矩阵，即precision matrix。

Abstract
Gaussian graphical models are widely used to represent correlations among entities but remain vulnerable to data corruption. In this work, we introduce a modified trimmed-inner-product algorithm to robustly estimate the covariance in an online scenario even in the presence of arbitrary and adversarial data attacks. At each time step, data points, drawn nominally independently and identically from a multivariate Gaussian distribution, arrive. However, a certain fraction of these points may have been arbitrarily corrupted. We propose an online algorithm to estimate the sparse inverse covariance (i.e., precision) matrix despite this corruption. We provide the error-bound and convergence properties of the estimates to the true precision matrix under our algorithms.

摘要
Here's the Simplified Chinese translation: Gaussian 图模型广泛用于表示实体之间的相关性，但它们受到数据损害的威胁。在这项工作中，我们提出了一种修改后的内积法来在在线场景中稳定地估计协方差，即使在数据攻击中存在arbitrary和敌意的数据攻击。在每个时间步骤中，数据点会被独立地和identically从多变量 Gaussian 分布中采样，但一部分这些点可能被arbitrarily corrupted。我们提议一种在线算法来估计稀缺 inverse covariance（即精度）矩阵，即使在这些损害下。我们提供了估计错误 bound 和估计 converge 到真正精度矩阵的性质。

Rethinking Learning Rate Tuning in the Era of Large Language Models

paper_url: http://arxiv.org/abs/2309.08859
repo_url: https://github.com/mlsysx/lrbenchplusplus
paper_authors: Hongpeng Jin, Wenqi Wei, Xuyu Wang, Wenbin Zhang, Yanzhao Wu
for: 本研究旨在探讨大语言模型（LLM）精度预测性能的最佳化问题，尤其是学习率的调整问题。
methods: 本研究使用了现有的学习率策略分析LLM fine-tuning中的挑战和机遇，并提出了LRBench++来 benchmark学习率策略并且为LLM fine-tuning和传统的深度神经网络（DNN）训练提供了一个共享的 benchmarking工具。
results: 实验分析表明，LRBench++可以帮助找到最佳的学习率策略，并且在LLM fine-tuning和传统DNN训练中显示出了不同的特点。

Abstract
Large Language Models (LLMs) represent the recent success of deep learning in achieving remarkable human-like predictive performance. It has become a mainstream strategy to leverage fine-tuning to adapt LLMs for various real-world applications due to the prohibitive expenses associated with LLM training. The learning rate is one of the most important hyperparameters in LLM fine-tuning with direct impacts on both fine-tuning efficiency and fine-tuned LLM quality. Existing learning rate policies are primarily designed for training traditional deep neural networks (DNNs), which may not work well for LLM fine-tuning. We reassess the research challenges and opportunities of learning rate tuning in the coming era of Large Language Models. This paper makes three original contributions. First, we revisit existing learning rate policies to analyze the critical challenges of learning rate tuning in the era of LLMs. Second, we present LRBench++ to benchmark learning rate policies and facilitate learning rate tuning for both traditional DNNs and LLMs. Third, our experimental analysis with LRBench++ demonstrates the key differences between LLM fine-tuning and traditional DNN training and validates our analysis.

摘要
Translated into Simplified Chinese:大型语言模型（LLM）表示深度学习最近的成功，它们可以达到人类预测性能的惊人水平。由于LLM训练的成本过高，因此现在主流的策略是通过微调来适应LLM应用。学习率是微调LLM中最重要的超参数，它直接影响微调效率和微调后LLM质量。现有的学习率策略主要适用于训练传统深度神经网络（DNN），可能不适用于LLM微调。我们重新评估LLM微调中的研究挑战和机遇，并提出三项原创贡献。首先，我们回顾现有的学习率策略，分析LLM微调中学习率的挑战。其次，我们提出LRBench++来评估学习率策略，并且为传统DNN和LLM进行微调。第三，我们通过LRBench++的实验分析，发现LLM微调和传统DNN训练存在重要的差异，并证明我们的分析。

Intelligent machines work in unstructured environments by differential neural computing

paper_url: http://arxiv.org/abs/2309.08835
repo_url: None
paper_authors: Shengbo Wang, Shuo Gao, Chenyu Tang, Cong Li, Shurui Wang, Jiaqi Wang, Hubin Zhao, Guohua Hu, Arokia Nathan, Ravinder Dahiya, Luigi Occhipinti
for: 提高智能机器在实际世界中高效地处理未知环境信息，如人类一样。
methods: 基于卷积神经计算的干扰信号处理和学习方法，通过提取环境信息的主要特征并应用相关编码刺激到 memristors 中，成功实现了人类类似的处理未知环境信息能力，包括增强（>720%）和适应（<50%）机械刺激。
results: 方法在两种常见的智能机器应用中展现了良好的扩展性和泛化性，即物品抓取和自动驾驶。在前者中，通过学习未知物体特征（如锋利角和平滑表面），一个 memristor 在 1 ms 内实现了安全和稳定的抓取。在后者中，通过使用 40x25 个 memristor 数组，成功EXTRACTED 10 种未知环境决策信息（如超越车辆和行人），准确率达 94%。

Abstract
Expecting intelligent machines to efficiently work in real world requires a new method to understand unstructured information in unknown environments with good accuracy, scalability and generalization, like human. Here, a memristive neural computing based perceptual signal differential processing and learning method for intelligent machines is presented, via extracting main features of environmental information and applying associated encoded stimuli to memristors, we successfully obtain human-like ability in processing unstructured environmental information, such as amplification (>720%) and adaptation (<50%) of mechanical stimuli. The method also exhibits good scalability and generalization, validated in two typical applications of intelligent machines: object grasping and autonomous driving. In the former, a robot hand experimentally realizes safe and stable grasping, through learning unknown object features (e.g., sharp corner and smooth surface) with a single memristor in 1 ms. In the latter, the decision-making information of 10 unstructured environments in autonomous driving (e.g., overtaking cars, pedestrians) are accurately (94%) extracted with a 40x25 memristor array. By mimicking the intrinsic nature of human low-level perception mechanisms in electronic memristive neural circuits, the proposed method is adaptable to diverse sensing technologies, helping intelligent machines to generate smart high-level decisions in real world.

摘要
要让智能机器在真实世界中有效工作，需要一种新的方法来理解未知环境中的无结构信息，具有人类水平的准确性、可扩展性和泛化能力。在这篇文章中，我们提出了基于干扰神经计算的嗅收信号处理和学习方法，通过提取环境信息的主要特征并将其与干扰器相关的编码刺激应用于干扰器中，成功地实现了人类水平的环境信息处理能力，包括增强（>720%）和适应（<50%）的机械刺激。这种方法还具有良好的扩展性和泛化能力，在智能机器的两个典型应用中进行验证：物体抓取和自动驾驶。在前者中，一个机器人手经过实验性地实现了安全和稳定的抓取，通过学习未知物体特征（如锋利角和平滑面）的一个干扰器，在1毫秒内完成。在后者中，一个40x25干扰器数组可以高精度地提取10种不同的自动驾驶环境中的决策信息（如超越车辆和步行人），准确率达94%。通过模仿人类低级感觉机制的内在性，我们的方法可以与多种感知技术结合，帮助智能机器在真实世界中产生智能高级决策。

2023-09-16

eess.IV

eess.IV - 2023-09-16

Wavelet-based Topological Loss for Low-Light Image Denoising

paper_url: http://arxiv.org/abs/2309.08975
repo_url: None
paper_authors: Alexandra Malyugina, Nantheera Anantrasirichai, David Bull
for: 提高图像减雷的效果，增强图像的对比度和保留Texture信息
methods: 提出一种新的减雷损失函数，包括图像结构信息和常见的深度学习任务中的空间信息
results: 对BVI-Lowlight dataset进行训练，并在LPIPS metric中提高了25%，表明提出的损失函数能够更好地训练神经网络，提高图像减雷的效果。

Abstract
Despite extensive research conducted in the field of image denoising, many algorithms still heavily depend on supervised learning and their effectiveness primarily relies on the quality and diversity of training data. It is widely assumed that digital image distortions are caused by spatially invariant Additive White Gaussian Noise (AWGN). However, the analysis of real-world data suggests that this assumption is invalid. Therefore, this paper tackles image corruption by real noise, providing a framework to capture and utilise the underlying structural information of an image along with the spatial information conventionally used for deep learning tasks. We propose a novel denoising loss function that incorporates topological invariants and is informed by textural information extracted from the image wavelet domain. The effectiveness of this proposed method was evaluated by training state-of-the-art denoising models on the BVI-Lowlight dataset, which features a wide range of real noise distortions. Adding a topological term to common loss functions leads to a significant increase in the LPIPS (Learned Perceptual Image Patch Similarity) metric, with the improvement reaching up to 25\%. The results indicate that the proposed loss function enables neural networks to learn noise characteristics better. We demonstrate that they can consequently extract the topological features of noise-free images, resulting in enhanced contrast and preserved textural information.

摘要
尽管在图像噪声除除领域进行了广泛的研究，许多算法仍然依赖于指导学习，其效果主要取决于训练数据的质量和多样性。通常认为数字图像扭曲是由空间不变的加速白噪声（AWGN）引起的，但是分析实际数据表示这个假设是无效的。因此，这篇论文通过图像扭曲实际噪声，提供了一个捕捉图像下的结构信息以及传统用于深度学习任务中的空间信息的框架。我们提议一种新的减噪损失函数，该函数包含拓扑 invariants 和通过图像振荡频谱领域提取的文本信息。我们通过在 BVI-Lowlight 数据集上训练现状顶峰的减噪模型，评估了该提议的效果。添加拓扑项到常见损失函数后，LPIPS（学习感知图像补充相似度）指标上的提升可达 25%。结果表明，我们的损失函数使得神经网络学习噪声特征更好。我们示出，神经网络可以根据噪声free图像的拓扑特征提取图像的增强对比度和保持文本信息。

2023-09-16

eess.SP

eess.SP - 2023-09-16

Optimal Photodetector Size for High-Speed Free-Space Optics Receivers

paper_url: http://arxiv.org/abs/2309.09090
repo_url: None
paper_authors: Muhammad Salman Bashir, Qasim Zeeshan Ahmed, Mohamed-Slim Alouini
for: 优化光电器面积以实现高速度数据传输
methods: 使用closed-form解题法优化光电器面积，以 maximize通道容量
results: 实现了各种光无线通信系统的最大可达数据速率，包括长距离深空光链和短距离室内可见光通信系统

Abstract
The selection of an optimal photodetector area is closely linked to the attainment of higher data rates in optical wireless communication receivers. If the photodetector area is too large, the channel capacity degrades due to lower modulation bandwidth of the detector. A smaller photodetector maximizes the bandwidth, but minimizes the captured signal power and the subsequent signal-to-noise ratio. Therein lies an opportunity in this trade-off to maximize the channel rate by choosing the optimal photodetector area. In this study, we have optimized the photodetector area in order to maximize the channel capacity of a free-space optical link for a diverse set of communication scenarios. We believe that the study in this paper in general -- and the closed-form solutions derived in this study in particular -- will be helpful to maximize achievable data rates of a wide gamut of optical wireless communication systems: from long range deep space optical links to short range indoor visible light communication systems.

摘要
选择最佳光探测面积对于光无线通信接收器的数据速率的实现具有紧密的关系。如果光探测面积太大，通道容量会降低因为探测器的模拟宽频率下降。小型光探测器可以最大化宽频率，但是它会降低捕获信号强度和相应的噪声比。这里就存在一个利点，通过选择最佳光探测面积可以最大化通道容量。在这项研究中，我们对光无线通信系统中的多种通信场景进行了优化光探测面积，以实现最大化通道容量。我们认为这项研究的总体成果以及 derive的关闭形解决方案会对各种光无线通信系统中的数据速率帮助提高。从深空光学链到indoor可见光通信系统，我们认为这项研究将对各种系统的实现数据速率具有帮助。

Split Federated Learning for 6G Enabled-Networks: Requirements, Challenges and Future Directions

paper_url: http://arxiv.org/abs/2309.09086
repo_url: None
paper_authors: Houda Hafi, Bouziane Brik, Pantelis A. Frangoudis, Adlen Ksentini
for: This paper is written to explore the potential of Split Federated Learning (SFL) in 6G wireless networks and its applications in various use cases.
methods: The paper uses a comprehensive study of SFL techniques and their deployment over 6G wireless networks, including an overview of three emerging collaborative learning paradigms and their comparison with existing approaches.
results: The paper highlights the need for SFL in 6G networks and its potential benefits in improving data privacy and reducing communication overhead, and identifies key technical challenges and future research directions in this area.Here is the same information in Simplified Chinese text:
for: 这篇论文是为了探讨6G无线网络中使用Split Federated Learning（SFL）的潜力和其在不同应用场景中的运用。
methods: 论文采用了一项全面的SFL技术研究和其在6G无线网络上的部署，包括三种出现的共同学习方法的概述和与现有方法的比较。
results: 论文强调了SFL在6G网络中的需求和其可能提高数据隐私和通信负担的优点，并标识了这个领域的关键技术挑战和未来研究方向。

Abstract
Sixth-generation (6G) networks anticipate intelligently supporting a wide range of smart services and innovative applications. Such a context urges a heavy usage of Machine Learning (ML) techniques, particularly Deep Learning (DL), to foster innovation and ease the deployment of intelligent network functions/operations, which are able to fulfill the various requirements of the envisioned 6G services. Specifically, collaborative ML/DL consists of deploying a set of distributed agents that collaboratively train learning models without sharing their data, thus improving data privacy and reducing the time/communication overhead. This work provides a comprehensive study on how collaborative learning can be effectively deployed over 6G wireless networks. In particular, our study focuses on Split Federated Learning (SFL), a technique recently emerged promising better performance compared with existing collaborative learning approaches. We first provide an overview of three emerging collaborative learning paradigms, including federated learning, split learning, and split federated learning, as well as of 6G networks along with their main vision and timeline of key developments. We then highlight the need for split federated learning towards the upcoming 6G networks in every aspect, including 6G technologies (e.g., intelligent physical layer, intelligent edge computing, zero-touch network management, intelligent resource management) and 6G use cases (e.g., smart grid 2.0, Industry 5.0, connected and autonomous systems). Furthermore, we review existing datasets along with frameworks that can help in implementing SFL for 6G networks. We finally identify key technical challenges, open issues, and future research directions related to SFL-enabled 6G networks.

摘要
We first provide an overview of three emerging collaborative learning paradigms, including federated learning, split learning, and split federated learning, as well as an overview of 6G networks along with their main vision and timeline of key developments. We then highlight the need for split federated learning towards the upcoming 6G networks in every aspect, including 6G technologies (e.g., intelligent physical layer, intelligent edge computing, zero-touch network management, intelligent resource management) and 6G use cases (e.g., smart grid 2.0, Industry 5.0, connected and autonomous systems). Furthermore, we review existing datasets along with frameworks that can help in implementing SFL for 6G networks.We finally identify key technical challenges, open issues, and future research directions related to SFL-enabled 6G networks. These include the need for better privacy guarantees, more efficient communication protocols, and better handling of non-iid data distributions. Additionally, there is a need for more research on the intersection of SFL and other emerging technologies such as edge computing, blockchain, and quantum computing.In summary, this work provides a comprehensive study on the potential of SFL for 6G networks, highlighting its benefits, challenges, and future research directions. The findings of this study can help researchers and practitioners to better understand the potential of SFL in 6G networks and to develop innovative solutions that can leverage the advantages of collaborative learning while ensuring data privacy and reducing communication overhead.

paper_url: http://arxiv.org/abs/2309.09063
repo_url: None
paper_authors: Victor M. Tenorio, Samuel Rey, Antonio G. Marques
for: 解压缩图像信号，以获取输入（源）和图像扩散过程中的筛选器（模型）。
methods: 提议使用优化算法来解压缩图像信号，并考虑图像扩散过程中的瑕疵。
results: 预liminary numerical experiments表明，提议的算法可以有效地解压缩图像信号。

Abstract
Blind deconvolution over graphs involves using (observed) output graph signals to obtain both the inputs (sources) as well as the filter that drives (models) the graph diffusion process. This is an ill-posed problem that requires additional assumptions, such as the sources being sparse, to be solvable. This paper addresses the blind deconvolution problem in the presence of imperfect graph information, where the observed graph is a perturbed version of the (unknown) true graph. While not having perfect knowledge of the graph is arguably more the norm than the exception, the body of literature on this topic is relatively small. This is partly due to the fact that translating the uncertainty about the graph topology to standard graph signal processing tools (e.g. eigenvectors or polynomials of the graph) is a challenging endeavor. To address this limitation, we propose an optimization-based estimator that solves the blind identification in the vertex domain, aims at estimating the inverse of the generating filter, and accounts explicitly for additive graph perturbations. Preliminary numerical experiments showcase the effectiveness and potential of the proposed algorithm.

摘要
盲损减殖在图上 involve 使用（观察）输出图像信号来获得输入（源）以及驱动图像扩散过程的筛选器。这是一个不充分定义的问题，需要更多的假设，如源是稀疏的，才能解决。本文 Addresses 盲损减殖问题在不完整的图像信息下，其中观察到的图像是真实图像未知的 perturbed 版本。尽管不具备完整的图像信息是更常见的情况，但相关的文献较少，这可能是因为将图像 topology 不确定性翻译到标准的图像处理工具（例如对角线或图像权值）是一个困难的任务。为解决这些限制，我们提出了一种优化基于的估计器，解决盲损减殖问题在顶点域，计算出生成器的逆，并考虑到添加性的图像杂化。初步的数字实验显示了提案的算法的有效性和潜力。

A Low-Latency FFT-IFFT Cascade Architecture

paper_url: http://arxiv.org/abs/2309.09035
repo_url: None
paper_authors: Keshab K. Parhi
for: 这篇论文描述了一种不需要中间缓冲的幂等快速傅立卷-快速傅立卷架构的设计。
methods: 该架构使用叠加来实现部分并行的FFT和IFFT架构。通过不同的叠加集来设计FFT和IFFT架构，但是对于给定的叠加FFT架构，存在一个唯一的叠加集来设计IFFT架构，无需中间缓冲。
results: 该方法可以避免中间缓冲，降低延迟和释放内存空间。此外，该方法还可以扩展到多通道时间序列的并行处理。相比一个具有相同叠加集的设计，该架构可以节省约N/2个存储元素和N/4个时钟周期的延迟。对于2个扩展FFT-IFFT架构，则分别节省约N/2个存储元素和N/2个时钟周期的延迟。

Abstract
This paper addresses the design of a partly-parallel cascaded FFT-IFFT architecture that does not require any intermediate buffer. Folding can be used to design partly-parallel architectures for FFT and IFFT. While many cascaded FFT-IFFT architectures can be designed using various folding sets for the FFT and the IFFT, for a specified folded FFT architecture, there exists a unique folding set to design the IFFT architecture that does not require an intermediate buffer. Such a folding set is designed by processing the output of the FFT as soon as possible (ASAP) in the folded IFFT. Elimination of the intermediate buffer reduces latency and saves area. The proposed approach is also extended to interleaved processing of multi-channel time-series. The proposed FFT-IFFT cascade architecture saves about N/2 memory elements and N/4 clock cycles of latency compared to a design with identical folding sets. For the 2-interleaved FFT-IFFT cascade, the memory and latency savings are, respectively, N/2 units and N/2 clock cycles, compared to a design with identical folding sets.

摘要

Localization with Noisy Android Raw GNSS Measurements

paper_url: http://arxiv.org/abs/2309.08936
repo_url: None
paper_authors: Xu Weng, Keck Voon Ling
for: 本研究旨在利用AndroidRaw全球导航卫星系统(GNSS)测量来进行高精度定位任务，传统上由特殊GNSS接收器进行。
methods: 本研究使用Moveing Horizon Estimation(MHE)、Extended Kalman Filter(EKF)和Rauch-Tung-Striebel(RTS)缓和器来抑制噪声。
results: 实验结果显示，RTS缓和器可以实现最佳的定位性能，在静止和动态情况下对应位置误差降低76.4%和46.5%，相比基准weighted least squares(WLS)方法。

Abstract
Android raw Global Navigation Satellite System (GNSS) measurements are expected to bring power to take on demanding localization tasks that are traditionally performed by specialized GNSS receivers. The hardware constraints, however, make Android raw GNSS measurements much noisier than geodetic-quality ones. This study elucidates the principles of localization using Android raw GNSS measurements and leverages Moving Horizon Estimation (MHE), Extended Kalman Filter (EKF), and Rauch-Tung-Striebel (RTS) smoother for noise suppression. The experiment results showcase that RTS smoother achieves the best localization performance and yields a remarkable reduction of 76.4\% and 46.5\% in horizontal positioning error during static and dynamic scenarios compared to the baseline weighted least squares (WLS) method.

摘要

Scalable Multiuser Immersive Communications with Multi-numerology and Mini-slot

paper_url: http://arxiv.org/abs/2309.08906
repo_url: None
paper_authors: Ming Hu, Jiazhi Peng, Lifeng Wang, Kai-Kit Wong
for: 这个论文是为了研究多用户 immerse 通信网络，在这些网络中不同的用户设备可能需要不同的扩展现实（XR）服务。
methods: 该论文提出了一种可扩展的时间频率资源分配方法，基于多 numerology 和 mini-slot。
results: 该方法可以有效地提高多用户 immerse 通信网络中的总体品质经验（QoE），并且可以适应不同用户的 QoE 限制。

Abstract
This paper studies multiuser immersive communications networks in which different user equipment may demand various extended reality (XR) services. In such heterogeneous networks, time-frequency resource allocation needs to be more adaptive since XR services are usually multi-modal and latency-sensitive. To this end, we develop a scalable time-frequency resource allocation method based on multi-numerology and mini-slot. To appropriately determining the discrete parameters of multi-numerology and mini-slot for multiuser immersive communications, the proposed method first presents a novel flexible time-frequency resource block configuration, then it leverages the deep reinforcement learning to maximize the total quality-of-experience (QoE) under different users' QoE constraints. The results confirm the efficiency and scalability of the proposed time-frequency resource allocation method.

摘要
To determine the appropriate discrete parameters of multi-numerology and mini-slot for multi-user immersive communications, the proposed method begins by presenting a flexible time-frequency resource block configuration. Then, it utilizes deep reinforcement learning to maximize the total quality-of-experience (QoE) while meeting the different users' QoE constraints.The results demonstrate the efficiency and scalability of the proposed time-frequency resource allocation method.

CDDM: Channel Denoising Diffusion Models for Wireless Semantic Communications

paper_url: http://arxiv.org/abs/2309.08895
repo_url: None
paper_authors: Tong Wu, Zhiyong Chen, Dazhi He, Liang Qian, Yin Xu, Meixia Tao, Wenjun Zhang
for: 这篇论文主要目的是提出一种新的物理层模块——渠道减噪扩散模型（CDDM），用于semantic通信系统中减噪。methods: 该论文使用了扩散模型（DM），特别是针对渠道模型的扩散进行特定的设计，以及针对渠道模型的特殊化采样和训练算法。results: 实验结果表明，CDDM可以减少接收信号的条件熵，并且在小步骤下可以有效减少MSE。此外，joint CDDM和JSCC系统在图像传输中表现更好，并且比JSCC系统和传统的JPEG2000与LDPC编码方法更好。

Abstract
Diffusion models (DM) can gradually learn to remove noise, which have been widely used in artificial intelligence generated content (AIGC) in recent years. The property of DM for eliminating noise leads us to wonder whether DM can be applied to wireless communications to help the receiver mitigate the channel noise. To address this, we propose channel denoising diffusion models (CDDM) for semantic communications over wireless channels in this paper. CDDM can be applied as a new physical layer module after the channel equalization to learn the distribution of the channel input signal, and then utilizes this learned knowledge to remove the channel noise. We derive corresponding training and sampling algorithms of CDDM according to the forward diffusion process specially designed to adapt the channel models and theoretically prove that the well-trained CDDM can effectively reduce the conditional entropy of the received signal under small sampling steps. Moreover, we apply CDDM to a semantic communications system based on joint source-channel coding (JSCC) for image transmission. Extensive experimental results demonstrate that CDDM can further reduce the mean square error (MSE) after minimum mean square error (MMSE) equalizer, and the joint CDDM and JSCC system achieves better performance than the JSCC system and the traditional JPEG2000 with low-density parity-check (LDPC) code approach.

摘要
Diffusion models (DM) 可以慢慢地学习去除噪音，已经广泛应用于人工智能生成内容 (AIGC) 领域的最新技术。DM 对于通信频率 Canal 的噪音 mitigation 提供了一个可能性，因此我们在本文中提出了通道减噪扩散模型 (CDDM)。CDDM 可以作为通信物理层模块，在通道均衡后学习通道输入信号的分布，然后利用这些学习知识来去除通道噪音。我们提出了对应的训练和采样算法，并经过特殊设计的前进扩散过程来适应通道模型，并理论上证明了充分训练 CDDM 可以在小步骤下降低接收信号的 conditional entropy。此外，我们应用 CDDM 到基于联合源-通道编码 (JSCC) 的图像传输系统中。实验结果表明，CDDM 可以在 MMSE 等式后进行加权平均值补做，并且联合 CDDM 和 JSCC 系统可以在 JSCC 系统和传统的 JPEG2000 低密度极性码 (LDPC) 方法之上具有更好的性能。

Demo: Intelligent Radar Detection in CBRS Band in the Colosseum Wireless Network Emulator

paper_url: http://arxiv.org/abs/2309.08861
repo_url: None
paper_authors: Davide Villa, Daniel Uvaydov, Leonardo Bonati, Pedram Johari, Josep Miquel Jornet, Tommaso Melodia
for: 这个论文是为了研究商业激光波形与无线网络共同运行的技术。
methods: 这个研究使用了Colosseum，全球最大的无线网络模拟器，以及硬件在回路的技术来模拟实际的无线网络环境。
results: 实验结果显示，使用机器学习代理人在基站中训练时，可以实现88%的检测精度，检测时间为137ms。

Abstract
The ever-growing number of wireless communication devices and technologies demands spectrum-sharing techniques. Effective coexistence management is crucial to avoid harmful interference, especially with critical systems like nautical and aerial radars in which incumbent radios operate mission-critical communication links. In this demo, we showcase a framework that leverages Colosseum, the world's largest wireless network emulator with hardware-in-the-loop, as a playground to study commercial radar waveforms coexisting with a cellular network in CBRS band in complex environments. We create an ad-hoc high-fidelity spectrum-sharing scenario for this purpose. We deploy a cellular network to collect IQ samples with the aim of training an ML agent that runs at the base station. The agent has the goal of detecting incumbent radar transmissions and vacating the cellular bandwidth to avoid interfering with the radar operations. Our experiment results show an average detection accuracy of 88%, with an average detection time of 137 ms.

摘要
随着无线通信设备和技术的不断增加，需要 spectrum-sharing 技术来实现共享频率。有效地管理共享是关键，以避免干扰，特别是与航空和海上雷达系统相关的核心通信链接。在这个 demo 中，我们利用 Colosseum，全球最大的无线网络模拟器，作为一个实验室，研究商业雷达波形在 CBRS 频级上与无线网络共享资源的可行性。我们创建了一个高精度的 spectrum-sharing enario，并将一个 cellular 网络部署到收集 IQ 样本，以用于训练基站上运行的机器学习代理。这个代理的目标是检测 incumbent 雷达传输，并让 cellular 频率占用避免与雷达操作干扰。我们的实验结果显示，检测精度平均为 88%，检测时间平均为 137 ms。

2023-09-15

cs.SD

cs.SD - 2023-09-15

Stack-and-Delay: a new codebook pattern for music generation

paper_url: http://arxiv.org/abs/2309.08804
repo_url: None
paper_authors: Gael Le Lan, Varun Nagaraja, Ernie Chang, David Kant, Zhaoheng Ni, Yangyang Shi, Forrest Iandola, Vikas Chandra
for: 这 paper 是为了提高语音生成模型的执行速度而写的。
methods: 这 paper 使用了一种新的 stack-and-delay 解码策略，以提高 auto-regressive 解码的速度。
results: 对于同等效果预算，这 paper 的新策略可以在对 GPU 进行批处理时提高生成速度，并且在对比 vanilla flat 解码法时，质量几乎相当。

Abstract
In language modeling based music generation, a generated waveform is represented by a sequence of hierarchical token stacks that can be decoded either in an auto-regressive manner or in parallel, depending on the codebook patterns. In particular, flattening the codebooks represents the highest quality decoding strategy, while being notoriously slow. To this end, we propose a novel stack-and-delay style of decoding strategy to improve upon the flat pattern decoding where generation speed is four times faster as opposed to vanilla flat decoding. This brings the inference time close to that of the delay decoding strategy, and allows for faster inference on GPU for small batch sizes. For the same inference efficiency budget as the delay pattern, we show that the proposed approach performs better in objective evaluations, almost closing the gap with the flat pattern in terms of quality. The results are corroborated by subjective evaluations which show that samples generated by the new model are slightly more often preferred to samples generated by the competing model given the same text prompts.

摘要
在语言模型基于音乐生成中，生成波形被表示为一个层次化token堆，可以在某些情况下以自动递归方式或平行方式解码，具体取决于codebook Pattern。特别是，平滑codebooks表示最高质量解码策略，但却非常慢。为此，我们提出了一种新的堆延式解码策略，以提高对于平滑解码的速度。这将执行时间与延迟解码策略相似，并在小批处理时在GPU上进行更快的执行。对于同样的推理效率预算，我们表明了我们的方法在对象评价中比 delay Pattern 更好，几乎与平滑Pattern 相近的质量。结果得到了subjective评价的支持，显示新模型生成的样本在同一个文本提示下被轻微更多地选择。

Music Source Separation Based on a Lightweight Deep Learning Framework (DTTNET: DUAL-PATH TFC-TDF UNET)

paper_url: http://arxiv.org/abs/2309.08684
repo_url: https://github.com/junyuchen-cjy/dttnet-pytorch
paper_authors: Junyu Chen, Susmitha Vekkot, Pancham Shukla
for: 本研究旨在提出一种轻量级的音乐源分离模型（DTTNet），以提高音乐源分离的效果。
methods: 本文使用了一种基于双路模块和时域频域卷积的时间分布式全连接卷积神经网络（TFC-TDF UNet），并对模型进行了训练。
results: 对于 vocals 部分，DTTNet 可以达到 10.12 dB cSDR，比 Bandsplit RNN (BSRNN) 高出 0.11 dB，但具有 86.7% fewer 参数。

Abstract
Music source separation (MSS) aims to extract 'vocals', 'drums', 'bass' and 'other' tracks from a piece of mixed music. While deep learning methods have shown impressive results, there is a trend toward larger models. In our paper, we introduce a novel and lightweight architecture called DTTNet, which is based on Dual-Path Module and Time-Frequency Convolutions Time-Distributed Fully-connected UNet (TFC-TDF UNet). DTTNet achieves 10.12 dB cSDR on 'vocals' compared to 10.01 dB reported for Bandsplit RNN (BSRNN) but with 86.7% fewer parameters. We also assess pattern-specific performance and model generalization for intricate audio patterns.

摘要
音乐源分离（MSS）目标是从混合音乐中提取“声乐”、“鼓”、“低音”和“其他”多个轨道。深度学习方法已经表现出色，但是现在有一趋势是增大模型。在我们的论文中，我们介绍了一种新的轻量级架构，称为DTTNet，它基于双路模块和时域频域卷积（TFC-TDF UNet）。DTTNet实现了10.12 dB的清晰度（cSDR），比BSRNN（Bandsplit RNN）的10.01 dB高，但具有86.7%的参数数量少。我们还评估了模型对复杂音乐 patrern的特定性能和通用性。

Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition

paper_url: http://arxiv.org/abs/2309.08436
repo_url: None
paper_authors: Mohammad Zeineldeen, Albert Zeyer, Ralf Schlüter, Hermann Ney
for: 这篇论文旨在提出一种流处理的注意力基于encoder-decoder模型，其中 either the decoder, or both the encoder and decoder, operate on pre-defined, fixed-size windows called chunks.
methods: 该模型使用特殊的 end-of-chunk (EOC) 符号来进行chunk boundaries的标识，而不是 convential end-of-sequence 符号。此外，模型还 explores 其他与标准转录器模型的差异。
results: 通过在 Librispeech 和 TED-LIUM-v2 上进行实验，并将 consecutives sequences concatenated for long-form trials，发现该流处理模型与非流处理模型的性能相对 compatible，并且在长型语音总结very well。

Abstract
We study a streamable attention-based encoder-decoder model in which either the decoder, or both the encoder and decoder, operate on pre-defined, fixed-size windows called chunks. A special end-of-chunk (EOC) symbol advances from one chunk to the next chunk, effectively replacing the conventional end-of-sequence symbol. This modification, while minor, situates our model as equivalent to a transducer model that operates on chunks instead of frames, where EOC corresponds to the blank symbol. We further explore the remaining differences between a standard transducer and our model. Additionally, we examine relevant aspects such as long-form speech generalization, beam size, and length normalization. Through experiments on Librispeech and TED-LIUM-v2, and by concatenating consecutive sequences for long-form trials, we find that our streamable model maintains competitive performance compared to the non-streamable variant and generalizes very well to long-form speech.

摘要
我们研究了一个流处理器基于注意力的编解码器模型，其中编码器或解码器都操作在预定的固定大小窗口（chunk）上。特殊的结束chunk（EOC）符号在一个chunk到下一个chunk之间进行转移，从而替代传统的结束序列符号。这种修改虽小，但将我们的模型与帧模型等同起来，其中EOC符号与空符号相对应。我们进一步探讨了标准转录器和我们模型之间的剩余差异。我们还考虑了长форма语音总体化、扫描大小和长度 нормализация等相关因素。通过对Librispeech和TED-LIUM-v2上进行实验，并将 consecutivesequences concatenate для长形试验，我们发现我们的流处理器模型与非流处理器模型的性能相比具有竞争力，并且对长形语音总体化很好。

Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech

paper_url: http://arxiv.org/abs/2309.08408
repo_url: https://github.com/mrjunjieli/activeextract
paper_authors: Junjie Li, Ruijie Tao, Zexu Pan, Meng Ge, Shuai Wang, Haizhou Li
for: 本研究旨在提高target speaker抽取的精度，特别是在稀有 overlap的场景下。
methods: 本文提出了一种名为ActiveExtract的音频视频 speaker抽取模型，该模型利用音频视频活跃 speaker检测（ASD）来直接提供目标说话者的帧级活动信息，同时使用ASD的中间特征表示来鉴别说话lip同步。
results: 实验结果表明， compared to基线，我们的模型在不同的 overlap ratio下均表现出超过4dB的提升，这表明我们的模型可以在稀有 overlap的场景下提高target speaker抽取的精度。

Abstract
Target speaker extraction aims to extract the speech of a specific speaker from a multi-talker mixture as specified by an auxiliary reference. Most studies focus on the scenario where the target speech is highly overlapped with the interfering speech. However, this scenario only accounts for a small percentage of real-world conversations. In this paper, we aim at the sparsely overlapped scenarios in which the auxiliary reference needs to perform two tasks simultaneously: detect the activity of the target speaker and disentangle the active speech from any interfering speech. We propose an audio-visual speaker extraction model named ActiveExtract, which leverages speaking activity from audio-visual active speaker detection (ASD). The ASD directly provides the frame-level activity of the target speaker, while its intermediate feature representation is trained to discriminate speech-lip synchronization that could be used for speaker disentanglement. Experimental results show our model outperforms baselines across various overlapping ratios, achieving an average improvement of more than 4 dB in terms of SI-SNR.

摘要
target speaker extraction aims to extract the speech of a specific speaker from a multi-talker mixture as specified by an auxiliary reference. most studies focus on the scenario where the target speech is highly overlapped with the interfering speech. however, this scenario only accounts for a small percentage of real-world conversations. in this paper, we aim at the sparsely overlapped scenarios in which the auxiliary reference needs to perform two tasks simultaneously: detect the activity of the target speaker and disentangle the active speech from any interfering speech. we propose an audio-visual speaker extraction model named activeextract, which leverages speaking activity from audio-visual active speaker detection (asd). the asd directly provides the frame-level activity of the target speaker, while its intermediate feature representation is trained to discriminate speech-lip synchronization that could be used for speaker disentanglement. experimental results show our model outperforms baselines across various overlapping ratios, achieving an average improvement of more than 4 dB in terms of si-snr.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Audio-free Prompt Tuning for Language-Audio Models

paper_url: http://arxiv.org/abs/2309.08357
repo_url: None
paper_authors: Yiming Li, Xiangdong Wang, Hong Liu
for: 这个论文想要协助CLAP模型从语言 Audio域预训练中提取特征，以便在不需要标注的域音频数据下进行适应。
methods: 我们提议一种不需要域音频数据的CLAP模型调教方法，通过利用CLAP模型的modalitiesAlignment来调整一些文本提示符，以便更好地调整模型空间，避免过拟合见到的类别。此外，我们还探索了多层排序提示的策略，以 fusionglobal和local信息。
results: 我们的方法可以提高CLAP模型的性能和训练效率，并在零例推理中对未经见的类别进行识别，并且比vanilla CLAP更好地转移知识。此外，我们的方法还可以在只知道下游类别名称的情况下进行适应。

Abstract
Contrastive Language-Audio Pretraining (CLAP) is pre-trained to associate audio features with human language, making it a natural zero-shot classifier to recognize unseen sound categories. To adapt CLAP to downstream tasks, prior works inevitably require labeled domain audios, which limits their scalability under data scarcity and deprives them of the capability to detect novel classes as the original CLAP. In this work, by leveraging the modality alignment in CLAP, we propose an efficient audio-free prompt tuning scheme aimed at optimizing a few prompt tokens from texts instead of audios, which regularizes the model space to avoid overfitting the seen classes as well. Based on this, a multi-grained prompt design is further explored to fuse global and local information. Experiments on several tasks demonstrate that our approach can boost the CLAP and outperform other training methods on model performance and training efficiency. While conducting zero-shot inference on unseen categories, it still shows better transferability than the vanilla CLAP. Moreover, our method is flexible enough even if only knowing the downstream class names. The code will be released soon.

摘要
“对于语音识别任务，我们提出了一个有效的无音训练方法，可以将文本提示调整为CLAP模型的条件，以提高模型的性能和训练效率。这个方法基于CLAP模型中的modalità对齐，可以将文本提示调整为CLAP模型的条件，以避免模型过拟合见到的类别。我们还提出了一个多层次提示设计，可以融合全球和本地信息。实验结果显示，我们的方法可以提高CLAP模型的性能和训练效率，并且在零shot推断中也表现出比vanilla CLAP更好的转移性。此外，我们的方法可以让你只知道下游类别名称来进行训练，并且还可以在零shot推断中进行推断。我们将将代码发布 soon。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know and I can provide that as well.

Semi-supervised Sound Event Detection with Local and Global Consistency Regularization

paper_url: http://arxiv.org/abs/2309.08355
repo_url: None
paper_authors: Yiming Li, Xiangdong Wang, Hong Liu, Rui Tao, Long Yan, Kazushige Ouchi
for: 这 paper 是为了提高 semi-supervised sound event detection 的性能而写的。
methods: 这 paper 使用了 Local and Global Consistency (LGC) regularization scheme，包括 audio CutMix 和特制的 contrastive loss，以促进模型在 label- 和 feature-level 上的改进。
results: 实验结果表明，LGC 超越了同等设置的基eline system，并且可以与现有方法相结合以实现进一步的改进。

Abstract
Learning meaningful frame-wise features on a partially labeled dataset is crucial to semi-supervised sound event detection. Prior works either maintain consistency on frame-level predictions or seek feature-level similarity among neighboring frames, which cannot exploit the potential of unlabeled data. In this work, we design a Local and Global Consistency (LGC) regularization scheme to enhance the model on both label- and feature-level. The audio CutMix is introduced to change the contextual information of clips. Then, the local consistency is adopted to encourage the model to leverage local features for frame-level predictions, and the global consistency is applied to force features to align with global prototypes through a specially designed contrastive loss. Experiments on the DESED dataset indicate the superiority of LGC, surpassing its respective competitors largely with the same settings as the baseline system. Besides, combining LGC with existing methods can obtain further improvements. The code will be released soon.

摘要
学习有意义的帧级特征是 semi-supervised 音频事件检测中的关键。先前的工作ether maintain consistency on frame-level predictions or seek feature-level similarity among neighboring frames, 这些方法无法利用无标签数据的潜力。在这种工作中，我们设计了 Local and Global Consistency (LGC) 规范来提高模型在标签和特征水平上。音频 CutMix 被引入，改变clip的Contextual information。然后，本地一致性被采用，以便使模型利用本地特征进行帧级预测，而全球一致性被应用，通过特殊的对比损失来让特征与全球谱系对齐。DESED 数据集的实验表明 LGC 的优越性，大大超越了同样的设置的基eline system。此外，将 LGC 与现有方法结合可以获得进一步的改进。代码即将发布。

The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

paper_url: http://arxiv.org/abs/2309.08348
repo_url: None
paper_authors: Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang, Hongbo Lan, Jun Du, Chin-Hui Lee, Jingdong Chen, Shinji Watanabe, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong-Qiu Wang, Jia Pan, Jianqing Gao
for: 本研究的目的是为了提高后端语音识别系统的准确率，通过使用音视频目标说话人抽取（AVTSE）任务。
methods: 本研究使用了音视频Speech Enhancement（MISP）挑战的数据集，并提供了一个基eline系统来支持参与者的参与。
results: 实验结果表明，AVTSE任务在真实的音响环境中非常具有挑战性，参与者可能会遇到各种问题。

Abstract
Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhance-ment challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the ac-curacy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward.

摘要

Speech-dependent Modeling of Own Voice Transfer Characteristics for In-ear Microphones in Hearables

paper_url: http://arxiv.org/abs/2309.08294
repo_url: None
paper_authors: Mattes Ohlenbusch, Christian Rollwage, Simon Doclo
for: 增强听力器中的耳壳内麦icrophone信号质量使用算法，以joint bandwidth extension、equalization和噪声减少。
methods: 基于phoneme认识的speech-dependent系统Identification模型，用于模拟耳壳内麦icrophone recording。
results: 使用提议的speech-dependent模型可以更好地模拟耳壳内麦icrophone recording，并且可以更好地泛化到不同的说话人。

Abstract
Many hearables contain an in-ear microphone, which may be used to capture the own voice of its user in noisy environments. Since the in-ear microphone mostly records body-conducted speech due to ear canal occlusion, it suffers from band-limitation effects while only capturing a limited amount of external noise. To enhance the quality of the in-ear microphone signal using algorithms aiming at joint bandwidth extension, equalization, and noise reduction, it is desirable to have an accurate model of the own voice transfer characteristics between the entrance of the ear canal and the in-ear microphone. Such a model can be used, e.g., to simulate a large amount of in-ear recordings to train supervised learning-based algorithms. Since previous research on ear canal occlusion suggests that own voice transfer characteristics depend on speech content, in this contribution we propose a speech-dependent system identification model based on phoneme recognition. We assess the accuracy of simulating own voice speech by speech-dependent and speech-independent modeling and investigate how well modeling approaches are able to generalize to different talkers. Simulation results show that using the proposed speech-dependent model is preferable for simulating in-ear recordings compared to using a speech-independent model.

摘要

paper_url: http://arxiv.org/abs/2309.08290
repo_url: https://github.com/xingyuaudio/HRTF-SCNN
paper_authors: Xingyu Chen, Fei Ma, Yile Zhang, Amy Bastine, Prasanga N. Samarasinghe
for: 这篇论文旨在提出一种基于深度学习的高分辨率 Head-related transfer functions (HRTFs) interpolating方法，以便在虚拟现实应用中实现高精度的声场 reproduce。
methods: 该方法基于圆形卷积神经网络，通过对 HRTF 的分解和重建来实现卷积过程。使用 Spherical Harmonics (SHs) 作为卷积函数，使得卷积层能够有效地捕捉声场特征。
results: simulations 结果表明，提出的方法能够准确地从稀疏测量 interpolate HRTF，高效地超过 SH 方法和学习基于方法。

Abstract
Head-related transfer functions (HRTFs) are crucial for spatial soundfield reproduction in virtual reality applications. However, obtaining personalized, high-resolution HRTFs is a time-consuming and costly task. Recently, deep learning-based methods showed promise in interpolating high-resolution HRTFs from sparse measurements. Some of these methods treat HRTF interpolation as an image super-resolution task, which neglects spatial acoustic features. This paper proposes a spherical convolutional neural network method for HRTF interpolation. The proposed method realizes the convolution process by decomposing and reconstructing HRTF through the Spherical Harmonics (SHs). The SHs, an orthogonal function set defined on a sphere, allow the convolution layers to effectively capture the spatial features of HRTFs, which are sampled on a sphere. Simulation results demonstrate the effectiveness of the proposed method in achieving accurate interpolation from sparse measurements, outperforming the SH method and learning-based methods.

摘要
HEAD-RELATED TRANSFER FUNCTIONS (HRTFs) 是虚拟现实应用中重要的空间声场重建技术。然而，获取个人化、高分辨率 HRTFs 是一项时间consuming 和成本高的任务。最近，深度学习基于方法在 interpolating 高分辨率 HRTFs 中表现出了搭配。其中一些方法将 HRTF interpolating 视为一种图像超分辨率任务，忽略了空间声学特征。本文提出了一种圆柱体 convolutional neural network 方法，用于 HRTF interpolating。该方法通过分解和重建 HRTF 通过圆柱体快推函数 (SHs) 来实现卷积过程。SHs 是定义在球体上的正交函数集，使卷积层能够有效地捕捉 HRTFs 的空间特征，这些特征在球体上被采样。 simulation 结果表明，提议的方法可以准确地从稀疏测量中 interpolate HRTFs，超越 SH 方法和学习基于方法。

One-Class Knowledge Distillation for Spoofing Speech Detection

paper_url: http://arxiv.org/abs/2309.08285
repo_url: None
paper_authors: Jingze Lu, Yuxiang Zhang, Wenchao Wang, Zengqiang Shang, Pengyuan Zhang
for: 本研究旨在解决未知算法生成的假语音杀毒检测问题，尤其是traditional检测系统无法通过二分类分类方法普适地检测假语音。
methods: 本研究提出了一种教师-学生框架，通过一个教师模型来引导学生模型学习一类模型，从而提高假语音检测的普适性。
results: 实验结果表明，提出的一类知识储存方法在ASVspoof 21DF数据集和InTheWild数据集上具有更高的普适性和检测精度，相比之下其他现有方法。

Abstract
The detection of spoofing speech generated by unseen algorithms remains an unresolved challenge. One reason for the lack of generalization ability is traditional detecting systems follow the binary classification paradigm, which inherently assumes the possession of prior knowledge of spoofing speech. One-class methods attempt to learn the distribution of bonafide speech and are inherently suited to the task where spoofing speech exhibits significant differences. However, training a one-class system using only bonafide speech is challenging. In this paper, we introduce a teacher-student framework to provide guidance for the training of a one-class model. The proposed one-class knowledge distillation method outperforms other state-of-the-art methods on the ASVspoof 21DF dataset and InTheWild dataset, which demonstrates its superior generalization ability.

摘要
检测假声音仍然是一个未解决的挑战。一个原因是传统的检测系统采用二分类分类方式，这意味着它们假设攻击者拥有假声音的知识。一类方法尝试学习正常的声音分布，但是在训练时需要大量的正常声音数据。在本文中，我们介绍了一种教师-学生框架，以帮助一类模型的训练。我们提出的一类知识填充方法在ASVspoof 21DF数据集和InTheWild数据集上显示出优于其他现有方法的性能，这 demonstartes its 的普遍性能。

Improving Short Utterance Anti-Spoofing with AASIST2

paper_url: http://arxiv.org/abs/2309.08279
repo_url: None
paper_authors: Yuxiang Zhang, Jingze Lu, Zengqiang Shang, Wenchao Wang, Pengyuan Zhang
for: 防止声音伪造 (anti-spoofing)
methods: 使用 wave2vec 2.0 和 интегрированный спектро-временной графический注意力网络 (AASIST)，并在检测过程中应用动态尺寸大小调整 (DCS) 和自适应大margin精度调整 (ALMFT)
results: 提高短语音识别性能，同时保持不同数据集的常规评估性能

Abstract
The wav2vec 2.0 and integrated spectro-temporal graph attention network (AASIST) based countermeasure achieves great performance in speech anti-spoofing. However, current spoof speech detection systems have fixed training and evaluation durations, while the performance degrades significantly during short utterance evaluation. To solve this problem, AASIST can be improved to AASIST2 by modifying the residual blocks to Res2Net blocks. The modified Res2Net blocks can extract multi-scale features and improve the detection performance for speech of different durations, thus improving the short utterance evaluation performance. On the other hand, adaptive large margin fine-tuning (ALMFT) has achieved performance improvement in short utterance speaker verification. Therefore, we apply Dynamic Chunk Size (DCS) and ALMFT training strategies in speech anti-spoofing to further improve the performance of short utterance evaluation. Experiments demonstrate that the proposed AASIST2 improves the performance of short utterance evaluation while maintaining the performance of regular evaluation on different datasets.

摘要
“wav2vec 2.0 和嵌入式спектро-时间图注意力网络（AASIST）基于的防范措施在语音骗取中表现出色。然而，现有的骗取语音检测系统具有固定的训练和评估时间，而性能在短语音评估中明显下降。为解决这问题，AASIST可以改进为AASIST2，通过修改剩下块为Res2Net块来提取多级特征，提高不同时长语音的检测性能，因此提高短语音评估性能。另一方面，适应大margin微调（ALMFT）在短语音 speaker认证中实现了性能提高。因此，我们在语音骗取中应用动态块大小（DCS）和ALMFT 训练策略，以进一步提高短语音评估性能。实验表明，提出的AASIST2可以在不同的数据集上维持短语音评估性能的同时，提高短语音评估性能。”

Improving Voice Conversion for Dissimilar Speakers Using Perceptual Losses

paper_url: http://arxiv.org/abs/2309.08263
repo_url: None
paper_authors: Suhita Ghosh, Yamini Sinha, Ingo Siegert, Sebastian Stober
for: 保护用户隐私和数据安全
methods: 使用语音转换技术实现语音数据匿名化
results: 成功地隐藏了语音数据的来源 speaker

Abstract
The rising trend of using voice as a means of interacting with smart devices has sparked worries over the protection of users' privacy and data security. These concerns have become more pressing, especially after the European Union's adoption of the General Data Protection Regulation (GDPR). The information contained in an utterance encompasses critical personal details about the speaker, such as their age, gender, socio-cultural origins and more. If there is a security breach and the data is compromised, attackers may utilise the speech data to circumvent the speaker verification systems or imitate authorised users. Therefore, it is pertinent to anonymise the speech data before being shared across devices, such that the source speaker of the utterance cannot be traced. Voice conversion (VC) can be used to achieve speech anonymisation, which involves altering the speaker's characteristics while preserving the linguistic content.

摘要
声音作为智能设备交互方式的升温趋势，引发了用户隐私和数据安全保护的worries。这些问题在欧盟通过《个人数据保护条例》（GDPR）之后变得更加紧迫。语音中含有关键个人信息，如speaker的年龄、性别、社会文化背景等。如果数据被泄露，攻击者可能利用语音数据绕过speaker验证系统或模仿已经授权的用户。因此，需要对语音数据进行匿名处理，以隐藏语音的来源speaker。声音转换（VC）可以实现匿名处理，即改变speaker的特征，保留语言内容不变。

TF-SepNet: An Efficient 1D Kernel Design in CNNs for Low-Complexity Acoustic Scene Classification

paper_url: http://arxiv.org/abs/2309.08200
repo_url: None
paper_authors: Yiqiang Cai, Peihong Zhang, Shengchen Li
for: 这篇论文主要关注发展高效的语音Scene分类系统，使用扩展单元网络（CNNs），实现更好的效能和效率。
methods: 提案的TF-SepNet架构将特征处理分为时间和频率两个维度，并将每个维度的特征进行分类。相比于传统的二维（2D）核心，TF-SepNet使用一维（1D）核心，以减少计算成本。
results: 实验结果显示，TF-SepNet在TAU都市语音Scene 2022 Mobile development dataset上表现出色，较同类的State-of-the-arts之前。进一步的调查发现，TF-SepNet的分类器具有更大的有效接收场（ERF），使得更好地捕捉时间-频率特征。

Abstract
Recent studies focus on developing efficient systems for acoustic scene classification (ASC) using convolutional neural networks (CNNs), which typically consist of consecutive kernels. This paper highlights the benefits of using separate kernels as a more powerful and efficient design approach in ASC tasks. Inspired by the time-frequency nature of audio signals, we propose TF-SepNet, a CNN architecture that separates the feature processing along the time and frequency dimensions. Features resulted from the separate paths are then merged by channels and directly forwarded to the classifier. Instead of the conventional two dimensional (2D) kernel, TF-SepNet incorporates one dimensional (1D) kernels to reduce the computational costs. Experiments have been conducted using the TAU Urban Acoustic Scene 2022 Mobile development dataset. The results show that TF-SepNet outperforms similar state-of-the-arts that use consecutive kernels. A further investigation reveals that the separate kernels lead to a larger effective receptive field (ERF), which enables TF-SepNet to capture more time-frequency features.

摘要
近期研究强调开发高效的听音场景分类（ASC）系统，使用核函数网络（CNN）来实现。通常情况下，CNN包含连续的核函数。这篇论文指出，使用独立的核函数可以作为更有力的和高效的设计方法。受听音信号的时间-频率特性启发，我们提出TF-SepNet架构，它在时间和频率维度上分离特征处理。从分离的道路中得到的特征然后通过通道直接传递给分类器。而不是传统的二维（2D）核函数，TF-SepNet使用一维（1D）核函数，以降低计算成本。经过实验，使用TAU都市听音场景2022移动开发 dataset，结果表明TF-SepNet超过了类似的状态艺术使用连续核函数的同类方法。进一步的调查表明，独立的核函数导致更大的有效收发场（ERF），这使得TF-SepNet能够捕捉更多的时间-频率特征。

Controllable Residual Speaker Representation for Voice Conversion

paper_url: http://arxiv.org/abs/2309.08166
repo_url: None
paper_authors: Le Xu, Jiangyan Yi, Jianhua Tao, Tao Wang, Yong Ren, Rongxiu Zhong
for: 提高voice conversion的高质量表现和 robustness
methods: 使用多层残差近似Token进行提高Robustness，并实现有效控制时声表现
results: 比基eline表现出色，在主观和客观评估中都达到了更高的性能和Robustness

Abstract
Recently, there have been significant advancements in voice conversion, resulting in high-quality performance. However, there are still two critical challenges in this field. Firstly, current voice conversion methods have limited robustness when encountering unseen speakers. Secondly, they also have limited ability to control timbre representation. To address these challenges, this paper presents a novel approach leverages tokens of multi-layer residual approximations to enhance robustness when dealing with unseen speakers, called the residual speaker module. The introduction of multi-layer approximations facilitates the separation of information from the timbre, enabling effective control over timbre in voice conversion. The proposed method outperforms baselines in both subjective and objective evaluations, demonstrating superior performance and increased robustness. Our demo page is publicly available.

摘要
最近，voice conversion技术已经取得了 significative进步，以至于表现质量得到了提升。然而，这个领域仍然存在两个关键挑战。第一，现有的voice conversion方法在遇到未看过的speaker时，其Robustness具有有限的能力。第二，它们也有限制timbre表达的能力。为了解决这些挑战，本文提出了一种新的方法，利用多层径辐射近似token来增强对未看过的speaker的Robustness，称为剩余speaker模块。多层径辐射近似token的引入，使得信息从timbre中分离得到更好，以便有效控制timbre在voice conversion中。提议的方法在对比基准方法的主观和客观评估中表现出了superior的性能和更高的Robustness。我们的demo页面公开给公众。

RVAE-EM: Generative speech dereverberation based on recurrent variational auto-encoder and convolutive transfer function

paper_url: http://arxiv.org/abs/2309.08157
repo_url: https://github.com/audio-westlakeu/rvae-em
paper_authors: Pengyu Wang, Xiaofei Li
for: 室内场景中的干扰声纳成为干扰speech质量和清晰度的主要因素。本文提出了一种生成抑干方法。
methods: 本方法基于一个概率模型，利用回卷变换自动编码器（RVAE）网络和卷积函数（CTF）的近似。与大多数前置方法不同，我们的输出是清晰speech的先验知识。我们通过预测最大 posteriori（MAP）算法来实现MAP估计清晰speech。
results: 对单通道speech抑干进行实验，我们发现提出的生成方法明显超过了先进的探测网络。

Abstract
In indoor scenes, reverberation is a crucial factor in degrading the perceived quality and intelligibility of speech. In this work, we propose a generative dereverberation method. Our approach is based on a probabilistic model utilizing a recurrent variational auto-encoder (RVAE) network and the convolutive transfer function (CTF) approximation. Different from most previous approaches, the output of our RVAE serves as the prior of the clean speech. And our target is the maximum a posteriori (MAP) estimation of clean speech, which is achieved iteratively through the expectation maximization (EM) algorithm. The proposed method integrates the capabilities of network-based speech prior modelling and CTF-based observation modelling. Experiments on single-channel speech dereverberation show that the proposed generative method noticeably outperforms the advanced discriminative networks.

摘要
在室内场景中，干扰是影响speech perceived质量和 intelligibility的关键因素。在这项工作中，我们提出了一种生成抑干方法。我们的方法基于一个概率模型，使用回归变换自动编码器（RVAE）网络和卷积函数（CTF）的近似。与大多数前一代方法不同，我们的RVAE输出作为干扰前后的净speech的假设。我们的目标是使用期望最大化（EM）算法来实现MAP估计净speech。我们的方法结合了网络基于声音先验模型和CTF基于观察模型的能力。实验表明，我们的生成方法在单通道speech抑干方面明显超过了先进的探测网络。

Fine-tune the pretrained ATST model for sound event detection

paper_url: http://arxiv.org/abs/2309.08153
repo_url: https://github.com/Audio-WestlakeU/ATST-SED
paper_authors: Nian Shao, Xian Li, Xiaofei Li
for: 这种研究是为了解决音频事件检测（SED）问题中的数据不足问题。
methods: 这个研究使用了大量预训练的自动学习（SelfSL）模型，以便生成更有特征的特征来进行 SED。
results: 我们的实验表明，我们的 fine-tuning 方法可以超越大型预训练网络的过拟合问题，并实现新的最佳性表现（SOTA），得到了 DCASE 挑战任务4 dataset 的 PSDS1/PSDS2 分数为 0.587/0.812。

Abstract
Sound event detection (SED) often suffers from the data deficiency problem. The recent baseline system in the DCASE2023 challenge task 4 leverages the large pretrained self-supervised learning (SelfSL) models to mitigate such restriction, where the pretrained models help to produce more discriminative features for SED. However, the pretrained models are regarded as a frozen feature extractor in the challenge baseline system and most of the challenge submissions, and fine-tuning of the pretrained models has been rarely studied. In this work, we study the fine-tuning method of the pretrained models for SED. We first introduce ATST-Frame, our newly proposed SelfSL model, to the SED system. ATST-Frame was especially designed for learning frame-level representations of audio signals and obtained state-of-the-art (SOTA) performances on a series of downstream tasks. We then propose a fine-tuning method for ATST-Frame using both (in-domain) unlabelled and labelled SED data. Our experiments show that, the proposed method overcomes the overfitting problem when fine-tuning the large pretrained network, and our SED system obtains new SOTA results of 0.587/0.812 PSDS1/PSDS2 scores on the DCASE challenge task 4 dataset.

摘要
声音事件检测（SED）经常面临数据不足问题。最近的基eline系统在DCASE2023挑战任务4中利用大规模预先自监学习（SelfSL）模型来缓解这种限制，其中预先学习的模型帮助生成更有特征的特征来进行SED。然而，预先学习的模型通常被视为DCASE2023挑战系统和大多数挑战提交中的冰结特征提取器，并且微调这些预先学习的模型的研究很少。在这项工作中，我们研究了SED中预先学习模型的微调方法。我们首先介绍了我们新提出的ATST-Frame模型，它专门用于学习音频信号帧级表示，并在一系列下游任务上达到了状态之arte（SOTA）性能。然后，我们提议一种微调方法，用于微调ATST-Frame模型使用（域内）无标签和标签SED数据。我们的实验结果表明，提议的方法可以在微调大规模预先学习网络时解决过拟合问题，并且我们的SED系统在DCASE挑战任务4数据集上获得了新的SOTA结果，即0.587/0.812 PSDS1/PSDS2分数。

t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation Capability

paper_url: http://arxiv.org/abs/2309.08131
repo_url: None
paper_authors: Jian Wu, Naoyuki Kanda, Takuya Yoshioka, Rui Zhao, Zhuo Chen, Jinyu Li
for: 这篇论文主要targets推广多说话者自动语音识别（ASR）的挑战，提出了Token-level serialized output training（t-SOT）方法。
methods: 该方法使用了$\langle \text{cc}\rangle$符号间适应多说话者转录，但使用简单的神经网络推导器结构限制了其应用范围。为解决这个问题，我们提出了一种新的t-SOT模型结构，即factorized neural transducers（FNT）。该方法将语言模型（LM）和推导器 Predictor 分离开，并对$\langle \text{cc}\rangle$符号的不自然的字符顺序进行处理。我们通过保持多个隐藏状态和对$\langle \text{cc}\rangle$符号进行特殊处理来实现这一点。
results: 我们的t-SOT FNT模型在单个和多个说话者 dataset 上的Word Error Rate（WER）减少表现和原始 t-SOT 模型相似，同时保持了文本适应的能力。

Abstract
Token-level serialized output training (t-SOT) was recently proposed to address the challenge of streaming multi-talker automatic speech recognition (ASR). T-SOT effectively handles overlapped speech by representing multi-talker transcriptions as a single token stream with $\langle \text{cc}\rangle$ symbols interspersed. However, the use of a naive neural transducer architecture significantly constrained its applicability for text-only adaptation. To overcome this limitation, we propose a novel t-SOT model structure that incorporates the idea of factorized neural transducers (FNT). The proposed method separates a language model (LM) from the transducer's predictor and handles the unnatural token order resulting from the use of $\langle \text{cc}\rangle$ symbols in t-SOT. We achieve this by maintaining multiple hidden states and introducing special handling of the $\langle \text{cc}\rangle$ tokens within the LM. The proposed t-SOT FNT model achieves comparable performance to the original t-SOT model while retaining the ability to reduce word error rate (WER) on both single and multi-talker datasets through text-only adaptation.

摘要

Diversity-based core-set selection for text-to-speech with linguistic and acoustic features

paper_url: http://arxiv.org/abs/2309.08127
repo_url: None
paper_authors: Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari
for: 提高语音识别系统的表达能力
methods: 使用多源数据集合（如audiobooks和YouTube）构建大规模的语音识别系统数据集，并使用多元度度量（measure the degree to which a subset encompasses a wide range）选择核心子集（known as \textit{core-set））
results: 对于不同语言和数据集大小，与基准方法相比，提议的方法表现出色，性能明显超过基准方法

Abstract
This paper proposes a method for extracting a lightweight subset from a text-to-speech (TTS) corpus ensuring synthetic speech quality. In recent years, methods have been proposed for constructing large-scale TTS corpora by collecting diverse data from massive sources such as audiobooks and YouTube. Although these methods have gained significant attention for enhancing the expressive capabilities of TTS systems, they often prioritize collecting vast amounts of data without considering practical constraints like storage capacity and computation time in training, which limits the available data quantity. Consequently, the need arises to efficiently collect data within these volume constraints. To address this, we propose a method for selecting the core subset~(known as \textit{core-set}) from a TTS corpus on the basis of a \textit{diversity metric}, which measures the degree to which a subset encompasses a wide range. Experimental results demonstrate that our proposed method performs significantly better than the baseline phoneme-balanced data selection across language and corpus size.

摘要
Translated into Simplified Chinese:这篇论文提出了一种方法，用于从文本到语音（TTS）集合中提取轻量级的子集，保证合成语音质量。在过去的几年中，有人提出了大规模的 TTS 集合建构方法，通过收集各种媒体资源，如 audiobooks 和 YouTube。虽然这些方法吸引了大量的注意力，但它们经常忽略实际的存储容量和训练时间限制，导致可用数据量受限。因此，需要有效地收集数据，以满足这些容量限制。为此，我们提出了一种基于 \textit{多样度度量} 的核心子集选择方法（known as \textit{core-set}），用于从 TTS 集合中选择最佳的子集。实验结果表明，我们的提议方法在语言和集合大小方面具有显著的优势，比基eline phoneme-balanced 数据选择更好。

Foundation Model Assisted Automatic Speech Emotion Recognition: Transcribing, Annotating, and Augmenting

paper_url: http://arxiv.org/abs/2309.08108
repo_url: None
paper_authors: Tiantian Feng, Shrikanth Narayanan
for: 本研究旨在探讨使用基础模型自动化语音情感识别（SER），从杂音识别到增强。
methods: 本研究使用了基础模型进行自动化SER，包括自动生成转录和注释。
results: 研究发现，使用多个基础模型的输出可以提高情感注释质量，并且可以增强现有语音情感数据集的可用性。

Abstract
Significant advances are being made in speech emotion recognition (SER) using deep learning models. Nonetheless, training SER systems remains challenging, requiring both time and costly resources. Like many other machine learning tasks, acquiring datasets for SER requires substantial data annotation efforts, including transcription and labeling. These annotation processes present challenges when attempting to scale up conventional SER systems. Recent developments in foundational models have had a tremendous impact, giving rise to applications such as ChatGPT. These models have enhanced human-computer interactions including bringing unique possibilities for streamlining data collection in fields like SER. In this research, we explore the use of foundational models to assist in automating SER from transcription and annotation to augmentation. Our study demonstrates that these models can generate transcriptions to enhance the performance of SER systems that rely solely on speech data. Furthermore, we note that annotating emotions from transcribed speech remains a challenging task. However, combining outputs from multiple LLMs enhances the quality of annotations. Lastly, our findings suggest the feasibility of augmenting existing speech emotion datasets by annotating unlabeled speech samples.

摘要
<>大量的进步在语音情感识别（SER）领域中使用深度学习模型。然而，训练SER系统仍然具有挑战性，需要大量的时间和资源。与其他机器学习任务类似，获取SER数据集需要大量的数据注释和标注。这些注释和标注过程中存在挑战，尝试扩大传统的SER系统。最近的基础模型的发展对SER领域有益，如ChatGPT等模型，它们提高了人机交互，包括带来了对SER领域的数据收集方面的新可能性。在本研究中，我们explore使用基础模型来帮助自动化SER，从译文和注释到增强。我们的研究表明，这些模型可以生成提高SER系统的性能的译文。此外，我们注意到从译文中注释情感仍然是一个挑战。然而，将多个LLMs的输出结合起来可以提高注释质量。最后，我们的发现表明可以使用未标注的语音样本来增强现有的speech emotion数据集。<>

Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context

paper_url: http://arxiv.org/abs/2309.08105
repo_url: https://github.com/k2-fsa/libriheavy
paper_authors: Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Yifan Yang, Liyong Guo, Long Lin, Daniel Povey
for: 本研究开发了一个大规模的语音识别 corpora，名为Libriheavy，包含50,000小时的英语朗读，来自LibriVox。这是目前所知道的最大的开源语音识别数据集。
methods: 作者提出了一种通用和高效的数据集创建管道，用于将Librilight中的音频与其相应的文本进行对应。同时，作者也提供了一些基线系统，包括CTC-Attention和探测器模型。
results: 作者在 Libriheavy 中建立了一个基eline系统，并对其进行了评估。研究结果显示，这个基线系统在 Libriheavy 中的识别精度高于其他相似的数据集。此外，作者还开源了其数据集创建管道，可以用于其他语音对应任务。

Abstract
In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting of 50,000 hours of read English speech derived from LibriVox. To the best of our knowledge, Libriheavy is the largest freely-available corpus of speech with supervisions. Different from other open-sourced datasets that only provide normalized transcriptions, Libriheavy contains richer information such as punctuation, casing and text context, which brings more flexibility for system building. Specifically, we propose a general and efficient pipeline to locate, align and segment the audios in previously published Librilight to its corresponding texts. The same as Librilight, Libriheavy also has three training subsets small, medium, large of the sizes 500h, 5000h, 50000h respectively. We also extract the dev and test evaluation sets from the aligned audios and guarantee there is no overlapping speakers and books in training sets. Baseline systems are built on the popular CTC-Attention and transducer models. Additionally, we open-source our dataset creatation pipeline which can also be used to other audio alignment tasks.

摘要
在这篇论文中，我们介绍了 Libriheavy，一个大规模的语音识别集合，包含50,000小时的英文语音，来自LibriVox。据我们所知，Libriheavy是目前最大的免费可用的语音识别集合。与其他开源数据集不同，Libriheavy包含更多的信息，例如括号、字母和文本上下文，这使得系统建设更加灵活。我们提出了一个通用和高效的管道来定位、对齐和分割 Librilight 中的音频。与 Librilight 相同，Libriheavy 也有三个训练subset：小、中、大，分别为 500h、5000h 和 50000h。我们还提取了评估集和测试集，并保证训练集中没有重复的 speaker 和书籍。基础系统使用了流行的 CTC-Attention 和批处理模型。此外，我们还开源了我们的数据创建管道，可以用于其他音频对齐任务。

SSL-Net: A Synergistic Spectral and Learning-based Network for Efficient Bird Sound Classification

paper_url: http://arxiv.org/abs/2309.08072
repo_url: None
paper_authors: Yiyuan Yang, Kaichen Zhou, Niki Trigoni, Andrew Markham
for: 鸟叫声分类是生态学、栖息地保护和科研中重要的任务，因为它对鸟种分布和数量进行监测起重要作用。
methods: 我们提出了一种高效和通用的框架called SSL-Net，它将spectral和学习特征相结合，以分类不同的鸟叫声。
results: 我们在一个标准的野外采集的鸟叫声数据集上获得了鼓舞人的实验结果，证明我们的方法可以高效地提取特征和实现鸟叫声分类的高性能，即使工作样本数量有限。此外，我们还提出了三种特征融合策略，以便工程师和研究人员在选择中受益。

Abstract
Efficient and accurate bird sound classification is of important for ecology, habitat protection and scientific research, as it plays a central role in monitoring the distribution and abundance of species. However, prevailing methods typically demand extensively labeled audio datasets and have highly customized frameworks, imposing substantial computational and annotation loads. In this study, we present an efficient and general framework called SSL-Net, which combines spectral and learned features to identify different bird sounds. Encouraging empirical results gleaned from a standard field-collected bird audio dataset validate the efficacy of our method in extracting features efficiently and achieving heightened performance in bird sound classification, even when working with limited sample sizes. Furthermore, we present three feature fusion strategies, aiding engineers and researchers in their selection through quantitative analysis.

摘要
efficient和准确的鸟叫声分类对生态学、栖息地保护和科学研究非常重要，因为它在监测物种分布和数量方面扮演了中心角色。然而，现有的方法通常需要大量的标注音频数据和特定的框架，导致计算和标注负担很大。在这个研究中，我们提出了一种高效和通用的框架called SSL-Net，它将spectral和学习特征结合以分类不同的鸟叫声。我们从标准采集的鸟叫声数据集中获得了鼓舞人心的实验结果，证明了我们的方法可以高效地提取特征和实现鸟叫声分类 tasks，即使受限制的样本数量。此外，我们还提出了三种特征融合策略，以帮助工程师和研究人员在选择方面做出数据分析。

2023-09-17

Mitigating Over-Smoothing and Over-Squashing using Augmentations of Forman-Ricci Curvature

Federated Learning in Temporal Heterogeneity

Fully Convolutional Generative Machine Learning Method for Accelerating Non-Equilibrium Greens Function Simulations

A Survey on Congestion Control and Scheduling for Multipath TCP: Machine Learning vs Classical Approaches

An Automatic Tuning MPC with Application to Ecological Cruise Control

Structure to Property: Chemical Element Embeddings and a Deep Learning Approach for Accurate Prediction of Chemical Properties

Simulation-based Inference for Exoplanet Atmospheric Retrieval: Insights from winning the Ariel Data Challenge 2023 using Normalizing Flows

Experiential-Informed Data Reconstruction for Fishery Sustainability and Policies in the Azores

Kinematics-aware Trajectory Generation and Prediction with Latent Stochastic Differential Modeling

Energy stable neural network for gradient flow equations

Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets

User Assignment and Resource Allocation for Hierarchical Federated Learning over Wireless Networks

High-dimensional manifold of solutions in neural networks: insights from statistical physics

Globally Convergent Accelerated Algorithms for Multilinear Sparse Logistic Regression with $\ell_0$-constraints

Provable learning of quantum states with graphical models

Double Normalizing Flows: Flexible Bayesian Gaussian Process ODEs Learning

MFRL-BI: Design of a Model-free Reinforcement Learning Process Control Scheme by Using Bayesian Inference

End-to-End Optimized Pipeline for Prediction of Protein Folding Kinetics

Data-Driven Reachability Analysis of Stochastic Dynamical Systems with Conformal Inference

On the Connection Between Riemann Hypothesis and a Special Class of Neural Networks

Integration of geoelectric and geochemical data using Self-Organizing Maps (SOM) to characterize a landfill

Total Variation Distance Estimation Is as Easy as Probabilistic Inference

2023-09-17

Climate-Resilient UAVs: Enhancing Energy-Efficient B5G Communication in Harsh Environments

Frequency-Domain Detection for Molecular Communication with Cross-Reactive Receptors

Frequency Estimation Using Complex-Valued Shifted Window Transformer

Asymptotic Analysis of the Downlink in Cooperative Massive MIMO Systems

Toward Beamfocusing-Aided Near-Field Communications: Research Advances, Potential, and Challenges

Cramer-Rao Bound Optimization for Active RIS-Empowered ISAC Systems

NOMA-Based Coexistence of Near-Field and Far-Field Massive MIMO Communications

Throughput Analysis of IEEE 802.11bn Coordinated Spatial Reuse

Sparse Code Multiple Access (SCMA) Technique

2023-09-16

Enhancing GAN-Based Vocoders with Contrastive Learning Under Data-limited Condition

SynthTab: Leveraging Synthesized Data for Guitar Tablature Transcription

Music Generation based on Generative Adversarial Networks with Transformer

Unifying Robustness and Fidelity: A Comprehensive Study of Pretrained Generative Methods for Speech Enhancement in Adverse Conditions

Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation

Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework

Boosting End-to-End Multilingual Phoneme Recognition through Exploiting Universal Speech Attributes Constraints

2023-09-16

FrameRS: A Video Frame Compression Model Composed by Self supervised Video Frame Reconstructor and Key Frame Selector

Multi-camera Bird’s Eye View Perception for Autonomous Driving

Unsupervised Green Object Tracker (GOT) without Offline Pre-training

MMST-ViT: Climate Change-aware Crop Yield Prediction via Multi-Modal Spatial-Temporal Vision Transformer

Sub-action Prototype Learning for Point-level Weakly-supervised Temporal Action Localization

Microscale 3-D Capacitance Tomography with a CMOS Sensor Array

RingMo-lite: A Remote Sensing Multi-task Lightweight Network with CNN-Transformer Hybrid Framework

OmniLRS: A Photorealistic Simulator for Lunar Robotics

RMP: A Random Mask Pretrain Framework for Motion Prediction

Comparative study of Deep Learning Models for Binary Classification on Combined Pulmonary Chest X-ray Dataset

FF-LOGO: Cross-Modality Point Cloud Registration with Feature Filtering and Local to Global Optimization

Tightening Classification Boundaries in Open Set Domain Adaptation through Unknown Exploitation

ExBluRF: Efficient Radiance Fields for Extreme Motion Blurred Images

IntelliBeeHive: An Automated Honey Bee, Pollen, and Varroa Destructor Monitoring System

Robust Backdoor Attacks on Object Detection in Real World

Staged Contact-Aware Global Human Motion Forecasting

AffordPose: A Large-scale Dataset of Hand-Object Interactions with Affordance-driven Hand Pose

Semantics-aware LiDAR-Only Pseudo Point Cloud Generation for 3D Object Detection

In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval

DynaMoN: Motion-Aware Fast And Robust Camera Localization for Dynamic NeRF

Pixel Adapter: A Graph-Based Post-Processing Approach for Scene Text Image Super-Resolution

Delving into Multimodal Prompting for Fine-grained Visual Classification

Enhancing Visual Perception in Novel Environments via Incremental Data Augmentation Based on Style Transfer

MA-SAM: Modality-agnostic SAM Adaptation for 3D Medical Image Segmentation

AOSR-Net: All-in-One Sandstorm Removal Network

Dual-Camera Joint Deblurring-Denoising

2023-09-16

Interactively Teaching an Inverse Reinforcement Learner with Limited Feedback

RMDM: A Multilabel Fakenews Dataset for Vietnamese Evidence Verification

NOWJ1@ALQAC 2023: Enhancing Legal Task Performance with Classic Statistical Models and Pre-trained Language Models

GenDOM: Generalizable One-shot Deformable Object Manipulation with Parameter-Aware Policy

Generative AI-Driven Storytelling: A New Era for Marketing

A store-and-forward cloud-based telemonitoring system for automatic assessing dysarthria evolution in neurological diseases from video-recording analysis

Improve Deep Forest with Learnable Layerwise Augmentation Policy Schedule

Earth Virtualization Engines – A Technical Perspective

Deliberative Context-Aware Ambient Intelligence System for Assisted Living Homes

Accelerating In-Browser Deep Learning Inference on Diverse Edge Clients through Just-in-Time Kernel Optimizations