2023-08-29

cs.LG

cs.LG - 2023-08-29

Minimizing Quasi-Self-Concordant Functions by Gradient Regularization of Newton Method

paper_url: http://arxiv.org/abs/2308.14742
repo_url: None
paper_authors: Nikita Doikov
for: 本研究考虑了复合 convex 优化问题，其中一个分别是 quasi-self-concordant 平滑组件。这个问题类型在自身 concordant 函数和 lipschitz 连续偏导数之间进行自然的 interpolate。
methods: 我们使用 basic Newton method with gradient regularization 来解决这个问题。在无约束情况下，这种算法只需要在每步完成一个简单的矩阵反转操作（解决一个线性系统）。我们证明了这种算法在全球Linear rate 上具有快速的性能，与信任区间算法的复杂性 bound 相同，而我们的方法更加简单实现。
results: 我们发现，使用 quasi-self-concordant 函数的 Newton method 可以在多个实际问题中实现快速的全球Linear rate，无需进一步假设强或平均凸性 для目标函数。这些问题包括Logistic Regression、Soft Maximum 和 Matrix Scaling。

Abstract
We study the composite convex optimization problems with a Quasi-Self-Concordant smooth component. This problem class naturally interpolates between classic Self-Concordant functions and functions with Lipschitz continuous Hessian. Previously, the best complexity bounds for this problem class were associated with trust-region schemes and implementations of a ball-minimization oracle. In this paper, we show that for minimizing Quasi-Self-Concordant functions we can use instead the basic Newton Method with Gradient Regularization. For unconstrained minimization, it only involves a simple matrix inversion operation (solving a linear system) at each step. We prove a fast global linear rate for this algorithm, matching the complexity bound of the trust-region scheme, while our method remains especially simple to implement. Then, we introduce the Dual Newton Method, and based on it, develop the corresponding Accelerated Newton Scheme for this problem class, which further improves the complexity factor of the basic method. As a direct consequence of our results, we establish fast global linear rates of simple variants of the Newton Method applied to several practical problems, including Logistic Regression, Soft Maximum, and Matrix Scaling, without requiring additional assumptions on strong or uniform convexity for the target objective.

摘要
我们研究复合凸优化问题中具有半自相关函数组成部分。这个问题类型天然地 interpolates between经典的自相关函数和具有 lipschitz 连续导数的函数。在过去，这个问题类型的最佳复杂性界限与信任区间算法和实现球形最小化器相关。在这篇论文中，我们证明可以使用基本的新顿方法与梯度规则化来减少 quasi-self-concordant 函数的最小化问题。无需进行额外的假设或假设，我们证明了这种算法在全球linear rate具有相同的复杂性界限，而且该算法的实现非常简单。然后，我们引入对应的 dual 新顿方法，并基于它，开发了对这个问题类型的加速新顿方案，进一步改善了基本方法的复杂性因子。作为直接结论，我们建立了基本新顿方法应用于一些实际问题的快速全球线性率，包括逻辑回归、软最大值和矩阵压缩，无需进行额外的假设或假设强或均匀凸性。

Diversified Ensemble of Independent Sub-Networks for Robust Self-Supervised Representation Learning

paper_url: http://arxiv.org/abs/2308.14705
repo_url: None
paper_authors: Amirhossein Vahidi, Lisa Wimmer, Hüseyin Anil Gündüz, Bernd Bischl, Eyke Hüllermeier, Mina Rezaei
for: 这个论文主要目标是提高深度学习模型的性能、估计不确定性和稳定性。methods: 这个论文使用了一种新的自我超vised训练方法，利用一个 ensemble of 独立的子网络，并使用一个新的损失函数来鼓励多样性。results: 这个论文的实验结果表明，使用这种方法可以efficiently建立一个多样化的sub-model ensemble，从而实现了高度准确的预测估计和模型不确定性的良好均衡。这种方法在多种任务中表现出色，包括随机数据生成、数据腐坏检测、 semi-supervised setting等。

Abstract
Ensembling a neural network is a widely recognized approach to enhance model performance, estimate uncertainty, and improve robustness in deep supervised learning. However, deep ensembles often come with high computational costs and memory demands. In addition, the efficiency of a deep ensemble is related to diversity among the ensemble members which is challenging for large, over-parameterized deep neural networks. Moreover, ensemble learning has not yet seen such widespread adoption, and it remains a challenging endeavor for self-supervised or unsupervised representation learning. Motivated by these challenges, we present a novel self-supervised training regime that leverages an ensemble of independent sub-networks, complemented by a new loss function designed to encourage diversity. Our method efficiently builds a sub-model ensemble with high diversity, leading to well-calibrated estimates of model uncertainty, all achieved with minimal computational overhead compared to traditional deep self-supervised ensembles. To evaluate the effectiveness of our approach, we conducted extensive experiments across various tasks, including in-distribution generalization, out-of-distribution detection, dataset corruption, and semi-supervised settings. The results demonstrate that our method significantly improves prediction reliability. Our approach not only achieves excellent accuracy but also enhances calibration, surpassing baseline performance across a wide range of self-supervised architectures in computer vision, natural language processing, and genomics data.

摘要
ensemble 一种广泛应用的方法是增强模型性能，估计不确定性，并提高深度学习中的稳定性。然而，深度集成常常带来高计算成本和内存需求。此外，集成学习的效率与集成成员之间的多样性有直接关系，而大型、过参数化的深度神经网络中实现多样性是一项挑战。此外，集成学习还尚未得到广泛的应用，而且在无监督或自监督学习中实现集成学习是一项挑战。为了解决这些挑战，我们提出了一种新的自监督训练方法，该方法利用了多个独立的子网络，并且使用了一种新的损失函数，以促进多样性。我们的方法可以高效地建立一个多样性较高的子模型集成，从而获得高度准确的模型不确定性估计，而且与传统的深度自监督集成相比，计算成本减少了较多。为了评估我们的方法的效果，我们在各种任务上进行了广泛的实验，包括内部概率泛化、外部泛化检测、数据损害和半监督设置。结果表明，我们的方法可以显著提高预测可靠性。我们的方法不仅达到了出色的准确率，还可以提高准确性的抽象，在计算机视觉、自然语言处理和生物数据中的各种自监督架构上都达到了或超过了基eline性能。

Hybrid PLS-ML Authentication Scheme for V2I Communication Networks

paper_url: http://arxiv.org/abs/2308.14693
repo_url: None
paper_authors: Hala Amin, Jawaher Kaldari, Nora Mohamed, Waqas Aman, Saif Al-Kuwari
for: 本研究旨在提供一种hybrid物理层安全（PLS）-机器学习（ML）身份验证方案，以确保智能汽车交通管理中的安全和有效性。
methods: 我们提出一种使用发送器车辆位置作为设备指纹的ToA基于本地化位置机制，并使用ML模型跟踪移动的合法车辆的方法。
results: 我们的实验结果表明，使用我们的方案可以减少False报警和遗漏检测的可能性，并且在 missed detection 方面表现更好于基eline方案。

Abstract
Vehicular communication networks are rapidly emerging as vehicles become smarter. However, these networks are increasingly susceptible to various attacks. The situation is exacerbated by the rise in automated vehicles complicates, emphasizing the need for security and authentication measures to ensure safe and effective traffic management. In this paper, we propose a novel hybrid physical layer security (PLS)-machine learning (ML) authentication scheme by exploiting the position of the transmitter vehicle as a device fingerprint. We use a time-of-arrival (ToA) based localization mechanism where the ToA is estimated at roadside units (RSUs), and the coordinates of the transmitter vehicle are extracted at the base station (BS).Furthermore, to track the mobility of the moving legitimate vehicle, we use ML model trained on several system parameters. We try two ML models for this purpose, i.e., support vector regression and decision tree. To evaluate our scheme, we conduct binary hypothesis testing on the estimated positions with the help of the ground truths provided by the ML model, which classifies the transmitter node as legitimate or malicious. Moreover, we consider the probability of false alarm and the probability of missed detection as performance metrics resulting from the binary hypothesis testing, and mean absolute error (MAE), mean square error (MSE), and coefficient of determination $\text{R}^2$ to further evaluate the ML models. We also compare our scheme with a baseline scheme that exploits the angle of arrival at RSUs for authentication. We observe that our proposed position-based mechanism outperforms the baseline scheme significantly in terms of missed detections.

摘要
Our scheme uses a time-of-arrival (ToA) based localization mechanism, where the ToA is estimated at roadside units (RSUs) and the coordinates of the transmitter vehicle are extracted at the base station (BS). Additionally, we use ML models to track the mobility of the moving legitimate vehicle. We test two ML models for this purpose: support vector regression and decision tree.To evaluate our scheme, we conduct binary hypothesis testing on the estimated positions with the help of the ground truths provided by the ML model. This classifies the transmitter node as legitimate or malicious. We also consider the probability of false alarm and the probability of missed detection as performance metrics, as well as mean absolute error (MAE), mean square error (MSE), and coefficient of determination $\text{R}^2$ to evaluate the ML models.We compare our scheme with a baseline scheme that exploits the angle of arrival at RSUs for authentication and find that our proposed position-based mechanism outperforms the baseline scheme in terms of missed detections. Our scheme provides a more secure and effective way to authenticate vehicles in vehicular communication networks.

2023-08-26

cs.CV

cs.CV - 2023-08-26

Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation

paper_url: http://arxiv.org/abs/2308.13505
repo_url: None
paper_authors: Jiaming Zhang, Yutao Cui, Gangshan Wu, Limin Wang
for: 提高视频对象分割（VOS）方法的性能，解决现有方法中 dense matching после Extracting 特征的限制，以及 pixel-wise 匹配导致的目标信息不具备全面理解。
methods: 提出一种 joint 模型，名为 JointFormer，具有三个元素的共同模型，包括特征、匹配和压缩存储。核心设计是 Joint Block，通过注意力的灵活性同时提取特征和传递目标信息到当前 токен和压缩存储 токен。这种方案允许进行广泛的信息传递和特征学习。
results: 根据 DAVIS 2017 val/test-dev 和 YouTube-VOS 2018/2019 val bencmarks，我们的方法实现了新的state-of-the-art性能（89.7%和87.6%）和（87.0%和87.0%），与现有方法相比提高了很大的margin。

Abstract
Current prevailing Video Object Segmentation (VOS) methods usually perform dense matching between the current and reference frames after extracting their features. One on hand, the decoupled modeling restricts the targets information propagation only at high-level feature space. On the other hand, the pixel-wise matching leads to a lack of holistic understanding of the targets. To overcome these issues, we propose a unified VOS framework, coined as JointFormer, for joint modeling the three elements of feature, correspondence, and a compressed memory. The core design is the Joint Block, utilizing the flexibility of attention to simultaneously extract feature and propagate the targets information to the current tokens and the compressed memory token. This scheme allows to perform extensive information propagation and discriminative feature learning. To incorporate the long-term temporal targets information, we also devise a customized online updating mechanism for the compressed memory token, which can prompt the information flow along the temporal dimension and thus improve the global modeling capability. Under the design, our method achieves a new state-of-art performance on DAVIS 2017 val/test-dev (89.7% and 87.6%) and YouTube-VOS 2018/2019 val (87.0% and 87.0%) benchmarks, outperforming existing works by a large margin.

摘要
当前主流的视频对象分割（VOS）方法通常是在提取特征后进行密集匹配 между当前和参考帧。一方面，分解模型限制目标信息的传播只在高级特征空间进行。另一方面，像素级匹配导致无法全面理解目标。为了解决这些问题，我们提出了一个统一的VOS框架，命名为JointFormer，用于联合模型特征、匹配和压缩存储。核心设计是Joint块，利用注意力的灵活性同时提取特征和传递目标信息到当前 токен和压缩存储 токен。这种方案允许进行广泛的信息传递和特征学习。为了包含长期时间的目标信息，我们还设计了自定义在线更新机制，以便在时间维度上流动信息，从而提高全局模型能力。根据设计，我们的方法在DAVIS 2017 val/test-dev（89.7%和87.6%）和YouTube-VOS 2018/2019 val（87.0%和87.0%）标准测试上达到了新的状态级表现，超过现有方法的表现。

A2Q: Accumulator-Aware Quantization with Guaranteed Overflow Avoidance

paper_url: http://arxiv.org/abs/2308.13504
repo_url: None
paper_authors: Ian Colbert, Alessandro Pappalardo, Jakoba Petri-Koenig
for: 训练量化神经网络（QNN）以避免在推理时使用低精度累加器时发生溢出。
methods: 引入了一种新的量化方法，即accumulator-aware quantization（A2Q），它基于模型量化的约束，使得在训练QNN时，模型权重的L1范围遵循累加器的位数范围。这种方法同时激发了模型权重的不结构化粒度缺失，以确保避免溢出。
results: A2Q可以在深度学习基于计算机视觉任务上训练QNN，而无需使用浮点数据类型，同时保持模型精度与浮点模型相似。在我们的评估中，我们考虑了A2Q在通用平台和可编程硬件上的影响。但我们主要目标是在FPGA上部署模型，因为它可以完全利用自定义累加器的位数。我们的实验表明，累加器位数对FPGA上的加速器资源利用率产生显著影响。在我们的 benchmark 中，A2Q可以在 average 上减少资源利用率达到 2.3倍，与32位累加器相比，而且保持99.2%的浮点模型精度。

Abstract
We present accumulator-aware quantization (A2Q), a novel weight quantization method designed to train quantized neural networks (QNNs) to avoid overflow when using low-precision accumulators during inference. A2Q introduces a unique formulation inspired by weight normalization that constrains the L1-norm of model weights according to accumulator bit width bounds that we derive. Thus, in training QNNs for low-precision accumulation, A2Q also inherently promotes unstructured weight sparsity to guarantee overflow avoidance. We apply our method to deep learning-based computer vision tasks to show that A2Q can train QNNs for low-precision accumulators while maintaining model accuracy competitive with a floating-point baseline. In our evaluations, we consider the impact of A2Q on both general-purpose platforms and programmable hardware. However, we primarily target model deployment on FPGAs because they can be programmed to fully exploit custom accumulator bit widths. Our experimentation shows accumulator bit width significantly impacts the resource efficiency of FPGA-based accelerators. On average across our benchmarks, A2Q offers up to a 2.3x reduction in resource utilization over 32-bit accumulator counterparts with 99.2% of the floating-point model accuracy.

摘要
我们介绍了一种新的量化预测方法，名为“accumulator-aware quantization”（A2Q），用于在推导过程中避免量化神经网络（QNN）的过流。A2Q将量化神经网络的模型重量与累绩器的位元数量产生联乘关系，以确保在推导过程中避免过流。因此，在对低精度累绩器进行训练时，A2Q同时具有过程简化和过程简化的功能。我们将这种方法应用于深度学习数据领域的应用，并证明A2Q可以在低精度累绩器下训练QNN，并保持与浮点数据模型的竞争力。在我们的评估中，我们考虑了在通用平台和可程式硬件上的影响，但我们主要针对FPGA进行部署，因为FPGA可以根据自己的特定累绩器位元数量进行自适应。我们的实验表明，累绩器位元数量有着显著的影响力，A2Q可以在FPGA上的加速器上提供更好的资源利用率。在我们的测试中，A2Q在32位累绩器下提供了2.3倍的资源利用率，并保持99.2%的浮点数据模型精度。

Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers

paper_url: http://arxiv.org/abs/2308.13494
repo_url: https://github.com/WISION-Lab/eventful-transformer
paper_authors: Matthew Dutson, Yin Li, Mohit Gupta
for: 这篇论文的目的是提高视觉识别任务中的Transformers模型精度，并且降低它们的computational cost。
methods: 这篇论文使用了一种称为”Eventful Transformers”的方法，它可以将Transformers模型转换为具有适应控制的计算成本，并且可以在runtime中进行自适应调节。
results: 这篇论文在大规模的视频物件检测（ImageNet VID）和动作识别（EPIC-Kitchens 100） dataset上进行了评估，发现这种方法可以实现2-4倍的计算成本优化，仅有小量的精度损失。

Abstract
Vision Transformers achieve impressive accuracy across a range of visual recognition tasks. Unfortunately, their accuracy frequently comes with high computational costs. This is a particular issue in video recognition, where models are often applied repeatedly across frames or temporal chunks. In this work, we exploit temporal redundancy between subsequent inputs to reduce the cost of Transformers for video processing. We describe a method for identifying and re-processing only those tokens that have changed significantly over time. Our proposed family of models, Eventful Transformers, can be converted from existing Transformers (often without any re-training) and give adaptive control over the compute cost at runtime. We evaluate our method on large-scale datasets for video object detection (ImageNet VID) and action recognition (EPIC-Kitchens 100). Our approach leads to significant computational savings (on the order of 2-4x) with only minor reductions in accuracy.

摘要
< Lang="zh-CN" >视力转换器在视觉识别任务中表现出色，但它们的计算成本很高。特别是在视频识别中，模型经常在帧或时间块之间重复应用。在这项工作中，我们利用时间重复性来降低转换器的计算成本。我们描述了一种方法，可以在运行时控制计算成本。我们的提议的家族模型，即事件转换器，可以将现有转换器转换成适应性控制计算成本的模型，而无需再训练。我们在大规模数据集上进行了视频物体检测（ImageNet VID）和动作识别（EPIC-Kitchens 100）的评估，我们的方法可以实现大量的计算成本减少（在2-4倍之间），只有小量的减少准确率。Note: The Simplified Chinese translation is done using the Google Translate API, which may not be perfect and may not capture all the nuances of the original text.

Unlocking the Performance of Proximity Sensors by Utilizing Transient Histograms

paper_url: http://arxiv.org/abs/2308.13473
repo_url: None
paper_authors: Carter Sifferman, Yeping Wang, Mohit Gupta, Michael Gleicher
for: 该论文主要用于提高close-range时间探测（ToF）距离传感器所获取的场景几何信息的准确性。
methods: 该论文使用了直接使用传感器捕获的转变 histogram，并使用可导渠 Rendering 管道来直接优化场景几何，以提高对观察结果的匹配。
results: 论文通过对八种不同视角的平面表面进行3,800次测量，并证明了其方法在大多数场景下高效性比Proprietary distance estimate基准值高一个数量级。此外，论文还示出了一种简单的机器人应用，通过使用该方法来感知机器人臂上的把握器上的距离和坡度。

Abstract
We provide methods which recover planar scene geometry by utilizing the transient histograms captured by a class of close-range time-of-flight (ToF) distance sensor. A transient histogram is a one dimensional temporal waveform which encodes the arrival time of photons incident on the ToF sensor. Typically, a sensor processes the transient histogram using a proprietary algorithm to produce distance estimates, which are commonly used in several robotics applications. Our methods utilize the transient histogram directly to enable recovery of planar geometry more accurately than is possible using only proprietary distance estimates, and consistent recovery of the albedo of the planar surface, which is not possible with proprietary distance estimates alone. This is accomplished via a differentiable rendering pipeline, which simulates the transient imaging process, allowing direct optimization of scene geometry to match observations. To validate our methods, we capture 3,800 measurements of eight planar surfaces from a wide range of viewpoints, and show that our method outperforms the proprietary-distance-estimate baseline by an order of magnitude in most scenarios. We demonstrate a simple robotics application which uses our method to sense the distance to and slope of a planar surface from a sensor mounted on the end effector of a robot arm.

摘要
我们提供一种方法，可以利用 close-range time-of-flight (ToF) 距离传感器所捕获的过渡历史gram来恢复平面场景的几何结构。一个过渡历史gram是一个一维时间射频信号，其记录了在 ToF 传感器上 incident 光子的到达时间。通常，一个传感器会使用专有算法来处理过渡历史gram，以生成距离估计，这些估计在多种 робо得应用中被广泛使用。我们的方法直接利用过渡历史gram，以准确地恢复平面几何结构，并同时恢复平面表面的反射率，这两个参数不可能通过专有距离估计alone 获得。我们通过一个可微的渲染管线来实现这一点，该管线模拟了过渡成像过程，允许直接优化场景几何来匹配观测。为验证我们的方法，我们Capture 3,800个平面测量数据，来自多种视点，并显示我们的方法在大多数情况下高效性比专有距离估计baseline 提高一个数量级。我们还展示了一个简单的 робо得应用，使用我们的方法来检测 robot arm 上的末端器的距离和倾斜。

A Fast Minimization Algorithm for the Euler Elastica Model Based on a Bilinear Decomposition

paper_url: http://arxiv.org/abs/2308.13471
repo_url: None
paper_authors: Zhifang Liu, Baochen Sun, Xue-Cheng Tai, Qi Wang, Huibin Chang
for: 这个论文的目的是提出一种新的、快速、稳定的 alternating minimization（HALM）算法来解决Euler Elastica（EE）模型中的非线性和缺失问题。
methods: 该算法基于bilinear decomposition of the gradient of the underlying image，并且包括三个子最小化问题，每个问题可以在关闭式或快速解决器中解决。
results: 对比其他当前状态算法，新算法能够更快、更稳定地解决EE模型，并且在一系列数学实验中表现良好。例如，与fast operator-splitting-based Deng-Glowinski-Tai算法相比，新算法的平均运行时间只需一半。

Abstract
The Euler Elastica (EE) model with surface curvature can generate artifact-free results compared with the traditional total variation regularization model in image processing. However, strong nonlinearity and singularity due to the curvature term in the EE model pose a great challenge for one to design fast and stable algorithms for the EE model. In this paper, we propose a new, fast, hybrid alternating minimization (HALM) algorithm for the EE model based on a bilinear decomposition of the gradient of the underlying image and prove the global convergence of the minimizing sequence generated by the algorithm under mild conditions. The HALM algorithm comprises three sub-minimization problems and each is either solved in the closed form or approximated by fast solvers making the new algorithm highly accurate and efficient. We also discuss the extension of the HALM strategy to deal with general curvature-based variational models, especially with a Lipschitz smooth functional of the curvature. A host of numerical experiments are conducted to show that the new algorithm produces good results with much-improved efficiency compared to other state-of-the-art algorithms for the EE model. As one of the benchmarks, we show that the average running time of the HALM algorithm is at most one-quarter of that of the fast operator-splitting-based Deng-Glowinski-Tai algorithm.

摘要
“欧拉-艾拉斯特拉（EE）模型可以实现无残留的结果，与传统的总方差整合模型相比，在图像处理中。然而，EE模型中的曲率项带来强烈的非线性和极值问题，使得设计快速稳定的算法成为一大挑战。在本文中，我们提出了一个新的、快速、混合替换几何（HALM）算法，基于图像的梯度的 bilinear 分解，并证明了混合替换过程中的数列 convergence 的 globally 稳定性。HALM 算法包括三个子替换问题，每个都可以通过关键简单的类型或快速的算法来解决，使得新算法具有高精度和高效性。此外，我们还讨论了对应各种曲率基于的可变量化模型的扩展，特别是一个 Lipschitz 平滑函数的曲率。在实验中，我们展示了新算法在许多测试案例中具有较好的效果，并且比传统的算法更高效。”

RestNet: Boosting Cross-Domain Few-Shot Segmentation with Residual Transformation Network

paper_url: http://arxiv.org/abs/2308.13469
repo_url: None
paper_authors: Xinyang Huang, Chuang Zhu, Wenkai Chen
for: 本文目的是提出一种新的交叉频谱几何学习模型，以实现在未知频谱上进行 semantic segmentation，并且可以在有限的注释样本基础上进行学习。
methods: 本文提出了一种名为RestNet的几何学习模型，该模型通过 Semantic Enhanced Anchor Transform (SEAT) 模块和 Intra-domain Residual Enhancement (IRE) 模块来实现知识传递，同时保持了内域支持查询特征信息。此外，本文还提出了一种基于 prototype fusion 的面 predicate 策略，帮助模型慢慢地学习如何分割。
results: 实验表明，RestNet 可以在 ISIC、Chest X-ray 和 FSS-1000 等 dataset 上 achieve state-of-the-art 性能，并且不需要额外的 fine-tuning。

Abstract
Cross-domain few-shot segmentation (CD-FSS) aims to achieve semantic segmentation in previously unseen domains with a limited number of annotated samples. Although existing CD-FSS models focus on cross-domain feature transformation, relying exclusively on inter-domain knowledge transfer may lead to the loss of critical intra-domain information. To this end, we propose a novel residual transformation network (RestNet) that facilitates knowledge transfer while retaining the intra-domain support-query feature information. Specifically, we propose a Semantic Enhanced Anchor Transform (SEAT) module that maps features to a stable domain-agnostic space using advanced semantics. Additionally, an Intra-domain Residual Enhancement (IRE) module is designed to maintain the intra-domain representation of the original discriminant space in the new space. We also propose a mask prediction strategy based on prototype fusion to help the model gradually learn how to segment. Our RestNet can transfer cross-domain knowledge from both inter-domain and intra-domain without requiring additional fine-tuning. Extensive experiments on ISIC, Chest X-ray, and FSS-1000 show that our RestNet achieves state-of-the-art performance. Our code will be available soon.

摘要
specifically，我们提出了一种 Semantic Enhanced Anchor Transform (SEAT) 模块，它可以将特征映射到一个稳定的领域不依赖的空间中使用先进的 semantics。此外，我们还提出了一种 Intra-domain Residual Enhancement (IRE) 模块，它可以保持原始领域的 intra-domain 表示。我们还提出了一种 mask prediction strategy based on prototype fusion，帮助模型慢慢地学习如何分割。我们的 RestNet 可以从 both inter-domain 和 intra-domain 中传递知识，而不需要额外的 fine-tuning。我们进行了广泛的实验，结果表明我们的 RestNet 可以达到 state-of-the-art 性能。我们的代码很快就会发布。

2023-08-26

cs.AI

cs.AI - 2023-08-26

ChatGPT as Data Augmentation for Compositional Generalization: A Case Study in Open Intent Detection

paper_url: http://arxiv.org/abs/2308.13517
repo_url: None
paper_authors: Yihao Fang, Xianzhi Li, Stephen W. Thomas, Xiaodan Zhu
for: 增强自然语言理解任务中的拓展性 generale
methods: 使用ChatGPT作为数据增强技术，提高开放意图检测任务中的组合泛化能力
results: 对多个benchmark进行严格评估，发现我们的方法可以明显提高模型性能，并且在开放意图检测任务中具有显著的提升效果。

Abstract
Open intent detection, a crucial aspect of natural language understanding, involves the identification of previously unseen intents in user-generated text. Despite the progress made in this field, challenges persist in handling new combinations of language components, which is essential for compositional generalization. In this paper, we present a case study exploring the use of ChatGPT as a data augmentation technique to enhance compositional generalization in open intent detection tasks. We begin by discussing the limitations of existing benchmarks in evaluating this problem, highlighting the need for constructing datasets for addressing compositional generalization in open intent detection tasks. By incorporating synthetic data generated by ChatGPT into the training process, we demonstrate that our approach can effectively improve model performance. Rigorous evaluation of multiple benchmarks reveals that our method outperforms existing techniques and significantly enhances open intent detection capabilities. Our findings underscore the potential of large language models like ChatGPT for data augmentation in natural language understanding tasks.

摘要
开放意图检测是自然语言理解的重要方面，涉及到用户生成文本中未经见的意图的识别。Despite the progress made in this field, there are still challenges in handling new combinations of language components, which is crucial for compositional generalization. In this paper, we present a case study exploring the use of ChatGPT as a data augmentation technique to enhance compositional generalization in open intent detection tasks.我们开始 by discussing the limitations of existing benchmarks in evaluating this problem, highlighting the need for constructing datasets for addressing compositional generalization in open intent detection tasks. By incorporating synthetic data generated by ChatGPT into the training process, we demonstrate that our approach can effectively improve model performance. Rigorous evaluation of multiple benchmarks reveals that our method outperforms existing techniques and significantly enhances open intent detection capabilities. Our findings underscore the potential of large language models like ChatGPT for data augmentation in natural language understanding tasks.Here's the text with some minor adjustments to make it more readable in Simplified Chinese:开放意图检测是自然语言理解的重要方面，涉及到用户生成文本中未经见的意图的识别。尽管在这个领域已经做出了很多进步，但是处理新的语言组成部分的挑战仍然存在，这是重要的 Compositional generalization。在这篇论文中，我们进行了一个案例研究，探讨使用 ChatGPT 作为数据增强技术来提高开放意图检测任务中的 Compositional generalization。我们开始 by 讨论现有的 benchmar 的限制，高亮需要为开放意图检测任务构建数据集来解决 Compositional generalization 问题。通过在训练过程中添加 ChatGPT 生成的Synthetic数据，我们示出了我们的方法可以有效提高模型性能。多个 benchmar 的严格评估表明，我们的方法超过了现有的方法，并有效地提高了开放意图检测能力。我们的发现强调了大语言模型 like ChatGPT 的潜在作用在自然语言理解任务中。

Does Asking Clarifying Questions Increases Confidence in Generated Code? On the Communication Skills of Large Language Models

paper_url: http://arxiv.org/abs/2308.13507
repo_url: None
paper_authors: Jie JW Wu
for: 提高大型自然语言模型（LLM）在代码生成任务中的能力
methods: 使用LLM生成器寻找高不确定性和低信任性问题，并向用户提问以获取反馈
results: 通过提高沟通技巧，提高代码生成器对代码质量的信任度

Abstract
Large language models (LLMs) have significantly improved the ability to perform tasks in the field of code generation. However, there is still a gap between LLMs being capable coders and being top-tier software engineers. Based on the observation that top-level software engineers often ask clarifying questions to reduce ambiguity in both requirements and coding solutions, we argue that the same should be applied to LLMs for code generation tasks. By asking probing questions in various topics before generating the final code, the challenges of programming with LLMs, such as unclear intent specification, lack of computational thinking, and undesired code quality, may be alleviated. This, in turn, increases confidence in the generated code. In this work, we explore how to leverage better communication skills to achieve greater confidence in generated code. We propose a communication-centered process that uses an LLM-generated communicator to identify issues with high ambiguity or low confidence in problem descriptions and generated code. We then ask clarifying questions to obtain responses from users for refining the code.

摘要
Translated into Simplified Chinese:大型语言模型（LLM）已经对代码生成任务做出了重要改进，但是仍然存在LLM是出色的程序员和首席软件工程师之间的差距。根据观察到的首席软件工程师经常对需求和解决方案中的模糊性问题提出询问，我们认为这样的方法也应被应用到LLM的代码生成任务中。通过在生成代码之前向用户提出询问，可以帮助解决LLM在代码生成中的挑战，例如不清晰的意图规定、Computational Thinking的缺乏和不满意的代码质量。这样可以增加代码的信任度。在这个工作中，我们探索如何通过更好的沟通技巧来实现更高的代码信任度。我们提出了一个沟通中心的过程，使用LLM生成的通信器来识别问题中的高模糊性或低信任性，然后对用户提出询问以获取反馈。

Attending Generalizability in Course of Deep Fake Detection by Exploring Multi-task Learning

paper_url: http://arxiv.org/abs/2308.13503
repo_url: None
paper_authors: Pranav Balaji, Abhijit Das, Srijan Das, Antitza Dantcheva
for: 本研究探讨了多种多任务学习（MTL）技术，用于分类视频为原始或修改的混合修改场景，以提高深入Counterfeit场景中的泛化性。
methods: 我们使用了FaceForensics++ dataset，该 dataset包含1000个原始视频和4种不同的修改技术修改后的5000个视频。我们进行了广泛的多任务学习和对比技术的实验，这些技术在文献中已经得到了广泛的探讨。
results: 结果表明，我们提出的检测模型具有良好的泛化性，能够正确地检测不同修改方法的视频，比对state-of-the-art更高效。

Abstract
This work explores various ways of exploring multi-task learning (MTL) techniques aimed at classifying videos as original or manipulated in cross-manipulation scenario to attend generalizability in deep fake scenario. The dataset used in our evaluation is FaceForensics++, which features 1000 original videos manipulated by four different techniques, with a total of 5000 videos. We conduct extensive experiments on multi-task learning and contrastive techniques, which are well studied in literature for their generalization benefits. It can be concluded that the proposed detection model is quite generalized, i.e., accurately detects manipulation methods not encountered during training as compared to the state-of-the-art.

摘要
这项工作探讨了多种多任务学习（MTL）技术，用于分类视频为原始或修改的混合 manipulate enario，以提高深度假象场景中的泛化性。我们使用的数据集是 FaceForensics++, 该数据集包含 1000 个原始视频，被四种不同的技术修改，总共有 5000 个视频。我们进行了广泛的多任务学习和对比技术实验，这些技术在文献中已经得到了广泛的研究和证明了其泛化效果。可以结论，我们提出的检测模型具有良好的泛化性，即在训练中未遇到的修改方法上具有高度的检测精度，比之前的状态艺术。

Escaping the Sample Trap: Fast and Accurate Epistemic Uncertainty Estimation with Pairwise-Distance Estimators

paper_url: http://arxiv.org/abs/2308.13498
repo_url: None
paper_authors: Lucas Berry, David Meger
for: 本研究提出了一种新的方法来估计 ensemble 模型中的 epistemic uncertainty，使用 pairwise-distance estimators (PaiDEs)。
methods: PaiDEs 利用模型组件之间的对比距离来确定 entropy 的下界，并将这些下界作为信息基据 критериion 的估计。与现代深度学习方法不同，PaiDEs 可以在更大的空间（最多 100 $\times$）和更高的维度（最多 100 $\times$）上更快（大约 100 $\times$）和更准确地估计 epistemic uncertainty。
results: 通过一系列常用来评估 epistemic uncertainty 估计的实验（1D 杆形数据、Pendulum-v0、Hopper-v2、Ant-v2 和 Humanoid-v2），我们证明了 PaiDEs 在 epistemic uncertainty 估计中的优势。在每个实验 Setting 中，我们采用了 Active Learning 框架来展示 PaiDEs 的优势。

Abstract
This work introduces a novel approach for epistemic uncertainty estimation for ensemble models using pairwise-distance estimators (PaiDEs). These estimators utilize the pairwise-distance between model components to establish bounds on entropy and uses said bounds as estimates for information-based criterion. Unlike recent deep learning methods for epistemic uncertainty estimation, which rely on sample-based Monte Carlo estimators, PaiDEs are able to estimate epistemic uncertainty up to 100$\times$ faster, over a larger space (up to 100$\times$) and perform more accurately in higher dimensions. To validate our approach, we conducted a series of experiments commonly used to evaluate epistemic uncertainty estimation: 1D sinusoidal data, Pendulum-v0, Hopper-v2, Ant-v2 and Humanoid-v2. For each experimental setting, an Active Learning framework was applied to demonstrate the advantages of PaiDEs for epistemic uncertainty estimation.

摘要
这个研究提出了一种新的方法来估计 ensemble 模型中的认知不确定性使用对比距离估计器（PaiDEs）。这些估计器利用对比距离来确定模型组件之间的 entropy bound，并将这些 bound 用作信息基来的估计 criterion。与最近的深度学习方法不同，PaiDEs 可以在更大的空间（最多 100 倍）和更高维度（最多 100 倍）上更快（up to 100 倍）和更准确地估计认知不确定性。为验证我们的方法，我们进行了一系列通常用于评估认知不确定性估计的实验：1D 振荡数据、Pendulum-v0、Hopper-v2、Ant-v2 和 Humanoid-v2。对每个实验设置，我们应用了活动学习框架来展示 PaiDEs 在认知不确定性估计中的优势。

Open Gaze: An Open-Source Implementation Replicating Google’s Eye Tracking Paper

paper_url: http://arxiv.org/abs/2308.13495
repo_url: None
paper_authors: Sushmanth reddy Mereddy, Jyothi Swaroop Reddy, Somnath Sharma
for:This paper aims to develop an open-source implementation of a smartphone-based gaze tracker that can accurately track eye movements without the need for specialized hardware.methods:The authors use machine learning techniques to develop an eye tracking solution that is native to smartphones, and they validate their approach using the MIT GazeCapture dataset.results:The authors demonstrate that their approach can accurately track eye movements during natural image observation and reading comprehension tasks, and they show that their smartphone-based gaze tracker is comparable in accuracy to state-of-the-art mobile eye trackers that are two orders of magnitude more expensive.

Abstract
Eye tracking has been a pivotal tool in diverse fields such as vision research, language analysis, and usability assessment. The majority of prior investigations, however, have concentrated on expansive desktop displays employing specialized, costly eye tracking hardware that lacks scalability. Remarkably little insight exists into ocular movement patterns on smartphones, despite their widespread adoption and significant usage. In this manuscript, we present an open-source implementation of a smartphone-based gaze tracker that emulates the methodology proposed by a GooglePaper (whose source code remains proprietary). Our focus is on attaining accuracy comparable to that attained through the GooglePaper's methodology, without the necessity for supplementary hardware. Through the integration of machine learning techniques, we unveil an accurate eye tracking solution that is native to smartphones. Our approach demonstrates precision akin to the state-of-the-art mobile eye trackers, which are characterized by a cost that is two orders of magnitude higher. Leveraging the vast MIT GazeCapture dataset, which is available through registration on the dataset's website, we successfully replicate crucial findings from previous studies concerning ocular motion behavior in oculomotor tasks and saliency analyses during natural image observation. Furthermore, we emphasize the applicability of smartphone-based gaze tracking in discerning reading comprehension challenges. Our findings exhibit the inherent potential to amplify eye movement research by significant proportions, accommodating participation from thousands of subjects with explicit consent. This scalability not only fosters advancements in vision research, but also extends its benefits to domains such as accessibility enhancement and healthcare applications.

摘要
眼动跟踪技术已经在多个领域得到广泛应用，如视觉研究、语言分析和用户体验评估。然而，大多数前期研究都集中在使用特殊、昂贵的桌面显示器上进行眼动跟踪，lacking scalability。尚未得到充分的研究对于智能手机上的眼动跟踪，尽管智能手机的普及和使用率很高。在这篇文章中，我们提供了一个开源实现的智能手机基于眼动跟踪器，基于Google文献（其源代码尚未公开）的方法论。我们的注重点在于实现与Google文献的方法论相同的准确性，不需要额外的硬件。通过机器学习技术的 интеграción，我们提出了一种Native to smartphones的眼动跟踪解决方案。我们的方法与状态 искусственный智能手机眼动跟踪器相比，具有更高的准确性和可扩展性。基于MIT GazeCapture数据集，我们成功复制了先前研究中关于眼动行为在视觉任务和自然图像观看中的关键发现。此外，我们强调了智能手机基于眼动跟踪在了解阅读挑战中的应用。我们的发现表明了智能手机基于眼动跟踪的潜在潜力，可以提高眼动研究的进步，并扩展到访问ibilty enhancement和医疗应用领域。

Ultrafast-and-Ultralight ConvNet-Based Intelligent Monitoring System for Diagnosing Early-Stage Mpox Anytime and Anywhere

paper_url: http://arxiv.org/abs/2308.13492
repo_url: None
paper_authors: Yubiao Yue, Xiaoqiang Shi, Li Qin, Xinyue Zhang, Yanmei Chen, Jialong Xu, Zipei Zheng, Yujun Cao, Di Liu, Zhenzhang Li, Yang Li
for:The paper aims to develop a real-time diagnostic tool for monkeypox, addressing the lack of efficient diagnostic tools and the challenges of high inference speed, large parameter size, and limited diagnosis performance for early-stage monkeypox.methods:The proposed method, Fast-MpoxNet, is an ultrafast and ultralight deep learning network that integrates attention-based feature fusion and multiple auxiliary losses enhancement. It uses transfer learning and five-fold cross-validation, achieving 94.26% Accuracy on the Mpox dataset with a recall of 93.65% for early-stage monkeypox.results:Fast-MpoxNet achieves high accuracy and practicality in real-time diagnosis, with an Accuracy of 98.40% and a Practicality Score of 0.80 when adopting data augmentation. An application system named Mpox-AISM V2 was also developed for both personal computers and mobile phones, featuring ultrafast responses, offline functionality, and easy deployment. The proposed method has the potential to mitigate future monkeypox outbreaks and provide a new paradigm for developing real-time diagnostic tools in the healthcare field.

Abstract
Due to the lack of more efficient diagnostic tools for monkeypox, its spread remains unchecked, presenting a formidable challenge to global health. While the high efficacy of deep learning models for monkeypox diagnosis has been demonstrated in related studies, the overlook of inference speed, the parameter size and diagnosis performance for early-stage monkeypox renders the models inapplicable in real-world settings. To address these challenges, we proposed an ultrafast and ultralight network named Fast-MpoxNet. Fast-MpoxNet possesses only 0.27M parameters and can process input images at 68 frames per second (FPS) on the CPU. To counteract the diagnostic performance limitation brought about by the small model capacity, it integrates the attention-based feature fusion module and the multiple auxiliary losses enhancement strategy for better detecting subtle image changes and optimizing weights. Using transfer learning and five-fold cross-validation, Fast-MpoxNet achieves 94.26% Accuracy on the Mpox dataset. Notably, its recall for early-stage monkeypox achieves 93.65%. By adopting data augmentation, our model's Accuracy rises to 98.40% and attains a Practicality Score (A new metric for measuring model practicality in real-time diagnosis application) of 0.80. We also developed an application system named Mpox-AISM V2 for both personal computers and mobile phones. Mpox-AISM V2 features ultrafast responses, offline functionality, and easy deployment, enabling accurate and real-time diagnosis for both the public and individuals in various real-world settings, especially in populous settings during the outbreak. Our work could potentially mitigate future monkeypox outbreak and illuminate a fresh paradigm for developing real-time diagnostic tools in the healthcare field.

摘要

Towards Optimal Head-to-head Autonomous Racing with Curriculum Reinforcement Learning

paper_url: http://arxiv.org/abs/2308.13491
repo_url: None
paper_authors: Dvij Kalaria, Qin Lin, John M. Dolan
for: 本研究旨在提出一个头阵自动赛车环境，以便使用循环学习学习出最佳政策。
methods: 本研究使用了curriculum learning和安全循环学习算法，从 simpler vehicle model 逐渐转移到更加复杂的real environment，以教导循环学习代理人一个更加优化的政策。
results: 本研究的结果显示，使用curriculum learning和安全循环学习算法可以更加有效地将循环学习代理人训练到更加优化的政策，并且能够更加安全地进行训练。

Abstract
Head-to-head autonomous racing is a challenging problem, as the vehicle needs to operate at the friction or handling limits in order to achieve minimum lap times while also actively looking for strategies to overtake/stay ahead of the opponent. In this work we propose a head-to-head racing environment for reinforcement learning which accurately models vehicle dynamics. Some previous works have tried learning a policy directly in the complex vehicle dynamics environment but have failed to learn an optimal policy. In this work, we propose a curriculum learning-based framework by transitioning from a simpler vehicle model to a more complex real environment to teach the reinforcement learning agent a policy closer to the optimal policy. We also propose a control barrier function-based safe reinforcement learning algorithm to enforce the safety of the agent in a more effective way while not compromising on optimality.

摘要
HEAD-TO-HEAD自动赛车是一个复杂的问题，车辆需要在摩擦或控制限制下运行，以实现最低圈速并同时积极寻找超越或保持领先的策略。在这项工作中，我们提出了一个真实精度模型车辆动力学环境的HEAD-TO-HEAD赛车环境。一些前一次的工作已经尝试直接在复杂的车辆动力学环境中学习策略，但未能学习优化策略。在这项工作中，我们提出了一种学习纲程学习框架，从一个更加简单的车辆模型转移到更加复杂的真实环境，以教育学习代理更近似于优化策略。我们还提出了一种基于控制障碍函数的安全学习算法，以更有效地保证代理的安全性，不会影响优化性。

Temporal Uncertainty Localization to Enable Human-in-the-loop Analysis of Dynamic Contrast-enhanced Cardiac MRI Datasets

paper_url: http://arxiv.org/abs/2308.13488
repo_url: None
paper_authors: Dilek M. Yalcinkaya, Khalid Youssef, Bobak Heydari, Orlando Simonetti, Rohan Dharmakumar, Subha Raman, Behzad Sharif
for: 这个论文的目的是提出一种基于深度神经网络的动态质量控制（dQC）工具，用于识别DCE-CMRI数据集分割失败的情况。
methods: 这个论文使用的方法包括DCE-CMRI数据分割、深度神经网络分割和人工征化约束。
results: 研究发现，使用提出的dQC工具可以准确地识别分割失败的情况，并且可以提高分割结果的准确率和减少分割失败的数量。

Abstract
Dynamic contrast-enhanced (DCE) cardiac magnetic resonance imaging (CMRI) is a widely used modality for diagnosing myocardial blood flow (perfusion) abnormalities. During a typical free-breathing DCE-CMRI scan, close to 300 time-resolved images of myocardial perfusion are acquired at various contrast "wash in/out" phases. Manual segmentation of myocardial contours in each time-frame of a DCE image series can be tedious and time-consuming, particularly when non-rigid motion correction has failed or is unavailable. While deep neural networks (DNNs) have shown promise for analyzing DCE-CMRI datasets, a "dynamic quality control" (dQC) technique for reliably detecting failed segmentations is lacking. Here we propose a new space-time uncertainty metric as a dQC tool for DNN-based segmentation of free-breathing DCE-CMRI datasets by validating the proposed metric on an external dataset and establishing a human-in-the-loop framework to improve the segmentation results. In the proposed approach, we referred the top 10% most uncertain segmentations as detected by our dQC tool to the human expert for refinement. This approach resulted in a significant increase in the Dice score (p<0.001) and a notable decrease in the number of images with failed segmentation (16.2% to 11.3%) whereas the alternative approach of randomly selecting the same number of segmentations for human referral did not achieve any significant improvement. Our results suggest that the proposed dQC framework has the potential to accurately identify poor-quality segmentations and may enable efficient DNN-based analysis of DCE-CMRI in a human-in-the-loop pipeline for clinical interpretation and reporting of dynamic CMRI datasets.

摘要
对 Dynamic Contrast-Enhanced (DCE) Cardiac Magnetic Resonance Imaging (CMRI) 诊断Myocardial Blood Flow (Perfusion) 异常，通常需要获取约 300 个时间分解的Myocardial Perfusion 影像，并在不同的对比“洗入/洗出”阶段进行评估。然而，手动分类Myocardial 边析在每个时间点的DCE影像系列可以是时间consuming 和耗时consuming，尤其是当非静态运动调整失败或无法使用时。深度神经网 (DNNs) 已经显示出了分析 DCE-CMRI 数据的潜力，但是一个“动态品质控制” (dQC) 技术来可靠地检测失败的分类是缺乏的。我们提出了一个新的空间时间不确定度量来作为 dQC 工具，并在一个人际loop 框架中进行改进。在我们的方法中，我们将top 10% 最不确定的分类作为 dQC 工具所检测的，并请求人工专家进行重新分类。这种方法导致了 Dice 分数的增加（p < 0.001）和失败分类的数量的下降（16.2% 到 11.3%），而对照方法，将相同数量的分类 randomly 选择进行人工参考，无法获得任何有意义的改善。我们的结果表明，我们的 dQC 框架具有可靠地检测失败分类的潜力，并且可以实现人际loop pipeline中的有效 DNN-based 分析 DCE-CMRI 数据，以便在诊断和报告动态 CMRI 数据时提供有用的资讯。

Leveraging Knowledge and Reinforcement Learning for Enhanced Reliability of Language Models

paper_url: http://arxiv.org/abs/2308.13467
repo_url: None
paper_authors: Nancy Tyagi, Surjodeep Sarkar, Manas Gaur
for: 这个论文是为了提高现代语言模型（如BERT）的可靠性和准确性而写的。
methods: 这个论文使用了人工智能的集成学习方法，利用了ConceptNet和Wikipedia的知识图谱嵌入，以强化语言模型的可靠性和准确性。
results: 研究表明，使用这种知识导向集成学习方法可以提高语言模型的可靠性和准确性，在九个GLUE任务上都有出色的表现，超越了现有的最佳实现。

Abstract
The Natural Language Processing(NLP) community has been using crowd sourcing techniques to create benchmark datasets such as General Language Understanding and Evaluation(GLUE) for training modern Language Models such as BERT. GLUE tasks measure the reliability scores using inter annotator metrics i.e. Cohens Kappa. However, the reliability aspect of LMs has often been overlooked. To counter this problem, we explore a knowledge-guided LM ensembling approach that leverages reinforcement learning to integrate knowledge from ConceptNet and Wikipedia as knowledge graph embeddings. This approach mimics human annotators resorting to external knowledge to compensate for information deficits in the datasets. Across nine GLUE datasets, our research shows that ensembling strengthens reliability and accuracy scores, outperforming state of the art.

摘要
natural language processing（NLP）社区已经使用人群SOURCING技术创建了通用语言理解和评估（GLUE）测试集，用于训练现代语言模型（BERT）。GLUE任务测试语言模型的可靠性使用互annotator metric，即科恩斯均度。然而，语言模型的可靠性问题经常被忽略。为解决这个问题，我们研究了一种基于知识图谱Embedding的知识导向语言模型集成方法。这种方法模拟人工注解者通过外部知识补做信息缺乏的数据集。在九个GLUE任务上，我们的研究表明，集成可以提高可靠性和准确率，超过当前最佳。

2023-08-26

cs.CL

cs.CL - 2023-08-26

Training and Meta-Evaluating Machine Translation Evaluation Metrics at the Paragraph Level

paper_url: http://arxiv.org/abs/2308.13506
repo_url: None
paper_authors: Daniel Deutsch, Juraj Juraska, Mara Finkelstein, and Markus Freitag
for: 这个研究的目的是为了评估自动译文件翻译文本的效果，并不是只是评估单句翻译。
methods: 这篇论文提出了一种方法，即使用现有的句子级评估数据来创建段落级数据，并将这些数据用于训练和meta-评估评价指标。
results: 实验结果表明，使用句子级评估指标来评估整个段落的效果与使用专门为段落级设计的指标相当有效。

Abstract
As research on machine translation moves to translating text beyond the sentence level, it remains unclear how effective automatic evaluation metrics are at scoring longer translations. In this work, we first propose a method for creating paragraph-level data for training and meta-evaluating metrics from existing sentence-level data. Then, we use these new datasets to benchmark existing sentence-level metrics as well as train learned metrics at the paragraph level. Interestingly, our experimental results demonstrate that using sentence-level metrics to score entire paragraphs is equally as effective as using a metric designed to work at the paragraph level. We speculate this result can be attributed to properties of the task of reference-based evaluation as well as limitations of our datasets with respect to capturing all types of phenomena that occur in paragraph-level translations.

摘要
研究在机器翻译中延伸到文段级别的文本翻译效果，然而现有自动评价指标的有效性在长文段翻译方面仍然不清楚。在这项工作中，我们首先提出了将现有句子级数据转化为训练和Meta评价指标的方法。然后，我们使用这些新的数据集来评估现有的句子级指标以及在 paragraph 级别进行学习的指标。实验结果显示，使用句子级指标评估整个文段的效果与使用特制的 paragraph 级指标相当。我们推测这些结果可能是由评估任务的特性以及我们数据集中捕捉到的所有类型的现象所致。

Ngambay-French Neural Machine Translation (sba-Fr)

paper_url: http://arxiv.org/abs/2308.13497
repo_url: https://github.com/Toadoum/Ngambay-French-Neural-Machine-Translation-sba_fr_v1-
paper_authors: Sakayo Toadoum Sari, Angela Fan, Lema Logamou Seknewna
for: 本研究旨在开发一个基于神经机器翻译（NMT）系统，以推动语言障碍的缓解。特别是在语言资源匮乏的情况下，NMT 系统的研发成为了一项感兴趣的话题。
methods: 本研究使用了三个预训练模型进行了微调，并使用了一个自适应的数据采集方法来生成训练数据。
results: 实验结果显示，使用 M2M100 模型可以在原始数据和原始+ sintetic 数据上达到高的 BLEU 分数。此外，公共可用的 bitext 数据集可以用于研究用途。

Abstract
In Africa, and the world at large, there is an increasing focus on developing Neural Machine Translation (NMT) systems to overcome language barriers. NMT for Low-resource language is particularly compelling as it involves learning with limited labelled data. However, obtaining a well-aligned parallel corpus for low-resource languages can be challenging. The disparity between the technological advancement of a few global languages and the lack of research on NMT for local languages in Chad is striking. End-to-end NMT trials on low-resource Chad languages have not been attempted. Additionally, there is a dearth of online and well-structured data gathering for research in Natural Language Processing, unlike some African languages. However, a guided approach for data gathering can produce bitext data for many Chadian language translation pairs with well-known languages that have ample data. In this project, we created the first sba-Fr Dataset, which is a corpus of Ngambay-to-French translations, and fine-tuned three pre-trained models using this dataset. Our experiments show that the M2M100 model outperforms other models with high BLEU scores on both original and original+synthetic data. The publicly available bitext dataset can be used for research purposes.

摘要
在非洲和世界上，有越来越多的关注发展神经机器翻译（NMT）系统，以超越语言障碍。NMT для低资源语言特别有吸引力，因为它涉及到有限的标注数据学习。然而，获得低资源语言的高质量并行数据可以是挑战。非洲语言技术发展落差和中非的语言研究欠缺是惊人的。在毫不计入End-to-end NMT实验中，有些非洲语言的翻译还没有尝试。此外，在自然语言处理领域的在线和结构化数据收集也比较缺乏，与一些非洲语言不同。然而，一种引导的方法可以生成许多中非语言翻译对的数据，并且可以使用已知语言的丰富数据进行精度调整。在本项目中，我们创建了第一个sba-FrDataset，它是一个 Ngambay-to-French 翻译 corpus，并使用这个数据集进行三个预训练模型的精度调整。我们的实验表明，M2M100模型在原始和原始+ sintetic 数据上的 BLEU 分数均高于其他模型。公共可用的 bitext 数据集可以用于研究purposes。

Prompting a Large Language Model to Generate Diverse Motivational Messages: A Comparison with Human-Written Messages

paper_url: http://arxiv.org/abs/2308.13479
repo_url: None
paper_authors: Samuel Rhys Cox, Ashraf Abdul, Wei Tsang Ooi
for: 这个论文旨在探讨大语言模型（LLM）可以如何用于创作内容，以及使用特定指令可以提高LLM的创作质量。
methods: 该论文使用了一个以前的人群劳动任务管道，用于生成一个多样化的motivational message corpus。然后，该管道被用来生成消息，并与人群写作者和两个基线GPT-4提示进行比较。
results: 研究发现，使用人群劳动任务管道作为LLM提示可以使GPT-4生成更多样化的消息，比两个基线提示更好。此外，论文还讨论了由人类写作者和LLM生成的消息之间的对比。

Abstract
Large language models (LLMs) are increasingly capable and prevalent, and can be used to produce creative content. The quality of content is influenced by the prompt used, with more specific prompts that incorporate examples generally producing better results. On from this, it could be seen that using instructions written for crowdsourcing tasks (that are specific and include examples to guide workers) could prove effective LLM prompts. To explore this, we used a previous crowdsourcing pipeline that gave examples to people to help them generate a collectively diverse corpus of motivational messages. We then used this same pipeline to generate messages using GPT-4, and compared the collective diversity of messages from: (1) crowd-writers, (2) GPT-4 using the pipeline, and (3 & 4) two baseline GPT-4 prompts. We found that the LLM prompts using the crowdsourcing pipeline caused GPT-4 to produce more diverse messages than the two baseline prompts. We also discuss implications from messages generated by both human writers and LLMs.

摘要
Note: Simplified Chinese is also known as "Mandarin" or "Standard Chinese".Translation notes:* "Large language models" is translated as "大型语言模型" (dàxíng yǔyán módelǐng).* "Crowdsourcing" is translated as "人群协作" (rénqún xiézuò).* "Pipeline" is translated as "管道" (guǎndào).* "GPT-4" is translated as "GPT-4" (GPT-4).* "Baseline" is translated as "基线" (jīxiàn).* "Prompts" is translated as "提示" (tímǐ).* "Diverse" is translated as "多样的" (duōyàng de).* "Messages" is translated as "消息" (xiāoxi).Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Hong Kong, Macau, and Taiwan.

ARTIST: ARTificial Intelligence for Simplified Text

paper_url: http://arxiv.org/abs/2308.13458
repo_url: https://github.com/delftcrowd/artist
paper_authors: Lorenzo Corti, Jie Yang
for: 本研究旨在探讨自然语言处理中的文本简化问题，以提高公共信息和知识的接入性。
methods: 本研究使用了最新的生成人工智能技术，包括语言模型、领域和读者适应和可见化模块，以自动简化文本的语言和结构层次。
results: 研究发现了自动文本简化技术的优势和局限性，包括处理文化和常识知识的挑战。这些结果代表了对荷兰文本简化的首次探索，并为未来的研究和实践提供了灯光。

Abstract
Complex text is a major barrier for many citizens when accessing public information and knowledge. While often done manually, Text Simplification is a key Natural Language Processing task that aims for reducing the linguistic complexity of a text while preserving the original meaning. Recent advances in Generative Artificial Intelligence (AI) have enabled automatic text simplification both on the lexical and syntactical levels. However, as applications often focus on English, little is understood about the effectiveness of Generative AI techniques on low-resource languages such as Dutch. For this reason, we carry out empirical studies to understand the benefits and limitations of applying generative technologies for text simplification and provide the following outcomes: 1) the design and implementation for a configurable text simplification pipeline that orchestrates state-of-the-art generative text simplification models, domain and reader adaptation, and visualisation modules; 2) insights and lessons learned, showing the strengths of automatic text simplification while exposing the challenges in handling cultural and commonsense knowledge. These outcomes represent a first step in the exploration of Dutch text simplification and shed light on future endeavours both for research and practice.

摘要
各种复杂的文本是公共信息和知识访问的主要障碍。虽然经常是手动完成的，但文本简化是自然语言处理任务的关键任务，旨在降低文本语言复杂性，保留原始意思。现代生成人工智能技术（AI）已经使得自动文本简化在语言和语法层次上自动进行。然而，应用常focus on英语，对低资源语言如荷兰语的应用知之甚少。为了了解生成技术在文本简化中的效果和限制，我们进行了实践研究，并提供以下结果： 1）设计和实现一个可配置的文本简化管道，该管道将state-of-the-art生成文本简化模型、领域和读者适应、视觉模块相互协调。 2）对 automatic文本简化的发现和经验，包括自动简化的优势和处理文化和常识知识的挑战。这些结果代表了对荷兰文本简化的首次探索，并照亮未来的研究和实践的前景。

2023-08-26

cs.LG

cs.LG - 2023-08-26

Unveiling the Role of Message Passing in Dual-Privacy Preservation on GNNs

paper_url: http://arxiv.org/abs/2308.13513
repo_url: None
paper_authors: Tianyi Zhao, Hui Hu, Lu Cheng
for: This paper aims to address the privacy leakage issue in Graph Neural Networks (GNNs) and propose a principled privacy-preserving GNN framework.
methods: The proposed framework consists of three major modules: Sensitive Information Obfuscation Module, Dynamic Structure Debiasing Module, and Adversarial Learning Module.
results: Experimental results on four benchmark datasets show that the proposed model effectively protects both node and link privacy while preserving high utility for downstream tasks such as node classification.

Abstract
Graph Neural Networks (GNNs) are powerful tools for learning representations on graphs, such as social networks. However, their vulnerability to privacy inference attacks restricts their practicality, especially in high-stake domains. To address this issue, privacy-preserving GNNs have been proposed, focusing on preserving node and/or link privacy. This work takes a step back and investigates how GNNs contribute to privacy leakage. Through theoretical analysis and simulations, we identify message passing under structural bias as the core component that allows GNNs to \textit{propagate} and \textit{amplify} privacy leakage. Building upon these findings, we propose a principled privacy-preserving GNN framework that effectively safeguards both node and link privacy, referred to as dual-privacy preservation. The framework comprises three major modules: a Sensitive Information Obfuscation Module that removes sensitive information from node embeddings, a Dynamic Structure Debiasing Module that dynamically corrects the structural bias, and an Adversarial Learning Module that optimizes the privacy-utility trade-off. Experimental results on four benchmark datasets validate the effectiveness of the proposed model in protecting both node and link privacy while preserving high utility for downstream tasks, such as node classification.

摘要
格raph神经网络（GNNs）是一种强大的图像学习工具，可以在社交网络等图像上学习表示。然而，它们的隐私泄露攻击限制了它们在高风险领域的实际应用。为解决这个问题，隐私保护GNNs被提出，重点保护节点和/或连接的隐私。本工作从另一个角度研究GNNs如何导致隐私泄露。通过理论分析和实验，我们发现了 message passing under structural bias 是 GNNs 中最重要的泄露扩散和强化因素。基于这些发现，我们提出了一种理解隐私保护的 GNN 框架，称为 dual-privacy preservation。该框架包括三个主要模块：敏感信息干扰模块、动态结构偏置修正模块和对抗学习模块。实验结果表明，提出的模型可以保护节点和连接的隐私，同时保持下游任务的高用户性。

TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs

paper_url: http://arxiv.org/abs/2308.13490
repo_url: https://github.com/google-research-datasets/tpu_graphs
paper_authors: Phitchaya Mangpo Phothilimthana, Sami Abu-El-Haija, Kaidi Cao, Bahare Fatemi, Charith Mendis, Bryan Perozzi
for: 这篇论文的目的是提供一个大型计算图论文性能预测数据集，用于优化机器学习编译器和自动调整器。
methods: 这篇论文使用了计算图来表示机器学习工作负荷，并收集了来自开源机器学习项目的各种模型架构。
results: 这篇论文提出了一个大型计算图论文性能预测数据集（TPuGraphs），其中每个图表示一个主要计算任务，例如训练epoch或推理步骤。该数据集包含25倍以上的图数据，770倍以上的图 сред值大小，并且引入了新的挑战，如扩展性、训练效率和模型质量。

Abstract
Precise hardware performance models play a crucial role in code optimizations. They can assist compilers in making heuristic decisions or aid autotuners in identifying the optimal configuration for a given program. For example, the autotuner for XLA, a machine learning compiler, discovered 10-20% speedup on state-of-the-art models serving substantial production traffic at Google. Although there exist a few datasets for program performance prediction, they target small sub-programs such as basic blocks or kernels. This paper introduces TpuGraphs, a performance prediction dataset on full tensor programs, represented as computational graphs, running on Tensor Processing Units (TPUs). Each graph in the dataset represents the main computation of a machine learning workload, e.g., a training epoch or an inference step. Each data sample contains a computational graph, a compilation configuration, and the execution time of the graph when compiled with the configuration. The graphs in the dataset are collected from open-source machine learning programs, featuring popular model architectures, e.g., ResNet, EfficientNet, Mask R-CNN, and Transformer. TpuGraphs provides 25x more graphs than the largest graph property prediction dataset (with comparable graph sizes), and 770x larger graphs on average compared to existing performance prediction datasets on machine learning programs. This graph-level prediction task on large graphs introduces new challenges in learning, ranging from scalability, training efficiency, to model quality.

摘要
精准硬件性能模型在代码优化中扮演着关键的角色。它们可以帮助编译器做出优化决策，或者帮助自动调整器确定程序的最佳配置。例如，XLA的自动调整器在处理大规模生产流量时发现了10-20%的提升。虽然有一些程序性能预测数据集存在，但它们主要针对小型子程序，如基本块或kernels。这篇文章介绍了TpuGraphs，一个基于计算图的程序性能预测数据集，运行在tensor处理单元（TPU）上。每个图像在数据集中表示了机器学习工作负荷的主要计算，例如训练epoch或推理步骤。每个数据样本包含一个计算图、一个编译配置和图像的执行时间。数据集中的图像来自开源机器学习程序，包括受欢迎的模型架构，如ResNet、EfficientNet、Mask R-CNN和Transformer。TpuGraphs提供了25倍更多的图像，并770倍更大的图像平均大小，相比已有的机器学习程序性能预测数据集。这个图级预测任务中的大图学习问题 introduce了新的挑战，从扩展性、培训效率、模型质量等方面。

Staleness-Alleviated Distributed GNN Training via Online Dynamic-Embedding Prediction

paper_url: http://arxiv.org/abs/2308.13466
repo_url: None
paper_authors: Guangji Bai, Ziyang Yu, Zheng Chai, Yue Cheng, Liang Zhao
for: 这篇论文是为了解决Graph Neural Networks（GNNs）在大规模图上的训练中的难点，特别是难以同步多个节点的问题。
methods: 这篇论文使用了分布式计算来解决这个问题，并且使用了历史值推断来实现高并发性。
results: 这篇论文提出了一个名为SAT（Staleness-Alleviated Training）的新的分布式GNN训练框架，可以有效地减少节点嵌入缓存的旧化。实验结果显示，SAT可以实现更好的性能和训练速度在多个大规模图数据集上。

Abstract
Despite the recent success of Graph Neural Networks (GNNs), it remains challenging to train GNNs on large-scale graphs due to neighbor explosions. As a remedy, distributed computing becomes a promising solution by leveraging abundant computing resources (e.g., GPU). However, the node dependency of graph data increases the difficulty of achieving high concurrency in distributed GNN training, which suffers from the massive communication overhead. To address it, Historical value approximation is deemed a promising class of distributed training techniques. It utilizes an offline memory to cache historical information (e.g., node embedding) as an affordable approximation of the exact value and achieves high concurrency. However, such benefits come at the cost of involving dated training information, leading to staleness, imprecision, and convergence issues. To overcome these challenges, this paper proposes SAT (Staleness-Alleviated Training), a novel and scalable distributed GNN training framework that reduces the embedding staleness adaptively. The key idea of SAT is to model the GNN's embedding evolution as a temporal graph and build a model upon it to predict future embedding, which effectively alleviates the staleness of the cached historical embedding. We propose an online algorithm to train the embedding predictor and the distributed GNN alternatively and further provide a convergence analysis. Empirically, we demonstrate that SAT can effectively reduce embedding staleness and thus achieve better performance and convergence speed on multiple large-scale graph datasets.

摘要
尽管 Graf Neural Networks (GNNs) 的最近成功，但是在大规模图上训练 GNNs 仍然具有挑战，主要是因为邻居爆炸。为了解决这问题，分布式计算成为了一种有前途的解决方案，利用了丰富的计算资源（例如 GPU）。然而，图数据中节点的依赖关系使得在分布式 GNN 训练中达到高并发性变得更加困难，这会导致巨大的通信开销。为了解决这个问题，历史值 aproximation 被视为一种有前途的分布式训练技术。它利用了一个缓存历史信息（例如节点嵌入）作为可以Affordable的近似值，并实现了高并发性。然而，这些利点来自于使用过时的训练信息，导致偏斜、不准确和融合问题。为了解决这些挑战，本文提出了 SAT（Staleness-Alleviated Training），一种新的和可扩展的分布式 GNN 训练框架。SAT 的关键思想是模型 GNN 的嵌入演化为一个 temporal graph，并建立一个模型来预测未来的嵌入，从而有效地减轻嵌入缓存的过时性。我们提出了一种在线算法来训练嵌入预测器和分布式 GNN alternatively，并提供了一种融合分析。实验表明，SAT 可以有效地减轻嵌入缓存的过时性，从而实现更好的性能和融合速度在多个大规模图数据集上。

2023-08-25

cs.CV

cs.CV - 2023-08-25

ROAM: Robust and Object-aware Motion Generation using Neural Pose Descriptors

paper_url: http://arxiv.org/abs/2308.12969
repo_url: None
paper_authors: Wanyue Zhang, Rishabh Dabral, Thomas Leimkühler, Vladislav Golyanik, Marc Habermann, Christian Theobalt
for: 本研究旨在解决现有自动方法对新物体的抗性和泛化问题，提高3D虚拟人物运动合成中对新物体的适应性和自然性。
methods: 本研究使用一个受过参数化的动作模型，并通过对物体只 datasets上学习的半定态特征表示来增强模型对新物体的抗性和泛化能力。
results: 通过对比当前状态的方法和用户研究，本研究得到了较好的3D虚拟人物运动和互动质量和稳定性，并且可以在未看过物体的情况下进行高质量的动作生成。

Abstract
Existing automatic approaches for 3D virtual character motion synthesis supporting scene interactions do not generalise well to new objects outside training distributions, even when trained on extensive motion capture datasets with diverse objects and annotated interactions. This paper addresses this limitation and shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object. We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object. Given an unseen object and a reference pose-object pair, we optimise for the object-aware pose that is closest in the feature space to the reference pose. Finally, we use l-NSM, i.e., our motion generation model that is trained to seamlessly transition from locomotion to object interaction with the proposed bidirectional pose blending scheme. Through comprehensive numerical comparisons to state-of-the-art methods and in a user study, we demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects. Our project page is available at https://vcai.mpi-inf.mpg.de/projects/ROAM/.

摘要
现有自动化方法 для3D虚拟人物运动合成，不能很好地泛化到新物体外部训练分布，即使训练在含有多种物体和注释交互的大规模运动捕捉数据集上。这篇论文解决了这一问题，并显示了在含有新物体的场景中的3D物体意识Character Synthesis中的稳定性和泛化性可以通过训练一个运动模型，只需要一个参考物体。我们利用了一种基于物体专门采集的隐藏特征表示，这种表示在物体周围的SE(3)-等变换equivariant描述器场中编码了物体。给定一个未看过的物体和一个参考姿态-物体对，我们优化了 closest在特征空间的物体意识姿态。最后，我们使用l-NSM，即我们训练的运动生成模型，通过我们的拟合bidirectional姿态混合方案来协调转换从步行到物体交互。通过对现有方法的数字比较和用户研究，我们展示了3D虚拟人物运动和交互质量和稳定性在未看过物体场景中得到了显著改善。我们的项目页面可以在https://vcai.mpi-inf.mpg.de/projects/ROAM/上找到。

Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation

paper_url: http://arxiv.org/abs/2308.12968
repo_url: https://github.com/yuxinn-j/scenimefy
paper_authors: Yuxin Jiang, Liming Jiang, Shuai Yang, Chen Change Loy
for: 这种研究的目的是提高动漫场景的自动高质量渲染，以解决现有的镜像匹配问题，提高图像的semantic preserve和精细特征。
methods: 这种方法使用了semi-supervised image-to-image翻译框架，使用了Structure-consistent pseudo paired data，并使用了segementation-guided data selection和patch-wise contrastive style loss来提高风格化和精细特征。
results: 对比 estado-of-the-art 基eline，这种方法在 both perceptual quality和量化性能方面表现出色，得到了更高的质量和更好的结果。

Abstract
Automatic high-quality rendering of anime scenes from complex real-world images is of significant practical value. The challenges of this task lie in the complexity of the scenes, the unique features of anime style, and the lack of high-quality datasets to bridge the domain gap. Despite promising attempts, previous efforts are still incompetent in achieving satisfactory results with consistent semantic preservation, evident stylization, and fine details. In this study, we propose Scenimefy, a novel semi-supervised image-to-image translation framework that addresses these challenges. Our approach guides the learning with structure-consistent pseudo paired data, simplifying the pure unsupervised setting. The pseudo data are derived uniquely from a semantic-constrained StyleGAN leveraging rich model priors like CLIP. We further apply segmentation-guided data selection to obtain high-quality pseudo supervision. A patch-wise contrastive style loss is introduced to improve stylization and fine details. Besides, we contribute a high-resolution anime scene dataset to facilitate future research. Our extensive experiments demonstrate the superiority of our method over state-of-the-art baselines in terms of both perceptual quality and quantitative performance.

摘要
<>Automatic high-quality rendering of anime scenes from complex real-world images is of significant practical value. The challenges of this task lie in the complexity of the scenes, the unique features of anime style, and the lack of high-quality datasets to bridge the domain gap. Despite promising attempts, previous efforts are still incompetent in achieving satisfactory results with consistent semantic preservation, evident stylization, and fine details. In this study, we propose Scenimefy, a novel semi-supervised image-to-image translation framework that addresses these challenges. Our approach guides the learning with structure-consistent pseudo paired data, simplifying the pure unsupervised setting. The pseudo data are derived uniquely from a semantic-constrained StyleGAN leveraging rich model priors like CLIP. We further apply segmentation-guided data selection to obtain high-quality pseudo supervision. A patch-wise contrastive style loss is introduced to improve stylization and fine details. Besides, we contribute a high-resolution anime scene dataset to facilitate future research. Our extensive experiments demonstrate the superiority of our method over state-of-the-art baselines in terms of both perceptual quality and quantitative performance.Translated by Google Translate.

POCO: 3D Pose and Shape Estimation with Confidence

paper_url: http://arxiv.org/abs/2308.12965
repo_url: None
paper_authors: Sai Kumar Dwivedi, Cordelia Schmid, Hongwei Yi, Michael J. Black, Dimitrios Tzionas
for: The paper is written for improving the accuracy of 3D human pose and shape estimation from images, and providing uncertainty estimates for downstream tasks.methods: The paper proposes a novel framework called POCO, which uses a Dual Conditioning Strategy (DCS) to estimate both the 3D body pose and the per-sample variance in a single feed-forward pass.results: The paper shows that training the network to reason about uncertainty helps it learn to more accurately estimate 3D pose, and demonstrates the effectiveness of the proposed method by applying it to three state-of-the-art HPS regressors and showing improvement in accuracy. Additionally, the paper demonstrates the usefulness of the uncertainty estimates for downstream tasks such as bootstrap HPS training and video pose estimation.Here’s the Chinese translation of the three information:for: 本文是为了提高图像中人体三维姿态和形状估计的准确性, 并为下游任务提供不确定性估计。methods: 本文提出了一种名为POCO的新框架，该框架使用双conditioning策略（DCS）来在单一的前向传播中估计3D人体姿态和每个样本的方差。results: 本文显示了训练网络理解不确定性可以帮助其更加准确地估计3D姿态，并通过应用到三个state-of-the-art HPS regressors上显示了改进准确性。此外，本文还示出了不确定性估计的实用性，例如通过自动划分不确定样本来进行HPS训练、视频人体姿态估计等。

Abstract
The regression of 3D Human Pose and Shape (HPS) from an image is becoming increasingly accurate. This makes the results useful for downstream tasks like human action recognition or 3D graphics. Yet, no regressor is perfect, and accuracy can be affected by ambiguous image evidence or by poses and appearance that are unseen during training. Most current HPS regressors, however, do not report the confidence of their outputs, meaning that downstream tasks cannot differentiate accurate estimates from inaccurate ones. To address this, we develop POCO, a novel framework for training HPS regressors to estimate not only a 3D human body, but also their confidence, in a single feed-forward pass. Specifically, POCO estimates both the 3D body pose and a per-sample variance. The key idea is to introduce a Dual Conditioning Strategy (DCS) for regressing uncertainty that is highly correlated to pose reconstruction quality. The POCO framework can be applied to any HPS regressor and here we evaluate it by modifying HMR, PARE, and CLIFF. In all cases, training the network to reason about uncertainty helps it learn to more accurately estimate 3D pose. While this was not our goal, the improvement is modest but consistent. Our main motivation is to provide uncertainty estimates for downstream tasks; we demonstrate this in two ways: (1) We use the confidence estimates to bootstrap HPS training. Given unlabelled image data, we take the confident estimates of a POCO-trained regressor as pseudo ground truth. Retraining with this automatically-curated data improves accuracy. (2) We exploit uncertainty in video pose estimation by automatically identifying uncertain frames (e.g. due to occlusion) and inpainting these from confident frames. Code and models will be available for research at https://poco.is.tue.mpg.de.

摘要
“三维人体姿态和形状（HPS）从图像回推的进步越来越精确，使得结果可以用于人做动作识别或3Dgraphics。然而，没有一个优秀的回推器，因为图像证据不明确或者人做动作和外表都没有在训练过程中出现过。现今大多数HPS回推器都不会报告其出力的可信度，因此下游任务无法分辨实际的估计和错误的估计。为了解决这个问题，我们开发了POCO，一个新的框架，可以在单一的从前进推 pass中预测HPS和其可信度。具体来说，POCO会预测3D人体姿态和每个样本的条件方差。我们的关键思想是通过引入双条件策略（DCS）来预测不确定性，这和姿态重建质量高度相关。POCO框架可以应用于任何HPS回推器，我们在这里评估了修改HMR、PARE和CLIFF等回推器。在所有情况下，将network培训来理解不确定性，使其更好地估计3D姿态。这并不是我们的主要目标，但是改善是微不足道，但是一致的。我们的主要动机是提供不确定性估计，我们在两种方式中示出了这个：（1）我们使用POCO-trained回推器的自信估计作为自动生成的pseudo陌生标本。将这些自信估计作为训练标本，然后重训，可以提高准确性。（2）我们利用不确定性在动作捕捉中自动识别 uncertain frames（例如由遮蔽所致），并从自信frames中填充这些frame。”

Dense Text-to-Image Generation with Attention Modulation

paper_url: http://arxiv.org/abs/2308.12964
repo_url: https://github.com/naver-ai/densediffusion
paper_authors: Yunji Kim, Jiyoung Lee, Jin-Hwa Kim, Jung-Woo Ha, Jun-Yan Zhu
for: 实现文本描述中的具体图像Synthesize realistic images from dense captions, where each text prompt provides a detailed description for a specific image region.
methods: 使用预训条件为文本描述中的具体图像构成，并通过控制图像的构成来实现具体图像的生成。
results: 以无需训练和数据集，提高文本描述中的具体图像生成效果，并与特定构成条件下的图像生成效果相似。

Abstract
Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions, where each text prompt provides a detailed description for a specific image region. To address this, we propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions while offering control over the scene layout. We first analyze the relationship between generated images' layouts and the pre-trained model's intermediate attention maps. Next, we develop an attention modulation method that guides objects to appear in specific regions according to layout guidance. Without requiring additional fine-tuning or datasets, we improve image generation performance given dense captions regarding both automatic and human evaluation scores. In addition, we achieve similar-quality visual results with models specifically trained with layout conditions.

摘要
existing text-to-image diffusion models struggle to synthesize realistic images given dense captions, where each text prompt provides a detailed description for a specific image region. To address this, we propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions while offering control over the scene layout. We first analyze the relationship between generated images' layouts and the pre-trained model's intermediate attention maps. Next, we develop an attention modulation method that guides objects to appear in specific regions according to layout guidance. Without requiring additional fine-tuning or datasets, we improve image generation performance given dense captions regarding both automatic and human evaluation scores. In addition, we achieve similar-quality visual results with models specifically trained with layout conditions.Here's the translation in Traditional Chinese:现有的文本至图扩散模型对于细节描述的文本提示则做不好，这些文本提示每个图像区域的详细描述。为解决这个问题，我们提出了DenseDiffusion，一种不需要训练的方法，可以将预训练的文本至图模型调整以应对这些细节描述。我们首先分析生成图像的布局和预训练模型的中间注意力图。接着，我们开发了一种注意力调节方法，可以根据布局指导物品出现在特定的区域中。不需要进一步的调整或数据，我们提高了对细节描述的图像生成性能，并且在自动和人类评估上都有改善。此外，我们可以使用特定布局条件进行训练，以获得相似的视觉效果。

MapPrior: Bird’s-Eye View Map Layout Estimation with Generative Models

paper_url: http://arxiv.org/abs/2308.12963
repo_url: None
paper_authors: Xiyue Zhu, Vlas Zyrianov, Zhijian Liu, Shenlong Wang
for: 提高 bird’s-eye view (BEV) 识别模型的准确性和生成性的 semantic map 布局
methods: combine 传统的探测型 BEV 识别模型和学习的生成模型 для semantic map 布局
results: 在 nuScenes benchmark 上，MapPrior 比最强竞争对手提高 MMD 和 ECE scores 的 camera-和 LiDAR-based BEV 识别任务中表现出色，得到了显著改善的结果。

Abstract
Despite tremendous advancements in bird's-eye view (BEV) perception, existing models fall short in generating realistic and coherent semantic map layouts, and they fail to account for uncertainties arising from partial sensor information (such as occlusion or limited coverage). In this work, we introduce MapPrior, a novel BEV perception framework that combines a traditional discriminative BEV perception model with a learned generative model for semantic map layouts. Our MapPrior delivers predictions with better accuracy, realism, and uncertainty awareness. We evaluate our model on the large-scale nuScenes benchmark. At the time of submission, MapPrior outperforms the strongest competing method, with significantly improved MMD and ECE scores in camera- and LiDAR-based BEV perception.

摘要
尽管存在巨大的进步，现有的鸟瞰视（BEV）感知模型仍未能生成真实、凝重的 semantic map 布局，并且无法考虑部分感知器（如遮挡或有限覆盖）中的不确定性。在这项工作中，我们引入 MapPrior，一种新的 BEV 感知框架，该框架将传统的推理 BEV 感知模型与学习的生成模型结合在一起。我们的 MapPrior 能够提供更加准确、真实和不确定性意识的预测。我们在 nuScenes benchmark 上进行了评估，当时提交的 MapPrior 已经超过了最强竞争对手，在摄像头和 LiDAR 基于的 BEV 感知方面具有显著提高的 MMD 和 ECE 分数。

Motion-Guided Masking for Spatiotemporal Representation Learning

paper_url: http://arxiv.org/abs/2308.12962
repo_url: None
paper_authors: David Fan, Jue Wang, Shuai Liao, Yi Zhu, Vimal Bhat, Hector Santos-Villalobos, Rohith MV, Xinyu Li
for: 这个论文主要是为了提高视频理解性，并且使用随机遮盲法来提高视频 autoencoder 的性能。
methods: 这个论文提出了一种新的推寄算法，即动态推寄法（Motion-guided masking，MGM），该算法利用运动向量来引导遮盲器在时间上的位置。
results: 在两个复杂的大规模视频测试集（Kinetics-400和Something-Something V2）上，这个方法可以与之前的状态OF-THE-ART方法相比，在视频 autoencoder 中获得最大 $1.3%$ 的提高。此外，这个方法还可以在训练EPoch数量相同的情况下，与之前的方法相比，在视频 autoencoder 中获得最大 $66%$ 的提高。最后，这个方法在下游传输学习和领域适应任务中表现出色，在 UCF101、HMDB51 和 Diving48 datasets上获得最大 $4.9%$ 的提高。

Abstract
Several recent works have directly extended the image masked autoencoder (MAE) with random masking into video domain, achieving promising results. However, unlike images, both spatial and temporal information are important for video understanding. This suggests that the random masking strategy that is inherited from the image MAE is less effective for video MAE. This motivates the design of a novel masking algorithm that can more efficiently make use of video saliency. Specifically, we propose a motion-guided masking algorithm (MGM) which leverages motion vectors to guide the position of each mask over time. Crucially, these motion-based correspondences can be directly obtained from information stored in the compressed format of the video, which makes our method efficient and scalable. On two challenging large-scale video benchmarks (Kinetics-400 and Something-Something V2), we equip video MAE with our MGM and achieve up to +$1.3\%$ improvement compared to previous state-of-the-art methods. Additionally, our MGM achieves equivalent performance to previous video MAE using up to $66\%$ fewer training epochs. Lastly, we show that MGM generalizes better to downstream transfer learning and domain adaptation tasks on the UCF101, HMDB51, and Diving48 datasets, achieving up to +$4.9\%$ improvement compared to baseline methods.

摘要
On two challenging large-scale video benchmarks (Kinetics-400 and Something-Something V2), we equip video MAE with our MGM and achieve up to +$1.3\%$ improvement compared to previous state-of-the-art methods. Additionally, our MGM achieves equivalent performance to previous video MAE using up to $66\%$ fewer training epochs. Our MGM also generalizes better to downstream transfer learning and domain adaptation tasks on the UCF101, HMDB51, and Diving48 datasets, achieving up to +$4.9\%$ improvement compared to baseline methods.

Less is More: Towards Efficient Few-shot 3D Semantic Segmentation via Training-free Networks

paper_url: http://arxiv.org/abs/2308.12961
repo_url: https://github.com/yangyangyang127/tfs3d
paper_authors: Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Jiaming Liu, Hao Dong, Peng Gao
for: 提高3D分割任务中的几个shot学习效果，减少大规模数据的依赖。
methods: 提出了一种没有学习参数的培训自由3D分割网络（TFS3D）和其进一步改进版本TFS3D-T。TFS3D使用三角函数坐标编码提取密集表示，与之前的培训方法相比具有相似的性能。TFS3D-T通过增强几个shot查询和支持数据之间的交互，提高了前期培训的效果。
results: 对S3DIS和ScanNet数据集进行实验，TFS3D-T在mIoU方面提高了+6.93%和+17.96%，同时减少了培训时间 by -90%，表明TFS3D-T具有更高的效果和效率。

Abstract
To reduce the reliance on large-scale datasets, recent works in 3D segmentation resort to few-shot learning. Current 3D few-shot semantic segmentation methods first pre-train the models on `seen' classes, and then evaluate their generalization performance on `unseen' classes. However, the prior pre-training stage not only introduces excessive time overhead, but also incurs a significant domain gap on `unseen' classes. To tackle these issues, we propose an efficient Training-free Few-shot 3D Segmentation netwrok, TFS3D, and a further training-based variant, TFS3D-T. Without any learnable parameters, TFS3D extracts dense representations by trigonometric positional encodings, and achieves comparable performance to previous training-based methods. Due to the elimination of pre-training, TFS3D can alleviate the domain gap issue and save a substantial amount of time. Building upon TFS3D, TFS3D-T only requires to train a lightweight query-support transferring attention (QUEST), which enhances the interaction between the few-shot query and support data. Experiments demonstrate TFS3D-T improves previous state-of-the-art methods by +6.93% and +17.96% mIoU respectively on S3DIS and ScanNet, while reducing the training time by -90%, indicating superior effectiveness and efficiency.

摘要
Recent works in 3D segmentation have resorted to few-shot learning to reduce reliance on large-scale datasets. Current 3D few-shot semantic segmentation methods first pre-train the models on "seen" classes and then evaluate their generalization performance on "unseen" classes. However, the prior pre-training stage not only introduces excessive time overhead but also incurs a significant domain gap on "unseen" classes. To address these issues, we propose an efficient Training-free Few-shot 3D Segmentation network (TFS3D) and a further training-based variant (TFS3D-T). Without any learnable parameters, TFS3D extracts dense representations by trigonometric positional encodings and achieves comparable performance to previous training-based methods. Due to the elimination of pre-training, TFS3D can alleviate the domain gap issue and save a substantial amount of time. Building upon TFS3D, TFS3D-T only requires training a lightweight query-support transferring attention (QUEST), which enhances the interaction between the few-shot query and support data. Experiments demonstrate TFS3D-T improves previous state-of-the-art methods by +6.93% and +17.96% mIoU respectively on S3DIS and ScanNet, while reducing the training time by -90%, indicating superior effectiveness and efficiency.

Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment

paper_url: http://arxiv.org/abs/2308.12960
repo_url: https://github.com/sheng-eatamath/S3A
paper_authors: Sheng Zhang, Muzammal Naseer, Guangyi Chen, Zhiqiang Shen, Salman Khan, Kun Zhang, Fahad Khan
for: 本研究的目的是解决零 shot 分类中的开放世界问题，即没有注释但具有广泛的词汇。
methods: 本研究提出了 Self Structural Semantic Alignment (S^3A) 框架，它可以从无注释数据中提取结构性 semantics，并同时进行自我学习。S^3A 框架包括一种唯一的 Cluster-Vote-Prompt-Realign (CVPR) 算法，它通过轮循图像集成、选择每个集合中的图像，通过大语言模型生成权威提示，以及将图像和词汇进行结构性 semantic alignment，来提取结构性 semantics。
results: 对多种通用和细化的 benchmarcks 进行了广泛的实验，结果表明，S^3A 方法可以在零 shot 分类中提供较高的精度改进，相比 CLIP 的平均改进率高于 15%。

Abstract
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification. Despite the success, most traditional VLMs-based methods are restricted by the assumption of partial source supervision or ideal vocabularies, which rarely satisfy the open-world scenario. In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary. To address this challenge, we propose the Self Structural Semantic Alignment (S^3A) framework, which extracts the structural semantic information from unlabeled data while simultaneously self-learning. Our S^3A framework adopts a unique Cluster-Vote-Prompt-Realign (CVPR) algorithm, which iteratively groups unlabeled data to derive structural semantics for pseudo-supervision. Our CVPR process includes iterative clustering on images, voting within each cluster to identify initial class candidates from the vocabulary, generating discriminative prompts with large language models to discern confusing candidates, and realigning images and the vocabulary as structural semantic alignment. Finally, we propose to self-learn the CLIP image encoder with both individual and structural semantic alignment through a teacher-student learning strategy. Our comprehensive experiments across various generic and fine-grained benchmarks demonstrate that the S^3A method offers substantial improvements over existing VLMs-based approaches, achieving a more than 15% accuracy improvement over CLIP on average. Our codes, models, and prompts are publicly released at https://github.com/sheng-eatamath/S3A.

摘要
大规模预训练视觉语言模型（VLM）已经证明有效于零shot分类。 despite the success， most traditional VLMs-based methods are restricted by the assumption of partial source supervision or ideal vocabularies，which rarely satisfy the open-world scenario. In this paper，we aim at a more challenging setting，Realistic Zero-Shot Classification，which assumes no annotation but instead a broad vocabulary. To address this challenge，we propose the Self Structural Semantic Alignment（S^3A）framework，which extracts the structural semantic information from unlabeled data while simultaneously self-learning. Our S^3A framework adopts a unique Cluster-Vote-Prompt-Realign（CVPR）algorithm，which iteratively groups unlabeled data to derive structural semantics for pseudo-supervision. Our CVPR process includes iterative clustering on images，voting within each cluster to identify initial class candidates from the vocabulary，generating discriminative prompts with large language models to discern confusing candidates，and realigning images and the vocabulary as structural semantic alignment. Finally，we propose to self-learn the CLIP image encoder with both individual and structural semantic alignment through a teacher-student learning strategy. Our comprehensive experiments across various generic and fine-grained benchmarks demonstrate that the S^3A method offers substantial improvements over existing VLMs-based approaches，achieving a more than 15% accuracy improvement over CLIP on average. Our codes，models，and prompts are publicly released at https://github.com/sheng-eatamath/S3A.

Label Budget Allocation in Multi-Task Learning

paper_url: http://arxiv.org/abs/2308.12949
repo_url: None
paper_authors: Ximeng Sun, Kihyuk Sohn, Kate Saenko, Clayton Mellina, Xiao Bian
for: 提高机器学习系统的性能，解决标签数据的成本问题。
methods: 提出了标签预算分配问题，并提出了一种适应任务的预算分配算法来解决这个问题。
results: 通过实验证明了我们的方法可以比其他各种常用的标签策略提高多任务学习的性能。

Abstract
The cost of labeling data often limits the performance of machine learning systems. In multi-task learning, related tasks provide information to each other and improve overall performance, but the label cost can vary among tasks. How should the label budget (i.e. the amount of money spent on labeling) be allocated among different tasks to achieve optimal multi-task performance? We are the first to propose and formally define the label budget allocation problem in multi-task learning and to empirically show that different budget allocation strategies make a big difference to its performance. We propose a Task-Adaptive Budget Allocation algorithm to robustly generate the optimal budget allocation adaptive to different multi-task learning settings. Specifically, we estimate and then maximize the extent of new information obtained from the allocated budget as a proxy for multi-task learning performance. Experiments on PASCAL VOC and Taskonomy demonstrate the efficacy of our approach over other widely used heuristic labeling strategies.

摘要
machine learning系统的性能受数据标注成本的限制。在多任务学习中，相关任务之间交换信息，提高总性能，但标注成本可能因任务而异。如何在多任务学习中合理分配标注预算（即用于标注的费用）以达到最佳性能？我们是第一个提出并正式定义多任务学习标注预算分配问题，并通过实验证明不同预算分配策略对性能产生了很大影响。我们提出了适应任务的预算分配算法，可以在不同的多任务学习设置下生成最佳的预算分配策略。specifically，我们估算并最大化从分配预算中获得的新信息的总量，作为多任务学习性能的代理。 Pascal VOC和Taskonomy的实验表明我们的方法比其他常见的标注策略更有效。

Perspective-aware Convolution for Monocular 3D Object Detection

paper_url: http://arxiv.org/abs/2308.12938
repo_url: https://github.com/KenYu910645/perspective-aware-convolution
paper_authors: Jia-Quan Yu, Soo-Chang Pei
for: 提高自动驾驶车辆中的单摄像头三维物体检测精度。
methods: 提出了一种新的视角意识核心层，该层可以在图像中提取长距离依赖关系，以捕捉场景的视角信息。
results: 在KITTI3D数据集上测试，该方法可以提高3D物体检测精度，达到了23.9%的准确率。这些结果表明了场景信息的重要性，以及网络设计中场景结构的潜在优势。

Abstract
Monocular 3D object detection is a crucial and challenging task for autonomous driving vehicle, while it uses only a single camera image to infer 3D objects in the scene. To address the difficulty of predicting depth using only pictorial clue, we propose a novel perspective-aware convolutional layer that captures long-range dependencies in images. By enforcing convolutional kernels to extract features along the depth axis of every image pixel, we incorporates perspective information into network architecture. We integrate our perspective-aware convolutional layer into a 3D object detector and demonstrate improved performance on the KITTI3D dataset, achieving a 23.9\% average precision in the easy benchmark. These results underscore the importance of modeling scene clues for accurate depth inference and highlight the benefits of incorporating scene structure in network design. Our perspective-aware convolutional layer has the potential to enhance object detection accuracy by providing more precise and context-aware feature extraction.

摘要
<>Translate given text into Simplified Chinese.<>单目3D对象检测是自动驾驶车辆中的一项关键和挑战性任务，它使用单个摄像头图像来推断Scene中的3D对象。为了解决基于图像的深度预测困难，我们提出了一种新的视角意识核心层。我们要求核心层在每个图像像素上提取深度轴方向的特征，从而将视角信息integrated into网络 architecture。我们将这种视角意识核心层与3D对象检测器结合，并在KITTI3D数据集上进行了评估，实现了23.9%的准确率在易 benchmark。这些结果证明了场景 clue的重要性，并高亮了网络设计中场景结构的 incorporation 的好处。我们的视角意识核心层有可能提高对象检测精度，通过提供更加准确和上下文感知的特征提取。

Panoptic-Depth Color Map for Combination of Depth and Image Segmentation

paper_url: http://arxiv.org/abs/2308.12937
repo_url: None
paper_authors: Jia-Quan Yu, Soo-Chang Pei
for: 这篇论文旨在提出一种将图像分割和深度估计结合在一起的新方法，以提高自动驾驶场景中图像识别的精度和安全性。
methods: 该方法具有一个额外的深度估计分支，用于在分割网络中预测每个实例段的深度。
results: 在Cityscape数据集上测试，该方法能够实现高质量的分割结果，同时包含深度信息，并通过色彩地图可视化。这种方法开拓了将不同任务和网络结合起来生成更全面的图像识别结果，以提高自动驾驶车辆的安全性。

Abstract
Image segmentation and depth estimation are crucial tasks in computer vision, especially in autonomous driving scenarios. Although these tasks are typically addressed separately, we propose an innovative approach to combine them in our novel deep learning network, Panoptic-DepthLab. By incorporating an additional depth estimation branch into the segmentation network, it can predict the depth of each instance segment. Evaluating on Cityscape dataset, we demonstrate the effectiveness of our method in achieving high-quality segmentation results with depth and visualize it with a color map. Our proposed method demonstrates a new possibility of combining different tasks and networks to generate a more comprehensive image recognition result to facilitate the safety of autonomous driving vehicles.

摘要
Image segmentation和深度估计是计算机视觉中关键任务，尤其在自动驾驶场景下。虽然这两个任务通常被视为独立的，但我们提出了一种创新的方法，将它们结合在一起。我们的新型深度学习网络Panoptic-DepthLab中添加了一个深度估计分支，可以预测每个实例分割结果中的深度。在Cityscape数据集上评估，我们示出了我们的方法可以实现高质量的分割结果，并通过色彩地图进行可见化。我们的提议的方法开 up了将不同任务和网络结合起来以生成更全面的图像认知结果，以便促进自动驾驶车辆的安全。

Towards Realistic Unsupervised Fine-tuning with CLIP

paper_url: http://arxiv.org/abs/2308.12919
repo_url: None
paper_authors: Jian Liang, Lijun Sheng, Zhengbo Wang, Ran He, Tieniu Tan
for: 这个研究旨在应用CLIPvision-language模型进行下游有监督学习任务，并在无监督下精致化CLIP。
methods: 本研究提出了一个简单、高效的精致化方法，名为Universal Entropy Optimization（UEO），它利用amples的信任程度来减少信任度高的例子的 conditional entropy，并将不信任度高的例子的margin entropy提高。 UEO还包括了对CLIP的视觉分支中的通道对称变换进行优化。
results: 经过了15个领域和4种不同的专门知识的广泛实验，结果显示UEO的表现比基eline方法更好，具有更高的普遍化和外部调整检测能力。

Abstract
The emergence of vision-language models (VLMs), such as CLIP, has spurred a significant research effort towards their application for downstream supervised learning tasks. Although some previous studies have explored the unsupervised fine-tuning of CLIP, they often rely on prior knowledge in the form of class names associated with ground truth labels. In this paper, we delve into a realistic unsupervised fine-tuning scenario by assuming that the unlabeled data might contain out-of-distribution samples from unknown classes. Furthermore, we emphasize the importance of simultaneously enhancing out-of-distribution detection capabilities alongside the recognition of instances associated with predefined class labels. To tackle this problem, we present a simple, efficient, and effective fine-tuning approach called Universal Entropy Optimization (UEO). UEO leverages sample-level confidence to approximately minimize the conditional entropy of confident instances and maximize the marginal entropy of less confident instances. Apart from optimizing the textual prompts, UEO also incorporates optimization of channel-wise affine transformations within the visual branch of CLIP. Through extensive experiments conducted across 15 domains and 4 different types of prior knowledge, we demonstrate that UEO surpasses baseline methods in terms of both generalization and out-of-distribution detection.

摘要
“视觉语言模型（VLM）的出现，如CLIP，已经引发了大量关于其应用于下游有监督学习任务的研究effort。 although some previous studies have explored the unsupervised fine-tuning of CLIP, they often rely on prior knowledge in the form of class names associated with ground truth labels. In this paper, we delve into a realistic unsupervised fine-tuning scenario by assuming that the unlabeled data might contain out-of-distribution samples from unknown classes. Furthermore, we emphasize the importance of simultaneously enhancing out-of-distribution detection capabilities alongside the recognition of instances associated with predefined class labels. To tackle this problem, we present a simple, efficient, and effective fine-tuning approach called Universal Entropy Optimization (UEO). UEO leverages sample-level confidence to approximately minimize the conditional entropy of confident instances and maximize the marginal entropy of less confident instances. Apart from optimizing the textual prompts, UEO also incorporates optimization of channel-wise affine transformations within the visual branch of CLIP. Through extensive experiments conducted across 15 domains and 4 different types of prior knowledge, we demonstrate that UEO surpasses baseline methods in terms of both generalization and out-of-distribution detection.”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

Robot Pose Nowcasting: Forecast the Future to Improve the Present

paper_url: http://arxiv.org/abs/2308.12914
repo_url: None
paper_authors: Alessandro Simoni, Francesco Marchetti, Guido Borghi, Federico Becattini, Lorenzo Seidenari, Roberto Vezzani, Alberto Del Bimbo
for: 本研究旨在提供一种基于视觉数据的精准三维姿态估计系统，以便在工业4.0enario中安全和有效地协同工作人员和机器人。
methods: 该系统基于视觉数据，并通过对未来姿态进行预测来提高当前姿态估计精度。这种技术被称为“姿态预测”（pose nowcasting）。
results: 实验结果表明，该系统在两个不同的数据集上达到了状态艺术和实时性的国际先进水平，并在机器人和人类场景中都有较高的有效性。

Abstract
In recent years, the effective and safe collaboration between humans and machines has gained significant importance, particularly in the Industry 4.0 scenario. A critical prerequisite for realizing this collaborative paradigm is precisely understanding the robot's 3D pose within its environment. Therefore, in this paper, we introduce a novel vision-based system leveraging depth data to accurately establish the 3D locations of robotic joints. Specifically, we prove the ability of the proposed system to enhance its current pose estimation accuracy by jointly learning to forecast future poses. Indeed, we introduce the concept of Pose Nowcasting, denoting the capability of a system to exploit the learned knowledge of the future to improve the estimation of the present. The experimental evaluation is conducted on two different datasets, providing state-of-the-art and real-time performance and confirming the validity of the proposed method on both the robotic and human scenarios.

摘要

SCoRD: Subject-Conditional Relation Detection with Text-Augmented Data

paper_url: http://arxiv.org/abs/2308.12910
repo_url: None
paper_authors: Ziyan Yang, Kushal Kafle, Zhe Lin, Scott Cohen, Zhihong Ding, Vicente Ordonez
for: 预测Scene中对象之间的关系，以及这些关系的位置。
methods: 提出了一种基于自动生成模型的Subject-Conditional Relation Detection（SCoRD）方法，可以conditioned on a subject，预测该主题与其他对象之间的关系和位置。
results: 在Open Images dataset上，通过创建OIv6-SCoRD benchmark，并使用文本描述来自动生成relation-object对，提高了relation-object预测的准确率和泛化能力。

Abstract
We propose Subject-Conditional Relation Detection SCoRD, where conditioned on an input subject, the goal is to predict all its relations to other objects in a scene along with their locations. Based on the Open Images dataset, we propose a challenging OIv6-SCoRD benchmark such that the training and testing splits have a distribution shift in terms of the occurrence statistics of $\langle$subject, relation, object$\rangle$ triplets. To solve this problem, we propose an auto-regressive model that given a subject, it predicts its relations, objects, and object locations by casting this output as a sequence of tokens. First, we show that previous scene-graph prediction methods fail to produce as exhaustive an enumeration of relation-object pairs when conditioned on a subject on this benchmark. Particularly, we obtain a recall@3 of 83.8% for our relation-object predictions compared to the 49.75% obtained by a recent scene graph detector. Then, we show improved generalization on both relation-object and object-box predictions by leveraging during training relation-object pairs obtained automatically from textual captions and for which no object-box annotations are available. Particularly, for $\langle$subject, relation, object$\rangle$ triplets for which no object locations are available during training, we are able to obtain a recall@3 of 42.59% for relation-object pairs and 32.27% for their box locations.

摘要
我们提议Subject-Conditional Relation Detection（SCoRD），其中 conditioned on输入主题，目标是预测它们与其他对象之间的关系以及它们的位置。基于Open Images dataset，我们提出了一个具有分布转移的OIv6-SCoRD benchmark，其中训练和测试分别的分布转移是对于$\langle$主题、关系、对象$\rangle$ triplets的出现统计学的。为解决这个问题，我们提议一个自动生成的模型，其中给定一个主题，它预测其关系、对象和对象位置，并将这些输出作为一个序列的token进行投影。我们首先表明，前一些场景树预测方法在这个benchmark上不能生成主题conditioned的完整的对象-关系对的列表。特别是，我们获得了主题conditioned的relation-object预测的准确率为83.8%，而一个最近的场景树探测器只得49.75%。然后，我们表明，通过在训练中使用自动获得的文本描述中的关系-对象对，我们可以提高对象-框预测和关系-对象预测的总体化能力。特别是，对于没有输入对象位置的主题conditioned的$\langle$主题、关系、对象$\rangle$ triplets，我们可以获得relation-object预测的准确率为42.59%和对象位置预测的准确率为32.27%。

Boosting Semantic Segmentation from the Perspective of Explicit Class Embeddings

paper_url: http://arxiv.org/abs/2308.12894
repo_url: None
paper_authors: Yuhe Liu, Chuanjian Liu, Kai Han, Quan Tang, Zengchang Qin
for: 本研究旨在提高 semantic segmentation 任务中类划分的精度和效率，特别是通过强化类划分的 semantics 来提高分类准确率。
methods: 本研究提出了 ECENet 模型，该模型通过在多个阶段图像特征之间交互来获取和增强类划分的准确性。此外，我们还提出了一种 Feature Reconstruction 模块，该模块通过组合内在和多样化分支来保证特征的多样性和重复性。
results: 实验结果表明，ECENet 模型在 ADE20K 数据集上比其他模型具有更高的精度和效率，并在 PASCAL-Context 数据集上达到了新的州OF-the-art 结果。

Abstract
Semantic segmentation is a computer vision task that associates a label with each pixel in an image. Modern approaches tend to introduce class embeddings into semantic segmentation for deeply utilizing category semantics, and regard supervised class masks as final predictions. In this paper, we explore the mechanism of class embeddings and have an insight that more explicit and meaningful class embeddings can be generated based on class masks purposely. Following this observation, we propose ECENet, a new segmentation paradigm, in which class embeddings are obtained and enhanced explicitly during interacting with multi-stage image features. Based on this, we revisit the traditional decoding process and explore inverted information flow between segmentation masks and class embeddings. Furthermore, to ensure the discriminability and informativity of features from backbone, we propose a Feature Reconstruction module, which combines intrinsic and diverse branches together to ensure the concurrence of diversity and redundancy in features. Experiments show that our ECENet outperforms its counterparts on the ADE20K dataset with much less computational cost and achieves new state-of-the-art results on PASCAL-Context dataset. The code will be released at https://gitee.com/mindspore/models and https://github.com/Carol-lyh/ECENet.

摘要
Semantic segmentation 是一个计算机视觉任务，将每个图像像素标注为不同类别。现代方法通常会将类别嵌入引入 semantic segmentation，并将监督类 маSK 视为最终预测。在这篇论文中，我们研究了类别嵌入机制，并发现可以根据类 маSK 提取更Explicit和有意义的类别嵌入。基于这一观察，我们提出了 ECENet，一种新的 segmentation 模式，其中类别嵌入在多Stage图像特征交互中得到并加强。此外，我们重新评估传统的解码过程，并探索类 segmentation masks 和类别嵌入之间的倒推信息流。进一步，为保证特征来源的可识别度和信息充足，我们提出了一种 Feature Reconstruction 模块，它将内在和多样化分支结合起来，以保证特征的协调性和多样性。实验表明，我们的 ECENet 在 ADE20K 数据集上较其他方法具有许多计算成本的优势，并在 PASCAL-Context 数据集上达到了新的州态艺术结果。代码将在 https://gitee.com/mindspore/models 和 https://github.com/Carol-lyh/ECENet 上发布。

Multi-stage feature decorrelation constraints for improving CNN classification performance

paper_url: http://arxiv.org/abs/2308.12880
repo_url: None
paper_authors: Qiuyu Zhu, Xuewen Zu, Chengfei Liu
for: 提高深度神经网络（CNN）的分类精度。
methods: 提出了一种多Stage Feature Decorrelation Loss（MFD Loss），用于约束前stage特征的谱相关性，从而提高CNN的分类精度。
results: 对多个常用的数据集和多种常用的CNN进行了实验比较和分析，证明了MFD Loss可以显著提高CNN的分类精度，并且与其他常见的损失函数结合使用也有优于单独使用Softmax损失的性能。

Abstract
For the convolutional neural network (CNN) used for pattern classification, the training loss function is usually applied to the final output of the network, except for some regularization constraints on the network parameters. However, with the increasing of the number of network layers, the influence of the loss function on the network front layers gradually decreases, and the network parameters tend to fall into local optimization. At the same time, it is found that the trained network has significant information redundancy at all stages of features, which reduces the effectiveness of feature mapping at all stages and is not conducive to the change of the subsequent parameters of the network in the direction of optimality. Therefore, it is possible to obtain a more optimized solution of the network and further improve the classification accuracy of the network by designing a loss function for restraining the front stage features and eliminating the information redundancy of the front stage features .For CNN, this article proposes a multi-stage feature decorrelation loss (MFD Loss), which refines effective features and eliminates information redundancy by constraining the correlation of features at all stages. Considering that there are many layers in CNN, through experimental comparison and analysis, MFD Loss acts on multiple front layers of CNN, constrains the output features of each layer and each channel, and performs supervision training jointly with classification loss function during network training. Compared with the single Softmax Loss supervised learning, the experiments on several commonly used datasets on several typical CNNs prove that the classification performance of Softmax Loss+MFD Loss is significantly better. Meanwhile, the comparison experiments before and after the combination of MFD Loss and some other typical loss functions verify its good universality.

摘要
For the convolutional neural network (CNN) used for pattern classification, the training loss function is usually applied to the final output of the network, except for some regularization constraints on the network parameters. However, with the increase of the number of network layers, the influence of the loss function on the network front layers gradually decreases, and the network parameters tend to fall into local optimization. At the same time, it is found that the trained network has significant information redundancy at all stages of features, which reduces the effectiveness of feature mapping at all stages and is not conducive to the change of the subsequent parameters of the network in the direction of optimality. Therefore, it is possible to obtain a more optimized solution of the network and further improve the classification accuracy of the network by designing a loss function for restraining the front stage features and eliminating the information redundancy of the front stage features. For CNN, this article proposes a multi-stage feature decorrelation loss (MFD Loss), which refines effective features and eliminates information redundancy by constraining the correlation of features at all stages. Considering that there are many layers in CNN, through experimental comparison and analysis, MFD Loss acts on multiple front layers of CNN, constrains the output features of each layer and each channel, and performs supervision training jointly with classification loss function during network training. Compared with the single Softmax Loss supervised learning, the experiments on several commonly used datasets on several typical CNNs prove that the classification performance of Softmax Loss+MFD Loss is significantly better. Meanwhile, the comparison experiments before and after the combination of MFD Loss and some other typical loss functions verify its good universality.Note: Simplified Chinese is also known as "Mandarin" or "Standard Chinese".

2023-08-25

cs.AI

cs.AI - 2023-08-25

NeO 360: Neural Fields for Sparse View Synthesis of Outdoor Scenes

paper_url: http://arxiv.org/abs/2308.12967
repo_url: https://github.com/zubair-irshad/NeO-360
paper_authors: Muhammad Zubair Irshad, Sergey Zakharov, Katherine Liu, Vitor Guizilini, Thomas Kollar, Adrien Gaidon, Zsolt Kira, Rares Ambrus
for: 本研究旨在提出一种可扩展的方法，用于从单个或几个姿态的RGB图像中 Synthesize 360度场景。
methods: 我们提出了一种基于神经网络的方法，称为NeO 360，它可以从单个或几个姿态的RGB图像中学习出360度场景的分布。我们的方法使用混合的图像conditional triplanar表示，可以在任何世界点上进行查询。
results: 我们在提出的挑战性360度无限 dataset中，称为NeRDS 360，进行了实验，并证明了NeO 360可以在新视图和新场景中进行高效的Synthesize，同时也提供了编辑和组合功能。

Abstract
Recent implicit neural representations have shown great results for novel view synthesis. However, existing methods require expensive per-scene optimization from many views hence limiting their application to real-world unbounded urban settings where the objects of interest or backgrounds are observed from very few views. To mitigate this challenge, we introduce a new approach called NeO 360, Neural fields for sparse view synthesis of outdoor scenes. NeO 360 is a generalizable method that reconstructs 360{\deg} scenes from a single or a few posed RGB images. The essence of our approach is in capturing the distribution of complex real-world outdoor 3D scenes and using a hybrid image-conditional triplanar representation that can be queried from any world point. Our representation combines the best of both voxel-based and bird's-eye-view (BEV) representations and is more effective and expressive than each. NeO 360's representation allows us to learn from a large collection of unbounded 3D scenes while offering generalizability to new views and novel scenes from as few as a single image during inference. We demonstrate our approach on the proposed challenging 360{\deg} unbounded dataset, called NeRDS 360, and show that NeO 360 outperforms state-of-the-art generalizable methods for novel view synthesis while also offering editing and composition capabilities. Project page: https://zubair-irshad.github.io/projects/neo360.html

摘要
近期的隐式神经表示法已经实现了出色的新视图合成效果。然而，现有的方法具有每个场景需要贵重的多视图优化，因此在实际世界的无限范围城市场景中使用受限。为解决这个挑战，我们介绍了一种新的方法called NeO 360，它是一种普适的方法，可以从单个或几个RGB图像中重建360度场景。我们的方法的核心思想是捕捉复杂的实际 OUTDOOR 3D 场景的分布，并使用一种混合图像条件的三平面表示，可以在任何世界点上进行查询。我们的表示结合了 voxel-based 和 bird's-eye-view 表示的优点，并且在表示效果和可表达性方面比每个方法更好。NeO 360 的表示允许我们在大量的无限 3D 场景集中学习，并在新视图和新场景中进行推理，只需要几个图像。我们在 NeRDS 360 提出的挑战性 360度无限数据集上证明了 NeO 360 超过了当前最佳的通用方法，并且具有编辑和组合功能。项目页面：https://zubair-irshad.github.io/projects/neo360.html

DLIP: Distilling Language-Image Pre-training

paper_url: http://arxiv.org/abs/2308.12956
repo_url: None
paper_authors: Huafeng Kuang, Jie Wu, Xiawu Zheng, Ming Li, Xuefeng Xiao, Rui Wang, Min Zheng, Rongrong Ji
for: 提高语言图像预训练模型（VLP）的部署实际应用中的性能和效率。
methods: 通过知识填充来压缩VLP模型，并从不同模块的建筑特点和多模态信息传递的角度进行了多维ensional的分析和优化。
results: 通过实验，提出了一个简单 yet efficient的Distilling Language-Image Pre-training框架（DLIP），可以在多种跨模态任务中实现状态 искусственный智能的精度/效率质量评价。例如，DLIP可以压缩BLIP模型1.9倍，从213M Parameters降至108M Parameters，同时保持和更好的性能。此外，DLIP可以保留95%以上的性能，使用22.4% Parameters和24.8% FLOPs，并提高执行速度2.7倍。

Abstract
Vision-Language Pre-training (VLP) shows remarkable progress with the assistance of extremely heavy parameters, which challenges deployment in real applications. Knowledge distillation is well recognized as the essential procedure in model compression. However, existing knowledge distillation techniques lack an in-depth investigation and analysis of VLP, and practical guidelines for VLP-oriented distillation are still not yet explored. In this paper, we present DLIP, a simple yet efficient Distilling Language-Image Pre-training framework, through which we investigate how to distill a light VLP model. Specifically, we dissect the model distillation from multiple dimensions, such as the architecture characteristics of different modules and the information transfer of different modalities. We conduct comprehensive experiments and provide insights on distilling a light but performant VLP model. Experimental results reveal that DLIP can achieve a state-of-the-art accuracy/efficiency trade-off across diverse cross-modal tasks, e.g., image-text retrieval, image captioning and visual question answering. For example, DLIP compresses BLIP by 1.9x, from 213M to 108M parameters, while achieving comparable or better performance. Furthermore, DLIP succeeds in retaining more than 95% of the performance with 22.4% parameters and 24.8% FLOPs compared to the teacher model and accelerates inference speed by 2.7x.

摘要
《视力语言预训练（VLP）显示了惊人的进步，却面临实际应用中的部署挑战。知识填充被广泛认可为模型压缩的关键手段。然而，现有的知识填充技术尚未对VLP进行深入的研究和分析，并没有提供VLP-关注的压缩实践指南。本文提出了DLIP框架，是一个简单 yet efficient的语言图像预训练压缩框架。我们通过多维度分析模型压缩，包括不同模块的建筑特点和不同Modalities的信息传递。我们进行了广泛的实验，并提供了压缩轻量级VLP模型的深入分析和实践指南。实验结果表明，DLIP可以在多个横跨模态任务中实现状态机器的精度/效率交易，例如图像搜索、图像描述和视觉问答。例如，DLIP可以将BLIP压缩到1.9倍，从213M Parameters下降至108M Parameters，同时保持与教师模型相当或更好的性能。此外，DLIP可以保留95%以上的性能，使用22.4%的参数和24.8%的FLOPs，并提高执行速度2.7倍。

Low-count Time Series Anomaly Detection

paper_url: http://arxiv.org/abs/2308.12925
repo_url: None
paper_authors: Philipp Renz, Kurt Cutajar, Niall Twomey, Gavin K. C. Cheung, Hanting Xie
for: 本研究旨在Addressing the challenges of time series anomaly detection in low-count data settings, where signal-to-noise ratios are low and non-uniform performance is prevalent.
methods: 该研究引入了一种新的生成过程，用于创建含有低个数时间序列的异常段 benchmark datasets。该过程结合了理论和实验分析，以解释常用算法在异常段分布重叠问题上的缺陷。
results: 研究发现，使用异常分数平滑可以有效地提高异常检测性能。此外，该研究还 validate了该方法的实际用途性，在一个实际的零售店销售数据集上进行了验证。

Abstract
Low-count time series describe sparse or intermittent events, which are prevalent in large-scale online platforms that capture and monitor diverse data types. Several distinct challenges surface when modelling low-count time series, particularly low signal-to-noise ratios (when anomaly signatures are provably undetectable), and non-uniform performance (when average metrics are not representative of local behaviour). The time series anomaly detection community currently lacks explicit tooling and processes to model and reliably detect anomalies in these settings. We address this gap by introducing a novel generative procedure for creating benchmark datasets comprising of low-count time series with anomalous segments. Via a mixture of theoretical and empirical analysis, our work explains how widely-used algorithms struggle with the distribution overlap between normal and anomalous segments. In order to mitigate this shortcoming, we then leverage our findings to demonstrate how anomaly score smoothing consistently improves performance. The practical utility of our analysis and recommendation is validated on a real-world dataset containing sales data for retail stores.

摘要
低个数时序列描述稀疏或间歇性事件，这些事件在大规模在线平台上采集和监测多种数据类型时很普遍。在模型低个数时序列时，存在一些独特的挑战，如低信号噪声比（畸变签识不可靠）和非均匀性（平均指标不能反映本地行为）。现有的时序异常检测社区没有专门的工具和过程来模型和可靠地检测这些设置中的异常。我们解决这个空白，通过引入一种新的生成过程，创建了包含低个数时序列异常段的标准数据集。我们通过理论和实验分析，解释了广泛使用的算法在分布重叠问题上的缺陷。然后，我们利用我们的发现，示出了如何使用异常分数缓解这个缺陷，提高性能。我们的分析和建议在实际的零售业务中得到了验证。

Evaluating the Vulnerabilities in ML systems in terms of adversarial attacks

paper_url: http://arxiv.org/abs/2308.12918
repo_url: None
paper_authors: John Harshith, Mantej Singh Gill, Madhan Jothimani
for: 本研究探讨了最新的敌意攻击方法，以及它们对当前深度学习网络防御系统的影响。
methods: 本研究使用了Randomized和敌意示例来探讨漏洞的影响。
results: 研究发现，Randomized示例可能会导致漏洞的产生，而敌意示例则可能会导致漏洞的扩大。此外，研究还探讨了这些漏洞的伦理性。In English, that would be:
for: This research explores the latest adversarial attack methods and their impact on current deep learning cyber defense systems.
methods: The research uses Randomized and adversarial examples to examine the influence of vulnerabilities.
results: The study finds that Randomized examples may lead to the creation of vulnerabilities, while adversarial examples may exacerbate them. Additionally, the research discusses the ethical implications of these vulnerabilities.

Abstract
There have been recent adversarial attacks that are difficult to find. These new adversarial attacks methods may pose challenges to current deep learning cyber defense systems and could influence the future defense of cyberattacks. The authors focus on this domain in this research paper. They explore the consequences of vulnerabilities in AI systems. This includes discussing how they might arise, differences between randomized and adversarial examples and also potential ethical implications of vulnerabilities. Moreover, it is important to train the AI systems appropriately when they are in testing phase and getting them ready for broader use.

摘要
现在有一些新的 adversarial 攻击方法，这些攻击方法可能会对当前的深度学习网络防御系统 pose 挑战。作者在这篇研究报告中关注这个领域，探讨了人工智能系统中的漏洞可能性。这包括讨论恶意攻击的可能性、随机化和 adversarial 示例之间的区别，以及漏洞的伦理问题。此外，在测试阶段，需要适当地训练 AI 系统，以便在更广泛的应用中使用。Note: "adversarial attacks" in the original text was translated as "恶意攻击" in Simplified Chinese, which is a more common term used in the field.

Language as Reality: A Co-Creative Storytelling Game Experience in 1001 Nights using Generative AI

paper_url: http://arxiv.org/abs/2308.12915
repo_url: None
paper_authors: Yuqian Sun, Zhouyi Li, Ke Fang, Chang Hee Lee, Ali Asadipour
For: The paper is written to explore the potential of using advanced AI tools like GPT-4 and Stable Diffusion to create an AI-native game that blends interactive narrative and text-to-image transformation, and to enhance the narrative game genre with AI-generated content.* Methods: The paper uses a game called “1001 Nights” as a case study to demonstrate the use of AI tools in game development. The game features a protagonist, Shahrzad, who is driven by a large language model and can realize words and stories in her world through conversation with the AI King. The player can steer the conversation towards specific keywords, which become battle equipment in the game.* Results: The paper presents the results of the second iteration of the game, which challenges the conventional border between the game world and reality through a dual perspective. The game allows the player to collaborate with AI to craft narratives and shape the game world, and explores the technical and design elements of implementing such a game.

Abstract
In this paper, we present "1001 Nights", an AI-native game that allows players lead in-game reality through co-created storytelling with the character driven by large language model. The concept is inspired by Wittgenstein's idea of the limits of one's world being determined by the bounds of their language. Using advanced AI tools like GPT-4 and Stable Diffusion, the second iteration of the game enables the protagonist, Shahrzad, to realize words and stories in her world. The player can steer the conversation with the AI King towards specific keywords, which then become battle equipment in the game. This blend of interactive narrative and text-to-image transformation challenges the conventional border between the game world and reality through a dual perspective. We focus on Shahrzad, who seeks to alter her fate compared to the original folklore, and the player, who collaborates with AI to craft narratives and shape the game world. We explore the technical and design elements of implementing such a game with an objective to enhance the narrative game genre with AI-generated content and to delve into AI-native gameplay possibilities.

摘要
在这篇论文中，我们介绍了一款名为“1001夜”的人工智能（AI）原生游戏，该游戏使得玩家可以通过与人工智能合作创作故事来导导游戏世界。这个概念 draws inspiration from威特根штайн的思想，即语言的 bound пределяет我们的世界。使用了高级AI工具如GPT-4和Stable Diffusion，第二版游戏允许主人公 Шахرза德（Shahrzad）在她的世界中实现语言和故事。玩家可以通过对人工智能国王的对话指导语言，使得这些语言变成游戏中的武器。这种结合互动叙事和文本到图像转换的游戏模式挑战了传统游戏世界和现实之间的界限，我们在 dual perspective 中强调 Shahrazad 的自由和玩家和人工智能合作创作故事和形成游戏世界。我们探讨了在实施这种游戏时的技术和设计元素，以提高叙事游戏类型中的人工智能生成内容，并探索人工智能原生游戏的可能性。

CDAN: Convolutional Dense Attention-guided Network for Low-light Image Enhancement

paper_url: http://arxiv.org/abs/2308.12902
repo_url: None
paper_authors: Hossein Shakibania, Sina Raoufi, Hassan Khotanlou
for: 这篇论文主要针对低光照图像的改进和增强。
methods: 该论文提出了一种基于卷积神经网络和权重注意机制的Convolutional Dense Attention-guided Network（CDAN），用于提高低光照图像的明亮度、对比度和整体质量。
results: 对多个 benchmark 数据集进行测试，CDAN 表现出了明显的进步，与现有的状态艺技术相比，能够更好地处理低光照图像，并且能够有效地恢复图像中的纹理和颜色。

Abstract
Low-light images, characterized by inadequate illumination, pose challenges of diminished clarity, muted colors, and reduced details. Low-light image enhancement, an essential task in computer vision, aims to rectify these issues by improving brightness, contrast, and overall perceptual quality, thereby facilitating accurate analysis and interpretation. This paper introduces the Convolutional Dense Attention-guided Network (CDAN), a novel solution for enhancing low-light images. CDAN integrates an autoencoder-based architecture with convolutional and dense blocks, complemented by an attention mechanism and skip connections. This architecture ensures efficient information propagation and feature learning. Furthermore, a dedicated post-processing phase refines color balance and contrast. Our approach demonstrates notable progress compared to state-of-the-art results in low-light image enhancement, showcasing its robustness across a wide range of challenging scenarios. Our model performs remarkably on benchmark datasets, effectively mitigating under-exposure and proficiently restoring textures and colors in diverse low-light scenarios. This achievement underscores CDAN's potential for diverse computer vision tasks, notably enabling robust object detection and recognition in challenging low-light conditions.

摘要
低光照图像，受到不足照明的影响，具有减少清晰度、抑制颜色、降低细节等问题。低光照图像增强是计算机视觉中的关键任务，旨在通过提高亮度、对比度和总体品质来促进正确的分析和解释。本文介绍了一种新的卷积神经网络方法——卷积密集注意力引导网络（CDAN），用于提高低光照图像。CDAN结合了自适应网络架构、卷积块和密集块，并加入了注意力机制和跳过连接。这种架构确保了信息传递的高效和特征学习。此外，特定的后处理阶段进行了颜色均衡和对比度的调整。我们的方法在低光照图像增强中显示了明显的进步，与当前最佳结果相比，在多种复杂的场景中表现出了稳定和可靠的特点。我们的模型在标准 benchmark 数据集上表现出色，高效地抑制了下izada 和重新恢复了低光照图像中的纹理和颜色。这一成就表明 CDAN 在计算机视觉任务中具有广泛的潜力，特别是在低光照条件下进行稳定和准确的对象检测和识别。

Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?

paper_url: http://arxiv.org/abs/2308.12898
repo_url: https://github.com/wangfei-2019/snare
paper_authors: Fei Wang, Liang Ding, Jun Rao, Ye Liu, Li Shen, Changxing Ding
for: 本研究旨在探讨语义知识和语法结构是否可以在视觉语言关联（VLP）中提取，以及这些语言知识如何影响或改善多模态对应。
methods: 我们设计了首个大规模多模态对应探测 benchmark，名为SNARE，以检测重要的语言组件，如 lexical、semantic 和 syntax 知识。我们的研究使用 five 种高级 VLP 模型进行总体分析，发现这些模型： i) 忽略复杂的语法结构，依赖内容词 для句子理解; ii) 对 Sentence 和否定逻辑的组合表示有限制; iii) 在视觉信息中找不到动作或空间关系，困难确定 triple 组合的正确性。
results: 我们的研究发现，VLP 模型在复杂的语法结构和 Sentence 与否定逻辑的组合中存在困难，而且在视觉信息中找不到动作或空间关系。

Abstract
The multimedia community has shown a significant interest in perceiving and representing the physical world with multimodal pretrained neural network models, and among them, the visual-language pertaining (VLP) is, currently, the most captivating topic. However, there have been few endeavors dedicated to the exploration of 1) whether essential linguistic knowledge (e.g., semantics and syntax) can be extracted during VLP, and 2) how such linguistic knowledge impact or enhance the multimodal alignment. In response, here we aim to elucidate the impact of comprehensive linguistic knowledge, including semantic expression and syntactic structure, on multimodal alignment. Specifically, we design and release the SNARE, the first large-scale multimodal alignment probing benchmark, to detect the vital linguistic components, e.g., lexical, semantic, and syntax knowledge, containing four tasks: Semantic structure, Negation logic, Attribute ownership, and Relationship composition. Based on our proposed probing benchmarks, our holistic analyses of five advanced VLP models illustrate that the VLP model: i) shows insensitivity towards complex syntax structures and relies on content words for sentence comprehension; ii) demonstrates limited comprehension of combinations between sentences and negations; iii) faces challenges in determining the presence of actions or spatial relationships within visual information and struggles with verifying the correctness of triple combinations. We make our benchmark and code available at \url{https://github.com/WangFei-2019/SNARE/}.

摘要
multimedia社区对使用多模态预训练神经网络模型来感知和表示物理世界表示了广泛的兴趣，其中最吸引人的话题当属视语联系（VLP）。然而，有很少的尝试专门探讨以下两个问题：一是在VLP中是否可以提取语言基础知识（如 semantics和 syntax），二是如何使这些语言基础知识对多模态对应进行影响。为回答这些问题，我们希望通过检查包括语义表达和语法结构在内的全面语言知识的影响来解释多模态对应中的语言知识的影响。为此，我们设计并发布了首个大规模多模态对应探测 benchmark，即SNARE，以检测关键语言组件，如 lexical、semantic 和 syntax 知识。通过我们的提出的探测benchmark，我们对五种高级VLP模型进行了整体分析，发现：1. VLP模型对复杂语法结构表示不敏感，它们依赖于内容词来理解句子;2. VLP模型对 sentences和否定语言的组合表示有限制，它们很难理解这些组合的语义;3. VLP模型在视觉信息中寻找动作或空间关系的过程中遇到困难，同时它们也难以verify triple combinations的正确性。我们将我们的benchmark和代码发布在GitHub上，请参考 \url{https://github.com/WangFei-2019/SNARE/}.

Large Language Models Vote: Prompting for Rare Disease Identification

paper_url: http://arxiv.org/abs/2308.12890
repo_url: https://github.com/oniani/llms-vote
paper_authors: David Oniani, Jordan Hilsman, Hang Dong, Fengyi Gao, Shiven Verma, Yanshan Wang
for: 该论文旨在提出一种具有灵活性的提问方法，以提高基于大语言模型（LLM）的几招学习（FSL）任务的性能。
methods: 该方法称为模型投票提示（MVP），它通过提交多个LLM执行同一任务，并将其结果进行多数投票来提高任务的性能。
results: 对一种稀有疾病识别和分类任务，MVP方法能够获得任务的改进结果，并且比单个模型 ensemble 的结果更佳。此外， authors 还发布了一个新的稀有疾病数据集，可供那些同意 MIMIC-IV 数据使用协议（DUA）的人使用。

Abstract
The emergence of generative Large Language Models (LLMs) emphasizes the need for accurate and efficient prompting approaches. LLMs are often applied in Few-Shot Learning (FSL) contexts, where tasks are executed with minimal training data. FSL has become popular in many Artificial Intelligence (AI) subdomains, including AI for health. Rare diseases, affecting a small fraction of the population, inherently require FSL techniques due to limited data availability, though manual data collection and annotation is costly and time-consuming. In this paper, we propose Models-Vote Prompting (MVP), a flexible prompting approach for improving the performance of LLM queries in FSL settings. MVP works by prompting numerous LLMs to perform the same tasks and then conducting a majority vote on the resulting outputs. This method achieves improved results to any one model in the ensemble on one-shot rare disease identification and classification tasks. We also release a novel rare disease dataset for FSL, available to those who agreed to the MIMIC-IV Data Use Agreement (DUA). Furthermore, in using MVP, each model is prompted multiple times, substantially increasing the time needed for manual annotation, and to address this, we assess the feasibility of using JSON for automating generative LLM evaluation.

摘要
大量生成语言模型（LLM）的出现强调了 precisionefficient的提示方法的需求。 LLM frequently applied in Few-Shot Learning（FSL）上下文中，在 minimal training data 下进行任务执行。 FSL 在许多人工智能（AI）子领域中得到普及，包括 AI for health。 rare diseases，affecting a small fraction of the population，inherently require FSL techniques due to limited data availability，although manual data collection and annotation is costly and time-consuming。在这篇论文中，我们提出 Models-Vote Prompting（MVP），一种 flexible prompting approach，用于改进 LLM 查询在 FSL 设置中的性能。 MVP 通过 prompting numerous LLMS 完成同一个任务，并 Then conducting a majority vote on the resulting outputs。这种方法可以提高任何一个模型 ensemble 中的表现，在一次性罕见疾病识别和分类任务中。我们还发布了一个新的罕见疾病数据集，可供那些同意 MIMIC-IV Data Use Agreement（DUA）。此外，在使用 MVP 时，每个模型都会被多次提示，这substantially increases the time needed for manual annotation，并且为了解决这个问题，我们评估了使用 JSON 自动生成 LLM 评估的可能性。

Inducing Causal Structure for Abstractive Text Summarization

paper_url: http://arxiv.org/abs/2308.12888
repo_url: None
paper_authors: Lu Chen, Ruqing Zhang, Wei Huang, Wei Chen, Jiafeng Guo, Xueqi Cheng
for: 本研究旨在提高数据驱动抽象摘要模型的效果，通过强调 causal 关系而不是相关性。
methods: 我们引入了 Structural Causal Model (SCM)，假设文档和摘要中存在多个隐藏因素和非因果因素，用于捕捉文档和摘要的内容和风格。我们证明了在满足certain conditions下，我们可以通过适应训练数据来确定隐藏因素。基于这，我们提出了 Causality Inspired Sequence-to-Sequence model (CI-Seq2Seq)，用于学习 causal 表示，以便寻求 causal 信息 для摘要生成。
results: 我们在两个常用的文本摘要数据集上进行了实验，结果显示了我们的方法的优势。

Abstract
The mainstream of data-driven abstractive summarization models tends to explore the correlations rather than the causal relationships. Among such correlations, there can be spurious ones which suffer from the language prior learned from the training corpus and therefore undermine the overall effectiveness of the learned model. To tackle this issue, we introduce a Structural Causal Model (SCM) to induce the underlying causal structure of the summarization data. We assume several latent causal factors and non-causal factors, representing the content and style of the document and summary. Theoretically, we prove that the latent factors in our SCM can be identified by fitting the observed training data under certain conditions. On the basis of this, we propose a Causality Inspired Sequence-to-Sequence model (CI-Seq2Seq) to learn the causal representations that can mimic the causal factors, guiding us to pursue causal information for summary generation. The key idea is to reformulate the Variational Auto-encoder (VAE) to fit the joint distribution of the document and summary variables from the training corpus. Experimental results on two widely used text summarization datasets demonstrate the advantages of our approach.

摘要
主流的数据驱动抽象摘要模型往往探索相关性而不是 causal 关系。其中的一些相关性可能受到训练集中的语言优先级影响，从而降低整体模型的效果。为解决这个问题，我们引入结构 causal 模型（SCM）来探索摘要数据的下面结构。我们假设了一些隐藏的 causal 因素和非 causal 因素，表示文档和摘要的内容和风格。理论上，我们证明了我们的 SCM 中的隐藏因素可以通过适应训练数据来被确定。基于这，我们提议一种 causality 激发 sequence-to-sequence 模型（CI-Seq2Seq）来学习 causal 表示，以便追求摘要中的 causal 信息。关键思想是将 Variational Autoencoder（VAE）改进来适应训练集中的 JOIN 分布。实验结果表明，我们的方法在两个常用的文本摘要数据集上具有优势。

2023-08-25

cs.CL

cs.CL - 2023-08-25

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

paper_url: http://arxiv.org/abs/2308.12966
repo_url: https://github.com/qwenlm/qwen-vl
paper_authors: Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou
for: 这篇论文旨在描述一种大规模视语言模型系列（Qwen-VL），用于理解文本和图像。
methods: 该模型使用了新型的混合嵌入Space-Time Block Attention（STBAT）机制，以及一种基于自适应窗口的图像嵌入方法。
results: 论文表明，Qwen-VL系列模型在图像描述、问答、视觉定位等多个任务中表现出色，并且在零基础captioning、视觉或文档视觉问答等任务中具有优异表现。

Abstract
We introduce the Qwen-VL series, a set of large-scale vision-language models designed to perceive and understand both text and images. Comprising Qwen-VL and Qwen-VL-Chat, these models exhibit remarkable performance in tasks like image captioning, question answering, visual localization, and flexible interaction. The evaluation covers a wide range of tasks including zero-shot captioning, visual or document visual question answering, and grounding. We demonstrate the Qwen-VL outperforms existing Large Vision Language Models (LVLMs). We present their architecture, training, capabilities, and performance, highlighting their contributions to advancing multimodal artificial intelligence. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.

摘要
我们介绍Qwen-VL系列，一组大规模的视觉语言模型，旨在理解和处理文本和图像。包括Qwen-VL和Qwen-VL-Chat两种模型，它们在图像描述、问答、视觉定位和自适应互动等任务中表现出色。评估范围涵盖零引入描述、视觉或文档视问题回答以及固定。我们显示Qwen-VL超越现有的大型视觉语言模型（LVLM）。我们介绍它们的架构、训练、能力和性能，强调它们在多媒体人工智能领域的贡献。可以在https://github.com/QwenLM/Qwen-VL获取代码、demo和模型。

Code Llama: Open Foundation Models for Code

paper_url: http://arxiv.org/abs/2308.12950
repo_url: None
paper_authors: Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve
for: 这个论文是为了推出一个基于 Llama 2 的大语言模型 Code Llama，用于代码处理任务。
methods: 该论文使用了 Llama 2 作为基础，并通过不同的特定化和指令跟踪来提高模型的性能。
results: 论文表明，Code Llama 在多个代码测试 benchmark 上达到了当前开放模型的最佳性能，并且在某些情况下超过了 Llama 2 70B 的性能。

Abstract
We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B and 13B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 53% and 55% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. We release Code Llama under a permissive license that allows for both research and commercial use.

摘要
我们发布了 Code Llama，一家大型语言模型，用于程式码，基于 Llama 2 提供现代性能的开放模型，并具有填充功能、大型输入上下文支持和零代指令跟踪能力。我们提供多种版本，以覆盖广泛应用：基础模型（Code Llama）、Python 特化版本（Code Llama - Python）以及指令跟踪模型（Code Llama - Instruct），每个版本都有 7B、13B 和 34B 句子数的参数。所有模型都是在字串16k tokens的序列上训练，并在输入字串长度到100k tokens时显示改进。7B 和 13B Code Llama 和 Code Llama - Instruct 版本支持填充基于周围的内容。Code Llama 在开放模型中达到了一些程式码benchmark的州OF-THE-ART表现，其中包括 HumanEval 和 MBPP，分别得分53%和55%。特别是，Code Llama - Python 7B 超过 Llama 2 70B 的 HumanEval 和 MBPP 分数，并且我们的所有模型在 MultiPL-E 上超过每个公开的模型。我们发布 Code Llama 的授权是允许研究和商业用途的开放授权。

Beyond Document Page Classification: Design, Datasets, and Challenges

paper_url: http://arxiv.org/abs/2308.12896
repo_url: None
paper_authors: Jordy Van Landeghem, Sanket Biswas, Matthew B. Blaschko, Marie-Francine Moens
for: 本研究提出了将文档分类 benchmarking带到实际应用中，包括数据的性质（多通道、多页、多产业）和分类任务（多页文档、页流、文档集 classification）。
methods: 本研究识别了公共多页文档分类数据集的缺失，正规化了应用场景中的分类任务，并强调了完整文档表示的重要性。
results: 对于提posed的多页文档分类数据集进行了实验研究，发现现有的benchmarks已经失去了相关性，需要更新以评估实际中的完整文档。这也提出了更加成熟的评估方法，包括准确评估、时间-内存复杂度评估和各种实际分布转移（例如，生成vs扫描噪声、页码重构）。

Abstract
This paper highlights the need to bring document classification benchmarking closer to real-world applications, both in the nature of data tested ($X$: multi-channel, multi-paged, multi-industry; $Y$: class distributions and label set variety) and in classification tasks considered ($f$: multi-page document, page stream, and document bundle classification, ...). We identify the lack of public multi-page document classification datasets, formalize different classification tasks arising in application scenarios, and motivate the value of targeting efficient multi-page document representations. An experimental study on proposed multi-page document classification datasets demonstrates that current benchmarks have become irrelevant and need to be updated to evaluate complete documents, as they naturally occur in practice. This reality check also calls for more mature evaluation methodologies, covering calibration evaluation, inference complexity (time-memory), and a range of realistic distribution shifts (e.g., born-digital vs. scanning noise, shifting page order). Our study ends on a hopeful note by recommending concrete avenues for future improvements.}

摘要
An experimental study on proposed multi-page document classification datasets shows that current benchmarks are no longer relevant and need to be updated to evaluate complete documents as they naturally occur in practice. The study also highlights the need for more mature evaluation methodologies, including calibration evaluation, inference complexity (time-memory), and a range of realistic distribution shifts (e.g., born-digital vs. scanning noise, shifting page order). The paper concludes on a hopeful note by recommending concrete avenues for future improvements.

2023-08-25

cs.LG

cs.LG - 2023-08-25

NeuralClothSim: Neural Deformation Fields Meet the Kirchhoff-Love Thin Shell Theory

paper_url: http://arxiv.org/abs/2308.12970
repo_url: None
paper_authors: Navami Kairanda, Marc Habermann, Christian Theobalt, Vladislav Golyanik
for:这篇论文的目的是提出一种新的物理可能的布料模拟方法，使用薄shell理论来描述布料的表征和动态变化。methods:这篇论文使用了神经网络来学习布料的表征和动态变化，并使用了约束梯度下降来训练神经网络。results:实验结果表明，这种新的布料模拟方法可以具有高效的存储使用和可微的性，同时可以快速地实现布料的材质描述和模拟编辑。

Abstract
Cloth simulation is an extensively studied problem, with a plethora of solutions available in computer graphics literature. Existing cloth simulators produce realistic cloth deformations that obey different types of boundary conditions. Nevertheless, their operational principle remains limited in several ways: They operate on explicit surface representations with a fixed spatial resolution, perform a series of discretised updates (which bounds their temporal resolution), and require comparably large amounts of storage. Moreover, back-propagating gradients through the existing solvers is often not straightforward, which poses additional challenges when integrating them into modern neural architectures. In response to the limitations mentioned above, this paper takes a fundamentally different perspective on physically-plausible cloth simulation and re-thinks this long-standing problem: We propose NeuralClothSim, i.e., a new cloth simulation approach using thin shells, in which surface evolution is encoded in neural network weights. Our memory-efficient and differentiable solver operates on a new continuous coordinate-based representation of dynamic surfaces, i.e., neural deformation fields (NDFs); it supervises NDF evolution with the rules of the non-linear Kirchhoff-Love shell theory. NDFs are adaptive in the sense that they 1) allocate their capacity to the deformation details as the latter arise during the cloth evolution and 2) allow surface state queries at arbitrary spatial and temporal resolutions without retraining. We show how to train our NeuralClothSim solver while imposing hard boundary conditions and demonstrate multiple applications, such as material interpolation and simulation editing. The experimental results highlight the effectiveness of our formulation and its potential impact.

摘要
cloth 模拟是一个广泛研究的问题，计算机图形文献中有很多解决方案。现有的布料模拟器可以生成真实的布料变形，但它们的运作原理受到一些限制：它们在固定的空间分辨率上进行显式表面表示，执行一系列精炼的更新（这限制了它们的时间分辨率），并需要相对较大的存储量。此外，通过现有的解决方案来归档梯度的操作也不直观，这会增加将它们集成到现代神经网络架构时的挑战。面对这些限制，这篇论文采用了一种新的思路来解决布料模拟问题：我们提出了基于薄shell的新布料模拟方法，即NeuralClothSim。我们的方法使用神经网络权重来编码表面的进化，并且使用约束 Kirchhoff-Love 薄shell理论来监督神经变换场（NDF）的演化。NDF是可适应的，即它们会根据布料演化中的变化分配其容量，并且允许在任何空间和时间分辨率上进行表面状态的查询无需重新训练。我们示出了如何在强制边界条件下训练我们的NeuralClothSim解决方案，并展示了多种应用，如材料 interpolate 和模拟编辑。实验结果表明了我们的方法的有效性和潜在影响。

NeO 360: Neural Fields for Sparse View Synthesis of Outdoor Scenes

paper_url: http://arxiv.org/abs/2308.12967
repo_url: https://github.com/zubair-irshad/NeO-360
paper_authors: Muhammad Zubair Irshad, Sergey Zakharov, Katherine Liu, Vitor Guizilini, Thomas Kollar, Adrien Gaidon, Zsolt Kira, Rares Ambrus
for: 这个论文旨在解决现有的视点合成方法需要费力的每个景象优化问题，以便应用于实际的无限无限的城市环境中， где对象或背景只有几个视角可见。
methods: 我们介绍了一种新的方法 called NeO 360，它使用神经场来实现 sparse view synthesis of outdoor scenes。该方法可以从单个或几个颜色图像中重建360度场景。
results: 我们的实验表明，NeO 360 可以在 NeRDS 360 提出的挑战性 datasets 上表现出色，并且在新的视角和原始场景中都能够得到高质量的结果。此外，NeO 360 还提供了编辑和组合功能。

Abstract
Recent implicit neural representations have shown great results for novel view synthesis. However, existing methods require expensive per-scene optimization from many views hence limiting their application to real-world unbounded urban settings where the objects of interest or backgrounds are observed from very few views. To mitigate this challenge, we introduce a new approach called NeO 360, Neural fields for sparse view synthesis of outdoor scenes. NeO 360 is a generalizable method that reconstructs 360{\deg} scenes from a single or a few posed RGB images. The essence of our approach is in capturing the distribution of complex real-world outdoor 3D scenes and using a hybrid image-conditional triplanar representation that can be queried from any world point. Our representation combines the best of both voxel-based and bird's-eye-view (BEV) representations and is more effective and expressive than each. NeO 360's representation allows us to learn from a large collection of unbounded 3D scenes while offering generalizability to new views and novel scenes from as few as a single image during inference. We demonstrate our approach on the proposed challenging 360{\deg} unbounded dataset, called NeRDS 360, and show that NeO 360 outperforms state-of-the-art generalizable methods for novel view synthesis while also offering editing and composition capabilities. Project page: https://zubair-irshad.github.io/projects/neo360.html

摘要
最近的隐式神经表示法已经达到了对novel view synthesis的出色成绩。然而，现有的方法需要费时且费力地从多个视角优化，从而限制了它们在实际世界无限大的城市设置中的应用。为解决这个挑战，我们提出了一种新的方法called NeO 360，即神经场 для缺省视图Synthesis of outdoor scenes。NeO 360是一种通用的方法，可以从单个或几个RGB图像中重construct 360度场景。我们的方法的核心思想是捕捉复杂的实际户外3D场景的分布，并使用一种混合图像 conditioned triplanar表示，可以从任何世界点进行查询。我们的表示结合了 voxel-based和bird's-eye-view（BEV）表示的优点，并且在表示效果和表达力方面比每一种更高。NeO 360的表示允许我们从大量的无限大3D场景中学习，并在推理时对新视图和新场景进行普适化。我们在提出的challenging 360度无限大数据集，即NeRDS 360上进行了证明，并表明NeO 360在对novel view synthesis的推理中超越了当前最佳的通用方法，同时也提供了编辑和组合功能。项目页面：https://zubair-irshad.github.io/projects/neo360.html

Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation

paper_url: http://arxiv.org/abs/2308.12968
repo_url: https://github.com/yuxinn-j/scenimefy
paper_authors: Yuxin Jiang, Liming Jiang, Shuai Yang, Chen Change Loy
for: 高质量动漫场景自实际图像渲染
methods: 使用结构保持 pseudo 对应数据引导学习，利用 CLIP 富有模型先验，并应用 segmentation-guided 数据选择，以提高准确性和细节。
results: 比前一代基eline表现出较高的both perceptual quality和量化性能。

Abstract
Automatic high-quality rendering of anime scenes from complex real-world images is of significant practical value. The challenges of this task lie in the complexity of the scenes, the unique features of anime style, and the lack of high-quality datasets to bridge the domain gap. Despite promising attempts, previous efforts are still incompetent in achieving satisfactory results with consistent semantic preservation, evident stylization, and fine details. In this study, we propose Scenimefy, a novel semi-supervised image-to-image translation framework that addresses these challenges. Our approach guides the learning with structure-consistent pseudo paired data, simplifying the pure unsupervised setting. The pseudo data are derived uniquely from a semantic-constrained StyleGAN leveraging rich model priors like CLIP. We further apply segmentation-guided data selection to obtain high-quality pseudo supervision. A patch-wise contrastive style loss is introduced to improve stylization and fine details. Besides, we contribute a high-resolution anime scene dataset to facilitate future research. Our extensive experiments demonstrate the superiority of our method over state-of-the-art baselines in terms of both perceptual quality and quantitative performance.

摘要
自动高质量渲染动漫场景从复杂实际图像是实际上具有重要的实践价值。这些挑战包括场景复杂度、动漫风格独特性和领域域之间的数据域隔。尽管有承诺的尝试，过去的尝试仍然无法达到满意的结果，包括持续性的semantic preserve，证明性的风格化和细节。在本研究中，我们提出了Sceneimefy，一种新的半指导性图像-图像翻译框架。我们的方法利用结构一致的 pseudo paired数据来引导学习，从而简化了纯无监督的设定。 pseudo数据通过基于CLIP的semantic-constrained StyleGAN得到，并应用了 segmentation-guided data selection来获得高质量的pseudo超级vision。此外，我们还提供了一个高分辨率动漫场景集，以便未来的研究。我们的广泛的实验表明，我们的方法在比较现有基eline上方面具有superiority， both perceived quality和量化性能。

Dense Text-to-Image Generation with Attention Modulation

paper_url: http://arxiv.org/abs/2308.12964
repo_url: https://github.com/naver-ai/densediffusion
paper_authors: Yunji Kim, Jiyoung Lee, Jin-Hwa Kim, Jung-Woo Ha, Jun-Yan Zhu
for: 处理细致的文本描述，生成真实的图像。
methods: 利用预训练的文本到图像模型，通过 Layout 指导对象在具体的区域出现。
results: 不需要再训练或数据集，可以根据文本描述提高图像生成效果，并且与具体的 Layout 条件下的模型效果相似。

Abstract
Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions, where each text prompt provides a detailed description for a specific image region. To address this, we propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions while offering control over the scene layout. We first analyze the relationship between generated images' layouts and the pre-trained model's intermediate attention maps. Next, we develop an attention modulation method that guides objects to appear in specific regions according to layout guidance. Without requiring additional fine-tuning or datasets, we improve image generation performance given dense captions regarding both automatic and human evaluation scores. In addition, we achieve similar-quality visual results with models specifically trained with layout conditions.

摘要
现有的文本到图像扩散模型很难以生成具有细致描述的图像，每个文本提示都提供了特定图像区域的详细描述。为解决这个问题，我们提议了DenseDiffusion，一种不需要训练的方法，可以使用预训练的文本到图像模型来处理这些细致的文本提示，同时提供场景布局控制。我们首先分析生成图像的布局和预训练模型的中间注意力地图之间的关系。然后，我们开发了一种注意力调节方法，可以根据布局指导对象出现在特定区域中。不需要额外的训练或数据集，我们在给出细致文本提示时改进了图像生成性能， Regarding both automatic and human evaluation scores.此外，我们可以通过 specifically 将模型训练在场景条件下，实现类似的视觉效果。

DLIP: Distilling Language-Image Pre-training

paper_url: http://arxiv.org/abs/2308.12956
repo_url: None
paper_authors: Huafeng Kuang, Jie Wu, Xiawu Zheng, Ming Li, Xuefeng Xiao, Rui Wang, Min Zheng, Rongrong Ji
for:本研究旨在提出一种简单 yet efficient的语言图像预训练框架（DLIP），以实现快速、高效地压缩VLP模型。methods:本研究采用了多维度的模型压缩方法，包括不同模块的建筑特征和不同模式之间的信息传递。results:实验结果显示，DLIP可以在多种跨模态任务中实现最佳的精度/效率质量比，如图文检索、图文captioning和视觉问答。例如，DLIP可以压缩BLIP模型1.9倍，从213M参数压缩到108M参数，而且与教师模型的性能相似或更好。此外，DLIP可以保留95%以上的性能，使用22.4%的参数和24.8%的FLOPs，并提高执行速度2.7倍。

Abstract
Vision-Language Pre-training (VLP) shows remarkable progress with the assistance of extremely heavy parameters, which challenges deployment in real applications. Knowledge distillation is well recognized as the essential procedure in model compression. However, existing knowledge distillation techniques lack an in-depth investigation and analysis of VLP, and practical guidelines for VLP-oriented distillation are still not yet explored. In this paper, we present DLIP, a simple yet efficient Distilling Language-Image Pre-training framework, through which we investigate how to distill a light VLP model. Specifically, we dissect the model distillation from multiple dimensions, such as the architecture characteristics of different modules and the information transfer of different modalities. We conduct comprehensive experiments and provide insights on distilling a light but performant VLP model. Experimental results reveal that DLIP can achieve a state-of-the-art accuracy/efficiency trade-off across diverse cross-modal tasks, e.g., image-text retrieval, image captioning and visual question answering. For example, DLIP compresses BLIP by 1.9x, from 213M to 108M parameters, while achieving comparable or better performance. Furthermore, DLIP succeeds in retaining more than 95% of the performance with 22.4% parameters and 24.8% FLOPs compared to the teacher model and accelerates inference speed by 2.7x.

摘要
美化语言预训练（VLP）显示了惊人的进步，却面临实际应用中的部署挑战。知识储存是识别为模型压缩的关键过程。然而，现有的知识储存技术没有对VLP进行深入的研究和分析，也没有提供VLP-oriented储存的实用指南。在这篇论文中，我们提出了一个简单 yet efficient的Distilling Language-Image Pre-training框架（DLIP），以 investigate如何压缩一个轻量级VLP模型。我们从多个维度进行模型压缩，包括不同模块的建筑特征和不同Modalities之间的信息传递。我们进行了全面的实验，并提供了压缩轻量级VLP模型的深入分析。实验结果表明，DLIP可以在多个跨模态任务中实现状态机器人的精度/效率质量比，如图文检索、图文描述和视觉问答。例如，DLIP可以将BLIP压缩为1.9倍，从213M Parameters减少到108M Parameters，同时保持与教师模型的相同或更好的性能。此外，DLIP可以保留95%以上的性能，使用22.4% Parameters和24.8% FLOPs，相比教师模型快速加速执行速度2.7倍。

BridgeData V2: A Dataset for Robot Learning at Scale

paper_url: http://arxiv.org/abs/2308.12952
repo_url: https://github.com/rail-berkeley/BridgeData-V2
paper_authors: Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, Sergey Levine
For: The paper is written for researchers in the field of robotic manipulation, particularly those interested in scalable robot learning.* Methods: The paper uses a large and diverse dataset of robotic manipulation behaviors, called BridgeData V2, to facilitate research on scalable robot learning. The dataset contains 60,096 trajectories collected across 24 environments on a publicly available low-cost robot, and is compatible with a wide variety of open-vocabulary, multi-task learning methods conditioned on goal images or natural language instructions.* Results: The paper reports on the results of training six state-of-the-art imitation learning and offline reinforcement learning methods on the BridgeData V2 dataset, and finds that these methods succeed on a suite of tasks requiring varying amounts of generalization. The paper also demonstrates that the performance of these methods improves with more data and higher capacity models, and that training on a greater variety of skills leads to improved generalization.

Abstract
We introduce BridgeData V2, a large and diverse dataset of robotic manipulation behaviors designed to facilitate research on scalable robot learning. BridgeData V2 contains 60,096 trajectories collected across 24 environments on a publicly available low-cost robot. BridgeData V2 provides extensive task and environment variability, leading to skills that can generalize across environments, domains, and institutions, making the dataset a useful resource for a broad range of researchers. Additionally, the dataset is compatible with a wide variety of open-vocabulary, multi-task learning methods conditioned on goal images or natural language instructions. In our experiments, we train 6 state-of-the-art imitation learning and offline reinforcement learning methods on our dataset, and find that they succeed on a suite of tasks requiring varying amounts of generalization. We also demonstrate that the performance of these methods improves with more data and higher capacity models, and that training on a greater variety of skills leads to improved generalization. By publicly sharing BridgeData V2 and our pre-trained models, we aim to accelerate research in scalable robot learning methods. Project page at https://rail-berkeley.github.io/bridgedata

摘要
我们介绍 BridgeData V2，一个大型和多样化的机器人 manipulate 行为数据集，用于促进机器人学习扩展。 BridgeData V2 包含 60,096 条路径，在 24 个环境中收集到，这个数据集提供了广泛的任务和环境多样性，从而实现了在不同环境、领域和机构中实现数据可重用性。此外，这个数据集适用于广泛的开放词汇、多任务学习方法，以图像目标或自然语言指令为条件。在我们的实验中，我们将 6 种现代机器人模仿学习和离线强化学习方法训练在我们的数据集上，发现这些方法在一系列需要不同量的数据和模型容量的任务上成功。我们还证明了这些方法在更多的数据和更高的模型容量下表现更好，以及训练更多的技能将导致更好的数据可重用性。我们通过公开 BridgeData V2 和我们的预训模型，希望能够推动机器人学习方法的扩展。更多信息请参考https://rail-berkeley.github.io/bridgedata。

Label Budget Allocation in Multi-Task Learning

paper_url: http://arxiv.org/abs/2308.12949
repo_url: None
paper_authors: Ximeng Sun, Kihyuk Sohn, Kate Saenko, Clayton Mellina, Xiao Bian
for: 提高机器学习系统的性能，因为标签数据的成本限制了系统的性能。
methods: 提出了一种名为多任务学习中的标签预算分配问题，并正式定义了这个问题。并通过实验表明，不同的预算分配策略对多任务学习的性能有很大的影响。我们提出了一种适应任务的预算分配算法，可以在不同的多任务学习设置下生成最佳的预算分配策略。
results: 我们的方法可以在PASCAL VOC和Taskonomy上实现优于其他常用的标签分配策略的性能。

Abstract
The cost of labeling data often limits the performance of machine learning systems. In multi-task learning, related tasks provide information to each other and improve overall performance, but the label cost can vary among tasks. How should the label budget (i.e. the amount of money spent on labeling) be allocated among different tasks to achieve optimal multi-task performance? We are the first to propose and formally define the label budget allocation problem in multi-task learning and to empirically show that different budget allocation strategies make a big difference to its performance. We propose a Task-Adaptive Budget Allocation algorithm to robustly generate the optimal budget allocation adaptive to different multi-task learning settings. Specifically, we estimate and then maximize the extent of new information obtained from the allocated budget as a proxy for multi-task learning performance. Experiments on PASCAL VOC and Taskonomy demonstrate the efficacy of our approach over other widely used heuristic labeling strategies.

摘要
Machine learning系统的标签成本 oftentimeslimits its performance. In multi-task learning, related tasks can provide information to each other and improve overall performance, but the label cost can vary among tasks. How should the label budget (i.e. the amount of money spent on labeling) be allocated among different tasks to achieve optimal multi-task performance? We are the first to propose and formally define the label budget allocation problem in multi-task learning and to empirically show that different budget allocation strategies make a big difference to its performance. We propose a Task-Adaptive Budget Allocation algorithm to robustly generate the optimal budget allocation adaptive to different multi-task learning settings. Specifically, we estimate and then maximize the extent of new information obtained from the allocated budget as a proxy for multi-task learning performance. Experiments on PASCAL VOC and Taskonomy demonstrate the efficacy of our approach over other widely used heuristic labeling strategies.Here's the breakdown of the translation:* Machine learning系统 (机器学习系统) - This is the Chinese term for "machine learning system".* 标签成本 (标签成本) - This is the Chinese term for "label cost".* multi-task learning (多任务学习) - This is the Chinese term for "multi-task learning".* 不同任务 (不同任务) - This is the Chinese term for "different tasks".* 如何分配标签预算 (如何分配标签预算) - This is the Chinese term for "how to allocate the label budget".* 达到最佳多任务性能 (达到最佳多任务性能) - This is the Chinese term for "achieve optimal multi-task performance".* 我们是第一个 (我们是第一个) - This is the Chinese term for "we are the first".* 提出和正式定义标签预算分配问题 (提出和正式定义标签预算分配问题) - This is the Chinese term for "propose and formally define the label budget allocation problem".* 其实际效果 (其实际效果) - This is the Chinese term for "its practical effect".* 多任务学习设置 (多任务学习设置) - This is the Chinese term for "multi-task learning settings".* 适应任务 (适应任务) - This is the Chinese term for "adaptive to different tasks".* 新信息量 (新信息量) - This is the Chinese term for "new information quantity".* 作为多任务学习性能的代理 (作为多任务学习性能的代理) - This is the Chinese term for "as a proxy for multi-task learning performance".* 我们提议的Task-Adaptive Budget Allocation算法 (我们提议的Task-Adaptive Budget Allocation算法) - This is the Chinese term for "our proposed Task-Adaptive Budget Allocation algorithm".* 可以Robustly生成优化的标签预算分配 (可以Robustly生成优化的标签预算分配) - This is the Chinese term for "can robustly generate optimized label budget allocation".* PASCAL VOC和Taskonomy (PASCAL VOC和Taskonomy) - These are the Chinese terms for "PASCAL VOC and Taskonomy".* 实验证明 (实验证明) - This is the Chinese term for "experiments demonstrate".* 其他常用的标签分配策略 (其他常用的标签分配策略) - This is the Chinese term for "other commonly used labeling strategies".

Learning Only On Boundaries: a Physics-Informed Neural operator for Solving Parametric Partial Differential Equations in Complex Geometries

paper_url: http://arxiv.org/abs/2308.12939
repo_url: None
paper_authors: Zhiwei Fang, Sifan Wang, Paris Perdikaris
for: 解决 Parametrized boundary value problems without labeled data.
methods: 使用 Physics-informed neural operator 方法，通过将 PDE 转化为 boundary integral equations (BIEs)，可以在边界上训练操作网络，而不需要大量标注数据。
results: 可以处理复杂的参数化几何和无穷大问题，并且比现有的 PINNs 和 neural operators 更快速。

Abstract
Recently deep learning surrogates and neural operators have shown promise in solving partial differential equations (PDEs). However, they often require a large amount of training data and are limited to bounded domains. In this work, we present a novel physics-informed neural operator method to solve parametrized boundary value problems without labeled data. By reformulating the PDEs into boundary integral equations (BIEs), we can train the operator network solely on the boundary of the domain. This approach reduces the number of required sample points from $O(N^d)$ to $O(N^{d-1})$, where $d$ is the domain's dimension, leading to a significant acceleration of the training process. Additionally, our method can handle unbounded problems, which are unattainable for existing physics-informed neural networks (PINNs) and neural operators. Our numerical experiments show the effectiveness of parametrized complex geometries and unbounded problems.

摘要
最近，深度学习代理和神经操作已经在解偏微分方程（PDEs）中表现出了承诺。然而，它们经常需要大量的训练数据，并且受到固定域的限制。在这个工作中，我们提出了一种新的物理学习神经操作方法，用于解偏微分方程的参数化边值问题。我们将PDEs转化为边 интеграル方程（BIEs），因此我们可以在域的边上训练操作网络，不需要大量的标注数据。这种方法可以减少训练过程中需要的样本点数量从$O(N^d)$减少到$O(N^{d-1})$，其中$d$是域的维度，这导致训练过程的加速。此外，我们的方法还可以处理无界问题，这些问题对现有的物理学习神经网络（PINNs）和神经操作都是不可能的。我们的数学实验表明，参数化复杂的几何和无界问题的效果。

Low-count Time Series Anomaly Detection

paper_url: http://arxiv.org/abs/2308.12925
repo_url: None
paper_authors: Philipp Renz, Kurt Cutajar, Niall Twomey, Gavin K. C. Cheung, Hanting Xie
For: 这篇论文是为了解决低频时间序列中的异常检测问题，特别是在大规模在线平台上监测和记录多种数据类型时，遇到的几个独特挑战。* Methods: 该论文引入了一种新的生成过程，用于创建含有异常段的低频时间序列的基本数据集。通过理论和实验分析，论文解释了现有算法在这些设置下的缺陷，以及如何使用异常分数平滑化来改进性能。* Results: 该论文通过使用实际数据 validate了其分析和建议的实用性，并在一个实际的零售店销售数据集上证明了异常分数平滑化的作用。

Abstract
Low-count time series describe sparse or intermittent events, which are prevalent in large-scale online platforms that capture and monitor diverse data types. Several distinct challenges surface when modelling low-count time series, particularly low signal-to-noise ratios (when anomaly signatures are provably undetectable), and non-uniform performance (when average metrics are not representative of local behaviour). The time series anomaly detection community currently lacks explicit tooling and processes to model and reliably detect anomalies in these settings. We address this gap by introducing a novel generative procedure for creating benchmark datasets comprising of low-count time series with anomalous segments. Via a mixture of theoretical and empirical analysis, our work explains how widely-used algorithms struggle with the distribution overlap between normal and anomalous segments. In order to mitigate this shortcoming, we then leverage our findings to demonstrate how anomaly score smoothing consistently improves performance. The practical utility of our analysis and recommendation is validated on a real-world dataset containing sales data for retail stores.

摘要
低频时序描述稀疏或间歇性事件，这些事件在大规模在线平台上采集和监测多种数据类型中很普遍。模型低频时序时，有几个明显的挑战，特别是低信号噪声比（畸变签识不可避免）和非均匀性（平均指标不是本地行为的代表）。时序异常检测社区目前缺乏专门的工具和过程来模型和可靠地检测异常情况。我们填补这个空白，引入了一种新的生成过程，用于创建包含低频时序异常段的 Referenz datasets。通过理论和实验分析，我们解释了广泛使用的算法在正常和畸变段之间的分布重叠问题。为了解决这个缺陷，我们then 利用我们的发现，示出了如何使用异常得分平滑来提高性能。我们的分析和建议在实际的零售店销售数据集上进行验证，证明了我们的方法的实用性。

An Efficient Distributed Multi-Agent Reinforcement Learning for EV Charging Network Control

paper_url: http://arxiv.org/abs/2308.12921
repo_url: None
paper_authors: Amin Shojaeighadikolaei, Morteza Hashemi
For: The paper aims to develop an effective EV charging controller to mitigate the risk of transformer overload in the distribution grid, with a focus on preserving privacy for EV owners.* Methods: The authors propose a decentralized Multi-agent Reinforcement Learning (MARL) charging framework, employing the Centralized Training Decentralized Execution-Deep Deterministic Policy Gradient (CTDE-DDPG) scheme to provide valuable information to users during training while maintaining privacy during execution.* Results: The CTDE framework improves the performance of the charging network by reducing network costs, and reduces the Peak-to-Average Ratio (PAR) of the total demand, which in turn reduces the risk of transformer overload during peak hours.Here’s the Chinese translation of the three points:* For: 这篇论文目标是为了减少分布网络中变压器过载的风险，同时保持电动车所有者的隐私。* Methods: 作者提出了一种分布式多代理学习（MARL）充电控制器，使用中央训练分布执行-深度束缚策略 Gradient（CTDE-DDPG）算法，以提供训练过程中有价值信息，而执行过程中保持隐私。* Results: CTDE framwork可以提高充电网络的性能，降低网络成本，并降低总需求的峰值强度（PAR），从而减少变压器过载的风险。

Abstract
The increasing trend in adopting electric vehicles (EVs) will significantly impact the residential electricity demand, which results in an increased risk of transformer overload in the distribution grid. To mitigate such risks, there are urgent needs to develop effective EV charging controllers. Currently, the majority of the EV charge controllers are based on a centralized approach for managing individual EVs or a group of EVs. In this paper, we introduce a decentralized Multi-agent Reinforcement Learning (MARL) charging framework that prioritizes the preservation of privacy for EV owners. We employ the Centralized Training Decentralized Execution-Deep Deterministic Policy Gradient (CTDE-DDPG) scheme, which provides valuable information to users during training while maintaining privacy during execution. Our results demonstrate that the CTDE framework improves the performance of the charging network by reducing the network costs. Moreover, we show that the Peak-to-Average Ratio (PAR) of the total demand is reduced, which, in turn, reduces the risk of transformer overload during the peak hours.

摘要
随着电动车（EV）的普及趋势，它将对分布网络的住宅电力需求产生重要影响，从而增加分布网络的变压器负荷风险。为了缓解这些风险，有紧迫需要开发有效的EV充电控制器。目前，大多数EV充电控制器采用中央化的方法来管理个体EV或一组EV。在这篇论文中，我们介绍了一种分布式多智能体学习（MARL）充电框架，这种框架强调保护电动车所有者的隐私。我们采用了中央训练、分布执行-深度决定策函数（CTDE-DDPG）方案，这种方案在训练过程中提供了有价值的信息，同时在执行过程中保持隐私。我们的结果表明，CTDE框架可以改善充电网络的性能，同时还可以降低总需求的峰值至平均值比（PAR），从而降低变压器负荷风险 durante las horas pico。

Towards Realistic Unsupervised Fine-tuning with CLIP

paper_url: http://arxiv.org/abs/2308.12919
repo_url: None
paper_authors: Jian Liang, Lijun Sheng, Zhengbo Wang, Ran He, Tieniu Tan
for: 这个研究旨在应用CLIPvision-language模型（VLM）进行下游超vised学习任务中。
methods: 这篇文章提出了一个实用的、有效的 Fine-tuning方法，即Universal Entropy Optimization（UEO），它利用样本水平的信任度来紧急降低具有信任度的instances的conditional entropy，并将less confident instances的marginal entropy提高。
results: 这篇文章透过15个领域和4种不同的专业知识进行了广泛的实验，结果显示，UEO方法在泛化和out-of-distribution检测方面都大大超过了基eline方法。

Abstract
The emergence of vision-language models (VLMs), such as CLIP, has spurred a significant research effort towards their application for downstream supervised learning tasks. Although some previous studies have explored the unsupervised fine-tuning of CLIP, they often rely on prior knowledge in the form of class names associated with ground truth labels. In this paper, we delve into a realistic unsupervised fine-tuning scenario by assuming that the unlabeled data might contain out-of-distribution samples from unknown classes. Furthermore, we emphasize the importance of simultaneously enhancing out-of-distribution detection capabilities alongside the recognition of instances associated with predefined class labels. To tackle this problem, we present a simple, efficient, and effective fine-tuning approach called Universal Entropy Optimization (UEO). UEO leverages sample-level confidence to approximately minimize the conditional entropy of confident instances and maximize the marginal entropy of less confident instances. Apart from optimizing the textual prompts, UEO also incorporates optimization of channel-wise affine transformations within the visual branch of CLIP. Through extensive experiments conducted across 15 domains and 4 different types of prior knowledge, we demonstrate that UEO surpasses baseline methods in terms of both generalization and out-of-distribution detection.

摘要
随着视力语言模型（VLM）的出现，如CLIP，研究人员努力将其应用于下游指导学习任务。 Although some previous studies have explored the unsupervised fine-tuning of CLIP, they often rely on prior knowledge in the form of class names associated with ground truth labels. In this paper, we explore a realistic unsupervised fine-tuning scenario by assuming that the unlabeled data may contain out-of-distribution samples from unknown classes. Furthermore, we emphasize the importance of simultaneously enhancing out-of-distribution detection capabilities alongside the recognition of instances associated with predefined class labels.To tackle this problem, we present a simple, efficient, and effective fine-tuning approach called Universal Entropy Optimization (UEO). UEO leverages sample-level confidence to approximately minimize the conditional entropy of confident instances and maximize the marginal entropy of less confident instances. Apart from optimizing the textual prompts, UEO also incorporates optimization of channel-wise affine transformations within the visual branch of CLIP. Through extensive experiments conducted across 15 domains and 4 different types of prior knowledge, we demonstrate that UEO surpasses baseline methods in terms of both generalization and out-of-distribution detection.

Evaluating the Vulnerabilities in ML systems in terms of adversarial attacks

paper_url: http://arxiv.org/abs/2308.12918
repo_url: None
paper_authors: John Harshith, Mantej Singh Gill, Madhan Jothimani
for: 本研究探讨了最新的 adversarial 攻击方法，以及它们对当前深度学习网络防御系统的挑战。
methods: 本研究使用了various methods to explore the consequences of vulnerabilities in AI systems, including discussing how they might arise, differences between randomized and adversarial examples, and potential ethical implications.
results: 本研究发现了一些新的 adversarial 攻击方法，以及它们对当前防御系统的影响。同时，研究还提出了一些建议，以帮助在测试阶段 обу练 AI 系统，以便在更广泛的应用中使用。

Abstract
There have been recent adversarial attacks that are difficult to find. These new adversarial attacks methods may pose challenges to current deep learning cyber defense systems and could influence the future defense of cyberattacks. The authors focus on this domain in this research paper. They explore the consequences of vulnerabilities in AI systems. This includes discussing how they might arise, differences between randomized and adversarial examples and also potential ethical implications of vulnerabilities. Moreover, it is important to train the AI systems appropriately when they are in testing phase and getting them ready for broader use.

摘要
近些时候出现了困难发现的对抗攻击。这些新的对抗攻击方法可能会对当前的深度学习网络防御系统 pose 挑战，并可能影响未来网络攻击防御。作者在这篇研究论文中关注这个领域。他们探讨了人工智能系统的漏洞的后果，包括对随机化和对抗示例的区别，以及漏洞的可能性的伦理问题。此外，在测试阶段，我们需要适当地训练AI系统，以便在更广泛的应用中准备它们。Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

POLCA: Power Oversubscription in LLM Cloud Providers

paper_url: http://arxiv.org/abs/2308.12908
repo_url: None
paper_authors: Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Brijesh Warrier, Nithish Mahalingam, Ricardo Bianchini
for: 这个论文主要是针对大型自然语言模型（LLM）的创新和其多种应用场景带来的数据中心GPU的 compute capacity 需求，以及这些新工作负载对于数据中心的电力资源带来的挑战。
methods: 论文使用了对多种 LLM 的描述和配置的电力消耗模式进行了广泛的测量和分析，并识别了推导和训练过程中电力消耗的区别。
results: 论文发现，在推导和训练过程中， LLM 集群的平均和峰值电力使用率不高，并且可以通过实现更好的电力调度来提高数据中心的电力效率，增加可部署的服务器数量，并缩短部署时间。论文还提出了一个名为 POLCA 的可靠和可重复的电力调度机制，可以在 GPU 集群中实现更高的部署效率和可靠性。

Abstract
Recent innovation in large language models (LLMs), and their myriad use-cases have rapidly driven up the compute capacity demand for datacenter GPUs. Several cloud providers and other enterprises have made substantial plans of growth in their datacenters to support these new workloads. One of the key bottleneck resources in datacenters is power, and given the increasing model sizes of LLMs, they are becoming increasingly power intensive. In this paper, we show that there is a significant opportunity to oversubscribe power in LLM clusters. Power oversubscription improves the power efficiency of these datacenters, allowing more deployable servers per datacenter, and reduces the deployment time, since building new datacenters is slow. We extensively characterize the power consumption patterns of a variety of LLMs and their configurations. We identify the differences between the inference and training power consumption patterns. Based on our analysis of these LLMs, we claim that the average and peak power utilization in LLM clusters for inference should not be very high. Our deductions align with the data from production LLM clusters, revealing that inference workloads offer substantial headroom for power oversubscription. However, the stringent set of telemetry and controls that GPUs offer in a virtualized environment, makes it challenging to have a reliable and robust power oversubscription mechanism. We propose POLCA, our framework for power oversubscription that is robust, reliable, and readily deployable for GPU clusters. Using open-source models to replicate the power patterns observed in production, we simulate POLCA and demonstrate that we can deploy 30% more servers in the same GPU cluster for inference, with minimal performance loss

摘要
最近的大语言模型（LLM）的创新和多种应用场景，快速提高了数据中心GPU的计算能力需求。许多云提供商和企业在数据中心进行了大规模的扩展计划以支持这些新型应用。数据中心内部的一个关键瓶颈资源是电力，而随着LLM的模型Size不断增大，它们变得越来越占用电力。在这篇论文中，我们表明了数据中心LLM团群中的电力投入可以进行副作用。副作用提高了数据中心的电力效率，allowing more deployable servers per datacenter, and reduces the deployment time, since building new datacenters is slow。我们对各种LLM的多种配置进行了广泛的电力消耗特征分析。我们发现了推理和训练两种不同的电力消耗模式。根据我们对LLM的分析，我们认为LLM团群的平均和峰值电力利用率在推理任务上不应该很高。我们的结论与生产环境中LLM团群的数据相一致，表明推理任务提供了大量的副作用空间。然而，GPU在虚拟环境中提供的严格的测量和控制机制，使得实现可靠和可Robust的副作用机制变得具有挑战。我们提出了POLCA，我们的可靠可Robust的副作用框架。使用开源模型来复制生产环境中的电力模式，我们在POLCA中进行了模拟，并证明可以在同一GPU团群中部署30%更多的服务器，并且减少性能损失。

CDAN: Convolutional Dense Attention-guided Network for Low-light Image Enhancement

paper_url: http://arxiv.org/abs/2308.12902
repo_url: None
paper_authors: Hossein Shakibania, Sina Raoufi, Hassan Khotanlou
for: 增强低光照图像，解决低光照图像的降低清晰度、降低颜色强度和减少细节等问题，以便进行准确的分析和解释。
methods: 该文献提出了一种基于自编码器的Convolutional Dense Attention-guided Network（CDAN），包括卷积和密集块，以及注意力机制和跳过连接。这种架构确保了信息的有效传播和特征学习。
results: 该文献的方法在低光照图像增强 tasks中表现出了remarkable进步，与当前最佳结果相比，在多种复杂的低光照场景中表现出了robustness。

Abstract
Low-light images, characterized by inadequate illumination, pose challenges of diminished clarity, muted colors, and reduced details. Low-light image enhancement, an essential task in computer vision, aims to rectify these issues by improving brightness, contrast, and overall perceptual quality, thereby facilitating accurate analysis and interpretation. This paper introduces the Convolutional Dense Attention-guided Network (CDAN), a novel solution for enhancing low-light images. CDAN integrates an autoencoder-based architecture with convolutional and dense blocks, complemented by an attention mechanism and skip connections. This architecture ensures efficient information propagation and feature learning. Furthermore, a dedicated post-processing phase refines color balance and contrast. Our approach demonstrates notable progress compared to state-of-the-art results in low-light image enhancement, showcasing its robustness across a wide range of challenging scenarios. Our model performs remarkably on benchmark datasets, effectively mitigating under-exposure and proficiently restoring textures and colors in diverse low-light scenarios. This achievement underscores CDAN's potential for diverse computer vision tasks, notably enabling robust object detection and recognition in challenging low-light conditions.

摘要
低光照图像，受到不足照明的影响，会呈现出降低清晰度、抑制颜色、减少细节等问题。低光照图像改善是计算机视觉中的一项重要任务，旨在通过提高亮度、对比度和总体观察质量来使图像更加清晰，以便更加准确地分析和解释。本文介绍了一种新的低光照图像提升方法——卷积束注意力导航网络（CDAN）。CDAN通过综合了自适应网络、卷积块和束注意力机制的架构，确保了信息传递的高效性和特征学习。此外，特定的后处理阶段可以进一步调整颜色均衡和对比度。我们的方法在低光照图像改善任务中实现了显著的进步，在多种复杂的低光照场景中表现出了扎实的稳定性。这一成就表明CDAN在计算机视觉任务中具有广泛的应用前景，特别是在低光照条件下进行稳定的物体检测和识别。

Unified Data Management and Comprehensive Performance Evaluation for Urban Spatial-Temporal Prediction [Experiment, Analysis & Benchmark]

paper_url: http://arxiv.org/abs/2308.12899
repo_url: https://github.com/libcity/bigscity-libcity
paper_authors: Jiawei Jiang, Chengkai Han, Wayne Xin Zhao, Jingyuan Wang
for: 这个论文主要是为了解决城市空间时间预测领域中的数据访问和利用问题，以及深度学习模型的选择和结构设计问题。methods: 该论文提出了“原子文件”的统一存储格式，用于城市空间时间大数据的管理，并对40个多样化的数据集进行验证。此外，论文还提供了城市空间时间预测模型的技术进步概述，以及使用多种模型和数据集进行广泛的实验，建立了性能排名和研究方向。results: 该论文通过提出“原子文件”和对多种模型和数据集的实验，得出了有效地管理城市空间时间数据，指导未来的研究发展，并且可能在长期内对城市生活标准产生持续的贡献。

Abstract
The field of urban spatial-temporal prediction is advancing rapidly with the development of deep learning techniques and the availability of large-scale datasets. However, challenges persist in accessing and utilizing diverse urban spatial-temporal datasets from different sources and stored in different formats, as well as determining effective model structures and components with the proliferation of deep learning models. This work addresses these challenges and provides three significant contributions. Firstly, we introduce "atomic files", a unified storage format designed for urban spatial-temporal big data, and validate its effectiveness on 40 diverse datasets, simplifying data management. Secondly, we present a comprehensive overview of technological advances in urban spatial-temporal prediction models, guiding the development of robust models. Thirdly, we conduct extensive experiments using diverse models and datasets, establishing a performance leaderboard and identifying promising research directions. Overall, this work effectively manages urban spatial-temporal data, guides future efforts, and facilitates the development of accurate and efficient urban spatial-temporal prediction models. It can potentially make long-term contributions to urban spatial-temporal data management and prediction, ultimately leading to improved urban living standards.

摘要
难 accessible 和利用不同来源和格式的城市空间时间数据的挑战 persistently in the field of urban spatial-temporal prediction. To address these challenges, this work makes three significant contributions. Firstly, we introduce "atomic files", a unified storage format designed for urban spatial-temporal big data, and validate its effectiveness on 40 diverse datasets, simplifying data management. Secondly, we present a comprehensive overview of technological advances in urban spatial-temporal prediction models, guiding the development of robust models. Thirdly, we conduct extensive experiments using diverse models and datasets, establishing a performance leaderboard and identifying promising research directions. Overall, this work effectively manages urban spatial-temporal data, guides future efforts, and facilitates the development of accurate and efficient urban spatial-temporal prediction models. It can potentially make long-term contributions to urban spatial-temporal data management and prediction, ultimately leading to improved urban living standards.Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Beyond Document Page Classification: Design, Datasets, and Challenges

paper_url: http://arxiv.org/abs/2308.12896
repo_url: None
paper_authors: Jordy Van Landeghem, Sanket Biswas, Matthew B. Blaschko, Marie-Francine Moens
for: 本文提出了将文档分类测试更加接近实际应用的需求，包括测试数据的性质（多通道、多页、多业务）和分类任务的类型（多页文档、页流和文档套件分类等）。
methods: 本文认为现有的公共多页文档分类数据集缺乏，并正式化了应用场景中的多种分类任务，以及需要更好的文档表示。
results: 实验表明，现有的分类指标已经过时，需要更新以评估实际上occurs的完整文档。这也强调了评估方法的成熔和时间复杂度的重要性，以及面对实际分布的各种扩展。

Abstract
This paper highlights the need to bring document classification benchmarking closer to real-world applications, both in the nature of data tested ($X$: multi-channel, multi-paged, multi-industry; $Y$: class distributions and label set variety) and in classification tasks considered ($f$: multi-page document, page stream, and document bundle classification, ...). We identify the lack of public multi-page document classification datasets, formalize different classification tasks arising in application scenarios, and motivate the value of targeting efficient multi-page document representations. An experimental study on proposed multi-page document classification datasets demonstrates that current benchmarks have become irrelevant and need to be updated to evaluate complete documents, as they naturally occur in practice. This reality check also calls for more mature evaluation methodologies, covering calibration evaluation, inference complexity (time-memory), and a range of realistic distribution shifts (e.g., born-digital vs. scanning noise, shifting page order). Our study ends on a hopeful note by recommending concrete avenues for future improvements.}

摘要
An experimental study on proposed multi-page document classification datasets demonstrates that current benchmarks have become irrelevant and need to be updated to evaluate complete documents, as they naturally occur in practice. The study also calls for more mature evaluation methodologies, including calibration evaluation, inference complexity (time-memory), and a range of realistic distribution shifts (e.g., born-digital vs. scanning noise, shifting page order).The paper concludes on a hopeful note, recommending concrete avenues for future improvements.Translated into Simplified Chinese:这篇论文强调将文档分类 benchmarking 更近于实际应用场景，包括数据 ($X$) 的性质（多通道、多页、多产业）和分类任务 ($f$) 的多样性。文章指出了多页文档分类数据集的缺乏，并正式化了在应用场景中出现的不同分类任务。它还鼓励了 targets 精准的多页文档表示。一个实验研究表明，现有的分类标准数据集已经失去了现实意义，需要更新以评估完整的文档。研究还强调了需要更成熟的评估方法，包括准确评估、推理复杂度（时间内存）和多种现实的分布转移（例如，生成vs扫描噪声、页面顺序变化）。文章结束于一个希望的注意事项，建议将来的改进方向。Translated into Traditional Chinese:这篇论文强调将文档分类 benchmarking 更近于实际应用场景，包括数据 ($X$) 的性质（多通道、多页、多产业）和分类任务 ($f$) 的多样性。论文指出了多页文档分类数据集的缺乏，并正式化了在应用场景中出现的不同分类任务。它还鼓励了targets精准的多页文档表示。一个实验研究表明，现有的分类标准数据集已经失去了现实意义，需要更新以评估完整的文档。研究还强调了需要更成熟的评估方法，包括准确评估、推理复杂度（时间内存）和多种现实的分布转移（例如，生成vs扫描噪音、页面顺序变化）。论文结束于一个希望的注意事项，建议将来的改进方向。

2023-08-24

cs.SD

cs.SD - 2023-08-24

Sparks of Large Audio Models: A Survey and Outlook

paper_url: http://arxiv.org/abs/2308.12792
repo_url: None
paper_authors: Siddique Latif, Moazzam Shoukat, Fahad Shamshad, Muhammad Usama, Heriberto Cuayáhuitl, Björn W. Schuller
for: 这项论文提供了大语言模型在音频处理领域的最新进展和挑战。
methods: 这项论文使用的方法包括使用大量数据的传统变换器结构，以及对这些模型的深入分析和评估。
results: 这项论文的结果表明，这些基础Audio Models可以在多种音频任务中表现出色，包括自动语音识别、文本读取和音乐生成等。此外，这些模型还可以 acting as universal translators，支持多种语言的多种语音任务。

Abstract
This survey paper provides a comprehensive overview of the recent advancements and challenges in applying large language models to the field of audio signal processing. Audio processing, with its diverse signal representations and a wide range of sources--from human voices to musical instruments and environmental sounds--poses challenges distinct from those found in traditional Natural Language Processing scenarios. Nevertheless, \textit{Large Audio Models}, epitomized by transformer-based architectures, have shown marked efficacy in this sphere. By leveraging massive amount of data, these models have demonstrated prowess in a variety of audio tasks, spanning from Automatic Speech Recognition and Text-To-Speech to Music Generation, among others. Notably, recently these Foundational Audio Models, like SeamlessM4T, have started showing abilities to act as universal translators, supporting multiple speech tasks for up to 100 languages without any reliance on separate task-specific systems. This paper presents an in-depth analysis of state-of-the-art methodologies regarding \textit{Foundational Large Audio Models}, their performance benchmarks, and their applicability to real-world scenarios. We also highlight current limitations and provide insights into potential future research directions in the realm of \textit{Large Audio Models} with the intent to spark further discussion, thereby fostering innovation in the next generation of audio-processing systems. Furthermore, to cope with the rapid development in this area, we will consistently update the relevant repository with relevant recent articles and their open-source implementations at https://github.com/EmulationAI/awesome-large-audio-models.

摘要

WavMark: Watermarking for Audio Generation

paper_url: http://arxiv.org/abs/2308.12770
repo_url: None
paper_authors: Guangyu Chen, Yu Wu, Shujie Liu, Tao Liu, Xiaoyong Du, Furu Wei
for: 这个论文旨在提出一种 Audio Watermarking 框架，用于防止静音识别和说话者冒充。
methods: 该框架使用了一种新的压缩方法，可以在1秒钟的音频片段中编码32比特的水印，并且具有强大的鲁棒性和隐蔽性。
results: 该框架可以在10-20秒的音频片段上实现平均的 Bit Error Rate 为0.48%，比现有的水印工具减少了2800%以上的错误率。

Abstract
Recent breakthroughs in zero-shot voice synthesis have enabled imitating a speaker's voice using just a few seconds of recording while maintaining a high level of realism. Alongside its potential benefits, this powerful technology introduces notable risks, including voice fraud and speaker impersonation. Unlike the conventional approach of solely relying on passive methods for detecting synthetic data, watermarking presents a proactive and robust defence mechanism against these looming risks. This paper introduces an innovative audio watermarking framework that encodes up to 32 bits of watermark within a mere 1-second audio snippet. The watermark is imperceptible to human senses and exhibits strong resilience against various attacks. It can serve as an effective identifier for synthesized voices and holds potential for broader applications in audio copyright protection. Moreover, this framework boasts high flexibility, allowing for the combination of multiple watermark segments to achieve heightened robustness and expanded capacity. Utilizing 10 to 20-second audio as the host, our approach demonstrates an average Bit Error Rate (BER) of 0.48\% across ten common attacks, a remarkable reduction of over 2800\% in BER compared to the state-of-the-art watermarking tool. See https://aka.ms/wavmark for demos of our work.

摘要
最近的零上 синте声技术突破有了可以通过只需几秒钟的录音来模仿说话人的声音，同时保持高度的真实感。然而，这种强大技术也存在了一些风险，包括语音 fraud 和说话人模仿。不同于传统的仅依靠被动方法来检测合成数据，水印技术是一种积极和坚强的防御机制。本文介绍了一种创新的音频水印框架，可以在1秒钟的音频片断中编码Up to 32位的水印，人类不可见。这个水印具有强大的抗击攻击性，可以作为合成voice的标识符，并且有广泛的应用前途在音频版权保护方面。此外，这个框架具有高度的灵活性，可以将多个水印段组合以实现更高的强度和扩展的容量。使用10到20秒的音频作为主机，我们的方法在十种常见的攻击下显示了0.48%的比特错误率（BER），相比之下，现状的水印工具的BER减少了2800%以上。请参考https://aka.ms/wavmark 获取我们的工作示例。

Whombat: An open-source annotation tool for machine learning development in bioacoustics

paper_url: http://arxiv.org/abs/2308.12688
repo_url: None
paper_authors: Santiago Martinez Balvanera, Oisin Mac Aodha, Matthew J. Weldy, Holly Pringle, Ella Browning, Kate E. Jones
for: 这个论文的目的是提出一种用于自动分析生物声音记录的机器学习方法，以扩大生物多样性监测的规模。
methods: 这个论文使用的方法是基于机器学习的高级应用，需要一种数据驱动的方法，使用仔细标注和整理的评估和训练数据，以确保模型的准确性和可靠性。
results: 这个论文通过介绍一种名为Whombat的用户友好的浏览器基本接口，可以帮助用户管理声音记录和注释项目，并提供了一些视觉化、探索和注释工具，以便快速注释、审查和分享注释，并可视化和评估数据集中的机器学习预测结果。

Abstract

Automated analysis of bioacoustic recordings using machine learning (ML) methods has the potential to greatly scale biodiversity monitoring efforts. The use of ML for high-stakes applications, such as conservation research, demands a data-centric approach with a focus on utilizing carefully annotated and curated evaluation and training data that is relevant and representative. Creating annotated datasets of sound recordings presents a number of challenges, such as managing large collections of recordings with associated metadata, developing flexible annotation tools that can accommodate the diverse range of vocalization profiles of different organisms, and addressing the scarcity of expert annotators. 2. We present Whombat a user-friendly, browser-based interface for managing audio recordings and annotation projects, with several visualization, exploration, and annotation tools. It enables users to quickly annotate, review, and share annotations, as well as visualize and evaluate a set of machine learning predictions on a dataset. The tool facilitates an iterative workflow where user annotations and machine learning predictions feedback to enhance model performance and annotation quality. 3. We demonstrate the flexibility of Whombat by showcasing two distinct use cases: an project aimed at enhancing automated UK bat call identification at the Bat Conservation Trust (BCT), and a collaborative effort among the USDA Forest Service and Oregon State University researchers exploring bioacoustic applications and extending automated avian classification models in the Pacific Northwest, USA. 4. Whombat is a flexible tool that can effectively address the challenges of annotation for bioacoustic research. It can be used for individual and collaborative work, hosted on a shared server or accessed remotely, or run on a personal computer without the need for coding skills.

摘要
机器学习（ML）技术可以帮助自动分析生物声音记录，提高生物多样性监测的效率。在保护研究等高度重要应用中使用ML时，需要一种数据驱动的方法，强调使用高质量、精心标注和抽象的训练和评估数据。创建声音记录的标注数据集存在许多挑战，例如管理大量记录和相关元数据，开发 flexible的标注工具，以满足不同生物种类的声音profile的多样性。2. 我们介绍Whombat，一个易于使用的浏览器基本的界面，用于管理声音记录和标注项目。它具有许多可视化、探索和标注工具，让用户快速标注、审核和共享标注，以及可视化和评估数据集上机器学习预测的结果。工具支持循环的工作流程，在用户标注和机器学习预测之间进行反馈，以提高标注质量和模型性能。3. 我们示例了Whombat的灵活性，通过两个不同的应用场景：一是英国蝙蝠保护协会（BCT）的自动蝙蝠叫声识别项目，二是美国农业部和奥REGON州立大学合作的生物声音应用研究，探索生物声音应用和扩展自动鸟类分类模型。4. Whombat是一个灵活的工具，可以有效地解决生物声音研究中的标注挑战。它可以用于个人和团队工作，可以在共享服务器上Host或远程访问，或者在个人计算机上运行，无需编程技能。

Naaloss: Rethinking the objective of speech enhancement

paper_url: http://arxiv.org/abs/2308.12615
repo_url: None
paper_authors: Kuan-Hsun Ho, En-Lun Yu, Jeih-weih Hung, Berlin Chen
for:* This paper aims to improve the performance of automatic speech recognition (ASR) in noisy environments by reducing the impact of processing artifacts generated by single-channel speech enhancement (SE) methods.methods:* The paper proposes a novel Noise- and Artifacts-aware loss function (NAaLoss) that considers the loss of estimation, de-artifact, and noise ignorance to improve the quality of SE.results:* Experimental results show that NAaLoss significantly improves the ASR performance of most setups while preserving the quality of SE, as demonstrated through visualizations of artifacts in waveforms and spectrograms.

Abstract
Reducing noise interference is crucial for automatic speech recognition (ASR) in a real-world scenario. However, most single-channel speech enhancement (SE) generates "processing artifacts" that negatively affect ASR performance. Hence, in this study, we suggest a Noise- and Artifacts-aware loss function, NAaLoss, to ameliorate the influence of artifacts from a novel perspective. NAaLoss considers the loss of estimation, de-artifact, and noise ignorance, enabling the learned SE to individually model speech, artifacts, and noise. We examine two SE models (simple/advanced) learned with NAaLoss under various input scenarios (clean/noisy) using two configurations of the ASR system (with/without noise robustness). Experiments reveal that NAaLoss significantly improves the ASR performance of most setups while preserving the quality of SE toward perception and intelligibility. Furthermore, we visualize artifacts through waveforms and spectrograms, and explain their impact on ASR.

摘要
减少干扰是自动语音识别（ASR）在实际场景中的关键。然而，大多数单通道语音增强（SE）生成“处理残留”，这些残留会负面影响ASR性能。因此，在这项研究中，我们提议一种噪声和残留意识损失函数（NAaLoss），从新的视角来缓解噪声和残留的影响。NAaLoss考虑语音估计损失、去残留和噪声忽略，使得学习的SE可以分别模型语音、残留和噪声。我们在不同的输入场景（干扰/不干扰）和ASR系统的两种配置（带/ без噪声Robustness）中对两种SE模型（简单/高级）进行了NAaLoss学习。实验表明，NAaLoss可以在大多数设置中显著提高ASR性能，同时保持SE的质量。此外，我们通过波形和spectrogram来可见化残留，并解释它们对ASR的影响。

Emotion-Aligned Contrastive Learning Between Images and Music

paper_url: http://arxiv.org/abs/2308.12610
repo_url: None
paper_authors: Shanti Stewart, Tiantian Feng, Kleanthis Avramidis, Shrikanth Narayanan
for: 这个论文是为了 Retrieving emotionally-relevant music from image queries 而写的。
methods: 这篇论文使用的方法包括 learning an affective alignment between images and music audio，以及 cross-modal contrastive learning。
results: 该方法能够成功地对图像和音频进行对应，并且学习出的 embedding space 是可以用于cross-modal retrieval应用。

Abstract
Traditional music search engines rely on retrieval methods that match natural language queries with music metadata. There have been increasing efforts to expand retrieval methods to consider the audio characteristics of music itself, using queries of various modalities including text, video, and speech. Most approaches aim to match general music semantics to the input queries, while only a few focus on affective qualities. We address the task of retrieving emotionally-relevant music from image queries by proposing a framework for learning an affective alignment between images and music audio. Our approach focuses on learning an emotion-aligned joint embedding space between images and music. This joint embedding space is learned via emotion-supervised contrastive learning, using an adapted cross-modal version of the SupCon loss. We directly evaluate the joint embeddings with cross-modal retrieval tasks (image-to-music and music-to-image) based on emotion labels. In addition, we investigate the generalizability of the learned music embeddings with automatic music tagging as a downstream task. Our experiments show that our approach successfully aligns images and music, and that the learned embedding space is effective for cross-modal retrieval applications.

摘要
传统音乐搜索引擎通常使用自然语言查询匹配音乐元数据。随着扩展 Retrieval 方法的尝试，有些方法开始考虑音乐自身的特征，使用不同模式的查询，包括文本、视频和语音。大多数方法尝试匹配通用音乐Semantics 到输入查询，只有一些关注情感质量。我们解决通过图像查询检索情感相关的音乐的任务，我们提议一种学习影响对齐图像和音乐音频的情感对齐的框架。我们的方法是学习一个情感对齐的共同嵌入空间 между图像和音乐。这个共同嵌入空间是通过情感supervised contrastive learning 学习，使用修改后的跨模态版本的 SupCon 损失函数。我们直接评估共同嵌入的 JOINT embedding 与跨模态检索任务（图像到音乐和音乐到图像）基于情感标签。此外，我们还研究了学习得到的音乐嵌入的一致性，并在自动音乐标签设置作为下游任务进行研究。我们的实验表明，我们的方法成功地对图像和音乐进行对齐，并且学习的嵌入空间是跨模态检索应用中有效。

Hybrid noise shaping for audio coding using perfectly overlapped window

paper_url: http://arxiv.org/abs/2308.12566
repo_url: None
paper_authors: Byeongho Jo, Seungkwon Beack
for: 优化低比特率音频编码
methods: 基于模拟强制变换和变换编码刺激（TCX）的复杂LPC-基于CTNS，以及采用50%重叠窗口和切换方案提高编码效率
results: 对象指标和主观听测表明提出的编码框架具有优秀的低比特率音频编码性能

Abstract
In recent years, audio coding technology has been standardized based on several frameworks that incorporate linear predictive coding (LPC). However, coding the transient signal using frequency-domain LP residual signals remains a challenge. To address this, temporal noise shaping (TNS) can be adapted, although it cannot be effectively operated since the estimated temporal envelope in the modified discrete cosine transform (MDCT) domain is accompanied by the time-domain aliasing (TDA) terms. In this study, we propose the modulated complex lapped transform-based coding framework integrated with transform coded excitation (TCX) and complex LPC-based TNS (CTNS). Our approach uses a 50\% overlap window and switching scheme for the CTNS to improve the coding efficiency. Additionally, an adaptive calculation of the target bits for the sub-bands using the frequency envelope information based on the quantized LPC coefficients is proposed. To minimize the quantization mismatch between both modes, an integrated quantization for real and complex values and a TDA augmentation method that compensates for the artificially generated TDA components during switching operations are proposed. The proposed coding framework shows a superior performance in both objective metrics and subjective listening tests, thereby demonstrating its low bit-rate audio coding.

摘要
Recently, 音频编码技术已经基于多个框架标准化，其中包括线性预测编码（LPC）。然而，使用频域LP residual信号编码脉冲信号仍然是一大挑战。为解决这个问题，可以采用时间噪声形成（TNS），但是由于修改后的离散余弦变换（MDCT）域中的时间尺度扰动（TDA）项，TNS无法有效地运行。在这种研究中，我们提出了基于模拟复杂lapsed transform的编码框架，并与变换编码刺激（TCX）和复杂LPC-based TNS（CTNS）结合。我们的方法使用50%的重叠窗口和切换方案来提高编码效率。此外，我们还提出了基于频率尺度信息的适应计算目标位数据的方法，以避免编码抖音。为了减少编码模式之间的量化差异，我们提出了混合量化和TDA扩展方法。这种编码框架在对象指标和主观听测试中表现出色， thereby demonstrating its low bit-rate audio coding.

MultiPA: a multi-task speech pronunciation assessment system for a closed and open response scenario

paper_url: http://arxiv.org/abs/2308.12490
repo_url: None
paper_authors: Yu-Wen Chen, Zhou Yu, Julia Hirschberg
for: 这个研究旨在提出一种自动语音发音评估系统，能够在关闭和开放响应场景下工作，以满足不同学习需求和提供全面和准确的发音技能评估。
methods: 该系统使用多任务学习方法，包括语音识别和语音评估任务，以提高发音评估的准确性和可靠性。
results: 实验结果表明，该系统在关闭响应场景下的性能与之前的Kaldi-based系统相当，而在开放响应场景下的性能更加稳定和可靠。

Abstract
The design of automatic speech pronunciation assessment can be categorized into closed and open response scenarios, each with strengths and limitations. A system with the ability to function in both scenarios can cater to diverse learning needs and provide a more precise and holistic assessment of pronunciation skills. In this study, we propose a Multi-task Pronunciation Assessment model called MultiPA. MultiPA provides an alternative to Kaldi-based systems in that it has simpler format requirements and better compatibility with other neural network models. Compared with previous open response systems, MultiPA provides a wider range of evaluations, encompassing assessments at both the sentence and word-level. Our experimental results show that MultiPA achieves comparable performance when working in closed response scenarios and maintains more robust performance when directly used for open responses.

摘要
文本设计自动发音评估可以分为关闭和开放响应场景，每个场景都有优点和局限性。一个能够在多种场景下运行的系统可以满足多样化的学习需求，并提供更加准确和全面的发音技能评估。本研究提出了一种名为MultiPA的多任务发音评估模型。MultiPA比Kaldi基础系统更简单，可以更好地与其他神经网络模型结合使用。相比之前的开放响应系统，MultiPA提供了更广泛的评估范围，包括句子和单词层次的评估。我们的实验结果表明，MultiPA在关闭响应场景下的性能相当，而directly用于开放响应场景时的性能更加稳定。

Attention-Based Acoustic Feature Fusion Network for Depression Detection

paper_url: http://arxiv.org/abs/2308.12478
repo_url: https://github.com/xuxiaoooo/abafnet
paper_authors: Xiao Xu, Yang Wang, Xinru Wei, Fei Wang, Xizhe Zhang
for: 这篇论文的目的是提出一个新的听语音特征融合网络（ABAFnet），用于检测创伤后遗症（PTSD）和抑郁症（Depression）。
methods: 这篇论文使用了四种不同的听语音特征，通过深度学习模型进行融合，以实现多维度特征的有效融合。它还将提出一个新的权重调整模组，以提高检测性能。
results: 这篇论文的实验结果显示，ABAFnet 可以优于先前的方法在检测抑郁症和其子类型上。进一步的分析显示，听语音特征中的MFCC相关特征在检测speech-based抑郁症中扮演着重要的角色。

Abstract
Depression, a common mental disorder, significantly influences individuals and imposes considerable societal impacts. The complexity and heterogeneity of the disorder necessitate prompt and effective detection, which nonetheless, poses a difficult challenge. This situation highlights an urgent requirement for improved detection methods. Exploiting auditory data through advanced machine learning paradigms presents promising research directions. Yet, existing techniques mainly rely on single-dimensional feature models, potentially neglecting the abundance of information hidden in various speech characteristics. To rectify this, we present the novel Attention-Based Acoustic Feature Fusion Network (ABAFnet) for depression detection. ABAFnet combines four different acoustic features into a comprehensive deep learning model, thereby effectively integrating and blending multi-tiered features. We present a novel weight adjustment module for late fusion that boosts performance by efficaciously synthesizing these features. The effectiveness of our approach is confirmed via extensive validation on two clinical speech databases, CNRAC and CS-NRAC, thereby outperforming previous methods in depression detection and subtype classification. Further in-depth analysis confirms the key role of each feature and highlights the importance of MFCCrelated features in speech-based depression detection.

摘要
抑郁症，一种常见的心理疾病，对个人和社会产生了重要的影响。但是检测抑郁症的复杂性和多样性却提出了严峻的挑战。这种情况强调了改进检测方法的需要。通过利用语音数据的高级机器学习方法可能会开拓出有前途的研究方向。然而，现有的技术主要依靠单一的特征模型，可能会忽略语音特征中的巨量信息。为了纠正这一点，我们提出了一种新的注意力基于的听音特征融合网络（ABAFnet），用于抑郁症检测。ABAFnet将四种不同的语音特征融合到一个深度学习模型中， thereby 有效地汇集和融合多级特征。我们还提出了一种新的权重调整模块，可以在晚期融合中提高性能。我们的方法在两个临床语音数据库（CNRAC和CS-NRAC）上进行了广泛验证，并表现出了在抑郁症检测和亚型分类中的出色表现。进一步的深入分析表明，每种特征都扮演着重要的角色，并且MFCC相关的特征在语音基于的抑郁症检测中具有重要的意义。

An Initial Exploration: Learning to Generate Realistic Audio for Silent Video

paper_url: http://arxiv.org/abs/2308.12408
repo_url: None
paper_authors: Matthew Martel, Jackson Wagner
for: 这篇论文的目的是开发一个基于深度学习的框架，用于生成电影和其他媒体中的真实的音效。
methods: 这篇论文使用了多种不同的模型架构，包括深度融合CNN、扩展Wavenet CNN和变换器结构，以处理视频上下文和先前生成的音频。
results: 研究发现，使用 transformer 结构可以匹配视频的低频谱，但是无法生成更加复杂的波形。

Abstract
Generating realistic audio effects for movies and other media is a challenging task that is accomplished today primarily through physical techniques known as Foley art. Foley artists create sounds with common objects (e.g., boxing gloves, broken glass) in time with video as it is playing to generate captivating audio tracks. In this work, we aim to develop a deep-learning based framework that does much the same - observes video in it's natural sequence and generates realistic audio to accompany it. Notably, we have reason to believe this is achievable due to advancements in realistic audio generation techniques conditioned on other inputs (e.g., Wavenet conditioned on text). We explore several different model architectures to accomplish this task that process both previously-generated audio and video context. These include deep-fusion CNN, dilated Wavenet CNN with visual context, and transformer-based architectures. We find that the transformer-based architecture yields the most promising results, matching low-frequencies to visual patterns effectively, but failing to generate more nuanced waveforms.

摘要
Generating realistic audio effects for movies and other media is a challenging task that is primarily accomplished today through physical techniques known as Foley art. Foley artists create sounds with common objects (e.g., boxing gloves, broken glass) in time with video as it is playing to generate captivating audio tracks. In this work, we aim to develop a deep-learning based framework that does much the same - observes video in its natural sequence and generates realistic audio to accompany it. Notably, we have reason to believe this is achievable due to advancements in realistic audio generation techniques conditioned on other inputs (e.g., Wavenet conditioned on text). We explore several different model architectures to accomplish this task, including deep-fusion CNN, dilated Wavenet CNN with visual context, and transformer-based architectures. We find that the transformer-based architecture yields the most promising results, matching low-frequencies to visual patterns effectively, but failing to generate more nuanced waveforms.Here's the translation in Traditional Chinese:生成电影和其他媒体中的真实音效是一个挑战性的任务，主要通过物理技术知为FOLEY艺术完成。FOLEY艺术家使用常规的物品（例如拳拳套和碎 glass）与影片同步生成吸引人的音轨。在这个工作中，我们想要开发一个基于深度学习的框架，可以观察影片的自然顺序，并生成吸引人的音轨。我们有理由相信这是可能的，因为有进步在真实音效生成技术中，特别是基于其他输入（例如 Wavenet conditioned on text）。我们探索了不同的模型架构，以实现这个任务，包括深度融合 CNN、扩展 Wavenet CNN WITH 视觉上下文，以及 transformer 型架构。我们发现 transformer 型架构产生了最有前途的结果，能够对低频调和视觉模式匹配得非常好，但是无法产生更加细部的波形。

AdVerb: Visually Guided Audio Dereverberation

paper_url: http://arxiv.org/abs/2308.12370
repo_url: None
paper_authors: Sanjoy Chowdhury, Sreyan Ghosh, Subhrajyoti Dasgupta, Anton Ratnarajah, Utkarsh Tyagi, Dinesh Manocha
for: 这篇论文主要是为了提出一种基于视觉信号的声音去抖杂方法，以提高声音质量。
methods: 该方法使用了一种新的 geometry-aware cross-modal transformer 架构，利用视觉信号和声音信号之间的相互关系，生成一个复杂的理想比率幕，并将其应用于抖杂声音中，以估计清晰声音。
results: 该方法在三个下游任务中表现出色：语音提升、语音识别和speaker verification，与传统的声音 только和视觉只基eline上相比，有18%-82%的Relative improvement。此外，该方法在 AVSpeech 数据集上也实现了非常满意的 RT60 错误分数。

Abstract
We present AdVerb, a novel audio-visual dereverberation framework that uses visual cues in addition to the reverberant sound to estimate clean audio. Although audio-only dereverberation is a well-studied problem, our approach incorporates the complementary visual modality to perform audio dereverberation. Given an image of the environment where the reverberated sound signal has been recorded, AdVerb employs a novel geometry-aware cross-modal transformer architecture that captures scene geometry and audio-visual cross-modal relationship to generate a complex ideal ratio mask, which, when applied to the reverberant audio predicts the clean sound. The effectiveness of our method is demonstrated through extensive quantitative and qualitative evaluations. Our approach significantly outperforms traditional audio-only and audio-visual baselines on three downstream tasks: speech enhancement, speech recognition, and speaker verification, with relative improvements in the range of 18% - 82% on the LibriSpeech test-clean set. We also achieve highly satisfactory RT60 error scores on the AVSpeech dataset.

摘要
我们提出了AdVerb，一种新的音频视频去噪框架，该框架利用视频听录中的视觉信息，以估算清晰的音频。虽然音频只的去噪是已经广泛研究的问题，但我们的方法利用了补充的视觉Modalidade，以实现音频去噪。给出了环境中录制的听录音信号的图像，AdVerb使用了一种新的场景意识geometry-aware cross-modal transformer架构，捕捉场景几何和音频视频的跨Modalidade关系，生成了复杂的理想比率层，当应用于听录音频时，可以预测清晰的声音。我们的方法的效果得到了广泛的量化和质量评估。我们的方法在三个下游任务上显著超过了传统的音频只和音频视频基eline，在LibriSpeech测试集上的改进率在18%-82%之间。我们还实现了AVSpeech数据集上的高度满意的RT60错误分数。