2023-07-27

cs.CV

cs.CV - 2023-07-27

Federated Model Aggregation via Self-Supervised Priors for Highly Imbalanced Medical Image Classification

paper_url: http://arxiv.org/abs/2307.14959
repo_url: https://github.com/xmed-lab/fed-mas
paper_authors: Marawan Elbatel, Hualiang Wang, Robert Martí, Huazhu Fu, Xiaomeng Li
for: 这 paper 主要针对高度不均衡的医疗数据集，包括皮肤损伤和肠道图像。现有的联邦学习方法在高度不均衡的数据集上主要是优化全局模型，而不考虑具有不同人口、发现和扫描仪的内部变化。
methods: 本 paper 使用公共可用的自我监督网络来研究客户端间的内部变化。Specifically, 我们发现使用每个客户端上的共享auxiliary pre-trained模型，如 MoCo-V2，可以获得相似的偏移量度量。基于这些发现，我们 derivate了一种动态均衡模型聚合via自我监督先验（MAS），以导引全局模型优化。
results: Fed-MAS 可以与不同的本地学习方法结合使用，以实现有效的模型聚合，并生成一个高度稳定和不偏的全局模型。我们的代码可以在 \url{https://github.com/xmed-lab/Fed-MAS} 上找到。

Abstract
In the medical field, federated learning commonly deals with highly imbalanced datasets, including skin lesions and gastrointestinal images. Existing federated methods under highly imbalanced datasets primarily focus on optimizing a global model without incorporating the intra-class variations that can arise in medical imaging due to different populations, findings, and scanners. In this paper, we study the inter-client intra-class variations with publicly available self-supervised auxiliary networks. Specifically, we find that employing a shared auxiliary pre-trained model, like MoCo-V2, locally on every client yields consistent divergence measurements. Based on these findings, we derive a dynamic balanced model aggregation via self-supervised priors (MAS) to guide the global model optimization. Fed-MAS can be utilized with different local learning methods for effective model aggregation toward a highly robust and unbiased global model. Our code is available at \url{https://github.com/xmed-lab/Fed-MAS}.

摘要
医疗领域中，联合学习通常处理高度不均衡的数据集，包括皮肤损害和肠道图像。现有的联合方法在高度不均衡的数据集上主要是优化全局模型，而不考虑医疗图像中可能出现的不同人口、发现和扫描器引起的内类差异。在这篇论文中，我们研究了客户端间内类差异，使用公共可用的自我超vised副网络（如MoCo-V2）进行本地预训练。我们发现，使用共享副网络模型在每个客户端进行本地预训练可以获得一致的异同度测量。基于这些发现，我们 derivate了一种动态均衡模型聚合方法（MAS），以自我超vised先验为引导，向全局模型优化。Fed-MAS可以与不同的本地学习方法结合使用，以实现高度 Robust和不偏的全球模型。我们的代码可以在 GitHub 上找到：https://github.com/xmed-lab/Fed-MAS。

GET3D–: Learning GET3D from Unconstrained Image Collections

paper_url: http://arxiv.org/abs/2307.14918
repo_url: None
paper_authors: Fanghua Yu, Xintao Wang, Zheyuan Li, Yan-Pei Cao, Ying Shan, Chao Dong
for: 生成高质量3D模型的需求在增长 exponential 速度，手动创建3D模型需要专业知识和大量时间。
methods: 我们提出了GET3D–方法，可以从2D图像直接生成高品质的3D模型，并且可以处理未知姿态和比例的情况。我们还提出了一种新的训练计划，可以稳定优化shape生成器和摄像头抽象器。
results: 我们的实验表明，GET3D–方法可以准确地适应6D摄像头姿态分布，并生成高品质的3D模型。我们在 synthetic 和实际不制约的数据集上进行了广泛的实验，并证明了我们的方法的可靠性和稳定性。

Abstract
The demand for efficient 3D model generation techniques has grown exponentially, as manual creation of 3D models is time-consuming and requires specialized expertise. While generative models have shown potential in creating 3D textured shapes from 2D images, their applicability in 3D industries is limited due to the lack of a well-defined camera distribution in real-world scenarios, resulting in low-quality shapes. To overcome this limitation, we propose GET3D--, the first method that directly generates textured 3D shapes from 2D images with unknown pose and scale. GET3D-- comprises a 3D shape generator and a learnable camera sampler that captures the 6D external changes on the camera. In addition, We propose a novel training schedule to stably optimize both the shape generator and camera sampler in a unified framework. By controlling external variations using the learnable camera sampler, our method can generate aligned shapes with clear textures. Extensive experiments demonstrate the efficacy of GET3D--, which precisely fits the 6D camera pose distribution and generates high-quality shapes on both synthetic and realistic unconstrained datasets.

摘要
需求高效3D模型生成技术的增长速度在线上呈指数增长趋势，因为手动创建3D模型需要特殊的专业知识并且占用了大量时间。而生成模型在3D行业中的应用是有限的，主要是因为实际场景中相机的分布没有明确的定义，导致生成的形状质量低下。为了解决这个问题，我们提出了 GET3D--，第一种直接从2D图像中生成有文字URE的3D形状的方法。GET3D--包括3D形状生成器和可学习的相机抽取器，可以捕捉外部的6D变化。此外，我们还提出了一种新的训练计划，可以稳定地优化形状生成器和相机抽取器在一个统一的框架中。通过控制外部变化使用可学习的相机抽取器，我们的方法可以生成aligned的形状，并且 texture是清晰的。广泛的实验证明了 GET3D--的可靠性，它能够准确地适应6D相机pose分布，并在 synthetic和实际无结构化数据集上生成高质量的形状。

NSA: Naturalistic Support Artifact to Boost Network Confidence

paper_url: http://arxiv.org/abs/2307.14917
repo_url: None
paper_authors: Abhijith Sharma, Phil Munz, Apurva Narayan
for: 这篇论文主要关注在visual AI系统面对自然和人工物理变化的抵抗性能。
methods: 这篇论文提出了一种叫做自然支持物件（NSA）的方法，通过DC-GAN进行 artifact 训练，以提高预测的可靠性。
results: 研究发现，这种NSA方法可以对抗自然变化，提高预测的信任分数，并且可以提高适应攻击性能。

Abstract
Visual AI systems are vulnerable to natural and synthetic physical corruption in the real-world. Such corruption often arises unexpectedly and alters the model's performance. In recent years, the primary focus has been on adversarial attacks. However, natural corruptions (e.g., snow, fog, dust) are an omnipresent threat to visual AI systems and should be considered equally important. Many existing works propose interesting solutions to train robust models against natural corruption. These works either leverage image augmentations, which come with the additional cost of model training, or place suspicious patches in the scene to design unadversarial examples. In this work, we propose the idea of naturalistic support artifacts (NSA) for robust prediction. The NSAs are shown to be beneficial in scenarios where model parameters are inaccessible and adding artifacts in the scene is feasible. The NSAs are natural looking objects generated through artifact training using DC-GAN to have high visual fidelity in the scene. We test against natural corruptions on the Imagenette dataset and observe the improvement in prediction confidence score by four times. We also demonstrate NSA's capability to increase adversarial accuracy by 8\% on average. Lastly, we qualitatively analyze NSAs using saliency maps to understand how they help improve prediction confidence.

摘要
视觉人工智能系统容易受到自然和人工生成的物理损害。这种损害通常是意外的，并会改变模型的性能。在过去几年中，主要关注点是对敌意攻击。然而，自然损害（如雪、雾、尘埃）对视觉人工智能系统是一种普遍存在的威胁，应该受到同样重视。现有的许多研究提出了有趣的解决方案，以帮助训练鲁棒的模型。这些研究可能会通过增加图像变换，来增加模型训练的成本，或者在场景中放置怀疑的贴图以设计不敌攻击示例。在这项工作中，我们提出了自然支持 artifacts（NSA）的想法，以便在无法访问模型参数时，提高预测的可靠性。NSA通过使用DC-GAN进行遗产训练，以生成高视觉准确性的自然看起来的对象。我们在Imagenette数据集上测试了对自然损害的情况，并观察到预测信心分数提高4倍。此外，我们还证明NSA可以提高敌意率平均8%。最后，我们使用滤波映射来解释NSA如何提高预测的可靠性。

Clustering of illustrations by atmosphere using a combination of supervised and unsupervised learning

paper_url: http://arxiv.org/abs/2307.15099
repo_url: None
paper_authors: Keisuke Kubota, Masahiro Okuda
for:这篇论文旨在解决分类Illustration的”氛围”（atmosphere）问题，帮助推荐和搜索。methods:这篇论文使用了 both supervised和Unsupervised learning with pseudo-labels，以获得特征向量。results:实验分析表明，这种方法可以比传统方法更好地实现人类化的集群。

Abstract
The distribution of illustrations on social media, such as Twitter and Pixiv has increased with the growing popularity of animation, games, and animated movies. The "atmosphere" of illustrations plays an important role in user preferences. Classifying illustrations by atmosphere can be helpful for recommendations and searches. However, assigning clear labels to the elusive "atmosphere" and conventional supervised classification is not always practical. Furthermore, even images with similar colors, edges, and low-level features may not have similar atmospheres, making classification based on low-level features challenging. In this paper, this problem is solved using both supervised and unsupervised learning with pseudo-labels. The feature vectors are obtained using the supervised method with pseudo-labels that contribute to an ambiguous atmosphere. Further, clustering is performed based on these feature vectors. Experimental analyses show that our method outperforms conventional methods in human-like clustering on datasets manually classified by humans.

摘要
社交媒体上的插图分布量增加，与动画、游戏和动画电影的浸泡式 Popularity 相关。插图的“氛围”在用户喜好中扮演重要的角色。按氛围分类插图可以有助于推荐和搜索。但是，将氛围分类为明确的标签是不实际的。即使图像具有相同的颜色、边缘和低级特征，也可能没有相同的氛围，从而使基于低级特征的分类困难。在这篇论文中，我们解决了这个问题，使用超级vised和无监督学习，并使用 pseudo-labels。我们通过supervised方法获取特征向量，并使用这些特征向量进行 clustering。实验分析表明，我们的方法在人类分类数据集上超过了传统方法的性能。

Weakly Supervised AI for Efficient Analysis of 3D Pathology Samples

paper_url: http://arxiv.org/abs/2307.14907
repo_url: https://github.com/mahmoodlab/mamba
paper_authors: Andrew H. Song, Mane Williams, Drew F. K. Williamson, Guillaume Jaume, Andrew Zhang, Bowen Chen, Robert Serafin, Jonathan T. C. Liu, Alex Baras, Anil V. Parwani, Faisal Mahmood
for: 这篇论文旨在提供一种用于诊断和评估癌症治疗效果的深度学习平台，以帮助临床医生更好地诊断和治疗癌症。
methods: 该论文使用了多模态多实例学习（MAMBA）平台，通过处理不同成像模式的3D组织图像，预测癌症患者的5年生物化学回报。
results: 研究发现，使用MAMBA平台可以更好地预测癌症患者的5年生物化学回报，并且可以减少采样偏见的风险，提高诊断精度。

Abstract
Human tissue and its constituent cells form a microenvironment that is fundamentally three-dimensional (3D). However, the standard-of-care in pathologic diagnosis involves selecting a few two-dimensional (2D) sections for microscopic evaluation, risking sampling bias and misdiagnosis. Diverse methods for capturing 3D tissue morphologies have been developed, but they have yet had little translation to clinical practice; manual and computational evaluations of such large 3D data have so far been impractical and/or unable to provide patient-level clinical insights. Here we present Modality-Agnostic Multiple instance learning for volumetric Block Analysis (MAMBA), a deep-learning-based platform for processing 3D tissue images from diverse imaging modalities and predicting patient outcomes. Archived prostate cancer specimens were imaged with open-top light-sheet microscopy or microcomputed tomography and the resulting 3D datasets were used to train risk-stratification networks based on 5-year biochemical recurrence outcomes via MAMBA. With the 3D block-based approach, MAMBA achieves an area under the receiver operating characteristic curve (AUC) of 0.86 and 0.74, superior to 2D traditional single-slice-based prognostication (AUC of 0.79 and 0.57), suggesting superior prognostication with 3D morphological features. Further analyses reveal that the incorporation of greater tissue volume improves prognostic performance and mitigates risk prediction variability from sampling bias, suggesting the value of capturing larger extents of heterogeneous 3D morphology. With the rapid growth and adoption of 3D spatial biology and pathology techniques by researchers and clinicians, MAMBA provides a general and efficient framework for 3D weakly supervised learning for clinical decision support and can help to reveal novel 3D morphological biomarkers for prognosis and therapeutic response.

摘要
人类组织和其组成细胞形成了基本三维（3D）的微环境。然而，现行的诊断标准仅选择一些二维（2D）的部分进行微镜下评估，可能会导致抽象采样偏见和误诊。多种用于捕捉3D组织结构的方法已经开发，但它们尚未在临床实践中得到广泛应用。我们现在提出了模态不可知多例学习 для卷积体分析（MAMBA），一种基于深度学习的平台，用于处理多种成像模式的3D组织图像并预测患者结果。我们使用了开普通镜微镜或微计算tomography扫描患有前列腺癌的股骨板，并使用MAMBA进行风险 stratification网络的训练，以实现5年内生化回报的预测。通过3D块基本方法，MAMBA在 receiver操作特征曲线值（AUC）中取得0.86和0.74，比2D传统单片基本预测（AUC为0.79和0.57）更高，表明3D形态特征更好地预测结果。进一步分析表明，包含更大的组织体积可以提高预测性能，并减少采样偏见导致的预测变化，表明捕捉更大的多样性3D形态是有价值的。随着3D空间生物学和病理学技术的快速发展和应用，MAMBA提供了一个通用和高效的3D弱监学习框架，可以帮助探索新的3D形态生物标志物和诊断响应。

Mixture of Self-Supervised Learning

paper_url: http://arxiv.org/abs/2307.14897
repo_url: https://github.com/aristorenaldo/g-ssl
paper_authors: Aristo Renaldo Ruslim, Novanto Yudistira, Budi Darma Setiawan
For: The paper aims to improve image classification by proposing a Gated Self-Supervised Learning (G-SSL) method that uses multiple pretext tasks and a Mixture of Expert architecture to combine them.* Methods: The proposed G-SSL method uses a combination of pretext tasks, including rotation prediction, solving jigsaw puzzles, and predicting relative positions on images. The method also employs a Mixture of Expert architecture as a gating network to combine the pretext tasks and automatically focus on the most useful augmentations for classification.* Results: The proposed method is tested on several scenarios, including CIFAR imbalance dataset classification, adversarial perturbations, Tiny-Imagenet dataset classification, and semi-supervised learning. The results show that the G-SSL method outperforms previous self-supervised learning methods in image classification tasks. Additionally, Grad-CAM and T-SNE analysis are used to visualize the important features that influence image classification and represent data for each class.Here is the simplified Chinese text format for the three information points:* For: 本研究旨在提高图像分类方法，通过提出多个预测任务和权重综合网络的组合来提高自动学习方法。* Methods: 提议的G-SSL方法使用多个预测任务，包括旋转预测、图像谜题解决和图像相对位置预测。此外，方法还使用权重综合网络来组合每个预测任务，以便自动地强调最有用的扩充方法。* Results: 提议的方法在多个场景中进行测试，包括CIFAR不均衡数据集分类、防御攻击、Tiny-Imagenet数据集分类和半监督学习。结果表明，G-SSL方法在图像分类任务中表现出色，并且通过Grad-CAM和T-SNE分析可以查看到重要的特征影响图像分类。

Abstract
Self-supervised learning is popular method because of its ability to learn features in images without using its labels and is able to overcome limited labeled datasets used in supervised learning. Self-supervised learning works by using a pretext task which will be trained on the model before being applied to a specific task. There are some examples of pretext tasks used in self-supervised learning in the field of image recognition, namely rotation prediction, solving jigsaw puzzles, and predicting relative positions on image. Previous studies have only used one type of transformation as a pretext task. This raises the question of how it affects if more than one pretext task is used and to use a gating network to combine all pretext tasks. Therefore, we propose the Gated Self-Supervised Learning method to improve image classification which use more than one transformation as pretext task and uses the Mixture of Expert architecture as a gating network in combining each pretext task so that the model automatically can study and focus more on the most useful augmentations for classification. We test performance of the proposed method in several scenarios, namely CIFAR imbalance dataset classification, adversarial perturbations, Tiny-Imagenet dataset classification, and semi-supervised learning. Moreover, there are Grad-CAM and T-SNE analysis that are used to see the proposed method for identifying important features that influence image classification and representing data for each class and separating different classes properly. Our code is in https://github.com/aristorenaldo/G-SSL

摘要
自我超级学习是一种受欢迎的方法，因为它可以在无需标签的情况下学习图像中的特征。自我超级学习通过使用预测任务来实现，这些任务将在模型之前进行训练，然后应用到特定任务上。在图像识别领域中，一些常见的预测任务包括旋转预测、解决缺失图像和Relative Positions预测。以前的研究只使用了一种变换作为预测任务。这引发了对多个预测任务的使用以及使用权重网络结合每个预测任务的问题。因此，我们提出了加锁自动学习方法，以提高图像分类。我们在多个变换作为预测任务，并使用权重网络结合每个预测任务来自动学习和强调最有用的扩展。我们在多个场景中测试了我们的提议方法，包括CIFAR不均衡数据集分类、抗攻击扰动、Tiny-Imagenet数据集分类和半监督学习。此外，我们还使用Grad-CAM和T-SNE分析来看到我们的提议方法如何对图像分类做出重要的特征标识和数据表示，并正确地分离不同的类别。我们的代码可以在https://github.com/aristorenaldo/G-SSL上找到。

Self-Supervised Learning for Improved Synthetic Aperture Sonar Target Recognition

paper_url: http://arxiv.org/abs/2307.15098
repo_url: None
paper_authors: BW Sheffield
for: 本研究探讨了基于自动学习（SSL）的目标识别在探测镜（SAS）图像中的应用。由于水下环境的特殊性，传统的计算机视觉技术，它们几乎完全依赖于光学相机图像，在水下环境中变得更加不效果。而SAS，它能够生成高分辨率图像，因此成为水下扫描领域的首选。但是，高分辨率SAS数据的巨量却对标注提出了巨大的挑战，标注是训练深度神经网络（DNNs）的关键步骤。
methods: 本研究使用了两种知名的SSL算法，MoCov2和BYOL，与已知的超级vised学习模型，ResNet18，进行对比。
results: 结果表明，在几个尝试情况下，SSL模型可以比超级vised学习模型表现更好，但是当所有标注都用于训练时，SSL模型不能超过超级vised学习模型。这些结果表明SSL可以作为一种可靠的替代方案，可以维持任务性能而减少数据标注的时间和成本。

Abstract
This study explores the application of self-supervised learning (SSL) for improved target recognition in synthetic aperture sonar (SAS) imagery. The unique challenges of underwater environments make traditional computer vision techniques, which rely heavily on optical camera imagery, less effective. SAS, with its ability to generate high-resolution imagery, emerges as a preferred choice for underwater imaging. However, the voluminous high-resolution SAS data presents a significant challenge for labeling; a crucial step for training deep neural networks (DNNs). SSL, which enables models to learn features in data without the need for labels, is proposed as a potential solution to the data labeling challenge in SAS. The study evaluates the performance of two prominent SSL algorithms, MoCov2 and BYOL, against the well-regarded supervised learning model, ResNet18, for binary image classification tasks. The findings suggest that while both SSL models can outperform a fully supervised model with access to a small number of labels in a few-shot scenario, they do not exceed it when all the labels are used. The results underscore the potential of SSL as a viable alternative to traditional supervised learning, capable of maintaining task performance while reducing the time and costs associated with data labeling. The study also contributes to the growing body of evidence supporting the use of SSL in remote sensing and could stimulate further research in this area.

摘要

Sample Less, Learn More: Efficient Action Recognition via Frame Feature Restoration

paper_url: http://arxiv.org/abs/2307.14866
repo_url: https://github.com/xacheng1996/sllm
paper_authors: Harry Cheng, Yangyang Guo, Liqiang Nie, Zhiyong Cheng, Mohan Kankanhalli
for: 提高视频动作识别模型的效率，特别是在有限的计算资源下。
methods: 提出了一种新的方法，用于修复两帧视频中的间接特征。这种方法具有较少的计算成本，但可以提高模型的效率。
results: 对四个公共数据集进行了广泛的实验，并证明了该方法可以提高三种常用的基线模型的效率，即使是在零基eline设定下。此外，该方法还意外地提高了模型的总体化能力。

Abstract
Training an effective video action recognition model poses significant computational challenges, particularly under limited resource budgets. Current methods primarily aim to either reduce model size or utilize pre-trained models, limiting their adaptability to various backbone architectures. This paper investigates the issue of over-sampled frames, a prevalent problem in many approaches yet it has received relatively little attention. Despite the use of fewer frames being a potential solution, this approach often results in a substantial decline in performance. To address this issue, we propose a novel method to restore the intermediate features for two sparsely sampled and adjacent video frames. This feature restoration technique brings a negligible increase in computational requirements compared to resource-intensive image encoders, such as ViT. To evaluate the effectiveness of our method, we conduct extensive experiments on four public datasets, including Kinetics-400, ActivityNet, UCF-101, and HMDB-51. With the integration of our method, the efficiency of three commonly used baselines has been improved by over 50%, with a mere 0.5% reduction in recognition accuracy. In addition, our method also surprisingly helps improve the generalization ability of the models under zero-shot settings.

摘要
训练有效的视频动作识别模型具有显著的计算挑战，特别是在有限的资源预算下。现有方法主要是减小模型大小或使用预训练模型，这限制了它们的适应性于多种基础架构。本文探讨了过度采样帧的问题，这是许多方法中的一个常见问题，但它受到了相对少的注意。尽管使用 fewer frames 是一个可能的解决方案，但这经常会导致性能明显下降。为解决这个问题，我们提出了一种将两帧采样的间隔恢复原始特征的方法。这种特征恢复技术与资源占用量更高的图像编码器，如 ViT，相比较，带来了微scopic的计算增加。为评估我们的方法效果，我们在四个公共数据集上进行了广泛的实验，包括 Kinetics-400、ActivityNet、UCF-101 和 HMDB-51。与基eline integrate我们的方法后，三个常用的基eline的效率得到了50%以上的提升，而减少识别精度的幅度仅为0.5%。此外，我们的方法还意外地帮助了模型在零 shot 设定下的泛化能力。

A full-resolution training framework for Sentinel-2 image fusion

paper_url: http://arxiv.org/abs/2307.14864
repo_url: https://github.com/matciotola/FR-FUSE
paper_authors: Matteo Ciotola, Mario Ragosta, Giovanni Poggi, Giuseppe Scarpa
for: 这种研究旨在提出一种新的无监督框架，用于训练基于Sentinel-2图像的深度学习模型，以实现超分辨率图像重构。
methods: 该方案利用Sentinel-2图像的10米和20米频道进行融合，而不需要降解分辨率来生成训练数据。同时，提出了一种适当的损失函数，以保证网络预测结果和输入组件之间的循环一致性。
results: 在我们的初步实验中，该方案已经显示了与经验法相比的有力Promising results，并且由于构建的损失函数，得到的训练网络可以被归类为多分辨率分析方法。

Abstract
This work presents a new unsupervised framework for training deep learning models for super-resolution of Sentinel-2 images by fusion of its 10-m and 20-m bands. The proposed scheme avoids the resolution downgrade process needed to generate training data in the supervised case. On the other hand, a proper loss that accounts for cycle-consistency between the network prediction and the input components to be fused is proposed. Despite its unsupervised nature, in our preliminary experiments the proposed scheme has shown promising results in comparison to the supervised approach. Besides, by construction of the proposed loss, the resulting trained network can be ascribed to the class of multi-resolution analysis methods.

摘要
这个工作提出了一种新的无监督框架，用于深度学习模型的超Resolution Sentinel-2图像合并。该方案避免了在生成训练数据时需要进行分辨率下降的问题。然而，提出了一种适合的损失函数，该函数考虑了网络预测和输入组件的循环一致性。 DESPITE its 无监督性，我们的初步实验表明，提出的方案在与监督方法相比 display promising results。此外，由于构造的损失函数，得到的训练网络可以被归类为多resolution分析方法。Note: Please keep in mind that Simplified Chinese is a compromise between the original Chinese characters and the Romanization of Chinese. The translation may not be exactly the same as the original text, but it should convey the same meaning.

IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer

paper_url: http://arxiv.org/abs/2307.14863
repo_url: https://github.com/sunnyhaze/iml-vit
paper_authors: Xiaochen Ma, Bo Du, Zhuohang Jiang, Ahmed Y. Al Hammadi, Jizhe Zhou
for:IML-ViT is designed to capture artifacts in image manipulation localization, which is crucial for ensuring the trustworthiness of multimedia.methods:IML-ViT utilizes a ViT architecture with high-resolution capacity, multi-scale feature extraction, and manipulation edge supervision to capture artifacts. The self-attention mechanism is employed to enhance the model’s ability to extract non-semantic discrepancies between manipulated and authentic regions.results:Extensive experiments on five benchmark datasets demonstrate that IML-ViT outperforms state-of-the-art manipulation localization methods, showcasing the effectiveness of the proposed approach.

Abstract
Advanced image tampering techniques are increasingly challenging the trustworthiness of multimedia, leading to the development of Image Manipulation Localization (IML). But what makes a good IML model? The answer lies in the way to capture artifacts. Exploiting artifacts requires the model to extract non-semantic discrepancies between manipulated and authentic regions, necessitating explicit comparisons between the two areas. With the self-attention mechanism, naturally, the Transformer should be a better candidate to capture artifacts. However, due to limited datasets, there is currently no pure ViT-based approach for IML to serve as a benchmark, and CNNs dominate the entire task. Nevertheless, CNNs suffer from weak long-range and non-semantic modeling. To bridge this gap, based on the fact that artifacts are sensitive to image resolution, amplified under multi-scale features, and massive at the manipulation border, we formulate the answer to the former question as building a ViT with high-resolution capacity, multi-scale feature extraction capability, and manipulation edge supervision that could converge with a small amount of data. We term this simple but effective ViT paradigm IML-ViT, which has significant potential to become a new benchmark for IML. Extensive experiments on five benchmark datasets verified our model outperforms the state-of-the-art manipulation localization methods.Code and models are available at \url{https://github.com/SunnyHaze/IML-ViT}.

摘要
现代图像修改技术正在不断挑战 multimedia 的可信度，导致图像修改地方化（IML）的发展。但是，一个好的 IML 模型又是什么样的？答案在于如何捕捉缺陷。修改图像时，需要模型从修改后和原始图像之间进行直接比较，从而捕捉非语义缺陷。自然地，Transformer 应该是更好的选择，因为它可以通过自动注意力机制来捕捉缺陷。然而，由于数据有限，目前没有纯 Vit-based 的 IML 方法作为标准，CNN 则dominate了整个任务。不过，CNN 受到语义表征的限制，无法模型长距离和非语义特征。为了填补这个差距，我们根据缺陷敏感于图像分辨率、在多个比例级中捕捉特征、和修改边缘监督来设计一个简单 yet 有效的 Vit 模型，我们称之为 IML-ViT。我们进行了五个 benchmark 数据集的广泛实验，证明我们的模型可以超越当前修改地方化方法的状态。代码和模型可以在 GitHub 上获取：https://github.com/SunnyHaze/IML-ViT。

Comparative Evaluation of Digital and Analog Chest Radiographs to Identify Tuberculosis using Deep Learning Model

paper_url: http://arxiv.org/abs/2307.14859
repo_url: None
paper_authors: Subhankar Chattoraj, Bhargava Reddy, Manoj Tadepalli, Preetham Putha
for: 这个研究用于评估基于深度学习（DL）技术的胸部X射线（CXR）识别TB病理标志的性能。
methods: 研究使用了10,000张胸部X射线DICOM（.dcm）数据集和打印出来的影像Films，从印度不同地点收集到的数据包括三种不同的手机：Samsung S8、iPhone 8和iPhone XS。
results: 研究发现，DL基于设备可以准确地识别胸部X射线中TB病理标志，AUC为0.928，sensitivity为0.841，specificity为0.806。三种手机中，最小差异为2.55%、5.10%和1.91%。这表明DL基于设备在 both digital and analog CXR 中具有良好的可靠性和灵活性。

Abstract
Purpose:Chest X-ray (CXR) is an essential tool and one of the most prescribed imaging to detect pulmonary abnormalities, with a yearly estimate of over 2 billion imaging performed worldwide. However, the accurate and timely diagnosis of TB remains an unmet goal. The prevalence of TB is highest in low-middle-income countries, and the requirement of a portable, automated, and reliable solution is required. In this study, we compared the performance of DL-based devices on digital and analog CXR. The evaluated DL-based device can be used in resource-constraint settings. Methods: A total of 10,000 CXR DICOMs(.dcm) and printed photos of the films acquired with three different cellular phones - Samsung S8, iPhone 8, and iPhone XS along with their radiological report were retrospectively collected from various sites across India from April 2020 to March 2021. Results: 10,000 chest X-rays were utilized to evaluate the DL-based device in identifying radiological signs of TB. The AUC of qXR for detecting signs of tuberculosis on the original DICOMs dataset was 0.928 with a sensitivity of 0.841 at a specificity of 0.806. At an optimal threshold, the difference in the AUC of three cellular smartphones with the original DICOMs is 0.024 (2.55%), 0.048 (5.10%), and 0.038 (1.91%). The minimum difference demonstrates the robustness of the DL-based device in identifying radiological signs of TB in both digital and analog CXR.

摘要
Methods: 从2020年4月到2021年3月，我们在印度多个地点采集了10,000余个CXR DICOMs(.dcm)和印刷的胸部X射线film，其中使用了三种不同的手机：Samsung S8、iPhone 8和iPhone XS。这些影像和其相关的医学报告共同组成了我们的数据集。Results: 我们使用了10,000个胸部X射线来评估DL基于设备在诊断肺炎的 радиологических标志物上的性能。DL基于设备在原始DICOMs数据集上的AUC为0.928，sensitivity为0.841，specificity为0.806。在优化的阈值下，三种手机中的DICOMs数据集之间的AUC差异为0.024（2.55%）、0.048（5.10%）和0.038（1.91%）。这些数据表明DL基于设备在数字和分析CXR上的诊断能力具有坚定的可靠性。

Simplified Concrete Dropout – Improving the Generation of Attribution Masks for Fine-grained Classification

paper_url: http://arxiv.org/abs/2307.14825
repo_url: None
paper_authors: Dimitri Korsch, Maha Shadaydeh, Joachim Denzler
for: 这 paper 是为了提高 fine-grained classification 模型的准确性和可靠性而写的。
methods: 这 paper 使用了 perturbation-based 方法，特别是 fill-in of the dropout (FIDO) algorithm，以及一种新的 mini-batch 更新方法来提高模型的解释能力。
results: 这 paper 的结果表明，使用这种新方法可以减少计算复杂度，同时提高模型的解释质量和分类性能。

Abstract
Fine-grained classification is a particular case of a classification problem, aiming to classify objects that share the visual appearance and can only be distinguished by subtle differences. Fine-grained classification models are often deployed to determine animal species or individuals in automated animal monitoring systems. Precise visual explanations of the model's decision are crucial to analyze systematic errors. Attention- or gradient-based methods are commonly used to identify regions in the image that contribute the most to the classification decision. These methods deliver either too coarse or too noisy explanations, unsuitable for identifying subtle visual differences reliably. However, perturbation-based methods can precisely identify pixels causally responsible for the classification result. Fill-in of the dropout (FIDO) algorithm is one of those methods. It utilizes the concrete dropout (CD) to sample a set of attribution masks and updates the sampling parameters based on the output of the classification model. A known problem of the algorithm is a high variance in the gradient estimates, which the authors have mitigated until now by mini-batch updates of the sampling parameters. This paper presents a solution to circumvent these computational instabilities by simplifying the CD sampling and reducing reliance on large mini-batch sizes. First, it allows estimating the parameters with smaller mini-batch sizes without losing the quality of the estimates but with a reduced computational effort. Furthermore, our solution produces finer and more coherent attribution masks. Finally, we use the resulting attribution masks to improve the classification performance of a trained model without additional fine-tuning of the model.

摘要
细化分类是特殊的分类问题，旨在分类具有相似视觉特征的对象，只能通过微不足的差异来区分。细化分类模型常用于自动动物监测系统中确定动物种或个体。精准的视觉解释是分析系统错误的关键。通常使用注意力或梯度基本方法来确定图像中占据分类决策的区域。然而，这些方法可能提供过于粗糙或噪音的解释，不适合确定微不足的视觉差异。在这些情况下，杂化基本方法可以准确地确定图像中占据分类决策的像素。填充掉dropout（FIDO）算法是其中的一种方法。它利用具体的dropout（CD）来采样一组属性面积，然后基于分类模型的输出更新采样参数。然而，这个算法存在高度变化的梯度估计问题，作者通过小批处理参数进行了缓解。这篇论文提出了一种解决方案，利用更小的批处理大小来避免计算不稳定性，同时保持估计质量不受影响。此外，我们的解决方案生成了更细致和更凝聚的属性面积，最终用这些面积提高已经训练的模型的分类性能。

Building RadiologyNET: Unsupervised annotation of a large-scale multimodal medical database

paper_url: http://arxiv.org/abs/2308.08517
repo_url: None
paper_authors: Mateja Napravnik, Franko Hržić, Sebastian Tschauner, Ivan Štajduhar
for: 这个论文的目的是构建一个大规模的医疗图像自动标注集合。
methods: 该论文使用了一种自动化、无监督的方法，使用多Modal sources，包括图像、DICOM元数据和描述性诊断，并测试了多种适当的特征提取器，以确定最佳的特征提取器。
results: 研究人员使用了优选的特征提取器进行多Modal表示，并使用k-means和k-medoids归一 clustering算法对一个代表样本 subset进行评估。结果表明，将所有数据源的嵌入都 fusion起来最好地进行无监督归一 clustering任务，并且得到了最紧凑的归一结果。

Abstract
Background and objective: The usage of machine learning in medical diagnosis and treatment has witnessed significant growth in recent years through the development of computer-aided diagnosis systems that are often relying on annotated medical radiology images. However, the availability of large annotated image datasets remains a major obstacle since the process of annotation is time-consuming and costly. This paper explores how to automatically annotate a database of medical radiology images with regard to their semantic similarity. Material and methods: An automated, unsupervised approach is used to construct a large annotated dataset of medical radiology images originating from Clinical Hospital Centre Rijeka, Croatia, utilising multimodal sources, including images, DICOM metadata, and narrative diagnoses. Several appropriate feature extractors are tested for each of the data sources, and their utility is evaluated using k-means and k-medoids clustering on a representative data subset. Results: The optimal feature extractors are then integrated into a multimodal representation, which is then clustered to create an automated pipeline for labelling a precursor dataset of 1,337,926 medical images into 50 clusters of visually similar images. The quality of the clusters is assessed by examining their homogeneity and mutual information, taking into account the anatomical region and modality representation. Conclusion: The results suggest that fusing the embeddings of all three data sources together works best for the task of unsupervised clustering of large-scale medical data, resulting in the most concise clusters. Hence, this work is the first step towards building a much larger and more fine-grained annotated dataset of medical radiology images.

摘要
背景和目标：随着医疗领域中机器学习的应用不断扩展，计算机辅助诊断系统在医疗领域中的应用也得到了广泛的应用。然而，大量标注的医疗影像数据集的可用性仍然是一个主要的障碍因素，因为标注过程是时间consuming和成本高昂的。这篇论文探讨了如何自动标注医疗影像数据库中的semantic similarity。材料和方法：我们使用自动化、无监督的方法构建了一个大量标注的医疗影像数据库，来自克罗地亚维也纳医院中心医院，包括多Modal sources，包括影像、DICOM元数据和描述诊断。我们测试了多个适当的特征提取器，并对其在表示subset中进行了评估，使用k-means和k-medoids归一 clustering。结果：我们选择最佳的特征提取器，并将其集成到多modal表示中，然后使用k-means和k-medoids归一 clustering来自动标注一个预处理数据集中的1,337,926个医疗影像，并创建50个视觉相似的集群。我们评估了这些集群的一致性和相互信息，考虑到了解剖区域和模式表示。结论：结果表明，将所有三种数据源的嵌入统一为一起最好用于不supervised clustering的任务，得到了最短的集群。因此，这种工作是建立一个更大和更细化的医疗影像数据库的第一步。

Fading memory as inductive bias in residual recurrent networks

paper_url: http://arxiv.org/abs/2307.14823
repo_url: None
paper_authors: Igor Dubinin, Felix Effenberger
for: 这个论文旨在研究异常连接在RNN中的影响，以及它们如何影响网络的性能、动态和记忆特性。
methods: 作者提出了弱相互连接循环网络（WCRNN），并研究了不同类型的异常连接对网络表现的影响。
results: 研究发现，不同类型的异常连接可以提供不同的 inductive bias，从而提高网络的表达力。具体来说，异常连接可以使网络的动态靠近边缘的混沌边缘，使网络可以充分利用数据的特征性质，并且可以提高网络的记忆特性。此外，作者还展示了如何将这些结果推广到非线性异常连接和Elman RNN中。

Abstract
Residual connections have been proposed as architecture-based inductive bias to mitigate the problem of exploding and vanishing gradients and increase task performance in both feed-forward and recurrent networks (RNNs) when trained with the backpropagation algorithm. Yet, little is known about how residual connections in RNNs influence their dynamics and fading memory properties. Here, we introduce weakly coupled residual recurrent networks (WCRNNs) in which residual connections result in well-defined Lyapunov exponents and allow for studying properties of fading memory. We investigate how the residual connections of WCRNNs influence their performance, network dynamics, and memory properties on a set of benchmark tasks. We show that several distinct forms of residual connections yield effective inductive biases that result in increased network expressivity. In particular, residual connections that (i) result in network dynamics at the proximity of the edge of chaos, (ii) allow networks to capitalize on characteristic spectral properties of the data, and (iii) result in heterogeneous memory properties are shown to increase practical expressivity. In addition, we demonstrate how our results can be extended to non-linear residuals and introduce a weakly coupled residual initialization scheme that can be used for Elman RNNs

摘要
剩余连接已被提议作为网络架构基于束缚的偏好，以降低迁移和消失梯度的问题，提高反propagation算法下的任务性能。然而，关于剩余连接在RNN中的动态和渐幻性质仍未得到了充分的了解。在这里，我们引入了弱相互连接的剩余循环网络（WCRNN），其中剩余连接导致明确的Lyapunov exponent，使得研究RNN的动态和渐幻性质变得可能。我们研究了WCRNN的剩余连接如何影响其性能、网络动态和记忆特性，并证明了不同形式的剩余连接可以提供有效的束缚偏好，从而增加网络的表达能力。具体来说，剩余连接可以：（i）导致网络动态在数据特征附近，（ii）让网络利用数据的特征波动性，（iii）导致异ogeneous记忆特性，这些特性可以提高实际的表达能力。此外，我们还展示了我们的结果可以扩展到非线性剩余，并引入了弱相互连接初始化方案，可以应用于Elman RNN。

Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning

paper_url: http://arxiv.org/abs/2307.14786
repo_url: https://github.com/jwh97nn/DeepDPS
paper_authors: Junwen He, Yifan Wang, Lijun Wang, Huchuan Lu, Jun-Yan He, Jin-Peng Lan, Bin Luo, Yifeng Geng, Xuansong Xie
for: depth-aware panoptic segmentation, 提高Scene理解的robustness
methods: joint segmentation and depth estimation, 同时在每个 segment 中进行semantic segmentation和深度估计，并使用嵌入表示来整合场景几何信息
results: sets the new state of the art, 在Cityscapes-DVPS和SemKITTI-DVPS数据集上达到了新的state of the artHere’s the full translation of the abstract in Simplified Chinese: depth-aware panoptic segmentation是计算机视觉领域的一个新兴话题，它旨在结合semantic和几何理解来更加robust的Scene理解。 current works 通常将这两个任务看作为两个独立的学习任务，这限制了其探索跨领域信息的潜力。我们提出了一个深度嵌入的框架，用于depth-aware panoptic segmentation，它在每个 segment 中同时进行semantic segmentation和深度估计，并使用嵌入表示来整合场景几何信息。此外，我们还提出了一种bi-directional guidance learning方法，可以利用这两个任务之间的相互关系来促进cross-task feature learning。我们的方法在Cityscapes-DVPS和SemKITTI-DVPS数据集上达到了新的state of the art，并且我们的引导学习方法在不完全监督标签下也能够实现性能提升。

Abstract
Depth-aware panoptic segmentation is an emerging topic in computer vision which combines semantic and geometric understanding for more robust scene interpretation. Recent works pursue unified frameworks to tackle this challenge but mostly still treat it as two individual learning tasks, which limits their potential for exploring cross-domain information. We propose a deeply unified framework for depth-aware panoptic segmentation, which performs joint segmentation and depth estimation both in a per-segment manner with identical object queries. To narrow the gap between the two tasks, we further design a geometric query enhancement method, which is able to integrate scene geometry into object queries using latent representations. In addition, we propose a bi-directional guidance learning approach to facilitate cross-task feature learning by taking advantage of their mutual relations. Our method sets the new state of the art for depth-aware panoptic segmentation on both Cityscapes-DVPS and SemKITTI-DVPS datasets. Moreover, our guidance learning approach is shown to deliver performance improvement even under incomplete supervision labels.

摘要
depth-aware panoptic segmentation 是计算机视觉领域的一个emerging topic，它结合semantic和geometric的理解，以提高场景的理解度。 recent works 尝试开发一个unified framework来解决这个挑战，但大多数仍然视为两个不同的学习任务，这限制了其探索交叉领域信息的潜力。我们提出了一个深度统一的 framework для depth-aware panoptic segmentation，它在每个分割个体上同时进行分割和深度估计，并使用 identical object queries 来进行joint segmentation和depth estimation。为了降低这两个任务之间的差距，我们还设计了一种geometry query enhancement method，可以将场景几何信息 интеGRATE INTO object queries 中，使用latent representations。此外，我们还提出了一种bi-directional guidance learning Approach，可以在两个任务之间互相帮助学习特征，通过它们之间的相互关系来提高性能。我们的方法在Cityscapes-DVPS和SemKITTI-DVPS dataset上设置了新的state of the art，而且我们的导航学习方法在不完全supervision标签的情况下也能提供性能提升。

Contrastive Knowledge Amalgamation for Unsupervised Image Classification

paper_url: http://arxiv.org/abs/2307.14781
repo_url: None
paper_authors: Shangde Gao, Yichao Fu, Ke Liu, Yuqiang Han
for: 学习一个紧凑的学生模型，处理多个特有任务的教师模型的共同目标。
methods: 引入对比损失和对齐损失，实现类内凝聚和类间分离，同时通过软目标efficiently和flexibly进行多任务无监控学习。
results: 在多个benchmark上示出了总化能力，并通过了广泛的ablation study提供了更深入的理解。

Abstract
Knowledge amalgamation (KA) aims to learn a compact student model to handle the joint objective from multiple teacher models that are are specialized for their own tasks respectively. Current methods focus on coarsely aligning teachers and students in the common representation space, making it difficult for the student to learn the proper decision boundaries from a set of heterogeneous teachers. Besides, the KL divergence in previous works only minimizes the probability distribution difference between teachers and the student, ignoring the intrinsic characteristics of teachers. Therefore, we propose a novel Contrastive Knowledge Amalgamation (CKA) framework, which introduces contrastive losses and an alignment loss to achieve intra-class cohesion and inter-class separation.Contrastive losses intra- and inter- models are designed to widen the distance between representations of different classes. The alignment loss is introduced to minimize the sample-level distribution differences of teacher-student models in the common representation space.Furthermore, the student learns heterogeneous unsupervised classification tasks through soft targets efficiently and flexibly in the task-level amalgamation. Extensive experiments on benchmarks demonstrate the generalization capability of CKA in the amalgamation of specific task as well as multiple tasks. Comprehensive ablation studies provide a further insight into our CKA.

摘要
知识混合（KA）目标是学习一个紧凑的学生模型，以处理多个特циализирован任务的教师模型共同目标。现有方法主要是粗略对教师和学生在共同 Representation Space 进行对齐，使学生从多种不同任务的教师中学习到正确的决策边界很难。此外，前一些方法只是减小了教师和学生的概率分布差异，忽略了教师的内在特征。因此，我们提出了一种新的对比知识混合（CKA）框架，该框架引入对比损失和对齐损失，以实现类内凝聚和类间分离。对比损失可以把不同类型的表示之间的距离扩大，对齐损失可以在共同 Representation Space 中减少教师-学生模型的样本级分布差异。此外，学生通过软目标在任务级混合中学习多种不同任务，高效和灵活地完成不同任务的无监督分类任务。广泛的实验和综合性的减少研究表明CKA在特定任务和多任务混合中具有普遍性和可靠性。

pCTFusion: Point Convolution-Transformer Fusion with Semantic Aware Loss for Outdoor LiDAR Point Cloud Segmentation

paper_url: http://arxiv.org/abs/2307.14777
repo_url: https://github.com/GeoAI-Research-Lab/PCTFusion
paper_authors: Abhishek Kuriyal, Vaibhav Kumar, Bharat Lohani
for: 本研究旨在提高LiDAR点云semantic segmentation的性能，特别是对于少数类和边缘点的识别。
methods: 本研究提出了一种新的架构，即pCTFusion，它结合了kernel-based卷积和自注意机制，以提高特征学习和捕捉点云中的本地和全局依赖关系。
results: 对SemanticKITTI户外数据集进行评估，pCTFusion架构与当前状态艺术结构相比，提高了5-7%的性能。特别是对于少数类，性能提高尤为明显，这些发展的方法可以应用于复杂的点云数据集，并促进实际应用。

Abstract
LiDAR-generated point clouds are crucial for perceiving outdoor environments. The segmentation of point clouds is also essential for many applications. Previous research has focused on using self-attention and convolution (local attention) mechanisms individually in semantic segmentation architectures. However, there is limited work on combining the learned representations of these attention mechanisms to improve performance. Additionally, existing research that combines convolution with self-attention relies on global attention, which is not practical for processing large point clouds. To address these challenges, this study proposes a new architecture, pCTFusion, which combines kernel-based convolutions and self-attention mechanisms for better feature learning and capturing local and global dependencies in segmentation. The proposed architecture employs two types of self-attention mechanisms, local and global, based on the hierarchical positions of the encoder blocks. Furthermore, the existing loss functions do not consider the semantic and position-wise importance of the points, resulting in reduced accuracy, particularly at sharp class boundaries. To overcome this, the study models a novel attention-based loss function called Pointwise Geometric Anisotropy (PGA), which assigns weights based on the semantic distribution of points in a neighborhood. The proposed architecture is evaluated on SemanticKITTI outdoor dataset and showed a 5-7% improvement in performance compared to the state-of-the-art architectures. The results are particularly encouraging for minor classes, often misclassified due to class imbalance, lack of space, and neighbor-aware feature encoding. These developed methods can be leveraged for the segmentation of complex datasets and can drive real-world applications of LiDAR point cloud.

摘要
利用LiDAR生成的点云是外部环境的感知的关键。点云分割也是许多应用程序的必需之物。先前的研究主要集中在使用自注意力和卷积（本地注意力）机制来进行semantic segmentation。然而，有限的研究集中在将这些注意力机制学习的表示结合以提高性能。此外，现有的混合卷积和自注意力的研究都是基于全局注意力，不适合处理大量的点云。为解决这些挑战，本研究提出了一种新的架构，即pCTFusion，它结合均匀卷积和自注意力机制来提高特征学习和捕捉点云中的本地和全局依赖关系。该架构使用了两种类型的自注意力机制，即本地自注意力和全局自注意力，基于嵌入块的层次位置。此外，现有的损失函数不考虑点云中的 semantic和位置含义，导致准确性下降，特别是在精准的分类边界。为此，本研究提出了一种新的注意力基于损失函数，即点云几何各向异性（PGA），它根据点云的semantic分布在邻域分配权重。该提案的架构在SemanticKITTI外部 dataset上进行评估，与当前领先的架构相比，提高了5-7%的性能。结果特别有吸引力，尤其是对于少数类，通常因为分类不均衡、空间约束和邻居相关特征编码而导致的误分类。这些开发的方法可以应用于复杂的点云分割任务，并促进实际应用 LiDAR 点云。

3DPortraitGAN: Learning One-Quarter Headshot 3D GANs from a Single-View Portrait Dataset with Diverse Body Poses

paper_url: http://arxiv.org/abs/2307.14770
repo_url: https://github.com/oneThousand1000/3DPortraitGAN
paper_authors: Yiqian Wu, Hao Xu, Xiangjun Tang, Hongbo Fu, Xiaogang Jin
for: 实现一个可以从所有镜头角度产生完整的头部、脖子和肩膀 geometry 的一个四分之一头shot 3D 肖像。
methods: 使用 360°PHQ 数据集，并提出了一个名为 3DPortraitGAN 的新型一个四分之一头shot 3D 肖像生成器，可以从所有镜头角度产生可观的、具有完整的头部、脖子和肩膀 geometry 的肖像。
results: 经过实验显示，提出的框架可以很好地预测肖像的姿势和产生具有可观的、具有完整的头部、脖子和肩膀 geometry 的肖像。

Abstract
3D-aware face generators are typically trained on 2D real-life face image datasets that primarily consist of near-frontal face data, and as such, they are unable to construct one-quarter headshot 3D portraits with complete head, neck, and shoulder geometry. Two reasons account for this issue: First, existing facial recognition methods struggle with extracting facial data captured from large camera angles or back views. Second, it is challenging to learn a distribution of 3D portraits covering the one-quarter headshot region from single-view data due to significant geometric deformation caused by diverse body poses. To this end, we first create the dataset 360{\deg}-Portrait-HQ (360{\deg}PHQ for short) which consists of high-quality single-view real portraits annotated with a variety of camera parameters (the yaw angles span the entire 360{\deg} range) and body poses. We then propose 3DPortraitGAN, the first 3D-aware one-quarter headshot portrait generator that learns a canonical 3D avatar distribution from the 360{\deg}PHQ dataset with body pose self-learning. Our model can generate view-consistent portrait images from all camera angles with a canonical one-quarter headshot 3D representation. Our experiments show that the proposed framework can accurately predict portrait body poses and generate view-consistent, realistic portrait images with complete geometry from all camera angles.

摘要
三维意识的脸部生成器通常在二维实际脸部图像集上训练，这些集合主要由近 фронталь脸部数据组成，因此它们无法构建完整的一半头shot三维肖像。这个问题的两个原因是：首先，现有的人脸识别方法在大角度摄像头数据中提取脸部数据很困难；其次，学习一个包括一半头shot区域的三维肖像分布从单视数据上是很困难的，因为人体姿势的多样性导致了重大的几何变换。为解决这个问题，我们首先创建了360°PHQ数据集（简称为360°PHQ），该集合包含高质量的单视实际肖像，以及不同的摄像头parameters（摄像头方向的角度覆盖了整个360°范围）和人体姿势。然后，我们提出了3DPortraitGAN模型，这是第一个能够学习一个完整的三维肖像分布的一半头shot脸部生成器。我们的模型可以从所有摄像头角度生成视角一致的肖像图像，并且可以准确预测肖像图像中的人体姿势。我们的实验表明，我们的框架可以生成视角一致、真实、完整的肖像图像，并且可以准确预测肖像图像中的人体姿势。

Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining

paper_url: http://arxiv.org/abs/2307.14768
repo_url: https://github.com/zhoubenjia/gfslt-vlp
paper_authors: Benjia Zhou, Zhigang Chen, Albert Clapés, Jun Wan, Yanyan Liang, Sergio Escalera, Zhen Lei, Du Zhang
For: This paper aims to improve the task of sign language translation (SLT) by addressing the challenge of using an intermediate representation, such as gloss sequences, which can hinder the development of SLT.* Methods: The proposed method, called Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-VLP), uses a novel approach that inherits language-oriented prior knowledge from pre-trained models without any gloss annotation assistance. The approach involves two stages: (i) integrating Contrastive Language-Image Pre-training (CLIP) with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual Encoder and Text Decoder from the first stage.* Results: The proposed method achieves unprecedented improvements in terms of BLEU-4 score on the PHOENIX14T dataset (>+5) and the CSL-Daily dataset (>+3) compared to state-of-the-art gloss-free SLT methods. Additionally, the approach achieves competitive results on the PHOENIX14T dataset when compared with most of the gloss-based methods.

Abstract
Sign Language Translation (SLT) is a challenging task due to its cross-domain nature, involving the translation of visual-gestural language to text. Many previous methods employ an intermediate representation, i.e., gloss sequences, to facilitate SLT, thus transforming it into a two-stage task of sign language recognition (SLR) followed by sign language translation (SLT). However, the scarcity of gloss-annotated sign language data, combined with the information bottleneck in the mid-level gloss representation, has hindered the further development of the SLT task. To address this challenge, we propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-VLP), which improves SLT by inheriting language-oriented prior knowledge from pre-trained models, without any gloss annotation assistance. Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training (CLIP) with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual Encoder and Text Decoder from the first stage. The seamless combination of these novel designs forms a robust sign language representation and significantly improves gloss-free sign language translation. In particular, we have achieved unprecedented improvements in terms of BLEU-4 score on the PHOENIX14T dataset (>+5) and the CSL-Daily dataset (>+3) compared to state-of-the-art gloss-free SLT methods. Furthermore, our approach also achieves competitive results on the PHOENIX14T dataset when compared with most of the gloss-based methods. Our code is available at https://github.com/zhoubenjia/GFSLT-VLP.

摘要
Sign Language Translation (SLT) 是一个复杂的任务，因为它涉及视觉姿势语言的翻译，从视觉语言到文本。许多前一代方法使用中间表示，即荣誉序列，以便进行 SLT，因此将其转化为一个两阶段任务：首先使用 SLR 进行视觉语言识别，然后使用 SLT 进行翻译。但是，缺乏荣誉注解的手语数据，加之中间表示缺乏信息的瓶颈，使得 SLT 任务的进一步发展受阻。为解决这个挑战，我们提出了一种新的无荣誉 SLT，基于视觉语言预training（GFSLT-VLP），它通过继承预训练模型中的语言相关知识，而不需要荣誉注解，提高了 SLT 的性能。我们的方法包括两个阶段：1. 将 Contrastive Language-Image Pre-training（CLIP）与遮盖自我监督学习相结合，以创建预 зада务， bridge 视觉和文本表示之间的 semantic gap，并 restore 遮盖句子。2. 使用 encoder-decoder 结构， inherited 预训练 Visual Encoder 和 Text Decoder 的参数，以构建一个端到端的结构。这种新的设计结合，形成了一种robust的手语表示，并有效地提高了无荣誉手语翻译。具体来说，我们在 PHOENIX14T 数据集上 achieved 未曾有过的改进（BLEU-4 分数 > +5）和 CSL-Daily 数据集上 achieved 比较出色的改进（BLEU-4 分数 > +3），相比之下，state-of-the-art 无荣誉 SLT 方法。此外，我们的方法还在 PHOENIX14T 数据集上与大多数荣誉基于方法相比，达到了竞争的结果。我们的代码可以在上找到。

Semantic Image Completion and Enhancement using GANs

paper_url: http://arxiv.org/abs/2307.14748
repo_url: None
paper_authors: Priyansh Saxena, Raahat Gupta, Akshat Maheshwari, Saumil Maheshwari
for: 这篇论文主要是为了完善图像的完成和增强任务。
methods: 该论文使用了生成对抗网络（GAN）来实现图像完成任务。
results: GAN可以帮助完善图像的完成和增强任务，提高图像的质量。Here’s the full text in Simplified Chinese:
for: 这篇论文主要是为了完善图像的完成和增强任务。
methods: 该论文使用了生成对抗网络（GAN）来实现图像完成任务。
results: GAN可以帮助完善图像的完成和增强任务，提高图像的质量。

Abstract
Semantic inpainting or image completion alludes to the task of inferring arbitrary large missing regions in images based on image semantics. Since the prediction of image pixels requires an indication of high-level context, this makes it significantly tougher than image completion, which is often more concerned with correcting data corruption and removing entire objects from the input image. On the other hand, image enhancement attempts to eliminate unwanted noise and blur from the image, along with sustaining most of the image details. Efficient image completion and enhancement model should be able to recover the corrupted and masked regions in images and then refine the image further to increase the quality of the output image. Generative Adversarial Networks (GAN), have turned out to be helpful in picture completion tasks. In this chapter, we will discuss the underlying GAN architecture and how they can be used used for image completion tasks.

摘要
Semantic inpainting or image completion refers to the task of inferring large missing regions in images based on image semantics. Since the prediction of image pixels requires high-level context, this makes it significantly more challenging than image completion, which is often focused on correcting data corruption and removing objects from the input image. On the other hand, image enhancement aims to eliminate unwanted noise and blur from the image while preserving most of the image details. An efficient image completion and enhancement model should be able to recover the corrupted and masked regions in images and then refine the image to improve its quality. Generative Adversarial Networks (GAN) have proven to be helpful in picture completion tasks, and we will discuss their underlying architecture and how they can be used for image completion tasks in this chapter.

paper_url: http://arxiv.org/abs/2307.14735
repo_url: https://github.com/shankhanil006/tta-iqa
paper_authors: Subhadeep Roy, Shankhanil Mitra, Soma Biswas, Rajiv Soundararajan
for: 提高隐身图像质量评估（IQA）算法在推理时的性能，由于训练和测试场景之间的分布偏移导致现有算法在测试时表现不佳。
methods: 采用两种新的质量相关的 auxillary task，一个是批处理级别的群集比较损失，另一个是样本级别的相对排名损失，使模型具有质量意识并适应目标数据。
results: 通过更新源模型的批处理常数，可以在小批量测试数据上 Achieve 显著改善的性能。

Abstract
While the design of blind image quality assessment (IQA) algorithms has improved significantly, the distribution shift between the training and testing scenarios often leads to a poor performance of these methods at inference time. This motivates the study of test time adaptation (TTA) techniques to improve their performance at inference time. Existing auxiliary tasks and loss functions used for TTA may not be relevant for quality-aware adaptation of the pre-trained model. In this work, we introduce two novel quality-relevant auxiliary tasks at the batch and sample levels to enable TTA for blind IQA. In particular, we introduce a group contrastive loss at the batch level and a relative rank loss at the sample level to make the model quality aware and adapt to the target data. Our experiments reveal that even using a small batch of images from the test distribution helps achieve significant improvement in performance by updating the batch normalization statistics of the source model.

摘要
而网络设计的盲图像质量评估（IQA）算法的设计已经得到了显著改进，但在推理时间段的分布偏移通常会导致这些方法在推理时的性能不佳。这种情况 Motivates the study of test time adaptation（TTA）技术以提高其推理时间段的性能。现有的辅助任务和损失函数可能不适用于盲图像质量认知的适应。在这种工作中，我们介绍了两个新的质量相关的辅助任务，即批处理级别的群体对照损失和样本级别的相对排名损失，以使模型具备质量认知并适应目标数据。我们的实验表明，即使使用小批处理数据来源，也可以通过更新源模型的批处理常量来实现显著的性能改进。

Understanding Silent Failures in Medical Image Classification

paper_url: http://arxiv.org/abs/2307.14729
repo_url: https://github.com/iml-dkfz/sf-visuals
paper_authors: Till J. Bungert, Levin Kobelke, Paul F. Jaeger
for: 防止静止失败在医疗应用中ensure reliable use of classification systems
methods: 使用 confidence scoring functions (CSFs) 或设计更加可靠的分类器
results: None of the benchmarked CSFs can reliably prevent silent failures, deeper understanding of the root causes of failures in the data is required. Introduce SF-Visuals, an interactive analysis tool to visualize shifts and failures.

Abstract
To ensure the reliable use of classification systems in medical applications, it is crucial to prevent silent failures. This can be achieved by either designing classifiers that are robust enough to avoid failures in the first place, or by detecting remaining failures using confidence scoring functions (CSFs). A predominant source of failures in image classification is distribution shifts between training data and deployment data. To understand the current state of silent failure prevention in medical imaging, we conduct the first comprehensive analysis comparing various CSFs in four biomedical tasks and a diverse range of distribution shifts. Based on the result that none of the benchmarked CSFs can reliably prevent silent failures, we conclude that a deeper understanding of the root causes of failures in the data is required. To facilitate this, we introduce SF-Visuals, an interactive analysis tool that uses latent space clustering to visualize shifts and failures. On the basis of various examples, we demonstrate how this tool can help researchers gain insight into the requirements for safe application of classification systems in the medical domain. The open-source benchmark and tool are at: https://github.com/IML-DKFZ/sf-visuals.

摘要
(Simplified Chinese translation)为确保医疗应用中的分类系统可靠使用，防止潜在的失败是非常重要。这可以通过设计更加鲁棒的分类器，或者通过使用信任分数函数（CSF）来检测剩下的失败来实现。图像分类中最主要的失败来源是在训练数据和部署数据之间的分布变化。为了了解医疗领域中 silent failure 预防的当前状况，我们进行了第一次全面的比较多种 CSF 在四种生物医学任务和多样的分布变化下的性能分析。结果显示， none of the benchmarked CSFs 可靠地预防潜在的失败。我们 conclude 需要更深入地理解数据中的失败根据因素，以便在医疗领域中安全地应用分类系统。为此，我们介绍 SF-Visuals，一种可互动地分析工具，使用隐藏空间划分来视觉化偏移和失败。通过多个示例，我们示出了这个工具可以帮助研究人员更好地理解医疗领域中分类系统的应用需求。该开源 benchmark 和工具可以在 GitHub 上找到：https://github.com/IML-DKFZ/sf-visuals.

P2C: Self-Supervised Point Cloud Completion from Single Partial Clouds

paper_url: http://arxiv.org/abs/2307.14726
repo_url: https://github.com/cuiruikai/partial2complete
paper_authors: Ruikai Cui, Shi Qiu, Saeed Anwar, Jiawei Liu, Chaoyue Xing, Jing Zhang, Nick Barnes
for: 本研究旨在完成基于部分观察的点云ShapeNet数据集中的物体形状。
methods: 我们提出了一种自动学习的方法，称为Partial2Complete（P2C），它使用单个不完整的点云训练样本来完成物体的形状。
results: P2C可以与使用完整的点云训练样本相比，并且在使用多个部分观察样本训练的方法上表现更好。 Code可以在https://github.com/CuiRuikai/Partial2Complete中找到。

Abstract
Point cloud completion aims to recover the complete shape based on a partial observation. Existing methods require either complete point clouds or multiple partial observations of the same object for learning. In contrast to previous approaches, we present Partial2Complete (P2C), the first self-supervised framework that completes point cloud objects using training samples consisting of only a single incomplete point cloud per object. Specifically, our framework groups incomplete point clouds into local patches as input and predicts masked patches by learning prior information from different partial objects. We also propose Region-Aware Chamfer Distance to regularize shape mismatch without limiting completion capability, and devise the Normal Consistency Constraint to incorporate a local planarity assumption, encouraging the recovered shape surface to be continuous and complete. In this way, P2C no longer needs multiple observations or complete point clouds as ground truth. Instead, structural cues are learned from a category-specific dataset to complete partial point clouds of objects. We demonstrate the effectiveness of our approach on both synthetic ShapeNet data and real-world ScanNet data, showing that P2C produces comparable results to methods trained with complete shapes, and outperforms methods learned with multiple partial observations. Code is available at https://github.com/CuiRuikai/Partial2Complete.

摘要
《点云补充》目标是基于部分观察结果 recuperate 完整的形状。现有方法需要 either 完整的点云数据或多个相同对象的多个部分观察数据进行学习。相比之前的方法，我们提出了 Partial2Complete（P2C），这是首个没有需要多个部分观察或完整的点云数据作为学习参考的自动化完成点云对象的框架。具体来说，我们将 incomplete point cloud 分组成为本地 patches 作为输入，并预测masked patches 通过学习不同部分对象的准确信息。我们还提出了Region-Aware Chamfer Distance 来正则化形态匹配，而不是限制完成能力，并提出了 Normal Consistency Constraint 来包含地面的planarity假设，使recovered shape surface 是连续的和完整的。因此，P2C 不再需要多个观察或完整的点云数据作为参考。而是通过学习Category-specific dataset 中的结构准确信息来完成部分点云对象。我们在synthetic ShapeNet数据和实际世界 ScanNet 数据上展示了 P2C 的效果，显示 P2C 可以与基于完整形状的方法相比，并且超越基于多个部分观察的方法。代码可以在https://github.com/CuiRuikai/Partial2Complete 上找到。

vox2vec: A Framework for Self-supervised Contrastive Learning of Voxel-level Representations in Medical Images

paper_url: http://arxiv.org/abs/2307.14725
repo_url: https://github.com/mishgon/vox2vec
paper_authors: Mikhail Goncharov, Vera Soboleva, Anvar Kurmukov, Maxim Pisov, Mikhail Belyaev
for: 本文提出了一种自然语言学习（SSL）方法，用于学习声域级表示。
methods: 本文使用了一种矩阵学习方法，即特征峰网络（FPN），用于生成声域级表示。FPN是在不同层次的pyramid中预训练的，以便在不同的扩展上生成相同的声域级表示，并生成不同的声域级表示。
results: 本文使用vox2vec方法在多于6500个公共可用的计算机Tomography图像上进行预训练，然后attach到简单的头上并训练结果。结果显示，vox2vec方法在三个评估设置下都 OUTPERFORMS现有的医疗影像SSL技术：线性和非线性探测，以及终端精度调整。此外，使用冻结vox2vec表示后attach一个非线性头，可以与从scratch训练FPN获得相同的性能，而且具有50个可变参数的缺省参数量。代码可以在https://github.com/mishgon/vox2vec上获取。

Abstract
This paper introduces vox2vec - a contrastive method for self-supervised learning (SSL) of voxel-level representations. vox2vec representations are modeled by a Feature Pyramid Network (FPN): a voxel representation is a concatenation of the corresponding feature vectors from different pyramid levels. The FPN is pre-trained to produce similar representations for the same voxel in different augmented contexts and distinctive representations for different voxels. This results in unified multi-scale representations that capture both global semantics (e.g., body part) and local semantics (e.g., different small organs or healthy versus tumor tissue). We use vox2vec to pre-train a FPN on more than 6500 publicly available computed tomography images. We evaluate the pre-trained representations by attaching simple heads on top of them and training the resulting models for 22 segmentation tasks. We show that vox2vec outperforms existing medical imaging SSL techniques in three evaluation setups: linear and non-linear probing and end-to-end fine-tuning. Moreover, a non-linear head trained on top of the frozen vox2vec representations achieves competitive performance with the FPN trained from scratch while having 50 times fewer trainable parameters. The code is available at https://github.com/mishgon/vox2vec .

摘要
We use vox2vec to pre-train a FPN on more than 6500 publicly available computed tomography images. We evaluate the pre-trained representations by attaching simple heads on top of them and training the resulting models for 22 segmentation tasks. We show that vox2vec outperforms existing medical imaging SSL techniques in three evaluation setups: linear and non-linear probing and end-to-end fine-tuning. Moreover, a non-linear head trained on top of the frozen vox2vec representations achieves competitive performance with the FPN trained from scratch while having 50 times fewer trainable parameters.The code for vox2vec is available at .

EFLNet: Enhancing Feature Learning for Infrared Small Target Detection

paper_url: http://arxiv.org/abs/2307.14723
repo_url: None
paper_authors: Bo Yang, Xinyu Zhang, Jiahao Zhu, Jian Zhang, Dongjian Tian, Jun Luo, Mingliang Zhou, Yangjun Pi
for: 这篇论文的目的是解决单框架红外小目标检测中的挑战，即背景和目标之间的极大偏见，以及小目标信息在高层semantic层面上易产生丢失。
methods: 该论文提出了基于YOLOv7框架的强化特征学习网络（EFLNet）来解决这些问题。其中，我们提出了一种自适应权重损失函数，以自动调整损失权重，让模型更加偏爱目标特征。其次，我们引入了 норmalized Gaussian Wasserstein distance，以解决 bounding box regression 对红外小目标的极高敏感性。最后，我们在网络中添加了动态头机制，以实现自适应学习每个semantic层的相对重要性。
results: 实验结果表明，我们的方法可以在红外小目标检测中实现更好的性能，比如当前的深度学习基于方法。

Abstract
Single-frame infrared small target detection is considered to be a challenging task, due to the extreme imbalance between target and background, bounding box regression is extremely sensitive to infrared small targets, and small target information is easy to lose in the high-level semantic layer. In this paper, we propose an enhancing feature learning network (EFLNet) based on YOLOv7 framework to solve these problems. First, we notice that there is an extremely imbalance between the target and the background in the infrared image, which makes the model pay more attention to the background features, resulting in missed detection. To address this problem, we propose a new adaptive threshold focal loss function that adjusts the loss weight automatically, compelling the model to allocate greater attention to target features. Second, we introduce the normalized Gaussian Wasserstein distance to alleviate the difficulty of model convergence caused by the extreme sensitivity of the bounding box regression to infrared small targets. Finally, we incorporate a dynamic head mechanism into the network to enable adaptive learning of the relative importance of each semantic layer. Experimental results demonstrate our method can achieve better performance in the detection performance of infrared small targets compared to state-of-the-art deep-learning based methods.

摘要
单一帧红外小目标检测是一个具有挑战性的任务，因为目标和背景之间存在极大的不均衡， bounding box regression 非常敏感于红外小目标，小目标信息容易在高层 semantic layer 中失去。在这篇文章中，我们提出了一个强化特征学习网络（EFLNet）基于 YOLOv7 框架，以解决这些问题。首先，我们注意到了红外图像中目标和背景之间的极大不均衡，这使得模型更倾向于关注背景特征，从而导致搜寻失败。为了解决这个问题，我们提出了一个新的自动调整损失函数，可以自动调整损失重要性，让模型更加倾向于对目标特征进行关注。其次，我们引入了 норма化 Gaussian Wasserstein 距离，以减少 bounding box regression 的问题对于红外小目标的敏感性。最后，我们将动态头 Mechanism incorporated 到网络中，以启动网络进行自适应的学习，以便适当地调整各个 semantic layer 的相对重要性。实验结果显示，我们的方法可以在红外小目标检测中比 state-of-the-art deep-learning 基础方法表现更好。

Robust vertebra identification using simultaneous node and edge predicting Graph Neural Networks

paper_url: http://arxiv.org/abs/2308.02509
repo_url: https://github.com/imfusiongmbh/vid-vertebra-identification-dataset
paper_authors: Vincent Bürgin, Raphael Prevost, Marijn F. Stollenga
for: 这篇论文的目的是为了自动找到和识别CT扫描中的脊梗，以便于诊断和治疗。
methods: 这篇论文使用了一个简单的管线，包括使用标准预测和单一图像神经网络，将脊梗与全orientation相关和分类。
results: 这篇论文的结果显示，该方法可以正确地相互关联脊梗和体干骨，忽略False Positive，并在标准VerSe挑战 зада务中表现竞争性。

Abstract
Automatic vertebra localization and identification in CT scans is important for numerous clinical applications. Much progress has been made on this topic, but it mostly targets positional localization of vertebrae, ignoring their orientation. Additionally, most methods employ heuristics in their pipeline that can be sensitive in real clinical images which tend to contain abnormalities. We introduce a simple pipeline that employs a standard prediction with a U-Net, followed by a single graph neural network to associate and classify vertebrae with full orientation. To test our method, we introduce a new vertebra dataset that also contains pedicle detections that are associated with vertebra bodies, creating a more challenging landmark prediction, association and classification task. Our method is able to accurately associate the correct body and pedicle landmarks, ignore false positives and classify vertebrae in a simple, fully trainable pipeline avoiding application-specific heuristics. We show our method outperforms traditional approaches such as Hungarian Matching and Hidden Markov Models. We also show competitive performance on the standard VerSe challenge body identification task.

摘要
自动骨盆位置和识别在CT扫描图中是至关重要的临床应用。过去几年，在这一领域中有很大的进步，但大多数方法都是对骨盆体的位置进行定位，忽略了它们的方向。此外，大多数方法都使用了特定的规则和假设，这些规则可能不适用于实际的临床图像，这些图像可能包含异常。我们提出了一个简单的管道，该管道使用标准预测器和单个图 neural network 来关联和分类骨盆体，并忽略false positive。为测试我们的方法，我们创建了一个新的骨盆数据集，该数据集包含了关联骨盆体和肋骨的检测结果，这使得骨盆体的预测、关联和分类任务变得更加困难。我们的方法能够准确地关联正确的骨盆体和肋骨标记，忽略false positive，并在简单可trainable的管道中完成骨盆体的分类。我们比较了我们的方法与传统的匈牙利匹配和隐马尔可夫模型，发现我们的方法能够超越这些传统方法。此外，我们还发现我们的方法在标准VerSe挑战任务上的体部识别性能与其他方法类似。

GaitMorph: Transforming Gait by Optimally Transporting Discrete Codes

paper_url: http://arxiv.org/abs/2307.14713
repo_url: None
paper_authors: Adrian Cosma, Emilian Radoi
for: 本研究旨在提出一种无需人工标注的步态识别系统训练方法，通过自动学习方法来实现。
methods: 我们提出了一种基于高级压缩模型的方法，使用无标注数据来构建一个灵活可解释的抽象空间，并使用优化运输理论来学习抽象代码的传输地图。
results: 我们进行了广泛的实验，并证明了我们的方法可以合理地Synthesize加入视频序列中的其他视图。Here’s the English version of the three key points for reference:
for: The purpose of this research is to propose a method for training gait recognition systems without explicit human annotations, using self-supervised learning approaches.
methods: We propose a method based on a high-compression model for gait skeleton sequences, which leverages unlabeled data to construct a discrete and interpretable latent space that preserves identity-related features. We also propose a method based on optimal transport theory to learn latent transport maps on the discrete codebook that morph gait sequences between variations.
results: We conduct extensive experiments and show that our method is suitable to synthesize additional views for an input sequence.

Abstract
Gait, the manner of walking, has been proven to be a reliable biometric with uses in surveillance, marketing and security. A promising new direction for the field is training gait recognition systems without explicit human annotations, through self-supervised learning approaches. Such methods are heavily reliant on strong augmentations for the same walking sequence to induce more data variability and to simulate additional walking variations. Current data augmentation schemes are heuristic and cannot provide the necessary data variation as they are only able to provide simple temporal and spatial distortions. In this work, we propose GaitMorph, a novel method to modify the walking variation for an input gait sequence. Our method entails the training of a high-compression model for gait skeleton sequences that leverages unlabelled data to construct a discrete and interpretable latent space, which preserves identity-related features. Furthermore, we propose a method based on optimal transport theory to learn latent transport maps on the discrete codebook that morph gait sequences between variations. We perform extensive experiments and show that our method is suitable to synthesize additional views for an input sequence.

摘要
<>转换文本为简化中文。<>走姿（gait）已被证明为可靠的生物метри克，具有监测、营销和安全等多个应用。现在的新趋势是使用自动学习方法来训练走姿认识系统，而不需要显式的人类标注。这些方法对于数据多样性的需求非常高，因此需要强大的数据增强方法来模拟更多的走姿变化。现有的数据增强方法是谱系的，无法提供足够的数据多样性。在这种情况下，我们提出了一种新的方法——走姿变换（GaitMorph）。我们的方法是通过训练一个高压缩模型来Transforming Gait Sequences into Different VariationsGait, the manner of walking, has been proven to be a reliable biometric with uses in surveillance, marketing, and security. A promising new direction for the field is training gait recognition systems without explicit human annotations, through self-supervised learning approaches. However, current data augmentation schemes are heuristic and cannot provide the necessary data variation. In this work, we propose GaitMorph, a novel method to modify the walking variation for an input gait sequence. Our method entails training a high-compression model for gait skeleton sequences that leverages unlabelled data to construct a discrete and interpretable latent space, which preserves identity-related features. Furthermore, we propose a method based on optimal transport theory to learn latent transport maps on the discrete codebook that morph gait sequences between variations. We perform extensive experiments and show that our method is suitable to synthesize additional views for an input sequence.

Pre-training Vision Transformers with Very Limited Synthesized Images

paper_url: http://arxiv.org/abs/2307.14710
repo_url: https://github.com/ryoo-nakamura/ofdb
paper_authors: Ryo Nakamura, Hirokatsu Kataoka, Sora Takashima, Edgar Josafat Martinez Noriega, Rio Yokota, Nakamasa Inoue
for: 这个论文主要是用于探讨Formula-driven supervised learning（FDSL）是一种预训练方法，它利用生成自数学方程的Synthetic图像进行预训练。
methods: 这个论文使用的方法是将视transformer预训练在基于数学方程生成的Synthetic图像集上。
results: 研究发现，通过将不同类别的Synthetic图像视为数据增强，可以使模型在下游任务中表现更好。此外， authors还证明了这种方法可以使用远少于ImageNet-21k的数据集进行预训练，并且可以与ImageNet-21k相当或者超过其在ImageNet-1k fine-tuning中的性能。

Abstract
Formula-driven supervised learning (FDSL) is a pre-training method that relies on synthetic images generated from mathematical formulae such as fractals. Prior work on FDSL has shown that pre-training vision transformers on such synthetic datasets can yield competitive accuracy on a wide range of downstream tasks. These synthetic images are categorized according to the parameters in the mathematical formula that generate them. In the present work, we hypothesize that the process for generating different instances for the same category in FDSL, can be viewed as a form of data augmentation. We validate this hypothesis by replacing the instances with data augmentation, which means we only need a single image per category. Our experiments shows that this one-instance fractal database (OFDB) performs better than the original dataset where instances were explicitly generated. We further scale up OFDB to 21,000 categories and show that it matches, or even surpasses, the model pre-trained on ImageNet-21k in ImageNet-1k fine-tuning. The number of images in OFDB is 21k, whereas ImageNet-21k has 14M. This opens new possibilities for pre-training vision transformers with much smaller datasets.

摘要
“数学公式驱动的超级vised学习（FDSL）是一种预训练方法，利用生成自数学公式的 sintetic 图像。先前的研究表明，在这些 sintetic 图像上预训练视觉转换器可以达到多种下游任务的竞争性高度。这些 sintetic 图像按照生成图像的参数分类。在 presente 工作中，我们假设生成不同类别的过程可以视为数据增强。我们验证这一假设，通过将实例替换为数据增强，我们只需要单个图像per类。我们的实验表明，这一个单个图像数据库（OFDB）在 ImageNet-1k 精度调整中表现更好，而且可以与 ImageNet-21k 中预训练的模型相匹配或超越。OFDB 中的图像数量为 21k，而 ImageNet-21k 中的图像数量为 14M。这些新的可能性可能为预训练视觉转换器带来巨大的改进。”Note: Please note that the translation is in Simplified Chinese, which is one of the two standard Chinese dialects. If you need Traditional Chinese, please let me know.

Taxonomy Adaptive Cross-Domain Adaptation in Medical Imaging via Optimization Trajectory Distillation

paper_url: http://arxiv.org/abs/2307.14709
repo_url: https://github.com/camwew/tada-mi
paper_authors: Jianan Fan, Dongnan Liu, Hang Chang, Heng Huang, Mei Chen, Weidong Cai
for: 提高自动医疗图像分析的成功率，alleviate the burden of labeled data collection。
methods: 提出了一种名为“Optimization Trajectory Distillation”的新方法，它利用了梯度空间的低级特性和双流维护算法来规范网络训练的学习动态，并在不同的预测任务中进行了广泛的评估。
results: 比较previoius方法，该方法得到了更好的效果和改进。

Abstract
The success of automated medical image analysis depends on large-scale and expert-annotated training sets. Unsupervised domain adaptation (UDA) has been raised as a promising approach to alleviate the burden of labeled data collection. However, they generally operate under the closed-set adaptation setting assuming an identical label set between the source and target domains, which is over-restrictive in clinical practice where new classes commonly exist across datasets due to taxonomic inconsistency. While several methods have been presented to tackle both domain shifts and incoherent label sets, none of them take into account the common characteristics of the two issues and consider the learning dynamics along network training. In this work, we propose optimization trajectory distillation, a unified approach to address the two technical challenges from a new perspective. It exploits the low-rank nature of gradient space and devises a dual-stream distillation algorithm to regularize the learning dynamics of insufficiently annotated domain and classes with the external guidance obtained from reliable sources. Our approach resolves the issue of inadequate navigation along network optimization, which is the major obstacle in the taxonomy adaptive cross-domain adaptation scenario. We evaluate the proposed method extensively on several tasks towards various endpoints with clinical and open-world significance. The results demonstrate its effectiveness and improvements over previous methods.

摘要
成功的自动医疗图像分析取决于大规模的专家标注训练集。无监督领域适应（UDA）被提出为解决数据收集监督成本的有力的方法。然而，它们通常在关闭集成适应 Setting下运行，即源频率和目标频率之间的标签集 identical，这在临床实践中是过于严格的。虽然一些方法已经被提出来解决这两个技术挑战，但 none of them 考虑了这两个问题的共同特点和网络训练过程中的学习动力。在这项工作中，我们提出了优化轨迹涂抹，一种统一的方法，用于解决这两个技术挑战。它利用了梯度空间的低级别特性，并设计了一种双流涂抹算法，用于规范频率不充分标注的频率和类型的学习动力。我们的方法可以解决频率优化过程中的导航问题，这是涂抹适应交叉频率适应场景中的主要障碍。我们对多个任务进行了广泛的测试，结果表明我们的方法有效和前方法之上。

High Dynamic Range Imaging via Visual Attention Modules

paper_url: http://arxiv.org/abs/2307.14705
repo_url: https://github.com/alirezaomrani95/hdr-vam
paper_authors: Ali Reza Omrani, Davide Moroni
for: 这篇论文是为了提出一种新的高动态范围（HDR）成像方法，以增强图像的质量和细节。
methods: 该方法使用了深度学习架构，并通过视觉注意力模块（VAM）来提取图像中最重要的部分。
results: 实验结果表明，该方法比大多数现有的State-Of-The-Art算法更高效。

Abstract
Thanks to High Dynamic Range (HDR) imaging methods, the scope of photography has seen profound changes recently. To be more specific, such methods try to reconstruct the lost luminosity of the real world caused by the limitation of regular cameras from the Low Dynamic Range (LDR) images. Additionally, although the State-Of-The-Art methods in this topic perform well, they mainly concentrate on combining different exposures and have less attention to extracting the informative parts of the images. Thus, this paper aims to introduce a new model capable of incorporating information from the most visible areas of each image extracted by a visual attention module (VAM), which is a result of a segmentation strategy. In particular, the model, based on a deep learning architecture, utilizes the extracted areas to produce the final HDR image. The results demonstrate that our method outperformed most of the State-Of-The-Art algorithms.

摘要
高动态范围（HDR）摄影技术已经对摄影领域带来深刻的变革。更具体地说，这些方法试图重建实际世界中丢失的亮度，即常规相机的低动态范围（LDR）图像所带来的限制。尽管现状领先的方法在这个领域具有良好的性能，但它们主要集中在不同曝光的结合上，对图像中有用信息的提取则得到了更少的关注。因此，本文旨在介绍一种新的模型，可以通过视觉注意力模块（VAM） segmentation策略提取图像中最可见的部分，并使用这些部分生成最终的HDR图像。结果表明，我们的方法在大多数状态前进的算法中表现出色。

MIM-OOD: Generative Masked Image Modelling for Out-of-Distribution Detection in Medical Images

paper_url: http://arxiv.org/abs/2307.14701
repo_url: None
paper_authors: Sergio Naval Marimont, Vasilis Siomos, Giacomo Tarroni
for: 这篇论文目的是为了提出一种基于图像健康 анаatomy 的无监督 OUT-OF-DISTRIBUTION (OOD) 检测方法。
methods: 该方法使用了图像 tokenization 和 Auto-Regressive (AR) 模型来模型图像分布，并使用 AR 模型来标识异常 токен和替换异常表示。
results: 该方法使用了两个任务特定网络：一个是用于标识异常 токен的 transformer，另一个是用于使用 masked image modelling (MIM) 填充异常表示。实验表明，MIM-OOD 方法在脑 Magnetic Resonance Imaging (MRI) 异常中显著超越了 AR 模型（DICE 0.458 vs 0.301），同时实现了约25倍的速度提升（9.5s vs 244s）。

Abstract
Unsupervised Out-of-Distribution (OOD) detection consists in identifying anomalous regions in images leveraging only models trained on images of healthy anatomy. An established approach is to tokenize images and model the distribution of tokens with Auto-Regressive (AR) models. AR models are used to 1) identify anomalous tokens and 2) in-paint anomalous representations with in-distribution tokens. However, AR models are slow at inference time and prone to error accumulation issues which negatively affect OOD detection performance. Our novel method, MIM-OOD, overcomes both speed and error accumulation issues by replacing the AR model with two task-specific networks: 1) a transformer optimized to identify anomalous tokens and 2) a transformer optimized to in-paint anomalous tokens using masked image modelling (MIM). Our experiments with brain MRI anomalies show that MIM-OOD substantially outperforms AR models (DICE 0.458 vs 0.301) while achieving a nearly 25x speedup (9.5s vs 244s).

摘要
无监督异常检测（OOD）的方法是通过只使用健康解剖学图像进行模型训练来识别图像中的异常区域。一种常见的方法是将图像分割成 tokens，然后使用自动推导（AR）模型来分布计数。AR 模型可以1) 识别异常 tokens，2) 使用健康解剖学图像中的 tokens 填充异常表示。然而，AR 模型在推理时间和错误积累问题上存在缓慢和质量下降问题，这些问题会对 OOD 检测性能产生负面影响。我们的新方法 MIM-OOD 解决了这些问题，通过取代 AR 模型而使用两个任务特定的网络：1）一个基于 transformer 的异常 tokens 识别网络，2）一个基于 masked image modelling（MIM）的异常 tokens 填充网络。我们对大脑 MRI 异常进行了实验，结果显示 MIM-OOD 与 AR 模型相比有显著提高（DICE 0.458 vs 0.301），同时实现了约25倍的速度提升（9.5s vs 244s）。

paper_url: http://arxiv.org/abs/2307.14682
repo_url: https://github.com/aries-iai/cross-modal_patch_attack
paper_authors: Xingxing Wei, Yao Huang, Yitong Sun, Jie Yu
for: 防御深度学习对象检测器受到物理攻击的威胁，通过结合可见和无可见感知器的组合来增强安全性。methods: 我们提出了一种涉及多modal攻击的统一恶作剂质量，可以同时在可见和无可见感知器上实现欺骗。我们还提出了一种边界限定的形态优化方法，以及一种分数感知评估方法，以保证恶作剂质量的合理性。results: 我们的方法在多种场景下都达到了高于80%的攻击成功率，并在实际世界中进行了物理世界场景下的评估，包括不同的角度、距离、姿势和场景。

Abstract
Physical adversarial attacks have put a severe threat to DNN-based object detectors. To enhance security, a combination of visible and infrared sensors is deployed in various scenarios, which has proven effective in disabling existing single-modal physical attacks. To further demonstrate the potential risks in such cases, we design a unified adversarial patch that can perform cross-modal physical attacks, achieving evasion in both modalities simultaneously with a single patch. Given the different imaging mechanisms of visible and infrared sensors, our work manipulates patches' shape features, which can be captured in different modalities when they undergo changes. To deal with challenges, we propose a novel boundary-limited shape optimization approach that aims to achieve compact and smooth shapes for the adversarial patch, making it easy to implement in the physical world. And a score-aware iterative evaluation method is also introduced to balance the fooling degree between visible and infrared detectors during optimization, which guides the adversarial patch to iteratively reduce the predicted scores of the multi-modal sensors. Furthermore, we propose an Affine-Transformation-based enhancement strategy that makes the learnable shape robust to various angles, thus mitigating the issue of shape deformation caused by different shooting angles in the real world. Our method is evaluated against several state-of-the-art object detectors, achieving an Attack Success Rate (ASR) of over 80%. We also demonstrate the effectiveness of our approach in physical-world scenarios under various settings, including different angles, distances, postures, and scenes for both visible and infrared sensors.

摘要
人工智能图像识别器面临着物理攻击的严重威胁。为增强安全性，我们在不同场景中部署了可见和抗红外感知器，并证明了这种多感知模式的组合能够精准地终止现有的单模态物理攻击。为了进一步推翻这种情况的风险，我们设计了一种可跨模态物理攻击的统一恶作剂，可以同时在两种模式中 achievement 欺骗。由于可见和抗红外感知器的捕捉机制不同，我们在形状特征上进行了修改，以便在不同的捕捉机制下具有不同的特征。为了解决挑战，我们提出了一种边界限定形态优化方法，目的是实现紧凑而平滑的恶作剂形态，使其在物理世界中易于实现。此外，我们还提出了一种分数意识迭代评估方法，以保证恶作剂在多模式检测器中的欺骗度均衡。此外，我们还提出了基于Affine变换的增强策略，使learnable形态具有多个视角下的抗shape变形性，以mitigate实际世界中的shape变形问题。我们的方法在多个国家前景下进行了评估，其中包括不同的角度、距离、姿态和场景，并在可见和抗红外感知器中都达到了Attack Success Rate（ASR）高于80%。

LLDiffusion: Learning Degradation Representations in Diffusion Models for Low-Light Image Enhancement

paper_url: http://arxiv.org/abs/2307.14659
repo_url: https://github.com/taowangzj/lldiffusion
paper_authors: Tao Wang, Kaihao Zhang, Ziqian Shao, Wenhan Luo, Bjorn Stenger, Tae-Kyun Kim, Wei Liu, Hongdong Li
for: This paper proposes a degradation-aware learning scheme for low-light image enhancement (LLIE) using diffusion models, which effectively integrates degradation and image priors into the diffusion process to improve image enhancement.methods: The proposed method includes a joint learning framework for both image generation and image enhancement to learn degradation representations, as well as a Low-Light Diffusion model (LLDiffusion) with a well-designed dynamic diffusion module that takes into account both the color map and the latent degradation representations to guide the diffusion process.results: Extensive experiments on public benchmarks demonstrate that the proposed LLDiffusion outperforms state-of-the-art LLIE methods both quantitatively and qualitatively, with improved image enhancement and color fidelity.

Abstract
Current deep learning methods for low-light image enhancement (LLIE) typically rely on pixel-wise mapping learned from paired data. However, these methods often overlook the importance of considering degradation representations, which can lead to sub-optimal outcomes. In this paper, we address this limitation by proposing a degradation-aware learning scheme for LLIE using diffusion models, which effectively integrates degradation and image priors into the diffusion process, resulting in improved image enhancement. Our proposed degradation-aware learning scheme is based on the understanding that degradation representations play a crucial role in accurately modeling and capturing the specific degradation patterns present in low-light images. To this end, First, a joint learning framework for both image generation and image enhancement is presented to learn the degradation representations. Second, to leverage the learned degradation representations, we develop a Low-Light Diffusion model (LLDiffusion) with a well-designed dynamic diffusion module. This module takes into account both the color map and the latent degradation representations to guide the diffusion process. By incorporating these conditioning factors, the proposed LLDiffusion can effectively enhance low-light images, considering both the inherent degradation patterns and the desired color fidelity. Finally, we evaluate our proposed method on several well-known benchmark datasets, including synthetic and real-world unpaired datasets. Extensive experiments on public benchmarks demonstrate that our LLDiffusion outperforms state-of-the-art LLIE methods both quantitatively and qualitatively. The source code and pre-trained models are available at https://github.com/TaoWangzj/LLDiffusion.

摘要
当前的深度学习方法 для低光照图像提升（LLIE）通常基于像素级映射学习从对应的数据集学习。然而，这些方法经常忽视了考虑降低表示，这可能导致优化结果不佳。在这篇论文中，我们解决这些限制，并提出了基于降低表示的学习方案，用于LLIE。我们的提议基于于降低表示在准确地模型和捕捉低光照图像特有的降低特征。为此，我们首先提出了一个共同学习框架，用于学习图像生成和图像提升。然后，我们开发了一种低光照扩散模型（LLDiffusion），其中包括一个有效地考虑降低表示和颜色映射的动态扩散模块。通过这种模块，我们可以更好地考虑低光照图像的降低特征，同时保持图像的颜色准确性。最后，我们对多个公共 benchmark 进行了广泛的测试，并证明了我们的 LLDiffusion 在量化和质量上都超过了当前的 LL IE 方法。我们的源代码和预训练模型可以在 GitHub 上获取。

Spatial-Frequency U-Net for Denoising Diffusion Probabilistic Models

paper_url: http://arxiv.org/abs/2307.14648
repo_url: None
paper_authors: Xin Yuan, Linjie Li, Jianfeng Wang, Zhengyuan Yang, Kevin Lin, Zicheng Liu, Lijuan Wang
for: 这个论文探讨了在wavelet空间中使用滤波态 probabilistic模型（DDPM）来进行视觉合成。
methods: 这个论文提出了一个新的架构SFUNet，它在wavelet变换中积极地捕捉了频率和空间域之间的联系。在传统的混淆网络中，我们将2D滤波和空间专注层加以改进，并增加了我们自己的频率频率相关的滤波和专注模组，以同时模型频率和空间域之间的联系。
results: 这个论文发现，使用我们的SFUNet架构可以在CIFAR-10、FFHQ、LSUN-Bedroom和LSUN-Church datasets上生成高质量的图像，比起像素空间的网络。

Abstract
In this paper, we study the denoising diffusion probabilistic model (DDPM) in wavelet space, instead of pixel space, for visual synthesis. Considering the wavelet transform represents the image in spatial and frequency domains, we carefully design a novel architecture SFUNet to effectively capture the correlation for both domains. Specifically, in the standard denoising U-Net for pixel data, we supplement the 2D convolutions and spatial-only attention layers with our spatial frequency-aware convolution and attention modules to jointly model the complementary information from spatial and frequency domains in wavelet data. Our new architecture can be used as a drop-in replacement to the pixel-based network and is compatible with the vanilla DDPM training process. By explicitly modeling the wavelet signals, we find our model is able to generate images with higher quality on CIFAR-10, FFHQ, LSUN-Bedroom, and LSUN-Church datasets, than the pixel-based counterpart.

摘要
在这篇论文中，我们研究了在波лет空间中使用扩散概率模型（DDPM）进行视觉生成。而不是使用像素空间，我们在图像表示方面使用了波лет变换，这个变换可以在空间和频率两个领域中表示图像。为了有效地捕捉这两个领域之间的相关性，我们设计了一个新的架构SFUNet。在标准的清洁U-Net中，我们将2D卷积和空间只关注层替换为我们自己的空间频率相关卷积和关注模块，以同时模型空间和频率两个领域中的补偿信息。我们的新架构可以覆盖到像素基于网络，并且与普通的DDPM训练过程相容。通过显式地模型波лет信号，我们发现我们的模型在CIFAR-10、FFHQ、LSUN-Bedroom和LSUN-Church数据集上生成出比像素基于网络更高质量的图像。

EqGAN: Feature Equalization Fusion for Few-shot Image Generation

paper_url: http://arxiv.org/abs/2307.14638
repo_url: None
paper_authors: Yingbo Zhou, Zhihao Yue, Yutong Ye, Pengyu Zhang, Xian Wei, Mingsong Chen
for: 提高几何图像生成器的生成质量和多样性，解决现有的混合策略因为缺乏细结构和文本信息而导致的问题。
methods: 提出了一种新的特征平衡混合生成随机网络（EqGAN），通过分离编码特征来混合结构和文本，并通过等距离权重来归一化不同级别的结构和文本semantic。
results: 对三个公共数据集进行了广泛的实验，结果表明，EqGAN可以significantly提高生成性能（FID分数下降32.7%，LPIPS分数下降4.19%），并在下游分类任务中提高准确率（上升1.97%），与状态的艺术领域内的前景比较。

Abstract
Due to the absence of fine structure and texture information, existing fusion-based few-shot image generation methods suffer from unsatisfactory generation quality and diversity. To address this problem, we propose a novel feature Equalization fusion Generative Adversarial Network (EqGAN) for few-shot image generation. Unlike existing fusion strategies that rely on either deep features or local representations, we design two separate branches to fuse structures and textures by disentangling encoded features into shallow and deep contents. To refine image contents at all feature levels, we equalize the fused structure and texture semantics at different scales and supplement the decoder with richer information by skip connections. Since the fused structures and textures may be inconsistent with each other, we devise a consistent equalization loss between the equalized features and the intermediate output of the decoder to further align the semantics. Comprehensive experiments on three public datasets demonstrate that, EqGAN not only significantly improves generation performance with FID score (by up to 32.7%) and LPIPS score (by up to 4.19%), but also outperforms the state-of-the-arts in terms of accuracy (by up to 1.97%) for downstream classification tasks.

摘要
Translation notes:* "fine structure" and "texture information" are not easily translatable, so I left them as is.* "equalization" is 平衡 (píngyì) in Chinese, which means "balance" or "adjustment".* "shallow and deep contents" are superficical 和 deep 内容 (fāngtiě yǔ dīp cóng) in Chinese.* "skip connections" are 跳过连接 (tiào guò lián xiàng) in Chinese.* "consistent equalization loss" is 一致平衡损失 (yīchè píngyì shūshì) in Chinese.

HTNet for micro-expression recognition

paper_url: http://arxiv.org/abs/2307.14637
repo_url: https://github.com/wangzhifengharrison/htnet
paper_authors: Zhifeng Wang, Kaihao Zhang, Wenhan Luo, Ramesh Sankaranarayana
For: 本研究旨在提高微表情识别算法的性能，特别是识别微小的脸部表达。* Methods: 本文提出了一种 Hierarchical Transformer Network (HTNet)，包括两个主要组成部分：一个 transformer 层和一个汇集层。 transformer 层使用本地时间特征来表征本地小肌动作，而汇集层则用于学习脸部的全局semantic特征和本地相互作用。* Results: 实验结果显示，提出的方法在四个公共可用的微表情数据集上比前方法得分高出较大的幅度。codes和模型可以在以下链接中获取：\url{https://github.com/wangzhifengharrison/HTNet}

Abstract
Facial expression is related to facial muscle contractions and different muscle movements correspond to different emotional states. For micro-expression recognition, the muscle movements are usually subtle, which has a negative impact on the performance of current facial emotion recognition algorithms. Most existing methods use self-attention mechanisms to capture relationships between tokens in a sequence, but they do not take into account the inherent spatial relationships between facial landmarks. This can result in sub-optimal performance on micro-expression recognition tasks.Therefore, learning to recognize facial muscle movements is a key challenge in the area of micro-expression recognition. In this paper, we propose a Hierarchical Transformer Network (HTNet) to identify critical areas of facial muscle movement. HTNet includes two major components: a transformer layer that leverages the local temporal features and an aggregation layer that extracts local and global semantical facial features. Specifically, HTNet divides the face into four different facial areas: left lip area, left eye area, right eye area and right lip area. The transformer layer is used to focus on representing local minor muscle movement with local self-attention in each area. The aggregation layer is used to learn the interactions between eye areas and lip areas. The experiments on four publicly available micro-expression datasets show that the proposed approach outperforms previous methods by a large margin. The codes and models are available at: \url{https://github.com/wangzhifengharrison/HTNet}

摘要
Facial expression和facial muscle contractions有着密切的关系，不同的muscle movement对应于不同的情感状态。在微表达识别 tasks中，muscle movements通常是微妙的，这会影响现有的面部情感识别算法的性能。大多数现有方法使用自注意机制来捕捉序列中的关系，但是它们不考虑面部特征的自然空间关系。这会导致微表达识别任务的性能下降。因此，Recognize facial muscle movements是微表达识别领域中的关键挑战。在这篇论文中，我们提出了层次 transformer network（HTNet）来标识面部muscle movement的关键区域。HTNet包括两个主要组成部分：transformer层和聚合层。transformer层通过利用面部的局部时间特征来强调本地少股运动的表示。聚合层则通过学习眼部和唇部之间的交互来学习面部的全局semantic特征。特别是，HTNet将面部分为四个不同的区域：左眼区、左唇区、右眼区和右唇区。transformer层在每个区域中进行本地自注意，以强调本地少股运动的表示。聚合层则通过学习眼部和唇部之间的交互来学习面部的全局semantic特征。我们在四个公共可用的微表达数据集上进行了实验，结果显示，我们的方法与之前的方法相比，性能有大幅提高。代码和模型可以在以下链接中找到：\url{https://github.com/wangzhifengharrison/HTNet}

360VOT: A New Benchmark Dataset for Omnidirectional Visual Object Tracking

paper_url: http://arxiv.org/abs/2307.14630
repo_url: https://github.com/huajianup/360vot
paper_authors: Huajian Huang, Yinzhe Xu, Yingshu Chen, Sai-Kit Yeung
for: 该论文探讨了使用360度图像进行视觉对象跟踪，并描述了在360度图像中存在的新挑战，如大幅度扭曲和缝合 artifacts。
methods: 该论文提出了一种基于目标局部化的全面跟踪框架，并使用了新的表述方式，如 bounding field-of-view，以减少这些问题。
results: 该论文提供了一个大规模的全面跟踪 benchmark dataset， named 360VOT，包含120个序列和113K帧高分辨率图像，以及4种不偏的地面真实数据。此外，论文还提供了适用于360度图像的新的评价指标，并对20种现有的视觉跟踪器进行了广泛的评估。

Abstract
360{\deg} images can provide an omnidirectional field of view which is important for stable and long-term scene perception. In this paper, we explore 360{\deg} images for visual object tracking and perceive new challenges caused by large distortion, stitching artifacts, and other unique attributes of 360{\deg} images. To alleviate these problems, we take advantage of novel representations of target localization, i.e., bounding field-of-view, and then introduce a general 360 tracking framework that can adopt typical trackers for omnidirectional tracking. More importantly, we propose a new large-scale omnidirectional tracking benchmark dataset, 360VOT, in order to facilitate future research. 360VOT contains 120 sequences with up to 113K high-resolution frames in equirectangular projection. The tracking targets cover 32 categories in diverse scenarios. Moreover, we provide 4 types of unbiased ground truth, including (rotated) bounding boxes and (rotated) bounding field-of-views, as well as new metrics tailored for 360{\deg} images which allow for the accurate evaluation of omnidirectional tracking performance. Finally, we extensively evaluated 20 state-of-the-art visual trackers and provided a new baseline for future comparisons. Homepage: https://360vot.hkustvgd.com

摘要
三百六十度图像可提供全球视野，这对于稳定和长期场景识别非常重要。在这篇论文中，我们探讨了三百六十度图像在视觉对象跟踪中的挑战，包括大量扭曲、缝合 artifacts 和其他特有的三百六十度图像特征。为了解决这些问题，我们利用新的目标定位表示方法，即 bounding field-of-view，然后提出了一个通用的三百六十度跟踪框架，可以采用传统的全息跟踪器。更重要的是，我们提出了一个新的大规模全息跟踪数据集，360VOT，以便未来的研究。360VOT包含120个序列，最多113万高分辨率帧，使用 equirectangular projection。跟踪目标包括32个类别，在多样化的场景中。此外，我们提供了4种无偏见的地面真实，包括旋转的 bounding box 和旋转的 bounding field-of-view，以及适应三百六十度图像的新评价指标。最后，我们对20种当前最佳视觉跟踪器进行了广泛的评估，并提供了新的基线 для未来的比较。网站地址：https://360vot.hkustvgd.com

FS-Depth: Focal-and-Scale Depth Estimation from a Single Image in Unseen Indoor Scene

paper_url: http://arxiv.org/abs/2307.14624
repo_url: None
paper_authors: Chengrui Wei, Meng Yang, Lei He, Nanning Zheng
for: The paper is written for predicting absolute depth maps from single images in real (unseen) indoor scenes, which is an ill-posed problem due to the scale-ambiguous and focal-ambiguous issues.
methods: The paper proposes a focal-and-scale depth estimation model that combines a relative depth estimation network and an absolute depth estimation network, with multi-scale features generated by mapping a single focal length value to focal length features and concatenating them with intermediate features of different scales.
results: The paper reports that the proposed model significantly improves the generalization ability of depth estimation by 41%/13% (RMSE) with/without data augmentation compared with five recent SOTAs, and well alleviates the deformation problem in 3D reconstruction, while maintaining the accuracy of depth estimation on the original NYUDv2.Here are the three information points in Simplified Chinese text:
for: 本文是为了预测真实的indoor scene中单个图像的绝对深度图而写的。
methods: 本文提出了一种 focal-and-scale 深度估算模型，它将relative depth estimation网络和绝对深度估算网络结合在一起，并通过将单个 focal length value 映射到 focal length features 并与不同尺度的 intermediate features concatenate 而生成多个尺度的特征。
results: 本文发现，提出的模型可以对不同的indoor scene中单个图像进行绝对深度估算，并且可以提高generalization能力by 41%/13% (RMSE) compared with five recent SOTAs，同时也可以解决3D重建中的扭曲问题，而不会影响原始 NYUDv2 中的精度。

Abstract
It has long been an ill-posed problem to predict absolute depth maps from single images in real (unseen) indoor scenes. We observe that it is essentially due to not only the scale-ambiguous problem but also the focal-ambiguous problem that decreases the generalization ability of monocular depth estimation. That is, images may be captured by cameras of different focal lengths in scenes of different scales. In this paper, we develop a focal-and-scale depth estimation model to well learn absolute depth maps from single images in unseen indoor scenes. First, a relative depth estimation network is adopted to learn relative depths from single images with diverse scales/semantics. Second, multi-scale features are generated by mapping a single focal length value to focal length features and concatenating them with intermediate features of different scales in relative depth estimation. Finally, relative depths and multi-scale features are jointly fed into an absolute depth estimation network. In addition, a new pipeline is developed to augment the diversity of focal lengths of public datasets, which are often captured with cameras of the same or similar focal lengths. Our model is trained on augmented NYUDv2 and tested on three unseen datasets. Our model considerably improves the generalization ability of depth estimation by 41%/13% (RMSE) with/without data augmentation compared with five recent SOTAs and well alleviates the deformation problem in 3D reconstruction. Notably, our model well maintains the accuracy of depth estimation on original NYUDv2.

摘要
traditional Chinese:传统的问题是从单一图像中预测绝对深度地图，特别是在未见的室内场景中。我们观察到，这主要是因为scale-ambiguous problem和focal-ambiguous problem的共同作用，这导致单一图像中的深度估计缺乏普遍化能力。 Specifically, images may be captured by cameras of different focal lengths in scenes of different scales. In this paper, we develop a focal-and-scale depth estimation model to well learn absolute depth maps from single images in unseen indoor scenes. First, a relative depth estimation network is adopted to learn relative depths from single images with diverse scales/semantics. Second, multi-scale features are generated by mapping a single focal length value to focal length features and concatenating them with intermediate features of different scales in relative depth estimation. Finally, relative depths and multi-scale features are jointly fed into an absolute depth estimation network. In addition, a new pipeline is developed to augment the diversity of focal lengths of public datasets, which are often captured with cameras of the same or similar focal lengths. Our model is trained on augmented NYUDv2 and tested on three unseen datasets. Our model considerably improves the generalization ability of depth estimation by 41%/13% (RMSE) with/without data augmentation compared with five recent SOTAs and well alleviates the deformation problem in 3D reconstruction. Notably, our model well maintains the accuracy of depth estimation on original NYUDv2.Simplified Chinese:历史上，预测单一图像中的绝对深度地图是一个长期存在的问题。我们发现，这主要是因为scale-ambiguous problem和focal-ambiguous problem的共同作用，导致单个图像中的深度估计缺乏普遍化能力。Specifically, images may be captured by cameras of different focal lengths in scenes of different scales. In this paper, we develop a focal-and-scale depth estimation model to well learn absolute depth maps from single images in unseen indoor scenes. First, a relative depth estimation network is adopted to learn relative depths from single images with diverse scales/semantics. Second, multi-scale features are generated by mapping a single focal length value to focal length features and concatenating them with intermediate features of different scales in relative depth estimation. Finally, relative depths and multi-scale features are jointly fed into an absolute depth estimation network. In addition, a new pipeline is developed to augment the diversity of focal lengths of public datasets, which are often captured with cameras of the same or similar focal lengths. Our model is trained on augmented NYUDv2 and tested on three unseen datasets. Our model considerably improves the generalization ability of depth estimation by 41%/13% (RMSE) with/without data augmentation compared with five recent SOTAs and well alleviates the deformation problem in 3D reconstruction. Notably, our model well maintains the accuracy of depth estimation on original NYUDv2.

NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection

paper_url: http://arxiv.org/abs/2307.14620
repo_url: https://github.com/facebookresearch/nerf-det
paper_authors: Chenfeng Xu, Bichen Wu, Ji Hou, Sam Tsai, Ruilong Li, Jialiang Wang, Wei Zhan, Zijian He, Peter Vajda, Kurt Keutzer, Masayoshi Tomizuka
for: 本研究旨在提出一种基于NeRF的indoor三维检测方法，使用posedRGB图像作为输入，并且能够提高三维检测性能。
methods: 本方法使用NeRF进行三维场景的直接估计，并且引入了足够的geometry约束来提高NeRF-MLP的通用性。此外，我们在检测和NeRF分支之间引入了共享的MLP层，使得NeRF能够快速适应检测任务，并生成具有geometry感知的体积表示。
results: 根据ScanNet和ARKITScenes测试集，我们的方法在对比之下，比前一代方法提高了3.9 mAP和3.1 mAP。我们还提供了广泛的分析，以解释NeRF-Det如何工作。由于我们的共同训练设计，NeRF-Det能够在未看过场景时进行准确的检测、视图合成和深度估计任务，无需每个场景进行优化。代码可以在\url{https://github.com/facebookresearch/NeRF-Det}上获取。

Abstract
We present NeRF-Det, a novel method for indoor 3D detection with posed RGB images as input. Unlike existing indoor 3D detection methods that struggle to model scene geometry, our method makes novel use of NeRF in an end-to-end manner to explicitly estimate 3D geometry, thereby improving 3D detection performance. Specifically, to avoid the significant extra latency associated with per-scene optimization of NeRF, we introduce sufficient geometry priors to enhance the generalizability of NeRF-MLP. Furthermore, we subtly connect the detection and NeRF branches through a shared MLP, enabling an efficient adaptation of NeRF to detection and yielding geometry-aware volumetric representations for 3D detection. Our method outperforms state-of-the-arts by 3.9 mAP and 3.1 mAP on the ScanNet and ARKITScenes benchmarks, respectively. We provide extensive analysis to shed light on how NeRF-Det works. As a result of our joint-training design, NeRF-Det is able to generalize well to unseen scenes for object detection, view synthesis, and depth estimation tasks without requiring per-scene optimization. Code is available at \url{https://github.com/facebookresearch/NeRF-Det}.

摘要
我们提出了NeRF-Det方法，用于indoor三维检测，输入poseRGB图像。与现有的indoor三维检测方法不同，我们的方法通过endre-to-end的方式使用NeRF来显式地估计Scene geometry，从而改善三维检测性能。具体来说，为了避免每个场景优化NeRF的显著额外延迟，我们引入了充分的geometry priors，以提高NeRF-MLP的通用性。此外，我们在检测和NeRF分支之间设置了共享的MLP，使得NeRF能够高效地适应检测，并生成具有三维特征的 объем表示。我们的方法在ScanNet和ARKITScenes测试集上比前STATE-OF-THE-ART高出3.9 mAP和3.1 mAP。我们提供了广泛的分析，以解释NeRF-Det如何工作。由于我们的联合训练设计，NeRF-Det能够通过不需要每个场景优化来泛化到未看到的场景，进行对象检测、视synthesis和深度估计任务。代码可以在 \url{https://github.com/facebookresearch/NeRF-Det} 上获取。

Multiscale Dynamic Graph Representation for Biometric Recognition with Occlusions

paper_url: http://arxiv.org/abs/2307.14617
repo_url: https://github.com/renmin1991/dyamic-graph-representation
paper_authors: Min Ren, Yunlong Wang, Yuhao Zhu, Kunbo Zhang, Zhenan Sun
for: 提高生物认证中的遮挡问题解决方法
methods: 提议一种 integrate CNN 和图模型的统一框架，通过动态图匹配和多尺度策略来抑制遮挡部分
results: 对比基eline方法，提出的方法在自然和遮挡 simulate 两种情况下都显示出了明显的提高，具体来说是boosting 识别率。

Abstract
Occlusion is a common problem with biometric recognition in the wild. The generalization ability of CNNs greatly decreases due to the adverse effects of various occlusions. To this end, we propose a novel unified framework integrating the merits of both CNNs and graph models to overcome occlusion problems in biometric recognition, called multiscale dynamic graph representation (MS-DGR). More specifically, a group of deep features reflected on certain subregions is recrafted into a feature graph (FG). Each node inside the FG is deemed to characterize a specific local region of the input sample, and the edges imply the co-occurrence of non-occluded regions. By analyzing the similarities of the node representations and measuring the topological structures stored in the adjacent matrix, the proposed framework leverages dynamic graph matching to judiciously discard the nodes corresponding to the occluded parts. The multiscale strategy is further incorporated to attain more diverse nodes representing regions of various sizes. Furthermore, the proposed framework exhibits a more illustrative and reasonable inference by showing the paired nodes. Extensive experiments demonstrate the superiority of the proposed framework, which boosts the accuracy in both natural and occlusion-simulated cases by a large margin compared with that of baseline methods.

摘要
干扰是生物认证中常见的问题。通用的Convolutional Neural Networks (CNNs) 在不同的干扰情况下 exhibit 很差的泛化能力。为解决这个问题，我们提出了一种新的统一框架，即多尺度动态图表示 (MS-DGR)。更具体地说，一组深度特征在某些子区域上反射后，被重新拼接成一个特征图 (FG)。每个节点在 FG 中都代表了特定的本地区域，而边则表示了不受 occlusion 影响的区域之间的协同关系。通过分析节点表示的相似性和计算邻域矩阵中的 topological 结构，我们的框架通过动态图匹配来舍弃 occlusion 部分的节点。我们还在框架中采用多尺度策略，以获得更多的不同尺度的节点，代表不同大小的区域。此外，我们的框架可以更直观地显示对应的节点对，从而提供更直观的推理。广泛的实验表明，我们的方法可以在自然和干扰 simulate 的情况下，提高了识别率，相比基eline 方法，提高了大幅度。

GenCo: An Auxiliary Generator from Contrastive Learning for Enhanced Few-Shot Learning in Remote Sensing

paper_url: http://arxiv.org/abs/2307.14612
repo_url: None
paper_authors: Jing Wu, Naira Hovakimyan, Jennifer Hobbs
for: 提高远程感知和地球观测中的分类和 semantic segmentation 任务中的少量示例学习性能。
methods: 利用对冲学习框架（GenCo）进行预训练，同时允许特征样本的变体探索。
results: 在 Agriculture-Vision 和 EuroSAT 两个关键的远程感知数据集上，我们的方法比纯粹的超参数训练得到更好的性能，并在 semantic segmentation 任务上达到了最佳效果。

Abstract
Classifying and segmenting patterns from a limited number of examples is a significant challenge in remote sensing and earth observation due to the difficulty in acquiring accurately labeled data in large quantities. Previous studies have shown that meta-learning, which involves episodic training on query and support sets, is a promising approach. However, there has been little attention paid to direct fine-tuning techniques. This paper repurposes contrastive learning as a pre-training method for few-shot learning for classification and semantic segmentation tasks. Specifically, we introduce a generator-based contrastive learning framework (GenCo) that pre-trains backbones and simultaneously explores variants of feature samples. In fine-tuning, the auxiliary generator can be used to enrich limited labeled data samples in feature space. We demonstrate the effectiveness of our method in improving few-shot learning performance on two key remote sensing datasets: Agriculture-Vision and EuroSAT. Empirically, our approach outperforms purely supervised training on the nearly 95,000 images in Agriculture-Vision for both classification and semantic segmentation tasks. Similarly, the proposed few-shot method achieves better results on the land-cover classification task on EuroSAT compared to the results obtained from fully supervised model training on the dataset.

摘要
<>translate the following text into Simplified ChineseClassifying and segmenting patterns from a limited number of examples is a significant challenge in remote sensing and earth observation due to the difficulty in acquiring accurately labeled data in large quantities. Previous studies have shown that meta-learning, which involves episodic training on query and support sets, is a promising approach. However, there has been little attention paid to direct fine-tuning techniques. This paper repurposes contrastive learning as a pre-training method for few-shot learning for classification and semantic segmentation tasks. Specifically, we introduce a generator-based contrastive learning framework (GenCo) that pre-trains backbones and simultaneously explores variants of feature samples. In fine-tuning, the auxiliary generator can be used to enrich limited labeled data samples in feature space. We demonstrate the effectiveness of our method in improving few-shot learning performance on two key remote sensing datasets: Agriculture-Vision and EuroSAT. Empirically, our approach outperforms purely supervised training on the nearly 95,000 images in Agriculture-Vision for both classification and semantic segmentation tasks. Similarly, the proposed few-shot method achieves better results on the land-cover classification task on EuroSAT compared to the results obtained from fully supervised model training on the dataset.Translation:难以从有限数量的示例中提取模式的分类和 semantic segmentation 是Remote sensing 和 Earth observation 领域中的一大挑战，因为获取大量高精度标注数据具有很大的困难。先前的研究表明，meta-learning，即在查询和支持集上进行 episodic 训练，是一种有前途的方法。然而，直接精度调整技术几乎未获得关注。本文借鉴了对准学习，将它作为预训练方法，以提高少量示例学习的分类和 semantic segmentation 性能。特别是，我们引入了生成器基于对准学习框架（GenCo），在预训练期间同时探索特征样本的变体。在细化中，可以使用auxiliary生成器来增加有限量标注数据样本在特征空间中。我们通过实验证明，我们的方法可以在 Agriculture-Vision 和 EuroSAT 两个重要的 Remote sensing 数据集上提高少量示例学习性能。empirically，我们的方法在 Agriculture-Vision 中的分类和 semantic segmentation 任务上超过了完全监督训练 nearly 95,000 张图像的结果。类似地，我们提出的几shot 方法在 EuroSAT 上的 land-cover 分类任务上超过了完全监督模型训练 dataset 上的结果。

TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation

paper_url: http://arxiv.org/abs/2307.14611
repo_url: None
paper_authors: Moon Ye-Bin, Jisoo Kim, Hongyeob Kim, Kilho Son, Tae-Hyun Oh
for: 提高欠发达数据集中的泛化能力，即使面临着类别分布偏斜问题。
methods: 提出了一种基于文本的替换增强方法，通过使用可读性强的视觉感知词（attributes）来增强视觉特征空间的Semantic层次。
results: 实验表明，TextManiA可以在缺乏样本的情况下具有很高的泛化能力，并且可以与标准混合方法相结合使用，以提高泛化能力。

Abstract
Recent label mix-based augmentation methods have shown their effectiveness in generalization despite their simplicity, and their favorable effects are often attributed to semantic-level augmentation. However, we found that they are vulnerable to highly skewed class distribution, because scarce data classes are rarely sampled for inter-class perturbation. We propose TextManiA, a text-driven manifold augmentation method that semantically enriches visual feature spaces, regardless of data distribution. TextManiA augments visual data with intra-class semantic perturbation by exploiting easy-to-understand visually mimetic words, i.e., attributes. To this end, we bridge between the text representation and a target visual feature space, and propose an efficient vector augmentation. To empirically support the validity of our design, we devise two visualization-based analyses and show the plausibility of the bridge between two different modality spaces. Our experiments demonstrate that TextManiA is powerful in scarce samples with class imbalance as well as even distribution. We also show compatibility with the label mix-based approaches in evenly distributed scarce data.

摘要
(Simplified Chinese translation)近期的标签混合基于扩展方法已经表现出了一致性，即使它们简单，它们的良好效果通常归结于semantic-level的扩展。然而，我们发现它们对高度不均衡的类分布非常敏感，因为罕见的数据类很少被采样为间类扰动。我们提出了TextManiA，一种基于文本的扩展方法，它可以在不同的数据分布下增强视觉特征空间的 semantic。TextManiA通过利用易于理解的视觉效果词，例如特征，进行内类semantic扰动。为了实现这一点，我们将文本表示与目标视觉特征空间之间建立了一个桥梁，并提出了一种效率的向量扩展。为了证明我们的设计的有效性，我们设计了两种视觉基于分析和展示了两个不同的模态空间之间的桥梁的可能性。我们的实验表明，TextManiA在罕见样本中的类均衡和均衡分布中都具有强大的泛化能力。我们还证明了TextManiA与标签混合基于方法在均衡分布中的罕见样本中兼容。

A Weakly Supervised Segmentation Network Embedding Cross-scale Attention Guidance and Noise-sensitive Constraint for Detecting Tertiary Lymphoid Structures of Pancreatic Tumors

paper_url: http://arxiv.org/abs/2307.14603
repo_url: None
paper_authors: Bingxue Wang, Liwen Zou, Jun Chen, Yingying Cao, Zhenghua Cai, Yudong Qiu, Liang Mao, Zhongqiu Wang, Jingya Chen, Luying Gui, Xiaoping Yang
for: 检测胰腺癌的特Point是否存在，以便更好地诊断和治疗胰腺癌患者。
methods: 我们提议一种几何学学习方法， combinig着一个预训练的核型分割模型和域随机网络来识别лимфоци特 Point。然后，我们设立了一个跨度级别注意力引导机制，通过同时学习原始历史病理图像的粗细度特征和我们设计的лимфоци特 Point的密度注意力来帮助提高分割精度。
results: 我们的方法在两个收集的数据集上进行了实验，并证明了与现有分割基于深度学习算法相比，我们的方法可以更高的检测精度。此外，我们还应用了我们的方法来研究胰腺癌的特Point密度与周围血管渗透的关系，并获得了一些临床统计结果。

Abstract
The presence of tertiary lymphoid structures (TLSs) on pancreatic pathological images is an important prognostic indicator of pancreatic tumors. Therefore, TLSs detection on pancreatic pathological images plays a crucial role in diagnosis and treatment for patients with pancreatic tumors. However, fully supervised detection algorithms based on deep learning usually require a large number of manual annotations, which is time-consuming and labor-intensive. In this paper, we aim to detect the TLSs in a manner of few-shot learning by proposing a weakly supervised segmentation network. We firstly obtain the lymphocyte density maps by combining a pretrained model for nuclei segmentation and a domain adversarial network for lymphocyte nuclei recognition. Then, we establish a cross-scale attention guidance mechanism by jointly learning the coarse-scale features from the original histopathology images and fine-scale features from our designed lymphocyte density attention. A noise-sensitive constraint is introduced by an embedding signed distance function loss in the training procedure to reduce tiny prediction errors. Experimental results on two collected datasets demonstrate that our proposed method significantly outperforms the state-of-the-art segmentation-based algorithms in terms of TLSs detection accuracy. Additionally, we apply our method to study the congruent relationship between the density of TLSs and peripancreatic vascular invasion and obtain some clinically statistical results.

摘要
《文献》中提到的“次级血液结构（TLS）”在胰腺病理图像中的存在是诊断和治疗胰腺肿瘤患者的重要预测指标。因此，TLS的检测在胰腺病理图像上扮演着关键的角色。然而，通常需要大量的手动标注，这是时间consuming和劳动密集的。在这篇论文中，我们尝试通过几拍学习方法来检测TLS。我们首先使用预训练的核仁分 segmentation 模型和域 adversarial network来确定血液细胞核的位置。然后，我们实现了跨比例注意力导航机制，通过同时学习原始 histopathology 图像的粗细度特征和我们设计的血液细胞注意力来确定TLS的位置。在训练过程中，我们引入了一种噪声敏感的约束，通过嵌入签名距离函数损失来减少微scopic prediction error。实验结果表明，我们提出的方法在两个收集的数据集上显著超过了现有的分类基于深度学习的 segmentation 算法。此外，我们应用我们的方法研究了TLS的density与周围胰腺血管浸润之间的相互关系，并获得了一些临床统计结果。

FakeTracer: Proactively Defending Against Face-swap DeepFakes via Implanting Traces in Training

paper_url: http://arxiv.org/abs/2307.14593
repo_url: None
paper_authors: Pu Sun, Honggang Qi, Yuezun Li, Siwei Lyu
for: 防止 DeepFake 技术的违用，保护个人隐私
methods: 植入训练过程中的特征 traces，使 DeepFake 模型学习含有可持续和可除去的特征
results: 对 Celeb-DF 数据集进行了广泛的实验，证明了我们的方法可以有效地防止 face-swap DeepFake

Abstract
Face-swap DeepFake is an emerging AI-based face forgery technique that can replace the original face in a video with a generated face of the target identity while retaining consistent facial attributes such as expression and orientation. Due to the high privacy of faces, the misuse of this technique can raise severe social concerns, drawing tremendous attention to defend against DeepFakes recently. In this paper, we describe a new proactive defense method called FakeTracer to expose face-swap DeepFakes via implanting traces in training. Compared to general face-synthesis DeepFake, the face-swap DeepFake is more complex as it involves identity change, is subjected to the encoding-decoding process, and is trained unsupervised, increasing the difficulty of implanting traces into the training phase. To effectively defend against face-swap DeepFake, we design two types of traces, sustainable trace (STrace) and erasable trace (ETrace), to be added to training faces. During the training, these manipulated faces affect the learning of the face-swap DeepFake model, enabling it to generate faces that only contain sustainable traces. In light of these two traces, our method can effectively expose DeepFakes by identifying them. Extensive experiments are conducted on the Celeb-DF dataset, compared with recent passive and proactive defense methods, and are studied thoroughly regarding various factors, corroborating the efficacy of our method on defending against face-swap DeepFake.

摘要
“深圳技术：Face-swap DeepFake是一种新兴的人工智能技术，可以将原始影片中的面部替换为目标身份的生成面部，保留面部特征如表情和方向。由于面部隐私问题，这种技术可能会导致严重的社会影响，因此近期引起了广泛关注防范 DeepFakes。在这篇论文中，我们描述了一种新的积极防范方法，称为FakeTracer，可以透过将 traces 添加到训练过程中，将 face-swap DeepFake 曝光。相比于一般的面部合成 DeepFake，face-swap DeepFake 更加复杂，因为它涉及到身份变更、编码-解码过程和无监督训练，这使得在训练过程中将 traces 添加到面部上更加困难。为了有效防范 face-swap DeepFake，我们设计了两种 traces，可持续 trace (STrace) 和可清除 trace (ETrace)，并将它们添加到训练面部。在训练过程中，这些改进的面部对 face-swap DeepFake 模型的学习有很大影响，使其能够生成包含可持续 traces 的面部。这些两种 traces 使我们的方法能够有效地曝光 DeepFakes，通过识别它们。我们在 Celeb-DF 数据集上进行了广泛的实验，与最近的预防和反击方法进行比较，并且对不同的因素进行了深入的研究，证明了我们的方法在防范 face-swap DeepFake 方面的效果。”

MCPA: Multi-scale Cross Perceptron Attention Network for 2D Medical Image Segmentation

paper_url: http://arxiv.org/abs/2307.14588
repo_url: https://github.com/simonustc/mcpa-for-2d-medical-image-segmentation
paper_authors: Liang Xu, Mingxiao Chen, Yi Cheng, Pengfei Shao, Shuwei Shen, Peng Yao, Ronald X. Xu
for: 这个研究旨在提出一个基于Convolutional Neural Networks (CNN)的双重网络模型，以提高医疗影像分类 task的性能。
methods: 本研究使用了Transformer模组来强化UNet架构，以更好地捕捉医疗影像中的全球相依性。并且引入了多对多普尔投影模组，以实现各维度特征之间的协调。
results: 实验结果显示，这个MCPA模型在多个公开的医疗影像数据集上达到了州际性能。code可以在https://github.com/simonustc/MCPA-for-2D-Medical-Image-Segmentation上获取。

Abstract
The UNet architecture, based on Convolutional Neural Networks (CNN), has demonstrated its remarkable performance in medical image analysis. However, it faces challenges in capturing long-range dependencies due to the limited receptive fields and inherent bias of convolutional operations. Recently, numerous transformer-based techniques have been incorporated into the UNet architecture to overcome this limitation by effectively capturing global feature correlations. However, the integration of the Transformer modules may result in the loss of local contextual information during the global feature fusion process. To overcome these challenges, we propose a 2D medical image segmentation model called Multi-scale Cross Perceptron Attention Network (MCPA). The MCPA consists of three main components: an encoder, a decoder, and a Cross Perceptron. The Cross Perceptron first captures the local correlations using multiple Multi-scale Cross Perceptron modules, facilitating the fusion of features across scales. The resulting multi-scale feature vectors are then spatially unfolded, concatenated, and fed through a Global Perceptron module to model global dependencies. Furthermore, we introduce a Progressive Dual-branch Structure to address the semantic segmentation of the image involving finer tissue structures. This structure gradually shifts the segmentation focus of MCPA network training from large-scale structural features to more sophisticated pixel-level features. We evaluate our proposed MCPA model on several publicly available medical image datasets from different tasks and devices, including the open large-scale dataset of CT (Synapse), MRI (ACDC), fundus camera (DRIVE, CHASE_DB1, HRF), and OCTA (ROSE). The experimental results show that our MCPA model achieves state-of-the-art performance. The code is available at https://github.com/simonustc/MCPA-for-2D-Medical-Image-Segmentation.

摘要
UNet 架构，基于卷积神经网络（CNN），在医疗图像分析中表现出色。然而，它在捕捉长距离依赖关系方面面临挑战，因为卷积操作具有局部捕捉区域和内置偏见。最近，许多基于变换器技术的方法被 incorporated into UNet 架构，以解决这些限制，并有效地捕捉全局特征相关性。然而，将变换器模块 интеGRATED 到 UNet 架构可能会导致全局特征卷积过程中失去本地Contextual information。为了解决这些挑战，我们提出了一种名为 Multi-scale Cross Perceptron Attention Network (MCPA)的2D医疗图像分类模型。MCPA 模型由三个主要组成部分组成：编码器、解码器和 Cross Perceptron。Cross Perceptron 首先使用多个 Multi-scale Cross Perceptron 模块捕捉本地相关性，以便在不同尺度上进行特征整合。得到的多个尺度特征向量然后在空间上展开、 concatenate 并经过全球 Perceptron 模块来建模全局依赖关系。此外，我们还引入了一种进步的双分支结构，以Addressing the semantic segmentation of the image involving finer tissue structures。这种结构逐渐调整 MCPA 网络训练的 segmentation 焦点，从大规模结构特征向 pixel-level 特征。我们在 Synapse、ACDC、DRIVE、CHASE_DB1 和 HRF 等公共可用的医疗图像数据集上进行了实验，结果显示，我们的 MCPA 模型实现了状态的最佳性能。代码可以在中获取。

Neural Representation-Based Method for Metal-induced Artifact Reduction in Dental CBCT Imaging

paper_url: http://arxiv.org/abs/2307.14579
repo_url: None
paper_authors: Hyoung Suk Park, Kiwan Jeon, Jin Keun Seo
for: 这个研究旨在提出一种新的 dental cone-beam computed tomography (CBCT) 重建方法，以减少常见的金属引起的artefacts，特别是在存在多种金属设备的情况下。
methods: 该研究提议使用 implicit neural network 生成两个不同的 tomographic 图像，其中一个表示特定能量层的灰度分布，另一个表示各色 X-ray 束在不同能量层的非线性强化因素。与传统的 CT 重建技术不同，该方法仅基于 Beer–Lambert 定律，从而有效避免在传统方法中通常实现的背投过程中产生金属引起的artefacts。
results: 广泛的实验评估表明，提议的方法可以有效减少金属引起的artefacts，并提供高质量的图像重建，从而强调第二个图像在捕捉非线性强化因素方面的重要性。

Abstract
This study introduces a novel reconstruction method for dental cone-beam computed tomography (CBCT), focusing on effectively reducing metal-induced artifacts commonly encountered in the presence of prevalent metallic implants. Despite significant progress in metal artifact reduction techniques, challenges persist owing to the intricate physical interactions between polychromatic X-ray beams and metal objects, which are further compounded by the additional effects associated with metal-tooth interactions and factors specific to the dental CBCT data environment. To overcome these limitations, we propose an implicit neural network that generates two distinct and informative tomographic images. One image represents the monochromatic attenuation distribution at a specific energy level, whereas the other captures the nonlinear beam-hardening factor resulting from the polychromatic nature of X-ray beams. In contrast to existing CT reconstruction techniques, the proposed method relies exclusively on the Beer--Lambert law, effectively preventing the generation of metal-induced artifacts during the backprojection process commonly implemented in conventional methods. Extensive experimental evaluations demonstrate that the proposed method effectively reduces metal artifacts while providing high-quality image reconstructions, thus emphasizing the significance of the second image in capturing the nonlinear beam-hardening factor.

摘要
To overcome these limitations, the proposed method uses an implicit neural network to generate two distinct and informative tomographic images. One image represents the monochromatic attenuation distribution at a specific energy level, while the other captures the nonlinear beam-hardening factor resulting from the polychromatic nature of X-ray beams. Unlike existing CT reconstruction techniques, the proposed method relies exclusively on the Beer-Lambert law, effectively preventing the generation of metal-induced artifacts during the backprojection process commonly implemented in conventional methods.Experimental evaluations demonstrate that the proposed method effectively reduces metal artifacts while providing high-quality image reconstructions, highlighting the significance of the second image in capturing the nonlinear beam-hardening factor.

GADER: GAit DEtection and Recognition in the Wild

paper_url: http://arxiv.org/abs/2307.14578
repo_url: None
paper_authors: Yuxiang Guo, Cheng Peng, Ram Prabhakar, Chun Pong Lau, Rama Chellappa
for: 人体认证，特别是在开放场景下进行人体识别和验证
methods: 使用Double Helical Signature检测人体运动段落，并借鉴RGB认识模型进行学习表征，以实现更高的人体识别精度
results: 在室内和室外数据集上进行了广泛的实验，并达到了与状态艺前的最佳性能，特别是在无结构、长距离场景下的识别精度提高20.6%

Abstract
Gait recognition holds the promise of robustly identifying subjects based on their walking patterns instead of color information. While previous approaches have performed well for curated indoor scenes, they have significantly impeded applicability in unconstrained situations, e.g. outdoor, long distance scenes. We propose an end-to-end GAit DEtection and Recognition (GADER) algorithm for human authentication in challenging outdoor scenarios. Specifically, GADER leverages a Double Helical Signature to detect the fragment of human movement and incorporates a novel gait recognition method, which learns representations by distilling from an auxiliary RGB recognition model. At inference time, GADER only uses the silhouette modality but benefits from a more robust representation. Extensive experiments on indoor and outdoor datasets demonstrate that the proposed method outperforms the State-of-The-Arts for gait recognition and verification, with a significant 20.6% improvement on unconstrained, long distance scenes.

摘要
“门槛识别”可以强大地识别基于行走模式而不是颜色信息。在前一些方法中，它们在受控环境中表现得非常好，但是在无法控制的场景中，它们受到了严重的限制，例如：开放场景、长距离场景。我们提出了一个终端到端的“人体识别和验证”（GADER）算法，用于人类身份验证在具有挑战的开放场景中。具体来说，GADER 使用了一个Double Helical Signature来探测人类运动的残留部分，并将一个新的步行识别方法给运用到auxiliary RGB识别模型中。在推理时间，GADER 仅使用了 silhouette 模式，但是它们可以从更加稳定的表现中获益。实验结果显示，提出的方法在受控和无法控制的场景中都有着优秀的表现，与现有的State-of-The-Arts 有20.6%的提升。

Robust Detection, Association, and Localization of Vehicle Lights: A Context-Based Cascaded CNN Approach and Evaluations

paper_url: http://arxiv.org/abs/2307.14571
repo_url: None
paper_authors: Akshay Gopalkrishnan, Ross Greer, Maitrayee Keskar, Mohan Trivedi
for: 这篇论文是用于检测和识别汽车灯光的，以便实现安全自动驾驶任务。
methods: 本论文使用了一种基于CNN的方法，将汽车灯光拟合为四个粗略的角度，以提高检测灯光的精度。
results: experiments show that the proposed method can achieve an average distance error of 4.77 pixels from the ground truth corner, which is about 16.33% of the size of the vehicle light on average.Here’s the full text in Simplified Chinese:
for: 这篇论文是用于检测和识别汽车灯光的，以便实现安全自动驾驶任务。
methods: 本论文使用了一种基于CNN的方法，将汽车灯光拟合为四个粗略的角度，以提高检测灯光的精度。
results: experiments show that the proposed method can achieve an average distance error of 4.77 pixels from the ground truth corner, which is about 16.33% of the size of the vehicle light on average。

Abstract
Vehicle light detection, association, and localization are required for important downstream safe autonomous driving tasks, such as predicting a vehicle's light state to determine if the vehicle is making a lane change or turning. Currently, many vehicle light detectors use single-stage detectors which predict bounding boxes to identify a vehicle light, in a manner decoupled from vehicle instances. In this paper, we present a method for detecting a vehicle light given an upstream vehicle detection and approximation of a visible light's center. Our method predicts four approximate corners associated with each vehicle light. We experiment with CNN architectures, data augmentation, and contextual preprocessing methods designed to reduce surrounding-vehicle confusion. We achieve an average distance error from the ground truth corner of 4.77 pixels, about 16.33% of the size of the vehicle light on average. We train and evaluate our model on the LISA Lights Dataset, allowing us to thoroughly evaluate our vehicle light corner detection model on a large variety of vehicle light shapes and lighting conditions. We propose that this model can be integrated into a pipeline with vehicle detection and vehicle light center detection to make a fully-formed vehicle light detection network, valuable to identifying trajectory-informative signals in driving scenes.

摘要
自动驾驶需要车辆灯光检测、相关性和地理位置，以便完成重要的下游安全自动驾驶任务，如预测车辆灯光的状态，以确定车辆是否改变车道或转弯。现在，许多车辆灯光检测器使用单Stage检测器，以预测车辆灯光的 bounding box，并且与车辆实例分离。在这篇论文中，我们提出了一种用于检测车辆灯光的方法，给出了车辆灯光的四个约束角。我们对 CNN 架构、数据增强和上下文处理方法进行了实验，以降低周围车辆的混淆。我们在 LISA Lights 数据集上训练和评估我们的模型，可以充分评估我们的车辆灯光角度检测模型在不同的车辆灯光形状和照明条件下的性能。我们建议将这种模型纳入车辆检测和车辆灯光中心检测的管道，以创造一个完整的车辆灯光检测网络，有助于识别驾驶场景中的路径信号。

Physically Plausible 3D Human-Scene Reconstruction from Monocular RGB Image using an Adversarial Learning Approach

paper_url: http://arxiv.org/abs/2307.14570
repo_url: None
paper_authors: Sandika Biswas, Kejie Li, Biplab Banerjee, Subhasis Chaudhuri, Hamid Rezatofighi
for: 这个论文的目的是提出一种基于学习的全息人景三维重建方法，以解决基于单色RGB图像的全息人景三维重建问题。
methods: 该方法使用图形学表示法，并通过对Scene中元素之间的相互作用进行学习，从训练数据自身学习场景元素之间的物理约束。
results: 该方法可以达到与优化基于方法相当的三维重建质量，而不需要执行批处理时间的优化。这使得该方法更适合机器人应用，如机器人导航等。

Abstract
Holistic 3D human-scene reconstruction is a crucial and emerging research area in robot perception. A key challenge in holistic 3D human-scene reconstruction is to generate a physically plausible 3D scene from a single monocular RGB image. The existing research mainly proposes optimization-based approaches for reconstructing the scene from a sequence of RGB frames with explicitly defined physical laws and constraints between different scene elements (humans and objects). However, it is hard to explicitly define and model every physical law in every scenario. This paper proposes using an implicit feature representation of the scene elements to distinguish a physically plausible alignment of humans and objects from an implausible one. We propose using a graph-based holistic representation with an encoded physical representation of the scene to analyze the human-object and object-object interactions within the scene. Using this graphical representation, we adversarially train our model to learn the feasible alignments of the scene elements from the training data itself without explicitly defining the laws and constraints between them. Unlike the existing inference-time optimization-based approaches, we use this adversarially trained model to produce a per-frame 3D reconstruction of the scene that abides by the physical laws and constraints. Our learning-based method achieves comparable 3D reconstruction quality to existing optimization-based holistic human-scene reconstruction methods and does not need inference time optimization. This makes it better suited when compared to existing methods, for potential use in robotic applications, such as robot navigation, etc.

摘要
“整体3D人景重建是Robot感知领域中的关键和emerging研究领域。一个关键挑战是将单一RGB影像中的3D场景转换为物理可能的3D场景。现有的研究主要提出了基于优化的方法来从RGB影像序列中重建场景，并且明确定义了场景元素之间的物理法则和约束。但是，实际上很难明确地定义和模型每个场景中的物理法则。本文提出使用各元素的偏项特征来区别物理可能的人物和物品的平行配置和不可能的配置。我们提出使用图形基于的整体表示，将场景元素之间的物理表示编码到图形中，并通过对训练数据自身进行对抗学习，以学习场景元素之间的可能的平行配置。与现有的推理时间优化方法不同，我们使用这个对抗学习的模型来生成每帧3D重建结果，并且跟随物理法则和约束。我们的学习型方法与现有的优化型方法相比，具有更好的3D重建质量，并且不需要推理时间优化。这使得它更适合在Robot应用中使用，如Robot导航等。”

paper_url: http://arxiv.org/abs/2307.14523
repo_url: None
paper_authors: Soorena Salari, Amirhossein Rasoulian, Hassan Rivaz, Yiming Xiao
for: 这个研究旨在提供一个新的整合学习框架，用于在 neurosurgery 中医疗影像注册中找到相匹配的附属标记。
methods: 这个方法使用了两个卷积神经网络，用于将 MRI 和 US 影像特征编码，以帮助匹配 US 影像 patch 中的相匹配标记在 MRI 中。
results: 研究结果显示，该方法可以实现高精度的附属标记检测，具体来说是 5.88+-4.79 mm 的平均标记检测精度，较之 SIFT 特征的 18.78+-4.77 mm 为低。

Abstract
Homologous anatomical landmarks between medical scans are instrumental in quantitative assessment of image registration quality in various clinical applications, such as MRI-ultrasound registration for tissue shift correction in ultrasound-guided brain tumor resection. While manually identified landmark pairs between MRI and ultrasound (US) have greatly facilitated the validation of different registration algorithms for the task, the procedure requires significant expertise, labor, and time, and can be prone to inter- and intra-rater inconsistency. So far, many traditional and machine learning approaches have been presented for anatomical landmark detection, but they primarily focus on mono-modal applications. Unfortunately, despite the clinical needs, inter-modal/contrast landmark detection has very rarely been attempted. Therefore, we propose a novel contrastive learning framework to detect corresponding landmarks between MRI and intra-operative US scans in neurosurgery. Specifically, two convolutional neural networks were trained jointly to encode image features in MRI and US scans to help match the US image patch that contain the corresponding landmarks in the MRI. We developed and validated the technique using the public RESECT database. With a mean landmark detection accuracy of 5.88+-4.79 mm against 18.78+-4.77 mm with SIFT features, the proposed method offers promising results for MRI-US landmark detection in neurosurgical applications for the first time.

摘要
医学成像渠道之间的相似结构特征可以用于质量评估不同临床应用中的图像 региSTR的良好性，如MRI-ultrasound（US） региSTR用于脑肿瘤静脉 Correcting for tissue shift during ultrasound-guided brain tumor resection. Although manually identified landmark pairs between MRI and US have greatly facilitated the validation of different registration algorithms for this task, the procedure requires significant expertise, labor, and time, and can be prone to inter- and intra-rater inconsistency. To date, many traditional and machine learning approaches have been proposed for anatomical landmark detection, but they primarily focus on mono-modal applications. Unfortunately, despite the clinical needs, inter-modal/contrast landmark detection has very rarely been attempted. Therefore, we propose a novel contrastive learning framework to detect corresponding landmarks between MRI and intra-operative US scans in neurosurgery. Specifically, two convolutional neural networks were trained jointly to encode image features in MRI and US scans to help match the US image patch that contains the corresponding landmarks in the MRI. We developed and validated the technique using the public RESECT database. With a mean landmark detection accuracy of 5.88 ± 4.79 mm against 18.78 ± 4.77 mm with SIFT features, the proposed method offers promising results for MRI-US landmark detection in neurosurgical applications for the first time.

paper_url: http://arxiv.org/abs/2307.14520
repo_url: None
paper_authors: Soorena Salari, Amirhossein Rasoulian, Hassan Rivaz, Yiming Xiao
for: 本研究旨在提供一种 Deep Learning 技术，用于在 brain tumor 手术中评估 MRI-iUS регистра结果的准确性。methods: 该技术基于 3D 焦点 modify 的深度学习方法，并使用 uncertainty 估计来评估 MRI-iUS REGISTRATION 错误的准确性。results: 在使用公共的 RESECT 临床数据库进行验证后，该算法可以实现 MRI-iUS REGISTRATION 错误的估计错误为 0.59+-0.57 mm。

Abstract
In brain tumor resection, accurate removal of cancerous tissues while preserving eloquent regions is crucial to the safety and outcomes of the treatment. However, intra-operative tissue deformation (called brain shift) can move the surgical target and render the pre-surgical plan invalid. Intra-operative ultrasound (iUS) has been adopted to provide real-time images to track brain shift, and inter-modal (i.e., MRI-iUS) registration is often required to update the pre-surgical plan. Quality control for the registration results during surgery is important to avoid adverse outcomes, but manual verification faces great challenges due to difficult 3D visualization and the low contrast of iUS. Automatic algorithms are urgently needed to address this issue, but the problem was rarely attempted. Therefore, we propose a novel deep learning technique based on 3D focal modulation in conjunction with uncertainty estimation to accurately assess MRI-iUS registration errors for brain tumor surgery. Developed and validated with the public RESECT clinical database, the resulting algorithm can achieve an estimation error of 0.59+-0.57 mm.

摘要
在脑肿擦除手术中，准确地移除患有肿瘤的组织，保留功能区的安全和治疗结果是关键。然而，在手术中的组织弹性（脑移动）可能使手术目标移动，使原先的预治疗计划无效。使用实时ultrasound（iUS）图像提供实时跟踪脑移动，并使用多Modal（MRI-iUS）匹配进行更新预治疗计划。手术中质量控制匹配结果的重要性，以避免不良结果，但手动验证受到困难的3D视图和iUS的低对比度带来很大挑战。因此，我们提出了一种新的深度学习技术，基于3D关注模ulation，并与不确定性估计以准确评估MRI-iUS匹配错误。与RESECT临床数据库公共数据集进行开发和验证，我们的算法可以实现匹配错误的估计错误0.59±0.57毫米。

SuperInpaint: Learning Detail-Enhanced Attentional Implicit Representation for Super-resolutional Image Inpainting

paper_url: http://arxiv.org/abs/2307.14489
repo_url: None
paper_authors: Canyu Zhang, Qing Guo, Xiaoguang Li, Renjie Wan, Hongkai Yu, Ivor Tsang, Song Wang
for: This paper aims to address the challenging task of SuperInpaint, which involves reconstructing missing regions in low-resolution images and generating completed images with higher resolutions.
methods: The proposed method, called DEAR, uses a deep convolutional network to extract the latent embedding of an input image, and then enhances the high-frequency components of the latent embedding via an adaptive high-pass filter. The method also uses an unmask-attentional module to suppress embeddings from ineffective masked pixels, and an implicit representation to generate the color of the specified pixel.
results: The proposed method outperforms all existing methods by a significant margin on four widely used metrics, demonstrating its effectiveness in addressing the SuperInpaint task.

Abstract
In this work, we introduce a challenging image restoration task, referred to as SuperInpaint, which aims to reconstruct missing regions in low-resolution images and generate completed images with arbitrarily higher resolutions. We have found that this task cannot be effectively addressed by stacking state-of-the-art super-resolution and image inpainting methods as they amplify each other's flaws, leading to noticeable artifacts. To overcome these limitations, we propose the detail-enhanced attentional implicit representation (DEAR) that can achieve SuperInpaint with a single model, resulting in high-quality completed images with arbitrary resolutions. Specifically, we use a deep convolutional network to extract the latent embedding of an input image and then enhance the high-frequency components of the latent embedding via an adaptive high-pass filter. This leads to detail-enhanced semantic embedding. We further feed the semantic embedding into an unmask-attentional module that suppresses embeddings from ineffective masked pixels. Additionally, we extract a pixel-wise importance map that indicates which pixels should be used for image reconstruction. Given the coordinates of a pixel we want to reconstruct, we first collect its neighboring pixels in the input image and extract their detail-enhanced semantic embeddings, unmask-attentional semantic embeddings, importance values, and spatial distances to the desired pixel. Then, we feed all the above terms into an implicit representation and generate the color of the specified pixel. To evaluate our method, we extend three existing datasets for this new task and build 18 meaningful baselines using SOTA inpainting and super-resolution methods. Extensive experimental results demonstrate that our method outperforms all existing methods by a significant margin on four widely used metrics.

摘要
在这个工作中，我们引入了一个挑战性的图像修复任务，称为SuperInpaint，该任务的目标是在低分辨率图像中恢复缺失区域并生成完整图像的高分辨率版本。我们发现，这个任务不可能通过核心状态艺术的超解像和图像填充方法核心来解决，因为这些方法会强调对方的缺陷，导致明显的瑕疵。为了突破这些限制，我们提出了细节增强的匿名隐藏表示（DEAR），该方法可以在单个模型中实现SuperInpaint，并且可以生成高质量的完整图像。具体来说，我们使用深度卷积神经网络提取输入图像的含义嵌入，然后使用适应高频滤波器增强含义嵌入的高频分量。这导致细节增强的semantic嵌入。我们还将semantic嵌入 feed到无掩蔽听力模块，该模块可以抑制听力模块中不可靠的掩蔽像素的嵌入。此外，我们还提取了每个像素的重要性图，该图示出了需要用于图像重建的像素的坐标。给定需要重建的像素的坐标，我们首先收集了输入图像中的相关像素，然后提取它们的细节增强的semantic嵌入、无掩蔽听力嵌入、重要性值和空间距离目标像素。然后，我们将所有这些因素 feed到隐藏表示中，并生成目标像素的颜色。为了评估我们的方法，我们扩展了三个现有的数据集，并建立了18个基eline，使用现状填充和超解像方法的state-of-the-art方法。广泛的实验结果表明，我们的方法在四个常用的评价指标上明显超过了所有现有的方法。

Role of Image Acquisition and Patient Phenotype Variations in Automatic Segmentation Model Generalization

paper_url: http://arxiv.org/abs/2307.14482
repo_url: None
paper_authors: Timothy L. Kline, Sumana Ramanathan, Harrison C. Gottlich, Panagiotis Korfiatis, Adriana V. Gregory
for: This study aimed to evaluate the out-of-domain performance and generalization capabilities of automated medical image segmentation models.
methods: The study used datasets from non-contrast and contrast-enhanced abdominal CT scans of healthy patients and those with polycystic kidney disease (PKD) to train and validate the models.
results: The models trained on a diverse range of data showed no worse performance than models trained exclusively on in-domain data when tested on in-domain data, and the study found that broader training examples significantly enhance model generalization and out-of-domain performance.Here is the same information in Simplified Chinese text:
for: 这项研究目的是评估医疗影像自动分割模型的域外性能和泛化能力。
methods: 这项研究使用了非对照和对照腹部CT扫描图像健康人群和肾脏癌病（PKD）患者的数据集来训练和验证模型。
results: 模型在多样化的数据上训练后，与仅仅在域内数据上训练的模型在域内数据上的性能没有差异，并且发现更广泛的训练示例可以显著提高模型的泛化和域外性能。

Abstract
Purpose: This study evaluated the out-of-domain performance and generalization capabilities of automated medical image segmentation models, with a particular focus on adaptation to new image acquisitions and disease type. Materials: Datasets from both non-contrast and contrast-enhanced abdominal CT scans of healthy patients and those with polycystic kidney disease (PKD) were used. A total of 400 images (100 non-contrast controls, 100 contrast controls, 100 non-contrast PKD, 100 contrast PKD) were utilized for training/validation of models to segment kidneys, livers, and spleens, and the final models were then tested on 100 non-contrast CT images of patients affected by PKD. Performance was evaluated using Dice, Jaccard, TPR, and Precision. Results: Models trained on a diverse range of data showed no worse performance than models trained exclusively on in-domain data when tested on in-domain data. For instance, the Dice similarity of the model trained on 25% from each dataset was found to be non-inferior to the model trained purely on in-domain data. Conclusions: The results indicate that broader training examples significantly enhances model generalization and out-of-domain performance, thereby improving automated segmentation tools' applicability in clinical settings. The study's findings provide a roadmap for future research to adopt a data-centric approach in medical image AI model development.

摘要
目的：本研究评估了自动医疗影像分割模型的离域性和总体化能力，尤其关注新领域和疾病类型的适应性。材料：来自非对照和对照肝脏CT扫描的健康人群和肝病多囊细胞病（PKD）患者的 dataset 被使用。总共使用了400张图像（100张非对照控制、100张对照控制、100张非对照 PKD、100张对照 PKD）进行模型训练和验证，并将最终模型测试在100张非对照 CT 图像上，影像分割模型能够准确地 segmentation 肾脏、肝脏和脾脏。性能被评估使用 dice、jaccard、TPR 和精度。结论：结果表明，使用更广泛的数据集可以显著提高模型的总体化和离域性能，从而提高自动分割工具在临床中的应用性。这些结论为未来医疗影像 AI 模型开发提供了一个路线图。

MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation

paper_url: http://arxiv.org/abs/2307.14460
repo_url: https://github.com/isl-org/MiDaS
paper_authors: Reiner Birkl, Diana Wofk, Matthias Müller
for: 这 paper 是为了提高 monocular depth estimation 的质量和速度而实现的。
methods: 这 paper 使用了不同的 encoder backbones，包括 BEiT、Swin、SwinV2、Next-ViT 和 LeViT，以实现不同的 performance-runtime 贸易。
results: 这 paper 的结果表明，使用最有前途的视transformer 作为图像编码器可以提高 depth estimation 的质量，同时可以提高下游任务的高帧率。 Code 可以在 https://github.com/isl-org/MiDaS 找到。

Abstract
We release MiDaS v3.1 for monocular depth estimation, offering a variety of new models based on different encoder backbones. This release is motivated by the success of transformers in computer vision, with a large variety of pretrained vision transformers now available. We explore how using the most promising vision transformers as image encoders impacts depth estimation quality and runtime of the MiDaS architecture. Our investigation also includes recent convolutional approaches that achieve comparable quality to vision transformers in image classification tasks. While the previous release MiDaS v3.0 solely leverages the vanilla vision transformer ViT, MiDaS v3.1 offers additional models based on BEiT, Swin, SwinV2, Next-ViT and LeViT. These models offer different performance-runtime tradeoffs. The best model improves the depth estimation quality by 28% while efficient models enable downstream tasks requiring high frame rates. We also describe the general process for integrating new backbones. A video summarizing the work can be found at https://youtu.be/UjaeNNFf9sE and the code is available at https://github.com/isl-org/MiDaS.

摘要
我们发布了MiDaS v3.1，用于单目深度估计，提供了多种基于不同Encoder脊梁的新模型。这个发布是由计算机视觉中transformer的成功所 inspirited，现有大量预训练视觉transformer可用。我们研究了使用最有前途的视觉transformer作为图像编码器对深度估计质量和 runtime MiDaS架构的影响。我们的调查还包括最近的 convolutional方法，可以与视觉transformer相比肩。在上一个发布MiDaS v3.0中，我们只使用了vanilla vision transformer ViT，而MiDaS v3.1提供了基于BEiT、Swin、SwinV2、Next-ViT和LeViT的additional模型。这些模型提供了不同的性能-时间质量交易。最佳模型可以提高深度估计质量28%，而高效的模型可以支持需要高帧率的下游任务。我们还描述了将新的脊梁集成到MiDaS架构的一般过程。关于这个工作的视频可以在https://youtu.be/UjaeNNFf9sE找到，代码可以在https://github.com/isl-org/MiDaS中找到。

Self-supervised Few-shot Learning for Semantic Segmentation: An Annotation-free Approach

paper_url: http://arxiv.org/abs/2307.14446
repo_url: https://github.com/mindflow-institue/annotation_free_fewshot
paper_authors: Sanaz Karimijafarbigloo, Reza Azad, Dorit Merhof
For: + The paper aims to address the challenge of few-shot semantic segmentation (FSS) in medical image analysis, where limited annotated data is available. + The proposed method is designed to work without annotated semantic classes, making it suitable for medical images with limited annotations.* Methods: + The proposed method reframes the problem of image decomposition as a graph partitioning task, using eigenvectors from self-supervised networks to estimate the distribution of objects in the support images. + A novel self-supervised FSS framework is proposed, which adaptively estimates the query mask based on the eigenvectors obtained from the support images, eliminating the need for manual annotation. + A multi-scale large kernel attention module is introduced to selectively emphasize relevant features and details, improving the segmentation process and contributing to better object delineation.* Results: + The proposed method is evaluated on both natural and medical image datasets, demonstrating its efficiency and effectiveness. + The approach is shown to be general and model-agnostic, allowing for seamless integration with various deep architectures.Here is the information in Simplified Chinese text:* For: + 该研究旨在解决医学图像分割领域中的少量标注数据问题，使用自动标注方法实现准确的物体分割。 + 提议的方法不需要任何标注数据，因此适用于医学图像中的限量标注数据。* Methods: + 该方法将图像分解问题转化为图像分解任务，使用自动标注网络中的特征相似矩阵来估算支持图像中对象的分布。 + 提议的方法包括一种新的自动标注分割框架，不需要任何手动标注，而是基于支持图像中的特征 eigenvector 来适应性地分割查询图像。 + 为了进一步提高查询图像的解码，提议的方法还引入了一种多级大kernel注意模块，可以选择性地强调相关的特征和细节，从而提高分割过程和物体定义。* Results: + 提议的方法在自然图像和医学图像上进行了评估，显示其高效和有效。 + 该方法具有通用性和模型无关性，可以轻松地与多种深度架构集成。I hope that helps!

Abstract
Few-shot semantic segmentation (FSS) offers immense potential in the field of medical image analysis, enabling accurate object segmentation with limited training data. However, existing FSS techniques heavily rely on annotated semantic classes, rendering them unsuitable for medical images due to the scarcity of annotations. To address this challenge, multiple contributions are proposed: First, inspired by spectral decomposition methods, the problem of image decomposition is reframed as a graph partitioning task. The eigenvectors of the Laplacian matrix, derived from the feature affinity matrix of self-supervised networks, are analyzed to estimate the distribution of the objects of interest from the support images. Secondly, we propose a novel self-supervised FSS framework that does not rely on any annotation. Instead, it adaptively estimates the query mask by leveraging the eigenvectors obtained from the support images. This approach eliminates the need for manual annotation, making it particularly suitable for medical images with limited annotated data. Thirdly, to further enhance the decoding of the query image based on the information provided by the support image, we introduce a multi-scale large kernel attention module. By selectively emphasizing relevant features and details, this module improves the segmentation process and contributes to better object delineation. Evaluations on both natural and medical image datasets demonstrate the efficiency and effectiveness of our method. Moreover, the proposed approach is characterized by its generality and model-agnostic nature, allowing for seamless integration with various deep architectures. The code is publicly available at \href{https://github.com/mindflow-institue/annotation_free_fewshot}{\textcolor{magenta}{GitHub}.

摘要
这个文章提出了几个解决方案，以实现无需标注的几个Semantic Segmentation（FSS）。首先，我们参考了spectral decomposition方法，将图像分解问题转换为一个图像分类 задачу。然后，我们提出了一个新的自助学FSS框架，不需要任何标注。它可以透过获取支持图像中的eigenvector来 Adaptively estimate the query mask。此外，我们还引入了一个多尺度大kernel注意模组，以增强对支持图像的讯息整合，进一步改善分类过程。实验结果显示，我们的方法具有高效性和可靠性，并且可以与多种深度架构整合。代码可以在\href{https://github.com/mindflow-institue/annotation_free_fewshot}{\textcolor{magenta}{GitHub}上找到。

Phenotype-preserving metric design for high-content image reconstruction by generative inpainting

paper_url: http://arxiv.org/abs/2307.14436
repo_url: None
paper_authors: Vaibhav Sharma, Artur Yakimovich
for: 这个论文旨在描述一种基于高级微scopy的自动化高内容图像分析技术，以及该技术如何处理和修复图像中的artefacts。
methods: 本论文使用了state-of-the-art的填充方法，如DeepFill V2和Edge Connect，来修复高级微scopy图像中的artefacts。这些方法可以通过精心调整来实现 faithful restoration of microscopy images。
results: 研究发现，Restoration的质量与修复区域的大小有关，而不是形状。此外，提出了一种新的phenotype-preserving metric设计策略，该策略可以控制修复的质量，并且可以扩展到其他应用。

Abstract
In the past decades, automated high-content microscopy demonstrated its ability to deliver large quantities of image-based data powering the versatility of phenotypic drug screening and systems biology applications. However, as the sizes of image-based datasets grew, it became infeasible for humans to control, avoid and overcome the presence of imaging and sample preparation artefacts in the images. While novel techniques like machine learning and deep learning may address these shortcomings through generative image inpainting, when applied to sensitive research data this may come at the cost of undesired image manipulation. Undesired manipulation may be caused by phenomena such as neural hallucinations, to which some artificial neural networks are prone. To address this, here we evaluate the state-of-the-art inpainting methods for image restoration in a high-content fluorescence microscopy dataset of cultured cells with labelled nuclei. We show that architectures like DeepFill V2 and Edge Connect can faithfully restore microscopy images upon fine-tuning with relatively little data. Our results demonstrate that the area of the region to be restored is of higher importance than shape. Furthermore, to control for the quality of restoration, we propose a novel phenotype-preserving metric design strategy. In this strategy, the size and count of the restored biological phenotypes like cell nuclei are quantified to penalise undesirable manipulation. We argue that the design principles of our approach may also generalise to other applications.

摘要
过去几十年，自动化高内容微镜已经证明其能够提供大量的图像数据，推动了生物学和系统生物学应用的多样化。然而，随着图像数据的大小的增加，人类无法控制、避免和超越图像和样本准备 artifacts。而新技术如机器学习和深度学习可能解决这些缺陷，通过生成图像填充来修复图像。但当应用于敏感研究数据时，这可能会导致不жела的图像修改。这些修改可能是由神经投影引起的，一些人工神经网络容易受到这种影响。为此，我们在高内容染色微镜像dataset中评估了当前的填充方法。我们发现，深度填充V2和边缘连接可以在 fine-tuning 后对图像进行 faithful 的修复，只需要 relativity little data。我们的结果表明，图像修复的区域大小比较重要于形状。此外，为了控制修复质量，我们提出了一种新的现象保持式度量设计策略。在这种策略中，修复后的生物现象的大小和数量被量化，以便对不良修改进行惩罚。我们认为这些设计原则可能也适用于其他应用。

ProtoASNet: Dynamic Prototypes for Inherently Interpretable and Uncertainty-Aware Aortic Stenosis Classification in Echocardiography

paper_url: http://arxiv.org/abs/2307.14433
repo_url: https://github.com/hooman007/protoasnet
paper_authors: Hooman Vaseli, Ang Nan Gu, S. Neda Ahmadi Amiri, Michael Y. Tsang, Andrea Fung, Nima Kondori, Armin Saadat, Purang Abolmaesumi, Teresa S. M. Tsang
for: 本研究旨在提出一种可靠的自动检测涂猛病（AS）严重程度的方法，以便于适时采取治疗。
methods: 本方法基于protoNet，直接从B模式超音波视频中检测AS，并通过与学习的空间时间谱例子进行比较，以获得可解释的预测结果。
results: 对于一个私人数据集和公共可用的TMED-2数据集，ProtoASNet比现有状态之artefact的方法表现出色，具有80.0%和79.7%的准确率。此外，ProtoASNet还提供了解释性和不确定性度量，以便于在深度网络的助助 clinical decision-making。

Abstract
Aortic stenosis (AS) is a common heart valve disease that requires accurate and timely diagnosis for appropriate treatment. Most current automatic AS severity detection methods rely on black-box models with a low level of trustworthiness, which hinders clinical adoption. To address this issue, we propose ProtoASNet, a prototypical network that directly detects AS from B-mode echocardiography videos, while making interpretable predictions based on the similarity between the input and learned spatio-temporal prototypes. This approach provides supporting evidence that is clinically relevant, as the prototypes typically highlight markers such as calcification and restricted movement of aortic valve leaflets. Moreover, ProtoASNet utilizes abstention loss to estimate aleatoric uncertainty by defining a set of prototypes that capture ambiguity and insufficient information in the observed data. This provides a reliable system that can detect and explain when it may fail. We evaluate ProtoASNet on a private dataset and the publicly available TMED-2 dataset, where it outperforms existing state-of-the-art methods with an accuracy of 80.0% and 79.7%, respectively. Furthermore, ProtoASNet provides interpretability and an uncertainty measure for each prediction, which can improve transparency and facilitate the interactive usage of deep networks to aid clinical decision-making. Our source code is available at: https://github.com/hooman007/ProtoASNet.

摘要
《血液动脉狭窄症（AS）是心血管疾病的常见疾病，需要准确和时间上的诊断，以便适当的治疗。现有的大多数自动AS严重程度检测方法都采用黑盒模型，导致临床应用受限。为解决这个问题，我们提出了ProtoASNet，一种prototype网络，可以直接从B模式超音波视频中检测AS，并通过与学习的空间时间原型进行可读性的预测。这种方法提供了临床相关的支持证据，通常是卷积和叶 valve的受限运动的标志。此外，ProtoASNet使用缺失损失来估算随机不确定性，定义了一组捕捉歧义和不充分信息的prototype，从而提供一个可靠的系统。我们评估了ProtoASNet于私人数据集和公共可用的TMED-2数据集，其在两个数据集上均取得了80.0%和79.7%的准确率，分别超过了现有的状态对比方法。此外，ProtoASNet还提供了每个预测的可解释性和不确定性度量，可以提高透明度和促进深度网络的交互式使用，以帮助临床决策。我们的源代码可以在：https://github.com/hooman007/ProtoASNet。

Virtual Mirrors: Non-Line-of-Sight Imaging Beyond the Third Bounce

paper_url: http://arxiv.org/abs/2307.14341
repo_url: None
paper_authors: Diego Royo, Talha Sultan, Adolfo Muñoz, Khadijeh Masumnia-Bisheh, Eric Brandt, Diego Gutierrez, Andreas Velten, Julio Marco
for: 该研究旨在推广非直视（NLOS）成像方法的能力，以描绘不可见的场景。
methods: 该研究使用计算波动频谱NLOS成像领域的 indirect illumination 方法，并利用物体表面的折射特性来扩展NLOS成像的能力。
results: 该研究通过分析场景中的物体位置和方向，以及使用计算建立 auxiliary apertures 来扩展NLOS成像的能力，可以成功地成像单个角度的物体和隐藏在两个角度的物体。

Abstract
Non-line-of-sight (NLOS) imaging methods are capable of reconstructing complex scenes that are not visible to an observer using indirect illumination. However, they assume only third-bounce illumination, so they are currently limited to single-corner configurations, and present limited visibility when imaging surfaces at certain orientations. To reason about and tackle these limitations, we make the key observation that planar diffuse surfaces behave specularly at wavelengths used in the computational wave-based NLOS imaging domain. We call such surfaces virtual mirrors. We leverage this observation to expand the capabilities of NLOS imaging using illumination beyond the third bounce, addressing two problems: imaging single-corner objects at limited visibility angles, and imaging objects hidden behind two corners. To image objects at limited visibility angles, we first analyze the reflections of the known illuminated point on surfaces of the scene as an estimator of the position and orientation of objects with limited visibility. We then image those limited visibility objects by computationally building secondary apertures at other surfaces that observe the target object from a direct visibility perspective. Beyond single-corner NLOS imaging, we exploit the specular behavior of virtual mirrors to image objects hidden behind a second corner by imaging the space behind such virtual mirrors, where the mirror image of objects hidden around two corners is formed. No specular surfaces were involved in the making of this paper.

摘要

MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation

paper_url: http://arxiv.org/abs/2307.14336
repo_url: None
paper_authors: Rajeev Yasarla, Hong Cai, Jisoo Jeong, Yunxiao Shi, Risheek Garrepalli, Fatih Porikli
for: 提高单目视频深度估计精度
methods: 使用内存和注意力框架，将单目深度估计网络转化为视频深度估计模型，利用视频 temporal 信息提高深度估计精度
results: 在多个benchmark上实现了新的state-of-the-art（SOTA）精度，并且在响应时间方面与cost-volume-based视频深度模型相比，提供了更高的精度和更低的响应时间。

Abstract
We propose MAMo, a novel memory and attention frame-work for monocular video depth estimation. MAMo can augment and improve any single-image depth estimation networks into video depth estimation models, enabling them to take advantage of the temporal information to predict more accurate depth. In MAMo, we augment model with memory which aids the depth prediction as the model streams through the video. Specifically, the memory stores learned visual and displacement tokens of the previous time instances. This allows the depth network to cross-reference relevant features from the past when predicting depth on the current frame. We introduce a novel scheme to continuously update the memory, optimizing it to keep tokens that correspond with both the past and the present visual information. We adopt attention-based approach to process memory features where we first learn the spatio-temporal relation among the resultant visual and displacement memory tokens using self-attention module. Further, the output features of self-attention are aggregated with the current visual features through cross-attention. The cross-attended features are finally given to a decoder to predict depth on the current frame. Through extensive experiments on several benchmarks, including KITTI, NYU-Depth V2, and DDAD, we show that MAMo consistently improves monocular depth estimation networks and sets new state-of-the-art (SOTA) accuracy. Notably, our MAMo video depth estimation provides higher accuracy with lower latency, when omparing to SOTA cost-volume-based video depth models.

摘要
我们提出了MAMo，一种新的记忆和注意框架，用于单光学视频深度估计。MAMo可以将单个图像深度估计网络转化为视频深度估计模型，让它们利用视频中的时间信息来预测更加准确的深度。在MAMo中，我们增加了记忆，以 помочь深度预测。Specifically, the memory stores learned visual and displacement tokens of the previous time instances. This allows the depth network to cross-reference relevant features from the past when predicting depth on the current frame. We introduce a novel scheme to continuously update the memory, optimizing it to keep tokens that correspond with both the past and the present visual information. We adopt an attention-based approach to process memory features, where we first learn the spatio-temporal relation among the resultant visual and displacement memory tokens using a self-attention module. Further, the output features of self-attention are aggregated with the current visual features through cross-attention. The cross-attended features are finally given to a decoder to predict depth on the current frame. Through extensive experiments on several benchmarks, including KITTI, NYU-Depth V2, and DDAD, we show that MAMo consistently improves monocular depth estimation networks and sets new state-of-the-art (SOTA) accuracy. Notably, our MAMo video depth estimation provides higher accuracy with lower latency, when comparing to SOTA cost-volume-based video depth models.

Visual Instruction Inversion: Image Editing via Visual Prompting

paper_url: http://arxiv.org/abs/2307.14331
repo_url: https://github.com/thaoshibe/visii
paper_authors: Thao Nguyen, Yuheng Li, Utkarsh Ojha, Yong Jae Lee
for: 图像编辑
methods: 使用视觉提示学习文本编辑指令
results: 与状态艺技图像编辑框架竞争的 результа集Here’s a more detailed explanation of each point:
for: The paper is written for the task of image editing, specifically using visual prompts to convey editing ideas and learn text-based editing directions.
methods: The paper uses a method of inverting visual prompts into editing instructions, leveraging pre-trained text-to-image diffusion models to perform the editing.
results: The paper achieves competitive results compared to state-of-the-art text-conditioned image editing frameworks, with just one example pair.

Abstract
Text-conditioned image editing has emerged as a powerful tool for editing images. However, in many situations, language can be ambiguous and ineffective in describing specific image edits. When faced with such challenges, visual prompts can be a more informative and intuitive way to convey ideas. We present a method for image editing via visual prompting. Given pairs of example that represent the "before" and "after" images of an edit, our goal is to learn a text-based editing direction that can be used to perform the same edit on new images. We leverage the rich, pretrained editing capabilities of text-to-image diffusion models by inverting visual prompts into editing instructions. Our results show that with just one example pair, we can achieve competitive results compared to state-of-the-art text-conditioned image editing frameworks.

摘要
文本受控图像编辑已经成为图像编辑的强大工具。然而，在许多情况下，语言可能是模糊和不准确地描述特定的图像编辑。在面临这些挑战时，视觉提示可以是更加信息rich和INTUITIVE的方式来传达想法。我们提出了基于视觉提示的图像编辑方法。给定一对"before"和"after"图像，我们的目标是学习一个可以用于新图像进行同样编辑的文本编辑方向。我们利用了已经预训练的文本到图像扩散模型，通过视觉提示的倒转来获得编辑指令。我们的结果显示，只需要一对示例，我们可以与状态静态图像编辑框架相比的获得竞争性的结果。

US & MR Image-Fusion Based on Skin Co-Registration

paper_url: http://arxiv.org/abs/2307.14288
repo_url: None
paper_authors: Martina Paccini, Giacomo Paschina, Stefano De Beni, Giuseppe Patanè
for: 这个研究的目的是发展一个可携性高的医学影像融合系统，用于融合CT和MRI影像 WITH real-time US取得，以便在医疗器械运行中进行实时追踪和操作。
methods: 这个研究使用了一个3D摄像头感应器，用于融合CT和MRI影像 WITH real-time US取得。
results: 这个研究的主要结果是一个可携性高的医学影像融合系统，可以在不同的生物学结构中进行实时追踪和操作。

Abstract
The study and development of innovative solutions for the advanced visualisation, representation and analysis of medical images offer different research directions. Current practice in medical imaging consists in combining real-time US with imaging modalities that allow internal anatomy acquisitions, such as CT, MRI, PET or similar. Application of image-fusion approaches can be found in tracking surgical tools and/or needles, in real-time during interventions. Thus, this work proposes a fusion imaging system for the registration of CT and MRI images with real-time US acquisition leveraging a 3D camera sensor. The main focus of the work is the portability of the system and its applicability to different anatomical districts.

摘要
研究和开发创新解决方案 для高级医学图像可视化、表示和分析提供了不同的研究方向。现有医学成像做法是将实时US与可以获取内部解剖结构的成像方式结合，如CT、MRI、PET等。在实时手术过程中，应用图像融合方法可以跟踪手术工具和/或针刺的位置。这项工作提议了CT和MRI图像融合系统，通过3D摄像头感知器实现实时US获取。主要关注点是系统的可搬性和适用于不同的解剖区域。Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Large-scale Fully-Unsupervised Re-Identification

paper_url: http://arxiv.org/abs/2307.14278
repo_url: None
paper_authors: Gabriel Bertocco, Fernanda Andaló, Terrance E. Boult, Anderson Rocha
for: This paper focuses on fully-unsupervised person and vehicle re-identification, with the goal of improving the robustness and efficiency of the re-identification process in large-scale scenarios.
methods: The proposed methodology consists of two strategies: local neighborhood sampling and a novel Re-Ranking technique. The first strategy reduces the dataset size in each iteration without violating neighborhood relationships, while the second strategy reduces the time and memory complexity of the Re-Ranking process. Additionally, the paper introduces a novel scheduling algorithm that adjusts the density parameter during training to leverage the diversity of samples and keep the learning robust to noisy labeling.
results: The proposed method outperforms state-of-the-art methods in well-known benchmarks and in the challenging large-scale Veri-Wild dataset. The method achieves a faster and memory-efficient Re-Ranking strategy, and a large-scale, noisy-robust, and ensemble-based learning approach.

Abstract
Fully-unsupervised Person and Vehicle Re-Identification have received increasing attention due to their broad applicability in surveillance, forensics, event understanding, and smart cities, without requiring any manual annotation. However, most of the prior art has been evaluated in datasets that have just a couple thousand samples. Such small-data setups often allow the use of costly techniques in time and memory footprints, such as Re-Ranking, to improve clustering results. Moreover, some previous work even pre-selects the best clustering hyper-parameters for each dataset, which is unrealistic in a large-scale fully-unsupervised scenario. In this context, this work tackles a more realistic scenario and proposes two strategies to learn from large-scale unlabeled data. The first strategy performs a local neighborhood sampling to reduce the dataset size in each iteration without violating neighborhood relationships. A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n^2) to O(kn) with k << n. To avoid the pre-selection of specific hyper-parameter values for the clustering algorithm, we also present a novel scheduling algorithm that adjusts the density parameter during training, to leverage the diversity of samples and keep the learning robust to noisy labeling. Finally, due to the complementary knowledge learned by different models, we also introduce a co-training strategy that relies upon the permutation of predicted pseudo-labels, among the backbones, with no need for any hyper-parameters or weighting optimization. The proposed methodology outperforms the state-of-the-art methods in well-known benchmarks and in the challenging large-scale Veri-Wild dataset, with a faster and memory-efficient Re-Ranking strategy, and a large-scale, noisy-robust, and ensemble-based learning approach.

摘要
具有广泛应用前景的无监督人员和车辆重识别技术已经吸引了越来越多的关注，因为它不需要任何手动标注。然而，大多数先前的研究都是在几千个样本的小数据集上进行评估的，这些小数据集经常允许使用费时的技术，如重新排序，以提高归类结果。此外，一些先前的工作甚至会预先选择每个数据集的最佳归类参数，这是在大规模无监督场景下不现实的。在这种情况下，本研究面临着更加现实的场景，并提出了两种策略来学习大规模无标签数据。第一种策略是本地邻域采样，可以降低数据集的大小，而不违反邻域关系。第二种策略是一种新的重新排序技术，它的时间上限复杂度为O(n)，相比之下，传统的重新排序技术的时间上限复杂度为O(n^2)。为了避免预先选择特定的归类参数，我们还提出了一种新的调整策略，可以在训练中调整凝固参数，以利用样本的多样性，并使学习具有鲁棒性。最后，由于不同的模型学习的知识是 complementary，我们还引入了一种合作训练策略，可以在不同的后处理器之间进行预测 pseudo-标签的 permutation，无需任何超参或权值优化。在许多知名的benchmark上和Veri-Wild数据集中，我们的方法性能高于当前状态艺术。我们的方法具有快速、内存高效的重新排序策略，以及大规模、噪声鲁棒、 ensemble-based 学习方法。

G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

paper_url: http://arxiv.org/abs/2307.14277
repo_url: None
paper_authors: Hongxiang Li, Meng Cao, Xuxin Cheng, Yaowei Li, Zhihong Zhu, Yuexian Zou
for: 这个论文旨在解决视频定位问题，即使用自适应对视频定位。
methods: 该论文提出了一种基于地odesic和游戏理论的均衡视频定位框架（G2L），通过地odesic距离来衡量同时段的含义相似性，并通过游戏理论来学习细致的含义对应。
results: 实验结果表明，G2L方法可以有效地解决视频定位问题，并且在三个基准数据集上达到了最高的性能。

Abstract
The recent video grounding works attempt to introduce vanilla contrastive learning into video grounding. However, we claim that this naive solution is suboptimal. Contrastive learning requires two key properties: (1) \emph{alignment} of features of similar samples, and (2) \emph{uniformity} of the induced distribution of the normalized features on the hypersphere. Due to two annoying issues in video grounding: (1) the co-existence of some visual entities in both ground truth and other moments, \ie semantic overlapping; (2) only a few moments in the video are annotated, \ie sparse annotation dilemma, vanilla contrastive learning is unable to model the correlations between temporally distant moments and learned inconsistent video representations. Both characteristics lead to vanilla contrastive learning being unsuitable for video grounding. In this paper, we introduce Geodesic and Game Localization (G2L), a semantically aligned and uniform video grounding framework via geodesic and game theory. We quantify the correlations among moments leveraging the geodesic distance that guides the model to learn the correct cross-modal representations. Furthermore, from the novel perspective of game theory, we propose semantic Shapley interaction based on geodesic distance sampling to learn fine-grained semantic alignment in similar moments. Experiments on three benchmarks demonstrate the effectiveness of our method.

摘要
最近的视频定位工作尝试将纯然对偶学习引入到视频定位中。然而，我们认为这种简单的解决方案是不合适的。对偶学习需要两个关键性质：（1）对类似样本的特征进行Alignment，以及（2）在归一化后的特征分布在圆柱体上具有 uniformity。由于视频定位中存在两个烦人的问题：（1）视频中存在一些视觉实体同时出现在真实标注和其他时刻中，即semantic overlapping;（2）只有一些时刻在视频中被标注，即稀疏标注困难，纯然对偶学习无法模型视频中的时间距离相关性和学习不一致的视频表示。这两个特征导致纯然对偶学习不适用于视频定位。在这篇论文中，我们提出了Geodesic and Game Localization（G2L）方法，该方法通过几何学和游戏理论来实现semantically aligned和uniform的视频定位框架。我们利用几何学距离来衡量时间距离相关性，使模型学习正确的交叉模式表示。此外，从游戏理论的新角度出发，我们提出了semantic Shapley交互，基于几何学距离采样来学习细致的semantic alignment。实验结果表明，我们的方法有效地解决了视频定位中的问题。

Deepfake Image Generation for Improved Brain Tumor Segmentation

paper_url: http://arxiv.org/abs/2307.14273
repo_url: None
paper_authors: Roa’a Al-Emaryeen, Sara Al-Nahhas, Fatima Himour, Waleed Mahafza, Omar Al-Kadi
for: 针对脑肿 segmentation task，提高识别效果
methods: 使用生成对抗网络进行图像到图像翻译，然后使用U-Net适应性 convolutional neural network进行图像分割
results: 比较四个公共数据集的基准值，显示提高图像分割质量指标的性能

Abstract
As the world progresses in technology and health, awareness of disease by revealing asymptomatic signs improves. It is important to detect and treat tumors in early stage as it can be life-threatening. Computer-aided technologies are used to overcome lingering limitations facing disease diagnosis, while brain tumor segmentation remains a difficult process, especially when multi-modality data is involved. This is mainly attributed to ineffective training due to lack of data and corresponding labelling. This work investigates the feasibility of employing deep-fake image generation for effective brain tumor segmentation. To this end, a Generative Adversarial Network was used for image-to-image translation for increasing dataset size, followed by image segmentation using a U-Net-based convolutional neural network trained with deepfake images. Performance of the proposed approach is compared with ground truth of four publicly available datasets. Results show improved performance in terms of image segmentation quality metrics, and could potentially assist when training with limited data.

摘要
世界的技术和医疗发展，疾病早期发现的意识提高。早期发现和治疗肿瘤非常重要，因为可能是生命威胁。计算机支持技术用于超越疾病诊断中的留存问题，而脑肿瘤分 segmentation仍然是一个困难的过程，特别是当涉及多Modal数据时。这主要归因于训练效果不够，因为缺乏数据和相应的标注。这项工作研究了使用深归真图生成技术来实现有效的脑肿瘤分 segmentation。为此，我们使用了一个生成对抗网络进行图像到图像翻译，然后使用一个基于 U-Net 的卷积神经网络，并在深归真图上进行图像分割。我们对提出的方法的性能进行了比较，并与四个公共可用的数据集的标准值进行比较。结果表明，我们的方法可以提高图像分割质量指标，并且可能在培训数据有限时提供帮助。

2023-07-27

cs.AI

cs.AI - 2023-07-27

Multi-Source Domain Adaptation through Dataset Dictionary Learning in Wasserstein Space

paper_url: http://arxiv.org/abs/2307.14953
repo_url: https://github.com/eddardd/demo-dadil
paper_authors: Eduardo Fernandes Montesuma, Fred Ngolè Mboula, Antoine Souloumiac
for: 解决多源频率域适应（MSDA）问题，即将多个标注源频率域中的知识传递到无标注目标频率域中，并mitigate数据分布变化问题。
methods: 提出了一个基于词典学习和优质运输的MSDA框架，将每个频率域 interpret为一个empirical distribution，并使用 Wasserstein barycenter来表示每个频率域。提出了一个新算法DaDiL，通过每个频率域的atom分布和矩阵barycentric坐标来学习。
results: 在Caltech-Office、Office 31和CRWU三个benchmark上评估了我们的方法，并比前一个状态的报告提高了3.15%、2.29%和7.71%的分类性能。最后，我们表明了在学习到的atom分布中的 interpolations可以通过Wasserstein筒来提供可以泛化到目标频率域的数据。

Abstract
This paper seeks to solve Multi-Source Domain Adaptation (MSDA), which aims to mitigate data distribution shifts when transferring knowledge from multiple labeled source domains to an unlabeled target domain. We propose a novel MSDA framework based on dictionary learning and optimal transport. We interpret each domain in MSDA as an empirical distribution. As such, we express each domain as a Wasserstein barycenter of dictionary atoms, which are empirical distributions. We propose a novel algorithm, DaDiL, for learning via mini-batches: (i) atom distributions; (ii) a matrix of barycentric coordinates. Based on our dictionary, we propose two novel methods for MSDA: DaDil-R, based on the reconstruction of labeled samples in the target domain, and DaDiL-E, based on the ensembling of classifiers learned on atom distributions. We evaluate our methods in 3 benchmarks: Caltech-Office, Office 31, and CRWU, where we improved previous state-of-the-art by 3.15%, 2.29%, and 7.71% in classification performance. Finally, we show that interpolations in the Wasserstein hull of learned atoms provide data that can generalize to the target domain.

摘要
In this framework, we interpret each domain in MSDA as an empirical distribution. We express each domain as a Wasserstein barycenter of dictionary atoms, which are empirical distributions. We propose a novel algorithm, DaDiL, for learning via mini-batches: (i) atom distributions; (ii) a matrix of barycentric coordinates.Based on our dictionary, we propose two novel methods for MSDA: DaDil-R, which is based on the reconstruction of labeled samples in the target domain, and DaDiL-E, which is based on the ensembling of classifiers learned on atom distributions. We evaluate our methods in three benchmarks: Caltech-Office, Office 31, and CRWU, and achieve state-of-the-art performance, with improvements of 3.15%, 2.29%, and 7.71% in classification performance, respectively.Finally, we show that interpolations in the Wasserstein hull of learned atoms provide data that can generalize to the target domain.

Designing Fiduciary Artificial Intelligence

paper_url: http://arxiv.org/abs/2308.02435
repo_url: None
paper_authors: Sebastian Benthall, David Shekman
For: The paper is written to provide a procedure for designing and auditing Fiduciary AI, which is AI that is compliant with the legal duty of loyalty and care towards a principal.* Methods: The paper uses a combination of computer science and law to develop the procedure, including identifying the principals, assessing their interests, and ensuring loyalty and care in the design and auditing of Fiduciary AI.* Results: The paper argues that Fiduciary AI is a promising means to address the incompleteness of data subjects’ consent when interacting with complex technical systems, and connects the steps in the procedure to dimensions of Trustworthy AI such as privacy and alignment.Here’s the same information in Simplified Chinese text:* For: 这篇论文是为了提供设计和审核 fiduciary AI 的过程，fiduciary AI 是指遵循法律责任的loyalty和care的人工智能。* Methods: 这篇论文使用计算机科学和法律来发展设计和审核 fiduciary AI 的过程，包括确定主体、评估其利益，并确保设计和审核 fiduciary AI 的loyalty和care。* Results: 这篇论文认为 fiduciary AI 是Complex技术系统中数据主体consent的不完整性的一种解决方案，并将设计和审核 fiduciary AI 的步骤与信任worthy AI 的维度联系起来，如隐私和对齐。

Abstract
A fiduciary is a trusted agent that has the legal duty to act with loyalty and care towards a principal that employs them. When fiduciary organizations interact with users through a digital interface, or otherwise automate their operations with artificial intelligence, they will need to design these AI systems to be compliant with their duties. This article synthesizes recent work in computer science and law to develop a procedure for designing and auditing Fiduciary AI. The designer of a Fiduciary AI should understand the context of the system, identify its principals, and assess the best interests of those principals. Then the designer must be loyal with respect to those interests, and careful in an contextually appropriate way. We connect the steps in this procedure to dimensions of Trustworthy AI, such as privacy and alignment. Fiduciary AI is a promising means to address the incompleteness of data subject's consent when interacting with complex technical systems.

摘要
一个 fiduciary 是一位被信任的代理人，具有法律责任，向使用者（principal）示好和照顾。当 fiduciary 组织通过数字界面或人工智能自动化其操作时，它们需要设计这些 AI 系统符合其职责。这篇文章结合计算机科学和法律研究，提出了设计和审核 fiduciary AI 的程序。设计 fiduciary AI 的人需要了解系统的上下文，识别主体，并评估这些主体的最好利益。然后，设计人必须对这些利益示好，并在上下文相应的方式进行谨慎。我们将这些步骤与信worthy AI 的维度，如隐私和对齐，连接起来。 fiduciary AI 是用于解决数据主体同技术系统交互时的数据权限不充分的问题的有希望的方法。

PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback

paper_url: http://arxiv.org/abs/2307.14936
repo_url: None
paper_authors: Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan, Bing Geng, An Fu, Muhan Zeng, Ailun Yu, Jichuan Ji, Jingyang Zhao, Yuenan Guo, Qianxiang Wang
for: 这篇论文主要写于如何提高预训练的代码生成模型的性能。
methods: 该论文提出了一种新的RRTF（ Rank Responses to align Test&Teacher Feedback）框架，用于效果地提高预训练大语言模型的代码生成性能。
results: 该论文通过PanGu-Coder2实现了62.20%的pass@1分数在OpenAI HumanEval标准准点测试上，并在CoderEval和LeetCode标准测试上 consistentemente超过了所有之前的代码LM。

Abstract
Large Language Models for Code (Code LLM) are flourishing. New and powerful models are released on a weekly basis, demonstrating remarkable performance on the code generation task. Various approaches have been proposed to boost the code generation performance of pre-trained Code LLMs, such as supervised fine-tuning, instruction tuning, reinforcement learning, etc. In this paper, we propose a novel RRTF (Rank Responses to align Test&Teacher Feedback) framework, which can effectively and efficiently boost pre-trained large language models for code generation. Under this framework, we present PanGu-Coder2, which achieves 62.20% pass@1 on the OpenAI HumanEval benchmark. Furthermore, through an extensive evaluation on CoderEval and LeetCode benchmarks, we show that PanGu-Coder2 consistently outperforms all previous Code LLMs.

摘要
大型语言模型 для程式码 (Code LLM) 正在盛况。新的强大模型在每周基础上发表，展示了惊人的程式码生成能力。各种方法已经被提议来提高预训Code LLM的程式码生成性能，如监督精度调整、指令调整、循环学习等。在本文中，我们提出了一个新的RRTF（排名回应对测试&教师反馈）框架，可以有效地和高效地提高预训大型语言模型的程式码生成能力。在这个框架下，我们发表了PanGu-Coder2，它在OpenAI HumanEvalbenchmark上取得了62.20%的通过率@1。此外，我们透过广泛的评估在CoderEval和LeetCodebenchmark上，展示了PanGu-Coder2在所有前一代Code LLMs中的稳定性和竞争力。

Solving Data Quality Problems with Desbordante: a Demo

paper_url: http://arxiv.org/abs/2307.14935
repo_url: None
paper_authors: George Chernishev, Michael Polyntsov, Anton Chizhov, Kirill Stupakov, Ilya Shchuckin, Alexander Smirnov, Maxim Strutovsky, Alexey Shlyonskikh, Mikhail Firsov, Stepan Manannikov, Nikita Bobrov, Daniil Goncharov, Ilia Barutkin, Vladislav Shalnev, Kirill Muraviev, Anna Rakhmukova, Dmitriy Shcheka, Anton Chernikov, Mikhail Vyrodov, Yaroslav Kurbatov, Maxim Fofanov, Sergei Belokonnyi, Pavel Anosov, Arthur Saliou, Eduard Gaisin, Kirill Smirnov
for: 提高现代数据驱动行业中数据分析的效率和质量。
methods: 使用功能依赖关系、数据约束、关联规则等复杂统计方法，并提供了可解释的描述。
results: 实现了高效、可扩展、可靠的数据 profiling 系统，并提供了与Python集成的解释。

Abstract
Data profiling is an essential process in modern data-driven industries. One of its critical components is the discovery and validation of complex statistics, including functional dependencies, data constraints, association rules, and others. However, most existing data profiling systems that focus on complex statistics do not provide proper integration with the tools used by contemporary data scientists. This creates a significant barrier to the adoption of these tools in the industry. Moreover, existing systems were not created with industrial-grade workloads in mind. Finally, they do not aim to provide descriptive explanations, i.e. why a given pattern is not found. It is a significant issue as it is essential to understand the underlying reasons for a specific pattern's absence to make informed decisions based on the data. Because of that, these patterns are effectively rest in thin air: their application scope is rather limited, they are rarely used by the broader public. At the same time, as we are going to demonstrate in this presentation, complex statistics can be efficiently used to solve many classic data quality problems. Desbordante is an open-source data profiler that aims to close this gap. It is built with emphasis on industrial application: it is efficient, scalable, resilient to crashes, and provides explanations. Furthermore, it provides seamless Python integration by offloading various costly operations to the C++ core, not only mining. In this demonstration, we show several scenarios that allow end users to solve different data quality problems. Namely, we showcase typo detection, data deduplication, and data anomaly detection scenarios.

摘要
现代数据驱动行业中，数据 profiling 是一项非常重要的过程。其中一个关键组件是发现和验证复杂的统计学，如功能依赖关系、数据约束、相关规则等。然而，大多数现有的数据 profiling 系统，主要关注于复杂的统计学，并不提供适当的工具集成。这创造了使用这些工具在行业中的显著障碍。此外，现有系统没有考虑现代化工作负荷，而且不提供描述性解释，即为什么某种特征没有出现。这是一个重要的问题，因为需要理解数据下的深层次原因，以便根据数据做出了 Informed 决策。由于这些原因，这些特征在实际应用中具有有限的应用范围，通常只有特定领域的专业人员使用。然而，我们将在这个演示中展示，复杂的统计学可以高效地解决许多经典数据质量问题。Desbordante 是一款开源的数据 profiler，旨在填补这个空白。它强调在工业应用中高效、可扩展、可靠、并提供描述性解释。此外，它通过将费时操作卷积到 C++ 核心中，实现了顺略的 Python 集成，不仅是探钻。在这个演示中，我们将展示一些使用 Desbordante 解决不同数据质量问题的场景。具体来说，我们将展示 typo 检测、数据重复检测和数据异常检测等场景。

Approximate Model-Based Shielding for Safe Reinforcement Learning

paper_url: http://arxiv.org/abs/2308.00707
repo_url: https://github.com/sacktock/ambs
paper_authors: Alexander W. Goodall, Francesco Belardinelli
for: 这篇论文的目的是为了解决通用的问题，并在实际世界中应用增强学习（Reinforcement Learning，RL）。
methods: 本论文提出了一种名为“approximate model-based shielding”（AMBS）的原理性的预先观察措施，用于验证RL政策对于一些给定的安全限制的性能。AMBS不需要先知道系统的安全相关动力学。
results: 论文的实验结果显示，AMBS在一组Atari游戏中的状态依赖安全标签上表现较好，并且比其他安全意识的方法更好。

Abstract
Reinforcement learning (RL) has shown great potential for solving complex tasks in a variety of domains. However, applying RL to safety-critical systems in the real-world is not easy as many algorithms are sample-inefficient and maximising the standard RL objective comes with no guarantees on worst-case performance. In this paper we propose approximate model-based shielding (AMBS), a principled look-ahead shielding algorithm for verifying the performance of learned RL policies w.r.t. a set of given safety constraints. Our algorithm differs from other shielding approaches in that it does not require prior knowledge of the safety-relevant dynamics of the system. We provide a strong theoretical justification for AMBS and demonstrate superior performance to other safety-aware approaches on a set of Atari games with state-dependent safety-labels.

摘要

Scaling Session-Based Transformer Recommendations using Optimized Negative Sampling and Loss Functions

paper_url: http://arxiv.org/abs/2307.14906
repo_url: https://github.com/otto-de/tron
paper_authors: Timo Wilm, Philipp Normann, Sophie Baumeister, Paul-Vincent Kobow
for: 提高推荐质量和减少训练时间
methods: 使用最佳负样本和列wise损失函数增强推荐准确性
results: 在大规模电商数据集上比前方法提高推荐质量，同时保持训练速度类似于SASRec，实际场景中A/B测试显示与SASRec比起提高18.14%的点击率。

Abstract
This work introduces TRON, a scalable session-based Transformer Recommender using Optimized Negative-sampling. Motivated by the scalability and performance limitations of prevailing models such as SASRec and GRU4Rec+, TRON integrates top-k negative sampling and listwise loss functions to enhance its recommendation accuracy. Evaluations on relevant large-scale e-commerce datasets show that TRON improves upon the recommendation quality of current methods while maintaining training speeds similar to SASRec. A live A/B test yielded an 18.14% increase in click-through rate over SASRec, highlighting the potential of TRON in practical settings. For further research, we provide access to our source code at https://github.com/otto-de/TRON and an anonymized dataset at https://github.com/otto-de/recsys-dataset.

摘要
这个工作介绍了TRON，一种可扩展的会话基于Transformer推荐器，使用优化的负样本选择来提高推荐精度。由于现有的模型如SASRec和GRU4Rec+的可扩展性和性能限制，TRON通过将top-k负样本和listwise损失函数结合使用来提高推荐精度。在相关的大规模电商数据集上进行评估，TRON与当前方法相比有所提高了推荐质量，同时保持与SASRec相同的训练速度。一次实际的A/B测试中，TRON比SASRec提高了18.14%的点击率，这表明TRON在实际场景中具有潜在的应用前景。如果您想了解更多细节，可以通过我们的GitHub仓库https://github.com/otto-de/TRON获得源代码，并通过https://github.com/otto-de/recsys-dataset获得匿名化的数据集。

CodeLens: An Interactive Tool for Visualizing Code Representations

paper_url: http://arxiv.org/abs/2307.14902
repo_url: None
paper_authors: Yuejun Guo, Seifeddine Bettaieb, Qiang Hu, Yves Le Traon, Qiang Tang
for: 这篇论文是为了提供一个可视化编程代码的工具，帮助开发者更好地理解和探索不同类型的代码表示方法。
methods: 这篇论文使用了多种代码表示方法，包括序列化的Token，抽象语法树（AST），数据流图（DFG）和控制流图（CFG）等。
results: 这篇论文介绍了一个名为CodeLens的工具，可以帮助开发者快速Visualize不同类型的代码表示，并且可以获取代码表示的输入数据，以便用于代码学习模型。

Abstract
Representing source code in a generic input format is crucial to automate software engineering tasks, e.g., applying machine learning algorithms to extract information. Visualizing code representations can further enable human experts to gain an intuitive insight into the code. Unfortunately, as of today, there is no universal tool that can simultaneously visualise different types of code representations. In this paper, we introduce a tool, CodeLens, which provides a visual interaction environment that supports various representation methods and helps developers understand and explore them. CodeLens is designed to support multiple programming languages, such as Java, Python, and JavaScript, and four types of code representations, including sequence of tokens, abstract syntax tree (AST), data flow graph (DFG), and control flow graph (CFG). By using CodeLens, developers can quickly visualize the specific code representation and also obtain the represented inputs for models of code. The Web-based interface of CodeLens is available at http://www.codelens.org. The demonstration video can be found at http://www.codelens.org/demo.

摘要
“代码的抽象表示是软件工程中的关键，例如应用机器学习算法提取信息。可视化代码表示可以帮助人类专家获得直观的了解。却是，到目前为止，没有一个通用的工具可以同时视觉不同类型的代码表示。在这篇论文中，我们介绍了一个工具，CodeLens，它提供了一个可视化交互环境，支持多种代码表示方法，帮助开发者理解和探索代码。CodeLens支持多种编程语言，如Java、Python和JavaScript，以及四种代码表示方法，包括Token序列、抽象 sintaxis树（AST）、数据流图（DFG）和控制流图（CFG）。通过使用CodeLens，开发者可以快速视觉特定的代码表示，并获得代码表示的输入数据，以便用于代码模型。CodeLens的Web版本可在http://www.codelens.org/中找到，示例视频在http://www.codelens.org/demo。”

Text-guided Foundation Model Adaptation for Pathological Image Classification

paper_url: http://arxiv.org/abs/2307.14901
repo_url: https://github.com/yunkun-zhang/cite
paper_authors: Yunkun Zhang, Jin Gao, Mu Zhou, Xiaosong Wang, Yu Qiao, Shaoting Zhang, Dequan Wang
for: 强化数据稀缺的病理图像分类
methods: 利用语言模型预训练的各种生物医学文本知识，将图像和文本embeddings连接起来，增强图像分类性能
results: 在patchgastric癌病理图像 dataset上，与多个基elines比较，CITE方法 achieve leading performance，特别是在数据稀缺情况下Here’s the translation in English:
for: Enhancing data-efficient pathological image classification
methods: Utilizing language models pre-trained with a broad range of biomedical texts to connect image and text embeddings and enhance pathological image understanding
results: Leading performance compared with various baselines, especially when training data is scarce, demonstrated through extensive experiments on the PatchGastric stomach tumor pathological image dataset.

Abstract
The recent surge of foundation models in computer vision and natural language processing opens up perspectives in utilizing multi-modal clinical data to train large models with strong generalizability. Yet pathological image datasets often lack biomedical text annotation and enrichment. Guiding data-efficient image diagnosis from the use of biomedical text knowledge becomes a substantial interest. In this paper, we propose to Connect Image and Text Embeddings (CITE) to enhance pathological image classification. CITE injects text insights gained from language models pre-trained with a broad range of biomedical texts, leading to adapt foundation models towards pathological image understanding. Through extensive experiments on the PatchGastric stomach tumor pathological image dataset, we demonstrate that CITE achieves leading performance compared with various baselines especially when training data is scarce. CITE offers insights into leveraging in-domain text knowledge to reinforce data-efficient pathological image classification. Code is available at https://github.com/Yunkun-Zhang/CITE.

摘要
最近几年，基金会模型在计算机视觉和自然语言处理领域的崛起，开启了使用多模态医疗数据训练大型模型的可能性。然而，病理图像 dataset часто缺乏医学文献注释和丰富。引导数据不充分的图像诊断成为了一项重要的利益。在这篇论文中，我们提出了连接图像和文本嵌入（CITE），以增强病理图像分类。CITE 利用语言模型预训练的宽泛生物医学文献知识，导向基础模型向病理图像理解。通过对 PatchGastric 胃癌病理图像集进行了广泛的实验，我们示出了 CITE 在各种基elines中表现出了领先的性能，特别是在训练数据稀缺时。CITE 提供了使用域内文本知识来加强数据效率的病理图像分类的思路。代码可以在 https://github.com/Yunkun-Zhang/CITE 上获取。

Base-based Model Checking for Multi-Agent Only Believing (long version)

paper_url: http://arxiv.org/abs/2307.14893
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Tiago de Lima, Emiliano Lorini, François Schwarzentruber
for: 这 paper 是用于描述 Multi-Agent 语言的 semantics 和如何自动检查其中的方程和动态扩展的 private belief expansion 算法。
methods: 这 paper 使用了 PSPACE 算法和一种专门的算法，它们都基于 reduction to QBF 和 state space 的探索。
results: 这 paper 提供了一个实现 QBF-based 算法，以及一些实际计算时间的示例数据。

Abstract
We present a novel semantics for the language of multi-agent only believing exploiting belief bases, and show how to use it for automatically checking formulas of this language and of its dynamic extension with private belief expansion operators. We provide a PSPACE algorithm for model checking relying on a reduction to QBF and alternative dedicated algorithm relying on the exploration of the state space. We present an implementation of the QBF-based algorithm and some experimental results on computation time in a concrete example.

摘要
我们提出了一种新的 semantics для多智能语言，只信任滥用信仰基础，并示出如何使用它来自动检查这种语言和其动态扩展中的私有信仰扩展运算符的 формулы。我们提供了一个PSPACE算法 для模板检查，基于降reducible to QBF和一种专门的算法，基于状态空间的探索。我们提供了基于QBF的算法的实现和一些实际例子中的计算时间实验结果。

paper_url: http://arxiv.org/abs/2307.14889
repo_url: None
paper_authors: Peter Bauer, Arij Bouazizi, Ulrich Kressel, Fabian B. Flohr
for: 这个论文的目的是提出一种简单 yet efficient的弱监督方法 для3D人姿估计在自动驾驶车辆（AV）上。methods: 这种方法使用了一种高级感知融合，将摄像头和LiDAR数据进行融合，并使用了一个Off-the-shelf 2D联合提取器和LiDAR到图像投影的pseudo标签来进行训练。results: 这种方法在Waymo开放数据集上的弱监督设定下，与当前最佳状态的结果相比，提高了$\sim$13%，并在超级监督设定下达到了当前最佳结果。

Abstract
Accurate 3D human pose estimation (3D HPE) is crucial for enabling autonomous vehicles (AVs) to make informed decisions and respond proactively in critical road scenarios. Promising results of 3D HPE have been gained in several domains such as human-computer interaction, robotics, sports and medical analytics, often based on data collected in well-controlled laboratory environments. Nevertheless, the transfer of 3D HPE methods to AVs has received limited research attention, due to the challenges posed by obtaining accurate 3D pose annotations and the limited suitability of data from other domains. We present a simple yet efficient weakly supervised approach for 3D HPE in the AV context by employing a high-level sensor fusion between camera and LiDAR data. The weakly supervised setting enables training on the target datasets without any 2D/3D keypoint labels by using an off-the-shelf 2D joint extractor and pseudo labels generated from LiDAR to image projections. Our approach outperforms state-of-the-art results by up to $\sim$ 13% on the Waymo Open Dataset in the weakly supervised setting and achieves state-of-the-art results in the supervised setting.

摘要
精准的3D人姿估计（3DHPE）对自动驾驶车辆（AV）的决策和反应具有重要作用。在各个领域中，如人机交互、 робо扮、运动和医疗分析中，3DHPE已经获得了可观的成果，通常基于实验室环境中收集的数据。然而，将3DHPE方法应用于AV领域受到了限制的研究注意力，因为获得精准的3D姿势标注和其他领域数据的限制。我们提出了一种简单而高效的弱监督方法，通过高级感知融合相机和LiDAR数据来实现3DHPE在AV上。弱监督设定允许在目标数据集上进行训练，不需要2D/3D关键点标注，通过使用市售的2D联合抽取器和LiDAR到图像投影生成的pseudo标签。我们的方法在Waymo开放数据集上比州前一个月提高了$\sim$13%的表现，并在弱监督设定下实现了最佳表现。

Exploiting the Potential of Seq2Seq Models as Robust Few-Shot Learners

paper_url: http://arxiv.org/abs/2307.14856
repo_url: None
paper_authors: Jihyeon Lee, Dain Kim, Doohae Jung, Boseop Kim, Kyoung-Woon On
for: 这个论文的目的是研究seq2seq模型在少量示例学习中的表现，以及如何更好地让seq2seq模型在这些任务上表现出几何学习的能力。
methods: 这个论文使用了decoder-only模型和encoder-decoder模型，并对这些模型进行了对比。另外， authors还提出了两种方法来提高seq2seq模型的在context few-shot learning中的表现：目标协调的提问和混合方法。
results: 实验结果显示， seq2seq模型在各种任务上的表现比 conventionalseq2seq模型更好，并且 authors的方法可以提高seq2seq模型的表现。特别是， authors的方法可以让seq2seq模型在少量示例学习中表现出几何学习的能力。

Abstract
In-context learning, which offers substantial advantages over fine-tuning, is predominantly observed in decoder-only models, while encoder-decoder (i.e., seq2seq) models excel in methods that rely on weight updates. Recently, a few studies have demonstrated the feasibility of few-shot learning with seq2seq models; however, this has been limited to tasks that align well with the seq2seq architecture, such as summarization and translation. Inspired by these initial studies, we provide a first-ever extensive experiment comparing the in-context few-shot learning capabilities of decoder-only and encoder-decoder models on a broad range of tasks. Furthermore, we propose two methods to more effectively elicit in-context learning ability in seq2seq models: objective-aligned prompting and a fusion-based approach. Remarkably, our approach outperforms a decoder-only model that is six times larger and exhibits significant performance improvements compared to conventional seq2seq models across a variety of settings. We posit that, with the right configuration and prompt design, seq2seq models can be highly effective few-shot learners for a wide spectrum of applications.

摘要
内容学习，具有很大优势，主要在decoder-only模型中观察到，而encoder-decoder（即seq2seq）模型则在需要weight更新的方法中表现出色。最近几个研究已经证明了seq2seq模型可以实现几步学习，但这仅仅限于适合seq2seq架构的任务，如概要和翻译。我们受这些初期研究的启发，进行了首次广泛的实验， comparing the in-context few-shot learning能力of decoder-only和encoder-decoder模型在多种任务上。此外，我们也提出了两种方法，以更好地激发seq2seq模型的内容学习能力：目标对适配和融合方法。可以见，我们的方法在不同的设定和提示设计下，可以与decoder-only模型比较，并且在多种情况下表现出优秀的表现。我们认为，在适当的配置和提示设计下，seq2seq模型可以在广泛的应用中成为高效的几步学习模型。

Counterfactual Explanations for Graph Classification Through the Lenses of Density

paper_url: http://arxiv.org/abs/2307.14849
repo_url: https://github.com/carlo-abrate/Counterfactual-Explanations-for-Graph-Classification-Through-the-Lenses-of-Density
paper_authors: Carlo Abrate, Giulia Preti, Francesco Bonchi
for: 提供一种基于密度的对比例类型Counterfactual例子生成方法，用于解释图像分类器的决策。
methods: 使用不同类型的密集结构来生成对比例类型Counterfactual例子，包括打开或关闭三角形、驱动最大 clique。
results: 在7个大脑网络数据集中评估了不同实现方法的效果，并通过多种广泛使用的指标进行比较。结果表明，采用Semantic relevance的变换单元如密度是生成可靠和可读的对比例类型Counterfactual例子的关键。

Abstract
Counterfactual examples have emerged as an effective approach to produce simple and understandable post-hoc explanations. In the context of graph classification, previous work has focused on generating counterfactual explanations by manipulating the most elementary units of a graph, i.e., removing an existing edge, or adding a non-existing one. In this paper, we claim that such language of explanation might be too fine-grained, and turn our attention to some of the main characterizing features of real-world complex networks, such as the tendency to close triangles, the existence of recurring motifs, and the organization into dense modules. We thus define a general density-based counterfactual search framework to generate instance-level counterfactual explanations for graph classifiers, which can be instantiated with different notions of dense substructures. In particular, we show two specific instantiations of this general framework: a method that searches for counterfactual graphs by opening or closing triangles, and a method driven by maximal cliques. We also discuss how the general method can be instantiated to exploit any other notion of dense substructures, including, for instance, a given taxonomy of nodes. We evaluate the effectiveness of our approaches in 7 brain network datasets and compare the counterfactual statements generated according to several widely-used metrics. Results confirm that adopting a semantic-relevant unit of change like density is essential to define versatile and interpretable counterfactual explanation methods.

摘要
counterfactual 例子在 graph 分类 задании中出现为一种有效的方法，生成简单易理解的后果解释。在图 classification 的上下文中，先前的工作都是通过修改图形的最基本单元，如删除现有的边或添加不存在的一个，来生成 counterfactual 解释。在这篇论文中，我们认为这种语言解释可能太细节，我们转移注意力到了现实世界中复杂网络的一些主要特征，如形成三角形的倾向，存在循环模式，以及组织成密集模块。我们因此定义了一种通用的density-based counterfactual 搜索框架，用于生成图分类器的实例级 counterfactual 解释，可以实现不同的归并概念。特别是，我们显示了两种特定的实现方式：通过打开或关闭三角形来搜索 counterfactual 图，以及通过最大 клиQUE 驱动。我们还讨论了如何使用其他任何密集结构的概念来实现该框架，包括例如给定的节点分类。我们在 7 个大脑网络数据集上评估了我们的方法，并与多种常用的度量进行比较。结果表明，采用 Semantic-relevant 的变化单元，如density，是生成可靠和可读的 counterfactual 解释方法的必要条件。

Seeing through the Brain: Image Reconstruction of Visual Perception from Human Brain Signals

paper_url: http://arxiv.org/abs/2308.02510
repo_url: None
paper_authors: Yu-Ting Lan, Kan Ren, Yansen Wang, Wei-Long Zheng, Dongsheng Li, Bao-Liang Lu, Lili Qiu
for: 这 paper 是关于人类视觉认知的研究，具体来说是利用 neuroscience 和人工智能技术来记录和复制人类视觉能力。
methods: 这 paper 使用了 electroencephalography (EEG) 信号来重建观察到的图像，并提出了一个全面的数据处理管道，名为 NeuroImagen，以提取有用信息并进行图像重建。
results: 实验结果表明，这 paper 的方法可以有效地重建图像，并且其表现比传统方法更为出色。

Abstract
Seeing is believing, however, the underlying mechanism of how human visual perceptions are intertwined with our cognitions is still a mystery. Thanks to the recent advances in both neuroscience and artificial intelligence, we have been able to record the visually evoked brain activities and mimic the visual perception ability through computational approaches. In this paper, we pay attention to visual stimuli reconstruction by reconstructing the observed images based on portably accessible brain signals, i.e., electroencephalography (EEG) data. Since EEG signals are dynamic in the time-series format and are notorious to be noisy, processing and extracting useful information requires more dedicated efforts; In this paper, we propose a comprehensive pipeline, named NeuroImagen, for reconstructing visual stimuli images from EEG signals. Specifically, we incorporate a novel multi-level perceptual information decoding to draw multi-grained outputs from the given EEG data. A latent diffusion model will then leverage the extracted information to reconstruct the high-resolution visual stimuli images. The experimental results have illustrated the effectiveness of image reconstruction and superior quantitative performance of our proposed method.

摘要
seeing是信服，但是人类视觉 cognition 的下面机制仍然是一个谜。感谢最近的 neuroscience 和人工智能的进步，我们可以记录视觉诱发的脑动力和模拟视觉能力通过计算方法。在这篇文章中，我们关注视觉刺激重建，基于可 portable 的 brain signals，即 electroencephalography (EEG) 数据。因为 EEG 信号是时间序列格式的动态和具有噪声，处理和提取有用信息需要更多的努力。为此，我们提出了一个完整的管道，名为 NeuroImagen，用于从 EEG 信号中重建高分辨率的视觉刺激图像。 Specifically，我们采用了一种新的多级感知信息解码，以从给定的 EEG 数据中提取多层次输出。然后，一种潜在的扩散模型会利用提取的信息来重建高分辨率的视觉刺激图像。实验结果表明了图像重建的效果和我们提出的方法的数量性表现优于。

Hybrid ASP-based multi-objective scheduling of semiconductor manufacturing processes (Extended version)

paper_url: http://arxiv.org/abs/2307.14799
repo_url: None
paper_authors: Mohammed M. S. El-Kholany, Ramsha Ali, Martin Gebser
for: 本研究旨在实现现代半导体生产线上的调度，以满足对复杂生产过程和高科技机器的需求。
methods: 本研究使用混合Answer Set Programming with difference logic来模型半导体生产线上的特有需求，并包括可变机器处理、设置、批次和维护操作。
results: 本研究发现，在考虑多个优化目标下，大规模的调度可以实现更好的生产效率和流程可靠性，而非仅对单一机器或特定阶段进行地方优化。

Abstract
Modern semiconductor manufacturing involves intricate production processes consisting of hundreds of operations, which can take several months from lot release to completion. The high-tech machines used in these processes are diverse, operate on individual wafers, lots, or batches in multiple stages, and necessitate product-specific setups and specialized maintenance procedures. This situation is different from traditional job-shop scheduling scenarios, which have less complex production processes and machines, and mainly focus on solving highly combinatorial but abstract scheduling problems. In this work, we address the scheduling of realistic semiconductor manufacturing processes by modeling their specific requirements using hybrid Answer Set Programming with difference logic, incorporating flexible machine processing, setup, batching and maintenance operations. Unlike existing methods that schedule semiconductor manufacturing processes locally with greedy heuristics or by independently optimizing specific machine group allocations, we examine the potentials of large-scale scheduling subject to multiple optimization objectives.

摘要
现代半导体生产包括复杂的生产过程，包括数百个操作，可能需要几个月时间从库存释放到完成。这些高科技机器在多个阶段中操作，需要产品特定的设置和维护过程。这与传统的作坊调度enario不同，后者的生产过程更为简单，主要关注解决高 combinatorial的Abstract调度问题。在这种工作中，我们使用混合Answer Set Programming with difference logic来模拟半导体生产过程的特定需求，包括灵活机器处理、设置、批处理和维护操作。与现有的方法不同，我们不仅使用本地剑指法或独立优化特定机器组分配置，而是尝试解决大规模调度问题，同时满足多个优化目标。

Emotion4MIDI: a Lyrics-based Emotion-Labeled Symbolic Music Dataset

paper_url: http://arxiv.org/abs/2307.14783
repo_url: https://github.com/serkansulun/lyricsemotions
paper_authors: Serkan Sulun, Pedro Oliveira, Paula Viana
for: 创建一个大规模的符号音乐数据集，包含12000首MIDI乐曲，用于探索音乐和情感之间的关系，以及开发能够根据特定情感生成音乐的模型。
methods: 首先在GoEmotions数据集上训练情感分类模型，实现了状态之内的最佳效果，并将这些模型应用于两个大规模的MIDI数据集中的歌词。
results: 创建了一个广泛的情感分类数据集，覆盖了多种细腻的情感，为研究音乐和情感之间的关系，以及开发能够根据特定情感生成音乐的模型提供了一个丰富的资源。

Abstract
We present a new large-scale emotion-labeled symbolic music dataset consisting of 12k MIDI songs. To create this dataset, we first trained emotion classification models on the GoEmotions dataset, achieving state-of-the-art results with a model half the size of the baseline. We then applied these models to lyrics from two large-scale MIDI datasets. Our dataset covers a wide range of fine-grained emotions, providing a valuable resource to explore the connection between music and emotions and, especially, to develop models that can generate music based on specific emotions. Our code for inference, trained models, and datasets are available online.

摘要
我们提供了一个新的大规模符号音乐数据集，包含12000个MIDI歌曲。为创建这个数据集，我们首先在GoEmotions数据集上训练了情感分类模型，实现了状态之arte的结果，使用的模型比基线模型小半。然后，我们将这些模型应用到了两个大规模MIDI数据集中的歌词上。我们的数据集覆盖了各种细化的情感，提供了一个优质的资源，探索音乐和情感之间的连接，特别是开发基于具体情感的音乐生成模型。我们的推理代码、训练模型和数据集在线可用。

Car-Driver Drowsiness Assessment through 1D Temporal Convolutional Networks

paper_url: http://arxiv.org/abs/2308.02415
repo_url: None
paper_authors: Francesco Rundo, Concetto Spampinato, Michael Rundo
for: 这项研究旨在提高驾驶安全性，通过分析司机注意度水平。
methods: 该研究使用了一种新型的生物传感器，包括近红外LED发射器和光探测器，以分析司机生物学状态。同时，研究人员还开发了一种嵌入式时域频谱过滤技术，以及一种1D时间卷积架构，以实现实时识别司机睡眠状态。
results: 研究人员通过对实验数据进行分析，发现该系统可以准确地识别司机睡眠状态，准确率约为96%。

Abstract
Recently, the scientific progress of Advanced Driver Assistance System solutions (ADAS) has played a key role in enhancing the overall safety of driving. ADAS technology enables active control of vehicles to prevent potentially risky situations. An important aspect that researchers have focused on is the analysis of the driver attention level, as recent reports confirmed a rising number of accidents caused by drowsiness or lack of attentiveness. To address this issue, various studies have suggested monitoring the driver physiological state, as there exists a well-established connection between the Autonomic Nervous System (ANS) and the level of attention. For our study, we designed an innovative bio-sensor comprising near-infrared LED emitters and photo-detectors, specifically a Silicon PhotoMultiplier device. This allowed us to assess the driver physiological status by analyzing the associated PhotoPlethysmography (PPG) signal.Furthermore, we developed an embedded time-domain hyper-filtering technique in conjunction with a 1D Temporal Convolutional architecture that embdes a progressive dilation setup. This integrated system enables near real-time classification of driver drowsiness, yielding remarkable accuracy levels of approximately 96%.

摘要
最近，高级驾驶助手技术(ADAS)的科学进步在提高驾驶安全方面扮演着关键角色。ADAS技术允许车辆活动控制，预防可能带来危险的情况。研究人员注重分析司机注意力水平，据报告显示，睡意或注意力不集中导致的交通事故的数量在增长。为解决这个问题，各种研究建议监测司机生理状况，因为存在自主神经系统(ANS)和注意力之间的很好的关系。为我们的研究，我们设计了一种创新的生物传感器，包括近红外LED发射器和光检测器，特别是一种半导体光产生器。这使得我们可以通过分析相关的血液压力信号来评估司机生理状况。此外，我们还开发了一种嵌入式时域频域滤波技术，并与一种1D时间卷积架构结合，实现了近实时的司机睡意分类，准确率达到约96%。

Fair Machine Unlearning: Data Removal while Mitigating Disparities

paper_url: http://arxiv.org/abs/2307.14754
repo_url: None
paper_authors: Alex Oesterling, Jiaqi Ma, Flavio P. Calmon, Hima Lakkaraju
for: 这个论文的目的是提出一种可靠地忘记数据实例的机器学习方法，以保持集体公正性。
methods: 该论文使用了一种基于梯度下降的方法，通过让模型学习一个新的权重函数来忘记数据实例。
results: 该论文的实验结果表明，该方法可以有效地忘记数据实例，同时保持集体公正性。

Abstract
As public consciousness regarding the collection and use of personal information by corporations grows, it is of increasing importance that consumers be active participants in the curation of corporate datasets. In light of this, data governance frameworks such as the General Data Protection Regulation (GDPR) have outlined the right to be forgotten as a key principle allowing individuals to request that their personal data be deleted from the databases and models used by organizations. To achieve forgetting in practice, several machine unlearning methods have been proposed to address the computational inefficiencies of retraining a model from scratch with each unlearning request. While efficient online alternatives to retraining, it is unclear how these methods impact other properties critical to real-world applications, such as fairness. In this work, we propose the first fair machine unlearning method that can provably and efficiently unlearn data instances while preserving group fairness. We derive theoretical results which demonstrate that our method can provably unlearn data instances while maintaining fairness objectives. Extensive experimentation with real-world datasets highlight the efficacy of our method at unlearning data instances while preserving fairness.

摘要
In this work, we propose the first fair machine unlearning method that can provably and efficiently unlearn data instances while preserving group fairness. We derive theoretical results that demonstrate our method can provably unlearn data instances while maintaining fairness objectives. Extensive experiments with real-world datasets show that our method is effective in unlearning data instances while preserving fairness.

LLMediator: GPT-4 Assisted Online Dispute Resolution

paper_url: http://arxiv.org/abs/2307.16732
repo_url: None
paper_authors: Hannes Westermann, Jaromir Savelka, Karim Benyekhlef
For: The paper is written to explore the potential of using large language models (LLMs) to enhance online dispute resolution (ODR) processes, specifically in the context of high-volume, low-intensity legal disputes.* Methods: The paper proposes an experimental platform called LLMediator, which leverages GPT-4 to reformulate user messages, draft mediator responses, and potentially engage in discussions autonomously.* Results: The initial qualitative evaluations presented in the paper demonstrate the potential for LLMs to support ODR and facilitate amicable settlements, with promising results for the proof of concept.Here are the three points in Simplified Chinese text:* For: 本文是为了探讨利用现代大语言模型（LLM）来增强在线纠纷解决（ODR）过程，具体是在高量低度纠纷法律纠纷中。* Methods: 本文提出了一个名为LLMediator的实验平台，利用GPT-4来改写用户消息，写作仲裁员回复，并可能地自动参与讨论。* Results: 本文提出的初步质量评估显示LLM可以支持ODR，并且初步证明了概念的可行性。

Abstract
In this article, we introduce LLMediator, an experimental platform designed to enhance online dispute resolution (ODR) by utilizing capabilities of state-of-the-art large language models (LLMs) such as GPT-4. In the context of high-volume, low-intensity legal disputes, alternative dispute resolution methods such as negotiation and mediation offer accessible and cooperative solutions for laypeople. These approaches can be carried out online on ODR platforms. LLMediator aims to improve the efficacy of such processes by leveraging GPT-4 to reformulate user messages, draft mediator responses, and potentially autonomously engage in the discussions. We present and discuss several features of LLMediator and conduct initial qualitative evaluations, demonstrating the potential for LLMs to support ODR and facilitate amicable settlements. The initial proof of concept is promising and opens up avenues for further research in AI-assisted negotiation and mediation.

摘要
在这篇文章中，我们介绍LLMediator，一个实验性的平台，用于增强在线纠纷解决（ODR），通过使用现代大语言模型（LLM） such as GPT-4。在高量低激法律纠纷中，人们可以通过谈判和谈判来解决纠纷，这些方法可以在ODR平台上进行在线。LLMediator通过使用GPT-4来重新表达用户消息，制定仲裁者回复，以及可能地自动参与谈判。我们介绍了LLMediator的多种特性，并进行了初步质量评估，以示LLM可以支持ODR，并促进和谐解决方案。我们的初步证明有潜力，开启了AI助成谈判和谈判的研究方向。

Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation

paper_url: http://arxiv.org/abs/2307.14750
repo_url: https://github.com/zhiyuan-li-john/rapsg
paper_authors: Zhiyuan Li, Dongnan Liu, Heng Wang, Chaoyi Zhang, Weidong Cai
for:This paper proposes a new strategy for training an image captioner without annotated image-sentence pairs, which is to leverage prior knowledge from large pre-trained models (LPMs) and integrate a retrieval process to generate high-quality pseudo sentences.methods:The proposed method, called LPM + retrieval-augmented learning, consists of two main components: (1) Retrieval-augmented Pseudo Sentence Generation (RaPSG), which retrieves highly relevant short region descriptions from mismatching corpora and uses them to generate a variety of pseudo sentences with distinct representations and high quality, and (2) a fluency filter and a CLIP-guided training objective to facilitate model optimization.results:The proposed method achieves a CIDEr score of 78.1 (+5.1) while utilizing only 0.3% of the trainable parameters of the SOTA pre-training model (Flamingo3B), and outperforms the 1% semi-supervised image caption benchmark with a score of 93.4 CIDEr (+8.9) with a simple extension.

Abstract
Training an image captioner without annotated image-sentence pairs has gained traction in recent years. Previous approaches can be categorized into two strategies: crawling sentences from mismatching corpora and aligning them with the given images as pseudo annotations, or pre-training the captioner using external image-text pairs. However, the aligning setting seems to reach its performance limit due to the quality problem of pairs, and pre-training requires significant computational resources. To address these challenges, we propose a new strategy ``LPM + retrieval-augmented learning" where the prior knowledge from large pre-trained models (LPMs) is leveraged as supervision, and a retrieval process is integrated to further reinforce its effectiveness. Specifically, we introduce Retrieval-augmented Pseudo Sentence Generation (RaPSG), which adopts an efficient approach to retrieve highly relevant short region descriptions from the mismatching corpora and use them to generate a variety of pseudo sentences with distinct representations as well as high quality via LPMs. In addition, a fluency filter and a CLIP-guided training objective are further introduced to facilitate model optimization. Experimental results demonstrate that our method surpasses the SOTA pre-training model (Flamingo3B) by achieving a CIDEr score of 78.1 (+5.1) while utilizing only 0.3% of its trainable parameters (1.3B VS 33M). Importantly, our approach eliminates the need of computationally expensive pre-training processes on external datasets (e.g., the requirement of 312M image-text pairs for Flamingo3B). We further show that with a simple extension, the generated pseudo sentences can be deployed as weak supervision to boost the 1% semi-supervised image caption benchmark up to 93.4 CIDEr score (+8.9) which showcases the versatility and effectiveness of our approach.

摘要
Recently, 没有annotated image-sentence对照的训练方法在image captioner领域受到了广泛关注。以前的方法可以分为两种策略：一是从不符对的corpus中抓取句子并将其与给定的图像作为pseudo注释进行对应，另一是使用外部的image-text对照来预训captioner。但是，对应设定似乎已经达到了性能的限制，因为对应对的问题和预训需要大量的计算资源。为了解决这些挑战，我们提出了一种新的策略“LPM+ Retrieval-augmented learning”，利用大型预训模型（LPM）的优先知识作为监督，并将检索过程集成到了模型训练中来进一步加强其效果。具体来说，我们提出了Retrieval-augmented Pseudo Sentence Generation（RaPSG）方法，通过高效的检索方式从不符对的corpus中检索高度相关的短区描述，并使用LPMs生成一系列高质量和多样化的pseudo句。此外，我们还引入了一个流利性筛选器和CLIP帮助的训练目标，以便优化模型。实验结果表明，我们的方法超过了预训模型（Flamingo3B）的SOTA分数（78.1 (+5.1），同时只使用0.3%的可训练参数（1.3B VS 33M）。重要的是，我们的方法消除了需要大量计算资源的预训过程（例如，需要312M个图像-文本对 дляFlamingo3B）。我们还证明了，通过一个简单的扩展，生成的pseudo句可以被用作弱级supervision，将1%的 semi-supervised image caption benchmark提高到93.4 CIDEr分数 (+8.9)，这说明了我们的方法的多样性和效果。

JusticeBot: A Methodology for Building Augmented Intelligence Tools for Laypeople to Increase Access to Justice

paper_url: http://arxiv.org/abs/2308.02032
repo_url: None
paper_authors: Hannes Westermann, Karim Benyekhlef
for: 该论文旨在帮助非法专业人士解决法律问题。
methods: 该论文提出的方法是基于混合 случа法和规则逻辑的法律决策支持系统，通过问候用户的情况并提供法律信息、相似案例和下一步建议，帮助用户解决问题。
results: 该论文通过实现这种方法，为用户提供了一个可能帮助解决案件或在法律程序中行使权利的工具。

Abstract
Laypeople (i.e. individuals without legal training) may often have trouble resolving their legal problems. In this work, we present the JusticeBot methodology. This methodology can be used to build legal decision support tools, that support laypeople in exploring their legal rights in certain situations, using a hybrid case-based and rule-based reasoning approach. The system ask the user questions regarding their situation and provides them with legal information, references to previous similar cases and possible next steps. This information could potentially help the user resolve their issue, e.g. by settling their case or enforcing their rights in court. We present the methodology for building such tools, which consists of discovering typically applied legal rules from legislation and case law, and encoding previous cases to support the user. We also present an interface to build tools using this methodology and a case study of the first deployed JusticeBot version, focused on landlord-tenant disputes, which has been used by thousands of individuals.

摘要
非法律专业人士（即没有法律训练的个人）经常遇到法律问题的解决困难。在这项工作中，我们介绍了JusticeBot方法论。这种方法可以用于建立法律决策支持工具，以帮助非法律专业人士在某些情况下探索他们的法律权利，采用混合案例基于和规则基于的思维方法。系统会问用户他们的情况，并提供他们法律信息、相似案例和可能的下一步。这些信息可能帮助用户解决他们的问题，例如和解案或在法律程序中保护他们的权利。我们介绍了这种方法的建立工具，包括从法律和案例中找到通常适用的法律规则，并将前例编码以支持用户。我们还介绍了使用这种方法构建工具的界面，以及一个关注JusticeBot版本1.0的案例研究，专注于房东和租户纠纷，已经被千余人使用。

New Interaction Paradigm for Complex EDA Software Leveraging GPT

paper_url: http://arxiv.org/abs/2307.14740
repo_url: https://github.com/smarton-empower/smarton-ai
paper_authors: Boyu Han, Xinyu Wang, Yifan Wang, Junyu Yan, Yidong Tian
for: 帮助 novice Printed Circuit Board (PCB) 设计者更好地使用 KiCad 等专业电子设计自动化 (EDA) 软件，通过人工智能 (AI) 交互助手Plugin 提高设计效率和用户体验。
methods: 基于 HuggingGPT 框架，采用大语言模型 GPT 和 BERT，实现任务规划和执行，包括分析帮助文档段落和执行不同插件，同时充分利用 KiCad 自身的电子设计和PCB manipulate 功能。
results: 在 préliminary 测试中，SmartonAI 可以有效地简化 PCB 设计过程，将复杂的命令转化为易于理解的语言基于交互。这种 bridging gap между复杂的 EDA 软件和易用的交互 interfaces 可以帮助 novice 设计者更好地使用 KiCad 等软件，同时也适用于其他复杂的软件系统，展示了 AI 协助用户界面在不同领域的潜在潜力。

Abstract
In the rapidly growing field of electronic design automation (EDA), professional software such as KiCad, Cadence , and Altium Designer provide increasingly extensive design functionalities. However, the intricate command structure and high learning curve create a barrier, particularly for novice printed circuit board (PCB) designers. This results in difficulties in selecting appropriate functions or plugins for varying design purposes, compounded by the lack of intuitive learning methods beyond traditional documentation, videos, and online forums. To address this challenge, an artificial intelligence (AI) interaction assist plugin for EDA software named SmartonAl is developed here, also KiCad is taken as the first example. SmartonAI is inspired by the HuggingGPT framework and employs large language models, such as GPT and BERT, to facilitate task planning and execution. On receiving a designer request, SmartonAI conducts a task breakdown and efficiently executes relevant subtasks, such as analysis of help documentation paragraphs and execution of different plugins, along with leveraging the built-in schematic and PCB manipulation functions in both SmartonAl itself and software. Our preliminary results demonstrate that SmartonAI can significantly streamline the PCB design process by simplifying complex commands into intuitive language-based interactions. By harnessing the powerful language capabilities of ChatGPT and the rich design functions of KiCad, the plugin effectively bridges the gap between complex EDA software and user-friendly interaction. Meanwhile, the new paradigm behind SmartonAI can also extend to other complex software systems, illustrating the immense potential of AI-assisted user interfaces in advancing digital interactions across various domains.

摘要
在日益发展的电子设计自动化（EDA）领域中，职业软件如KiCad、Cadence和Altium Designer提供越来越广泛的设计功能。然而，复杂的命令结构和学习曲线创造了一个障碍，特别是 для初级Printed Circuit Board（PCB）设计师。这导致选择适合的功能或插件在不同的设计目的上具有困难，并且缺乏直观的学习方法，只有通过传统的文档、视频和在线讨论来学习。为解决这个挑战，我们开发了一个人工智能（AI）互动助手插件 для EDA 软件，名为SmartonAl，KiCad 被选为首个示例。SmartonAI 基于 HuggingGPT 框架，采用大型语言模型，如 GPT 和 BERT，以便任务规划和执行。当设计者发送请求时，SmartonAI 会进行任务拆分，然后高效地执行相关的子任务，例如分析帮助文档段落和执行不同的插件，同时利用 SmartonAl 自身和软件中的基本电子设计和PCB manipulate 功能。我们的初步结果表明，SmartonAI 可以很大程度地减少 PCB 设计过程中的复杂性，通过将复杂的命令转化为直观的语言基本互动。通过将ChatGPT 强大的语言能力和 KiCad wealthy 的设计功能相结合，插件可以有效地将复杂的 EDA 软件和用户友好的互动相连。同时，SmartonAI 的新理念可以扩展到其他复杂的软件系统，illustrating the immense potential of AI-assisted user interfaces in advancing digital interactions across various domains.

Cortex Inspired Learning to Recover Damaged Signal Modality with ReD-SOM Model

paper_url: http://arxiv.org/abs/2307.15095
repo_url: None
paper_authors: Artem Muliukov, Laurent Rodriguez, Benoit Miramond
for: 本研究旨在恢复lost的一个modalidad的数据，使用另一个modalidad的数据进行恢复。
methods: 本研究使用Variational Auto-Encoders、Self-Organizing Maps和Hebb连接在一起，构建了一个ReD-SOM模型，以模拟人脑中不同modalities之间的交互效应。
results: 实验结果表明，ReD-SOM模型可以有效地恢复lost的数据，并且在存在较大的信号扭曲情况下，效果更加remarkable。

Abstract
Recent progress in the fields of AI and cognitive sciences opens up new challenges that were previously inaccessible to study. One of such modern tasks is recovering lost data of one modality by using the data from another one. A similar effect (called the McGurk Effect) has been found in the functioning of the human brain. Observing this effect, one modality of information interferes with another, changing its perception. In this paper, we propose a way to simulate such an effect and use it to reconstruct lost data modalities by combining Variational Auto-Encoders, Self-Organizing Maps, and Hebb connections in a unified ReD-SOM (Reentering Deep Self-organizing Map) model. We are inspired by human's capability to use different zones of the brain in different modalities, in case of having a lack of information in one of the modalities. This new approach not only improves the analysis of ambiguous data but also restores the intended signal! The results obtained on the multimodal dataset demonstrate an increase of quality of the signal reconstruction. The effect is remarkable both visually and quantitatively, specifically in presence of a significant degree of signal's distortion.

摘要
现代人工智能和认知科学的进步打开了以前无法研究的新挑战。其中一项现代任务是通过一种不同的感知modalities来恢复丢失的数据。人脑中的同样效应（叫做McGurk效应）也发现了类似的现象，人们在感知信息时，一种感知modalities会对另一种感知modalities产生干扰，从而改变它的感知。在这篇论文中，我们提议使用Variational Auto-Encoders、Self-Organizing Maps和Hebb连接在一起，实现一种 reunified ReD-SOM（再入Self-organizing Map）模型。我们受人类在不同感知modalities中使用不同的脑区域的启发，这种新方法不仅改善了抽象数据的分析，还能恢复原始信号！实验结果表明，在多模态数据集上，可以提高信号重建质量。效果是观察性和量度上有显著改善，特别在信号受到较大的扭曲时。

Evaluating Generative Models for Graph-to-Text Generation

paper_url: http://arxiv.org/abs/2307.14712
repo_url: https://github.com/shuzhouyuan/eval_g2t_genmodels
paper_authors: Shuzhou Yuan, Michael Färber
for: 本研究旨在探讨生成模型在零shot情况下对图数据的文本生成能力。
methods: 我们使用GPT-3和ChatGPT生成模型，并对T5和BARTfinetuned LLM模型进行比较。
results: 我们的结果表明，生成模型能够生成流畅、连贯的文本，AGENDA和WebNLG数据集的BLEU分数分别为10.57和11.08。然而，我们的错误分析发现，生成模型仍然忽略实体之间的semantic关系，并且有时会生成hallucination或无关信息。

Abstract
Large language models (LLMs) have been widely employed for graph-to-text generation tasks. However, the process of finetuning LLMs requires significant training resources and annotation work. In this paper, we explore the capability of generative models to generate descriptive text from graph data in a zero-shot setting. Specifically, we evaluate GPT-3 and ChatGPT on two graph-to-text datasets and compare their performance with that of finetuned LLM models such as T5 and BART. Our results demonstrate that generative models are capable of generating fluent and coherent text, achieving BLEU scores of 10.57 and 11.08 for the AGENDA and WebNLG datasets, respectively. However, our error analysis reveals that generative models still struggle with understanding the semantic relations between entities, and they also tend to generate text with hallucinations or irrelevant information. As a part of error analysis, we utilize BERT to detect machine-generated text and achieve high macro-F1 scores. We have made the text generated by generative models publicly available.

摘要
大型语言模型（LLM）已广泛应用于图数据生成文本任务。然而，训练LLM模型需要巨量的资源和标注工作。在这篇论文中，我们探讨了生成模型是否可以从图数据生成描述性文本，而不需要训练。我们评估了GPT-3和ChatGPT模型在两个图数据生成文本任务上的表现，并与训练后的LLM模型T5和BART进行比较。我们的结果表明，生成模型可以生成流畅和一致的文本，AGENDA和WebNLG数据集上的BLEU分数分别为10.57和11.08。然而，我们的错误分析表明，生成模型仍然很难理解实体之间的semantic关系，同时也很容易生成幻觉或无关信息。为了进行错误分析，我们利用BERT检测机器生成文本，并实现了高macro-F1分数。我们已经公开了生成模型生成的文本。

A Multimodal Supervised Machine Learning Approach for Satellite-based Wildfire Identification in Europe

paper_url: http://arxiv.org/abs/2308.02508
repo_url: None
paper_authors: Angelica Urbanelli, Luca Barco, Edoardo Arnaudo, Claudio Rossi
for: 提高自动化天然灾害检测系统的精度，特意开发了一种野火识别解决方案。
methods: 利用多种信息源，包括MODIS和VIIRS热点服务、EFFIS数据库、ERSI年度Land Use Land Cover（LULC）和Copernicus Sentinel-3数据，提出了一种多模态超级vised机器学习方法，实现了野火识别任务中的效果。
results: 实验结果表明，提出的方法可以有效地分解热点检测结果，将野火和其他事件区分开来。

Abstract
The increasing frequency of catastrophic natural events, such as wildfires, calls for the development of rapid and automated wildfire detection systems. In this paper, we propose a wildfire identification solution to improve the accuracy of automated satellite-based hotspot detection systems by leveraging multiple information sources. We cross-reference the thermal anomalies detected by the Moderate-resolution Imaging Spectroradiometer (MODIS) and the Visible Infrared Imaging Radiometer Suite (VIIRS) hotspot services with the European Forest Fire Information System (EFFIS) database to construct a large-scale hotspot dataset for wildfire-related studies in Europe. Then, we propose a novel multimodal supervised machine learning approach to disambiguate hotspot detections, distinguishing between wildfires and other events. Our methodology includes the use of multimodal data sources, such as the ERSI annual Land Use Land Cover (LULC) and the Copernicus Sentinel-3 data. Experimental results demonstrate the effectiveness of our approach in the task of wildfire identification.

摘要
随着自然灾害的频繁发生，如野火，需要开发高速自动化野火检测系统。在这篇论文中，我们提出了一种野火识别解决方案，以提高自动通过卫星温度异常检测系统获取的热点数据的准确性。我们将模拟高分辨率柯比耶报警系统（MODIS）和可见近红外报警系统（VIIRS）的热点服务与欧洲林地火灾信息系统（EFFIS）数据库进行交叉参考，以建立欧洲大规模热点数据集，用于林地火灾相关研究。然后，我们提出了一种新的多模态指导学习方法，用于分解热点检测结果，并将野火和其他事件区分开来。我们的方法包括使用多模态数据源，如地图信息系统（ERSI）年度土地用途土地覆盖（LULC）数据和科学卫星三号数据。实验结果表明，我们的方法在野火识别任务中具有效果。

Prediction of wind turbines power with physics-informed neural networks and evidential uncertainty quantification

paper_url: http://arxiv.org/abs/2307.14675
repo_url: None
paper_authors: Alfonso Gijón, Ainhoa Pujana-Goitia, Eugenio Perea, Miguel Molina-Solana, Juan Gómez-Romero
for: 这个研究旨在优化风力发电机的运行和维护，通过投角控制器和早期缺陷检测，提高风力发电机的发电效率和可靠性。
methods: 这个研究使用数据驱动方法来优化风力发电机的模型，将大量数据处理成更加精确和高效的模型，并且将物理限制给模型以保持其实际性。
results: 研究结果显示，使用物理限制的数据驱动模型可以实时预测风力发电机的发电功率、扭矩和功率系数，并且可以提供精确的不确定性估计。

Abstract
The ever-growing use of wind energy makes necessary the optimization of turbine operations through pitch angle controllers and their maintenance with early fault detection. It is crucial to have accurate and robust models imitating the behavior of wind turbines, especially to predict the generated power as a function of the wind speed. Existing empirical and physics-based models have limitations in capturing the complex relations between the input variables and the power, aggravated by wind variability. Data-driven methods offer new opportunities to enhance wind turbine modeling of large datasets by improving accuracy and efficiency. In this study, we used physics-informed neural networks to reproduce historical data coming from 4 turbines in a wind farm, while imposing certain physical constraints to the model. The developed models for regression of the power, torque, and power coefficient as output variables showed great accuracy for both real data and physical equations governing the system. Lastly, introducing an efficient evidential layer provided uncertainty estimations of the predictions, proved to be consistent with the absolute error, and made possible the definition of a confidence interval in the power curve.

摘要
随着风能使用的增长，适用于风机操作的扭矩角度控制器和其维护的早期缺陷检测变得越来越重要。为了准确地预测风机生成的电力，特别是在风速变化的情况下，需要有高度准确和可靠的风机模型。现有的empirical和物理基于模型具有限制capture风机输入变量和电力之间的复杂关系，这使得预测电力的准确性受到风度的变化的影响。使用数据驱动方法可以提高风机模型的准确性和效率。在本研究中，我们使用物理信息权重 neural network来复制来自4台风机风电农的历史数据，并对模型受到物理限制。开发的 regression 模型的输出变量为电力、扭矩和功率系数表现出了很高的准确性，并且与物理方程统制系统的实际数据相符。最后，通过添加高效的证据层，我们实现了预测结果的uncertainty estimations，并证明与绝对错误之间的一致性。这使得可以定义风机电力曲线上的自信Interval。

Fuzzy order-sorted feature logic

paper_url: http://arxiv.org/abs/2307.14669
repo_url: None
paper_authors: Gian Carlo Milanese, Gabriella Pasi
for: 本文探讨了一种基于函数表示和集合表示的知识表示和推理语言（OSF逻辑）的扩展，即将OSF逻辑扩展到不确定环境中。
methods: 本文使用了一种柔化包含关系来扩展OSF逻辑，其中包含关系是基于 zadeh 的包含关系。在这个柔化环境中，sort symbol 和 OSF term 都表示不确定集合。
results: 本文提出了一种基于柔化包含关系的 OSF 逻辑 semantics，并证明了这种 semantics 的准确性和可行性。此外，本文还提供了一种用于计算 OSF term 之间的包含关系度的算法，并证明了这种算法的复杂度。

Abstract
Order-Sorted Feature (OSF) logic is a knowledge representation and reasoning language based on function-denoting feature symbols and set-denoting sort symbols ordered in a subsumption lattice. OSF logic allows the construction of record-like terms that represent classes of entities and that are themselves ordered in a subsumption relation. The unification algorithm for such structures provides an efficient calculus of type subsumption, which has been applied in computational linguistics and implemented in constraint logic programming languages such as LOGIN and LIFE and automated reasoners such as CEDAR. This work generalizes OSF logic to a fuzzy setting. We give a flexible definition of a fuzzy subsumption relation which generalizes Zadeh's inclusion between fuzzy sets. Based on this definition we define a fuzzy semantics of OSF logic where sort symbols and OSF terms denote fuzzy sets. We extend the subsumption relation to OSF terms and prove that it constitutes a fuzzy partial order with the property that two OSF terms are subsumed by one another in the crisp sense if and only if their subsumption degree is greater than 0. We show how to find the greatest lower bound of two OSF terms by unifying them and how to compute the subsumption degree between two OSF terms, and we provide the complexity of these operations.

摘要
订定排序特征逻规（OSF）逻规是一种知识表现和推理语言，基于功能表示特征符号和集合表示排序符号，顺序在一个包含关系中。OSF逻规允许建构记录类型的条件，并且这些条件顺序在包含关系中。数据整合算法适用于这些结构，实现了型别包含的快速计算，并且在计算语言中如LOGIN和LIFE以及自动推理工具中如CEDAR中实现。这个工作将OSF逻规扩展到模糊设定。我们提供一个洒处的包含关系定义，它为Zadeh的包含关系中的模糊集合提供了一个扩展。基于这个定义，我们定义了模糊OSF逻规， Sort symbol和OSF表达符号表示模糊集合。我们将包含关系扩展到OSF表达符号，并证明它具有模糊偏序的性质，即两个OSF表达符号之间的包含关系是在模糊上的，即它们之间的包含度大于0。我们还说明如何在两个OSF表达符号之间找到最小共识，以及如何计算两个OSF表达符号之间的包含度，并且提供了这些操作的复杂度。

Multi-Valued Partial Order Plans in Numeric Planning

paper_url: http://arxiv.org/abs/2307.14660
repo_url: None
paper_authors: Hayyan Helal, Gerhard Lakemeyer
for: 研究 numeric planning 中的不可解析性的可能原因，通过研究动作的不同出现次数。
methods: 使用搜索问题的 reformulation，NP-complete 的 numeric planning 可以通过规则来找到。
results: 开发了多值部分顺序计划，一种用于（串行和并行）计划的最小commitment减少表示方式，并研究了优化技术以包含软前件。

Abstract
Many planning formalisms allow for mixing numeric with Boolean effects. However, most of these formalisms are undecidable. In this paper, we will analyze possible causes for this undecidability by studying the number of different occurrences of actions, an approach that proved useful for metric fluents before. We will start by reformulating a numeric planning problem known as restricted tasks as a search problem. We will then show how an NP-complete fragment of numeric planning can be found by using heuristics. To achieve this, we will develop the idea of multi-valued partial order plans, a least committing compact representation for (sequential and parallel) plans. Finally, we will study optimization techniques for this representation to incorporate soft preconditions.

摘要
很多规划ormalism允许混合数字和布尔效果。然而，大多数这些ormalism是不可解决的。在这篇论文中，我们会分析可能导致这种不可解决性的原因，通过研究行动的不同出现次数来研究metric fluents的方法。我们将从restricted tasks中的数字规划问题开始，然后使用启发式将NP完全 Fragment of numeric planning转化为搜索问题。最后，我们将开发多值partial order plan的理想，一种用于（串行和并行）计划的最小承诺表示法。最后，我们将研究这种表示法的优化技术，以涵盖软前件。

MVMR-FS : Non-parametric feature selection algorithm based on Maximum inter-class Variation and Minimum Redundancy

paper_url: http://arxiv.org/abs/2307.14643
repo_url: None
paper_authors: Haitao Nie, Shengbo Zhang, Bin Xie
for: 本研究旨在解决 Filter-based 特征选择方法无法直接测量连续数据中特征之间的重复性的问题。
methods: 本方法基于最大间类差和最小重复性，简称 MVMR-FS。首先，我们使用支持学习和无支持学习核密度估计来捕捉特征之间的相似性和总体分布的不同。然后，我们提出了最大间类差和最小重复性的 criterion，其中间类概率分布用于反映特征相关性，而总体分布距离用于衡量特征之间的重复性。最后，我们使用 AG 搜索算法来找到最佳特征子集，以最小化 MVMR。
results: 与十种现状顶尖方法进行比较后，MVMR-FS achieved the highest average accuracy, and improved the accuracy by 5% to 11%.

Abstract
How to accurately measure the relevance and redundancy of features is an age-old challenge in the field of feature selection. However, existing filter-based feature selection methods cannot directly measure redundancy for continuous data. In addition, most methods rely on manually specifying the number of features, which may introduce errors in the absence of expert knowledge. In this paper, we propose a non-parametric feature selection algorithm based on maximum inter-class variation and minimum redundancy, abbreviated as MVMR-FS. We first introduce supervised and unsupervised kernel density estimation on the features to capture their similarities and differences in inter-class and overall distributions. Subsequently, we present the criteria for maximum inter-class variation and minimum redundancy (MVMR), wherein the inter-class probability distributions are employed to reflect feature relevance and the distances between overall probability distributions are used to quantify redundancy. Finally, we employ an AGA to search for the feature subset that minimizes the MVMR. Compared with ten state-of-the-art methods, MVMR-FS achieves the highest average accuracy and improves the accuracy by 5% to 11%.

摘要
如何准确地衡量特征之间的相关性和重复性是机器学习领域的一个古老的挑战。然而，现有的筛选方法无法直接测量连续数据中的重复性。另外，大多数方法需要手动指定特征的数量，这可能会导致专家知识不足的情况下引入错误。在本文中，我们提出了一种非参数式特征选择算法基于最大间类差和最小重复性，简称MVMR-FS。我们首先引入supervised和unsupervised核密度估计来捕捉特征之间的相似性和总体分布的差异。然后，我们介绍了MVMR的标准，其中间类概率分布用于反映特征相关性，而总体概率分布的距离用于衡量特征之间的重复性。最后，我们使用AGA算法搜索最小化MVMR的特征子集，相比之下，与state-of-the-art方法相比，MVMR-FS实现了最高的平均准确率，提高了准确率5%到11%。

Fact-Checking of AI-Generated Reports

paper_url: http://arxiv.org/abs/2307.14634
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Razi Mahmood, Ge Wang, Mannudeep Kalra, Pingkun Yan
for: 这篇论文的目的是为了提高 radiology 图像的自动生成报告的准确性和责任使用。
methods: 这篇论文提出了一新的方法，利用图像和报告之间的关联来检查生成的报告是否实际存在错误。
results: 这篇论文的结果显示，这新的方法可以对 automatically生成的报告进行检查，并删除假的句子，提高报告的准确性和责任使用。

Abstract
With advances in generative artificial intelligence (AI), it is now possible to produce realistic-looking automated reports for preliminary reads of radiology images. This can expedite clinical workflows, improve accuracy and reduce overall costs. However, it is also well-known that such models often hallucinate, leading to false findings in the generated reports. In this paper, we propose a new method of fact-checking of AI-generated reports using their associated images. Specifically, the developed examiner differentiates real and fake sentences in reports by learning the association between an image and sentences describing real or potentially fake findings. To train such an examiner, we first created a new dataset of fake reports by perturbing the findings in the original ground truth radiology reports associated with images. Text encodings of real and fake sentences drawn from these reports are then paired with image encodings to learn the mapping to real/fake labels. The utility of such an examiner is demonstrated for verifying automatically generated reports by detecting and removing fake sentences. Future generative AI approaches can use the resulting tool to validate their reports leading to a more responsible use of AI in expediting clinical workflows.

摘要
使用生成式人工智能（AI），现在可以生成具有很好的真实感的自动报告，以便加速临床工作流程，提高准确性和降低总成本。然而，这些模型经常“见鬼”，导致自动生成的报告中出现假的结论。在这篇论文中，我们提议一种新的实验检查AI生成的报告的方法，specifically，我们通过学习图像和报告中的句子之间的关系，来分辨真实的和假的句子。为了训练这种检查器，我们首先创建了一个新的假报告数据集，其中perturb the findings在原始的基准真实Radiology report中。然后，我们对真实和假句子的文本编码和图像编码进行对应，以学习将它们映射到真实/假标签。我们证明了这种检查器可以用于检查自动生成的报告，以检测和移除假的句子。未来的生成AI方法可以使用这个工具来验证他们的报告，从而实现负责任的使用AI来加速临床工作流程。

Metric-Based In-context Learning: A Case Study in Text Simplification

paper_url: http://arxiv.org/abs/2307.14632
repo_url: https://github.com/nlp-ku/metric-based-in-context-learning
paper_authors: Subha Vadlamannati, Gözde Gül Şahin
for: investigate the best method for selecting examples for in-context learning (ICL) in text simplification (TS) tasks.
methods: propose a Metric-Based in-context Learning (MBL) method that uses commonly used TS metrics such as SARI, compression ratio, and BERT-Precision for selection.
results: show that examples selected by the top SARI scores perform the best on larger models, while the compression ratio generally performs better on smaller models. MBL is robust to example orderings and out-of-domain test sets, and outperforms strong baselines and state-of-the-art finetuned language models. Additionally, the chosen metric can implicitly control the behavior of large GPT models.

Abstract
In-context learning (ICL) for large language models has proven to be a powerful approach for many natural language processing tasks. However, determining the best method to select examples for ICL is nontrivial as the results can vary greatly depending on the quality, quantity, and order of examples used. In this paper, we conduct a case study on text simplification (TS) to investigate how to select the best and most robust examples for ICL. We propose Metric-Based in-context Learning (MBL) method that utilizes commonly used TS metrics such as SARI, compression ratio, and BERT-Precision for selection. Through an extensive set of experiments with various-sized GPT models on standard TS benchmarks such as TurkCorpus and ASSET, we show that examples selected by the top SARI scores perform the best on larger models such as GPT-175B, while the compression ratio generally performs better on smaller models such as GPT-13B and GPT-6.7B. Furthermore, we demonstrate that MBL is generally robust to example orderings and out-of-domain test sets, and outperforms strong baselines and state-of-the-art finetuned language models. Finally, we show that the behaviour of large GPT models can be implicitly controlled by the chosen metric. Our research provides a new framework for selecting examples in ICL, and demonstrates its effectiveness in text simplification tasks, breaking new ground for more accurate and efficient NLG systems.

摘要
大型语言模型的增Context学习（ICL）已经证明是许多自然语言处理任务的有力的方法。但是确定最佳的ICL例子选择方法是非常困难，因为结果可能会很大程度上取决于选择的例子质量、量和顺序。在这篇论文中，我们进行了TS（简化文本）案例研究，以 Investigate how to select the best and most robust examples for ICL。我们提出了Metric-Based in-context Learning（MBL）方法，该方法利用通用TS度量标准such as SARI、压缩率和BERT-Precision进行选择。通过对不同大小的GPT模型（GPT-175B、GPT-13B和GPT-6.7B）进行了广泛的实验，我们发现了TS的最高SARI分数选择的例子在更大的模型上表现最佳，而压缩率通常在较小的模型上表现更好。此外，我们还证明了MBL是对example顺序和尝试集的鲁棒的，并且超越了强基线和当前训练语言模型。最后，我们发现了选择的度量可以 implicitly控制大型GPT模型的行为。我们的研究提供了一个新的ICL例子选择框架，并证明其在TS任务中的效果，打破了更准确和有效的NLG系统的障碍。

A Survey on Reservoir Computing and its Interdisciplinary Applications Beyond Traditional Machine Learning

paper_url: http://arxiv.org/abs/2307.15092
repo_url: None
paper_authors: Heng Zhang, Danilo Vasconcellos Vargas
for: 这篇论文主要探讨了储量计算（RC）的最新发展，从机器学习到物理、生物和神经科学。
methods: RC 使用了随机连接的激活函数和非线性动力系统，可以处理时间信号处理等应用。
results: RC 可以实现高维空间的映射，具有良好的非线性特性和记忆能力，并且可以应用于多种领域。

Abstract
Reservoir computing (RC), first applied to temporal signal processing, is a recurrent neural network in which neurons are randomly connected. Once initialized, the connection strengths remain unchanged. Such a simple structure turns RC into a non-linear dynamical system that maps low-dimensional inputs into a high-dimensional space. The model's rich dynamics, linear separability, and memory capacity then enable a simple linear readout to generate adequate responses for various applications. RC spans areas far beyond machine learning, since it has been shown that the complex dynamics can be realized in various physical hardware implementations and biological devices. This yields greater flexibility and shorter computation time. Moreover, the neuronal responses triggered by the model's dynamics shed light on understanding brain mechanisms that also exploit similar dynamical processes. While the literature on RC is vast and fragmented, here we conduct a unified review of RC's recent developments from machine learning to physics, biology, and neuroscience. We first review the early RC models, and then survey the state-of-the-art models and their applications. We further introduce studies on modeling the brain's mechanisms by RC. Finally, we offer new perspectives on RC development, including reservoir design, coding frameworks unification, physical RC implementations, and interaction between RC, cognitive neuroscience and evolution.

摘要
储池计算（RC），最初应用于时间信号处理，是一种循环神经网络，其中神经元随机连接。一旦初始化，连接强度保持不变。这种简单的结构使RC变成了一个非线性动力系统，可以将低维输入映射到高维空间中。模型的丰富动力、线性分离和记忆容量，然后允许简单的线性读取生成适用于各种应用的充分回应。RC的应用范围远超机器学习，因为它在不同的物理硬件实现和生物设备中也可以实现复杂的动力学过程。这提供了更大的灵活性和更短的计算时间。此外，由模型的动力触发的神经元响应，也有助于理解大脑机制，这些机制也利用类似的动力过程。在文献中，RC的发展是庞大的和杂乱的，在这里我们提供了一个统一的RC发展的评论，从机器学习到物理、生物和神经科学。我们首先评论了RC的早期模型，然后报道了当前的state-of-the-art模型和其应用。我们还介绍了通过RC模型大脑机制的研究。最后，我们提出了新的RC发展 Perspectives，包括储池设计、编程框架统一、物理RC实现和RC、认知神经科学和演化之间的交互。

BubbleML: A Multi-Physics Dataset and Benchmarks for Machine Learning

paper_url: http://arxiv.org/abs/2307.14623
repo_url: https://github.com/hpcforge/bubbleml
paper_authors: Sheikh Md Shakeel Hassan, Arthur Feeney, Akash Dhruv, Jihoon Kim, Youngjoon Suh, Jaiyoung Ryu, Yoonjin Won, Aparna Chandramowlishwaran
for: 提供 Machine Learning 训练数据集，帮助更好地理解多物理现象的复杂性。
methods: 基于物理驱动的数值 simulations，提供了多种热层泵泡场景的准确基准信息，覆盖了重力条件、流速、冷却水温、壁超热等多个参数，共51个 simulations。
results: 验证了对实验观测的验证和趋势，并为多种下游任务提供了探索的可能性，如液体动态分析和温度动态学习网络。

Abstract
In the field of phase change phenomena, the lack of accessible and diverse datasets suitable for machine learning (ML) training poses a significant challenge. Existing experimental datasets are often restricted, with limited availability and sparse ground truth data, impeding our understanding of this complex multi-physics phenomena. To bridge this gap, we present the BubbleML Dataset(https://github.com/HPCForge/BubbleML) which leverages physics-driven simulations to provide accurate ground truth information for various boiling scenarios, encompassing nucleate pool boiling, flow boiling, and sub-cooled boiling. This extensive dataset covers a wide range of parameters, including varying gravity conditions, flow rates, sub-cooling levels, and wall superheat, comprising 51 simulations. BubbleML is validated against experimental observations and trends, establishing it as an invaluable resource for ML research. Furthermore, we showcase its potential to facilitate exploration of diverse downstream tasks by introducing two benchmarks: (a) optical flow analysis to capture bubble dynamics, and (b) operator networks for learning temperature dynamics. The BubbleML dataset and its benchmarks serve as a catalyst for advancements in ML-driven research on multi-physics phase change phenomena, enabling the development and comparison of state-of-the-art techniques and models.

摘要
在热相转换现象领域，因缺乏可访问和多样化的数据集，使得机器学习（ML）训练受到很大的挑战。现有的实验数据往往受限，有限的可用性和稀缺的实际数据，使得我们对这种复杂多物理现象的理解受阻。为了缓解这个问题，我们提供了BubbleML数据集（https://github.com/HPCForge/BubbleML），该数据集利用物理驱动的 simulations提供了多种爆发情况的准确的真实数据，包括固定流体流速、不同的液体温度和壁超热等参数，涵盖51个 simulations。BubbleML被验证了对实验观察和趋势的验证，成为一项不可或缺的资源 для ML研究。此外，我们还介绍了两个标准准确：（a）光流分析以捕捉气泡动力学，以及（b）运算网络用于学习温度动力学。BubbleML数据集和其标准准确 serve as a catalyst for advancements in ML-driven research on multi-physics phase change phenomena，allowing the development and comparison of state-of-the-art techniques and models.

Self-Contrastive Graph Diffusion Network

paper_url: http://arxiv.org/abs/2307.14613
repo_url: https://github.com/kunzhan/SCDGN
paper_authors: Yixian Ma, Kun Zhan
for: 本研究提出了一种新的框架 called Self-Contrastive Graph Diffusion Network (SCGDN), 用于 Graph Self-Contrastive Learning paradigm，以解决现有方法中的一些限制。
methods: 该框架包括两个主要组件：Attentional Module (AttM) 和 Diffusion Module (DiFM)。AttM 通过聚合高阶结构和特征信息来获得优秀的嵌入，而 DiFM 通过拉普拉斯扩散学习平衡每个节点的状态，并让 adjacency 和特征信息在图中协同演化。
results: SCGDN 可以避免 “sampling bias” 和 semantic drift，无需预训练。通过高质量的采样方法，SCGDN 可以更好地保持高阶结构信息，并避免过拟合。实验结果表明，SCGDN 可以在对照方法和传统方法的比较中表现出优异表现。

Abstract
Augmentation techniques and sampling strategies are crucial in contrastive learning, but in most existing works, augmentation techniques require careful design, and their sampling strategies can only capture a small amount of intrinsic supervision information. Additionally, the existing methods require complex designs to obtain two different representations of the data. To overcome these limitations, we propose a novel framework called the Self-Contrastive Graph Diffusion Network (SCGDN). Our framework consists of two main components: the Attentional Module (AttM) and the Diffusion Module (DiFM). AttM aggregates higher-order structure and feature information to get an excellent embedding, while DiFM balances the state of each node in the graph through Laplacian diffusion learning and allows the cooperative evolution of adjacency and feature information in the graph. Unlike existing methodologies, SCGDN is an augmentation-free approach that avoids "sampling bias" and semantic drift, without the need for pre-training. We conduct a high-quality sampling of samples based on structure and feature information. If two nodes are neighbors, they are considered positive samples of each other. If two disconnected nodes are also unrelated on $k$NN graph, they are considered negative samples for each other. The contrastive objective reasonably uses our proposed sampling strategies, and the redundancy reduction term minimizes redundant information in the embedding and can well retain more discriminative information. In this novel framework, the graph self-contrastive learning paradigm gives expression to a powerful force. SCGDN effectively balances between preserving high-order structure information and avoiding overfitting. The results manifest that SCGDN can consistently generate outperformance over both the contrastive methods and the classical methods.

摘要
《增强技术和采样策略是对冲学习中的关键，但现有的方法通常需要仔细设计增强技术，并且只能捕捉到小量内在监督信息。此外，现有的方法需要复杂的设计来获得两种不同的数据表示。为了解决这些限制，我们提出了一种新的框架called Self-Contrastive Graph Diffusion Network (SCGDN)。我们的框架包括两个主要组成部分：Attentional Module (AttM)和Diffusion Module (DiFM)。AttM将高阶结构和特征信息聚合以获得优秀的嵌入，而DiFM通过拉普拉斯扩散学习平衡每个节点在图中的状态，并允许邻居和特征信息在图中协同演化。与现有方法不同，SCGDN不需要增强技术和采样偏见，无需预训练。我们采用高质量的采样策略，根据结构和特征信息进行采样。如果两个节点相邻，它们被视为对方的正例；如果两个不相邻的节点也不在$k$NN图中相关，它们被视为对方的负例。对冲目标合理使用我们提议的采样策略，并且减少纠纷信息的概率逻辑可以良好地保留更多的特征信息。在这种新的框架中，图自相关学习方式表达出了强大的力量。SCGDN能够均衡保持高阶结构信息和避免过拟合。结果表明SCGDN可以一直在对冲方法和传统方法之上出现出色的性能。》

Clustering based Point Cloud Representation Learning for 3D Analysis

paper_url: http://arxiv.org/abs/2307.14605
repo_url: https://github.com/fengzicai/cluster3dseg
paper_authors: Tuo Feng, Wenguan Wang, Xiaohan Wang, Yi Yang, Qinghua Zheng
for: 这种研究的目的是提出一种基于归一化的超参数学习方法，以自动发现 scene中的 subclass 模式，从而提高点云分析 task 的稳定性和敏感性。
methods: 这种方法使用 clustering 技术在点云 embedding 空间中进行内类划分，以挖掘Scene中的 latent 模式。然后，这些模式被用来重新绘制 embedding 空间，以更好地遵循训练数据集的下面分布，提高对变化的抗锋性。
results: 这种方法在多种3D网络架构（包括 voxel-based、point-based 和 Transformer-based）上显示了显著的改进（即2.0-2.6% 和 1.8-1.9% 在 SemanticKITTI 和 S3DIS datasets 上），并且在 KITTI 上也显示了2.0-3.4% mAP 的提升。

Abstract
Point cloud analysis (such as 3D segmentation and detection) is a challenging task, because of not only the irregular geometries of many millions of unordered points, but also the great variations caused by depth, viewpoint, occlusion, etc. Current studies put much focus on the adaption of neural networks to the complex geometries of point clouds, but are blind to a fundamental question: how to learn an appropriate point embedding space that is aware of both discriminative semantics and challenging variations? As a response, we propose a clustering based supervised learning scheme for point cloud analysis. Unlike current de-facto, scene-wise training paradigm, our algorithm conducts within-class clustering on the point embedding space for automatically discovering subclass patterns which are latent yet representative across scenes. The mined patterns are, in turn, used to repaint the embedding space, so as to respect the underlying distribution of the entire training dataset and improve the robustness to the variations. Our algorithm is principled and readily pluggable to modern point cloud segmentation networks during training, without extra overhead during testing. With various 3D network architectures (i.e., voxel-based, point-based, Transformer-based, automatically searched), our algorithm shows notable improvements on famous point cloud segmentation datasets (i.e.,2.0-2.6% on single-scan and 2.0-2.2% multi-scan of SemanticKITTI, 1.8-1.9% on S3DIS, in terms of mIoU). Our algorithm also demonstrates utility in 3D detection, showing 2.0-3.4% mAP gains on KITTI.

摘要
点云分析（如3D segmentation和检测）是一项复杂的任务，因为点云的不规则形状以及深度、视点、遮挡等因素引起的巨大变化。现有研究强调用神经网络适应点云的复杂 geometries，但忽略了一个基本问题：如何学习适当的点云嵌入空间，考虑到both discriminative semantics和挑战性变化？作为回应，我们提出了一种 clustering 基于超级vised learning 方案 для点云分析。与当前的 scene-wise 训练方法不同，我们的算法在点云嵌入空间内进行 Within-class clustering，以自动发现 scene 中的 Representative subclass 模式。挖掘出来的模式会被用来重新绘制嵌入空间，以使其遵循整个训练数据集的下面分布，提高对变化的 robustness。我们的算法是理性的，可以在现代点云 segmentation 网络中进行实时插值，无需测试时间过载。与不同的3D网络架构（i.e., voxel-based, point-based, Transformer-based, automatically searched）结合使用，我们的算法在著名的点云 segmentation 数据集（i.e., SemanticKITTI 2.0-2.6%、S3DIS 1.8-1.9%， terms of mIoU）上显示了显著的提升。我们的算法还在3D检测中展示了2.0-3.4% mAP 的提升。

The detection and rectification for identity-switch based on unfalsified control

paper_url: http://arxiv.org/abs/2307.14591
repo_url: None
paper_authors: Junchao Huang, Xiaoqi He, Sheng Zhao
for: 这 paper 是为了解决视频中的多对象跟踪问题，并且提出了一种基于不做假的控制方法来解决 ID-switch 问题。
methods: 该 paper 使用了一种特定的检测和修正模块来检测 ID-switch，并提出了一种简单有效的匹配方法来解决 ambiguous 匹配问题。
results: 实验结果表明，该 tracker 在覆盖和快速运动导致的跟踪错误问题下表现出色，并且具有极高的效果和稳定性。

Abstract
The purpose of multi-object tracking (MOT) is to continuously track and identify objects detected in videos. Currently, most methods for multi-object tracking model the motion information and combine it with appearance information to determine and track objects. In this paper, unfalsified control is employed to address the ID-switch problem in multi-object tracking. We establish sequences of appearance information variations for the trajectories during the tracking process and design a detection and rectification module specifically for ID-switch detection and recovery. We also propose a simple and effective strategy to address the issue of ambiguous matching of appearance information during the data association process. Experimental results on publicly available MOT datasets demonstrate that the tracker exhibits excellent effectiveness and robustness in handling tracking errors caused by occlusions and rapid movements.

摘要
<> translate english text into simplified chineseThe purpose of multi-object tracking (MOT) is to continuously track and identify objects detected in videos. Currently, most methods for multi-object tracking model the motion information and combine it with appearance information to determine and track objects. In this paper, unfalsified control is employed to address the ID-switch problem in multi-object tracking. We establish sequences of appearance information variations for the trajectories during the tracking process and design a detection and rectification module specifically for ID-switch detection and recovery. We also propose a simple and effective strategy to address the issue of ambiguous matching of appearance information during the data association process. Experimental results on publicly available MOT datasets demonstrate that the tracker exhibits excellent effectiveness and robustness in handling tracking errors caused by occlusions and rapid movements.<>Here's the translation in Traditional Chinese:<> translate english text into traditional chineseThe purpose of multi-object tracking (MOT) is to continuously track and identify objects detected in videos. Currently, most methods for multi-object tracking model the motion information and combine it with appearance information to determine and track objects. In this paper, unfalsified control is employed to address the ID-switch problem in multi-object tracking. We establish sequences of appearance information variations for the trajectories during the tracking process and design a detection and rectification module specifically for ID-switch detection and recovery. We also propose a simple and effective strategy to address the issue of ambiguous matching of appearance information during the data association process. Experimental results on publicly available MOT datasets demonstrate that the tracker exhibits excellent effectiveness and robustness in handling tracking errors caused by occlusions and rapid movements.<>

Explainable Techniques for Analyzing Flow Cytometry Cell Transformers

paper_url: http://arxiv.org/abs/2307.14581
repo_url: None
paper_authors: Florian Kowarsch, Lisa Weijler, FLorian Kleber, Matthias Wödlinger, Michael Reiter, Margarita Maurer-Granofszky, Michael Dworzak
for:This paper aims to improve explainability for deep learning models in clinical applications, specifically for Flow CytoMetry (FCM) data.methods:The authors propose and evaluate two visualization techniques for cell classification and polygon regression on pediatric Acute Lymphoblastic Leukemia (ALL) FCM samples: gradient-based visualization and attention visualization. These techniques are tailored for FCM data and utilize a transformer architecture called ReluFormer.results:The results demonstrate the effectiveness of the proposed visualization techniques in outlining the model’s decision process and providing insights into the transformer’s decision-making process when handling FCM data. The gradient-based visualization identifies cells that are most significant for a particular prediction, while the attention visualization shows that different attention heads specialize by attending to different biologically meaningful sub-populations in the data.

Abstract
Explainability for Deep Learning Models is especially important for clinical applications, where decisions of automated systems have far-reaching consequences. While various post-hoc explainable methods, such as attention visualization and saliency maps, already exist for common data modalities, including natural language and images, little work has been done to adapt them to the modality of Flow CytoMetry (FCM) data. In this work, we evaluate the usage of a transformer architecture called ReluFormer that ease attention visualization as well as we propose a gradient- and an attention-based visualization technique tailored for FCM. We qualitatively evaluate the visualization techniques for cell classification and polygon regression on pediatric Acute Lymphoblastic Leukemia (ALL) FCM samples. The results outline the model's decision process and demonstrate how to utilize the proposed techniques to inspect the trained model. The gradient-based visualization not only identifies cells that are most significant for a particular prediction but also indicates the directions in the FCM feature space in which changes have the most impact on the prediction. The attention visualization provides insights on the transformer's decision process when handling FCM data. We show that different attention heads specialize by attending to different biologically meaningful sub-populations in the data, even though the model retrieved solely supervised binary classification signals during training.

摘要
deep learning 模型的可解释性特别重要于临床应用，因为机器自动系统的决策对结果有深远的影响。 exist 多种后处可解释方法，如注意力视觉和积分地图，已经应用于常见的数据模式，如自然语言和图像。然而，对流率维度测试（FCM）数据的可解释方法尚未得到广泛的研究。在这种情况下，我们评估了一种名为ReLUFormer的transformer架构，以便进行注意力视觉以及我们提议了一种基于梯度和注意力的FCM数据可视化技术。我们质量评估了这些可视化技术在儿童急性 лимфоblastLeukemia（ALL）FCM样本上进行细胞分类和多边 regression。结果表明了模型做出的决策过程，并示出了如何使用我们提议的技术来检查训练的模型。梯度可视化不仅可以确定细胞分类中最重要的细胞，还可以指示FCM特征空间中改变的方向具有最大影响。注意力可视化为模型处理FCM数据的决策过程提供了启示，我们发现了不同的注意力头专注于不同的生物学意义的子 популяción，即使模型只在训练过程中获得了简单的二分类信号。

A Memory-Augmented Multi-Task Collaborative Framework for Unsupervised Traffic Accident Detection in Driving Videos

paper_url: http://arxiv.org/abs/2307.14575
repo_url: None
paper_authors: Rongqin Liang, Yuanman Li, Yingxin Yi, Jiantao Zhou, Xia Li
for: 本研究旨在提高自动驾驶和助手系统的安全性，通过识别驾驶视频中的交通事故。
methods: 我们提出了一种新的记忆增强多任务协同框架（MAMTCF），通过同时模型视频帧中的出现变化和物体运动来更准确地检测交通事故。我们还引入了一种具有记忆的动作表示机制，以全面探索不同类型的运动表示之间的相互关系，并利用存储在内存中的常见交通模式高级特征来增强动作表示。
results: 我们的方法在最新的大规模数据集上进行了实验，与之前的状态时的方法相比，我们的方法可以更好地检测交通事故。

Abstract
Identifying traffic accidents in driving videos is crucial to ensuring the safety of autonomous driving and driver assistance systems. To address the potential danger caused by the long-tailed distribution of driving events, existing traffic accident detection (TAD) methods mainly rely on unsupervised learning. However, TAD is still challenging due to the rapid movement of cameras and dynamic scenes in driving scenarios. Existing unsupervised TAD methods mainly rely on a single pretext task, i.e., an appearance-based or future object localization task, to detect accidents. However, appearance-based approaches are easily disturbed by the rapid movement of the camera and changes in illumination, which significantly reduce the performance of traffic accident detection. Methods based on future object localization may fail to capture appearance changes in video frames, making it difficult to detect ego-involved accidents (e.g., out of control of the ego-vehicle). In this paper, we propose a novel memory-augmented multi-task collaborative framework (MAMTCF) for unsupervised traffic accident detection in driving videos. Different from previous approaches, our method can more accurately detect both ego-involved and non-ego accidents by simultaneously modeling appearance changes and object motions in video frames through the collaboration of optical flow reconstruction and future object localization tasks. Further, we introduce a memory-augmented motion representation mechanism to fully explore the interrelation between different types of motion representations and exploit the high-level features of normal traffic patterns stored in memory to augment motion representations, thus enlarging the difference from anomalies. Experimental results on recently published large-scale dataset demonstrate that our method achieves better performance compared to previous state-of-the-art approaches.

摘要
identifying traffic accidents in driving videos是Autonomous driving和driver assistance system的关键安全因素。为了解决驾驶场景中异常事件的长尾分布问题，现有的交通事故检测（TAD）方法主要采用不监督学习。然而，TAD仍然是挑战，因为驾驶场景中的相机和场景在快速运动中变化。现有的不监督TAD方法主要基于单一假设任务，即外观基于的或未来对象定位任务，以检测事故。然而，外观基于的方法容易受到相机快速运动和光照变化的影响，导致事故检测性能下降。基于未来对象定位任务的方法可能无法捕捉视频帧中的外观变化，从而困难检测 egovolved 事故（例如， egovolved 车辆失控）。在本文中，我们提出了一种新的记忆增强多任务合作框架（MAMTCF），用于不监督交通事故检测。与之前的方法不同，我们的方法可以更准确地检测 egovolved 和非 egovolved 事故，通过视频帧中的相机运动和未来对象定位任务的共同模型化，捕捉视频帧中的外观变化和对象运动。此外，我们引入记忆增强运动表示机制，全面利用不同类型的运动表示之间的相互关系，并通过记忆中高级特征来增强运动表示，从而增大与异常之间的差异。实验结果表明，我们的方法在最新的大规模数据集上达到了前一个 estado del arte 方法的性能。

paper_url: http://arxiv.org/abs/2307.14568
repo_url: None
paper_authors: Brian Angulo, Gregory Gorbov, Aleksandr Panov, Konstantin Yakovlev
for: 这个研究旨在高亮安全性因素在自动驾驶系统中的重要性，并对两种不同的学习导航策略进行比较。
methods: 这个研究使用了一种不同的学习导航策略，即考虑安全性因素的“安全”策略，与不考虑安全性因素的“危险”策略进行比较。
results: 研究结果表明，使用“安全”策略可以生成更多的减噪距离（距离障碍物），避免更多的碰撞，同时不 sacrificing总性能。I hope that helps! Let me know if you have any other questions.

Abstract
While reinforcement learning algorithms have had great success in the field of autonomous navigation, they cannot be straightforwardly applied to the real autonomous systems without considering the safety constraints. The later are crucial to avoid unsafe behaviors of the autonomous vehicle on the road. To highlight the importance of these constraints, in this study, we compare two learnable navigation policies: safe and unsafe. The safe policy takes the constraints into account, while the other does not. We show that the safe policy is able to generate trajectories with more clearance (distance to the obstacles) and makes less collisions while training without sacrificing the overall performance.

摘要
Autonomous navigation algorithms based on reinforcement learning have achieved great success, but they cannot be directly applied to real-world autonomous systems without considering safety constraints. These constraints are crucial to avoid dangerous behaviors of the autonomous vehicle on the road. To emphasize the importance of these constraints, in this study, we compare two learnable navigation policies: safe and unsafe. The safe policy takes the constraints into account, while the other does not. We show that the safe policy can generate trajectories with more clearance (distance to obstacles) and makes fewer collisions while training without sacrificing overall performance.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Understanding Forward Process of Convolutional Neural Network

paper_url: http://arxiv.org/abs/2307.15090
repo_url: https://github.com/himanshub1007/Alzhimers-Disease-Prediction-Using-Deep-learning
paper_authors: Peixin Tian
for: 描述 CNN 的前向处理中的选择性旋转
methods: 利用 activation function 作为筛选和归一化输入数据的机制，并通过应用结构化数学工具来分析输入的统计指标
results: 研究发现，这种定义的方法让网络能够根据统计指标来分辨输入，并且发现人工神经网络和人脑在数据处理模式上存在一致性。

Abstract
This paper reveal the selective rotation in the CNNs' forward processing. It elucidates the activation function as a discerning mechanism that unifies and quantizes the rotational aspects of the input data. Experiments show how this defined methodology reflects the progress network distinguish inputs based on statistical indicators, which can be comprehended or analyzed by applying structured mathematical tools. Our findings also unveil the consistency between artificial neural networks and the human brain in their data processing pattern.

摘要
Translated into Simplified Chinese:这篇论文揭示了 CNN 的选择性旋转处理。它提出了活动函数作为分类机制，归一化和量化输入数据的旋转方面。实验显示，这种定义的方法ология可以基于统计指标来分类输入，可以通过结构化数学工具进行理解或分析。我们的发现还揭示了人工神经网络和人脑在数据处理模式上的一致性。

Reinforcement learning guided fuzz testing for a browser’s HTML rendering engine

paper_url: http://arxiv.org/abs/2307.14556
repo_url: None
paper_authors: Martin Sablotny, Bjørn Sand Jensen, Jeremy Singer
for: 找寻多种漏洞和安全漏洞
methods: 使用深度学习模型生成测试案例，并使用双层深度Q网络（DDQN）引导测试案例创建
results: 对Firefox HTML渲染引擎进行了18.5%的代码覆盖性提升 compared to基eline语法基础的混杂化器

Abstract
Generation-based fuzz testing can uncover various bugs and security vulnerabilities. However, compared to mutation-based fuzz testing, it takes much longer to develop a well-balanced generator that produces good test cases and decides where to break the underlying structure to exercise new code paths. We propose a novel approach to combine a trained test case generator deep learning model with a double deep Q-network (DDQN) for the first time. The DDQN guides test case creation based on a code coverage signal. Our approach improves the code coverage performance of the underlying generator model by up to 18.5\% for the Firefox HTML rendering engine compared to the baseline grammar based fuzzer.

摘要
生成基于的验证测试可以探测多种漏洞和安全漏洞。然而，相比于变换基于的验证测试，它需要较长时间来开发一个很好均衡的生成器，以生成好的测试 caso和决定在下面的结构下激活新的代码路径。我们提出了一种新的方法，将训练过的测试 caso生成深度学习模型与双深度Q网络（DDQN）结合使用。DDQN根据代码覆盖率信号引导测试 caso创建。我们的方法可以提高 Firefox HTML 渲染引擎下的代码覆盖率性能，相比基eline grammar基础的验证器，提高了18.5%。

Adversarial Sleeping Bandit Problems with Multiple Plays: Algorithm and Ranking Application

paper_url: http://arxiv.org/abs/2307.14549
repo_url: None
paper_authors: Jianjun Yuan, Wei Lee Woon, Ludovik Coba
for: 这篇论文是为了解决在线推荐系统中的睡眠bandit问题而写的。
methods: 该论文提出了一种高效的算法来解决睡眠bandit问题，该算法基于单臂选择算法的扩展，并且保证能够实现理论性的表现，即 regret upper bounded by $\bigO(kN^2\sqrt{T\log T})$.
results: 该论文的结果表明，该算法能够在睡眠bandit问题中实现高效的选择，并且能够避免极端情况下的质量下降。

Abstract
This paper presents an efficient algorithm to solve the sleeping bandit with multiple plays problem in the context of an online recommendation system. The problem involves bounded, adversarial loss and unknown i.i.d. distributions for arm availability. The proposed algorithm extends the sleeping bandit algorithm for single arm selection and is guaranteed to achieve theoretical performance with regret upper bounded by $\bigO(kN^2\sqrt{T\log T})$, where $k$ is the number of arms selected per time step, $N$ is the total number of arms, and $T$ is the time horizon.

摘要
(Simplified Chinese translation)这篇论文提出了一种高效的算法，用于在线推荐系统中解决睡着帮手问题。该问题具有含边界、对抗损失和未知i.i.d.分布的arm可用性。提出的算法基于单个帮手选择算法的扩展，并且保证了理论性能， regret的上界为 $\bigO(kN^2\sqrt{T\log T})$, where $k$ is the number of arms selected per time step, $N$ is the total number of arms, and $T$ is the time horizon.

Speed Reading Tool Powered by Artificial Intelligence for Students with ADHD, Dyslexia, or Short Attention Span

paper_url: http://arxiv.org/abs/2307.14544
repo_url: None
paper_authors: Megat Irfan Zackry Bin Ismail Ahmad Nazran bin Yusri Muhammad Hafizzul Bin Abdul Manap Muhammad Muizzuddin Bin Kamarozaman
for: 帮助学生 WITH dyslexia, ADHD, 和短时间注意力不足更好地理解文本信息
methods: 使用多层感知器（MLP）算法进行复杂文本处理和概要化任务，并使用 Hugging Face 提供的 T5 模型（文本生成模型）进行 fine-tuning 特定任务，使用 NLTK 的 Punkt Sentence Tokenizer 将文本分解成句子列表
results: 通过应用 Bionic Reading 的原则（包括粗体函数和字符、词、行间距调整），提高了阅读速度和效率

Abstract
This paper presents a novel approach to assist students with dyslexia, ADHD, and short attention span in digesting any text-based information more efficiently. The proposed solution utilizes the Multilayer Perceptron (MLP) algorithm for complex text processing and summarization tasks. The tool leverages the T5 (Text-to-Text Transfer Transformer) model from Hugging Face, which treats every NLP task as a text generation task. The model is fine-tuned on specific tasks using a smaller dataset. The NLTK's Punkt Sentence Tokenizer is used to divide a text into a list of sentences. The application is served using Flask, a lightweight web server and framework. The tool also applies principles from Bionic Reading to enhance readability, which includes a bolding function and adjustments to line, word, and character spacing. The paper discusses the methodology, implementation, and results of the AI-based speed reading tool.

摘要
这篇论文介绍了一种新的方法，用于帮助学生有读写障碍、注意力不集中和短时间内容过载的问题更有效地处理文本信息。提出的解决方案利用多层感知器（MLP）算法进行复杂文本处理和摘要任务。工具使用了Hugging Face提供的T5（文本生成传输变换器）模型，该模型对每个NLP任务视为文本生成任务。模型通过使用特定任务的更小数据集进行细化。使用NLTK的Punkt Sentence Tokenizer将文本分解成一个列表中的句子。应用程序使用Flask，一个轻量级的网络服务器和框架。工具还应用了生物阅读的原则，以提高阅读性，包括粗体功能和字符、词和行间距调整。文章讨论了方法、实现和这种人工智能快速阅读工具的结果。

Open Problems in Computer Vision for Wilderness SAR and The Search for Patricia Wu-Murad

paper_url: http://arxiv.org/abs/2307.14527
repo_url: https://github.com/crasar/wisar
paper_authors: Thomas Manzini, Robin Murphy
for: 这个研究旨在应用两种计算机视觉系统，一个是supervised学习模型EfficientDET，另一个是无监督的RX спектраль分类器，于日本 Wu-Murad野外搜救（WSAR）活动中98.9 GB的无人机影像上进行找人任务。
methods: 这些研究使用了19个提议的方法和3个数据集来找人在无人机影像中，但只有3个方法（2个无监督和1个不知道的结构）在文献中被引用为在实际WSAR操作中使用过。
results: 这些提议中，EfficientDET体系和无监督的spectral RX分类器被选为最适合这种设定。EfficientDET模型在HERIDAL数据集上进行应用，尽管达到了数据上的性能，但在实际世界中出现了假阳性（如把树 LIMBS和岩石当作人）和假负性（如不能识别搜救队成员）。这些结果表明，在实际WSAR操作中，计算机视觉算法的表现并不如数据上的表现好。因此，未来的研究方向包括：更真实的野外SAR数据集、计算机视觉模型可以快速适应野外SAR操作中collect的多样化影像，以及更好地对照性能指标。

Abstract
This paper details the challenges in applying two computer vision systems, an EfficientDET supervised learning model and the unsupervised RX spectral classifier, to 98.9 GB of drone imagery from the Wu-Murad wilderness search and rescue (WSAR) effort in Japan and identifies 3 directions for future research. There have been at least 19 proposed approaches and 3 datasets aimed at locating missing persons in drone imagery, but only 3 approaches (2 unsupervised and 1 of an unknown structure) are referenced in the literature as having been used in an actual WSAR operation. Of these proposed approaches, the EfficientDET architecture and the unsupervised spectral RX classifier were selected as the most appropriate for this setting. The EfficientDET model was applied to the HERIDAL dataset and despite achieving performance that is statistically equivalent to the state-of-the-art, the model fails to translate to the real world in terms of false positives (e.g., identifying tree limbs and rocks as people), and false negatives (e.g., failing to identify members of the search team). The poor results in practice for algorithms that showed good results on datasets suggest 3 areas of future research: more realistic datasets for wilderness SAR, computer vision models that are capable of seamlessly handling the variety of imagery that can be collected during actual WSAR operations, and better alignment on performance measures.

摘要

More realistic datasets for wilderness SAR: The current datasets used for training and testing computer vision models may not accurately reflect the real-world conditions and variability of imagery collected during actual WSAR operations.2. Computer vision models that can handle diverse imagery: The models need to be able to seamlessly handle the variety of imagery that can be collected during actual WSAR operations, including different lighting conditions, weather, and vegetation.3. Better alignment on performance measures: The models need to be evaluated using performance measures that are relevant to the specific task and environment of WSAR operations, rather than simply relying on metrics that show good results on datasets.The paper also notes that while there have been many proposed approaches to locating missing persons in drone imagery, only a few have been used in actual WSAR operations, and of those, the EfficientDET architecture and the unsupervised spectral RX classifier were selected as the most appropriate for this setting. However, the EfficientDET model failed to translate to real-world performance, with issues such as false positives (identifying tree limbs and rocks as people) and false negatives (failing to identify members of the search team).

Patterns of Vehicle Lights: Addressing Complexities in Curation and Annotation of Camera-Based Vehicle Light Datasets and Metrics

paper_url: http://arxiv.org/abs/2307.14521
repo_url: None
paper_authors: Ross Greer, Akshay Gopalkrishnan, Maitrayee Keskar, Mohan Trivedi
for: 本研究探讨了计算机视觉中车辆灯光的表示方法以及其对自动驾驶场景中不同任务的影响。
methods: 研究比较了不同的车辆灯光表示方法，包括 bounding boxes、中心点、角点和分割面积，并讨论了它们的优缺点。
results: 研究认为，正确地检测车辆灯光对于夜间车辆检测、3D车辆方向估计和动态轨迹指示等任务都非常重要。研究还提出了一种基于实际数据驱动的模型训练方法，并提供了一个新的车辆灯光数据集和可见光模型，以满足下游应用中的车辆检测、意图预测和安全轨迹规划等任务。

Abstract
This paper explores the representation of vehicle lights in computer vision and its implications for various tasks in the field of autonomous driving. Different specifications for representing vehicle lights, including bounding boxes, center points, corner points, and segmentation masks, are discussed in terms of their strengths and weaknesses. Three important tasks in autonomous driving that can benefit from vehicle light detection are identified: nighttime vehicle detection, 3D vehicle orientation estimation, and dynamic trajectory cues. Each task may require a different representation of the light. The challenges of collecting and annotating large datasets for training data-driven models are also addressed, leading to introduction of the LISA Vehicle Lights Dataset and associated Light Visibility Model, which provides light annotations specifically designed for downstream applications in vehicle detection, intent and trajectory prediction, and safe path planning. A comparison of existing vehicle light datasets is provided, highlighting the unique features and limitations of each dataset. Overall, this paper provides insights into the representation of vehicle lights and the importance of accurate annotations for training effective detection models in autonomous driving applications. Our dataset and model are made available at https://cvrr.ucsd.edu/vehicle-lights-dataset

摘要

A new algorithm for Subgroup Set Discovery based on Information Gain

paper_url: http://arxiv.org/abs/2307.15089
repo_url: None
paper_authors: Daniel Gómez-Bravo, Aaron García, Guillermo Vigueras, Belén Ríos, Alejandro Rodríguez-González
for: 本研究旨在提出一种新的模式发现算法，即信息增益子集发现（IGSD），用于解决现有模式发现算法的一些局限性。
methods: IGSD算法结合信息增益（IG）和偶极率（OR）为多个评价因素，用于模式选择。
results: 对于11个数据集，IGSD算法比现有的FSSD和SSD++算法提供更可靠的模式和更少的模式集。IGSD算法还提供了更高的偶极率值，表明模式和目标之间的相互依赖性更高。此外，IGSD算法中的模式被专家 validate 为更加符合域专家的评价。

Abstract
Pattern discovery is a machine learning technique that aims to find sets of items, subsequences, or substructures that are present in a dataset with a higher frequency value than a manually set threshold. This process helps to identify recurring patterns or relationships within the data, allowing for valuable insights and knowledge extraction. In this work, we propose Information Gained Subgroup Discovery (IGSD), a new SD algorithm for pattern discovery that combines Information Gain (IG) and Odds Ratio (OR) as a multi-criteria for pattern selection. The algorithm tries to tackle some limitations of state-of-the-art SD algorithms like the need for fine-tuning of key parameters for each dataset, usage of a single pattern search criteria set by hand, usage of non-overlapping data structures for subgroup space exploration, and the impossibility to search for patterns by fixing some relevant dataset variables. Thus, we compare the performance of IGSD with two state-of-the-art SD algorithms: FSSD and SSD++. Eleven datasets are assessed using these algorithms. For the performance evaluation, we also propose to complement standard SD measures with IG, OR, and p-value. Obtained results show that FSSD and SSD++ algorithms provide less reliable patterns and reduced sets of patterns than IGSD algorithm for all datasets considered. Additionally, IGSD provides better OR values than FSSD and SSD++, stating a higher dependence between patterns and targets. Moreover, patterns obtained for one of the datasets used, have been validated by a group of domain experts. Thus, patterns provided by IGSD show better agreement with experts than patterns obtained by FSSD and SSD++ algorithms. These results demonstrate the suitability of the IGSD as a method for pattern discovery and suggest that the inclusion of non-standard SD metrics allows to better evaluate discovered patterns.

摘要
Pattern discovery 是一种机器学习技术，旨在在数据集中找到较高频值的项集、 subsequences 或 substructures。这个过程可以帮助找到数据中复杂的模式或关系，从而提供有价值的发现和知识提取。在这个工作中，我们提出了信息增加 subgroup discovery（IGSD）算法，这是一种新的 Pattern discovery 算法，它将信息增加（IG）和 odds ratio（OR）作为多种选择 criterion。该算法试图解决现有 Pattern discovery 算法的一些局限性，如手动设置的阈值、不同数据集需要调整参数、使用不 overlap 的数据结构来探索 subgroup 空间、以及无法通过固定数据集中的一些有关变量来搜索模式。因此，我们将IGSD与现有的 Pattern discovery 算法FSSD和SSD++进行比较。对 eleven 个数据集进行评估，我们还提出了一种用于评估 Pattern discovery 性能的方法，该方法包括IG、OR和p-value。获得的结果表明，FSSD和SSD++算法在所有考虑的数据集中提供了较差的模式和减少的模式集，而IGSD算法则提供了更高的OR值，表明模式和目标之间的依赖性更高。此外，IGSD算法对一个数据集中的模式进行验证，并与领域专家的评估结果相符，表明IGSD算法提供的模式更加符合专家的认知。这些结果表明IGSD算法适用于模式发现，并且包括非标准 Pattern discovery 度量可以更好地评估发现的模式。

The Co-12 Recipe for Evaluating Interpretable Part-Prototype Image Classifiers

paper_url: http://arxiv.org/abs/2307.14517
repo_url: None
paper_authors: Meike Nauta, Christin Seifert
for: 本研究旨在提供一份关于可解释部件prototype模型的评估方法的概要。
methods: 本文使用了Co-12性能评估属性（正确性、完整性、紧凑性等）来评估部件prototype模型的解释质量。
results: 本研究发现了评估部件prototype模型解释质量的现有工作和研究漏洞，并提出了未来评估方法的建议。同时，本文还提供了一份“Co-12 quick reference”，用于概括评估部件prototype模型的解释质量。

Abstract
Interpretable part-prototype models are computer vision models that are explainable by design. The models learn prototypical parts and recognise these components in an image, thereby combining classification and explanation. Despite the recent attention for intrinsically interpretable models, there is no comprehensive overview on evaluating the explanation quality of interpretable part-prototype models. Based on the Co-12 properties for explanation quality as introduced in arXiv:2201.08164 (e.g., correctness, completeness, compactness), we review existing work that evaluates part-prototype models, reveal research gaps and outline future approaches for evaluation of the explanation quality of part-prototype models. This paper, therefore, contributes to the progression and maturity of this relatively new research field on interpretable part-prototype models. We additionally provide a ``Co-12 cheat sheet'' that acts as a concise summary of our findings on evaluating part-prototype models.

摘要
<>对于可解释部件模型，我们现在没有一个全面的评估解释质量的概述。基于arXiv:2201.08164中所引入的Co-12属性（如正确性、完整性、 компакт性），我们回顾现有的工作，揭示研究漏掉和未来的评估方法。因此，本文对新的可解释部件模型领域做出了贡献，并提供了一份``Co-12指南''，作为评估部件模型的 concise 概括。Translation:For interpretable part-prototype models, there is no comprehensive overview of evaluating the explanation quality. Based on the Co-12 properties for explanation quality introduced in arXiv:2201.08164 (such as correctness, completeness, compactness), we review existing work, reveal research gaps, and outline future approaches for evaluating the explanation quality of part-prototype models. Therefore, this paper contributes to the progression and maturity of this relatively new research field on interpretable part-prototype models. We also provide a "Co-12 cheat sheet" as a concise summary of our findings on evaluating part-prototype models.

Words That Stick: Predicting Decision Making and Synonym Engagement Using Cognitive Biases and Computational Linguistics

paper_url: http://arxiv.org/abs/2307.14511
repo_url: None
paper_authors: Nimrod Dvir, Elaine Friedman, Suraj Commuri, Fan Yang, Jennifer Romano
for:This research aims to anticipate user engagement and decision-making on digital platforms by leveraging cognitive psychology and information systems studies.methods:The study employs natural language processing (NLP) techniques and insights from cognitive bias research to analyze user interactions with synonyms within digital content. The READ model is synthesized from four cognitive biases: Representativeness, Ease-of-use, Affect, and Distribution.results:Through a comprehensive user survey, the study finds that synonyms that accurately represent core ideas, are easy to understand, elicit emotional responses, and are commonly encountered, promote greater user engagement. The results offer a fresh perspective on human-computer interaction, digital behaviors, and decision-making processes, and highlight the significance of cognitive biases in designing effective digital content across fields like education and marketing.

Abstract
This research draws upon cognitive psychology and information systems studies to anticipate user engagement and decision-making on digital platforms. By employing natural language processing (NLP) techniques and insights from cognitive bias research, we delve into user interactions with synonyms within digital content. Our methodology synthesizes four cognitive biasesRepresentativeness, Ease-of-use, Affect, and Distributioninto the READ model. Through a comprehensive user survey, we assess the model's ability to predict user engagement, discovering that synonyms that accurately represent core ideas, are easy to understand, elicit emotional responses, and are commonly encountered, promote greater user engagement. Crucially, our work offers a fresh lens on human-computer interaction, digital behaviors, and decision-making processes. Our results highlight the promise of cognitive biases as potent indicators of user engagement, underscoring their significance in designing effective digital content across fields like education and marketing.

摘要

Attention for Robot Touch: Tactile Saliency Prediction for Robust Sim-to-Real Tactile Control

paper_url: http://arxiv.org/abs/2307.14510
repo_url: None
paper_authors: Yijiong Lin, Mauro Comi, Alex Church, Dandan Zhang, Nathan F. Lepora
for: 提高机器人摸索控制在无结构环境中的稳定性。
methods: 基于人类触觉注意机制和计算视觉环境中的视觉突出预测问题，提出了一种新的概念——触觉突出。这个概念的目的是在触觉图像中寻找关键信息。而人类 manually labelling 触觉图像是困难的，因为触觉图像的模式可能具有counterintuitive的特征。为Address这个挑战，我们提出了一种新的方法，包括三个相互关联的网络：1）触觉深度网络（ConDepNet），生成一个实际触觉图像中的Contact Depth Map，以地址触觉图像中的target和噪声特征；2）触觉突出网络（TacSalNet），预测一个触觉突出图像，描述输入Contact Depth Map中的target区域；3）触觉噪声生成器（TacNGen），生成噪声特征，用于训练TacSalNet。
results: 实验结果表明，我们的触觉突出预测方法可以准确地预测实际触觉图像中的target特征。总的来说，我们的触觉突出预测方法在无结构环境中提供了稳定的sim-to-real触觉控制。项目页面：https://sites.google.com/view/tactile-saliency/.

Abstract
High-resolution tactile sensing can provide accurate information about local contact in contact-rich robotic tasks. However, the deployment of such tasks in unstructured environments remains under-investigated. To improve the robustness of tactile robot control in unstructured environments, we propose and study a new concept: \textit{tactile saliency} for robot touch, inspired by the human touch attention mechanism from neuroscience and the visual saliency prediction problem from computer vision. In analogy to visual saliency, this concept involves identifying key information in tactile images captured by a tactile sensor. While visual saliency datasets are commonly annotated by humans, manually labelling tactile images is challenging due to their counterintuitive patterns. To address this challenge, we propose a novel approach comprised of three interrelated networks: 1) a Contact Depth Network (ConDepNet), which generates a contact depth map to localize deformation in a real tactile image that contains target and noise features; 2) a Tactile Saliency Network (TacSalNet), which predicts a tactile saliency map to describe the target areas for an input contact depth map; 3) and a Tactile Noise Generator (TacNGen), which generates noise features to train the TacSalNet. Experimental results in contact pose estimation and edge-following in the presence of distractors showcase the accurate prediction of target features from real tactile images. Overall, our tactile saliency prediction approach gives robust sim-to-real tactile control in environments with unknown distractors. Project page: https://sites.google.com/view/tactile-saliency/.

摘要
高解析触觉感测可以提供精确的本地接触信息在有接触的 роботиче任务中。然而，在无结构环境中部署这些任务仍然受到了不足的研究。为了提高触觉控制在无结构环境中的稳定性，我们提出了一新的概念：触觉焦点，启发自人类触觉注意机制和计算机视觉中的视觉预测问题。在视觉预测问题中，这个概念涉及到从触觉图像中提取关键信息。而视觉预测数据通常由人类手动标注，然而对于触觉图像来说，手动标注是具有counterintuitive pattern的。为了解决这个挑战，我们提出了一种新的方法，包括三个相关的网络：1）触觉深度网络（ConDepNet），生成一个具有target和噪声特征的真实触觉深度图像；2）触觉焦点网络（TacSalNet），预测一个触觉焦点图像，描述输入触觉深度图像中的target区域；3）触觉噪声生成器（TacNGen），生成噪声特征来训练TacSalNet。实验结果表明，我们的触觉焦点预测方法可以准确地预测真实触觉图像中的target特征。总的来说，我们的触觉焦点预测方法可以在未知干扰下提供稳定的sim-to-real触觉控制。项目页面：https://sites.google.com/view/tactile-saliency/

paper_url: http://arxiv.org/abs/2307.14501
repo_url: None
paper_authors: Raihan Islam Arnob, Gregory J. Stein
for: 该论文目的是提高具有部分地图信息的环境中，可靠、长期目标导航的能力。
methods: 该论文使用非本地信息来预测行动的质量，包括使用图 neural network 来学习非本地信息。
results: 在三个 simulations 环境中，该方法比非学习基eline 减少了 9.3% 的成本，并比只使用本地信息来预测的学习 Informed Planner 减少了 14.9% 的成本。

Abstract
We improve reliable, long-horizon, goal-directed navigation in partially-mapped environments by using non-locally available information to predict the goodness of temporally-extended actions that enter unseen space. Making predictions about where to navigate in general requires non-local information: any observations the robot has seen so far may provide information about the goodness of a particular direction of travel. Building on recent work in learning-augmented model-based planning under uncertainty, we present an approach that can both rely on non-local information to make predictions (via a graph neural network) and is reliable by design: it will always reach its goal, even when learning does not provide accurate predictions. We conduct experiments in three simulated environments in which non-local information is needed to perform well. In our large scale university building environment, generated from real-world floorplans to the scale, we demonstrate a 9.3\% reduction in cost-to-go compared to a non-learned baseline and a 14.9\% reduction compared to a learning-informed planner that can only use local information to inform its predictions.

摘要
我们提高了可靠、长期目标导航在部分地图环境中，使用非本地可用信息预测行动的质量。任何机器人所见之前的观察都可以提供行动的质量信息。基于最近的学习增强模型基于不确定性的规划方法，我们提出了一种方法，可以通过图 neural network 来预测，同时具有可靠性：它总是可以达到目标，即使学习不准确预测。我们在三个 simulated 环境中进行了实验，其中非本地信息是必要的。在我们的大规模大学建筑环境中，生成自真实的 floorplans ，我们表明了与非学习基准相比的 9.3% 的成本降低，并与只能使用本地信息来预测的学习 Informed 规划器相比，表现出 14.9% 的成本降低。

Technical note: ShinyAnimalCV: open-source cloud-based web application for object detection, segmentation, and three-dimensional visualization of animals using computer vision

paper_url: http://arxiv.org/abs/2307.14487
repo_url: https://github.com/uf-aiaos/shinyanimalcv
paper_authors: Jin Wang, Yu Hu, Lirong Xiang, Gota Morota, Samantha A. Brooks, Carissa L. Wickens, Emily K. Miller-Cushon, Haipeng Yu
for:这个研究的目的是为了开发一个开源云端Web应用程序，以提供一个易于使用的界面，用于执行计算机视觉任务，包括对动物数据进行物体分割、检测、三维表面视觉化以及二维和三维形态特征提取。methods:这个研究使用了多种计算机视觉和深度学习算法，包括9种预训练的计算机视觉模型，以处理上述动物数据。results:这个研究开发了一个名为ShinyAnimalCV的开源云端Web应用程序，可以帮助用户快速和简单地执行计算机视觉任务，并提供了详细的教程和示例数据，以帮助用户快速适应该应用程序。

Abstract
Computer vision (CV), a non-intrusive and cost-effective technology, has furthered the development of precision livestock farming by enabling optimized decision-making through timely and individualized animal care. The availability of affordable two- and three-dimensional camera sensors, combined with various machine learning and deep learning algorithms, has provided a valuable opportunity to improve livestock production systems. However, despite the availability of various CV tools in the public domain, applying these tools to animal data can be challenging, often requiring users to have programming and data analysis skills, as well as access to computing resources. Moreover, the rapid expansion of precision livestock farming is creating a growing need to educate and train animal science students in CV. This presents educators with the challenge of efficiently demonstrating the complex algorithms involved in CV. Thus, the objective of this study was to develop ShinyAnimalCV, an open-source cloud-based web application. This application provides a user-friendly interface for performing CV tasks, including object segmentation, detection, three-dimensional surface visualization, and extraction of two- and three-dimensional morphological features. Nine pre-trained CV models using top-view animal data are included in the application. ShinyAnimalCV has been deployed online using cloud computing platforms. The source code of ShinyAnimalCV is available on GitHub, along with detailed documentation on training CV models using custom data and deploying ShinyAnimalCV locally to allow users to fully leverage the capabilities of the application. ShinyAnimalCV can contribute to CV research and teaching in the animal science community.

摘要
计算机视觉（CV）技术，一种不侵入和cost-effective的技术，已经推动了精细 живо产业的发展，通过提供了时 opportune和个性化的动物护理，从而实现了 optimize livestock production systems。可以获得的便宜的二维和三维摄像头感知器，combined with various machine learning and deep learning algorithms，has provided a valuable opportunity to improve livestock production systems. However, despite the availability of various CV tools in the public domain, applying these tools to animal data can be challenging, often requiring users to have programming and data analysis skills, as well as access to computing resources. Moreover, the rapid expansion of precision livestock farming is creating a growing need to educate and train animal science students in CV. This presents educators with the challenge of efficiently demonstrating the complex algorithms involved in CV. Therefore, the objective of this study was to develop ShinyAnimalCV, an open-source cloud-based web application. This application provides a user-friendly interface for performing CV tasks, including object segmentation, detection, three-dimensional surface visualization, and extraction of two- and three-dimensional morphological features. Nine pre-trained CV models using top-view animal data are included in the application. ShinyAnimalCV has been deployed online using cloud computing platforms. The source code of ShinyAnimalCV is available on GitHub, along with detailed documentation on training CV models using custom data and deploying ShinyAnimalCV locally to allow users to fully leverage the capabilities of the application. ShinyAnimalCV can contribute to CV research and teaching in the animal science community.

Single Channel Speech Enhancement Using U-Net Spiking Neural Networks

paper_url: http://arxiv.org/abs/2307.14464
repo_url: None
paper_authors: Abir Riahi, Éric Plourde
for: 提高沟通设备和语音识别系统的可靠性，采用神经网络进行干扰减少。
methods: 使用神经网络（SNN）基于U-Net架构，SNN适用于处理时间维度数据，如语音，并且能够在限制性的硬件上实现能量减少。
results: 比较 intel neuromorphic deep noise suppression challenge 基线解决方案和等效的人工神经网络模型，能源减少的SNN模型达到了接受性的性能。

Abstract
Speech enhancement (SE) is crucial for reliable communication devices or robust speech recognition systems. Although conventional artificial neural networks (ANN) have demonstrated remarkable performance in SE, they require significant computational power, along with high energy costs. In this paper, we propose a novel approach to SE using a spiking neural network (SNN) based on a U-Net architecture. SNNs are suitable for processing data with a temporal dimension, such as speech, and are known for their energy-efficient implementation on neuromorphic hardware. As such, SNNs are thus interesting candidates for real-time applications on devices with limited resources. The primary objective of the current work is to develop an SNN-based model with comparable performance to a state-of-the-art ANN model for SE. We train a deep SNN using surrogate-gradient-based optimization and evaluate its performance using perceptual objective tests under different signal-to-noise ratios and real-world noise conditions. Our results demonstrate that the proposed energy-efficient SNN model outperforms the Intel Neuromorphic Deep Noise Suppression Challenge (Intel N-DNS Challenge) baseline solution and achieves acceptable performance compared to an equivalent ANN model.

摘要
声音增强（SE）是重要的可靠通信设备或Robust Speech recognition系统的一部分。虽然传统的人工神经网络（ANN）在SE方面已经表现出了很好的性能，但它们需要很大的计算能力以及高的能源成本。在这篇论文中，我们提出了一种使用射频神经网络（SNN）基于U-Net架构的新方法 для SE。SNNs适用于处理具有时间维度的数据，如Speech，并且在 neuromorphic 硬件上实现时能够减少能源成本。因此，SNNs 是实时应用于有限资源的设备上的不错选择。我们的目标是开发一个与现有ANN模型相当的性能的SNN模型。我们使用代理函数逼近的优化方法来训练深度SNN，并在不同的信号噪声比和实际噪声条件下使用感知目标测试其性能。我们的结果表明，我们提出的能效的SNN模型比Intel Neuromorphic Deep Noise Suppression Challenge（Intel N-DNS Challenge）基eline解决方案更好，并且与相等的ANN模型相比，它的性能是可接受的。

VISPUR: Visual Aids for Identifying and Interpreting Spurious Associations in Data-Driven Decisions

paper_url: http://arxiv.org/abs/2307.14448
repo_url: https://github.com/picsolab/vispur
paper_authors: Xian Teng, Yongsu Ahn, Yu-Ru Lin
For: The paper aims to help users identify and understand spurious associations in big data and machine learning, and make accountable causal decisions.* Methods: The proposed “de-paradox” workflow and visual analytic system, including the CONFOUNDER DASHBOARD, SUBGROUP VIEWER, REASONING STORYBOARD, and DECISION DIAGNOSIS panel, provide a framework for tackling spurious associations.* Results: The qualitative and quantitative results from an expert interview and a controlled user experiment demonstrate the effectiveness of the proposed system in helping users identify and understand spurious associations, and make accountable causal decisions.In Simplified Chinese text, the three key points would be:
for: 论文旨在帮助用户标识和理解大数据和机器学习中的假关联，并做出负责任的 causal 决策。
methods: 提出的 “de-paradox” 工作流程和视觉分析系统，包括 CONFOUNDER DASHBOARD、SUBGROUP VIEWER、REASONING STORYBOARD 和 DECISION DIAGNOSIS panel，为处理假关联提供了一个框架。
results: 从专家采访和控制的用户试验结果来看，提出的系统有效地帮助用户标识和理解假关联，并做出负责任的 causal 决策。

Abstract
Big data and machine learning tools have jointly empowered humans in making data-driven decisions. However, many of them capture empirical associations that might be spurious due to confounding factors and subgroup heterogeneity. The famous Simpson's paradox is such a phenomenon where aggregated and subgroup-level associations contradict with each other, causing cognitive confusions and difficulty in making adequate interpretations and decisions. Existing tools provide little insights for humans to locate, reason about, and prevent pitfalls of spurious association in practice. We propose VISPUR, a visual analytic system that provides a causal analysis framework and a human-centric workflow for tackling spurious associations. These include a CONFOUNDER DASHBOARD, which can automatically identify possible confounding factors, and a SUBGROUP VIEWER, which allows for the visualization and comparison of diverse subgroup patterns that likely or potentially result in a misinterpretation of causality. Additionally, we propose a REASONING STORYBOARD, which uses a flow-based approach to illustrate paradoxical phenomena, as well as an interactive DECISION DIAGNOSIS panel that helps ensure accountable decision-making. Through an expert interview and a controlled user experiment, our qualitative and quantitative results demonstrate that the proposed "de-paradox" workflow and the designed visual analytic system are effective in helping human users to identify and understand spurious associations, as well as to make accountable causal decisions.

摘要
大数据和机器学习工具共同激活了人类在基于数据的决策中的能力。然而，许多其中捕捉到了可能的假相关性，因为存在干扰因素和 subgroup 多样性。例如，赛普逊的 парадоксом是这种现象，其中汇集和 subgroup 级别的相关性相互矛盾，导致认知混乱和不能正确地解释和决策。现有的工具几乎无法为人类提供有用的理解和避免假相关性的方法。我们提议了一个名为VISPUR的视觉分析系统，它提供了一个 causal 分析框架和一个人类中心的工作流程，以解决假相关性的问题。这些包括一个 CONFOUNDER DASHBOARD，可以自动 indentify 可能的干扰因素，以及一个 SUBGROUP VIEWER，可以视觉化和比较多个 subgroup 的多样性，可能或可能导致 causality 的误解。此外，我们还提议了一个 REASONING STORYBOARD，使用流程方式解释困扰现象，以及一个交互式的 DECISION DIAGNOSIS 面板，帮助确保决策是可负责的。经过专家采访和控制的用户测试，我们的资深和量化结果表明，我们的提议的 "de-paradox" 工作流程和设计的视觉分析系统有效地帮助人类用户认识和理解假相关性，以及作出可负责的 causal 决策。

Three Bricks to Consolidate Watermarks for Large Language Models

paper_url: http://arxiv.org/abs/2308.00113
repo_url: https://github.com/facebookresearch/three_bricks
paper_authors: Pierre Fernandez, Antoine Chaffin, Karim Tit, Vivien Chappelier, Teddy Furon
for: 本研究旨在为大语言模型 Watermarking 技术提供更好的理论基础和实践应用。
methods: 本研究使用了三种理论和实践考虑，包括新的统计测试、对经典benchmark的比较和多比特水印技术。
results: 研究人员通过新的统计测试和实践应用发现，使用 Watermarking 技术可以准确地推断出生成文本是否来自特定的语言模型。

Abstract
The task of discerning between generated and natural texts is increasingly challenging. In this context, watermarking emerges as a promising technique for ascribing generated text to a specific model. It alters the sampling generation process so as to leave an invisible trace in the generated output, facilitating later detection. This research consolidates watermarks for large language models based on three theoretical and empirical considerations. First, we introduce new statistical tests that offer robust theoretical guarantees which remain valid even at low false-positive rates (less than 10$^{\text{-6}$). Second, we compare the effectiveness of watermarks using classical benchmarks in the field of natural language processing, gaining insights into their real-world applicability. Third, we develop advanced detection schemes for scenarios where access to the LLM is available, as well as multi-bit watermarking.

摘要
“文本生成和自然文本之间的区分日益困难。在这种情况下，水印技术成为了识别生成文本的特定模型的有力的方法。它在生成过程中改变抽象样本，以留下不可见的 trace，使得后续检测变得容易。这项研究汇集了大语言模型的水印，基于三种理论和实证考虑。首先，我们提出了新的统计测试，具有坚实的理论保证，适用于低 FALSE POSITIVE 率（低于10$^{-6}$）。其次，我们通过经典的自然语言处理benchmark进行比较，从而获得了水印的实际应用性。最后，我们开发了高级检测方案，包括对LLM的访问和多比特水印。”

WavJourney: Compositional Audio Creation with Large Language Models

paper_url: http://arxiv.org/abs/2307.14335
repo_url: https://github.com/audio-agi/wavjourney
paper_authors: Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, Jinhua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang
for: 这篇论文旨在探讨大语言模型（LLM）如何用于创造智能音频内容。
methods: 该论文提出了一种名为WavJourney的系统，该系统利用LLM连接多种音频模型，以生成包括语音、音乐和特效的听力内容。
results: 该论文通过在多个实际场景中应用WavJourney系统，证明了该系统的可行性和创造力。

Abstract
Large Language Models (LLMs) have shown great promise in integrating diverse expert models to tackle intricate language and vision tasks. Despite their significance in advancing the field of Artificial Intelligence Generated Content (AIGC), their potential in intelligent audio content creation remains unexplored. In this work, we tackle the problem of creating audio content with storylines encompassing speech, music, and sound effects, guided by text instructions. We present WavJourney, a system that leverages LLMs to connect various audio models for audio content generation. Given a text description of an auditory scene, WavJourney first prompts LLMs to generate a structured script dedicated to audio storytelling. The audio script incorporates diverse audio elements, organized based on their spatio-temporal relationships. As a conceptual representation of audio, the audio script provides an interactive and interpretable rationale for human engagement. Afterward, the audio script is fed into a script compiler, converting it into a computer program. Each line of the program calls a task-specific audio generation model or computational operation function (e.g., concatenate, mix). The computer program is then executed to obtain an explainable solution for audio generation. We demonstrate the practicality of WavJourney across diverse real-world scenarios, including science fiction, education, and radio play. The explainable and interactive design of WavJourney fosters human-machine co-creation in multi-round dialogues, enhancing creative control and adaptability in audio production. WavJourney audiolizes the human imagination, opening up new avenues for creativity in multimedia content creation.

摘要
大型语言模型（LLM）已经表现出很大的损 Promise 在融合多种专家模型来解决复杂的语言和视觉任务。即使它们在人工智能生成内容（AIGC）领域中的潜力未被完全探索，但它们在智能音频内容创作中的潜力仍然未被开发。在这个工作中，我们对于创建音频内容的Storyline 进行了探索，这些Storyline 包括谈话、音乐和音效。我们提出了 WavJourney，一个系统可以通过连接多种音频模型来创建音频内容。当我们给出了一个文本描述的听频场景时，WavJourney 会使用 LLM 生成一个结构化的对话脚本，这个对话脚本包括多种音频元素，并以其空间时间关系组织。这个对话脚本作为音频的概念表现，提供了可互动和可解释的理由，以便人类参与。接着，这个对话脚本会被转换为一个 компьютер程序，每个程序行叫用一个任务特定的音频生成模型或计算操作函数（例如， concatenate 或 mix）。这个 компьютер程序会被执行，以获得一个可解释的音频生成解决方案。我们在多个实际应用中证明了 WavJourney 的实用性，包括科幻、教育和广播剧等。WavJourney 的可靠和互动设计启动人机共创，增加了创作控制和适应性在音频生成中。WavJourney 将人类的想像获得了声音，开启了新的创作 Avenues 在多媒体内容创作中。

Event-based Vision for Early Prediction of Manipulation Actions

paper_url: http://arxiv.org/abs/2307.14332
repo_url: https://github.com/danideniz/davishanddataset-events
paper_authors: Daniel Deniz, Cornelia Fermuller, Eduardo Ros, Manuel Rodriguez-Alvarez, Francisco Barranco
for: 这个研究的目的是用事件驱动的变换器网络进行 manipulate 动作预测。
methods: 该研究使用了事件驱动的 transformers 网络，通过在线推理来预测 manipulate 动作。
results: 研究表明， transformers 网络可以准确地预测 manipulate 动作，并且可以 capture 动作的动态特征，超过视频基于方法。 code 将在 GitHub 上发布。

Abstract
Neuromorphic visual sensors are artificial retinas that output sequences of asynchronous events when brightness changes occur in the scene. These sensors offer many advantages including very high temporal resolution, no motion blur and smart data compression ideal for real-time processing. In this study, we introduce an event-based dataset on fine-grained manipulation actions and perform an experimental study on the use of transformers for action prediction with events. There is enormous interest in the fields of cognitive robotics and human-robot interaction on understanding and predicting human actions as early as possible. Early prediction allows anticipating complex stages for planning, enabling effective and real-time interaction. Our Transformer network uses events to predict manipulation actions as they occur, using online inference. The model succeeds at predicting actions early on, building up confidence over time and achieving state-of-the-art classification. Moreover, the attention-based transformer architecture allows us to study the role of the spatio-temporal patterns selected by the model. Our experiments show that the Transformer network captures action dynamic features outperforming video-based approaches and succeeding with scenarios where the differences between actions lie in very subtle cues. Finally, we release the new event dataset, which is the first in the literature for manipulation action recognition. Code will be available at https://github.com/DaniDeniz/EventVisionTransformer.

摘要
neuromorphic visual sensors 是人工视觉系统，它们当光度变化时输出一系列异步事件。这些感知器具有许多优点，包括非常高的时间分辨率、无运动模糊和智能数据压缩，适用于实时处理。在这个研究中，我们提出了基于事件的推理模型，并进行了实验研究，以测试这种模型在人体动作预测中的性能。人类动作预测在认知机器人和人机交互领域引起了极大的兴趣。早期预测可以在规划中预测复杂的阶段，使得交互变得有效和实时。我们的 transformer 网络使用事件来预测 manipulation 动作，并在线进行推理。模型在动作开始时就能够预测动作，逐渐增加信任度，并达到了状态之 arts 的分类。此外，基于注意力的 transformer 架构允许我们研究模型选择的空间时间模式的角色。我们的实验显示， transformer 网络在动作动态特征方面超越了视频基于方法，并在具有非常细微差异的动作场景中表现出色。最后，我们发布了一个新的事件数据集，这是文献中的第一个推理数据集。代码将在 GitHub 上公开，链接为 https://github.com/DaniDeniz/EventVisionTransformer。

Utilizing Large Language Models for Natural Interface to Pharmacology Databases

paper_url: http://arxiv.org/abs/2307.15717
repo_url: None
paper_authors: Hong Lu, Chuan Li, Yinheng Li, Jie Zhao
for: 这篇论文的目的是为了开发一个基于自然语言的数据库查询界面，以便药理学家在药物开发过程中访问和检索大量数据。
methods: 本文使用了大型自然语言模型（LLM）来实现数据库中的数据查询，并通过实验证明了这个框架的可行性和效能。
results: 实验结果显示，该框架可以对各种药品数据和知识库进行扩展，并且可以实现高效的数据查询和检索。

Abstract
The drug development process necessitates that pharmacologists undertake various tasks, such as reviewing literature, formulating hypotheses, designing experiments, and interpreting results. Each stage requires accessing and querying vast amounts of information. In this abstract, we introduce a Large Language Model (LLM)-based Natural Language Interface designed to interact with structured information stored in databases. Our experiments demonstrate the feasibility and effectiveness of the proposed framework. This framework can generalize to query a wide range of pharmaceutical data and knowledge bases.

摘要
药物开发过程中，药物学家需要完成多种任务，如阅读文献、提出假设、设计实验和解释结果。每个阶段都需要访问和查询大量信息。在这个报告中，我们介绍了一种基于自然语言的语言模型（LLM）的自然语言界面，用于与数据库中的结构化信息进行交互。我们的实验表明该框架的可行性和效果。这个框架可以泛化到访问各种药物数据和知识库。

Building and Testing a General Intelligence Embodied in a Humanoid Robot

paper_url: http://arxiv.org/abs/2307.16770
repo_url: https://github.com/Aryia-Behroziuan/Other-sources
paper_authors: Suzanne Gildert, Geordie Rose
for: 这种论文的目的是建立一种人类水平的智能机器，以便它们可以完成经济最有价值的工作。
methods: 这种方法包括一个物理的人型 робо体系统，一个基于软件的控制系统，一个名为“g+”的性能指标，以及一种进化算法来逐步提高这个指标的得分。
results: 作者们介绍了这种方法的当前状况和历史测量结果，包括“g+”指标的测量结果。

Abstract
Machines with human-level intelligence should be able to do most economically valuable work. This aligns a major economic incentive with the scientific grand challenge of building a human-like mind. Here we describe our approach to building and testing such a system. Our approach comprises a physical humanoid robotic system; a software based control system for robots of this type; a performance metric, which we call g+, designed to be a measure of human-like intelligence in humanoid robots; and an evolutionary algorithm for incrementally increasing scores on this performance metric. We introduce and describe the current status of each of these. We report on current and historical measurements of the g+ metric on the systems described here.

摘要
机器人 WITH human-level intelligence应该能够完成经济有价值的工作。这与科学大挑战的建立人类智能型机器人相吻合。我们的方法包括物理人iform机器人系统；基于软件的控制系统 для这类机器人；一个名为g+的性能指标，用于评估机器人的人类智能水平；以及一种演化算法，用于逐步提高g+指标的得分。我们介绍了每个部分的当前状态，并报告了当前和历史中的g+指标测量结果。

Waypoint-Based Imitation Learning for Robotic Manipulation

paper_url: http://arxiv.org/abs/2307.14326
repo_url: https://github.com/lucys0/awe
paper_authors: Lucy Xiaoyang Shi, Archit Sharma, Tony Z. Zhao, Chelsea Finn
for: 该论文旨在提出一种自动生成方向点的方法，以提高人工学习中的imitazione学习的精度和效率。
methods: 该方法基于linear motion的准确性，通过分解示例为最小的方向点集来生成方向点。
results: 实验结果表明，该方法可以增加state-of-the-art算法的成功率，在simulation中提高了25%，在实际双手操作任务上提高了4-28%，同时降低了决策准确性的 horizon。

Abstract
While imitation learning methods have seen a resurgent interest for robotic manipulation, the well-known problem of compounding errors continues to afflict behavioral cloning (BC). Waypoints can help address this problem by reducing the horizon of the learning problem for BC, and thus, the errors compounded over time. However, waypoint labeling is underspecified, and requires additional human supervision. Can we generate waypoints automatically without any additional human supervision? Our key insight is that if a trajectory segment can be approximated by linear motion, the endpoints can be used as waypoints. We propose Automatic Waypoint Extraction (AWE) for imitation learning, a preprocessing module to decompose a demonstration into a minimal set of waypoints which when interpolated linearly can approximate the trajectory up to a specified error threshold. AWE can be combined with any BC algorithm, and we find that AWE can increase the success rate of state-of-the-art algorithms by up to 25% in simulation and by 4-28% on real-world bimanual manipulation tasks, reducing the decision making horizon by up to a factor of 10. Videos and code are available at https://lucys0.github.io/awe/

摘要
While imitation learning methods have seen a resurgence of interest for robotic manipulation, the well-known problem of compounding errors continues to affect behavioral cloning (BC). Waypoints can help address this problem by reducing the horizon of the learning problem for BC, but waypoint labeling is underspecified and requires additional human supervision. Can we generate waypoints automatically without any additional human supervision? Our key insight is that if a trajectory segment can be approximated by linear motion, the endpoints can be used as waypoints. We propose Automatic Waypoint Extraction (AWE) for imitation learning, a preprocessing module to decompose a demonstration into a minimal set of waypoints that can be interpolated linearly to approximate the trajectory up to a specified error threshold. AWE can be combined with any BC algorithm, and we find that AWE can increase the success rate of state-of-the-art algorithms by up to 25% in simulation and by 4-28% on real-world bimanual manipulation tasks, reducing the decision-making horizon by up to a factor of 10. Videos and code are available at .

Evaluating the Moral Beliefs Encoded in LLMs

paper_url: http://arxiv.org/abs/2307.14324
repo_url: https://github.com/ninodimontalcino/moralchoice
paper_authors: Nino Scherrer, Claudia Shi, Amir Feder, David M. Blei
for: 本研究旨在探讨大语言模型（LLM）上的问卷设计、管理、后处理和评估。
methods: 本研究使用统计方法来激发LLM中的信念，并 introduce了一些统计指标和评估度量来量化LLM“选择”的概率、相关的不确定性以及选择的一致性。
results: 研究发现：（1）在明确的场景下，大多数模型会选择与常识相符的行为。在杂乱的场景下，大多数模型表现出不确定性。（2）一些模型对于选择常识行为的问题 wording 有敏感性。（3）一些模型在杂乱场景下具有明确的偏好。特别是关闭源模型在大多数情况下表现一致。

Abstract
This paper presents a case study on the design, administration, post-processing, and evaluation of surveys on large language models (LLMs). It comprises two components: (1) A statistical method for eliciting beliefs encoded in LLMs. We introduce statistical measures and evaluation metrics that quantify the probability of an LLM "making a choice", the associated uncertainty, and the consistency of that choice. (2) We apply this method to study what moral beliefs are encoded in different LLMs, especially in ambiguous cases where the right choice is not obvious. We design a large-scale survey comprising 680 high-ambiguity moral scenarios (e.g., "Should I tell a white lie?") and 687 low-ambiguity moral scenarios (e.g., "Should I stop for a pedestrian on the road?"). Each scenario includes a description, two possible actions, and auxiliary labels indicating violated rules (e.g., "do not kill"). We administer the survey to 28 open- and closed-source LLMs. We find that (a) in unambiguous scenarios, most models "choose" actions that align with commonsense. In ambiguous cases, most models express uncertainty. (b) Some models are uncertain about choosing the commonsense action because their responses are sensitive to the question-wording. (c) Some models reflect clear preferences in ambiguous scenarios. Specifically, closed-source models tend to agree with each other.

摘要

A statistical method for extracting beliefs from LLMs. We introduce statistical measures and evaluation metrics that quantify the probability of an LLM “making a choice,” the associated uncertainty, and the consistency of that choice.2. Applying this method to investigate what moral beliefs are encoded in different LLMs, especially in situations where the right choice is not obvious. We designed a large-scale survey with 680 high-ambiguity moral scenarios (e.g., “Should I tell a white lie?”) and 687 low-ambiguity moral scenarios (e.g., “Should I stop for a pedestrian on the road?”). Each scenario includes a description, two possible actions, and auxiliary labels indicating violated rules (e.g., “do not kill”). We administered the survey to 28 open- and closed-source LLMs.Our findings are as follows:a. In unambiguous scenarios, most models “choose” actions that align with common sense. In ambiguous cases, most models express uncertainty.b. Some models are uncertain about choosing the common-sense action due to sensitivity to question wording.c. Some models reflect clear preferences in ambiguous scenarios, with closed-source models tending to agree with each other.

Reinforcement Learning by Guided Safe Exploration

paper_url: http://arxiv.org/abs/2307.14316
repo_url: None
paper_authors: Qisong Yang, Thiago D. Simão, Nils Jansen, Simon H. Tindemans, Matthijs T. J. Spaan
for: 本研究旨在使用不带奖励的强化学习（RL）训练代理人（导师），以便在未知目标任务下快速适应。
methods: 研究使用受限的奖励自由RL训练代理人，以避免危险交互，并在目标任务公布后不允许安全违反。同时，通过传输学习，训练目标策略（学生）以导师为 Referent，并逐渐减少导师的影响。
results: 实验结果表明，该方法可以实现安全的传输学习，帮助学生更快地解决目标任务。

Abstract
Safety is critical to broadening the application of reinforcement learning (RL). Often, we train RL agents in a controlled environment, such as a laboratory, before deploying them in the real world. However, the real-world target task might be unknown prior to deployment. Reward-free RL trains an agent without the reward to adapt quickly once the reward is revealed. We consider the constrained reward-free setting, where an agent (the guide) learns to explore safely without the reward signal. This agent is trained in a controlled environment, which allows unsafe interactions and still provides the safety signal. After the target task is revealed, safety violations are not allowed anymore. Thus, the guide is leveraged to compose a safe behaviour policy. Drawing from transfer learning, we also regularize a target policy (the student) towards the guide while the student is unreliable and gradually eliminate the influence of the guide as training progresses. The empirical analysis shows that this method can achieve safe transfer learning and helps the student solve the target task faster.

摘要
安全性是RL应用扩展的关键因素。我们通常在实验室中训练RL代理，以便在真实世界中部署。然而，真实世界目标任务可能未知之前部署。无奖RL在部署前训练代理，以适应快速更新奖励信号。我们考虑了受限的奖励自由设定，其中一个代理在控制环境中学习安全地探索，而不需要奖励信号。当目标任务揭示后，安全违反不再允许。因此，代理可以被利用来组成安全行为策略。 drew from transfer learning，我们还启用了一个目标策略（学生），使其受到代理的正则化，而学生在训练进程中不可靠。逐渐消除代理的影响，以便更快地解决目标任务。我们的实验分析表明，这种方法可以实现安全的转移学习，并帮助学生更快地解决目标任务。

Unsupervised Deep Learning-based Pansharpening with Jointly-Enhanced Spectral and Spatial Fidelity

paper_url: http://arxiv.org/abs/2307.14403
repo_url: https://github.com/matciotola/lambda-pnn
paper_authors: Matteo Ciotola, Giovanni Poggi, Giuseppe Scarpa
for: 这个论文主要目的是提出一种基于深度学习的高分辨率图像缩进方法，以提高图像缩进的性能。
methods: 该方法使用了全分辨率培生和新的损失函数，以提高图像缩进的 spectral 和 spatial 质量。
results: 实验结果表明，提出的方法在具有挑战性的测试图像上比领先方法更好， both in terms of numerical results and visual output。

Abstract
In latest years, deep learning has gained a leading role in the pansharpening of multiresolution images. Given the lack of ground truth data, most deep learning-based methods carry out supervised training in a reduced-resolution domain. However, models trained on downsized images tend to perform poorly on high-resolution target images. For this reason, several research groups are now turning to unsupervised training in the full-resolution domain, through the definition of appropriate loss functions and training paradigms. In this context, we have recently proposed a full-resolution training framework which can be applied to many existing architectures. Here, we propose a new deep learning-based pansharpening model that fully exploits the potential of this approach and provides cutting-edge performance. Besides architectural improvements with respect to previous work, such as the use of residual attention modules, the proposed model features a novel loss function that jointly promotes the spectral and spatial quality of the pansharpened data. In addition, thanks to a new fine-tuning strategy, it improves inference-time adaptation to target images. Experiments on a large variety of test images, performed in challenging scenarios, demonstrate that the proposed method compares favorably with the state of the art both in terms of numerical results and visual output. Code is available online at https://github.com/matciotola/Lambda-PNN.

摘要
最近几年，深度学习在多resolution图像投射中得到了主导地位。由于缺乏实际数据，大多数深度学习基于方法在减小分辨率领域进行了超级vised学习。然而，在高分辨率目标图像上，模型通常表现不佳。为此，许多研究小组现在转向无监督学习在全分辨率领域进行训练，通过定义适当的损失函数和训练方法。在这个上下文中，我们最近提出了一个全分辨率训练框架，可以应用到许多现有的架构上。我们提出了一个新的深度学习基于投射模型，全面利用了这种方法的潜力，并提供了顶尖性能。除了与前一代方法进行建筑性改进外，该模型还特点之所以有 residual attention 模块，以及一种新的损失函数，同时Promote spectral和spatial质量。此外，通过一种新的微调策略，可以在目标图像进行实时适应。对于一个大量的测试图像，在复杂的场景下进行了实验，结果表明，提议的方法与状态码器比较， both in terms of numerical results and visual output。代码可以在上获取。

ChatGPT and Persuasive Technologies for the Management and Delivery of Personalized Recommendations in Hotel Hospitality

paper_url: http://arxiv.org/abs/2307.14298
repo_url: None
paper_authors: Manolis Remountakis, Konstantinos Kotis, Babis Kourtzis, George E. Tsekouras
for: 酒店住宿预测系统的自动化和改善
methods: 应用大语言模型(ChatGPT)和吸引技术对酒店预测系统的整合和改进
results: 透过实验研究，发现这些技术可以提高用户满意度和酒店营收

Abstract
Recommender systems have become indispensable tools in the hotel hospitality industry, enabling personalized and tailored experiences for guests. Recent advancements in large language models (LLMs), such as ChatGPT, and persuasive technologies, have opened new avenues for enhancing the effectiveness of those systems. This paper explores the potential of integrating ChatGPT and persuasive technologies for automating and improving hotel hospitality recommender systems. First, we delve into the capabilities of ChatGPT, which can understand and generate human-like text, enabling more accurate and context-aware recommendations. We discuss the integration of ChatGPT into recommender systems, highlighting the ability to analyze user preferences, extract valuable insights from online reviews, and generate personalized recommendations based on guest profiles. Second, we investigate the role of persuasive technology in influencing user behavior and enhancing the persuasive impact of hotel recommendations. By incorporating persuasive techniques, such as social proof, scarcity and personalization, recommender systems can effectively influence user decision-making and encourage desired actions, such as booking a specific hotel or upgrading their room. To investigate the efficacy of ChatGPT and persuasive technologies, we present a pilot experi-ment with a case study involving a hotel recommender system. We aim to study the impact of integrating ChatGPT and persua-sive techniques on user engagement, satisfaction, and conversion rates. The preliminary results demonstrate the potential of these technologies in enhancing the overall guest experience and business performance. Overall, this paper contributes to the field of hotel hospitality by exploring the synergistic relationship between LLMs and persuasive technology in recommender systems, ultimately influencing guest satisfaction and hotel revenue.

摘要
<>TRANSLATE_TEXT酒店ospitality行业中的推荐系统已经成为不可或缺的工具，帮助客户得到个性化和适应性的体验。最新的大型语言模型（LLMs），如ChatGPT，以及吸引技术，则开启了新的 Avenues for enhancing the effectiveness of those systems。这篇论文探讨了将ChatGPT和吸引技术 integrate into hotel hospitality推荐系统中的 potential。首先，我们深入探讨ChatGPT的能力，它可以理解和生成人类语言，实现更加精确和上下文感知的推荐。我们讨论了将ChatGPT integrating into recommender systems， highlighting the ability to analyze user preferences, extract valuable insights from online reviews, and generate personalized recommendations based on guest profiles。其次，我们 investigate the role of persuasive technology in influencing user behavior and enhancing the persuasive impact of hotel recommendations。通过 incorporating persuasive techniques, such as social proof, scarcity and personalization, recommender systems can effectively influence user decision-making and encourage desired actions, such as booking a specific hotel or upgrading their room。为了评估ChatGPT和吸引技术的效果，我们进行了一项试验，使用一个酒店推荐系统的case study。我们的目的是研究将ChatGPT和吸引技术integrated into hotel hospitality推荐系统的影响，包括用户参与度、满意度和转化率。初步结果表明这些技术在提高客户体验和酒店业绩方面具有潜在的潜力。总之，这篇论文对酒店ospitality领域的推荐系统做出了贡献，探讨了LLMs和吸引技术之间的相互关系，最终影响客户满意度和酒店收益。

Unraveling the Complexity of Splitting Sequential Data: Tackling Challenges in Video and Time Series Analysis

paper_url: http://arxiv.org/abs/2307.14294
repo_url: None
paper_authors: Diego Botache, Kristina Dingel, Rico Huhnstock, Arno Ehresmann, Bernhard Sick
for: 本研究探讨了分析顺序数据时的挑战，包括数据收集、数据表示、分割率选择、设置质量标准和选择适当的选择策略。
methods: 本研究使用了两个实际应用例 Study two real-world examples: motor test benches and particle tracking in liquids.
results: 研究发现，在分析顺序数据时，需要考虑数据收集、数据表示、分割率选择、设置质量标准和选择适当的选择策略等多个因素，以确保分析结果的准确性和可靠性。

Abstract
Splitting of sequential data, such as videos and time series, is an essential step in various data analysis tasks, including object tracking and anomaly detection. However, splitting sequential data presents a variety of challenges that can impact the accuracy and reliability of subsequent analyses. This concept article examines the challenges associated with splitting sequential data, including data acquisition, data representation, split ratio selection, setting up quality criteria, and choosing suitable selection strategies. We explore these challenges through two real-world examples: motor test benches and particle tracking in liquids.

摘要
分割连续数据，如视频和时间序列，是数据分析任务中的一项重要步骤，包括对象跟踪和异常检测。然而，分割连续数据带来许多挑战，这些挑战可能会影响后续分析的准确性和可靠性。本概念文章探讨分割连续数据的挑战，包括数据获取、数据表示、分割率选择、设置质量标准和选择适合的选择策略。我们通过两个实际应用例：汽车测试台和在液体中跟踪粒子来探讨这些挑战。

General Purpose Artificial Intelligence Systems (GPAIS): Properties, Definition, Taxonomy, Open Challenges and Implications

paper_url: http://arxiv.org/abs/2307.14283
repo_url: None
paper_authors: Isaac Triguero, Daniel Molina, Javier Poyatos, Javier Del Ser, Francisco Herrera
for: The paper discusses and proposes a new definition for General-Purpose Artificial Intelligence Systems (GPAIS) and its differentiation based on various factors.
methods: The paper uses existing definitions of GPAIS and proposes a new definition, and also discusses a taxonomy of approaches to realise GPAIS.
results: The paper aims to facilitate research collaboration across different areas that are tackling general-purpose tasks, and provides a holistic view of GPAIS, including its challenges and prospects, implications for society, and the need for responsible and trustworthy AI systems and regulation.Here is the same information in Simplified Chinese text:
for: 这篇论文讨论了和提出了一个新的General-Purpose Artificial Intelligence Systems（GPAIS）定义，以及其 diferenciación的基于多个因素。
methods: 这篇论文使用了现有的GPAIS定义，并提出了一个新的定义，同时还讨论了实现GPAIS的多种方法。
results: 这篇论文的目标是通过帮助不同领域的研究人员在处理通用任务方面合作，提供了GPAIS的总体视图，包括其挑战和前景，对社会的影响和责任的AI系统和 regulatory。

Abstract
Most applications of Artificial Intelligence (AI) are designed for a confined and specific task. However, there are many scenarios that call for a more general AI, capable of solving a wide array of tasks without being specifically designed for them. The term General-Purpose Artificial Intelligence Systems (GPAIS) has been defined to refer to these AI systems. To date, the possibility of an Artificial General Intelligence, powerful enough to perform any intellectual task as if it were human, or even improve it, has remained an aspiration, fiction, and considered a risk for our society. Whilst we might still be far from achieving that, GPAIS is a reality and sitting at the forefront of AI research. This work discusses existing definitions for GPAIS and proposes a new definition that allows for a gradual differentiation among types of GPAIS according to their properties and limitations. We distinguish between closed-world and open-world GPAIS, characterising their degree of autonomy and ability based on several factors such as adaptation to new tasks, competence in domains not intentionally trained for, ability to learn from few data, or proactive acknowledgment of their own limitations. We then propose a taxonomy of approaches to realise GPAIS, describing research trends such as the use of AI techniques to improve another AI or foundation models. As a prime example, we delve into generative AI, aligning them with the terms and concepts presented in the taxonomy. Through the proposed definition and taxonomy, our aim is to facilitate research collaboration across different areas that are tackling general-purpose tasks, as they share many common aspects. Finally, we discuss the current state of GPAIS, its challenges and prospects, implications for our society, and the need for responsible and trustworthy AI systems and regulation, with the goal of providing a holistic view of GPAIS.

摘要
大多数人工智能（AI）应用都是为特定任务设计的，但有很多情况需要一种通用的AI系统，能够解决多种任务而不需要特定的设计。我们称这种AI系统为通用人工智能系统（GPAIS）。迄今为止，人工通用智能系统的可能性仍然是一个aspiration， fiction和社会中的风险。尽管我们仍然远离实现这一目标，但GPAIS已经成为人工智能研究的前线。这个工作提出了现有的GPAIS定义，并提出了一个新的定义，允许在不同的性能和限制基础下进行渐进的分类。我们分为闭世和开放世GPAIS，根据它们的自主性、能力和多个因素进行定义，如适应新任务、在不expressly预训练的领域中的能力、从少量数据学习、或者主动承认自己的局限性。然后，我们提出了实现GPAIS的方法分类，描述了研究趋势，如使用AI技术来提高另一个AI或基础模型。作为一个典型例子，我们探讨了生成AI，与提出的术语和概念进行对应。通过我们的定义和分类，我们希望能够促进不同领域在解决通用任务方面的合作研究，因为它们在许多方面都有共同之处。最后，我们讨论了GPAIS的当前状况、挑战和前途，以及对我们社会的影响和责任的人工智能系统和regulation，以提供一个总的视图。

2023-07-27

cs.CL

cs.CL - 2023-07-27

ARC-NLP at PAN 2023: Transition-Focused Natural Language Inference for Writing Style Detection

paper_url: http://arxiv.org/abs/2307.14913
repo_url: None
paper_authors: Izzet Emre Kucukkaya, Umitcan Sahin, Cagri Toraman
for: 本文的任务是检测文本中的多个作者写作风格变化。
methods: 我们将这个任务转换为自然语言推理问题，并将两个篇章相互对应。我们使用不同的 transformer 基础模型进行训练，并在训练过程中添加充电阶段。
results: 我们在 экспериментах中表现出了超过基线和其他提议的模型版本的好干净。对于易、中等设置，我们提交了关注转折的自然语言推理基于 DeBERTa 的温存训练版本，而对于困难设置，我们提交了不含转折的同一模型版本。

Abstract
The task of multi-author writing style detection aims at finding any positions of writing style change in a given text document. We formulate the task as a natural language inference problem where two consecutive paragraphs are paired. Our approach focuses on transitions between paragraphs while truncating input tokens for the task. As backbone models, we employ different Transformer-based encoders with warmup phase during training. We submit the model version that outperforms baselines and other proposed model versions in our experiments. For the easy and medium setups, we submit transition-focused natural language inference based on DeBERTa with warmup training, and the same model without transition for the hard setup.

摘要
文本检测多种作者风格目标在文档中找到任何写作风格变化的位置。我们将问题转换为自然语言推理问题，将两段文本相邻排序。我们的方法关注段落之间的转换，并在训练中使用Transformer基本encoder进行温存阶段。在我们的实验中，我们提交的模型版本超越基eline和其他提议的模型版本。对于易Difficulty和中等Difficulty的设置，我们提交转换注意力的自然语言推理基于DeBERTa，并在训练中添加温存阶段。

ARC-NLP at PAN 2023: Hierarchical Long Text Classification for Trigger Detection

paper_url: http://arxiv.org/abs/2307.14912
repo_url: None
paper_authors: Umitcan Sahin, Izzet Emre Kucukkaya, Cagri Toraman
for: 这篇论文目的是描述在PAN CLEF 2023中的Trigger Detection分享任务中检测多个触发内容的方法。
methods: 我们使用了层次模型，其中首先将长文档分割成较小的 segment，然后使用这些 segment 来练化一个基于 Transformer 的语言模型。接着，我们从练化后的 Transformer 模型提取特征嵌入，并将其作为多个 LSTM 模型的训练输入。
results: 我们的模型在验证集上达到了 F1-macro 分数为 0.372 和 F1-micro 分数为 0.736，这些结果高于PAN CLEF 2023中基eline的结果。

Abstract
Fanfiction, a popular form of creative writing set within established fictional universes, has gained a substantial online following. However, ensuring the well-being and safety of participants has become a critical concern in this community. The detection of triggering content, material that may cause emotional distress or trauma to readers, poses a significant challenge. In this paper, we describe our approach for the Trigger Detection shared task at PAN CLEF 2023, where we want to detect multiple triggering content in a given Fanfiction document. For this, we build a hierarchical model that uses recurrence over Transformer-based language models. In our approach, we first split long documents into smaller sized segments and use them to fine-tune a Transformer model. Then, we extract feature embeddings from the fine-tuned Transformer model, which are used as input in the training of multiple LSTM models for trigger detection in a multi-label setting. Our model achieves an F1-macro score of 0.372 and F1-micro score of 0.736 on the validation set, which are higher than the baseline results shared at PAN CLEF 2023.

摘要
fanfiction，一种受欢迎的创作写作形式，已经在确立的虚拟世界中得到了大量的在线支持。然而，保证参与者的健康和安全已成为这个社区的核心问题。检测触发内容（material that may cause emotional distress or trauma to readers）成为了一项重要的挑战。在这篇论文中，我们描述了我们在PAN CLEF 2023中的触发检测共同任务中的方法，我们想检测给定的fanfiction文档中的多个触发内容。为此，我们构建了层次模型，使用recurrence override Transformer-based语言模型。在我们的方法中，我们首先将长文档分成更小的段落，然后使用这些段落来精度调整Transformer模型。然后，我们从精度调整后的Transformer模型中提取特征嵌入，这些嵌入被用作多个LSTM模型的训练输入，以实现多标签的触发检测。我们的模型在验证集上达到了F1-macroscore的0.372和F1-micro score的0.736，这些结果高于PAN CLEF 2023中分享的基线结果。

Retrieval-based Text Selection for Addressing Class-Imbalanced Data in Classification

paper_url: http://arxiv.org/abs/2307.14899
repo_url: None
paper_authors: Sareh Ahmadi, Aditya Shah, Edward Fox
for: 这篇论文的目的是解决文本分类中选择文本进行标注的问题，当有限制人工资源时。另外，该论文还面临了罕见的分类问题，即Binary categories with a small number of positive instances。
methods: 这篇论文提出了利用SHAP来建构高质量的查询集，以便使用Elasticsearch和semantic search来选择需要标注的文本。这种方法可以在批量标注的情况下，使用先前的标注来指导下一批文本的选择。
results: 实验结果显示，这种方法可以提高少数分类的F1分数，协助解决分类问题。

Abstract
This paper addresses the problem of selecting of a set of texts for annotation in text classification using retrieval methods when there are limits on the number of annotations due to constraints on human resources. An additional challenge addressed is dealing with binary categories that have a small number of positive instances, reflecting severe class imbalance. In our situation, where annotation occurs over a long time period, the selection of texts to be annotated can be made in batches, with previous annotations guiding the choice of the next set. To address these challenges, the paper proposes leveraging SHAP to construct a quality set of queries for Elasticsearch and semantic search, to try to identify optimal sets of texts for annotation that will help with class imbalance. The approach is tested on sets of cue texts describing possible future events, constructed by participants involved in studies aimed to help with the management of obesity and diabetes. We introduce an effective method for selecting a small set of texts for annotation and building high-quality classifiers. We integrate vector search, semantic search, and machine learning classifiers to yield a good solution. Our experiments demonstrate improved F1 scores for the minority classes in binary classification.

摘要
Here is the translation in Simplified Chinese:这篇论文关注在文本分类中使用检索方法选择文本进行标注时，因为人工资源有限制而存在限制标注数量的问题。此外，论文还 Addresses the challenge of dealing with binary categories that have a small number of positive instances, reflecting severe class imbalance. 在我们的情况下，标注发生在长时间内，因此可以在批次中选择文本进行标注，前一次的标注导导选择下一批文本。为解决这些挑战，论文提议利用SHAP construct一个质量集合查询语义搜索，以尝试确定最佳的标注文本集，以帮助减轻类偏。这种方法在cue文描述可能的未来事件的集合上进行测试，这些集合由参与研究的参与者构建。我们引入了一种有效的文本选择和建立高质量分类器的方法。我们将vector搜索、语义搜索和机器学习分类器集成，以获得良好的解决方案。我们的实验表明，在二分类分类中，少数类的F1分数得到了改善。

paper_url: http://arxiv.org/abs/2307.14878
repo_url: https://github.com/thukelab/mesed
paper_authors: Yangning Li, Tingwei Lu, Yinghui Li, Tianyu Yu, Shulin Huang, Hai-Tao Zheng, Rui Zhang, Jun Yuan
for: 本研究旨在提高Entity Set Expansion（ESE）任务中的entity扩展精度，通过 integrate多Modalities的信息来表示实体。
methods: 本研究提出了Multi-modal Entity Set Expansion（MESE）方法，使用多modalities的信息来代表实体，包括：(1) 不同Modalities提供补充信息。(2) 多modalities信息提供共同Visual属性，为同Semantic class或实体提供一致信号。(3) 多modalities信息提供Robust的Alignment信号，用于同义实体。
results: 在MESED dataset上，我们提出了一种 poderous multi-modal模型MultiExpan，通过四种多模态预训练任务进行预训练。实验和分析表明，MESED dataset具有高质量，MultiExpan具有高效性，同时还指明了未来研究的方向。

Abstract
The Entity Set Expansion (ESE) task aims to expand a handful of seed entities with new entities belonging to the same semantic class. Conventional ESE methods are based on mono-modality (i.e., literal modality), which struggle to deal with complex entities in the real world such as: (1) Negative entities with fine-grained semantic differences. (2) Synonymous entities. (3) Polysemous entities. (4) Long-tailed entities. These challenges prompt us to propose Multi-modal Entity Set Expansion (MESE), where models integrate information from multiple modalities to represent entities. Intuitively, the benefits of multi-modal information for ESE are threefold: (1) Different modalities can provide complementary information. (2) Multi-modal information provides a unified signal via common visual properties for the same semantic class or entity. (3) Multi-modal information offers robust alignment signal for synonymous entities. To assess the performance of model in MESE and facilitate further research, we constructed the MESED dataset which is the first multi-modal dataset for ESE with large-scale and elaborate manual calibration. A powerful multi-modal model MultiExpan is proposed which is pre-trained on four multimodal pre-training tasks. The extensive experiments and analyses on MESED demonstrate the high quality of the dataset and the effectiveness of our MultiExpan, as well as pointing the direction for future research.

摘要
Entity Set Expansion (ESE) 任务目标是从seed entity开始扩展新的实体，这些实体属于同一个semantic class。传统的ESE方法基于单一模式（即Literal modality），在实际世界中遇到了复杂实体的挑战，如：1. 具有细化 semantic differences的负实体。2. 同义实体。3. 多义实体。4. 长尾实体。这些挑战使我们提出了Multi-modal Entity Set Expansion (MESE)，其中模型可以 Integrate multiple modalities 来表示实体。可以 intuitively 分解为以下三个 benefit:1. 不同的模式可以提供补充信息。2. multi-modal information 提供了一个统一的信号，通过共同的视觉属性来表示同一个semantic class或实体。3. multi-modal information 提供了一个强健的对同义实体的吴健信号。为了评估模型在 MESE 中的性能和进一步研究，我们构建了 MESED 数据集，这是首个大规模、精心准备的多模式 ESE 数据集。我们还提出了一种强大的多模式模型 MultiExpan，该模型在四种多模式预训练任务上进行预训练。我们对 MESED 进行了广泛的实验和分析，以证明数据集的高质量和我们的 MultiExpan 的有效性，同时也指出了未来研究的方向。

paper_url: http://arxiv.org/abs/2307.15097
repo_url: https://github.com/ristea/ccmt
paper_authors: Nicolae-Catalin Ristea, Radu Tudor Ionescu
for: 本研究的目的是开发一种新的多modal transformer模型，用于检测电话对话中的客户请求和投诉。
methods: 本研究使用自动语音识别（ASR）模型将语音转录为文本，并将文本翻译成不同的语言。然后，我们将语言特定的 BERT 模型和 Wav2Vec2.0 音频特征结合在一起，使用novel的卷积混合注意力模型。
results: 我们在 ACM Multimedia 2023 计算语言学挑战中的请求子挑战中应用了我们的系统，实现了无权重平均回归率（UAR）65.41%和85.87% для请求和投诉类别。

Abstract
We propose a novel cascaded cross-modal transformer (CCMT) that combines speech and text transcripts to detect customer requests and complaints in phone conversations. Our approach leverages a multimodal paradigm by transcribing the speech using automatic speech recognition (ASR) models and translating the transcripts into different languages. Subsequently, we combine language-specific BERT-based models with Wav2Vec2.0 audio features in a novel cascaded cross-attention transformer model. We apply our system to the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge, reaching unweighted average recalls (UAR) of 65.41% and 85.87% for the complaint and request classes, respectively.

摘要
我们提出了一种新的层次融合 cross-modal transformer（CCMT），该模型结合语音和文本译文来检测电话对话中的客户请求和投诉。我们的方法借鉴了多模态思想，通过使用自动语音识别（ASR）模型将语音转录为不同语言，然后将语言特定的 BERT 基于模型与 Wav2Vec2.0 音频特征结合在一起，并通过新的层次融合 cross-attention transformer 模型进行结合。我们在 ACM Multimedia 2023 计算 паралингвистики挑战中的请求子挑战中应用了我们的系统，实现了不Weighted average recall（UAR）的 65.41% 和 85.87%，对于投诉和请求类别。

ArcGPT: A Large Language Model Tailored for Real-world Archival Applications

paper_url: http://arxiv.org/abs/2307.14852
repo_url: None
paper_authors: Shitou Zhang, Jingrui Hou, Siyuan Peng, Zuchao Li, Qibiao Hu, Ping Wang
for: 为了更好地管理和利用档案信息资源，提高archivist的工作效率和质量。
methods: 使用大量和广泛的档案领域数据进行预训练，提高模型在真实世界档案任务中的表现。
results: 比起现有状态的模型，ArcGPT在四个真实世界档案任务上表现出色，代表了一个重要的进步在有效地管理档案数据方面。

Abstract
Archives play a crucial role in preserving information and knowledge, and the exponential growth of such data necessitates efficient and automated tools for managing and utilizing archive information resources. Archival applications involve managing massive data that are challenging to process and analyze. Although LLMs have made remarkable progress in diverse domains, there are no publicly available archives tailored LLM. Addressing this gap, we introduce ArcGPT, to our knowledge, the first general-purpose LLM tailored to the archival field. To enhance model performance on real-world archival tasks, ArcGPT has been pre-trained on massive and extensive archival domain data. Alongside ArcGPT, we release AMBLE, a benchmark comprising four real-world archival tasks. Evaluation on AMBLE shows that ArcGPT outperforms existing state-of-the-art models, marking a substantial step forward in effective archival data management. Ultimately, ArcGPT aims to better serve the archival community, aiding archivists in their crucial role of preserving and harnessing our collective information and knowledge.

摘要
归档 play a crucial role in preserving information and knowledge, and the exponential growth of such data makes it necessary to have efficient and automated tools for managing and utilizing archive information resources. 归档应用程序面临着处理和分析庞大数据的挑战。虽然LLMs 已经在多个领域取得了卓越成果，但是没有公共可用的归档LLM。为了填补这个空白，我们介绍ArcGPT，我们知道的第一个普用LLM，专门针对归档领域。为了提高模型在实际归档任务中的性能，ArcGPT 在庞大和广泛的归档领域数据上进行了预训练。同时，我们发布了 AMBLE，一个包含四个实际归档任务的benchmark。评估表明，ArcGPT 在 AMBLE 上表现出色，超越了现有的状态对模型， represents a significant step forward in effective archival data management。最终，ArcGPT hopes to better serve the archival community, aiding archivists in their crucial role of preserving and harnessing our collective information and knowledge.

Turkish Native Language Identification

paper_url: http://arxiv.org/abs/2307.14850
repo_url: None
paper_authors: Ahmet Yavuz Uluslu, Gerold Schneider
for: 这个论文是为了探讨土耳其语言认知（NLI）的首次应用。
methods: 该研究使用了土耳其学习语料库，并结合三种语法特征（CFG生成规则、part-of-speech n-grams和函数词）来证明其效果。
results: 研究发现这些语法特征可以有效地预测土耳其作者的首语言。

Abstract
In this paper, we present the first application of Native Language Identification (NLI) for the Turkish language. NLI involves predicting the writer's first language by analysing their writing in different languages. While most NLI research has focused on English, our study extends its scope to Turkish. We used the recently constructed Turkish Learner Corpus and employed a combination of three syntactic features (CFG production rules, part-of-speech n-grams, and function words) with L2 texts to demonstrate their effectiveness in this task.

摘要
在本文中，我们介绍了第一个用于土耳其语的本地语言认知（NLI）应用。NLI 涉及预测作者的第一语言，通过分析他们在不同语言中写作的文本。而大多数 NLI 研究都集中在英语上，我们的研究将其扩展到土耳其语。我们使用最近建立的土耳其学习 корпуス，并将三种语法特征（CFG生成规则、part-of-speech n-grams 和函数词）与第二语言文本结合使用，以示其效果。

What Makes a Good Paraphrase: Do Automated Evaluations Work?

paper_url: http://arxiv.org/abs/2307.14818
repo_url: None
paper_authors: Anna Moskvina, Bhushan Kotnis, Chris Catacata, Michael Janz, Nasrin Saef
for: 本研究旨在探讨自动重塑语料的质量。
methods: 研究使用了自动生成的语料，并采用了不同的自动重塑策略进行评估。
results: 研究发现，使用不同的自动重塑策略可以获得不同的质量。

Abstract
Paraphrasing is the task of expressing an essential idea or meaning in different words. But how different should the words be in order to be considered an acceptable paraphrase? And can we exclusively use automated metrics to evaluate the quality of a paraphrase? We attempt to answer these questions by conducting experiments on a German data set and performing automatic and expert linguistic evaluation.

摘要
要完成篇幅翻译，需要不同的词语表达同一个主题的核心意义。但是，可以用自动化指标来评估翻译质量吗？我们通过使用德语数据集进行实验和自动语言评估来尝试回答这些问题。Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Models of reference production: How do they withstand the test of time?

paper_url: http://arxiv.org/abs/2307.14817
repo_url: https://github.com/fsame/reg_grec-wsj
paper_authors: Fahime Same, Guanyi Chen, Kees van Deemter
for: 本研究的目的是研究自然语言处理（NLP）领域的语言和科学方面，而不是围绕性能提高。
methods: 本研究使用生成引用表达（REG-in-context）作为案例研究，从GREC（一个十年前发布的英语共同任务集）开始分析。我们评估模型在更现实的数据集和更高级的方法上的性能。我们使用不同的评价指标和特征选择实验测试模型。
results: 我们结论认为，GREC不再能准确评估模型的人工引用生成能力，因为选择 Corpora 和评价指标会产生很大的影响。我们的结果还表明，基于预训练语言模型的模型比 классиic Machine Learning 模型更具有抗衰落性，因此在不同的 Corpora 上都能够更好地预测类别。

Abstract
In recent years, many NLP studies have focused solely on performance improvement. In this work, we focus on the linguistic and scientific aspects of NLP. We use the task of generating referring expressions in context (REG-in-context) as a case study and start our analysis from GREC, a comprehensive set of shared tasks in English that addressed this topic over a decade ago. We ask what the performance of models would be if we assessed them (1) on more realistic datasets, and (2) using more advanced methods. We test the models using different evaluation metrics and feature selection experiments. We conclude that GREC can no longer be regarded as offering a reliable assessment of models' ability to mimic human reference production, because the results are highly impacted by the choice of corpus and evaluation metrics. Our results also suggest that pre-trained language models are less dependent on the choice of corpus than classic Machine Learning models, and therefore make more robust class predictions.

摘要
Translation notes:* "GREC" is translated as "格列克" (Gélèkè) in Simplified Chinese, which is a combination of "语料" (yǔliào) and "共享" (gòngxiǎng).* "REG-in-context" is translated as "在上下文中的引用表达" (zài shàngxìaoxìng zhōng de yǐnxiǎng biǎojiè) in Simplified Chinese.* "pre-trained language models" is translated as "预训练语言模型" (yùxùnxíng yǔyán módelì) in Simplified Chinese.* "classic Machine Learning models" is translated as "传统机器学习模型" (chuánchēng jīxuéxí módelì) in Simplified Chinese.

Improving Aspect-Based Sentiment with End-to-End Semantic Role Labeling Model

paper_url: http://arxiv.org/abs/2307.14785
repo_url: https://github.com/pauli31/srl-aspect-based-sentiment
paper_authors: Pavel Přibáň, Ondřej Pražák
for: 提高 Aspect-Based Sentiment Analysis (ABSA) 性能
methods: 利用 Semantic Role Labeling (SRL) 模型提取 semantics 信息，并提出了一种 novel end-to-end SRL 模型
results: 在英语和捷克两种语言中，使用 ELECTRA-small 模型评估提出的模型，并在两种语言中提高了 ABSA 性能，同时在捷克 ABSA 实现了新的状态理想 результаanos

Abstract
This paper presents a series of approaches aimed at enhancing the performance of Aspect-Based Sentiment Analysis (ABSA) by utilizing extracted semantic information from a Semantic Role Labeling (SRL) model. We propose a novel end-to-end Semantic Role Labeling model that effectively captures most of the structured semantic information within the Transformer hidden state. We believe that this end-to-end model is well-suited for our newly proposed models that incorporate semantic information. We evaluate the proposed models in two languages, English and Czech, employing ELECTRA-small models. Our combined models improve ABSA performance in both languages. Moreover, we achieved new state-of-the-art results on the Czech ABSA.

摘要

Turning Whisper into Real-Time Transcription System

paper_url: http://arxiv.org/abs/2307.14743
repo_url: https://github.com/ufal/whisper_streaming
paper_authors: Dominik Macháček, Raj Dabre, Ondřej Bojar
for: 这个论文是为了实现实时语音抄写和翻译的 Whisper-Streaming 模型而写的。
methods: 该论文基于 Whisper 模型，并使用本地协议和自适应延迟来实现流式抄写。
results: 论文在长形语音抄写测试集上达到了高质量和3.3秒延迟，并在多语言会议中作为实时抄写组件展示了可靠性和实用性。

Abstract
Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference.

摘要
喊号是一种最新的多语言语音识别和翻译模型，但它并不是实时抄写的设计。在这篇论文中，我们基于喊号建立了喊号流动，一种实时语音抄写和翻译的喊号-like 模型的实现。喊号流动使用本地协议策略和自适应延迟来实现流动抄写。我们表明，喊号流动在长形语音抄写测试集上达到了高质量和3.3秒延迟，并在多语言会议中作为实时抄写服务的组件进行了实践应用。

Improving Natural Language Inference in Arabic using Transformer Models and Linguistically Informed Pre-Training

paper_url: http://arxiv.org/abs/2307.14666
repo_url: https://github.com/fraunhofer-iais/arabic_nlp
paper_authors: Mohammad Majd Saad Al Deen, Maren Pielka, Jörn Hees, Bouthaina Soulef Abdou, Rafet Sifa
for: 本研究针对的是使用自然语言处理（NLP）技术进行阿拉伯文本分类，尤其是自然语言推理（NLI）和矛盾检测（CD）。由于阿拉伯语为resource-poor语言，具有有限的数据资源，因此有限的NLP方法可用。
methods: 作者通过创建专门的数据集来缓解这种限制，并使用 transformer 型机器学习模型进行训练和评估。他们还应用了语言专门预训练方法，如命名实体识别（NER），以提高模型的性能。
results: 研究发现，使用语言专门预训练方法可以使模型在阿拉伯语言中表现竞争力强，并且与多语言方法相比，语言专门模型（AraBERT）在 NLI 和 CD 任务上的性能几乎相同。这是首次对这类任务进行大规模评估，也是首次在这种语言上应用多任务预训练。

Abstract
This paper addresses the classification of Arabic text data in the field of Natural Language Processing (NLP), with a particular focus on Natural Language Inference (NLI) and Contradiction Detection (CD). Arabic is considered a resource-poor language, meaning that there are few data sets available, which leads to limited availability of NLP methods. To overcome this limitation, we create a dedicated data set from publicly available resources. Subsequently, transformer-based machine learning models are being trained and evaluated. We find that a language-specific model (AraBERT) performs competitively with state-of-the-art multilingual approaches, when we apply linguistically informed pre-training methods such as Named Entity Recognition (NER). To our knowledge, this is the first large-scale evaluation for this task in Arabic, as well as the first application of multi-task pre-training in this context.

摘要

paper_url: http://arxiv.org/abs/2307.14539
repo_url: None
paper_authors: Erfan Shayegani, Yue Dong, Nael Abu-Ghazaleh
for: 这篇论文的目的是提醒开发者关于将多模态组件（如视觉） incorporated into large language models（LLMs）中存在的安全问题。methods: 论文使用了对预训练encoder的 embedding空间进行攻击，而这些预训练encoder通常是公共可用的，并在多模态系统中使用了plug-and-play的方式。results: 论文发现，通过在预训练encoder的 embedding空间中寻找适合的输入图像，可以让模型表现出异常的行为，包括“Context Contamination”和“Hidden Prompt Injection”等两种威胁。这些威胁可以让模型的行为发生重大变化，从而威胁多模态系统的安全性。

Abstract
The rapid growth and increasing popularity of incorporating additional modalities (e.g., vision) into large language models (LLMs) has raised significant security concerns. This expansion of modality, akin to adding more doors to a house, unintentionally creates multiple access points for adversarial attacks. In this paper, by introducing adversarial embedding space attacks, we emphasize the vulnerabilities present in multi-modal systems that originate from incorporating off-the-shelf components like public pre-trained encoders in a plug-and-play manner into these systems. In contrast to existing work, our approach does not require access to the multi-modal system's weights or parameters but instead relies on the huge under-explored embedding space of such pre-trained encoders. Our proposed embedding space attacks involve seeking input images that reside within the dangerous or targeted regions of the extensive embedding space of these pre-trained components. These crafted adversarial images pose two major threats: 'Context Contamination' and 'Hidden Prompt Injection'-both of which can compromise multi-modal models like LLaVA and fully change the behavior of the associated language model. Our findings emphasize the need for a comprehensive examination of the underlying components, particularly pre-trained encoders, before incorporating them into systems in a plug-and-play manner to ensure robust security.

摘要
“大型语言模型（LLM）中的多modalitate（例如视觉）的快速增长和增加受欢迎性，导致安全性问题的提出。这些多modalitate的扩展，就如加入更多门窗的房屋一样，不意气境产生多个攻击入口。在这篇论文中，我们透过引入对抗嵌入空间攻击，强调了多modalite系统中从对抗预训练encoder的潜在弱点。与现有的工作不同，我们的方法不需要访问多modalite系统的 weights或parameters，而是利用这些预训练encoder的广大未探索的嵌入空间。我们的提案的对抗嵌入空间攻击包括寻找在这些预训练encoder的危险或目标区域中的输入图像。这些制成的攻击图像会导致两种主要威胁：'Context Contamination'和'Hidden Prompt Injection'，这两种威胁都可以妥协多modalite模型如LLaVA，并完全改变这些语言模型的行为。我们的发现强调了在将这些预训练encoder incorporated into systems中需要进行全面的评估，以确保系统的安全性。”

CliniDigest: A Case Study in Large Language Model Based Large-Scale Summarization of Clinical Trial Descriptions

paper_url: http://arxiv.org/abs/2307.14522
repo_url: None
paper_authors: Renee D. White, Tristan Peng, Pann Sripitak, Alexander Rosenberg Johansen, Michael Snyder
for: This paper aims to provide a tool for summarizing clinical trials in real-time, with the goal of helping researchers keep up-to-date with the latest trials in their field.
methods: The tool used for summarization is called CliniDigest, which is based on GPT-3.5. It can reduce a large amount of text (up to 85 clinical trial descriptions) into a concise 200-word summary with references and limited hallucinations.
results: The paper reports the results of testing CliniDigest on 457 trials across 27 medical subdomains. The summaries generated by CliniDigest have a mean length of 153 words and utilize an average of 54% of the sources.Here is the same information in Simplified Chinese text:
for: 这篇论文的目的是为了提供一种实时概括临床试验的工具，以帮助研究人员保持最新的知识在其领域。
methods: 这个工具被称为CliniDigest，它基于GPT-3.5。它可以将大量文本（最多85个临床试验描述）缩减成200个单词的概括，并包含参考文献和有限的幻化。
results: 论文报告了对CliniDigest在27个医学子领域的457个临床试验上的测试结果。CliniDigest生成的概括均长153单词，并使用了平均54%的源文献。

Abstract
A clinical trial is a study that evaluates new biomedical interventions. To design new trials, researchers draw inspiration from those current and completed. In 2022, there were on average more than 100 clinical trials submitted to ClinicalTrials.gov every day, with each trial having a mean of approximately 1500 words [1]. This makes it nearly impossible to keep up to date. To mitigate this issue, we have created a batch clinical trial summarizer called CliniDigest using GPT-3.5. CliniDigest is, to our knowledge, the first tool able to provide real-time, truthful, and comprehensive summaries of clinical trials. CliniDigest can reduce up to 85 clinical trial descriptions (approximately 10,500 words) into a concise 200-word summary with references and limited hallucinations. We have tested CliniDigest on its ability to summarize 457 trials divided across 27 medical subdomains. For each field, CliniDigest generates summaries of $\mu=153,\ \sigma=69 $ words, each of which utilizes $\mu=54\%,\ \sigma=30\% $ of the sources. A more comprehensive evaluation is planned and outlined in this paper.

摘要
临床试验是评估新生物医学 intervención的研究。为设计新试验，研究人员从当前和完成的试验中留取灵感。2022年平均每天有更多于100个临床试验被提交到ClinicalTrials.gov，每个试验约1500个字。这使得保持最新的issue nearly impossible。为解决这个问题，我们创建了一个批处临床试验摘要工具 called CliniDigest，使用GPT-3.5。CliniDigest是，我们所知道的，第一个能提供实时、真实、全面的临床试验摘要。CliniDigest可以将457个试验（每个字符串约10,500个字）缩减到200个字的摘要，并且具有参考和有限的幻化。我们对CliniDigest在不同医学子领域中的性能进行了测试，每个领域的摘要平均长度为153个字，标准差为69个字，每个字符串使用54%的源文件。我们计划进一步评估CliniDigest的性能。

A Predictive Model of Digital Information Engagement: Forecasting User Engagement With English Words by Incorporating Cognitive Biases, Computational Linguistics and Natural Language Processing

paper_url: http://arxiv.org/abs/2307.14500
repo_url: None
paper_authors: Nimrod Dvir, Elaine Friedman, Suraj Commuri, Fan yang, Jennifer Romano
for: 这种研究旨在开发一种用于数字信息参与度（IE）预测模型，称为READ模型，它 integrates 认知偏见和计算语言学，以提供一个多维度的信息参与度视角。
methods: 该研究使用了 WordNet 词库中随机选择的 50 对同义词（共 100 个词），通过大规模在线调查（n = 80,500）测量这些词的参与度水平，并计算每个词的 READ 特征值。
results: 研究发现，READ 模型能够准确预测一个词的参与度水平，并在对同义词进行比较时准确地分辨出更加吸引人的词。READ 模型在不同领域，如商业、教育、政府和医疗等领域，可以提高内容参与度和扩展人工语言模型的发展。

Abstract
This study introduces and empirically tests a novel predictive model for digital information engagement (IE) - the READ model, an acronym for the four pivotal attributes of engaging information: Representativeness, Ease-of-use, Affect, and Distribution. Conceptualized within the theoretical framework of Cumulative Prospect Theory, the model integrates key cognitive biases with computational linguistics and natural language processing to develop a multidimensional perspective on information engagement. A rigorous testing protocol was implemented, involving 50 randomly selected pairs of synonymous words (100 words in total) from the WordNet database. These words' engagement levels were evaluated through a large-scale online survey (n = 80,500) to derive empirical IE metrics. The READ attributes for each word were then computed and their predictive efficacy examined. The findings affirm the READ model's robustness, accurately predicting a word's IE level and distinguishing the more engaging word from a pair of synonyms with an 84% accuracy rate. The READ model's potential extends across various domains, including business, education, government, and healthcare, where it could enhance content engagement and inform AI language model development and generative text work. Future research should address the model's scalability and adaptability across different domains and languages, thereby broadening its applicability and efficacy.

摘要

A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Human Language?

paper_url: http://arxiv.org/abs/2308.00109
repo_url: None
paper_authors: Gary Marcus, Evelina Leivada, Elliot Murphy
for: 这 paper 探讨了大语言模型在语言相关任务中的潜力，以及这些模型在人工智能发展中的应用。
methods: 这 paper 使用了大语言模型来进行下一个词预测任务，并分析了这些模型的贡献和缺失。
results: 这 paper 发现了目前大语言模型的应用还缺乏一些关键的能力，并提出了未来研发的方向。

Abstract
Artificial Intelligence applications show great potential for language-related tasks that rely on next-word prediction. The current generation of large language models have been linked to claims about human-like linguistic performance and their applications are hailed both as a key step towards Artificial General Intelligence and as major advance in understanding the cognitive, and even neural basis of human language. We analyze the contribution of large language models as theoretically informative representations of a target system vs. atheoretical powerful mechanistic tools, and we identify the key abilities that are still missing from the current state of development and exploitation of these models.

摘要
人工智能应用场景中的语言相关任务可以通过下一个词预测来实现潜在的潜力。当前一代大语言模型被联系到人类语言性能的宣传和其应用被视为人工通用智能的关键一步和人类语言认知的重要进展。我们分析了大语言模型作为目标系统的理论 informative 表示和无理论强大机制工具的贡献，并标识当前开发和利用这些模型的关键缺失。

Controllable Generation of Dialogue Acts for Dialogue Systems via Few-Shot Response Generation and Ranking

paper_url: http://arxiv.org/abs/2307.14440
repo_url: https://github.com/aramir62/da-nlg
paper_authors: Angela Ramirez, Karik Agarwal, Juraj Juraska, Utkarsh Garg, Marilyn A. Walker
for: This paper aims to develop a novel few-shot overgenerate-and-rank approach for controlled generation of dialogue acts (DAs).
methods: The proposed approach uses pretrained language models (LLMs) with prompt-based learning, and includes eight few-shot prompt styles and six automatic ranking functions to identify outputs with both correct DA and high semantic accuracy.
results: The approach achieves perfect DA accuracy and near perfect semantic accuracy (99.81%) on three domains and four LLMs, outperforming fine-tuned few-shot models trained with 5 to 100 instances per DA.

Abstract
Dialogue systems need to produce responses that realize multiple types of dialogue acts (DAs) with high semantic fidelity. In the past, natural language generators (NLGs) for dialogue were trained on large parallel corpora that map from a domain-specific DA and its semantic attributes to an output utterance. Recent work shows that pretrained language models (LLMs) offer new possibilities for controllable NLG using prompt-based learning. Here we develop a novel few-shot overgenerate-and-rank approach that achieves the controlled generation of DAs. We compare eight few-shot prompt styles that include a novel method of generating from textual pseudo-references using a textual style transfer approach. We develop six automatic ranking functions that identify outputs with both the correct DA and high semantic accuracy at generation time. We test our approach on three domains and four LLMs. To our knowledge, this is the first work on NLG for dialogue that automatically ranks outputs using both DA and attribute accuracy. For completeness, we compare our results to fine-tuned few-shot models trained with 5 to 100 instances per DA. Our results show that several prompt settings achieve perfect DA accuracy, and near perfect semantic accuracy (99.81%) and perform better than few-shot fine-tuning.

摘要
对话系统需要生成具有多种对话行为（DA）的响应，以高度准确的semantic fidelity。在过去，对话自然语言生成器（NLG）通常通过大量并行 corpora 来训练，该 corpora 映射了域pecific DA 和其Semantic attribute 到输出utterance。现在，先进的语言模型（LLM）提供了新的可控制NLG的可能性，使用提示基本学习。在这里，我们开发了一种新的几个shot 过generate-and-rank方法，实现控制的对话行为生成。我们比较了8种几个shot prompt 样式，其中包括一种基于文本 Pseudo-References 的文本风格转移方法。我们开发了6种自动排序函数，以确定生成的输出包含正确的 DA 和高度准确的 Semantic accuracy。我们在三个领域和四个LLM上测试了我们的方法。我们认为这是对对话NLG的自动排序的首次研究。为了完整性，我们与 fine-tuned 几个shot 模型进行比较，后者在 5 到 100 个实例每个 DA 上进行微调。我们的结果表明，一些提示设置可以实现完美的 DA 准确率，以及near perfect Semantic accuracy（99.81%），并且比几个shot micro-tuning 更好。

Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models

paper_url: http://arxiv.org/abs/2307.14430
repo_url: None
paper_authors: Mayee F. Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, Christopher Ré
for: This paper aims to study how to select data to improve the performance of pre-trained large language models (LMs) by understanding the natural order of skills learned by LMs from their training data.methods: The authors develop a new framework based on the intuition that LMs learn skills in a natural order, and they formalize the notion of a skill and an ordered set of skills in terms of the associated data. They also propose an online data sampling algorithm called Skill-It to efficiently learn multiple skills in a continual pre-training setting and an individual skill in a fine-tuning setting.results: The authors demonstrate the effectiveness of their framework and algorithm on several datasets, including the LEGO synthetic dataset and the Natural Instructions dataset. On the LEGO dataset, Skill-It obtains 36.5 points higher accuracy than random sampling. On the Natural Instructions dataset, Skill-It reduces the validation loss on the target skill by 13.6% versus training on data associated with the target skill itself. Additionally, they apply their framework on the recent RedPajama dataset to continually pre-train a 3B-parameter LM, achieving higher accuracy on the LM Evaluation Harness with 1B tokens than the baseline approach of sampling uniformly over data sources with 3B tokens.

Abstract
The quality of training data impacts the performance of pre-trained large language models (LMs). Given a fixed budget of tokens, we study how to best select data that leads to good downstream model performance across tasks. We develop a new framework based on a simple hypothesis: just as humans acquire interdependent skills in a deliberate order, language models also follow a natural order when learning a set of skills from their training data. If such an order exists, it can be utilized for improved understanding of LMs and for data-efficient training. Using this intuition, our framework formalizes the notion of a skill and of an ordered set of skills in terms of the associated data. First, using both synthetic and real data, we demonstrate that these ordered skill sets exist, and that their existence enables more advanced skills to be learned with less data when we train on their prerequisite skills. Second, using our proposed framework, we introduce an online data sampling algorithm, Skill-It, over mixtures of skills for both continual pre-training and fine-tuning regimes, where the objective is to efficiently learn multiple skills in the former and an individual skill in the latter. On the LEGO synthetic in the continual pre-training setting, Skill-It obtains 36.5 points higher accuracy than random sampling. On the Natural Instructions dataset in the fine-tuning setting, Skill-It reduces the validation loss on the target skill by 13.6% versus training on data associated with the target skill itself. We apply our skills framework on the recent RedPajama dataset to continually pre-train a 3B-parameter LM, achieving higher accuracy on the LM Evaluation Harness with 1B tokens than the baseline approach of sampling uniformly over data sources with 3B tokens.

摘要
“训练数据质量对预训练大语言模型（LM）的性能产生重要影响。我们研究如何在固定的字符数限制下选择数据，以便在任务间之间尽可能地提高下游模型的性能。我们提出了一个新的框架，基于简单的假设：人类学习语言时也跟随着一定的顺序，然而语言模型在学习训练数据时也会遵循类似的自然顺序。如果这种顺序存在，那么我们可以利用它来更好地理解LM和进行数据效率的训练。使用这种假设，我们将技能和技能集定义为数据的一部分，并提出了一种在线数据采样算法Skill-It，用于在混合技能下进行不断预训练和精度调整。在LEGO sintética上，Skill-It在不断预训练中比随机采样高出36.5个点的精度。在自然指令集上，Skill-It在精度调整中比直接使用目标技能的数据采样下降13.6%。我们在最近的RedPajama数据集上应用我们的技能框架，实现了使用3B参数的LM在LM评估套件上的高精度，比基eline方法（随机采样）在3B字符上的精度高出1B字符。”

Towards Generalist Biomedical AI

paper_url: http://arxiv.org/abs/2307.14334
repo_url: https://github.com/kyegomez/Med-PaLM
paper_authors: Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Chuck Lau, Ryutaro Tanno, Ira Ktena, Basil Mustafa, Aakanksha Chowdhery, Yun Liu, Simon Kornblith, David Fleet, Philip Mansfield, Sushant Prakash, Renee Wong, Sunny Virmani, Christopher Semturs, S Sara Mahdavi, Bradley Green, Ewa Dominowska, Blaise Aguera y Arcas, Joelle Barral, Dale Webster, Greg S. Corrado, Yossi Matias, Karan Singhal, Pete Florence, Alan Karthikesalingam, Vivek Natarajan
for: 这个论文旨在开发一种涵盖多Modal的生物医学人工智能系统，以便实现从科学发现到医疗提供的各种应用。
methods: 该论文首先创建了MultiMedBench，一个新的多Modal生物医学测试集。然后，他们引入了Med-PaLM Multimodal（Med-PaLM M），一种证明性的生物医学人工智能系统，可以涵盖多Modal的生物医学数据，包括临床语言、成像和 genomics。
results: 该论文的结果表明，Med-PaLM M在MultiMedBench任务中达到了与或超过当前状态的表现，常常超越专家模型。此外，论文还报道了无需训练的零shot泛化和转移学习的情况，以及模型生成的 radiology 报告得到了临床医生的评价。

Abstract
Medicine is inherently multimodal, with rich data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence (AI) systems that flexibly encode, integrate, and interpret this data at scale can potentially enable impactful applications ranging from scientific discovery to care delivery. To enable the development of these models, we first curate MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. We then introduce Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system. Med-PaLM M is a large multimodal generative model that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. Med-PaLM M reaches performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. We also report examples of zero-shot generalization to novel medical concepts and tasks, positive transfer learning across tasks, and emergent zero-shot medical reasoning. To further probe the capabilities and limitations of Med-PaLM M, we conduct a radiologist evaluation of model-generated (and human) chest X-ray reports and observe encouraging performance across model scales. In a side-by-side ranking on 246 retrospective chest X-rays, clinicians express a pairwise preference for Med-PaLM M reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility. While considerable work is needed to validate these models in real-world use cases, our results represent a milestone towards the development of generalist biomedical AI systems.

摘要
医学 inherently 多Modal，具有丰富的数据Modalities，包括文本、影像、基因组和更多。通用医学人工智能（AI）系统，可以在大规模上编码、集成和解释这些数据，有 potential 实现科学发现和诊断等应用。为了实现这些模型的发展，我们首先筹建 MultiMedBench，一个新的多Modal 医学benchmark。MultiMedBench 包括 14 种多Modal 医学任务，如医学问答、肿瘤和皮肤影像解读、放射学报告生成和摘要、基因变异检测等。然后，我们引入 Med-PaLM Multimodal（Med-PaLM M），我们的证明性证明，可以在多Modal 医学数据中灵活编码和解释医学语言、影像和基因组数据，并且使用同一组model weights。Med-PaLM M 在 MultiMedBench 任务中达到了与或超过当前状态的表现，经常超过专家模型，并且在一些任务上表现出透明的优势。我们还报道了零shot泛化到新的医学概念和任务， transferred learning across tasks，以及 emergent zero-shot 医学理解。为了进一步探索 Med-PaLM M 的能力和局限性，我们对模型生成（以及人类）的胸部X射影报告进行了 radiologist 评估，并观察到了模型在不同缩放器 scales 上的鼓舞人的表现。在 246 例退化胸部X射影中，临床医生对 Med-PaLM M 报告表示了对比 radiologists 的可 preference，表明了这些模型的临床可能性。虽然需要进一步的验证，但我们的结果代表了开发通用医学 AI 系统的重要里程碑。

Comparative Analysis of Libraries for the Sentimental Analysis

paper_url: http://arxiv.org/abs/2307.14311
repo_url: https://github.com/RAJHARINI-KRISHNASAMY/sentimental-analysis-of-news-feeds
paper_authors: Wendy Ccoya, Edson Pinto
For: 本研究的主要目标是提供机器学习方法对图书馆进行比较分析。* Methods: 本研究使用了Python和R语言的NLTK、TextBlob、Vader、Transformers（GPT和BERT预训练）和Tidytext等库来实现情感分析技术。同时还使用了四种机器学习模型：DT、SVM、NB和KNN。* Results: 根据实验结果，BERT transformer方法的准确率为0.973，得出的结论是推荐使用BERT transformer方法进行情感分析。

Abstract
This study is main goal is to provide a comparative comparison of libraries using machine learning methods. Experts in natural language processing (NLP) are becoming more and more interested in sentiment analysis (SA) of text changes. The objective of employing NLP text analysis techniques is to recognize and categorize feelings related to twitter users utterances. In this examination, issues with SA and the libraries utilized are also looked at. provides a number of cooperative methods to classify emotional polarity. The Naive Bayes Classifier, Decision Tree Classifier, Maxent Classifier, Sklearn Classifier, Sklearn Classifier MultinomialNB, and other conjoint learning algorithms, according to recent research, are very effective. In the project will use Five Python and R libraries NLTK, TextBlob, Vader, Transformers (GPT and BERT pretrained), and Tidytext will be used in the study to apply sentiment analysis techniques. Four machine learning models Tree of Decisions (DT), Support Vector Machine (SVM), Naive Bayes (NB), and K-Nearest Neighbor (KNN) will also be used. To evaluate how well libraries for SA operate in the social network environment, comparative study was also carried out. The measures to assess the best algorithms in this experiment, which used a single data set for each method, were precision, recall, and F1 score. We conclude that the BERT transformer method with an Accuracy: 0.973 is recommended for sentiment analysis.

摘要
这个研究的主要目标是对使用机器学习方法进行图书馆的比较分析。专家们在自然语言处理（NLP）领域越来越关注图书馆用户言语的情感分析（SA）。通过使用NLP文本分析技术，可以识别和分类图书馆用户言语中的情感偏好。本 исследование还检查了SA和使用的图书馆问题。提供了许多合作方法来分类情感偏好。Naive Bayes Classifier、Decision Tree Classifier、Maxent Classifier、Sknlearn Classifier、Sknlearn Classifier MultinomialNB等 conjunction learning algorithms，据最新研究，很有效。在研究中使用的Python和R库有NLTK、TextBlob、Vader、Transformers（GPT和BERT预训练）和Tidytext等，用于应用情感分析技术。此外， Four machine learning模型：Tree of Decisions（DT）、Support Vector Machine（SVM）、Naive Bayes（NB）和K-Nearest Neighbor（KNN）也将被使用。为了评估图书馆用于SA的库如何在社交网络环境中运行，本研究还进行了比较研究。测试使用的单个数据集的评价标准为精度、恢复率和F1分数。我们得出的结论是，BERT变换器方法，准确率为0.973，是SA的最佳方法。

Automatically Evaluating Opinion Prevalence in Opinion Summarization

paper_url: http://arxiv.org/abs/2307.14305
repo_url: None
paper_authors: Christopher Malon
for: 本文主要研究如何自动生成产品评价摘要，以提高对产品评价的汇总和理解。
methods: 本文提出了一种自动测试摘要中意见强度的度量方法，基于对每个摘要句子与每个源评论进行匹配度量。同时，本文考虑了多种现有的方法来评估摘要句子与源评论之间的一致性。
results: 基于一个Amazon产品评论数据集，本文证明了自动生成的摘要具有较低的意见强度，而人工摘要具有较高的意见强度。同时，本文还表明了采用简化源评论后，现有的抽象意见摘要系统可以达到人工性能水平。

Abstract
When faced with a large number of product reviews, it is not clear that a human can remember all of them and weight opinions representatively to write a good reference summary. We propose an automatic metric to test the prevalence of the opinions that a summary expresses, based on counting the number of reviews that are consistent with each statement in the summary, while discrediting trivial or redundant statements. To formulate this opinion prevalence metric, we consider several existing methods to score the factual consistency of a summary statement with respect to each individual source review. On a corpus of Amazon product reviews, we gather multiple human judgments of the opinion consistency, to determine which automatic metric best expresses consistency in product reviews. Using the resulting opinion prevalence metric, we show that a human authored summary has only slightly better opinion prevalence than randomly selected extracts from the source reviews, and previous extractive and abstractive unsupervised opinion summarization methods perform worse than humans. We demonstrate room for improvement with a greedy construction of extractive summaries with twice the opinion prevalence achieved by humans. Finally, we show that preprocessing source reviews by simplification can raise the opinion prevalence achieved by existing abstractive opinion summarization systems to the level of human performance.

摘要
(Simplified Chinese translation)面对大量产品评论，人类可能无法记忆所有评论并准确表达其意见，我们提出一种自动度量测试摘要中意见的流行程度，基于对每个摘要声明中的评论数量进行匹配，并且忽略不重要或重复的评论。为了建立这种意见流行度度量，我们考虑了一些现有的评论一致性分数方法，并在亚马逊产品评论集中收集了多个人类意见一致性判断，以确定最佳的自动度量。结果显示，人类撰写的摘要只有轻微更高的意见流行程度，而前期EXTRACTIVE和ABSTRACTIVE无监督意见摘要方法则比人类性能更差。我们还示出了使用拼接法构建EXTRACTIVE摘要，可以 doubles the opinion prevalence achieved by humans。最后，我们显示了对源评论进行简化处理可以提高现有的ABSTRACTIVE意见摘要系统达到人类性能水平。

Founding a mathematical diffusion model in linguistics. The case study of German syntactic features in the North-Eastern Italian dialects

paper_url: http://arxiv.org/abs/2307.14291
repo_url: None
paper_authors: I. Lazzizzera
for: 研究德语特征在意大北部地区罗曼语方言中的传播
methods: 使用地理数据科学工具创建了一个交互式地图，以表示德语特征在当地的分布
results: 研究发现，使用德语特征的地区可以用一个二维函数来表示，并且这个函数可以通过 физи学中的卷积方程来描述，并且这种方法可以准确地预测当地语言的传播情况。此外，研究还发现，Schmidt的波可以被包含在卷积方程中，这可以复制现实语言传播事件中的复杂性。

Abstract
We take as a case study the spread of Germanic syntactic features into Romance dialects of North-Eastern Italy, which occurred after the immigration of German people in the Tyrol during the High Middle Ages. An interactive map is produced using tools of what is called Geographic Data Science. A smooth two-dimensional surface $\mathcal{G}$ expresses locally which fraction of territory uses a given German language feature: it is obtained by interpolating a discrete function that says if at any surveyed locality that feature is used or not.\newline This surface $\mathcal{G}$ is thought of as the value at the present time of a function describing a diffusion-convection phenomenon in two dimensions (here said \emph{tidal} mode), which is subjected in a very natural way to the same equation, suitably contextualized, used in physics for a number of phenomenological facts like the heat diffusion. It is shown that solutions of this equation, evaluated at the present time, fit well with the data as interpolated by $\mathcal{G}$, thus providing convincing pictures of diffusion-convection of the linguistic features of the case study, albeit simplifications and approximations.\newline Very importantly, it is shown that Schmidt's 'waves' can be counted among the solutions of the diffusion equation: superimposing Schmidt 'waves' to a 'tidal flooding' can reproduce complexities of real linguistic diffusion events.

摘要
我们选择德语 sintactic feature 在意大利北东部罗曼语方言的传播为案例研究，该事件发生在中世纪高中期，德国人在提rol中移民。使用地理数据科学工具生成了一个交互式地图，表示每个地方使用的德语特征的分布情况。这个二维表面 $\mathcal{G}$ 是通过 interpolating 一个粗略函数来获得，该函数在任一调查地点是否使用该语言特征。\newline 这个表面 $\mathcal{G}$ 被视为当前时间的函数值，描述了在二维空间中的扩散-湍流现象（称为“潮汐”模式），该现象遵循同physics中的一些现象，如热扩散。这些解的评估值，评估到当前时间，与实际数据 interpolated by $\mathcal{G}$ 相吻合，提供了语言扩散的诱导图。\newline 很重要的是，Schmidt的“浪”可以被视为扩散方程的解之一：将Schmidt的“浪”与“潮汐涌入”相加可以复制实际语言扩散事件中的复杂性。

2023-07-27

cs.LG

cs.LG - 2023-07-27

Federated Model Aggregation via Self-Supervised Priors for Highly Imbalanced Medical Image Classification

paper_url: http://arxiv.org/abs/2307.14959
repo_url: https://github.com/xmed-lab/fed-mas
paper_authors: Marawan Elbatel, Hualiang Wang, Robert Martí, Huazhu Fu, Xiaomeng Li
For: This paper focuses on addressing the challenges of federated learning in highly imbalanced medical datasets, specifically skin lesions and gastrointestinal images.* Methods: The authors use publicly available self-supervised auxiliary networks to study inter-client intra-class variations and derive a dynamic balanced model aggregation method called Fed-MAS, which can be used with different local learning methods to optimize a highly robust and unbiased global model.* Results: The authors demonstrate the effectiveness of Fed-MAS in improving the robustness and accuracy of the global model, and provide code for implementing the method at \url{https://github.com/xmed-lab/Fed-MAS}.

Abstract
In the medical field, federated learning commonly deals with highly imbalanced datasets, including skin lesions and gastrointestinal images. Existing federated methods under highly imbalanced datasets primarily focus on optimizing a global model without incorporating the intra-class variations that can arise in medical imaging due to different populations, findings, and scanners. In this paper, we study the inter-client intra-class variations with publicly available self-supervised auxiliary networks. Specifically, we find that employing a shared auxiliary pre-trained model, like MoCo-V2, locally on every client yields consistent divergence measurements. Based on these findings, we derive a dynamic balanced model aggregation via self-supervised priors (MAS) to guide the global model optimization. Fed-MAS can be utilized with different local learning methods for effective model aggregation toward a highly robust and unbiased global model. Our code is available at \url{https://github.com/xmed-lab/Fed-MAS}.

摘要
医疗领域中，联合学习经常处理高度不均衡的数据集，包括皮肤病变和肠道影像。现有的联合方法在高度不均衡数据集上主要是优化全球模型，而不考虑医疗影像中的内部变化，这些变化可能由不同的人口、发现和扫描仪器引起。在这篇论文中，我们研究了客户端间内部类别变化，使用公开可用的自动预训练网络（如MoCo-V2）。我们发现，在每个客户端上使用共享的 auxiliary 预训练模型可以获得一致的异同度测量。基于这些发现，我们 derivate了一种动态均衡模型聚合方法（MAS），以自动调整全球模型的优化。Fed-MAS 可以与不同的本地学习方法结合使用，以实现高度可靠和无偏Global模型。我们的代码可以在上找到。

Multi-Source Domain Adaptation through Dataset Dictionary Learning in Wasserstein Space

paper_url: http://arxiv.org/abs/2307.14953
repo_url: https://github.com/eddardd/demo-dadil
paper_authors: Eduardo Fernandes Montesuma, Fred Ngolè Mboula, Antoine Souloumiac
for: 解决多源频率适应问题 (Multi-Source Domain Adaptation, MSDA), mitigate data distribution shifts when transferring knowledge from multiple labeled source domains to an unlabeled target domain.
methods: 提议一个新的 MSDA 框架基于字典学习和最优运输，每个频率域被看作一个 empirical distribution，表示每个频率域为 Wasserstein 质量器中的一个barycenter。提出了一种新的算法 DaDiL，通过 mini-batches 学习：(i) atom distributions; (ii) 一个矩阵的barycentric coordinates。基于我们的字典，提出了两种新的 MSDA 方法：DaDil-R 基于目标频率域中标注样本的重建，DaDiL-E 基于 atom distributions 的 ensemble。
results: 在 Caltech-Office、Office 31 和 CRWU 三个 benchmark 上评估了我们的方法，与前一个state-of-the-art 相比，提高了3.15%、2.29% 和 7.71% 的分类性能。最后，我们示出了 interpolations 在学习的 atom Distributions 中提供了可以泛化到目标频率域的数据。

Abstract
This paper seeks to solve Multi-Source Domain Adaptation (MSDA), which aims to mitigate data distribution shifts when transferring knowledge from multiple labeled source domains to an unlabeled target domain. We propose a novel MSDA framework based on dictionary learning and optimal transport. We interpret each domain in MSDA as an empirical distribution. As such, we express each domain as a Wasserstein barycenter of dictionary atoms, which are empirical distributions. We propose a novel algorithm, DaDiL, for learning via mini-batches: (i) atom distributions; (ii) a matrix of barycentric coordinates. Based on our dictionary, we propose two novel methods for MSDA: DaDil-R, based on the reconstruction of labeled samples in the target domain, and DaDiL-E, based on the ensembling of classifiers learned on atom distributions. We evaluate our methods in 3 benchmarks: Caltech-Office, Office 31, and CRWU, where we improved previous state-of-the-art by 3.15%, 2.29%, and 7.71% in classification performance. Finally, we show that interpolations in the Wasserstein hull of learned atoms provide data that can generalize to the target domain.

摘要
We interpret each domain in MSDA as an empirical distribution, and express each domain as a Wasserstein barycenter of dictionary atoms, which are empirical distributions. We propose a novel algorithm, DaDiL, for learning via mini-batches: (i) atom distributions; (ii) a matrix of barycentric coordinates.Based on our dictionary, we propose two novel methods for MSDA: DaDil-R, which is based on the reconstruction of labeled samples in the target domain, and DaDiL-E, which is based on the ensembling of classifiers learned on atom distributions.We evaluate our methods in three benchmarks: Caltech-Office, Office 31, and CRWU, and achieve improved performance compared to previous state-of-the-art by 3.15%, 2.29%, and 7.71% in classification performance, respectively.Finally, we show that interpolations in the Wasserstein hull of learned atoms provide data that can generalize to the target domain.

paper_url: http://arxiv.org/abs/2307.14952
repo_url: None
paper_authors: Connor Mclaughlin, Matthew Ding, Denis Edogmus, Lili Su
for: Addresses the problem of non-Bayesian learning over vulnerable networks with communication failures and external adversarial attacks.
methods: Proposes a hierarchical robust push-sum algorithm with sparse information fusion and dual averaging update to achieve average consensus despite packet-dropping link failures. Also, uses a novel Byzantine-resilient gossiping-type rule to facilitate resilient information propagation across sub-networks.
results: Obtains provable convergence guarantees for the packet-dropping fault-tolerant non-Bayesian learning algorithm and solves the non-Bayesian learning problem via running multiple dynamics, each of which only involves Byzantine consensus with scalar inputs.

Abstract
As the network scale increases, existing fully distributed solutions start to lag behind the real-world challenges such as (1) slow information propagation, (2) network communication failures, and (3) external adversarial attacks. In this paper, we focus on hierarchical system architecture and address the problem of non-Bayesian learning over networks that are vulnerable to communication failures and adversarial attacks. On network communication, we consider packet-dropping link failures. We first propose a hierarchical robust push-sum algorithm that can achieve average consensus despite frequent packet-dropping link failures. We provide a sparse information fusion rule between the parameter server and arbitrarily selected network representatives. Then, interleaving the consensus update step with a dual averaging update with Kullback-Leibler (KL) divergence as the proximal function, we obtain a packet-dropping fault-tolerant non-Bayesian learning algorithm with provable convergence guarantees. On external adversarial attacks, we consider Byzantine attacks in which the compromised agents can send maliciously calibrated messages to others (including both the agents and the parameter server). To avoid the curse of dimensionality of Byzantine consensus, we solve the non-Bayesian learning problem via running multiple dynamics, each of which only involves Byzantine consensus with scalar inputs. To facilitate resilient information propagation across sub-networks, we use a novel Byzantine-resilient gossiping-type rule at the parameter server.

摘要
To address external adversarial attacks, we consider Byzantine attacks in which compromised agents can send maliciously calibrated messages to others. To avoid the curse of dimensionality of Byzantine consensus, we solve the non-Bayesian learning problem by running multiple dynamics, each of which only involves Byzantine consensus with scalar inputs. To facilitate resilient information propagation across sub-networks, we use a novel Byzantine-resilient gossiping-type rule at the parameter server.

A Self-Adaptive Penalty Method for Integrating Prior Knowledge Constraints into Neural ODEs

paper_url: http://arxiv.org/abs/2307.14940
repo_url: None
paper_authors: C. Coelho, M. Fernanda P. Costa, L. L. Ferrás
for: 模elling constrained natural systems, such as population growth, chemical reaction evolution, and damped harmonic oscillator motion.
methods: 使用自适应罚函数方法， dynamically adjust penalty parameters to ensure the models follow the underlying rules or laws of the systems.
results: 比较其他罚函数Neural ODE和\emph{vanilla} Neural ODE， demonstrate the effectiveness of the proposed self-adaptive penalty algorithm for Neural ODEs in modelling constrained natural systems, and provide more accurate and robust models with reliable and meaningful predictions.

Abstract
The continuous dynamics of natural systems has been effectively modelled using Neural Ordinary Differential Equations (Neural ODEs). However, for accurate and meaningful predictions, it is crucial that the models follow the underlying rules or laws that govern these systems. In this work, we propose a self-adaptive penalty algorithm for Neural ODEs to enable modelling of constrained natural systems. The proposed self-adaptive penalty function can dynamically adjust the penalty parameters. The explicit introduction of prior knowledge helps to increase the interpretability of Neural ODE -based models. We validate the proposed approach by modelling three natural systems with prior knowledge constraints: population growth, chemical reaction evolution, and damped harmonic oscillator motion. The numerical experiments and a comparison with other penalty Neural ODE approaches and \emph{vanilla} Neural ODE, demonstrate the effectiveness of the proposed self-adaptive penalty algorithm for Neural ODEs in modelling constrained natural systems. Moreover, the self-adaptive penalty approach provides more accurate and robust models with reliable and meaningful predictions.

摘要
自然系统的连续动力学已经有效地使用神经普通微分方程（神经ODE）进行模型化。然而，为了获得准确和有意义的预测，模型需要遵循下面的规则或法律来自然系统。在这种工作中，我们提议一种自适应罚函数算法，以便神经ODE模型中的约束。这个提议的自适应罚函数可以动态调整罚参数。通过直接引入先知知识，可以增加神经ODE模型的解释性。我们验证了提议的方法，通过模拟三个自然系统具有先知约束：人口增长、化学反应演化和抑压振荡运动。计算实验和与其他罚Neural ODE方法和� vanilla Neural ODE进行比较，表明我们的自适应罚函数算法对神经ODE模型的偏导出来自然系统进行模型化是有效的。此外，自适应罚方法还提供了更准确和可靠的模型，并且可以提供可靠和有意义的预测。

Efficient Interaction-Aware Interval Analysis of Neural Network Feedback Loops

paper_url: http://arxiv.org/abs/2307.14938
repo_url: None
paper_authors: Saber Jafarpour, Akash Harapanahalli, Samuel Coogan
for: 这个论文是为了提出一种 computationally efficient 的扩展系统间距 reachability 框架，用于系统控制器是神经网络的情况。
methods: 该方法利用包含函数来嵌入closed-loop系统到一个更大的嵌入系统中，以便使用单个轨迹来估计原始系统的行为下 uncertainty。 authors 提出了两种方法来构建closed-loop嵌入系统，它们不同的方式考虑了系统和控制器之间的交互。
results: authors 通过实现该方法在 Python 框架ReachMM中，并在一些示例和 benchmark 上进行了效果的评估，以证明该方法的高效性和可扩展性。

Abstract
In this paper, we propose a computationally efficient framework for interval reachability of systems with neural network controllers. Our approach leverages inclusion functions for the open-loop system and the neural network controller to embed the closed-loop system into a larger-dimensional embedding system, where a single trajectory over-approximates the original system's behavior under uncertainty. We propose two methods for constructing closed-loop embedding systems, which account for the interactions between the system and the controller in different ways. The interconnection-based approach considers the worst-case evolution of each coordinate separately by substituting the neural network inclusion function into the open-loop inclusion function. The interaction-based approach uses novel Jacobian-based inclusion functions to capture the first-order interactions between the open-loop system and the controller by leveraging state-of-the-art neural network verifiers. Finally, we implement our approach in a Python framework called ReachMM to demonstrate its efficiency and scalability on benchmarks and examples ranging to $200$ state dimensions.

摘要
在本文中，我们提出了一种 Computationally Efficient 框架，用于系统 WITH 神经网络控制器的间隔可达性。我们的方法利用包含函数来描述开 Loop 系统和神经网络控制器之间的关系，从而将关闭 Loop 系统 embedding 到一个更大的维度 embedding 系统中，其中一个trajectory 可以大致表示原始系统的行为下 uncertainty。我们提出了两种方法来构建关闭 Loop embedding 系统，它们都考虑了系统和控制器之间的交互方式。interconnection-based 方法通过将神经网络包含函数substituted 到开 Loop 包含函数中来考虑每个坐标的最坏情况演化。interaction-based 方法使用新的Jacobian-based包含函数来捕捉神经网络控制器和开 Loop 系统之间的首次互动，通过利用 state-of-the-art neural network verifiers。最后，我们在一个名为 ReachMM 的 Python 框架中实现了我们的方法，以示其效率和可扩展性。

PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback

paper_url: http://arxiv.org/abs/2307.14936
repo_url: None
paper_authors: Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan, Bing Geng, An Fu, Muhan Zeng, Ailun Yu, Jichuan Ji, Jingyang Zhao, Yuenan Guo, Qianxiang Wang
for: 这篇论文是为了提高预训练的代码生成模型性能而写的。
methods: 这篇论文提出了一种新的RRTF（排名响应对测试&教师反馈）框架，用于提高预训练的大语言模型代码生成性能。
results: 据论文所说，PanGu-Coder2在OpenAI HumanEval对benchmark上达到了62.20%的pass@1，并在CoderEval和LeetCode benchmark上 consistently outperform了所有之前的代码LM。

Abstract
Large Language Models for Code (Code LLM) are flourishing. New and powerful models are released on a weekly basis, demonstrating remarkable performance on the code generation task. Various approaches have been proposed to boost the code generation performance of pre-trained Code LLMs, such as supervised fine-tuning, instruction tuning, reinforcement learning, etc. In this paper, we propose a novel RRTF (Rank Responses to align Test&Teacher Feedback) framework, which can effectively and efficiently boost pre-trained large language models for code generation. Under this framework, we present PanGu-Coder2, which achieves 62.20% pass@1 on the OpenAI HumanEval benchmark. Furthermore, through an extensive evaluation on CoderEval and LeetCode benchmarks, we show that PanGu-Coder2 consistently outperforms all previous Code LLMs.

摘要
Large Language Models for Code (Code LLM) 是在繁殖的。新的强大模型在每周基础上发布，展示了各种各样的代码生成任务表现。各种方法被提议来提高预训练 Code LLM 的代码生成性能，如监督练化、指令调整、强化学习等。在这篇论文中，我们提出了一种新的 RRTF（评估答案对测试教师反馈的排名）框架，可以有效地和高效地提高预训练大语言模型的代码生成性能。在这个框架下，我们介绍了 PanGu-Coder2，在 OpenAI HumanEval benchmark 上达到了 62.20% 的 pass@1 成绩。此外，我们通过对 CoderEval 和 LeetCode benchmark 进行了广泛的评估，并证明了 PanGu-Coder2 在所有前一代 Code LLM 中具有稳定的优势。

Solving Data Quality Problems with Desbordante: a Demo

paper_url: http://arxiv.org/abs/2307.14935
repo_url: None
paper_authors: George Chernishev, Michael Polyntsov, Anton Chizhov, Kirill Stupakov, Ilya Shchuckin, Alexander Smirnov, Maxim Strutovsky, Alexey Shlyonskikh, Mikhail Firsov, Stepan Manannikov, Nikita Bobrov, Daniil Goncharov, Ilia Barutkin, Vladislav Shalnev, Kirill Muraviev, Anna Rakhmukova, Dmitriy Shcheka, Anton Chernikov, Mikhail Vyrodov, Yaroslav Kurbatov, Maxim Fofanov, Sergei Belokonnyi, Pavel Anosov, Arthur Saliou, Eduard Gaisin, Kirill Smirnov
for: This paper aims to address the limitations of existing data profiling systems by providing a new open-source data profiler called Desbordante, which is efficient, scalable, and provides explanations for complex statistics.
methods: Desbordante uses C++ for costly operations and seamless Python integration to efficiently mine data for functional dependencies, data constraints, association rules, and other complex statistics.
results: The paper demonstrates several scenarios for using Desbordante to solve data quality problems, including typo detection, data deduplication, and data anomaly detection.

Abstract
Data profiling is an essential process in modern data-driven industries. One of its critical components is the discovery and validation of complex statistics, including functional dependencies, data constraints, association rules, and others. However, most existing data profiling systems that focus on complex statistics do not provide proper integration with the tools used by contemporary data scientists. This creates a significant barrier to the adoption of these tools in the industry. Moreover, existing systems were not created with industrial-grade workloads in mind. Finally, they do not aim to provide descriptive explanations, i.e. why a given pattern is not found. It is a significant issue as it is essential to understand the underlying reasons for a specific pattern's absence to make informed decisions based on the data. Because of that, these patterns are effectively rest in thin air: their application scope is rather limited, they are rarely used by the broader public. At the same time, as we are going to demonstrate in this presentation, complex statistics can be efficiently used to solve many classic data quality problems. Desbordante is an open-source data profiler that aims to close this gap. It is built with emphasis on industrial application: it is efficient, scalable, resilient to crashes, and provides explanations. Furthermore, it provides seamless Python integration by offloading various costly operations to the C++ core, not only mining. In this demonstration, we show several scenarios that allow end users to solve different data quality problems. Namely, we showcase typo detection, data deduplication, and data anomaly detection scenarios.

摘要
“数据 Profiling 是现代数据驱动行业中的一个关键过程。其中一个重要的组成部分是发现和验证复杂统计学，包括函数依赖关系、数据约束、相关规则等。然而，大多数现有的数据 Profiling 系统， concentrate 在复杂统计学上，并不提供适当的工具集成。这创造了采用这些工具在行业中的 significiant 障碍。此外，现有系统并不是为现代企业级工作荟载设计。最重要的是，它们不提供描述性解释，即为什么某个模式缺失。这是一个关键的问题，因为需要了解数据下的基本原因，以便根据数据做出了 Informed 决策。因此，这些模式在现实中几乎无法应用，它们的应用范围很有限，并且 rarely 被更广泛的公众使用。然而，我们将在这个演示中展示，复杂统计学可以高效地解决许多 класси型数据质量问题。Desbordante 是一个开源的数据 Profiler，旨在关闭这个差距。它强调在企业级应用，高效、可扩展、快速响应、提供描述性解释。此外，它提供了Python集成，通过将费时操作委托给C++核心，不仅挖掘。在这个演示中，我们将展示一些使用场景，让用户解决不同的数据质量问题。例如，我们将展示 typo 检测、数据副本检测和数据异常检测场景。”

Approximate Model-Based Shielding for Safe Reinforcement Learning

paper_url: http://arxiv.org/abs/2308.00707
repo_url: https://github.com/sacktock/ambs
paper_authors: Alexander W. Goodall, Francesco Belardinelli
for: 这篇论文的目的是为了解决具有安全性需求的RL任务中的问题，以提高RL的可靠性和安全性。
methods: 本论文提出了一种名为简化模型基于护盾（AMBS）的原理方法，可以验证RL策略对于安全需求的满意度。AMBS不需要知道系统的安全相关运动，并且提供了严格的理论基础。
results: 在一系列Atari游戏中，AMBS表现较其他安全意识的方法更好，并且不需要对系统的安全相关运动进行先前的了解。

Abstract
Reinforcement learning (RL) has shown great potential for solving complex tasks in a variety of domains. However, applying RL to safety-critical systems in the real-world is not easy as many algorithms are sample-inefficient and maximising the standard RL objective comes with no guarantees on worst-case performance. In this paper we propose approximate model-based shielding (AMBS), a principled look-ahead shielding algorithm for verifying the performance of learned RL policies w.r.t. a set of given safety constraints. Our algorithm differs from other shielding approaches in that it does not require prior knowledge of the safety-relevant dynamics of the system. We provide a strong theoretical justification for AMBS and demonstrate superior performance to other safety-aware approaches on a set of Atari games with state-dependent safety-labels.

摘要
利用增强学习（RL）解决复杂任务的潜力已经得到了广泛的认可。然而，在实际世界中应用RL到安全关键系统上是不容易的，因为许多算法是样本不充分的，并且最大化标准RL目标无法提供最差情况性能的保证。在这篇论文中，我们提出了一种模型基于的简化遮盾（AMBS）算法，用于验证RL政策与给定的安全限制之间的性能关系。我们的算法与其他安全遮盾方法不同，不需要系统安全相关动力的先前知识。我们提供了强有力的理论基础，并在一组Atari游戏中证明了AMBS的超越其他安全意识方法的性能。

Graph-based Polyphonic Multitrack Music Generation

paper_url: http://arxiv.org/abs/2307.14928
repo_url: https://github.com/emanuelecosenza/polyphemus
paper_authors: Emanuele Cosenza, Andrea Valenti, Davide Bacciu
for: 这篇论文旨在提出一种基于图表表示的深度学习系统，用于音乐生成。
methods: 该论文使用了一种新的图表表示方法，并提出了一种嵌入型变分自动编码器来生成音乐图表的结构和内容。
results: 经过训练后，模型能够生成愉悦的短和长音乐序列，并能够实际地 interpolate между它们，生成具有律动和和声一致性的音乐。此外，Visualization 的结果表明，模型能够将其征预测空间组织成与知名的音乐概念相符的方式。

Abstract
Graphs can be leveraged to model polyphonic multitrack symbolic music, where notes, chords and entire sections may be linked at different levels of the musical hierarchy by tonal and rhythmic relationships. Nonetheless, there is a lack of works that consider graph representations in the context of deep learning systems for music generation. This paper bridges this gap by introducing a novel graph representation for music and a deep Variational Autoencoder that generates the structure and the content of musical graphs separately, one after the other, with a hierarchical architecture that matches the structural priors of music. By separating the structure and content of musical graphs, it is possible to condition generation by specifying which instruments are played at certain times. This opens the door to a new form of human-computer interaction in the context of music co-creation. After training the model on existing MIDI datasets, the experiments show that the model is able to generate appealing short and long musical sequences and to realistically interpolate between them, producing music that is tonally and rhythmically consistent. Finally, the visualization of the embeddings shows that the model is able to organize its latent space in accordance with known musical concepts.

摘要
图可以用来模型多声多轨符号音乐，其中每个音符、和每个段落都可以在不同的音乐层次结构中相互连接，通过声调和节奏关系。然而，现有的深度学习系统 для音乐生成中对图表示方法缺乏研究。这篇论文填补这一漏洞，并介绍了一种新的音乐图表示方法和深度Variational Autoencoder，可以分别生成音乐图的结构和内容，并且具有匹配音乐结构的层次结构。通过将结构和内容分开，可以根据指定哪些乐器在某些时间点执行。这开启了一种新的人机合作方式，在音乐合作中。经过训练模型于现有MIDI dataset上，实验表明，模型能够生成愉悦的短长乐曲，并能够实际地在这些乐曲之间进行满意的插值，生成具有声调和节奏一致的音乐。最后，对嵌入的视觉化表示了模型在知名的音乐概念上能够组织其嵌入空间。

Benchmarking Performance of Deep Learning Model for Material Segmentation on Two HPC Systems

paper_url: http://arxiv.org/abs/2307.14921
repo_url: None
paper_authors: Warren R. Williams, S. Ross Glandon, Luke L. Morris, Jing-Ru C. Cheng
for: 本研究的目的是为HPC系统的性能benchmarking提供信息，以提高性能和改进作业调度器。
methods: 本文开发了一个基于机器学习模型的benchmarking工具，用于测试GPU加速的节点在材料分割分析中的性能。该benchmark使用了Caffe转换为PyTorch的MMdnn工具和MINC-2500数据集。
results: 研究发现，虽然Vulcanite在大量benchmark中具有更快的模型时间，但它也更容易受环境因素的影响，导致性能较低。相比之下，Onyx的模型时间在benchmark中具有更高的一致性。

Abstract
Performance Benchmarking of HPC systems is an ongoing effort that seeks to provide information that will allow for increased performance and improve the job schedulers that manage these systems. We develop a benchmarking tool that utilizes machine learning models and gathers performance data on GPU-accelerated nodes while they perform material segmentation analysis. The benchmark uses a ML model that has been converted from Caffe to PyTorch using the MMdnn toolkit and the MINC-2500 dataset. Performance data is gathered on two ERDC DSRC systems, Onyx and Vulcanite. The data reveals that while Vulcanite has faster model times in a large number of benchmarks, and it is also more subject to some environmental factors that can cause performances slower than Onyx. In contrast the model times from Onyx are consistent across benchmarks.

摘要
高性能计算系统的性能测试是一项不断进行的努力，旨在提供有关性能的信息，以便提高作业调度器管理这些系统的性能。我们开发了一个测试工具，利用机器学习模型并在加速节点上收集 GPU 上执行物理分割分析时的性能数据。这个测试使用了从 Caffe 转换为 PyTorch 的 MMdnn 工具kit 和 MINC-2500 数据集。在两个 ERDC DSRC 系统上进行了性能测试，即 Onyx 和 Vulcanite。测试结果显示，虽然 Vulcanite 在许多测试中具有更快的模型时间，但也更容易受到环境因素的影响，导致性能较 Onyx 慢。相比之下，Onyx 的模型时间在各测试中具有更高的一致性。

NSA: Naturalistic Support Artifact to Boost Network Confidence

paper_url: http://arxiv.org/abs/2307.14917
repo_url: None
paper_authors: Abhijith Sharma, Phil Munz, Apurva Narayan
for: The paper is written to address the vulnerability of visual AI systems to natural and synthetic physical corruptions in the real-world, and to propose a novel approach called naturalistic support artifacts (NSA) to improve the robustness of visual AI systems.
methods: The paper uses a combination of deep learning techniques, including convolutional neural networks (CNNs) and generative adversarial networks (GANs), to generate naturalistic support artifacts (NSAs) that can be added to the scene to improve the robustness of visual AI systems.
results: The paper demonstrates the effectiveness of NSAs in improving prediction confidence scores by four times against natural corruptions on the Imagenette dataset, and also shows an average improvement of 8% in adversarial accuracy. The paper also provides qualitative analysis of NSAs using saliency maps to understand how they help improve prediction confidence.

Abstract
Visual AI systems are vulnerable to natural and synthetic physical corruption in the real-world. Such corruption often arises unexpectedly and alters the model's performance. In recent years, the primary focus has been on adversarial attacks. However, natural corruptions (e.g., snow, fog, dust) are an omnipresent threat to visual AI systems and should be considered equally important. Many existing works propose interesting solutions to train robust models against natural corruption. These works either leverage image augmentations, which come with the additional cost of model training, or place suspicious patches in the scene to design unadversarial examples. In this work, we propose the idea of naturalistic support artifacts (NSA) for robust prediction. The NSAs are shown to be beneficial in scenarios where model parameters are inaccessible and adding artifacts in the scene is feasible. The NSAs are natural looking objects generated through artifact training using DC-GAN to have high visual fidelity in the scene. We test against natural corruptions on the Imagenette dataset and observe the improvement in prediction confidence score by four times. We also demonstrate NSA's capability to increase adversarial accuracy by 8\% on average. Lastly, we qualitatively analyze NSAs using saliency maps to understand how they help improve prediction confidence.

摘要
自然和人工physical corruption对视觉AI系统提出了挑战。这些损害通常是意外的并改变模型的性能。过去几年，主要关注点是对敌意攻击。但是，自然损害（例如雪、雾、尘埃）对视觉AI系统是一种普遍存在的威胁，应该受到相同的重视。现有的许多工作提出了有趣的方法来训练抗损害的模型。这些工作可以通过图像增强来提高模型的训练，但这会带来额外的成本。或者，在场景中添加不可信的质感来设计不可攻击的示例。在这种情况下，我们提出了自然化支持 artifacts（NSA）的想法，用于强化预测。NSA通过使用DC-GAN进行artifact训练，在场景中生成自然看起来的对象，以提高预测的可信度。我们对Imagenette dataset中的自然损害进行测试，并观察到预测可信度分数提高四倍。此外，我们还证明NSA可以提高敌意率平均8%。最后，我们使用saliency map来Qualitatively分析NSAs，以了解它们如何改善预测的可信度。

Clustering of illustrations by atmosphere using a combination of supervised and unsupervised learning

paper_url: http://arxiv.org/abs/2307.15099
repo_url: None
paper_authors: Keisuke Kubota, Masahiro Okuda
for:这篇论文主要是为了解决在社交媒体上分布的插画，如Twitter和Pixiv的普及，以及插画的“氛围”在用户喜好中的重要性。methods:这篇论文使用了双重学习和 Pseudo-label 技术，包括监督学习和无监督学习，以获取特征向量。results:实验分析表明，该方法可以超越传统方法，在人类 manually 分类的数据集上实现人类化的划分。

Abstract
The distribution of illustrations on social media, such as Twitter and Pixiv has increased with the growing popularity of animation, games, and animated movies. The "atmosphere" of illustrations plays an important role in user preferences. Classifying illustrations by atmosphere can be helpful for recommendations and searches. However, assigning clear labels to the elusive "atmosphere" and conventional supervised classification is not always practical. Furthermore, even images with similar colors, edges, and low-level features may not have similar atmospheres, making classification based on low-level features challenging. In this paper, this problem is solved using both supervised and unsupervised learning with pseudo-labels. The feature vectors are obtained using the supervised method with pseudo-labels that contribute to an ambiguous atmosphere. Further, clustering is performed based on these feature vectors. Experimental analyses show that our method outperforms conventional methods in human-like clustering on datasets manually classified by humans.

摘要
社交媒体上的插图分布量增加了，与动画、游戏和动画电影的流行相关。插图的“氛围”在用户喜好中扮演重要的角色。将插图分类为不同的氛围可以有助于推荐和搜索。然而，将氛围的朴素特征明确归类是不实际的。同时，即使图像具有相似的颜色、边缘和低级特征，也可能没有相同的氛围，这使得基于低级特征的分类变得困难。在这篇论文中，我们解决了这个问题，使用了超vised学习和无监督学习，并使用pseudo-label来获得特征向量。然后，对这些特征向量进行聚类分析。实验分析表明，我们的方法比传统方法在人类化的聚类中表现出了优异性。

Scaling Session-Based Transformer Recommendations using Optimized Negative Sampling and Loss Functions

paper_url: http://arxiv.org/abs/2307.14906
repo_url: https://github.com/otto-de/tron
paper_authors: Timo Wilm, Philipp Normann, Sophie Baumeister, Paul-Vincent Kobow
for: 这篇论文旨在提出一种可扩展的会话基于转换器推荐器，以提高推荐质量和可扩展性。
methods: 该论文使用了优化的负样本选择和列表式损失函数，以增强推荐准确性。
results: 在大规模电商数据集上进行评估，该方法与现有方法相比，提高了推荐质量，而且保持了与SASRec相同的训练速度。一次实际A/B测试中，与SASRec比较，Click-through rate提高18.14%， highlighting TRON在实际场景中的潜力。

Abstract
This work introduces TRON, a scalable session-based Transformer Recommender using Optimized Negative-sampling. Motivated by the scalability and performance limitations of prevailing models such as SASRec and GRU4Rec+, TRON integrates top-k negative sampling and listwise loss functions to enhance its recommendation accuracy. Evaluations on relevant large-scale e-commerce datasets show that TRON improves upon the recommendation quality of current methods while maintaining training speeds similar to SASRec. A live A/B test yielded an 18.14% increase in click-through rate over SASRec, highlighting the potential of TRON in practical settings. For further research, we provide access to our source code at https://github.com/otto-de/TRON and an anonymized dataset at https://github.com/otto-de/recsys-dataset.

摘要
这个研究介绍了TRON，一种可扩展的会话基于转换器推荐器，使用优化的负抽象来提高推荐准确性。由于现有的模型如SASRec和GRU4Rec+的可扩展性和性能限制，TRON将顶峰k个负样本和列wise损失函数集成到其中，以提高推荐准确性。在相关的大规模电商数据集上进行评估，TRON比现有的方法提高了推荐质量，同时保持与SASRec相同的训练速度。一次实际A/B测试中，TRON比SASRec提高了18.14%的点击率，这 highlights TRON在实际场景中的潜力。为进一步研究，我们在GitHub上提供了源代码https://github.com/otto-de/TRON和一个匿名化的数据集https://github.com/otto-de/recsys-dataset。

CodeLens: An Interactive Tool for Visualizing Code Representations

paper_url: http://arxiv.org/abs/2307.14902
repo_url: None
paper_authors: Yuejun Guo, Seifeddine Bettaieb, Qiang Hu, Yves Le Traon, Qiang Tang
for: 这篇论文是为了提供一个可视化代码表示环境，帮助开发者快速理解和探索不同类型的代码表示。
methods: 这篇论文使用了CodeLens工具，该工具支持多种编程语言和多种代码表示方法，如Token序列、抽象 sintaxis树（AST）、数据流图（DFG）和控制流图（CFG）。
results: 通过使用CodeLens工具，开发者可以快速可见化特定的代码表示，并获得代码表示的输入数据，以便应用机器学习算法进行抽取信息。

Abstract
Representing source code in a generic input format is crucial to automate software engineering tasks, e.g., applying machine learning algorithms to extract information. Visualizing code representations can further enable human experts to gain an intuitive insight into the code. Unfortunately, as of today, there is no universal tool that can simultaneously visualise different types of code representations. In this paper, we introduce a tool, CodeLens, which provides a visual interaction environment that supports various representation methods and helps developers understand and explore them. CodeLens is designed to support multiple programming languages, such as Java, Python, and JavaScript, and four types of code representations, including sequence of tokens, abstract syntax tree (AST), data flow graph (DFG), and control flow graph (CFG). By using CodeLens, developers can quickly visualize the specific code representation and also obtain the represented inputs for models of code. The Web-based interface of CodeLens is available at http://www.codelens.org. The demonstration video can be found at http://www.codelens.org/demo.

摘要
<>translate the following text into Simplified Chinese代表源代码的通用输入格式是至关重要，以便自动化软件工程任务，例如应用机器学习算法提取信息。可视化代码表示可以帮助人类专家获得直观的印象，但当前没有一个通用的工具可同时支持不同类型的代码表示。在这篇论文中，我们介绍了一个工具——CodeLens，它提供了一个可视化交互环境，支持多种代码表示方法，并帮助开发者理解和探索它们。CodeLens支持多种编程语言，如Java、Python和JavaScript，以及四种代码表示方法，包括Token序列、抽象语法树（AST）、数据流图（DFG）和控制流图（CFG）。通过使用CodeLens，开发者可快速可视化特定的代码表示，并获得代码表示的输入数据，以便为代码模型进行模拟。CodeLens的Web基本接口可在http://www.codelens.org上查看，示例视频可在http://www.codelens.org/demo上找到。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Self-Supervised Learning for Improved Synthetic Aperture Sonar Target Recognition

paper_url: http://arxiv.org/abs/2307.15098
repo_url: None
paper_authors: BW Sheffield
for: 这个研究探讨了基于自然学习（SSL）的目标识别在Synthetic Aperture Sonar（SAS）成像中的应用。
methods: 研究使用了两种知名的SSL算法，MoCov2和BYOL，以及常见的超级vised learning模型，ResNet18，进行二元图像分类任务。
results: 结果表明，当只有少量标签时，SSL模型可以超越完全supervised learning模型，但当所有标签都用时，SSL模型不能超越supervised learning模型。

Abstract
This study explores the application of self-supervised learning (SSL) for improved target recognition in synthetic aperture sonar (SAS) imagery. The unique challenges of underwater environments make traditional computer vision techniques, which rely heavily on optical camera imagery, less effective. SAS, with its ability to generate high-resolution imagery, emerges as a preferred choice for underwater imaging. However, the voluminous high-resolution SAS data presents a significant challenge for labeling; a crucial step for training deep neural networks (DNNs). SSL, which enables models to learn features in data without the need for labels, is proposed as a potential solution to the data labeling challenge in SAS. The study evaluates the performance of two prominent SSL algorithms, MoCov2 and BYOL, against the well-regarded supervised learning model, ResNet18, for binary image classification tasks. The findings suggest that while both SSL models can outperform a fully supervised model with access to a small number of labels in a few-shot scenario, they do not exceed it when all the labels are used. The results underscore the potential of SSL as a viable alternative to traditional supervised learning, capable of maintaining task performance while reducing the time and costs associated with data labeling. The study also contributes to the growing body of evidence supporting the use of SSL in remote sensing and could stimulate further research in this area.

摘要

paper_url: http://arxiv.org/abs/2307.15097
repo_url: https://github.com/ristea/ccmt
paper_authors: Nicolae-Catalin Ristea, Radu Tudor Ionescu
for: 本研究旨在提出一种新的层次跨模态 transformer（CCMT），用于检测电话对话中的客户请求和投诉。
methods: 本方法利用多Modal Paradigma，通过自动语音识别（ASR）模型将Speech转录为文本，然后使用不同语言的BERT模型和Wav2Vec2.0音频特征进行组合。
results: 我们在ACM Multimedia 2023 Computational Paralinguistics Challenge中的请求子挑战中应用了我们的系统，实现了不等权平均回归率（UAR）的65.41%和85.87%，分别对应于投诉和请求类别。

Abstract
We propose a novel cascaded cross-modal transformer (CCMT) that combines speech and text transcripts to detect customer requests and complaints in phone conversations. Our approach leverages a multimodal paradigm by transcribing the speech using automatic speech recognition (ASR) models and translating the transcripts into different languages. Subsequently, we combine language-specific BERT-based models with Wav2Vec2.0 audio features in a novel cascaded cross-attention transformer model. We apply our system to the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge, reaching unweighted average recalls (UAR) of 65.41% and 85.87% for the complaint and request classes, respectively.

摘要
我们提出了一种新的层次跨模态转换器（CCMT），该模型将语音和文本译本结合以检测电话对话中的客户请求和投诉。我们的方法采用多模态思维，使用自动语音识别（ASR）模型将语音转录，然后将转录中文本翻译成不同语言。接着，我们将语言特定的 BERT 基于模型和 Wav2Vec2.0 音频特征相结合在一个新的层次跨关注转换器模型中。我们在 ACM Multimedia 2023 计算语言学挑战中使用我们的系统，达到了不Weighted Average Recall（UAR）的 65.41% 和 85.87% 的投诉和请求类别中的最高 recall。

Generative convective parametrization of dry atmospheric boundary layer

paper_url: http://arxiv.org/abs/2307.14857
repo_url: None
paper_authors: Florian Heyder, Juan Pedro Mellado, Jörg Schumacher
for: 这个论文的目的是为了提供一种基于生成式机器学习算法的湍流参数化方法，用于 kilometer-scale Earth系统模型中的层状湍流。
methods: 这个论文使用的方法是基于生成式机器学习算法的一种参数化方法，其中包括自适应的扩展和改进。
results: 这个论文的结果表明，使用这种参数化方法可以准确地预测湍流场的统计特征，包括湍流的非高斯性特征和快速的热气层升降。此外，这种参数化方法还可以提供湍流的横向组织结构，这种结构不可能通过其他模型 closure 来获得。

Abstract
Turbulence parametrizations will remain a necessary building block in kilometer-scale Earth system models. In convective boundary layers, where the mean vertical gradients of conserved properties such as potential temperature and moisture are approximately zero, the standard ansatz which relates turbulent fluxes to mean vertical gradients via an eddy diffusivity has to be extended by mass flux parametrizations for the typically asymmetric up- and downdrafts in the atmospheric boundary layer. In this work, we present a parametrization for a dry convective boundary layer based on a generative adversarial network. The model incorporates the physics of self-similar layer growth following from the classical mixed layer theory by Deardorff. This enhances the training data base of the generative machine learning algorithm and thus significantly improves the predicted statistics of the synthetically generated turbulence fields at different heights inside the boundary layer. The algorithm training is based on fully three-dimensional direct numerical simulation data. Differently to stochastic parametrizations, our model is able to predict the highly non-Gaussian transient statistics of buoyancy fluctuations, vertical velocity, and buoyancy flux at different heights thus also capturing the fastest thermals penetrating into the stabilized top region. The results of our generative algorithm agree with standard two-equation or multi-plume stochastic mass-flux schemes. The present parametrization provides additionally the granule-type horizontal organization of the turbulent convection which cannot be obtained in any of the other model closures. Our work paves the way to efficient data-driven convective parametrizations in other natural flows, such as moist convection, upper ocean mixing, or convection in stellar interiors.

摘要
额� parametrizations 将继续作为地球系统模型中必要的建筑物件。在循环边层中，其含有保守的积分物理特征，例如潜在温度和湿度的垂直梯度平均值为近乎零，标准假设，即 relate turbulent fluxes to mean vertical gradients via an eddy diffusivity 需要扩展。在这项工作中，我们提出了一种基于生成对抗网络的湿度 boundary layer parametrization。该模型包含了自similar层生长的物理学，从混合层理论中的Deardorff。这将提高生成机器学习算法的训练数据库，并在不同高度内 synthetically 生成的湿度场预测的统计学性能得到显著提高。算法训练基于完全三维直接数值 simulate 数据。与Stochastic parametrizations 不同，我们的模型可以预测高度不具有Gaussian 性质的杂流异常，垂直速度和杂流质量在不同高度上的统计学性能，因此也捕捉到稳定层顶部最快的热气孔。我们的结果与标准两Equation 或多柱Stochastic 质量流分布相符。我们的 parametrization 提供了 granule-type 的杂流层组织，这不能由其他任何模型 closure 获得。我们的工作开启了高效数据驱动的湿度 parametrizations 在其他自然流体中，例如湿气循环、上层海洋混合或星系内部湿度。

Counterfactual Explanations for Graph Classification Through the Lenses of Density

paper_url: http://arxiv.org/abs/2307.14849
repo_url: https://github.com/carlo-abrate/Counterfactual-Explanations-for-Graph-Classification-Through-the-Lenses-of-Density
paper_authors: Carlo Abrate, Giulia Preti, Francesco Bonchi
for: 这个论文旨在提供一种基于density的对graph分类器的counterfactual例子，以便更好地解释分类结果。
methods: 这种方法基于density-based counterfactual search framework，可以通过不同的密度结构来实现不同的counterfactual例子。
results: 试验结果表明，采用密度作为变量的改变unit是定义versatile和可读的counterfactual例子的关键。

Abstract
Counterfactual examples have emerged as an effective approach to produce simple and understandable post-hoc explanations. In the context of graph classification, previous work has focused on generating counterfactual explanations by manipulating the most elementary units of a graph, i.e., removing an existing edge, or adding a non-existing one. In this paper, we claim that such language of explanation might be too fine-grained, and turn our attention to some of the main characterizing features of real-world complex networks, such as the tendency to close triangles, the existence of recurring motifs, and the organization into dense modules. We thus define a general density-based counterfactual search framework to generate instance-level counterfactual explanations for graph classifiers, which can be instantiated with different notions of dense substructures. In particular, we show two specific instantiations of this general framework: a method that searches for counterfactual graphs by opening or closing triangles, and a method driven by maximal cliques. We also discuss how the general method can be instantiated to exploit any other notion of dense substructures, including, for instance, a given taxonomy of nodes. We evaluate the effectiveness of our approaches in 7 brain network datasets and compare the counterfactual statements generated according to several widely-used metrics. Results confirm that adopting a semantic-relevant unit of change like density is essential to define versatile and interpretable counterfactual explanation methods.

摘要
<>传统的 counterfactual 例子在图 классификации中已经被证明是一种有效的方法，可以生成简单易理解的后果解释。在这篇文章中，我们认为这种语言的解释可能太细腻，因此我们转移注意力到了实际世界中复杂网络的一些主要特征，例如减少三角形、存在循环模式以及组织成稠密模块。我们因此定义了一种基于密度的 counterfactual 搜索框架，用于生成图类ifizier的实例级别的反例解释。这个框架可以根据不同的归一化概念来实例化。在特定情况下，我们显示了两种具体的实现方法：一种通过打开或关闭三角形来搜索反例图，另一种是通过最大 клиQUE来驱动。我们还讨论了如何将这个框架实例化以利用任何其他的归一化概念，例如给定的节点分类。我们在 7 个大脑网络数据集上评估了我们的方法，并与多种通用的评价指标进行比较。结果表明，采用密度为单位的变化是生成多样化和可读的反例解释方法的关键。

Kernelised Normalising Flows

paper_url: http://arxiv.org/abs/2307.14839
repo_url: None
paper_authors: Eshant English, Matthias Kirchler, Christoph Lippert
for: 这个研究旨在探讨Normalising Flows是一种生成模型，并且提出了一种新的kernelised normalising flow paradigm，以提高这种模型的表现。
methods: 本研究使用了kernelised flow，即将kernels integrate到Normalising Flows的框架中，以提高模型的表现。
results: 研究结果显示，使用kernelised flow可以与基于神经网络的流动模型相比，具有相似或更好的结果，并且可以优化参数效率。特别是在低数据情况下，kernelised flows可以实现非 Parametric density估计。

Abstract
Normalising Flows are generative models characterised by their invertible architecture. However, the requirement of invertibility imposes constraints on their expressiveness, necessitating a large number of parameters and innovative architectural designs to achieve satisfactory outcomes. Whilst flow-based models predominantly rely on neural-network-based transformations for expressive designs, alternative transformation methods have received limited attention. In this work, we present Ferumal flow, a novel kernelised normalising flow paradigm that integrates kernels into the framework. Our results demonstrate that a kernelised flow can yield competitive or superior results compared to neural network-based flows whilst maintaining parameter efficiency. Kernelised flows excel especially in the low-data regime, enabling flexible non-parametric density estimation in applications with sparse data availability.

摘要
常规化流是一种生成模型，它们具有可逆的架构。然而，需要可逆性的限制表达能力，因此需要大量参数和创新的架构设计来实现满意的结果。而液流模型主要依靠神经网络基于的变换来实现表达设计，而其他变换方法受到了有限的关注。在这项工作中，我们提出了 Ferumal 流，一种新的核心化常规化流模型，它将核心纳入框架。我们的结果显示，核心化流可以与神经网络基于的流比肩或超越其表现，同时保持参数效率。核心化流在数据稀缺时特别有优势，允许非参数式density估计在应用中。

Building RadiologyNET: Unsupervised annotation of a large-scale multimodal medical database

paper_url: http://arxiv.org/abs/2308.08517
repo_url: None
paper_authors: Mateja Napravnik, Franko Hržić, Sebastian Tschauner, Ivan Štajduhar
for: 这个论文目的是自动标注医疗影像数据库，以提高医疗诊断和治疗的效率和准确性。
methods: 该论文使用了无监督的自动标注方法，利用多Modal的数据源，包括图像、DICOM元数据和描述性诊断。具体来说，该论文使用了多种适当的特征提取器，并对其进行了评估，以选择最佳的特征提取器。
results: 该论文的结果表明，将所有数据源的嵌入都 fusion起来最佳地进行无监督划分大规模医疗数据，可以获得最 concise 的划分结果。因此，该论文是建立大量和更细化的医疗影像数据库的第一步。

Abstract
Background and objective: The usage of machine learning in medical diagnosis and treatment has witnessed significant growth in recent years through the development of computer-aided diagnosis systems that are often relying on annotated medical radiology images. However, the availability of large annotated image datasets remains a major obstacle since the process of annotation is time-consuming and costly. This paper explores how to automatically annotate a database of medical radiology images with regard to their semantic similarity. Material and methods: An automated, unsupervised approach is used to construct a large annotated dataset of medical radiology images originating from Clinical Hospital Centre Rijeka, Croatia, utilising multimodal sources, including images, DICOM metadata, and narrative diagnoses. Several appropriate feature extractors are tested for each of the data sources, and their utility is evaluated using k-means and k-medoids clustering on a representative data subset. Results: The optimal feature extractors are then integrated into a multimodal representation, which is then clustered to create an automated pipeline for labelling a precursor dataset of 1,337,926 medical images into 50 clusters of visually similar images. The quality of the clusters is assessed by examining their homogeneity and mutual information, taking into account the anatomical region and modality representation. Conclusion: The results suggest that fusing the embeddings of all three data sources together works best for the task of unsupervised clustering of large-scale medical data, resulting in the most concise clusters. Hence, this work is the first step towards building a much larger and more fine-grained annotated dataset of medical radiology images.

摘要
背景和目标：在医学诊断和治疗中使用机器学习技术的应用已经在过去几年中呈现出明显的增长趋势，其中Computer-aided diagnosis系统经常利用了注释的医疗影像。然而，大量注释影像数据的可用性仍然是一个主要的障碍，因为注释过程占用时间和成本很高。这篇论文探讨了如何自动注释医疗影像数据库，并且根据各种数据源的Semantic similarity进行分类。材料和方法：本研究使用自动化、无监督的方法，通过构建一个大量注释的医疗影像数据库，该数据库来自克罗地亚维也纳医院中心医院，并使用多种数据源，包括图像、DICOM元数据和描述性诊断。为每种数据源选择合适的特征提取器，并对代表数据 subsets 进行 k-means 和 k-medoids 归一化，以评估特征提取器的有用性。结果：最佳的特征提取器被集成到多模式表示中，并对 precursor 数据集进行自动化分类，将1337926个医疗图像分为50个视觉相似的图像集。cluster质量被评估通过Homogeneity和Mutual Information的评估，考虑骨格区域和模式表示。结论：结果表明，将所有三种数据源的嵌入合并为一起是最佳的方法，得到了最短的 clusters。因此，这种工作是建立大量、更细化的注释医疗影像数据库的第一步。

Fading memory as inductive bias in residual recurrent networks

paper_url: http://arxiv.org/abs/2307.14823
repo_url: None
paper_authors: Igor Dubinin, Felix Effenberger
for: 这篇论文旨在研究剩下连接在循环神经网络（RNN）中的影响，以及如何通过这些连接来提高任务性能和网络动态特性。
methods: 作者提出了弱相互连接循环神经网络（WCRNN），并研究了这些连接对网络性能、动态特性和忘却特性的影响。
results: 研究发现，不同类型的剩下连接可以提供不同的 inductive bias，使网络表现更加强大。具体来说，这些连接可以使网络在数据特性的边缘附近进行动态，或者让网络利用数据特性的特征性质，还可以使网络具有不同的忘却特性。此外，作者还展示了如何将这些结果推广到非线性剩下连接和Elman RNN中。

Abstract
Residual connections have been proposed as architecture-based inductive bias to mitigate the problem of exploding and vanishing gradients and increase task performance in both feed-forward and recurrent networks (RNNs) when trained with the backpropagation algorithm. Yet, little is known about how residual connections in RNNs influence their dynamics and fading memory properties. Here, we introduce weakly coupled residual recurrent networks (WCRNNs) in which residual connections result in well-defined Lyapunov exponents and allow for studying properties of fading memory. We investigate how the residual connections of WCRNNs influence their performance, network dynamics, and memory properties on a set of benchmark tasks. We show that several distinct forms of residual connections yield effective inductive biases that result in increased network expressivity. In particular, residual connections that (i) result in network dynamics at the proximity of the edge of chaos, (ii) allow networks to capitalize on characteristic spectral properties of the data, and (iii) result in heterogeneous memory properties are shown to increase practical expressivity. In addition, we demonstrate how our results can be extended to non-linear residuals and introduce a weakly coupled residual initialization scheme that can be used for Elman RNNs

摘要
剩下的连接（residual connections）已经被提议作为网络架构偏好，以降低迪迪卷积和淘汰卷积的问题，并提高任务性能。然而，关于剩下的连接在RNN中的动态和渐幂性质的了解很少。在这里，我们介绍了弱相互连接的剩下RNN（WCRNN），其中剩下的连接导致了明确的Lyapunov exponent，可以研究RNN的渐幂性质。我们研究了WCRNN的剩下连接对其性能、网络动态和记忆性质的影响。我们发现，不同类型的剩下连接可以提供不同的偏好，以提高网络表达力。具体来说，剩下连接可以让网络的动态在稳定边缘附近，利用数据的特征频率特性，以及具有多样的记忆性质可以提高实际表达力。此外，我们还证明了我们的结果可以推广到非线性剩下连接，并引入了弱相互连接初始化方案，可以应用于Elman RNN。

Likely, Light, and Accurate Context-Free Clusters-based Trajectory Prediction

paper_url: http://arxiv.org/abs/2307.14788
repo_url: None
paper_authors: Tiago Rodrigues de Almeida, Oscar Martinez Mozos
for: 本研究旨在提出一种多 этап probabilistic 方法，用于预测路交通系统中的轨迹。
methods: 该方法包括轨迹变换到位移空间，分词时间序列，轨迹提议和排名提议。我们提出了一种新的深度特征归一化方法，基于自适应 GAN，可以更好地适应分布变化。
results: 总体系统超过了上下文自适应深度生成模型在人员和道路代理 trajectory 数据中，同时与点估计相比，对最有可能的轨迹进行排名时，效率更高却准确。

Abstract
Autonomous systems in the road transportation network require intelligent mechanisms that cope with uncertainty to foresee the future. In this paper, we propose a multi-stage probabilistic approach for trajectory forecasting: trajectory transformation to displacement space, clustering of displacement time series, trajectory proposals, and ranking proposals. We introduce a new deep feature clustering method, underlying self-conditioned GAN, which copes better with distribution shifts than traditional methods. Additionally, we propose novel distance-based ranking proposals to assign probabilities to the generated trajectories that are more efficient yet accurate than an auxiliary neural network. The overall system surpasses context-free deep generative models in human and road agents trajectory data while performing similarly to point estimators when comparing the most probable trajectory.

摘要
自适应系统在路运输网络中需要智能机制来预测未来。在这篇论文中，我们提出了一种多stage probabilistic方法 для路径预测：路径变换到位移空间、扩散时间序列集成、路径提议和排序提议。我们介绍了一种新的深度特征划分方法，基于自适应GAN，可以更好地适应分布变化。此外，我们提出了一种新的距离基于排序提议，可以更有效率又准确地分配批量中的概率。总体系统在人类和道路代理人 trajectory 数据中超越了上下文自适应深度生成模型，同时与点估计器相比，最可能的路径预测的准确率相似。

Emotion4MIDI: a Lyrics-based Emotion-Labeled Symbolic Music Dataset

paper_url: http://arxiv.org/abs/2307.14783
repo_url: https://github.com/serkansulun/lyricsemotions
paper_authors: Serkan Sulun, Pedro Oliveira, Paula Viana
for: 这个论文是为了创建一个大规模的符号音乐情感标注数据集而写的。
methods: 作者首先使用GoEmotions数据集进行情感分类模型的训练，实现了状态之最佳result，并且使用这些模型对两个大规模的MIDI数据集中的歌词进行应用。
results: 作者创建了一个广泛的情感标注音乐数据集，提供了一个价值的资源来探索音乐和情感之间的连接，特别是开发基于特定情感的音乐生成模型。作者的推理、训练模型和数据集在线可用。

Abstract
We present a new large-scale emotion-labeled symbolic music dataset consisting of 12k MIDI songs. To create this dataset, we first trained emotion classification models on the GoEmotions dataset, achieving state-of-the-art results with a model half the size of the baseline. We then applied these models to lyrics from two large-scale MIDI datasets. Our dataset covers a wide range of fine-grained emotions, providing a valuable resource to explore the connection between music and emotions and, especially, to develop models that can generate music based on specific emotions. Our code for inference, trained models, and datasets are available online.

摘要
我们现在发布了一个新的大规模符号音乐数据集，包含12个MIDI歌曲。为创建这个数据集，我们首先在GoEmotions数据集上训练了情感分类模型，实现了状态的最佳结果，使用的模型比基线模型小一半。然后，我们将这些模型应用于两个大规模MIDI数据集的歌词中。我们的数据集覆盖了各种细腻的情感，提供了价值的资源，以便探索音乐和情感之间的关系，特别是开发基于specific情感的音乐生成模型。我们在线上提供了推理代码、训练模型和数据集。

MATNilm: Multi-appliance-task Non-intrusive Load Monitoring with Limited Labeled Data

paper_url: http://arxiv.org/abs/2307.14778
repo_url: https://github.com/jxiong22/matnilm
paper_authors: Jing Xiong, Tianqi Hong, Dongbo Zhao, Yu Zhang
for: 这个研究的目的是提出一个高效精度的非入侵式负载监控（NILM）方法，以分析家庭各种设备的工作状态和能量消耗。
methods: 本研究提出了一个多设备任务框架，并使用训练效率高的样本扩展（SA）方法，以减少需要的标签数据。每个设备都有一个共享层次拆分结构，用于其预测和分类任务。此外，还提出了一个二维注意力机制，以捕捉所有设备之间的时空相互关联。
results: simulation results show that our proposed approach features a significantly improved performance over many baseline models, with relative errors reduced by more than 50% on average.

Abstract
Non-intrusive load monitoring (NILM) identifies the status and power consumption of various household appliances by disaggregating the total power usage signal of an entire house. Efficient and accurate load monitoring facilitates user profile establishment, intelligent household energy management, and peak load shifting. This is beneficial for both the end-users and utilities by improving the overall efficiency of a power distribution network. Existing approaches mainly focus on developing an individual model for each appliance. Those approaches typically rely on a large amount of household-labeled data which is hard to collect. In this paper, we propose a multi-appliance-task framework with a training-efficient sample augmentation (SA) scheme that boosts the disaggregation performance with limited labeled data. For each appliance, we develop a shared-hierarchical split structure for its regression and classification tasks. In addition, we also propose a two-dimensional attention mechanism in order to capture spatio-temporal correlations among all appliances. With only one-day training data and limited appliance operation profiles, the proposed SA algorithm can achieve comparable test performance to the case of training with the full dataset. Finally, simulation results show that our proposed approach features a significantly improved performance over many baseline models. The relative errors can be reduced by more than 50% on average. The codes of this work are available at https://github.com/jxiong22/MATNilm

摘要
非侵入式电力监测（NILM）可以识别各种家用电器的状态和能量消耗，从整个房屋的总电力使用信号中分解出各个家用电器的使用情况。高效精准的加载监测可以帮助用户建立使用模式、智能家庭能源管理和峰值加载延迟等等，这对于用户和供应商都是有利的，可以提高总电力分配网络的效率。现有的方法主要集中在开发每个家用电器的个性化模型上。这些方法通常需要大量的家庭标注数据，而这些数据很难收集。在这篇论文中，我们提出了一种多家用电器任务框架，具有训练效率高的样本扩展（SA）方案，可以在有限的标注数据下提高分解性能。每个家用电器都有一个共享层次拆分结构，用于其 regression 和分类任务。此外，我们还提出了两维注意力机制，以捕捉所有家用电器之间的空间时间相关性。只需一天的训练数据和有限的家用电器运行 profiling，我们的 SA 算法可以达到与全 dataset 训练的相同测试性能。最后，我们的实验结果表明，我们的提出的方法在许多基准模型上具有显著提高的性能，相对误差可以降低 более чем 50%的平均值。代码这些工作可以在查看。

Towards Practicable Sequential Shift Detectors

paper_url: http://arxiv.org/abs/2307.14758
repo_url: None
paper_authors: Oliver Cobb, Arnaud Van Looveren
for: 这篇论文是为了探讨分布差异的影响，并提出了实用的方法来检测这种差异。
methods: 论文使用了现有的分布探测方法，并评估了它们的实用性。
results: 论文发现了一些已有的方法无法满足实际应用中的重要需求，并提出了未来研究的方向。I hope this helps! Let me know if you have any further questions.

Abstract
There is a growing awareness of the harmful effects of distribution shift on the performance of deployed machine learning models. Consequently, there is a growing interest in detecting these shifts before associated costs have time to accumulate. However, desiderata of crucial importance to the practicable deployment of sequential shift detectors are typically overlooked by existing works, precluding their widespread adoption. We identify three such desiderata, highlight existing works relevant to their satisfaction, and recommend impactful directions for future research.

摘要
有越来越多的人意识到机器学习模型部署时的分布变化对模型性能的负面影响。因此，有越来越多的人对检测这些变化的方法表示兴趣。然而，现有的研究往往忽视了部署sequential shift detector的重要需求，这使得这些方法在实践中无法普及。我们识别出了三个重要的需求，介绍了现有的相关研究，并建议了未来研究的有效方向。

Fair Machine Unlearning: Data Removal while Mitigating Disparities

paper_url: http://arxiv.org/abs/2307.14754
repo_url: None
paper_authors: Alex Oesterling, Jiaqi Ma, Flavio P. Calmon, Hima Lakkaraju
for: 本研究旨在提供一种可靠地忘记数据实例的机器学习方法，以保护个人隐私和公正。
methods: 本研究使用了一种基于规则的方法，可以有效地忘记数据实例，同时保持分组公正。
results: 实验结果表明，该方法可以忘记数据实例，并且保持分组公正。

Abstract
As public consciousness regarding the collection and use of personal information by corporations grows, it is of increasing importance that consumers be active participants in the curation of corporate datasets. In light of this, data governance frameworks such as the General Data Protection Regulation (GDPR) have outlined the right to be forgotten as a key principle allowing individuals to request that their personal data be deleted from the databases and models used by organizations. To achieve forgetting in practice, several machine unlearning methods have been proposed to address the computational inefficiencies of retraining a model from scratch with each unlearning request. While efficient online alternatives to retraining, it is unclear how these methods impact other properties critical to real-world applications, such as fairness. In this work, we propose the first fair machine unlearning method that can provably and efficiently unlearn data instances while preserving group fairness. We derive theoretical results which demonstrate that our method can provably unlearn data instances while maintaining fairness objectives. Extensive experimentation with real-world datasets highlight the efficacy of our method at unlearning data instances while preserving fairness.

摘要

FLARE: Fingerprinting Deep Reinforcement Learning Agents using Universal Adversarial Masks

paper_url: http://arxiv.org/abs/2307.14751
repo_url: https://github.com/ssg-research/FLARE
paper_authors: Buse G. A. Tekgul, N. Asokan
for: 验证深度强化学习策略是否是伪造的另一个策略。
methods: 使用非传输的通用敌意掩蔽，生成对于犯罪策略的攻击示例，并使用这些掩蔽作为指纹来验证策略的真实所属。
results: 对伪造策略进行验证，100%的动作一致率，不会误 accusations独立策略。Robust against model modification attacks and cannot be easily evaded by more informed adversaries without negatively impacting agent performance.

Abstract
We propose FLARE, the first fingerprinting mechanism to verify whether a suspected Deep Reinforcement Learning (DRL) policy is an illegitimate copy of another (victim) policy. We first show that it is possible to find non-transferable, universal adversarial masks, i.e., perturbations, to generate adversarial examples that can successfully transfer from a victim policy to its modified versions but not to independently trained policies. FLARE employs these masks as fingerprints to verify the true ownership of stolen DRL policies by measuring an action agreement value over states perturbed via such masks. Our empirical evaluations show that FLARE is effective (100% action agreement on stolen copies) and does not falsely accuse independent policies (no false positives). FLARE is also robust to model modification attacks and cannot be easily evaded by more informed adversaries without negatively impacting agent performance. We also show that not all universal adversarial masks are suitable candidates for fingerprints due to the inherent characteristics of DRL policies. The spatio-temporal dynamics of DRL problems and sequential decision-making process make characterizing the decision boundary of DRL policies more difficult, as well as searching for universal masks that capture the geometry of it.

摘要
我们提出FLARE，第一个验证怀疑深度强化学习（DRL）政策是否为非法复制另一个（受害者）政策的机制。我们首先显示可以找到不可转移的通用防御措施，即扰动，以生成对照改进版本的对抗例项。FLARE使用这些扰动作为指痕，以验证盗取DRL政策的真正所有者。我们的实验评估显示FLARE有100%的动作一致率（在盗取 копиях上），并不会对独立的政策提出错误的指控（没有假阳性）。FLARE也具有模型修改攻击和更进一步的攻击者无法轻易逃脱的强度。我们还证明了不同的通用防御措施不一定适合指痕，因为深度强化学习问题的空间时间动态和统计决策过程使得描述DRL政策的决策boundary更加具有挑战性。

Semantic Image Completion and Enhancement using GANs

paper_url: http://arxiv.org/abs/2307.14748
repo_url: None
paper_authors: Priyansh Saxena, Raahat Gupta, Akshat Maheshwari, Saumil Maheshwari
for: 这篇论文的目的是提出一种基于生成对抗网络（GAN）的图像完成和加强方法。
methods: 这篇论文使用了GAN网络来实现图像完成和加强任务。
results: 研究表明，使用GAN网络可以有效地完成和加强图像，并且可以提高图像质量。

Abstract
Semantic inpainting or image completion alludes to the task of inferring arbitrary large missing regions in images based on image semantics. Since the prediction of image pixels requires an indication of high-level context, this makes it significantly tougher than image completion, which is often more concerned with correcting data corruption and removing entire objects from the input image. On the other hand, image enhancement attempts to eliminate unwanted noise and blur from the image, along with sustaining most of the image details. Efficient image completion and enhancement model should be able to recover the corrupted and masked regions in images and then refine the image further to increase the quality of the output image. Generative Adversarial Networks (GAN), have turned out to be helpful in picture completion tasks. In this chapter, we will discuss the underlying GAN architecture and how they can be used used for image completion tasks.

摘要
Semantic inpainting or image completion refers to the task of inferring arbitrary large missing regions in images based on image semantics. As the prediction of image pixels requires high-level context, this task is significantly more challenging than image completion, which is often focused on correcting data corruption and removing entire objects from the input image. On the other hand, image enhancement aims to eliminate unwanted noise and blur from the image while preserving most of the image details. An efficient image completion and enhancement model should be able to recover the corrupted and masked regions in images and then refine the image further to improve the quality of the output image. Generative Adversarial Networks (GAN) have proven to be helpful in picture completion tasks, and we will discuss the underlying GAN architecture and how they can be used for image completion tasks in this chapter.

A Strategic Framework for Optimal Decisions in Football 1-vs-1 Shot-Taking Situations: An Integrated Approach of Machine Learning, Theory-Based Modeling, and Game Theory

paper_url: http://arxiv.org/abs/2307.14732
repo_url: https://github.com/calvinyeungck/analyzing-two-agents-interaction-in-football-shot-taking-situations
paper_authors: Calvin C. K. Yeung, Keisuke Fujii
for:* The paper aims to analyze critical scenarios in football, specifically shot-taking, using game theory and machine learning.methods:* The proposed framework uses ML models to estimate expected payoffs and extracts additional features with a theory-based shot block model.* The xSOT metric is introduced to evaluate players’ actions even if the shot results in no goal, allowing for effective differentiation and comparison.results:* The framework is validated through experiments and shows a high correlation with existing metrics, indicating that xSOT provides valuable insights.* The paper demonstrates the use of the framework to analyze optimal strategies in the World Cup 2022 and a shot situation in EURO 2020.Here is the text in Simplified Chinese:for:* 本文旨在通过游戏理论和机器学习分析足球比赛中的关键情况，具体来说是射击情况。methods:* 提议的框架使用机器学习模型来估算射击结果的预期收益，同时使用理论基础的射击阻挡模型提取更多的特征。* xSOT指标是为了评估球员的决策，即使射击失败也能够提供有价值的信息。results:* 实验 validate了框架，并发现了与现有指标的高相关性，表明xSOT提供了有价值的信息。* 文章使用了框架来分析2022年世界杯和2020年欧洲锦标赛中的优化策略。

Abstract
Complex interactions between two opposing agents frequently occur in domains of machine learning, game theory, and other application domains. Quantitatively analyzing the strategies involved can provide an objective basis for decision-making. One such critical scenario is shot-taking in football, where decisions, such as whether the attacker should shoot or pass the ball and whether the defender should attempt to block the shot, play a crucial role in the outcome of the game. However, there are currently no effective data-driven and/or theory-based approaches to analyzing such situations. To address this issue, we proposed a novel framework to analyze such scenarios based on game theory, where we estimate the expected payoff with machine learning (ML) models, and additional features for ML models were extracted with a theory-based shot block model. Conventionally, successes or failures (1 or 0) are used as payoffs, while a success shot (goal) is extremely rare in football. Therefore, we proposed the Expected Probability of Shot On Target (xSOT) metric to evaluate players' actions even if the shot results in no goal; this allows for effective differentiation and comparison between different shots and even enables counterfactual shot situation analysis. In our experiments, we have validated the framework by comparing it with baseline and ablated models. Furthermore, we have observed a high correlation between the xSOT and existing metrics. This alignment of information suggests that xSOT provides valuable insights. Lastly, as an illustration, we studied optimal strategies in the World Cup 2022 and analyzed a shot situation in EURO 2020.

摘要
在机器学习、游戏理论等领域中，两个对立的代理经常发生复杂的互动。量化分析这些策略可以提供客观的决策基础。例如，在足球比赛中，决定是否发球或传球，以及是否阻止对方发球的决策具有决定性作用。然而，目前没有有效的数据驱动和/或理论基础的方法来分析这些情况。为解决这个问题，我们提出了一种新的分析框架，基于游戏理论，我们使用机器学习（ML）模型来估计预期的奖励，同时还提取了基于理论的射击阻挡模型中的额外特征。在我们的实验中，我们 validate了这种框架，并与基准和减少模型进行比较。此外，我们发现了xSOT指标与现有指标之间的高相关性。这种信息的对应表明了xSOT提供了有价值的洞察。最后，我们通过研究2022年世界杯和2020年欧洲锦标赛中的优化策略，以及一个在EURO 2020中的射击情况的分析，以示 illustrate 该框架的应用。

Understanding Silent Failures in Medical Image Classification

paper_url: http://arxiv.org/abs/2307.14729
repo_url: https://github.com/iml-dkfz/sf-visuals
paper_authors: Till J. Bungert, Levin Kobelke, Paul F. Jaeger
for: 防止隐藏失败，以确保医疗应用中的分类系统可靠性。
methods: 使用 confidence scoring functions (CSFs) 检测剩下的失败，或者设计更加可靠的分类器。
results: 对四种生物医学任务和多种数据分布shift进行了首次全面分析，发现现有的 CSFs 无法可靠地预防隐藏失败。

Abstract
To ensure the reliable use of classification systems in medical applications, it is crucial to prevent silent failures. This can be achieved by either designing classifiers that are robust enough to avoid failures in the first place, or by detecting remaining failures using confidence scoring functions (CSFs). A predominant source of failures in image classification is distribution shifts between training data and deployment data. To understand the current state of silent failure prevention in medical imaging, we conduct the first comprehensive analysis comparing various CSFs in four biomedical tasks and a diverse range of distribution shifts. Based on the result that none of the benchmarked CSFs can reliably prevent silent failures, we conclude that a deeper understanding of the root causes of failures in the data is required. To facilitate this, we introduce SF-Visuals, an interactive analysis tool that uses latent space clustering to visualize shifts and failures. On the basis of various examples, we demonstrate how this tool can help researchers gain insight into the requirements for safe application of classification systems in the medical domain. The open-source benchmark and tool are at: https://github.com/IML-DKFZ/sf-visuals.

摘要
Simplified Chinese translation:为确保医疗应用中的分类系统可靠使用，防止悬而无声失败是非常重要。这可以通过设计更加可靠的分类器来实现，或者通过使用信任分数函数（CSF）来检测剩下的失败。图像分类中最主要的失败来源是在训练数据和部署数据之间的分布变化。为了了解医疗领域中 silent failure 预防的当前状况，我们进行了首次全面的比较多种 CSF 在四个生物医学任务中的性能。结果显示，现有的 CSF 无法可靠预防 silent failure。我们认为，更深入的理解数据中失败的根本原因是必要的。为此，我们介绍 SF-Visuals，一个可互动的分析工具，使用秘密空间划分来可见化变化和失败。通过多个例子，我们示例了这个工具如何帮助研究人员了解医疗领域中分类系统的安全应用需求。开源 benchmark 和工具可以在 GitHub 上获取：https://github.com/IML-DKFZ/sf-visuals。

The Effect of Spoken Language on Speech Enhancement using Self-Supervised Speech Representation Loss Functions

paper_url: http://arxiv.org/abs/2307.14502
repo_url: https://github.com/leto19/commonvoice-demand
paper_authors: George Close, Thomas Hain, Stefan Goetze
for: 本研究旨在探讨自动化语音增强（SE）领域中使用自动学习语音表示（SSSR）作为功能转换在损失函数中的效果。
methods: 本研究使用了不同语言组合和网络结构的自动学习语音表示来训练SE系统，并测试其在未看到语言上的表现。
results: 研究发现，训练语言对增强性能的影响相对较小，但是训练数据量的影响非常大。

Abstract
Recent work in the field of speech enhancement (SE) has involved the use of self-supervised speech representations (SSSRs) as feature transformations in loss functions. However, in prior work, very little attention has been paid to the relationship between the language of the audio used to train the self-supervised representation and that used to train the SE system. Enhancement models trained using a loss function which incorporates a self-supervised representation that shares exactly the language of the noisy data used to train the SE system show better performance than those which do not match exactly. This may lead to enhancement systems which are language specific and as such do not generalise well to unseen languages, unlike models trained using traditional spectrogram or time domain loss functions. In this work, SE models are trained and tested on a number of different languages, with self-supervised representations which themselves are trained using different language combinations and with differing network structures as loss function representations. These models are then tested across unseen languages and their performances are analysed. It is found that the training language of the self-supervised representation appears to have a minor effect on enhancement performance, the amount of training data of a particular language, however, greatly affects performance.

摘要
近期在语音提高领域的研究中，有使用自动生成的语音表示（SSSR）作为损失函数中的特征变换。然而，在先前的工作中，对使用自动生成的语音表示和语音提高系统训练语言之间的关系进行了非常少的关注。使用与听取数据中的语言相同的自动生成表示来训练语音提高系统，可以得到更好的性能。这可能导致的语音提高系统具有语言特定性，无法通用于未见语言。在这个工作中，我们训练和测试了多种不同语言的语音提高模型，使用不同的语言组合和网络结构作为损失函数表示。这些模型然后在未见语言上进行测试，并分析其性能。结果显示，训练自动生成表示的语言对提高性能产生了微不足的影响，但是具体的语言训练数据的量却有很大的影响。

Robust vertebra identification using simultaneous node and edge predicting Graph Neural Networks

paper_url: http://arxiv.org/abs/2308.02509
repo_url: https://github.com/imfusiongmbh/vid-vertebra-identification-dataset
paper_authors: Vincent Bürgin, Raphael Prevost, Marijn F. Stollenga
for: 这篇论文的目的是提出一个简单的架构，用于自动测量CT扫描中的脊椎骨，并将其归类为完整的orientation。
methods: 这篇论文使用了一个标准的预测模型和单一的Graph Neural Network（GNN），将脊椎骨与完整的orientation归类。
results: 这篇论文的方法能够精确地将脊椎骨与相应的体部和脊椎骨关联，忽略 False Positives，并在一个简单且可训练的架构中进行归类。该方法超越了传统的方法，如匈牙利对称和隐藏Markov模型。此外，该方法在标准的VerSe体身亮度挑战任务中也表现了竞争性的表现。

Abstract
Automatic vertebra localization and identification in CT scans is important for numerous clinical applications. Much progress has been made on this topic, but it mostly targets positional localization of vertebrae, ignoring their orientation. Additionally, most methods employ heuristics in their pipeline that can be sensitive in real clinical images which tend to contain abnormalities. We introduce a simple pipeline that employs a standard prediction with a U-Net, followed by a single graph neural network to associate and classify vertebrae with full orientation. To test our method, we introduce a new vertebra dataset that also contains pedicle detections that are associated with vertebra bodies, creating a more challenging landmark prediction, association and classification task. Our method is able to accurately associate the correct body and pedicle landmarks, ignore false positives and classify vertebrae in a simple, fully trainable pipeline avoiding application-specific heuristics. We show our method outperforms traditional approaches such as Hungarian Matching and Hidden Markov Models. We also show competitive performance on the standard VerSe challenge body identification task.

摘要
自动 vertebra localization 和识别在 CT 扫描中是至关重要的许多临床应用。许多进步已经在这个领域得到了，但大多数方法忽略了vertebra 的方向。此外，大多数方法使用了规则集的ipeline，这些规则可能在实际的临床图像中不稳定。我们介绍了一个简单的管道，使用标准预测和 U-Net，然后使用单个图解神经网络将 vertebra 与完整的方向相关联和分类。为测试我们的方法，我们介绍了一个新的 vertebra 数据集，该数据集包含pedicle 的检测，这些检测与 vertebra body 相关，创造了更加复杂的标注预测、关联和分类任务。我们的方法能够准确地关联正确的体和pedicle 标记，忽略假阳性，并分类 vertebra 在一个简单、完全可训练的管道中，不需要应用特定的规则。我们显示了我们的方法超过传统方法，如匈牙利匹配和隐马尔可夫模型。我们还显示了与标准 VerSe 挑战体ID任务的竞争性性能。

TimeGNN: Temporal Dynamic Graph Learning for Time Series Forecasting

paper_url: http://arxiv.org/abs/2307.14680
repo_url: None
paper_authors: Nancy Xu, Chrysoula Kosma, Michalis Vazirgiannis
for: 这个论文主要用于提出一种可以Capture多ivariate时间序列之间的关系并预测时间序列的方法。
methods: 该方法使用图 neural network 结构，可以同时学习图structured representation和时间序列数据的相关性。
results: 该方法可以在4-80倍 faster的计算时间内 achieve comparable forecasting performance 与其他状态的艺术方法。

Abstract
Time series forecasting lies at the core of important real-world applications in many fields of science and engineering. The abundance of large time series datasets that consist of complex patterns and long-term dependencies has led to the development of various neural network architectures. Graph neural network approaches, which jointly learn a graph structure based on the correlation of raw values of multivariate time series while forecasting, have recently seen great success. However, such solutions are often costly to train and difficult to scale. In this paper, we propose TimeGNN, a method that learns dynamic temporal graph representations that can capture the evolution of inter-series patterns along with the correlations of multiple series. TimeGNN achieves inference times 4 to 80 times faster than other state-of-the-art graph-based methods while achieving comparable forecasting performance

摘要
时间序列预测在许多科学和工程领域的重要应用中央。大量的时间序列数据集中含有复杂的模式和长期相关性，这导致了各种神经网络架构的发展。图神经网络方法，它同时学习基于时间序列数据的Raw值相关性来建立图结构，在预测时已经取得了很大成功。然而，这些解决方案通常训练成本高、扩展困难。在这篇论文中，我们提出了TimeGNN方法，它可以捕捉多个时间序列的进化趋势以及多个时间序列之间的相关性。TimeGNN在预测性能和训练成本之间取得了平衡，与其他状态码的图基于方法相比，实现了4到80倍快的预测速度。

Prediction of wind turbines power with physics-informed neural networks and evidential uncertainty quantification

paper_url: http://arxiv.org/abs/2307.14675
repo_url: None
paper_authors: Alfonso Gijón, Ainhoa Pujana-Goitia, Eugenio Perea, Miguel Molina-Solana, Juan Gómez-Romero
for: 优化风力机操作和维护，早期缺陷探测
methods: 使用物理 informed neural network 模型 reproduce历史数据，并遵循物理限制
results: 模型准确预测功率、扭矩和功率系数，并提供了可靠的uncertainty estimations

Abstract
The ever-growing use of wind energy makes necessary the optimization of turbine operations through pitch angle controllers and their maintenance with early fault detection. It is crucial to have accurate and robust models imitating the behavior of wind turbines, especially to predict the generated power as a function of the wind speed. Existing empirical and physics-based models have limitations in capturing the complex relations between the input variables and the power, aggravated by wind variability. Data-driven methods offer new opportunities to enhance wind turbine modeling of large datasets by improving accuracy and efficiency. In this study, we used physics-informed neural networks to reproduce historical data coming from 4 turbines in a wind farm, while imposing certain physical constraints to the model. The developed models for regression of the power, torque, and power coefficient as output variables showed great accuracy for both real data and physical equations governing the system. Lastly, introducing an efficient evidential layer provided uncertainty estimations of the predictions, proved to be consistent with the absolute error, and made possible the definition of a confidence interval in the power curve.

摘要
随着风能发电的不断扩大，风力机操作的优化变得越来越重要，特别是通过翼角控制器和早期缺陷检测来保持风力机的维护。为了准确预测风力机发电量，需要建立准确和可靠的风力机模型，特别是预测风速对发电量的影响。现有的empirical和物理基础模型具有限制性，尤其是在风速变化的情况下。数据驱动方法可以提高风力机模型的准确性和效率。在这项研究中，我们使用物理 Informed Neural Networks来重现4台风力机历史数据，并对模型做出了certainphysical constraints。开发的回归模型包括输出变量的电力、扭矩和功率系数，具有很高的准确性和物理定律。最后，通过引入效率的证据层，我们可以获得预测结果的uncertainty estimations，并证明了它们与绝对错误之间的一致性。此外，我们还可以通过定义Confidence Interval在发电曲线上来确定预测结果的可靠性。

Bipartite Ranking Fairness through a Model Agnostic Ordering Adjustment

paper_url: http://arxiv.org/abs/2307.14668
repo_url: https://github.com/cuis15/xorder
paper_authors: Sen Cui, Weishen Pan, Changshui Zhang, Fei Wang
for: 这篇论文的目的是为了在机器学习中实现公平性，具体是在双面排名enario中学习一个排名函数，让正面类别的元素排名高于负面类别的元素。
methods: 本论文提出了一个模型agnostic post-processing框架xOrder，用于实现这个目的。xOrder 使用一个可调整的权重组合来推导一个最佳的扭曲路径，并通过动态计划运算解决。xOrder 可以与不同的分类模型和公平度指标相容，包括监督式和无监督式公平度指标。
results: 本论文在四个标准资料集和两个真实世界医疗保健记录存储系统上进行了评估。xOrder 在这些数据集上获得了一个更好的平衡 между算法的有用性和排名公平ness。从数据集的可观化排名 scores 可以看出，xOrder 对不同群体的排名 scores 进行了调整，并且在比较基eline时具有更好的公平性。此外，额外的分析结果显示xOrder 在较少样本和较大的分布差时仍能保持一定的稳定性。

Abstract
Algorithmic fairness has been a serious concern and received lots of interest in machine learning community. In this paper, we focus on the bipartite ranking scenario, where the instances come from either the positive or negative class and the goal is to learn a ranking function that ranks positive instances higher than negative ones. While there could be a trade-off between fairness and performance, we propose a model agnostic post-processing framework xOrder for achieving fairness in bipartite ranking and maintaining the algorithm classification performance. In particular, we optimize a weighted sum of the utility as identifying an optimal warping path across different protected groups and solve it through a dynamic programming process. xOrder is compatible with various classification models and ranking fairness metrics, including supervised and unsupervised fairness metrics. In addition to binary groups, xOrder can be applied to multiple protected groups. We evaluate our proposed algorithm on four benchmark data sets and two real-world patient electronic health record repositories. xOrder consistently achieves a better balance between the algorithm utility and ranking fairness on a variety of datasets with different metrics. From the visualization of the calibrated ranking scores, xOrder mitigates the score distribution shifts of different groups compared with baselines. Moreover, additional analytical results verify that xOrder achieves a robust performance when faced with fewer samples and a bigger difference between training and testing ranking score distributions.

摘要
algorithmic fairness 已经是机器学习社区中的一个严重问题和Received lots of interest。在这篇论文中，我们关注了两类分类enario，其中实例来自正类或负类，并且目标是学习一个排名函数，排名正类实例更高than负类实例。尽管存在性能和公平之间的交易，我们提出了一种模型agnostic post-processing框架xOrder，以实现在二分类排名中保持算法分类性能的公平。具体来说，我们优化了一个权重和 Utility 的总和，并通过动态编程过程解决。xOrder 与不同的分类模型和公平度标准相容，包括监督性和无监督性公平度标准。此外，xOrder 还可以应用于多个保护组。我们在四个 benchmark 数据集和两个实际患者电子医疗纪录库中评估了我们的提议算法。xOrder 在不同的数据集和度量上均能够实现一个更好的平衡 между算法实用性和排名公平。从Visualization 中的抽象排名分数的结果来看，xOrder 可以减少不同群体的分数分布变化。此外，额外的分析结果表明，xOrder 在有 fewer samples 和测试排名分数分布与训练排名分数分布之间的差异时，具有一定的稳定性。

Decoding the Secrets of Machine Learning in Malware Classification: A Deep Dive into Datasets, Feature Extraction, and Model Performance

paper_url: http://arxiv.org/abs/2307.14657
repo_url: https://github.com/eurecom-s3/decodingmlsecretsofwindowsmalwareclassification
paper_authors: Savino Dambra, Yufei Han, Simone Aonzo, Platon Kotzias, Antonino Vitale, Juan Caballero, Davide Balzarotti, Leyla Bilge
for: 本研究旨在探讨机器学习模型在恶意软件检测和分类方面的开放问题，包括数据集收集方式、特征提取方法和模型训练方法等因素对恶意软件分类结果的影响。
methods: 本研究使用最新的机器学习模型进行恶意软件检测和分类，并使用了最大的均衡的恶意软件集合（67K个样本，670个家族）进行训练。模型的特征提取方法包括静态分析和动态分析。
results: 研究发现静态特征在恶意软件检测和分类中表现更好，而将静态和动态特征相结合只能提供较差的改善。研究还发现没有与封包相关的对恶意软件分类精度的关系，而在动态特征中缺失的行为对恶意软件分类表现产生了很大的损害。此外，研究还发现模型训练用到具有均匀分布的家族数量时，模型在未经见过数据上的泛化性更好。

Abstract
Many studies have proposed machine-learning (ML) models for malware detection and classification, reporting an almost-perfect performance. However, they assemble ground-truth in different ways, use diverse static- and dynamic-analysis techniques for feature extraction, and even differ on what they consider a malware family. As a consequence, our community still lacks an understanding of malware classification results: whether they are tied to the nature and distribution of the collected dataset, to what extent the number of families and samples in the training dataset influence performance, and how well static and dynamic features complement each other. This work sheds light on those open questions. by investigating the key factors influencing ML-based malware detection and classification. For this, we collect the largest balanced malware dataset so far with 67K samples from 670 families (100 samples each), and train state-of-the-art models for malware detection and family classification using our dataset. Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features. We discover no correlation between packing and classification accuracy, and that missing behaviors in dynamically-extracted features highly penalize their performance. We also demonstrate how a larger number of families to classify make the classification harder, while a higher number of samples per family increases accuracy. Finally, we find that models trained on a uniform distribution of samples per family better generalize on unseen data.

摘要
很多研究已经提出了机器学习（ML）模型用于恶意软件检测和分类，报道了几乎完美的性能。然而，这些研究在组装真实数据的方式、使用不同的静态分析技术和动态分析技术来提取特征、以及认为什么是恶意软件家族的定义上有很大差异。因此，我们的社区仍然缺乏了恶意软件分类结果的理解：是否与收集的数据集的性质和分布相关，数据集中家族和样本的数量对性能的影响，静态和动态特征之间的配合情况如何。本工作用 investigate这些问题。我们收集了最大的均衡恶意软件数据集，包括67K个样本和670个家族（每个家族100个样本），并使用我们的数据集来训练当前的ML模型用于恶意软件检测和家族分类。我们的结果表明，静态特征在恶意软件检测和家族分类中表现得更好，并且将静态和动态特征结合只提供了静态特征的较小改进。我们发现packing和分类精度之间没有相关性，而在动态提取的特征中缺失行为会严重降低性能。我们还示出了家族数量增加分类变得更加困难，而每个家族中样本数量增加后，检测精度提高。最后，我们发现使用均匀分布的样本分类每个家族的模型可以更好地通用于未经见数据。

Machine Learning based Parameter Sensitivity of Regional Climate Models – A Case Study of the WRF Model for Heat Extremes over Southeast Australia

paper_url: http://arxiv.org/abs/2307.14654
repo_url: None
paper_authors: P. Jyoteeshkumar Reddy, Sandeep Chinta, Richard Matear, John Taylor, Harish Baki, Marcus Thatcher, Jatin Kala, Jason Sharples
for: 这个研究的目的是估计南东澳大利亚区域气候模型的参数敏感性，以便更好地理解这些模型的表现。
methods: 这个研究使用了机器学习（ML）surrogate-based Sobol SA方法来检查南东澳大利亚区域气候模型（WRF）模型参数的敏感性。
results: 研究结果显示，WRF模型中的3个参数（散射调整参数、饱和土壤水含量多余参数和气动对流积分系数的形状参数）对表面气象变量（温度、相对湿度和风速）有重要影响。这些结果在两次极端高温事件中都是一样的。

Abstract
Heatwaves and bushfires cause substantial impacts on society and ecosystems across the globe. Accurate information of heat extremes is needed to support the development of actionable mitigation and adaptation strategies. Regional climate models are commonly used to better understand the dynamics of these events. These models have very large input parameter sets, and the parameters within the physics schemes substantially influence the model's performance. However, parameter sensitivity analysis (SA) of regional models for heat extremes is largely unexplored. Here, we focus on the southeast Australian region, one of the global hotspots of heat extremes. In southeast Australia Weather Research and Forecasting (WRF) model is the widely used regional model to simulate extreme weather events across the region. Hence in this study, we focus on the sensitivity of WRF model parameters to surface meteorological variables such as temperature, relative humidity, and wind speed during two extreme heat events over southeast Australia. Due to the presence of multiple parameters and their complex relationship with output variables, a machine learning (ML) surrogate-based global sensitivity analysis method is considered for the SA. The ML surrogate-based Sobol SA is used to identify the sensitivity of 24 adjustable parameters in seven different physics schemes of the WRF model. Results show that out of these 24, only three parameters, namely the scattering tuning parameter, multiplier of saturated soil water content, and profile shape exponent in the momentum diffusivity coefficient, are important for the considered meteorological variables. These SA results are consistent for the two different extreme heat events. Further, we investigated the physical significance of sensitive parameters. This study's results will help in further optimising WRF parameters to improve model simulation.

摘要
全球各地的热潮和野火会对社会和生态系统产生重大影响。为了支持避免和适应策略的发展，需要准确的热极值信息。区域气象模型通常用于更好地理解这些事件的动态。这些模型有非常大的输入参数集，模型物理方程中的参数对模型性能有很大的影响。然而，区域模型参数敏感性分析（SA）对热极值事件的模型参数敏感性尚未得到广泛研究。本研究将在澳大利亚南东地区进行，这是全球热极值事件的热点之一。在这个地区，Weather Research and Forecasting（WRF）模型是广泛使用的区域模型，用于模拟EXTREME WEATHER事件。因此，本研究将Focus on WRF模型参数对表层天气变量（温度、相对湿度和风速）的敏感性。由于参数的多样性和复杂的关系，我们使用机器学习（ML）Surrogate-based Sobol SA方法进行敏感性分析。结果显示，共24个可调参数中，只有三个参数（散射调整参数、满足湿度的多项式尺度和阻力系数）对表层天气变量具有重要影响。这些敏感性结果在两次EXTREME HEAT事件中都是一致的。此外，我们还研究了敏感参数的物理意义。本研究的结果将有助于进一步优化WRF参数，提高模型的预测能力。

Speed Limits for Deep Learning

paper_url: http://arxiv.org/abs/2307.14653
repo_url: https://github.com/RishabhP19/Traffic-Surveillance
paper_authors: Inbar Seroussi, Alexander A. Alemi, Moritz Helias, Zohar Ringel
for: 这 paper 探讨了现代神经网络在训练时是否能够最优化。
methods: 本 paper 使用了Recent Advances in Stochastic Thermodynamics来确定神经网络训练过程中的速度上限，基于 Wasserstein-2 距离和动力学过程中的热力学生产率。
results: 研究发现，对于线性和线性可导神经网络（例如Neural Tangent Kernel，NTK），训练过程中的学习是在一定的扩展尺度下是最优的，这与小规模实验结果相符。

Abstract
State-of-the-art neural networks require extreme computational power to train. It is therefore natural to wonder whether they are optimally trained. Here we apply a recent advancement in stochastic thermodynamics which allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network, based on the ratio of their Wasserstein-2 distance and the entropy production rate of the dynamical process connecting them. Considering both gradient-flow and Langevin training dynamics, we provide analytical expressions for these speed limits for linear and linearizable neural networks e.g. Neural Tangent Kernel (NTK). Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense. Our results are consistent with small-scale experiments with Convolutional Neural Networks (CNNs) and Fully Connected Neural networks (FCNs) on CIFAR-10, showing a short highly non-optimal regime followed by a longer optimal regime.

摘要
现代神经网络需要极高的计算能力进行训练。因此，它是自然的问题是否Optimally 训练。在这里，我们采用了最新的热力学推理的进步，可以确定从初始权重分布到完全训练后的网络权重分布之间的速度上限，基于这两个分布之间的 Wasserstein-2 距离和动力学过程之间的热力学生产率。对于线性和可线性神经网络，如神经凝结kernel（NTK），我们提供了分析表达式的速度上限。很remarkably，在一些可能的 NTK спектrum和标签的 spectral decomposition 的假设下，我们发现学习是在一个尺度上优化的。我们的结果与小规模实验中的 convolutional neural networks （CNNs）和fully connected neural networks （FCNs）在 CIFAR-10 上表明，存在一个非常短暂的非优化期，然后是一个更长的优化期。

Spatial-Frequency U-Net for Denoising Diffusion Probabilistic Models

paper_url: http://arxiv.org/abs/2307.14648
repo_url: None
paper_authors: Xin Yuan, Linjie Li, Jianfeng Wang, Zhengyuan Yang, Kevin Lin, Zicheng Liu, Lijuan Wang
for: 该研究使用 diffusion probabilistic model (DDPM) 进行视觉合成，并在wavelet空间中进行研究，而不是在像素空间中。
methods: 该研究提出了一种新的网络架构 SFUNet，该架构包括特有的频率域自适应卷积和注意力模块，以jointly模型空间和频率域信息。
results: 该研究发现，通过显式地模型波峰信号，该模型可以在 CIFAR-10、FFHQ、LSUN-Bedroom 和 LSUN-Church 数据集上生成高质量的图像，比传统的像素基于网络更高。

Abstract
In this paper, we study the denoising diffusion probabilistic model (DDPM) in wavelet space, instead of pixel space, for visual synthesis. Considering the wavelet transform represents the image in spatial and frequency domains, we carefully design a novel architecture SFUNet to effectively capture the correlation for both domains. Specifically, in the standard denoising U-Net for pixel data, we supplement the 2D convolutions and spatial-only attention layers with our spatial frequency-aware convolution and attention modules to jointly model the complementary information from spatial and frequency domains in wavelet data. Our new architecture can be used as a drop-in replacement to the pixel-based network and is compatible with the vanilla DDPM training process. By explicitly modeling the wavelet signals, we find our model is able to generate images with higher quality on CIFAR-10, FFHQ, LSUN-Bedroom, and LSUN-Church datasets, than the pixel-based counterpart.

摘要
在这篇论文中，我们研究了在抽象波лет空间中使用扩散概率模型（DDPM）进行视觉合成。我们注意到，抽象波лет变换可以同时表示图像的空间和频率领域信息，因此我们特别设计了一个新的架构SFUNet，以有效地捕捉这两个领域之间的相关性。在标准的净化U-Net中，我们在2D核函数和空间只关注层之上补充了我们的空间频率感知卷积和关注模块，以同时模型空间和频率领域之间的衔接信息。我们的新架构可以作为像素数据上的替换，并且与普通的DDPM训练过程相容。通过显式地模型抽象波лет信号，我们发现我们的模型在CIFAR-10、FFHQ、LSUN-Bedroom和LSUN-Church数据集上可以生成高质量的图像，比普通像素数据上的对应模型更高。

MVMR-FS : Non-parametric feature selection algorithm based on Maximum inter-class Variation and Minimum Redundancy

paper_url: http://arxiv.org/abs/2307.14643
repo_url: None
paper_authors: Haitao Nie, Shengbo Zhang, Bin Xie
for: 本研究旨在解决Feature Selection中的 Age-old challenge，即如何准确地测量特征之间的相关性和重复性。
methods: 本研究提出了一种基于最大间类差和最小重复性的非参数式特征选择算法（MVMR-FS），该算法首先通过抽象特征的supervised和Unsupervised顶点密度推估来捕捉它们之间的相似性和差异。然后，该算法使用间类概率分布来反映特征相关性，并使用全类概率分布的距离来衡量特征之间的重复性。最后，该算法使用AGA进行搜索，以找到最佳特征子集，使得MVMR最小。
results: 相比十种state-of-the-art方法，MVMR-FS实现了最高的平均准确率，并提高了准确率5%到11%。

Abstract
How to accurately measure the relevance and redundancy of features is an age-old challenge in the field of feature selection. However, existing filter-based feature selection methods cannot directly measure redundancy for continuous data. In addition, most methods rely on manually specifying the number of features, which may introduce errors in the absence of expert knowledge. In this paper, we propose a non-parametric feature selection algorithm based on maximum inter-class variation and minimum redundancy, abbreviated as MVMR-FS. We first introduce supervised and unsupervised kernel density estimation on the features to capture their similarities and differences in inter-class and overall distributions. Subsequently, we present the criteria for maximum inter-class variation and minimum redundancy (MVMR), wherein the inter-class probability distributions are employed to reflect feature relevance and the distances between overall probability distributions are used to quantify redundancy. Finally, we employ an AGA to search for the feature subset that minimizes the MVMR. Compared with ten state-of-the-art methods, MVMR-FS achieves the highest average accuracy and improves the accuracy by 5% to 11%.

摘要
如何准确地衡量特征之间的相关性和重复性是数据特征选择领域的长standing问题。然而，现有的筛子型特征选择方法无法直接测量连续数据中的重复性。另外，大多数方法需要手动指定特征的数量，这可能会导致专家知识不足的情况下出现错误。在本文中，我们提出了一种非 Parametric 特征选择算法基于最大间类差和最小重复性，缩写为 MVMR-FS。我们首先引入了监督和无监督核密概率分布来捕捉特征之间的相似性和总体分布的差异。然后，我们提出了最大间类差和最小重复性的标准（MVMR），其中间类概率分布用于反映特征相关性，而总体概率分布之间的距离用于衡量特征之间的重复性。最后，我们使用 AG 搜索算法来找到最佳特征子集，该子集最小化 MVMR。与现有的十种状态级方法相比，MVMR-FS 达到了最高的平均准确率，提高了准确率5%到11%。

Linear Convergence of Black-Box Variational Inference: Should We Stick the Landing?

paper_url: http://arxiv.org/abs/2307.14642
repo_url: None
paper_authors: Kyurae Kim, Yian Ma, Jacob R. Gardner
for: 证明黑盒变量推断（BBVI）with控制变量，特别是 sticking-the-landing（STL）估计，在完美变量家族指定下 converges at a geometric rate。
methods: 使用黑盒变量推断和控制变量，特别是 STL 估计，以及前一个工作的 proyected stochastic gradient descent。
results: 证明 BBVI 的 Gradient 异议 variance quadratic bound，包括 misspecified variational families，以及对 closed-form entropy gradient estimators 的改进，提供了非假想性的复杂性保证。

Abstract
We prove that black-box variational inference (BBVI) with control variates, particularly the sticking-the-landing (STL) estimator, converges at a geometric (traditionally called "linear") rate under perfect variational family specification. In particular, we prove a quadratic bound on the gradient variance of the STL estimator, one which encompasses misspecified variational families. Combined with previous works on the quadratic variance condition, this directly implies convergence of BBVI with the use of projected stochastic gradient descent. We also improve existing analysis on the regular closed-form entropy gradient estimators, which enables comparison against the STL estimator and provides explicit non-asymptotic complexity guarantees for both.

摘要
我们证明黑盒变量推断（BBVI）使用控制变量，特别是“粘土降落”（STL）估计器，在完美变量家族特定下具有 Геометрический（传统上称为“线性”）速率减少。具体来说，我们证明STL估计器的梯度方差呈 quadratic 型，包括变量家族不准确的情况。与过去的工作相结合，这直接意味着 BBVI 使用 проекed stochastic gradient descent 确实存在收敛性。此外，我们也改进了已有的闭合形式 entropy Gradient 估计器，使其与 STL 估计器进行比较，并提供了非吸收性 garanties。

Fact-Checking of AI-Generated Reports

paper_url: http://arxiv.org/abs/2307.14634
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Razi Mahmood, Ge Wang, Mannudeep Kalra, Pingkun Yan
for: 这篇论文旨在提高医疗工作流程的效率和准确性，并减少医疗成本。
methods: 本研究使用生成式人工智能（AI）生成推测的自动化报告，并使用该报告的相应图像进行实验。
results: 本研究发现，使用专门的检查工具可以对生成式AI报告中的伪阳性找到和移除伪阳性的句子。

Abstract
With advances in generative artificial intelligence (AI), it is now possible to produce realistic-looking automated reports for preliminary reads of radiology images. This can expedite clinical workflows, improve accuracy and reduce overall costs. However, it is also well-known that such models often hallucinate, leading to false findings in the generated reports. In this paper, we propose a new method of fact-checking of AI-generated reports using their associated images. Specifically, the developed examiner differentiates real and fake sentences in reports by learning the association between an image and sentences describing real or potentially fake findings. To train such an examiner, we first created a new dataset of fake reports by perturbing the findings in the original ground truth radiology reports associated with images. Text encodings of real and fake sentences drawn from these reports are then paired with image encodings to learn the mapping to real/fake labels. The utility of such an examiner is demonstrated for verifying automatically generated reports by detecting and removing fake sentences. Future generative AI approaches can use the resulting tool to validate their reports leading to a more responsible use of AI in expediting clinical workflows.

摘要
使用生成式人工智能（AI），现在可以生成具有各种真实效果的自动报告，以便加速临床工作流程，提高准确性并降低总成本。然而，这些模型经常“幻觉”，导致自动生成的报告中的false finding。在本文中，我们提出了一种新的实验室检查AI生成报告的方法，使得可以 differentiate real和fake sentences。Specifically, the developed examiner learns the association between an image and sentences describing real or potentially fake findings. To train such an examiner, we first created a new dataset of fake reports by perturbing the findings in the original ground truth radiology reports associated with images. Text encodings of real and fake sentences drawn from these reports are then paired with image encodings to learn the mapping to real/fake labels. The utility of such an examiner is demonstrated for verifying automatically generated reports by detecting and removing fake sentences. Future generative AI approaches can use the resulting tool to validate their reports, leading to a more responsible use of AI in expediting clinical workflows.

A Survey on Reservoir Computing and its Interdisciplinary Applications Beyond Traditional Machine Learning

paper_url: http://arxiv.org/abs/2307.15092
repo_url: None
paper_authors: Heng Zhang, Danilo Vasconcellos Vargas
for: 本研究旨在提供一个综述式的概述，涵盖由机器学习到物理、生物和神经科学等领域的各种应用。
methods: 该研究使用了 randomly connected recurrent neural network（RC），并通过Linear readout来生成适当的应用Response。
results: 研究发现RC可以在各种应用中提供高效的解决方案，并且可以模拟脑内机制。

Abstract
Reservoir computing (RC), first applied to temporal signal processing, is a recurrent neural network in which neurons are randomly connected. Once initialized, the connection strengths remain unchanged. Such a simple structure turns RC into a non-linear dynamical system that maps low-dimensional inputs into a high-dimensional space. The model's rich dynamics, linear separability, and memory capacity then enable a simple linear readout to generate adequate responses for various applications. RC spans areas far beyond machine learning, since it has been shown that the complex dynamics can be realized in various physical hardware implementations and biological devices. This yields greater flexibility and shorter computation time. Moreover, the neuronal responses triggered by the model's dynamics shed light on understanding brain mechanisms that also exploit similar dynamical processes. While the literature on RC is vast and fragmented, here we conduct a unified review of RC's recent developments from machine learning to physics, biology, and neuroscience. We first review the early RC models, and then survey the state-of-the-art models and their applications. We further introduce studies on modeling the brain's mechanisms by RC. Finally, we offer new perspectives on RC development, including reservoir design, coding frameworks unification, physical RC implementations, and interaction between RC, cognitive neuroscience and evolution.

摘要
储池计算（RC），最初应用于时间信号处理，是一种循环神经网络，其中神经元之间随机连接。一旦初始化，连接强度保持不变。这种简单的结构使RC变成了一个非线性动力系统，可以将低维度输入映射到高维度空间中。模型的丰富动力、线性分离和记忆容量使得使用简单的线性读取器可以生成适用于各种应用的合适回应。RC的应用范围远 beyond 机器学习，因为它已经在物理硬件实现和生物设备中实现了类似的动力过程。这提供了更多的灵活性和更短的计算时间。此外，由模型的动力触发的神经元响应也为了解大脑机制提供了新的指导。在文献中，关于RC的研究非常广泛和杂乱，这里我们进行了一个统一的RC最新发展的评论，从机器学习到物理学、生物学和神经科学。我们首先评论了早期RC模型，然后概括了当前最先进的模型和其应用。我们还介绍了使用RC模型研究大脑机制的研究。最后，我们提出了新的RC发展 perspectives，包括储池设计、编程框架统一、物理RC实现和RC、认知神经科学和进化之间的交互。

Rapid and Scalable Bayesian AB Testing

paper_url: http://arxiv.org/abs/2307.14628
repo_url: None
paper_authors: Srivas Chennu, Andrew Maher, Christian Pangerl, Subash Prabanantham, Jae Hyeon Bae, Jamie Martin, Bud Goswami
for: 这种方法可以帮助企业决策者更好地利用数据来改进数字用户体验。
methods: 该方法使用 hierarchical Bayesian estimation Addresses 现有方法的限制，包括多变量设计中因素间的相关性、顺序测试和早期停止、以及在过去测试中学习的聚合 global learnings。
results: 对于实际数据，该方法可以提高统计力，无需过度风险，并且可以通过顺序测试和早期停止来加速未来测试。

Abstract
AB testing aids business operators with their decision making, and is considered the gold standard method for learning from data to improve digital user experiences. However, there is usually a gap between the requirements of practitioners, and the constraints imposed by the statistical hypothesis testing methodologies commonly used for analysis of AB tests. These include the lack of statistical power in multivariate designs with many factors, correlations between these factors, the need of sequential testing for early stopping, and the inability to pool knowledge from past tests. Here, we propose a solution that applies hierarchical Bayesian estimation to address the above limitations. In comparison to current sequential AB testing methodology, we increase statistical power by exploiting correlations between factors, enabling sequential testing and progressive early stopping, without incurring excessive false positive risk. We also demonstrate how this methodology can be extended to enable the extraction of composite global learnings from past AB tests, to accelerate future tests. We underpin our work with a solid theoretical framework that articulates the value of hierarchical estimation. We demonstrate its utility using both numerical simulations and a large set of real-world AB tests. Together, these results highlight the practical value of our approach for statistical inference in the technology industry.

摘要

BubbleML: A Multi-Physics Dataset and Benchmarks for Machine Learning

paper_url: http://arxiv.org/abs/2307.14623
repo_url: https://github.com/hpcforge/bubbleml
paper_authors: Sheikh Md Shakeel Hassan, Arthur Feeney, Akash Dhruv, Jihoon Kim, Youngjoon Suh, Jaiyoung Ryu, Yoonjin Won, Aparna Chandramowlishwaran
for: 本研究旨在提供一个可访问的和多样化的数据集，以便用于机器学习（ML）训练，以更好地理解多物理现象的热变化现象。
methods: 本研究使用物理驱动的 simulations提供了准确的表征数据，以涵盖多种燃烧场景，包括 Pool 燃烧、流燃烧和冷凝燃烧等，并且覆盖了一系列参数，如重力条件、流速、冷凝水平和墙面超热等，涵盖51个 simulations。
results: 本研究验证了BubbleML数据集的可靠性，并证明了其对实验观测的验证和趋势的表征，因此成为了一个非常有价值的资源 для ML研究。此外，本研究还展示了BubbleML数据集的潜在用于多种下游任务的可能性，如捕捉气泡动力学和温度动力学的运算网络。

Abstract
In the field of phase change phenomena, the lack of accessible and diverse datasets suitable for machine learning (ML) training poses a significant challenge. Existing experimental datasets are often restricted, with limited availability and sparse ground truth data, impeding our understanding of this complex multi-physics phenomena. To bridge this gap, we present the BubbleML Dataset(https://github.com/HPCForge/BubbleML) which leverages physics-driven simulations to provide accurate ground truth information for various boiling scenarios, encompassing nucleate pool boiling, flow boiling, and sub-cooled boiling. This extensive dataset covers a wide range of parameters, including varying gravity conditions, flow rates, sub-cooling levels, and wall superheat, comprising 51 simulations. BubbleML is validated against experimental observations and trends, establishing it as an invaluable resource for ML research. Furthermore, we showcase its potential to facilitate exploration of diverse downstream tasks by introducing two benchmarks: (a) optical flow analysis to capture bubble dynamics, and (b) operator networks for learning temperature dynamics. The BubbleML dataset and its benchmarks serve as a catalyst for advancements in ML-driven research on multi-physics phase change phenomena, enabling the development and comparison of state-of-the-art techniques and models.

摘要
在热相变现象领域，因无法访问和多样化的数据集而进行机器学习（ML）训练的缺乏，是一大挑战。现有的实验数据往往受限，有限的可用性和罕见的准确数据，使得我们对这种复杂多物理现象的理解受阻。为了bridging这个差距，我们提出了BubbleML数据集（https://github.com/HPCForge/BubbleML），该数据集利用物理驱动的 simulations提供了各种爆泡场景的准确的基础 truth信息，包括不同重力条件、流速、冷却水温和墙面超热等多种参数，涵盖51个 simulations。BubbleML被验证了对实验观察和趋势的验证，成为一个无价的资源 для ML研究。此外，我们还介绍了两个benchmark：(a) 光流分析以捕捉爆泡动力学，以及(b) 运算员网络用于学习温度动力学。BubbleML数据集和其benchmarks将作为多物理热相变现象的ML驱动研究的catalyst，帮助开发和比较现有的技术和模型。

Imitating Complex Trajectories: Bridging Low-Level Stability and High-Level Behavior

paper_url: http://arxiv.org/abs/2307.14619
repo_url: None
paper_authors: Adam Block, Daniel Pfrommer, Max Simchowitz
for: 研究复杂非Markovian随机专家示范的模仿在非线性动力系统中。
methods: 使用低级控制器（可以是学习的或假设的）来稳定模仿政策，并使用数据扩展和一种新算法技巧来保证政策的稳定性。
results: 研究证明，如果学习者准确地估计专家政策的分数，则模仿者的轨迹分布与专家的分布在自然的优化传输距离内很相似。

Abstract
We propose a theoretical framework for studying the imitation of stochastic, non-Markovian, potentially multi-modal (i.e. "complex" ) expert demonstrations in nonlinear dynamical systems. Our framework invokes low-level controllers - either learned or implicit in position-command control - to stabilize imitation policies around expert demonstrations. We show that with (a) a suitable low-level stability guarantee and (b) a stochastic continuity property of the learned policy we call "total variation continuity" (TVC), an imitator that accurately estimates actions on the demonstrator's state distribution closely matches the demonstrator's distribution over entire trajectories. We then show that TVC can be ensured with minimal degradation of accuracy by combining a popular data-augmentation regimen with a novel algorithmic trick: adding augmentation noise at execution time. We instantiate our guarantees for policies parameterized by diffusion models and prove that if the learner accurately estimates the score of the (noise-augmented) expert policy, then the distribution of imitator trajectories is close to the demonstrator distribution in a natural optimal transport distance. Our analysis constructs intricate couplings between noise-augmented trajectories, a technique that may be of independent interest. We conclude by empirically validating our algorithmic recommendations.

摘要
我们提出一种理论框架，用于研究复杂专家示范（即“复杂”的非马普尔逊、随机的）的仿制在非线性动力系统中。我们的框架利用低级控制器（可以是学习或含义在位置控制中）来稳定仿制策略。我们证明，当低级控制器具有适当的稳定性保证和随机继续性（我们称之为“总体变量稳定”，TVC）时，仿制者可以准确地模仿专家的动作，并且模仿者的动作 distribuion几乎与专家的动作分布一致。我们然后证明，可以通过将扩展噪声添加到执行时来保证 TVC，并且这种方法可以减少准确性下降。我们在 diffusion 模型参数化的政策上实现我们的保证，并证明如果学习者准确地估计专家（噪声扩展）的政策得分，那么仿制者的轨迹分布与专家的轨迹分布在自然的最佳运输距离上几乎一致。我们的分析构造了噪声扩展 trajectory 之间的复杂联系，这可能是独立的 интерес。我们 finally 验证了我们的算法建议。

Self-Contrastive Graph Diffusion Network

paper_url: http://arxiv.org/abs/2307.14613
repo_url: https://github.com/kunzhan/SCDGN
paper_authors: Yixian Ma, Kun Zhan
for: 本文提出了一种新的框架，即自同Graph Diffusion Network（SCGDN），用于Graph Self-Contrastive Learning（GSC）中。
methods: 本文使用了两个主要组件：Attentional Module（AttM）和Diffusion Module（DiFM）。AttM使用高级结构和特征信息进行聚合，以获得优秀的嵌入；DiFM通过laplacian填充学习平衡每个节点的状态，并允许 adjacency和特征信息在图中协同演化。
results: 本文的方法可以避免”采样偏见”和 semantics drift，无需预训练。与现有方法相比，SCGDN可以一直保持高级结构信息，并免受过拟合。实验表明，SCGDN可以在GSC中表现出优秀的性能，并在对照方法和经典方法的比较中占据优势。

Abstract
Augmentation techniques and sampling strategies are crucial in contrastive learning, but in most existing works, augmentation techniques require careful design, and their sampling strategies can only capture a small amount of intrinsic supervision information. Additionally, the existing methods require complex designs to obtain two different representations of the data. To overcome these limitations, we propose a novel framework called the Self-Contrastive Graph Diffusion Network (SCGDN). Our framework consists of two main components: the Attentional Module (AttM) and the Diffusion Module (DiFM). AttM aggregates higher-order structure and feature information to get an excellent embedding, while DiFM balances the state of each node in the graph through Laplacian diffusion learning and allows the cooperative evolution of adjacency and feature information in the graph. Unlike existing methodologies, SCGDN is an augmentation-free approach that avoids "sampling bias" and semantic drift, without the need for pre-training. We conduct a high-quality sampling of samples based on structure and feature information. If two nodes are neighbors, they are considered positive samples of each other. If two disconnected nodes are also unrelated on $k$NN graph, they are considered negative samples for each other. The contrastive objective reasonably uses our proposed sampling strategies, and the redundancy reduction term minimizes redundant information in the embedding and can well retain more discriminative information. In this novel framework, the graph self-contrastive learning paradigm gives expression to a powerful force. SCGDN effectively balances between preserving high-order structure information and avoiding overfitting. The results manifest that SCGDN can consistently generate outperformance over both the contrastive methods and the classical methods.

摘要
“增强技术和采样策略是对比学习中的关键，但现有的方法中的增强技术通常需要谨慎的设计，采样策略只能捕捉到一小部分内在监督信息。此外，现有的方法通常需要复杂的设计来获得两种不同的数据表示。为了解决这些限制，我们提出了一种新的框架called Self-Contrastive Graph Diffusion Network (SCGDN)。我们的框架包括两个主要组成部分：Attentional Module (AttM)和Diffusion Module (DiFM)。AttM通过聚合更高级结构和特征信息来获得优秀的嵌入，而DiFM通过拉普拉斯协同扩散学习来均衡每个节点在图中的状态，让各节点之间的协同演化和特征信息在图中均衡。不同于现有的方法ologies，SCGDN不需要增强技术和采样偏见，也不需要预训练。我们采用高质量的采样策略，根据结构和特征信息来采样样本。如果两个节点相邻，它们被视为对方的正例样本。如果两个不相关的节点也不在$k$NN图中相互关联，它们被视为对方的负例样本。对比目标reasonably使用我们提出的采样策略，红undancy减少项可以减少嵌入中的重复信息，可以良好地保留更多的特征信息。在这种新的框架中，图自我对比学习模式表达了强大的力量。SCGDN能够均衡保持高级结构信息和避免过拟合。结果显示，SCGDN可以一直在对比方法和传统方法上表现出优异表现。”

Complete and separate: Conditional separation with missing target source attribute completion

paper_url: http://arxiv.org/abs/2307.14609
repo_url: None
paper_authors: Dimitrios Bralios, Efthymios Tzinis, Paris Smaragdis
for: 用于提高多 conditional separation 网络的分离性能
methods: 使用预训练模型提取额外Semantic数据，并将其与不同输入混合物相结合
results: separation performance 接近oracle模型水平，并且与专门的单Conditional模型性能相当，提供了更容易使用的替代方案。In English:
for: To improve the separation performance of a multi-conditional separation network
methods: Using a pre-trained model to extract additional semantic data and combine it with different input mixtures
results: Separation performance approaching the level of an oracle model, and comparable to the best performing specialized single conditional models, providing an easier-to-use alternative.

Abstract
Recent approaches in source separation leverage semantic information about their input mixtures and constituent sources that when used in conditional separation models can achieve impressive performance. Most approaches along these lines have focused on simple descriptions, which are not always useful for varying types of input mixtures. In this work, we present an approach in which a model, given an input mixture and partial semantic information about a target source, is trained to extract additional semantic data. We then leverage this pre-trained model to improve the separation performance of an uncoupled multi-conditional separation network. Our experiments demonstrate that the separation performance of this multi-conditional model is significantly improved, approaching the performance of an oracle model with complete semantic information. Furthermore, our approach achieves performance levels that are comparable to those of the best performing specialized single conditional models, thus providing an easier to use alternative.

摘要
现代音源分离方法利用输入混合的semantic信息和组成源的信息，当用于条件分离模型时可以实现出色的性能。大多数方法都集中于简单的描述，这些描述不一定适用于不同类型的输入混合。在这项工作中，我们提出了一种方法， giventarget source的partial semantic information和输入混合，使用一个预训练的模型提取更多的semantic数据。然后，我们利用这个预训练模型改进了不相关的多 condional separation network的分离性能。我们的实验表明，这个多conditional模型的分离性能显著提高，接近完全semantic信息的oracle模型。此外，我们的方法可以与特化的单conditional模型相比，实现相似的性能水平，提供一个更容易使用的alternative。

HUTFormer: Hierarchical U-Net Transformer for Long-Term Traffic Forecasting

paper_url: http://arxiv.org/abs/2307.14596
repo_url: None
paper_authors: Zezhi Shao, Fei Wang, Zhao Zhang, Yuchen Fang, Guangyin Jin, Yongjun Xu
for: 本研究的目的是提高长期交通预测（1天预测）的精度， Addressing the unique challenges of long-term traffic forecasting, such as exploiting multi-scale representations.
methods: 提议一种新的层次U-NetTransformer（HUTFormer）来解决长期交通预测的问题。 HUTFormer包括一个层次编码器和解码器，并使用窗口自注意力和段合并来提取多尺度表示。
results: 对四个交通数据集进行了广泛的实验，结果表明，提议的HUTFormer显著超越了当前交通预测和长时序预测基准。

Abstract
Traffic forecasting, which aims to predict traffic conditions based on historical observations, has been an enduring research topic and is widely recognized as an essential component of intelligent transportation. Recent proposals on Spatial-Temporal Graph Neural Networks (STGNNs) have made significant progress by combining sequential models with graph convolution networks. However, due to high complexity issues, STGNNs only focus on short-term traffic forecasting, e.g., 1-hour forecasting, while ignoring more practical long-term forecasting. In this paper, we make the first attempt to explore long-term traffic forecasting, e.g., 1-day forecasting. To this end, we first reveal its unique challenges in exploiting multi-scale representations. Then, we propose a novel Hierarchical U-net TransFormer (HUTFormer) to address the issues of long-term traffic forecasting. HUTFormer consists of a hierarchical encoder and decoder to jointly generate and utilize multi-scale representations of traffic data. Specifically, for the encoder, we propose window self-attention and segment merging to extract multi-scale representations from long-term traffic data. For the decoder, we design a cross-scale attention mechanism to effectively incorporate multi-scale representations. In addition, HUTFormer employs an efficient input embedding strategy to address the complexity issues. Extensive experiments on four traffic datasets show that the proposed HUTFormer significantly outperforms state-of-the-art traffic forecasting and long time series forecasting baselines.

摘要
traffic 预测，目标是根据历史观察数据预测交通情况，已经是长期的研究话题，广泛被认为是智能交通系统的重要组成部分。 recent proposals on Spatial-Temporal Graph Neural Networks (STGNNs) 已经做出了重要进步，通过将序列模型与图 convolution networks 结合起来。然而，由于高复杂性问题，STGNNs 只能 focus on short-term traffic forecasting，例如 1 小时预测，而忽略了更实用的长期预测。在这篇论文中，我们首次尝试了长期交通预测，例如 1 天预测。为此，我们首先揭示了长期交通预测的独特挑战，然后我们提出了一种 novel Hierarchical U-net TransFormer (HUTFormer) 来解决这些问题。HUTFormer 包括一个层次编码器和解码器，共同生成和利用多级表示交通数据。具体来说，对于编码器，我们提出了窗口自注意和段合并来EXTRACT多级表示长期交通数据。对于解码器，我们设计了跨级注意机制，以有效地合并多级表示。此外，HUTFormer 使用了高效的输入嵌入策略，以解决复杂性问题。经验表明，我们提出的 HUTFormer 在四个交通数据集上具有显著的优势，与当前的交通预测和长时间序列预测基准值做出了比较。

MCPA: Multi-scale Cross Perceptron Attention Network for 2D Medical Image Segmentation

paper_url: http://arxiv.org/abs/2307.14588
repo_url: https://github.com/simonustc/mcpa-for-2d-medical-image-segmentation
paper_authors: Liang Xu, Mingxiao Chen, Yi Cheng, Pengfei Shao, Shuwei Shen, Peng Yao, Ronald X. Xu
for:* 这个研究是为了提出一个基于Convolutional Neural Networks (CNN)的双重构造图像分类模型，以便更好地处理医疗图像分类任务。methods:* 这个模型使用了Transformer模组来强化UNet架构，以更好地捕捉全面特征相关性。* 这个模型还具有Multi-scale Cross Perceptron Attention Network (MCPA)架构，可以将多个当地相关性捕捉到图像中，并将其聚合成全面特征。results:* 这个模型在多个公开的医疗图像数据集上取得了state-of-the-art的性能。

Abstract
The UNet architecture, based on Convolutional Neural Networks (CNN), has demonstrated its remarkable performance in medical image analysis. However, it faces challenges in capturing long-range dependencies due to the limited receptive fields and inherent bias of convolutional operations. Recently, numerous transformer-based techniques have been incorporated into the UNet architecture to overcome this limitation by effectively capturing global feature correlations. However, the integration of the Transformer modules may result in the loss of local contextual information during the global feature fusion process. To overcome these challenges, we propose a 2D medical image segmentation model called Multi-scale Cross Perceptron Attention Network (MCPA). The MCPA consists of three main components: an encoder, a decoder, and a Cross Perceptron. The Cross Perceptron first captures the local correlations using multiple Multi-scale Cross Perceptron modules, facilitating the fusion of features across scales. The resulting multi-scale feature vectors are then spatially unfolded, concatenated, and fed through a Global Perceptron module to model global dependencies. Furthermore, we introduce a Progressive Dual-branch Structure to address the semantic segmentation of the image involving finer tissue structures. This structure gradually shifts the segmentation focus of MCPA network training from large-scale structural features to more sophisticated pixel-level features. We evaluate our proposed MCPA model on several publicly available medical image datasets from different tasks and devices, including the open large-scale dataset of CT (Synapse), MRI (ACDC), fundus camera (DRIVE, CHASE_DB1, HRF), and OCTA (ROSE). The experimental results show that our MCPA model achieves state-of-the-art performance. The code is available at https://github.com/simonustc/MCPA-for-2D-Medical-Image-Segmentation.

摘要
“UNet架构，基于卷积神经网络（CNN），在医疗图像分析中表现出了惊人的表现。然而，它面临着捕捉长距离依赖关系的挑战，因为卷积操作具有局限的观察领域和自然偏好。随着时间的推移，许多基于transformer技术的方法被引入到UNet架构中，以更好地捕捉全局特征相关性。然而，将transformer模块纳入UNet架构可能会导致全局特征融合过程中的本地Contextual信息损失。为了解决这些挑战，我们提出了一种名为Multi-scale Cross Perceptron Attention Network（MCPA）的2D医疗图像分类模型。MCPA模型包括三个主要组成部分：编码器、解码器和Cross Perceptron。Cross Perceptron首先使用多个多级卷积模块捕捉本地相关性，以便在不同尺度上进行特征融合。然后，得到的多级特征向量被空间拓展、 concatenate 并输入到全球Perceptron模块，以模型全局依赖关系。此外，我们还引入了一种进步的双支结构，以解决医疗图像的Semantic分类问题，涉及到更细的组织结构。这种结构逐渐将MCPA网络训练中的 segmentation 焦点从大规模结构特征转移到更复杂的像素级特征。我们在多个公共可用的医疗图像 dataset 上进行了实验，包括 Synapse、ACDC、DRIVE、CHASE_DB1 和 HRF 等。实验结果表明，我们的 MCPA 模型达到了领先的性能。代码可以在 GitHub 上找到。”

paper_url: http://arxiv.org/abs/2307.14568
repo_url: None
paper_authors: Brian Angulo, Gregory Gorbov, Aleksandr Panov, Konstantin Yakovlev
for: 这个研究的目的是为了演示Autonomous Navigation中安全性的重要性，并提出一种能够满足安全性要求的学习策略。
methods: 这个研究使用了两种不同的学习策略：一种是安全的，另一种是不安全的。 unsafe policy不考虑安全性要求，而safe policy则会考虑这些要求。
results: 研究结果显示，使用safe policy可以生成更多的clearance（避免碰撞的距离），并且避免了许多碰撞，而不会 sacrificing总性能。

Abstract
While reinforcement learning algorithms have had great success in the field of autonomous navigation, they cannot be straightforwardly applied to the real autonomous systems without considering the safety constraints. The later are crucial to avoid unsafe behaviors of the autonomous vehicle on the road. To highlight the importance of these constraints, in this study, we compare two learnable navigation policies: safe and unsafe. The safe policy takes the constraints into account, while the other does not. We show that the safe policy is able to generate trajectories with more clearance (distance to the obstacles) and makes less collisions while training without sacrificing the overall performance.

摘要
而实际自动驾驶系统中，强化学习算法无法直接应用，必须考虑安全限制。这些限制是关键，以避免自动车辆上路上不安全的行为。在这种研究中，我们比较了两种可学习导航策略：安全和不安全。安全策略考虑了限制，而另一个不考虑。我们显示，安全策略可以生成更多的适当距离（距离障碍物）和更少的碰撞，在训练过程中不 sacrificing总性能。

Auto-Tables: Synthesizing Multi-Step Transformations to Relationalize Tables without Using Examples

paper_url: http://arxiv.org/abs/2307.14565
repo_url: https://github.com/lipengcs/auto-tables-benchmark
paper_authors: Peng Li, Yeye He, Cong Yan, Yue Wang, Surajit Chaudhuri
for: 这个论文是为了解决在实际应用中遇到的表格不符合关系标准的问题。
methods: 该论文使用自动生成管道来自动转换非关系表格，以便在下游分析工具中进行查询。
results: 论文的实验结果表明，Auto-Tables 系统可以成功地自动生成多步转换管道，将非关系表格转换成标准关系表格，并且可以在互动速度下完成这些转换，无需用户输入。

Abstract
Relational tables, where each row corresponds to an entity and each column corresponds to an attribute, have been the standard for tables in relational databases. However, such a standard cannot be taken for granted when dealing with tables "in the wild". Our survey of real spreadsheet-tables and web-tables shows that over 30% of such tables do not conform to the relational standard, for which complex table-restructuring transformations are needed before these tables can be queried easily using SQL-based analytics tools. Unfortunately, the required transformations are non-trivial to program, which has become a substantial pain point for technical and non-technical users alike, as evidenced by large numbers of forum questions in places like StackOverflow and Excel/Power-BI/Tableau forums. We develop an Auto-Tables system that can automatically synthesize pipelines with multi-step transformations (in Python or other languages), to transform non-relational tables into standard relational forms for downstream analytics, obviating the need for users to manually program transformations. We compile an extensive benchmark for this new task, by collecting 244 real test cases from user spreadsheets and online forums. Our evaluation suggests that Auto-Tables can successfully synthesize transformations for over 70% of test cases at interactive speeds, without requiring any input from users, making this an effective tool for both technical and non-technical users to prepare data for analytics.

摘要
traditional Chinese:关联表格，每一行都代表一个实体，每一列都代表一个属性，在关联式数据库中是标准。但是，这个标准不能被视为可预期，当面对真实的该表时，超过30%的表不符合关联式标准，需要进行复杂的表结构变换，以便使用SQL基础的分析工具进行轻松查询。不幸的是，需要进行这些变换的程式码是非常复杂，导致技术和非技术用户都会遇到问题，如Stack Overflow和Excel/Power BI/Tableau论坛中的讨论。我们开发了一个名为Auto-Tables的系统，可以自动合成多步变换管线（在Python或其他语言中），将非关联式表转换为标准关联式形式，以便进行下游分析。我们建立了一个广泛的benchmark测试项目，收集了244个真实的测试案例自用户的试题和网上论坛。我们的评估显示，Auto-Tables可以成功地自动合成变换为超过70%的测试案例，在互动速度下进行变换，不需要用户提供任何输入，这使得这个工具成为了技术和非技术用户都可以使用的有效工具，以准备资料进行分析。

Understanding Forward Process of Convolutional Neural Network

paper_url: http://arxiv.org/abs/2307.15090
repo_url: https://github.com/himanshub1007/Alzhimers-Disease-Prediction-Using-Deep-learning
paper_authors: Peixin Tian
for: 这篇论文揭示了深度学习网络中的选择性旋转处理。它解释了activation函数作为输入数据的旋转方面的决定性机制，并用数学工具来分析和理解这种定义的方法。
methods: 这篇论文使用了深度学习网络进行实验，以证明其在处理输入数据时的旋转性。它还使用了数学工具来分析和理解这种旋转性。
results: 这篇论文的实验结果表明，深度学习网络可以通过activation函数来分析和理解输入数据的旋转性。这些结果还表明了人工神经网络和人脑的数据处理模式之间的一致性。

Abstract
This paper reveal the selective rotation in the CNNs' forward processing. It elucidates the activation function as a discerning mechanism that unifies and quantizes the rotational aspects of the input data. Experiments show how this defined methodology reflects the progress network distinguish inputs based on statistical indicators, which can be comprehended or analyzed by applying structured mathematical tools. Our findings also unveil the consistency between artificial neural networks and the human brain in their data processing pattern.

摘要
这篇论文揭示了深度神经网络（CNN）的选择性旋转处理。它解释了活动函数作为识别机制，卷积数据的旋转方面的统一和量化。实验显示，这种定义的方法论可以通过结构化的数学工具来理解和分析输入数据的进程。我们的发现还揭示了人工神经网络和人脑在数据处理模式上的一致性。Here's a word-for-word translation:这篇论文揭示了深度神经网络（CNN）的选择性旋转处理。它解释了活动函数作为识别机制，卷积数据的旋转方面的统一和量化。实验显示，这种定义的方法论可以通过结构化的数学工具来理解和分析输入数据的进程。我们的发现还揭示了人工神经网络和人脑在数据处理模式上的一致性。

Adversarial Sleeping Bandit Problems with Multiple Plays: Algorithm and Ranking Application

paper_url: http://arxiv.org/abs/2307.14549
repo_url: None
paper_authors: Jianjun Yuan, Wei Lee Woon, Ludovik Coba
for: 这个论文是为了解决在线推荐系统中的睡眠帮手问题。
methods: 该算法使用了扩展的睡眠帮手算法，用于选择多个arm。
results: 该算法可以 garantuee理论性能，即对 regret的上界为$\bigO(kN^2\sqrt{T\log T})$, где $k$是每次选择的arm数量，$N$是总共有arm数量，$T$是时间 horizonto。

Abstract
This paper presents an efficient algorithm to solve the sleeping bandit with multiple plays problem in the context of an online recommendation system. The problem involves bounded, adversarial loss and unknown i.i.d. distributions for arm availability. The proposed algorithm extends the sleeping bandit algorithm for single arm selection and is guaranteed to achieve theoretical performance with regret upper bounded by $\bigO(kN^2\sqrt{T\log T})$, where $k$ is the number of arms selected per time step, $N$ is the total number of arms, and $T$ is the time horizon.

摘要
（这篇论文提出了一种有效的算法来解决在在线推荐系统中的睡眠喇嘛问题，该问题具有缩存损失和对称敌对损失，以及不确定的i.i.d.分布。提出的算法是基于单臂选择的睡眠喇嘛算法的扩展，并且保证了理论性能，即对不同的$k$、$N$和$T$来说，做出的停留 regret的上界为$\bigO(kN^2\sqrt{T\log T})$。）

Controlling the Inductive Bias of Wide Neural Networks by Modifying the Kernel’s Spectrum

paper_url: http://arxiv.org/abs/2307.14531
repo_url: None
paper_authors: Amnon Geifman, Daniel Barzilai, Ronen Basri, Meirav Galun
for: 本研究旨在提出一种修改宽神经网络偏好的方法，以便根据任务需要 modify the bias of wide neural networks to suit the task at hand.
methods: 我们引入 Modified Spectrum Kernels (MSKs)，一种新的 constructed kernels 家族，可以用于approximate kernels with desired eigenvalues，而且不需要closed form。我们还提出了一种采用权重梯度下降法，通过修改梯度下降的轨迹，以实现 polynomial 和、在一些情况下，exponential 的训练速度提高，而不会改变最终解决方案。
results: 我们的方法可以快速和简单地实现计算效率和执行效率的提高，而且可以在各种任务上达到good performance。

Abstract
Wide neural networks are biased towards learning certain functions, influencing both the rate of convergence of gradient descent (GD) and the functions that are reachable with GD in finite training time. As such, there is a great need for methods that can modify this bias according to the task at hand. To that end, we introduce Modified Spectrum Kernels (MSKs), a novel family of constructed kernels that can be used to approximate kernels with desired eigenvalues for which no closed form is known. We leverage the duality between wide neural networks and Neural Tangent Kernels and propose a preconditioned gradient descent method, which alters the trajectory of GD. As a result, this allows for a polynomial and, in some cases, exponential training speedup without changing the final solution. Our method is both computationally efficient and simple to implement.

摘要
广关系网络具有偏好学习某些函数，影响了Gradient Descent（GD）的趋向率和可以在培训时间内透过GDlearned的函数。为了解决这问题，我们介绍Modified Spectrum Kernels（MSKs），一种新的建构kernel的家族，可以用来 aproximate desiredeigenvalues的kernel，即无法known的closed form。我们利用广关系网络和Neural Tangent Kernels的 dualidade，提出预调正幂 descent方法，这会改变GD的轨迹。因此，这allow for a polynomial, 在一些情况下，even exponential training speedup without changing the final solution。我们的方法是computationally efficient和simple to implement。

Optimal Estimation in Mixed-Membership Stochastic Block Models

paper_url: http://arxiv.org/abs/2307.14530
repo_url: None
paper_authors: Fedor Noskov, Maxim Panov
for: 这 paper 是为了研究混合会员随机块模型（MMSB）中的跨群关系重建问题。
methods: 该 paper 使用了不同的方法来重建跨群关系，并证明了这些方法的最小最大下界。 finally, 提出了一种新的估计器，可以达到这个下界。
results: 该 paper 提出了一种新的估计器，可以在 MMSB 模型中重建跨群关系，并且证明了这种估计器的正确性。

Abstract
Community detection is one of the most critical problems in modern network science. Its applications can be found in various fields, from protein modeling to social network analysis. Recently, many papers appeared studying the problem of overlapping community detection, where each node of a network may belong to several communities. In this work, we consider Mixed-Membership Stochastic Block Model (MMSB) first proposed by Airoldi et al. (2008). MMSB provides quite a general setting for modeling overlapping community structure in graphs. The central question of this paper is to reconstruct relations between communities given an observed network. We compare different approaches and establish the minimax lower bound on the estimation error. Then, we propose a new estimator that matches this lower bound. Theoretical results are proved under fairly general conditions on the considered model. Finally, we illustrate the theory in a series of experiments.

摘要
社区探测是现代网络科学中最重要的问题之一。它在不同领域都有广泛的应用，从蛋白质模型到社交网络分析。最近，许多论文出现了研究多重社区探测问题，其中每个网络节点可能属于多个社区。在这种情况下，我们考虑了杂合会员随机块模型（MMSB），由airoldi等人于2008年提出。MMSB提供了模型多重社区结构的非常通用的设定。本文的主要问题是根据观察网络重建社区之间的关系。我们比较了不同方法，并证明了最小最大下界，这个下界对于估计错误的最低 bound。然后，我们提出了一种新的估计器，该估计器匹配这个下界。我们在满足一些基本的假设下提出了理论结果。最后，我们在一系列实验中证明了这些理论结果。

Function Value Learning: Adaptive Learning Rates Based on the Polyak Stepsize and Function Splitting in ERM

paper_url: http://arxiv.org/abs/2307.14528
repo_url: None
paper_authors: Guillaume Garrigos, Robert M. Gower, Fabian Schaipp
for: 本研究targets solving finite sum-of-terms problem (empirical risk minimization) using stochastic gradient descent (SGD) with adaptive step size.
methods: 本 paper proposes two variants of SGD: $\texttt{SPS}+$ and $\texttt{FUVAL}$. $\texttt{SPS}+$ is a minor modification of SPS method that uses sampled loss values and assumes knowledge of the sampled loss at optimality, while $\texttt{FUVAL}$ gradually learns the loss values at optimality.
results: 本 paper shows that $\texttt{SPS}_+$ achieves the best known rates of convergence for SGD in Lipschitz non-smooth setting. However, the convergence analysis of $\texttt{FUVAL}$ shows no advantage over SGD, and the stochastic version of $\texttt{FUVAL}$ shows no clear advantage over SGD. The paper also presents experimental results.

Abstract
Here we develop variants of SGD (stochastic gradient descent) with an adaptive step size that make use of the sampled loss values. In particular, we focus on solving a finite sum-of-terms problem, also known as empirical risk minimization. We first detail an idealized adaptive method called $\texttt{SPS}_+$ that makes use of the sampled loss values and assumes knowledge of the sampled loss at optimality. This $\texttt{SPS}_+$ is a minor modification of the SPS (Stochastic Polyak Stepsize) method, where the step size is enforced to be positive. We then show that $\texttt{SPS}_+$ achieves the best known rates of convergence for SGD in the Lipschitz non-smooth. We then move onto to develop $\texttt{FUVAL}$, a variant of $\texttt{SPS}_+$ where the loss values at optimality are gradually learned, as opposed to being given. We give three viewpoints of $\texttt{FUVAL}$, as a projection based method, as a variant of the prox-linear method, and then as a particular online SGD method. We then present a convergence analysis of $\texttt{FUVAL}$ and experimental results. The shortcomings of our work is that the convergence analysis of $\texttt{FUVAL}$ shows no advantage over SGD. Another shortcomming is that currently only the full batch version of $\texttt{FUVAL}$ shows a minor advantages of GD (Gradient Descent) in terms of sensitivity to the step size. The stochastic version shows no clear advantage over SGD. We conjecture that large mini-batches are required to make $\texttt{FUVAL}$ competitive. Currently the new $\texttt{FUVAL}$ method studied in this paper does not offer any clear theoretical or practical advantage. We have chosen to make this draft available online nonetheless because of some of the analysis techniques we use, such as the non-smooth analysis of $\texttt{SPS}_+$, and also to show an apparently interesting approach that currently does not work.

摘要
我们在这里开发了具有自适应步长的SGD（测量梯度下降）的变种，并使用抽象的负数梯度方法来解决empirical risk minimization这个Finite sum-of-terms问题。我们首先详细介绍了一个理想化的自适应方法called $\texttt{SPS}_+$,它使用抽象的负数梯度方法和认知到抽象的负数梯度的问题。我们随后显示了 $\texttt{SPS}_+ $ 在 Lipschitz 非平滑的情况下的最好已知的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的测度进行对SGD的

Open Problems in Computer Vision for Wilderness SAR and The Search for Patricia Wu-Murad

paper_url: http://arxiv.org/abs/2307.14527
repo_url: https://github.com/crasar/wisar
paper_authors: Thomas Manzini, Robin Murphy
for: 本研究旨在应用两种计算机视觉系统（一个是supersvised学习模型EfficientDET，另一个是无监督RXspectral分类器）于98.9 GB的无人机图像，从日本 Wu-Murad无人机搜救救援（WSAR）活动中，并识别出三个未来研究方向。
methods: 本研究使用了EfficientDET建模和RXspectral分类器，并在HERIDAL数据集上应用了EfficientDET模型，即使在数据集上达到了与状态艺术相同的性能水平，但在实际应用中出现了许多问题，如 mistakenly identifying tree limbs and rocks as people，以及 failure to identify members of the search team。
results: 本研究发现，在实际应用中，已有许多计算机视觉算法在数据集上表现良好，但在实际WSAR操作中表现不佳，主要是因为数据集和实际应用中的图像差异过大。因此，本研究提出了三个未来研究方向，即更加真实的数据集，能够覆盖更多的图像类型的计算机视觉模型，以及更好地定义性能指标。

Abstract
This paper details the challenges in applying two computer vision systems, an EfficientDET supervised learning model and the unsupervised RX spectral classifier, to 98.9 GB of drone imagery from the Wu-Murad wilderness search and rescue (WSAR) effort in Japan and identifies 3 directions for future research. There have been at least 19 proposed approaches and 3 datasets aimed at locating missing persons in drone imagery, but only 3 approaches (2 unsupervised and 1 of an unknown structure) are referenced in the literature as having been used in an actual WSAR operation. Of these proposed approaches, the EfficientDET architecture and the unsupervised spectral RX classifier were selected as the most appropriate for this setting. The EfficientDET model was applied to the HERIDAL dataset and despite achieving performance that is statistically equivalent to the state-of-the-art, the model fails to translate to the real world in terms of false positives (e.g., identifying tree limbs and rocks as people), and false negatives (e.g., failing to identify members of the search team). The poor results in practice for algorithms that showed good results on datasets suggest 3 areas of future research: more realistic datasets for wilderness SAR, computer vision models that are capable of seamlessly handling the variety of imagery that can be collected during actual WSAR operations, and better alignment on performance measures.

摘要

More realistic datasets for wilderness SAR: The current datasets used for training and testing computer vision models may not accurately reflect the real-world conditions and variability of imagery collected during actual WSAR operations.2. Computer vision models that can handle diverse imagery: The models should be capable of seamlessly handling the variety of imagery that can be collected during actual WSAR operations, such as different lighting conditions, weather, and terrain.3. Better alignment on performance measures: The performance of the models should be evaluated using relevant metrics that reflect the specific requirements of WSAR operations, such as detecting missing persons in real-time and minimizing false positives and false negatives.The paper also notes that while there have been many proposed approaches and datasets for locating missing persons in drone imagery, only a few have been used in actual WSAR operations. The EfficientDET architecture and the unsupervised spectral RX classifier were selected as the most appropriate for this setting, but they failed to translate to real-world performance due to issues with false positives and false negatives.

A new algorithm for Subgroup Set Discovery based on Information Gain

paper_url: http://arxiv.org/abs/2307.15089
repo_url: None
paper_authors: Daniel Gómez-Bravo, Aaron García, Guillermo Vigueras, Belén Ríos, Alejandro Rodríguez-González
for: 本研究旨在提出一种新的Pattern Discovery（PD）算法，用于找到数据集中更高频率出现的项集、子序列或结构。
methods: 本算法 combines Information Gain（IG）和Odds Ratio（OR）作为多重标准 дляpattern选择。
results: 对于11个数据集，IGSD算法比FSSD和SSD++算法提供了更可靠的pattern和更小的pattern集。IGSD还提供了更高的OR值，表明pattern和target之间的依存关系更强。此外，IGSD中的pattern被专家 validate 过，与FSSD和SSD++算法中的pattern更为一致。这些结果表明IGSD是一种适合Pattern Discovery的方法，并且包含非标准PD metric可以更好地评估发现的pattern。

Abstract
Pattern discovery is a machine learning technique that aims to find sets of items, subsequences, or substructures that are present in a dataset with a higher frequency value than a manually set threshold. This process helps to identify recurring patterns or relationships within the data, allowing for valuable insights and knowledge extraction. In this work, we propose Information Gained Subgroup Discovery (IGSD), a new SD algorithm for pattern discovery that combines Information Gain (IG) and Odds Ratio (OR) as a multi-criteria for pattern selection. The algorithm tries to tackle some limitations of state-of-the-art SD algorithms like the need for fine-tuning of key parameters for each dataset, usage of a single pattern search criteria set by hand, usage of non-overlapping data structures for subgroup space exploration, and the impossibility to search for patterns by fixing some relevant dataset variables. Thus, we compare the performance of IGSD with two state-of-the-art SD algorithms: FSSD and SSD++. Eleven datasets are assessed using these algorithms. For the performance evaluation, we also propose to complement standard SD measures with IG, OR, and p-value. Obtained results show that FSSD and SSD++ algorithms provide less reliable patterns and reduced sets of patterns than IGSD algorithm for all datasets considered. Additionally, IGSD provides better OR values than FSSD and SSD++, stating a higher dependence between patterns and targets. Moreover, patterns obtained for one of the datasets used, have been validated by a group of domain experts. Thus, patterns provided by IGSD show better agreement with experts than patterns obtained by FSSD and SSD++ algorithms. These results demonstrate the suitability of the IGSD as a method for pattern discovery and suggest that the inclusion of non-standard SD metrics allows to better evaluate discovered patterns.

摘要
Pattern discovery 是一种机器学习技术，旨在在数据集中找到高频值的项集、 subsequences 或 substructures。这个过程可以帮助找到数据中的重复模式或关系，从而提供有价值的发现和知识提取。在这个工作中，我们提出了信息增加子集发现（IGSD）算法，这是一种新的 SD 算法，它结合信息增加（IG）和很大比率（OR）作为多种选择 criterion。该算法想要解决现有 SD 算法的一些限制，如需要手动调整关键参数、使用单一的模式搜索 criterion、使用不相交的数据结构进行 subgroup 空间探索、以及无法通过固定一些重要的数据集变量来搜索模式。因此，我们对 IGSD 算法进行了比较，并使用了两种现有 SD 算法：FSSD 和 SSD++。我们对 11 个数据集进行了测试，并为表现评估提出了一种新的方法，该方法包括 IG、OR 和 p-value。获得的结果表明，FSSD 和 SSD++ 算法提供的模式较不可靠，模式集较少，而 IGSD 算法提供的模式较为一致，OR 值较高，表明模式和目标之间的依赖性更高。此外，IGSD 算法对一个数据集中的模式进行了验证，并与领域专家的评估相符。因此，IGSD 算法是一种适用的模式发现方法，并且包括非标准 SD 度量可以更好地评估发现的模式。

Bug Characterization in Machine Learning-based Systems

paper_url: http://arxiv.org/abs/2307.14512
repo_url: https://github.com/ml-bugs-2022/replication-package
paper_authors: Mohammad Mehdi Morovati, Amin Nikanjam, Florian Tambon, Foutse Khomh, Zhen Ming, Jiang
for:This paper aims to investigate the characteristics of bugs in Machine Learning (ML)-based software systems and the differences between ML and non-ML bugs from the maintenance viewpoint.methods:The paper uses a dataset of 447,948 GitHub repositories that use one of the three most popular ML frameworks (TensorFlow, Keras, and PyTorch) and manually inspects 386 sampled reported issues to identify ML bugs. The paper also examines 109 identified ML bugs to identify their root causes, symptoms, and required fixing time.results:The paper finds that nearly half of the real issues reported in ML-based systems are ML bugs, indicating that ML components are more error-prone than non-ML components. The paper also shows that fixing ML bugs is more costly and complex compared to non-ML bugs, with ML bugs requiring more commits, changed files, and changed lines of code to fix. These findings highlight the importance of paying significant attention to the reliability of ML components in ML-based systems.

Abstract
Rapid growth of applying Machine Learning (ML) in different domains, especially in safety-critical areas, increases the need for reliable ML components, i.e., a software component operating based on ML. Understanding the bugs characteristics and maintenance challenges in ML-based systems can help developers of these systems to identify where to focus maintenance and testing efforts, by giving insights into the most error-prone components, most common bugs, etc. In this paper, we investigate the characteristics of bugs in ML-based software systems and the difference between ML and non-ML bugs from the maintenance viewpoint. We extracted 447,948 GitHub repositories that used one of the three most popular ML frameworks, i.e., TensorFlow, Keras, and PyTorch. After multiple filtering steps, we select the top 300 repositories with the highest number of closed issues. We manually investigate the extracted repositories to exclude non-ML-based systems. Our investigation involved a manual inspection of 386 sampled reported issues in the identified ML-based systems to indicate whether they affect ML components or not. Our analysis shows that nearly half of the real issues reported in ML-based systems are ML bugs, indicating that ML components are more error-prone than non-ML components. Next, we thoroughly examined 109 identified ML bugs to identify their root causes, symptoms, and calculate their required fixing time. The results also revealed that ML bugs have significantly different characteristics compared to non-ML bugs, in terms of the complexity of bug-fixing (number of commits, changed files, and changed lines of code). Based on our results, fixing ML bugs are more costly and ML components are more error-prone, compared to non-ML bugs and non-ML components respectively. Hence, paying a significant attention to the reliability of the ML components is crucial in ML-based systems.

摘要
随着机器学习（ML）在不同领域的应用迅速增长，特别是在安全关键领域，需要可靠的ML组件，即基于ML的软件组件。理解ML基于系统中bug的特点和维护挑战可以帮助ML系统开发者确定维护和测试努力的重点，提供关于最error-prone ком ponents、最常见的bug等信息。在这篇论文中，我们investigated ML基于系统中bug的特点和与非MLbug的区别从维护角度。我们从GitHub上提取了300个最高数量关闭issue的仓库，并经过多个筛选步骤，选择了使用TensorFlow、Keras和PyTorch三大ML框架的仓库。我们 manually inspect了386个采样的报告issue，以确定它们是否affect ML组件。我们的分析结果显示，ML基于系统中的real issue大约占half， indicating that ML组件比非ML组件更容易出现错误。然后，我们进行了109个IDML bug的Root Cause分析，以确定它们的原因、症状和修复时间。结果还表明，ML bug和非ML bug在bug-fixing复杂性方面存在很大差异，即修复ML bug需要更多的commit、更多的改变的文件和更多的改变的代码行数。根据我们的结果，修复ML bug需要更多的时间和劳动，而ML组件也比非ML组件更容易出现错误。因此，在ML-based系统中，重点着眼于ML组件的可靠性是非常重要的。

A Predictive Model of Digital Information Engagement: Forecasting User Engagement With English Words by Incorporating Cognitive Biases, Computational Linguistics and Natural Language Processing

paper_url: http://arxiv.org/abs/2307.14500
repo_url: None
paper_authors: Nimrod Dvir, Elaine Friedman, Suraj Commuri, Fan yang, Jennifer Romano
for: This paper introduces and tests a novel predictive model for digital information engagement, called the READ model, which integrates cognitive biases and natural language processing to predict the engagement levels of information.methods: The study uses a rigorous testing protocol involving a large-scale online survey to evaluate the engagement levels of 100 words selected from the WordNet database, and computes the READ attributes for each word to predict its engagement level.results: The findings show that the READ model accurately predicts a word’s engagement level with an 84% accuracy rate, and has the potential to enhance content engagement and inform AI language model development and generative text work across various domains.Here’s the simplified Chinese version:for: 这篇论文介绍了一种新的数字信息互动预测模型，称为READ模型，它将认知偏见和自然语言处理integrated into a multidimensional perspective on information engagement.methods: 该研究采用了一种严格的测试协议，通过大规模的在线问卷调查（n = 80,500）来评估100个从WordNet数据库选择的词语的互动水平，并计算每个词语的READ属性以预测其互动水平。results: 结果表明，READ模型在预测词语互动水平方面具有84%的准确率，并有可能在不同领域，如商业、教育、政府和医疗等领域中提高内容互动和AI语言模型开发和生成文本工作。

Abstract
This study introduces and empirically tests a novel predictive model for digital information engagement (IE) - the READ model, an acronym for the four pivotal attributes of engaging information: Representativeness, Ease-of-use, Affect, and Distribution. Conceptualized within the theoretical framework of Cumulative Prospect Theory, the model integrates key cognitive biases with computational linguistics and natural language processing to develop a multidimensional perspective on information engagement. A rigorous testing protocol was implemented, involving 50 randomly selected pairs of synonymous words (100 words in total) from the WordNet database. These words' engagement levels were evaluated through a large-scale online survey (n = 80,500) to derive empirical IE metrics. The READ attributes for each word were then computed and their predictive efficacy examined. The findings affirm the READ model's robustness, accurately predicting a word's IE level and distinguishing the more engaging word from a pair of synonyms with an 84% accuracy rate. The READ model's potential extends across various domains, including business, education, government, and healthcare, where it could enhance content engagement and inform AI language model development and generative text work. Future research should address the model's scalability and adaptability across different domains and languages, thereby broadening its applicability and efficacy.

摘要
To test the model, a rigorous testing protocol was implemented, involving 50 randomly selected pairs of synonymous words (100 words in total) from the WordNet database. The engagement levels of these words were evaluated through a large-scale online survey (n = 80,500) to derive empirical IE metrics. The READ attributes for each word were then computed and their predictive efficacy examined.The findings confirm the READ model's robustness, accurately predicting a word's IE level and distinguishing the more engaging word from a pair of synonyms with an 84% accuracy rate. The READ model has the potential to enhance content engagement in various domains, including business, education, government, and healthcare, and could also inform AI language model development and generative text work. Future research should focus on scaling and adapting the model across different domains and languages to broaden its applicability and efficacy.

HUGE: Huge Unsupervised Graph Embeddings with TPUs

paper_url: http://arxiv.org/abs/2307.14490
repo_url: None
paper_authors: Brandon Mayer, Anton Tsitsulin, Hendrik Fichtenberger, Jonathan Halcrow, Bryan Perozzi
for: 这篇论文是为了快速分析大规模图数据而设计的。
methods: 该论文使用了高性能的图嵌入架构，利用tensor处理单元（TPUs）和可配置的高带宽内存来简化图嵌入问题，并可扩展到 Billions 个节点和万亿个边的图数据。
results: 论文验证了嵌入空间质量在真实和模拟的大规模数据集上，并达到了高度的嵌入空间质量和性能。

Abstract
Graphs are a representation of structured data that captures the relationships between sets of objects. With the ubiquity of available network data, there is increasing industrial and academic need to quickly analyze graphs with billions of nodes and trillions of edges. A common first step for network understanding is Graph Embedding, the process of creating a continuous representation of nodes in a graph. A continuous representation is often more amenable, especially at scale, for solving downstream machine learning tasks such as classification, link prediction, and clustering. A high-performance graph embedding architecture leveraging Tensor Processing Units (TPUs) with configurable amounts of high-bandwidth memory is presented that simplifies the graph embedding problem and can scale to graphs with billions of nodes and trillions of edges. We verify the embedding space quality on real and synthetic large-scale datasets.

摘要
グラフは、オブジェクトの関系を构造化したデータの表现です。现在、ネットワークデータが広く利用されているため、大规模なグラフをすぐに分析することが必要とされています。グラフの理解のための一つの基本的なステップは、グラフエンコーディングです。グラフのノードに対して、连続的な表现を作成することで、后続の机械学习タスクにおいて、クラスジャッジ、リンクプレディクト、クラスタリングなどの问题を解くのに、より柔软でスケールすることができます。高性能のグラフエンコーディングアーキテクチャが提案されます。このアーキテクチャでは、TPU（テンソルプロセッシングユニット）を使用し、高速バンド幅メモリを调整することで、グラフエンコーディング问题を単纯化することができます。我々は、実际の大规模なデータセットに対して、エンコーディング空间の质を検证しました。

Role of Image Acquisition and Patient Phenotype Variations in Automatic Segmentation Model Generalization

paper_url: http://arxiv.org/abs/2307.14482
repo_url: None
paper_authors: Timothy L. Kline, Sumana Ramanathan, Harrison C. Gottlich, Panagiotis Korfiatis, Adriana V. Gregory
for: 本研究旨在评估自动医学图像分割模型的 OUT-OF-DOMAIN 性能和泛化能力，特别是适应新的图像收集和疾病类型。
methods: 该研究使用了非对照和对照的腹部 CT 扫描图像数据，并对模型进行了训练和验证，以分割肝脏、胰脏和肾脏。
results: 研究发现，使用更广泛的数据集可以提高模型的泛化能力和 OUT-OF-DOMAIN 性能，而不会降低模型在同类型图像上的性能。例如，使用25%的数据集进行训练的模型与仅使用同类型图像进行训练的模型相比，其 Dice 相似性几乎相同。

Abstract
Purpose: This study evaluated the out-of-domain performance and generalization capabilities of automated medical image segmentation models, with a particular focus on adaptation to new image acquisitions and disease type. Materials: Datasets from both non-contrast and contrast-enhanced abdominal CT scans of healthy patients and those with polycystic kidney disease (PKD) were used. A total of 400 images (100 non-contrast controls, 100 contrast controls, 100 non-contrast PKD, 100 contrast PKD) were utilized for training/validation of models to segment kidneys, livers, and spleens, and the final models were then tested on 100 non-contrast CT images of patients affected by PKD. Performance was evaluated using Dice, Jaccard, TPR, and Precision. Results: Models trained on a diverse range of data showed no worse performance than models trained exclusively on in-domain data when tested on in-domain data. For instance, the Dice similarity of the model trained on 25% from each dataset was found to be non-inferior to the model trained purely on in-domain data. Conclusions: The results indicate that broader training examples significantly enhances model generalization and out-of-domain performance, thereby improving automated segmentation tools' applicability in clinical settings. The study's findings provide a roadmap for future research to adopt a data-centric approach in medical image AI model development.

摘要
目的：本研究评估了自动医疗图像分割模型的域外性和泛化能力，尤其关注于新图像获取和疾病类型的适应性。材料：非对照和对照脐肿CT成像数据库中的健康人群和肝硬化病（PKD）患者的数据集被用。总共使用400张图像（100张非对照控制图像，100张对照控制图像，100张非对照PKD，100张对照PKD）进行模型训练和验证，并将最终模型测试在100张非对照CT成像中的PKD患者中。性能评估使用 dice、jaccard、True Positive Rate（TPR）和精度。结果：模型使用多样化的数据进行训练显示与仅使用域内数据进行训练时的性能相当。例如，模型在25%来自每个数据集上进行训练后的dice相似性与仅在域内数据上进行训练的模型相当。结论：结果表明，广泛的训练示例可以显著提高模型的泛化和域外性，从而提高自动分割工具在临床设置中的应用性。这项研究的结果可以推导为未来医疗图像AI模型开发的道路图。

Equitable Time-Varying Pricing Tariff Design: A Joint Learning and Optimization Approach

paper_url: http://arxiv.org/abs/2307.15088
repo_url: None
paper_authors: Liudong Chen, Bolun Xu
for: The paper aims to design equitable time-varying tariffs that balance affordability and response incentives for consumers with limited response capability.
methods: The paper proposes a joint learning-based identification and optimization method that uses a recurrent neural network (RNN) to capture high-dimensional and non-linear consumer price response behaviors, and embeds the RNN into the tariff design optimization as a non-linear optimization problem with a quadratic objective.
results: The proposed method achieves fast and scalable computation, and simulation using real-world consumer data shows that the equitable tariffs protect low-income consumers from price surges while effectively motivating consumers to reduce peak demand, ensure revenue recovery for the utility company, and achieve robust performance against demand response uncertainties and prediction errors.

Abstract
Time-varying pricing tariffs incentivize consumers to shift their electricity demand and reduce costs, but may increase the energy burden for consumers with limited response capability. The utility must thus balance affordability and response incentives when designing these tariffs by considering consumers' response expectations. This paper proposes a joint learning-based identification and optimization method to design equitable time-varying tariffs. Our proposed method encodes historical prices and demand response data into a recurrent neural network (RNN) to capture high-dimensional and non-linear consumer price response behaviors. We then embed the RNN into the tariff design optimization, formulating a non-linear optimization problem with a quadratic objective. We propose a gradient-based solution method that achieves fast and scalable computation. Simulation using real-world consumer data shows that our equitable tariffs protect low-income consumers from price surges while effectively motivating consumers to reduce peak demand. The method also ensures revenue recovery for the utility company and achieves robust performance against demand response uncertainties and prediction errors.

摘要
时变价格的电力价格奖励消费者shift其电力需求，以降低成本，但可能增加有限回应能力的消费者的能源荷重。公司因此必须在设计价格时考虑消费者的回应预期，以保持价格可持续性和回应奖励的平衡。这篇论文提出了一种基于学习的标合一识别优化方法，用于设计公平的时变价格。我们使用历史价格和响应数据将消费者价格响应行为编码成回归神经网络（RNN），以捕捉高维和非线性的消费者价格响应行为。然后，我们将RNN embed到价格设计优化中，定义一个非线性优化问题，其目标函数为quadratic。我们提出了一种梯度下降的解决方法，实现了快速和扩展性的计算。实验使用实际的消费者数据显示，我们的公平价格可以保护低收入消费者免受价格涨势，同时有效地鼓励消费者减少峰值需求。此外，我们的方法还确保了公司收益回报，并在需求响应不确定性和预测错误下实现了稳定性。

Limits to Reservoir Learning

paper_url: http://arxiv.org/abs/2307.14474
repo_url: None
paper_authors: Anthony M. Polloreno
for: 本研究围定机器学习能力基于物理限制。
methods: 我们使用信息处理容量（IPC）测量干扰下噪声影响储存计算机的性能。
results: 我们发现IPC在系统大小$n$上是最多 polynomial 型，而且噪声下学习需要 exponential 数量的样本。

Abstract
In this work, we bound a machine's ability to learn based on computational limitations implied by physicality. We start by considering the information processing capacity (IPC), a normalized measure of the expected squared error of a collection of signals to a complete basis of functions. We use the IPC to measure the degradation under noise of the performance of reservoir computers, a particular kind of recurrent network, when constrained by physical considerations. First, we show that the IPC is at most a polynomial in the system size $n$, even when considering the collection of $2^n$ possible pointwise products of the $n$ output signals. Next, we argue that this degradation implies that the family of functions represented by the reservoir requires an exponential number of samples to learn in the presence of the reservoir's noise. Finally, we conclude with a discussion of the performance of the same collection of $2^n$ functions without noise when being used for binary classification.

摘要
在这个工作中，我们限制机器学习能力基于物理限制的计算限制。我们从信息处理容量（IPC）开始，是一个正规化的指标，用于度量噪声下隐藏函数的性能下降。我们使用IPC测量干扰器计算机的性能下降，并证明其是一个乘元函数关于系统大小n。接着，我们 argue that this degradation implies that the family of functions represented by the reservoir requires an exponential number of samples to learn in the presence of the reservoir's noise.最后，我们讨论无噪声情况下，这个集合的2^n个函数在进行二分类时的性能。Here's the breakdown of the translation:* "In this work" is translated as "在这个工作中" (在这个工作中).* "we bound a machine's ability to learn" is translated as "我们限制机器学习能力" (我们限制机器学习能力).* "based on computational limitations implied by physicality" is translated as "基于物理限制的计算限制" (基于物理限制的计算限制).* "We start by considering the information processing capacity" is translated as "我们开始从信息处理容量开始" (我们开始从信息处理容量开始).* "a normalized measure of the expected squared error" is translated as "一个正规化的指标，用于度量预期平方误差" (一个正规化的指标，用于度量预期平方误差).* "of a collection of signals to a complete basis of functions" is translated as "一个集合的信号到完整的函数基" (一个集合的信号到完整的函数基).* "We use the IPC to measure the degradation under noise" is translated as "我们使用IPC测量噪声下的下降" (我们使用IPC测量噪声下的下降).* "of the performance of reservoir computers" is translated as "隐藏计算机的性能下降" (隐藏计算机的性能下降).* "when constrained by physical considerations" is translated as "在物理限制下" (在物理限制下).* "First, we show that the IPC is at most a polynomial in the system size n" is translated as "首先，我们证明IPC是一个乘元函数关于系统大小n" (首先，我们证明IPC是一个乘元函数关于系统大小n).* "Next, we argue that this degradation implies that the family of functions represented by the reservoir requires an exponential number of samples to learn" is translated as "接着，我们 argue that this degradation implies that隐藏函数集合由隐藏器表示的函数集合需要一个指数数量的样本来学习" (接着，我们 argue that this degradation implies that隐藏函数集合由隐藏器表示的函数集合需要一个指数数量的样本来学习).* "Finally, we conclude with a discussion of the performance of the same collection of functions without noise" is translated as "最后，我们讨论无噪声情况下，这个集合的2^n个函数在进行二分类时的性能" (最后，我们讨论无噪声情况下，这个集合的2^n个函数在进行二分类时的性能).

What Kinds of Contracts Do ML APIs Need?

paper_url: http://arxiv.org/abs/2307.14465
repo_url: None
paper_authors: Samantha Syeda Khairunnesa, Shibbir Ahmed, Sayem Mohammad Imtiaz, Hridesh Rajan, Gary T. Leavens
for: This paper aims to identify the most useful contracts for ML API users to catch errors early in the ML pipeline.
methods: The study uses empirical data from Stack Overflow to extract 413 informal API specifications for four popular ML libraries (TensorFlow, Scikit-learn, Keras, and PyTorch).
results: The key findings are that the most commonly needed contracts for ML APIs are checking constraints on single arguments of an API or on the order of API calls. The study also suggests a need to combine behavioral and temporal contract mining approaches.

Abstract
Recent work has shown that Machine Learning (ML) programs are error-prone and called for contracts for ML code. Contracts, as in the design by contract methodology, help document APIs and aid API users in writing correct code. The question is: what kinds of contracts would provide the most help to API users? We are especially interested in what kinds of contracts help API users catch errors at earlier stages in the ML pipeline. We describe an empirical study of posts on Stack Overflow of the four most often-discussed ML libraries: TensorFlow, Scikit-learn, Keras, and PyTorch. For these libraries, our study extracted 413 informal (English) API specifications. We used these specifications to understand the following questions. What are the root causes and effects behind ML contract violations? Are there common patterns of ML contract violations? When does understanding ML contracts require an advanced level of ML software expertise? Could checking contracts at the API level help detect the violations in early ML pipeline stages? Our key findings are that the most commonly needed contracts for ML APIs are either checking constraints on single arguments of an API or on the order of API calls. The software engineering community could employ existing contract mining approaches to mine these contracts to promote an increased understanding of ML APIs. We also noted a need to combine behavioral and temporal contract mining approaches. We report on categories of required ML contracts, which may help designers of contract languages.

摘要
近期研究发现机器学习（ML）程序存在许多错误，因此需要为 ML 代码签订合同。合同，就如design by contract方法论，可以帮助文档API和帮助API用户编写正确的代码。问题是：哪些合同可以为 API 用户提供最多帮助？我们尤其关心在 ML 管道中早期发现错误的合同。我们描述了在 Stack Overflow 上关于 TensorFlow、Scikit-learn、Keras 和 PyTorch 四个最常讨论的 ML 库的四百十三个非正式（英文）API 规范。这些规范可以帮助我们解决以下问题：ML 合同违反的根本原因和影响是什么？ML 合同违反是否存在常见的模式？在 ML 软件专业水平上理解 ML 合同需要多少技能准备？在 ML 管道的早期阶段，可以通过检查合同来检测违反吗？我们的关键发现是，ML API 中最需要的合同是检查单个参数的 API 上的约束或者 API 调用的顺序。软件工程社区可以利用现有的合同挖掘方法来挖掘这些合同，以促进 ML API 的理解。我们还注意到了将行为合同和时间合同混合的需求。我们报告了需要的 ML 合同类型，可以帮助设计合同语言的设计者。

Training Quantum Boltzmann Machines with Coresets

paper_url: http://arxiv.org/abs/2307.14459
repo_url: None
paper_authors: Joshua Viszlai, Teague Tomesh, Pranav Gokhale, Eric Anschuetz, Frederic T. Chong
for: 加速量子计算机上的量子自适应机器（QBM）训练，使用核心集技术以减少计算时间。
methods: 使用核心集技术取代全数据集，以减少Gradient-based步骤中的Gibbs状态抽取步骤数量，并加速总训练时间。
results: 使用核心集技术可以在6x6binomial图像数据集上减少QBM训练时间，并且使用Inception分数指标表明，使用核心集技术可以提高QBM的训练效率。

Abstract
Recent work has proposed and explored using coreset techniques for quantum algorithms that operate on classical data sets to accelerate the applicability of these algorithms on near-term quantum devices. We apply these ideas to Quantum Boltzmann Machines (QBM) where gradient-based steps which require Gibbs state sampling are the main computational bottleneck during training. By using a coreset in place of the full data set, we try to minimize the number of steps needed and accelerate the overall training time. In a regime where computational time on quantum computers is a precious resource, we propose this might lead to substantial practical savings. We evaluate this approach on 6x6 binary images from an augmented bars and stripes data set using a QBM with 36 visible units and 8 hidden units. Using an Inception score inspired metric, we compare QBM training times with and without using coresets.

摘要
近期研究提出了使用核心集技术来加速量子算法在类传统计算机上的应用，以便在半导体量子计算机上实现这些算法的可靠性。我们将这些想法应用到量子博尔tz曼机（QBM）中，其中梯度更新步骤需要 Gibbs 样本采样是主要的计算瓶颈。通过将核心集置换为全量数据集，我们尝试最小化需要的步骤数量，从而加速整体训练时间。在计算时间在量子计算机上是珍贵资源的情况下，我们提议这可能导致实质性的实践成本减少。我们使用6x6 二进制图像从扩展的条纹数据集，并使用一个 QBM WITH 36 可见单元和 8 隐藏单元进行训练。使用基于 Inception metric 的训练时间比较，我们比较了不使用核心集和使用核心集两种方法的 QBM 训练时间。

Predictive Maintenance of Armoured Vehicles using Machine Learning Approaches

paper_url: http://arxiv.org/abs/2307.14453
repo_url: None
paper_authors: Prajit Sengupta, Anant Mehta, Prashant Singh Rana
for: 预测机甲车辆维护需求
methods: 使用多种模型，包括Light Gradient Boosting、Random Forest、决策树、Extra Tree Classifier和Gradient Boosting，基于感知数据预测机甲车辆维护需求
results: 提出的模型在K-fold十字验证和TOPSIS分析中得到了98.93%的准确率、99.80%的精度和99.03%的回归率，能够有效预测机甲车辆维护需求，降低机甲车辆停机时间，提高运作效率。

Abstract
Armoured vehicles are specialized and complex pieces of machinery designed to operate in high-stress environments, often in combat or tactical situations. This study proposes a predictive maintenance-based ensemble system that aids in predicting potential maintenance needs based on sensor data collected from these vehicles. The proposed model's architecture involves various models such as Light Gradient Boosting, Random Forest, Decision Tree, Extra Tree Classifier and Gradient Boosting to predict the maintenance requirements of the vehicles accurately. In addition, K-fold cross validation, along with TOPSIS analysis, is employed to evaluate the proposed ensemble model's stability. The results indicate that the proposed system achieves an accuracy of 98.93%, precision of 99.80% and recall of 99.03%. The algorithm can effectively predict maintenance needs, thereby reducing vehicle downtime and improving operational efficiency. Through comparisons between various algorithms and the suggested ensemble, this study highlights the potential of machine learning-based predictive maintenance solutions.

摘要
armoured vehicles 是特殊化和复杂的机器设备，用于在高压环境中运行，经常在战斗或战略情况下使用。这项研究提议一种预测维护需求的ensemble系统，该系统根据战车上收集的传感器数据预测维护需求。该提议的模型体系包括轻度梯度拟合、随机森林、决策树、Extra Tree Classifier和梯度拟合等模型，以准确预测战车的维护需求。此外， employ K-fold Cross Validation 和 TOPSIS分析来评估提议的ensemble模型的稳定性。结果显示，该提议的系统实现了98.93%的准确率、99.80%的精度和99.03%的回归率。该算法可以有效预测维护需求，从而降低战车的下线时间，提高运作效率。通过对不同算法的比较以及建议的ensemble，这项研究强调了机器学习基于的预测维护解决方案的潜在力量。

VISPUR: Visual Aids for Identifying and Interpreting Spurious Associations in Data-Driven Decisions

paper_url: http://arxiv.org/abs/2307.14448
repo_url: https://github.com/picsolab/vispur
paper_authors: Xian Teng, Yongsu Ahn, Yu-Ru Lin
for: 这篇论文旨在帮助人们通过数据驱动的决策，但是现有工具很容易捕捉到假阳关系，从而导致假的结论和决策。
methods: 该论文提出了一个可见分析框架和一个人类中心的工作流程，用于解决假阳关系。包括一个可自动识别可能的干扰因素的 CONFOUNDER DASHBOARD，一个可视化和比较多个子组别模式的 SUBGROUP VIEWER，以及一个可视化和解释恶势力的 REASONING STORYBOARD。
results: 经过专家采访和控制的用户实验，我们的结果表明，提出的“去恶势”工作流程和设计的可见分析系统有效地帮助人们识别和理解假阳关系，以及制定负责任的决策。

Abstract
Big data and machine learning tools have jointly empowered humans in making data-driven decisions. However, many of them capture empirical associations that might be spurious due to confounding factors and subgroup heterogeneity. The famous Simpson's paradox is such a phenomenon where aggregated and subgroup-level associations contradict with each other, causing cognitive confusions and difficulty in making adequate interpretations and decisions. Existing tools provide little insights for humans to locate, reason about, and prevent pitfalls of spurious association in practice. We propose VISPUR, a visual analytic system that provides a causal analysis framework and a human-centric workflow for tackling spurious associations. These include a CONFOUNDER DASHBOARD, which can automatically identify possible confounding factors, and a SUBGROUP VIEWER, which allows for the visualization and comparison of diverse subgroup patterns that likely or potentially result in a misinterpretation of causality. Additionally, we propose a REASONING STORYBOARD, which uses a flow-based approach to illustrate paradoxical phenomena, as well as an interactive DECISION DIAGNOSIS panel that helps ensure accountable decision-making. Through an expert interview and a controlled user experiment, our qualitative and quantitative results demonstrate that the proposed "de-paradox" workflow and the designed visual analytic system are effective in helping human users to identify and understand spurious associations, as well as to make accountable causal decisions.

摘要
大数据和机器学习工具已经共同强化了人们在基于数据的决策上。然而，许多其中捕捉到了偶合关系，可能因为干扰因素和 subgroup 多样性而导致误导。负荷者 парадоксом是这种现象，其中汇总级和 subgroup 级别的关系相互矛盾，从而引起了认知混乱和不当的解释和决策。现有工具提供了少量的洞察和解释，用于帮助人们在实践中避免误导关系。我们提出了 VISPUR，一个可见分析系统，该系统提供了 causal 分析框架和人类中心的工作流程，用于解决误导关系。这些包括一个 CONFOUNDER DASHBOARD，可以自动确定可能的干扰因素，以及一个 SUBGROUP VIEWER，可以Visualize 和比较多样 subgroup 模式，这些模式可能或可能导致误导关系。此外，我们还提出了一个 REASONING STORYBOARD，用于 illustrate 负荷者 парадоксом，以及一个交互式的 DECISION DIAGNOSIS 面板，用于确保负荷者做出负荷者决策。经过专家采访和控制的用户试验，我们的Qualitative 和量化结果表明，提posed "de-paradox" 工作流程和设计的可见分析系统有效地帮助人们检测和理解误导关系，以及做出负荷者决策。

Neural Schrödinger Bridge with Sinkhorn Losses: Application to Data-driven Minimum Effort Control of Colloidal Self-assembly

paper_url: http://arxiv.org/abs/2307.14442
repo_url: None
paper_authors: Iman Nodozi, Charlie Yan, Mira Khare, Abhishek Halder, Ali Mesbah
for: 这个论文是关于控制束束自组装的最小努力问题的研究。
methods: 这个论文使用了一种名为“神经 Шрёдингер桥”的数据驱动学习和控制框架，以解决这类问题。
results: 研究人员通过使用分子动力学 simulate 数据来学习控制的漂移和扩散系数，并使用这些系数来训练一个三个网络，以解决这类问题。

Abstract
We show that the minimum effort control of colloidal self-assembly can be naturally formulated in the order-parameter space as a generalized Schr\"odinger bridge problem -- a class of fixed-horizon stochastic optimal control problems that originated in the works of Erwin Schr\"odinger in the early 1930s. In recent years, this class of problems has seen a resurgence of research activities in control and machine learning communities. Different from the existing literature on the theory and computation for such problems, the controlled drift and diffusion coefficients for colloidal self-assembly are typically non-affine in control, and are difficult to obtain from physics-based modeling. We deduce the conditions of optimality for such generalized problems, and show that the resulting system of equations is structurally very different from the existing results in a way that standard computational approaches no longer apply. Thus motivated, we propose a data-driven learning and control framework, named `neural Schr\"odinger bridge', to solve such generalized Schr\"odinger bridge problems by innovating on recent advances in neural networks. We illustrate the effectiveness of the proposed framework using a numerical case study of colloidal self-assembly. We learn the controlled drift and diffusion coefficients as two neural networks using molecular dynamics simulation data, and then use these two to train a third network with Sinkhorn losses designed for distributional endpoint constraints, specific for this class of control problems.

摘要
我们示出，可控束的溶体自组合最小努力可以自然地写作通用变量空间的一种对称化Schrödinger桥问题。这是在Erwin Schrödinger在30年代初期的研究中发展出的一种固定时间阶段的数学控制问题。在最近几年，这种问题在控制和机器学习领域中得到了新的研究热情。不同于现有的理论和计算方法，控制束的漂移和扩散系数在溶体自组合中通常是非对称的，从物理模型中难以取得。我们推出了最佳条件，并观察到这个系统的方程是与现有结果 Structurally very different，使得现有的计算方法不再适用。因此，我们提出了一个基于神经网络的学习和控制框架，称为“神经Schrödinger桥”，以解决这种对称化Schrödinger桥问题。我们使用分子动力学模拟数据来学习控制束的漂移和扩散系数两个神经网络，然后将这两个网络用来训练一个第三个神经网络，使用Sinkhorn损失函数，特别是这种控制问题的终端点给定。我们透过一个数值案例研究证明了我们的框架的有效性。

Fixed Integral Neural Networks

paper_url: http://arxiv.org/abs/2307.14439
repo_url: None
paper_authors: Ryan Kortvelesy
for: 该论文旨在解决数学 интеграル问题，即对学习函数（如神经网络）的数学 интеграル问题。
methods: 该论文提出了一种方法，可以用来计算学习函数的数学 интеграル。这种方法基于了一种新的表示方式，可以减少计算复杂度。
results: 该论文通过实验表明，该方法可以准确地计算学习函数的数学 интеграル，并且可以应用于各种应用场景，如概率分布、距离度量等。此外，该论文还引入了一种方法，可以将固定的 интеграル约束应用到神经网络中。

Abstract
It is often useful to perform integration over learned functions represented by neural networks. However, this integration is usually performed numerically, as analytical integration over learned functions (especially neural networks) is generally viewed as intractable. In this work, we present a method for representing the analytical integral of a learned function $f$. This allows the exact integral of a neural network to be computed, and enables constrained neural networks to be parametrised by applying constraints directly to the integral. Crucially, we also introduce a method to constrain $f$ to be positive, a necessary condition for many applications (e.g. probability distributions, distance metrics, etc). Finally, we introduce several applications where our fixed-integral neural network (FINN) can be utilised.

摘要
通常情况下，通过神经网络学习的函数的数值积分是常用的。然而，这种积分通常是分析不可能的，尤其是神经网络学习的函数。在这项工作中，我们提出了一种方法来计算神经网络学习得到的函数积分。这允许我们精确地计算神经网络的积分，并通过直接应用约束来Parametrise受限神经网络。此外，我们还提出了一种方法来约束函数$f$是正的，这是许多应用中的必要条件（例如概率分布、距离度量等）。最后，我们介绍了一些应用场景， где我们的固定积分神经网络（FINN）可以被利用。

Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models

paper_url: http://arxiv.org/abs/2307.14430
repo_url: None
paper_authors: Mayee F. Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, Christopher Ré
for: 这个论文研究了如何使用有限token数据来训练预训练过的大语言模型（LM），以提高其下游任务的性能。
methods: 作者提出了一个新的框架，基于人类学习的自然顺序来理解LM的学习过程，并使用这种顺序来选择数据进行训练。他们还提出了一种在线数据采样算法，叫做Skill-It，可以在 kontinual pre-training和精度调整两个 режиме下进行数据选择，以提高LM的性能。
results: 作者通过使用Synthetic和实际数据进行示例，证明了这种顺序的存在，并证明了采用这种顺序可以在 fewer tokens 中更好地学习更高级别的技能。在 kontinual pre-training режиmé下，Skill-It比随机采样高出36.5点的精度。在精度调整 régime下，Skill-It比只训练target skill的数据减少了13.6%的验证损失。最后，作者应用了他们的技能框架在RedPajama dataset上， continually pre-train了3B参数的LM，并在LM评估套件上达到了高于基eline方法的精度。

Abstract
The quality of training data impacts the performance of pre-trained large language models (LMs). Given a fixed budget of tokens, we study how to best select data that leads to good downstream model performance across tasks. We develop a new framework based on a simple hypothesis: just as humans acquire interdependent skills in a deliberate order, language models also follow a natural order when learning a set of skills from their training data. If such an order exists, it can be utilized for improved understanding of LMs and for data-efficient training. Using this intuition, our framework formalizes the notion of a skill and of an ordered set of skills in terms of the associated data. First, using both synthetic and real data, we demonstrate that these ordered skill sets exist, and that their existence enables more advanced skills to be learned with less data when we train on their prerequisite skills. Second, using our proposed framework, we introduce an online data sampling algorithm, Skill-It, over mixtures of skills for both continual pre-training and fine-tuning regimes, where the objective is to efficiently learn multiple skills in the former and an individual skill in the latter. On the LEGO synthetic in the continual pre-training setting, Skill-It obtains 36.5 points higher accuracy than random sampling. On the Natural Instructions dataset in the fine-tuning setting, Skill-It reduces the validation loss on the target skill by 13.6% versus training on data associated with the target skill itself. We apply our skills framework on the recent RedPajama dataset to continually pre-train a 3B-parameter LM, achieving higher accuracy on the LM Evaluation Harness with 1B tokens than the baseline approach of sampling uniformly over data sources with 3B tokens.

摘要
“训练数据质量对预训练大型自然语言模型（LM）的性能产生重要影响。给定一个固定的字符数预算，我们研究如何选择数据以实现下游模型在多个任务中的好性能。我们提出了一个新的框架，基于简单的假设：人类在意图的顺序中学习了一系列技能，然后 language models 也会遵循自然的顺序学习这些技能。如果这种顺序存在，那么可以用于更好地理解LMs 和数据efficient 训练。使用这种假设，我们的框架将技能和其相关数据进行了正式的定义。首先，使用 sintetic 和实际数据，我们证明了这些顺序技能集存在，并且它们的存在使得更高级别的技能可以在更少的数据量下学习。其次，我们使用我们提议的框架，开发了一个在线数据采样算法，叫做 Skill-It，用于在 continual pre-training 和 fine-tuning режиmes 中高效地学习多个技能。在 LEGO sintetic 上，Skill-It 在 continual pre-training Setting 中评估得到了36.5个点更高的准确率，相比随机采样。在 Natural Instructions dataset 上，Skill-It 在 fine-tuning Setting 中降低了目标技能的验证损失 by 13.6%，相比于直接在目标技能数据上训练。我们在 Recent RedPajama dataset 上应用了我们的技能框架，实现了在 3B 参数 LM 上的更高的准确率，与基eline 方法（随机采样）在 3B 字符数据上的性能相比。”

TabR: Unlocking the Power of Retrieval-Augmented Tabular Deep Learning

paper_url: http://arxiv.org/abs/2307.14338
repo_url: https://github.com/yandex-research/tabular-dl-tabr
paper_authors: Yury Gorishniy, Ivan Rubachev, Nikolay Kartashev, Daniil Shlenskii, Akim Kotelnikov, Artem Babenko
for: This paper is focused on developing a retrieval-based approach for tabular data problems using deep learning (DL) models.
methods: The authors propose a simple feed-forward architecture with an attention-like retrieval component, which is incrementally augmented to improve the performance on tabular data problems. The attention mechanism is designed to retrieve relevant objects from the available training data to make better predictions.
results: The proposed TabR model achieves the best average performance among tabular DL models on a set of public benchmarks, becomes the new state-of-the-art on several datasets, and even outperforms GBDT models on the recently proposed “GBDT-friendly” benchmark.

Abstract
Deep learning (DL) models for tabular data problems are receiving increasingly more attention, while the algorithms based on gradient-boosted decision trees (GBDT) remain a strong go-to solution. Following the recent trends in other domains, such as natural language processing and computer vision, several retrieval-augmented tabular DL models have been recently proposed. For a given target object, a retrieval-based model retrieves other relevant objects, such as the nearest neighbors, from the available (training) data and uses their features or even labels to make a better prediction. However, we show that the existing retrieval-based tabular DL solutions provide only minor, if any, benefits over the properly tuned simple retrieval-free baselines. Thus, it remains unclear whether the retrieval-based approach is a worthy direction for tabular DL. In this work, we give a strong positive answer to this question. We start by incrementally augmenting a simple feed-forward architecture with an attention-like retrieval component similar to those of many (tabular) retrieval-based models. Then, we highlight several details of the attention mechanism that turn out to have a massive impact on the performance on tabular data problems, but that were not explored in prior work. As a result, we design TabR -- a simple retrieval-based tabular DL model which, on a set of public benchmarks, demonstrates the best average performance among tabular DL models, becomes the new state-of-the-art on several datasets, and even outperforms GBDT models on the recently proposed ``GBDT-friendly'' benchmark (see the first figure).

摘要
深度学习（DL）模型在表格数据问题上已经收到了越来越多的关注，而基于梯度拟合决策树（GBDT）的算法仍然是一个强大的解决方案。随着其他领域的趋势，如自然语言处理和计算机视觉，一些基于搜索的表格DL模型已经被最近提出。为一个目标对象，一个搜索基于模型将其他相关的对象（如最近的邻居）从可用的数据集中检索出来，并使用其特征或甚至标签来进行更好的预测。然而，我们表明现有的搜索基于DL解决方案只提供了微小，如果有的，的改进，而不是GBDT模型。因此，是否使用搜索基于DL方法是一个值得考虑的问题。在这种情况下，我们给出了一个积极的答案。我们首先将简单的批处理架构逐渐添加了一个注意力相似的搜索组件，类似于许多（表格）搜索基于DL模型中的注意力机制。然后，我们强调了一些注意力机制的细节，这些细节在表格数据问题上有很大的影响，但在前一项工作中没有被探讨。因此，我们设计了TabR模型，这是一个简单的搜索基于DL模型。在一些公共测试集上，TabR模型示出了最佳平均性能 среди表格DL模型，创造了新的州OF-the-art，并 même surpassed GBDT模型在“GBDT-friendly”测试集上（参见第一个图像）。

Waypoint-Based Imitation Learning for Robotic Manipulation

paper_url: http://arxiv.org/abs/2307.14326
repo_url: https://github.com/lucys0/awe
paper_authors: Lucy Xiaoyang Shi, Archit Sharma, Tony Z. Zhao, Chelsea Finn
for: 该论文主要针对的是解决人工智能机器人 manipulation 领域中的行为模仿学习（BC）问题，即训练机器人执行任务时的累累问题。
methods: 该论文提出了一种自动生成方点（AWE）模块，可以自动生成一个最小的方点集，以便将示例分解成一系列方点，并通过直线 interpolate 来 aproximate trajectory。
results: 该论文的实验结果表明，通过使用 AWE 模块可以提高 state-of-the-art 算法的成功率，在 simulated 环境中提高了25%，在实际世界上进行的双手 manipulation 任务中提高了4-28%。此外，AWE 还可以减少决策几何的大小，提高机器人的性能。

Abstract
While imitation learning methods have seen a resurgent interest for robotic manipulation, the well-known problem of compounding errors continues to afflict behavioral cloning (BC). Waypoints can help address this problem by reducing the horizon of the learning problem for BC, and thus, the errors compounded over time. However, waypoint labeling is underspecified, and requires additional human supervision. Can we generate waypoints automatically without any additional human supervision? Our key insight is that if a trajectory segment can be approximated by linear motion, the endpoints can be used as waypoints. We propose Automatic Waypoint Extraction (AWE) for imitation learning, a preprocessing module to decompose a demonstration into a minimal set of waypoints which when interpolated linearly can approximate the trajectory up to a specified error threshold. AWE can be combined with any BC algorithm, and we find that AWE can increase the success rate of state-of-the-art algorithms by up to 25% in simulation and by 4-28% on real-world bimanual manipulation tasks, reducing the decision making horizon by up to a factor of 10. Videos and code are available at https://lucys0.github.io/awe/

摘要
而归案学习方法在机器人操作领域得到了复兴的兴趣，但是经典的复制错误问题仍然困扰着行为归案（BC）。 waypoints 可以帮助解决这个问题，因为它们可以降低归案学习问题的观察 horizon，从而减少错误的汇总。然而， waypoint 标注是下pecified，需要进一步的人类监督。我们的关键发现是，如果一段运动段可以被近似为线性运动，那么该段的终点可以被用作 waypoints。我们提议一种自动生成 waypoints 的方法，即自动 waypoint 提取（AWE）。AWE 可以与任何 BC 算法结合使用，我们发现 AWE 可以提高现有算法的成功率，在模拟环境中提高了25%，在实际世界双手操作任务上提高了4-28%，将决策征集的时间减少到最多的10倍。视频和代码可以在上找到。

Evaluating the Moral Beliefs Encoded in LLMs

paper_url: http://arxiv.org/abs/2307.14324
repo_url: https://github.com/ninodimontalcino/moralchoice
paper_authors: Nino Scherrer, Claudia Shi, Amir Feder, David M. Blei
for: This paper presents a case study on the design, administration, post-processing, and evaluation of surveys on large language models (LLMs) to understand their moral beliefs.
methods: The paper introduces a statistical method for eliciting beliefs encoded in LLMs, which includes statistical measures and evaluation metrics to quantify the probability of an LLM “making a choice”, the associated uncertainty, and the consistency of that choice.
results: The study finds that in unambiguous scenarios, most models “choose” actions that align with commonsense, while in ambiguous cases, most models express uncertainty. Additionally, some models reflect clear preferences in ambiguous scenarios, and closed-source models tend to agree with each other.

Abstract
This paper presents a case study on the design, administration, post-processing, and evaluation of surveys on large language models (LLMs). It comprises two components: (1) A statistical method for eliciting beliefs encoded in LLMs. We introduce statistical measures and evaluation metrics that quantify the probability of an LLM "making a choice", the associated uncertainty, and the consistency of that choice. (2) We apply this method to study what moral beliefs are encoded in different LLMs, especially in ambiguous cases where the right choice is not obvious. We design a large-scale survey comprising 680 high-ambiguity moral scenarios (e.g., "Should I tell a white lie?") and 687 low-ambiguity moral scenarios (e.g., "Should I stop for a pedestrian on the road?"). Each scenario includes a description, two possible actions, and auxiliary labels indicating violated rules (e.g., "do not kill"). We administer the survey to 28 open- and closed-source LLMs. We find that (a) in unambiguous scenarios, most models "choose" actions that align with commonsense. In ambiguous cases, most models express uncertainty. (b) Some models are uncertain about choosing the commonsense action because their responses are sensitive to the question-wording. (c) Some models reflect clear preferences in ambiguous scenarios. Specifically, closed-source models tend to agree with each other.

摘要

A statistical method for eliciting beliefs encoded in LLMs. We introduce statistical measures and evaluation metrics that quantify the probability of an LLM “making a choice”, the associated uncertainty, and the consistency of that choice.2. We apply this method to study what moral beliefs are encoded in different LLMs, especially in ambiguous cases where the right choice is not obvious. We design a large-scale survey consisting of 680 high-ambiguity moral scenarios (e.g., “Should I tell a white lie?”) and 687 low-ambiguity moral scenarios (e.g., “Should I stop for a pedestrian on the road?”). Each scenario includes a description, two possible actions, and auxiliary labels indicating violated rules (e.g., “do not kill”). We administer the survey to 28 open- and closed-source LLMs.Our findings are as follows:1. In unambiguous scenarios, most models “choose” actions that align with commonsense.2. In ambiguous cases, most models express uncertainty.3. Some models are uncertain about choosing the commonsense action because their responses are sensitive to the question-wording.4. Some models reflect clear preferences in ambiguous scenarios. Specifically, closed-source models tend to agree with each other.

Reinforcement Learning by Guided Safe Exploration

paper_url: http://arxiv.org/abs/2307.14316
repo_url: None
paper_authors: Qisong Yang, Thiago D. Simão, Nils Jansen, Simon H. Tindemans, Matthijs T. J. Spaan
for: trains an RL agent to adapt quickly to a target task in a constrained and unsafe environment
methods: uses a guide agent to explore safely without a reward signal, and regularizes a target policy towards the guide using transfer learning
results: achieves safe transfer learning and helps the target policy solve the task faster

Abstract
Safety is critical to broadening the application of reinforcement learning (RL). Often, we train RL agents in a controlled environment, such as a laboratory, before deploying them in the real world. However, the real-world target task might be unknown prior to deployment. Reward-free RL trains an agent without the reward to adapt quickly once the reward is revealed. We consider the constrained reward-free setting, where an agent (the guide) learns to explore safely without the reward signal. This agent is trained in a controlled environment, which allows unsafe interactions and still provides the safety signal. After the target task is revealed, safety violations are not allowed anymore. Thus, the guide is leveraged to compose a safe behaviour policy. Drawing from transfer learning, we also regularize a target policy (the student) towards the guide while the student is unreliable and gradually eliminate the influence of the guide as training progresses. The empirical analysis shows that this method can achieve safe transfer learning and helps the student solve the target task faster.

摘要
安全性是推广强化学习（RL）的关键。我们通常在实验室中训练RL代理人，以便在真实世界中部署。然而，真实世界目标任务可能不明确，直到部署之前。无奖RL通过不提供奖励来训练代理人，以便快速适应奖励的披露。我们考虑了限制的奖励自由设定，在控制环境中训练代理人，以便安全地探索。当目标任务揭示后，安全违反不再允许。因此，我们可以利用导航器（guide）组织安全行为策略。通过转移学习，我们还 regularize了目标政策（student）向导而言，而学生在训练过程中不可靠。逐渐消除导航器的影响，以便帮助学生更快地解决目标任务。我们的实验分析表明，这种方法可以实现安全的转移学习，帮助学生更快地解决目标任务。

Unsupervised Deep Learning-based Pansharpening with Jointly-Enhanced Spectral and Spatial Fidelity

paper_url: http://arxiv.org/abs/2307.14403
repo_url: https://github.com/matciotola/lambda-pnn
paper_authors: Matteo Ciotola, Giovanni Poggi, Giuseppe Scarpa
for: 本研究旨在提出一种基于深度学习的全分辨率训练框架，以提高预先训练在高分辨率目标图像上的性能。
methods: 该模型采用了尚未在前一文中使用的残差注意力模块，并定义了一种联合优化 spectral和 spatial 质量的损失函数。此外，提出了一种新的微调策略，以提高在目标图像上的推理时 Adaptation。
results: 对于一个大量的测试图像集，经过进行了在艰难情况下的测试，结果表明，提出的方法与当前最佳方法相比，具有较高的数值性能和视觉质量。代码可以在 GitHub 上下载：https://github.com/matciotola/Lambda-PNN。

Abstract
In latest years, deep learning has gained a leading role in the pansharpening of multiresolution images. Given the lack of ground truth data, most deep learning-based methods carry out supervised training in a reduced-resolution domain. However, models trained on downsized images tend to perform poorly on high-resolution target images. For this reason, several research groups are now turning to unsupervised training in the full-resolution domain, through the definition of appropriate loss functions and training paradigms. In this context, we have recently proposed a full-resolution training framework which can be applied to many existing architectures. Here, we propose a new deep learning-based pansharpening model that fully exploits the potential of this approach and provides cutting-edge performance. Besides architectural improvements with respect to previous work, such as the use of residual attention modules, the proposed model features a novel loss function that jointly promotes the spectral and spatial quality of the pansharpened data. In addition, thanks to a new fine-tuning strategy, it improves inference-time adaptation to target images. Experiments on a large variety of test images, performed in challenging scenarios, demonstrate that the proposed method compares favorably with the state of the art both in terms of numerical results and visual output. Code is available online at https://github.com/matciotola/Lambda-PNN.

摘要
近年来，深度学习在多分辨率图像投射中获得了主导地位。由于缺乏真实数据，大多数基于深度学习的方法在减小分辨率领域进行了超vised 训练。然而，在高分辨率目标图像上，模型经过减小图像训练后表现不佳。为此，许多研究小组现在转向不upervised 训练在全分辨率领域，通过定义合适的损失函数和训练方法。在这个上下文中，我们最近提出了一个全分辨率训练框架，可以应用于许多现有的架构。我们的研究报告了一种全新的深度学习基于投射模型，该模型完全利用了这种方法的潜力，并提供了前沿性的性能。与前一些研究比较，该模型具有以下改进：* 使用差分注意力模块，以提高模型对图像的投射能力。* 定义一种新的损失函数，该函数同时Promote图像的spectral和空间质量。* 通过一种新的调整策略，提高投射过程中的投射精度。对于许多测试图像，在多种复杂的场景下进行了实验。结果表明，提议的方法与当前最佳方法相比，具有更高的数值结果和视觉输出。代码可以在https://github.com/matciotola/Lambda-PNN上下载。

A Constraint Enforcement Deep Reinforcement Learning Framework for Optimal Energy Storage Systems Dispatch

paper_url: http://arxiv.org/abs/2307.14304
repo_url: https://github.com/ShengrenHou/Energy-management-MIP-Deep-Reinforcement-Learning
paper_authors: Shengren Hou, Edgar Mauricio Salazar Duque, Peter Palensky, Pedro P. Vergara
for: optimize the dispatch of energy storage systems (ESSs) in the presence of uncertainty
methods: deep reinforcement learning (DRL) algorithms with mixed-integer programming (MIP) formulation to enforce operational constraints
results: superior performance compared to state-of-the-art DRL algorithms and the optimal solution with perfect forecast of stochastic variables, effectively enforcing all constraints while delivering high-quality dispatch decisions.

Abstract
The optimal dispatch of energy storage systems (ESSs) presents formidable challenges due to the uncertainty introduced by fluctuations in dynamic prices, demand consumption, and renewable-based energy generation. By exploiting the generalization capabilities of deep neural networks (DNNs), deep reinforcement learning (DRL) algorithms can learn good-quality control models that adaptively respond to distribution networks' stochastic nature. However, current DRL algorithms lack the capabilities to enforce operational constraints strictly, often even providing unfeasible control actions. To address this issue, we propose a DRL framework that effectively handles continuous action spaces while strictly enforcing the environments and action space operational constraints during online operation. Firstly, the proposed framework trains an action-value function modeled using DNNs. Subsequently, this action-value function is formulated as a mixed-integer programming (MIP) formulation enabling the consideration of the environment's operational constraints. Comprehensive numerical simulations show the superior performance of the proposed MIP-DRL framework, effectively enforcing all constraints while delivering high-quality dispatch decisions when compared with state-of-the-art DRL algorithms and the optimal solution obtained with a perfect forecast of the stochastic variables.

摘要
优化能量存储系统（ESS）的分配具有挑战性，由于动态价格、消耗量和可再生能源生产的不确定性引入的随机性。通过利用深度神经网络（DNN）的泛化能力，深度强化学习（DRL）算法可以学习适应分布网络的随机性的控制模型。然而，现有的DRL算法缺乏对环境和动作空间的操作约束的严格执行能力，经常提供不可实施的控制动作。为解决这个问题，我们提出了一种DRL框架，可以有效地处理连续动作空间，并且在在线操作中严格执行环境和动作空间的操作约束。首先，我们的框架将 trains一个动作价值函数，使用DNN模型来模型。然后，这个动作价值函数被转换为混合整数编程（MIP）形式，以便考虑环境的操作约束。我们的数字实验表明，我们的MIP-DRL框架在比较难的环境下表现出色，可以有效地执行所有约束，并提供高质量的分配决策，与当前的DRL算法和完美预测 Stochastic Variables 的优质解相比。

ChatGPT and Persuasive Technologies for the Management and Delivery of Personalized Recommendations in Hotel Hospitality

paper_url: http://arxiv.org/abs/2307.14298
repo_url: None
paper_authors: Manolis Remountakis, Konstantinos Kotis, Babis Kourtzis, George E. Tsekouras
for: 这篇论文旨在探讨将 chatGPT 和吸引技术应用于酒店住宿系统中，以提高个性化体验和酒店业绩。
methods: 论文首先介绍了 chatGPT 的能力，包括理解和生成人类语言，以提供更准确和上下文感知的推荐。然后，论文详细介绍了将 chatGPT интегра到推荐系统中，包括分析用户偏好、从在线评论中提取有价值信息、并基于用户 profiling 生成个性化推荐。
results: 论文描述了一个实验，通过将 chatGPT 和吸引技术应用于酒店推荐系统中，对用户参与度、满意度和转化率进行了评估。初步结果表明，这些技术在提高客户体验和酒店业绩方面具有潜在的潜力。

Abstract
Recommender systems have become indispensable tools in the hotel hospitality industry, enabling personalized and tailored experiences for guests. Recent advancements in large language models (LLMs), such as ChatGPT, and persuasive technologies, have opened new avenues for enhancing the effectiveness of those systems. This paper explores the potential of integrating ChatGPT and persuasive technologies for automating and improving hotel hospitality recommender systems. First, we delve into the capabilities of ChatGPT, which can understand and generate human-like text, enabling more accurate and context-aware recommendations. We discuss the integration of ChatGPT into recommender systems, highlighting the ability to analyze user preferences, extract valuable insights from online reviews, and generate personalized recommendations based on guest profiles. Second, we investigate the role of persuasive technology in influencing user behavior and enhancing the persuasive impact of hotel recommendations. By incorporating persuasive techniques, such as social proof, scarcity and personalization, recommender systems can effectively influence user decision-making and encourage desired actions, such as booking a specific hotel or upgrading their room. To investigate the efficacy of ChatGPT and persuasive technologies, we present a pilot experi-ment with a case study involving a hotel recommender system. We aim to study the impact of integrating ChatGPT and persua-sive techniques on user engagement, satisfaction, and conversion rates. The preliminary results demonstrate the potential of these technologies in enhancing the overall guest experience and business performance. Overall, this paper contributes to the field of hotel hospitality by exploring the synergistic relationship between LLMs and persuasive technology in recommender systems, ultimately influencing guest satisfaction and hotel revenue.

摘要
现代技术如推荐系统已经成为酒店业的不可或缺工具，帮助提供个性化和适应性的旅客体验。最新的大语言模型（LLMs），如ChatGPT，以及吸引技术，对推荐系统的效果提高有新的可能性。本文探讨将ChatGPT和吸引技术集成到酒店推荐系统中，以提高旅客体验和酒店业绩。首先，我们深入探讨ChatGPT的能力，它可以理解和生成人类语言，从而提供更加准确和 Context-Aware 的推荐。我们讨论将ChatGPT integrate into recommender systems，包括分析用户偏好、从网络评论中提取有价值信息和基于客户profile generate个性化推荐。其次，我们调查了吸引技术在用户行为中的作用，并将其应用于酒店推荐系统中。通过包含社交证明、缺失和个性化等吸引技术，推荐系统可以更好地影响用户决策和适应旅客体验。为了评估ChatGPT和吸引技术的效果，我们在一个酒店推荐系统的实验中进行了一个先验。我们的目标是研究将ChatGPT和吸引技术集成到推荐系统中的影响，包括用户参与度、满意度和转化率。初步结果表明这些技术在总体客户体验和酒店业绩方面具有潜在的潜力。总之，本文对酒店业的贡献在于探讨LLMs和吸引技术在推荐系统中的相互作用，从而影响客户满意度和酒店收益。

Unraveling the Complexity of Splitting Sequential Data: Tackling Challenges in Video and Time Series Analysis

paper_url: http://arxiv.org/abs/2307.14294
repo_url: None
paper_authors: Diego Botache, Kristina Dingel, Rico Huhnstock, Arno Ehresmann, Bernhard Sick
for: 这篇论文主要是为了探讨分析顺序数据的问题，如视频和时间序列数据的分割。
methods: 论文使用了多种方法来探讨分析顺序数据的问题，包括数据收集、数据表示、分割率选择、设置质量标准和选择适当的选择策略等。
results: 论文通过两个实际例子——汽车测试台和液体中的粒子跟踪——探讨了分析顺序数据的各种挑战，并提出了一些可能的解决方案。

Abstract
Splitting of sequential data, such as videos and time series, is an essential step in various data analysis tasks, including object tracking and anomaly detection. However, splitting sequential data presents a variety of challenges that can impact the accuracy and reliability of subsequent analyses. This concept article examines the challenges associated with splitting sequential data, including data acquisition, data representation, split ratio selection, setting up quality criteria, and choosing suitable selection strategies. We explore these challenges through two real-world examples: motor test benches and particle tracking in liquids.

摘要
分割连续数据，如视频和时间序列数据，是数据分析任务中的一个重要步骤，包括对象跟踪和异常检测。然而，分割连续数据带来许多挑战，这些挑战可能会影响后续分析的准确性和可靠性。本文探讨分割连续数据的挑战，包括数据获取、数据表示、分割率选择、设置质量标准和选择适合的选择策略。通过两个实际应用例——汽车测试台和液体中粒子跟踪——我们探讨这些挑战。

General Purpose Artificial Intelligence Systems (GPAIS): Properties, Definition, Taxonomy, Open Challenges and Implications

paper_url: http://arxiv.org/abs/2307.14283
repo_url: None
paper_authors: Isaac Triguero, Daniel Molina, Javier Poyatos, Javier Del Ser, Francisco Herrera
for: 这篇论文关注通用人工智能系统（GPAIS）的定义和分类，以便在不同领域进行合作研究。methods: 本文提出了新的GPAIS定义，并分类了GPAIS的不同类型，包括关闭世界GPAIS和开放世界GPAIS，以及它们的自主性和能力。同时，文章还介绍了实现GPAIS的多种方法，如使用人工智能技术提高另一个AI的能力。results: 文章提出了一种新的GPAIS定义和分类方法，可以帮助不同领域的研究人员在进行通用任务时进行合作研究。同时，文章还介绍了实现GPAIS的多种方法，并对现有的GPAIS系统进行了评价。

Abstract
Most applications of Artificial Intelligence (AI) are designed for a confined and specific task. However, there are many scenarios that call for a more general AI, capable of solving a wide array of tasks without being specifically designed for them. The term General-Purpose Artificial Intelligence Systems (GPAIS) has been defined to refer to these AI systems. To date, the possibility of an Artificial General Intelligence, powerful enough to perform any intellectual task as if it were human, or even improve it, has remained an aspiration, fiction, and considered a risk for our society. Whilst we might still be far from achieving that, GPAIS is a reality and sitting at the forefront of AI research. This work discusses existing definitions for GPAIS and proposes a new definition that allows for a gradual differentiation among types of GPAIS according to their properties and limitations. We distinguish between closed-world and open-world GPAIS, characterising their degree of autonomy and ability based on several factors such as adaptation to new tasks, competence in domains not intentionally trained for, ability to learn from few data, or proactive acknowledgment of their own limitations. We then propose a taxonomy of approaches to realise GPAIS, describing research trends such as the use of AI techniques to improve another AI or foundation models. As a prime example, we delve into generative AI, aligning them with the terms and concepts presented in the taxonomy. Through the proposed definition and taxonomy, our aim is to facilitate research collaboration across different areas that are tackling general-purpose tasks, as they share many common aspects. Finally, we discuss the current state of GPAIS, its challenges and prospects, implications for our society, and the need for responsible and trustworthy AI systems and regulation, with the goal of providing a holistic view of GPAIS.

摘要
大多数人工智能（AI）应用都是为特定任务设计的，但有许多情况需要一种通用的AI系统，能够解决多种任务而不需要特定设计。这种AI系统被称为通用人工智能系统（GPAIS）。迄今为止，人工通用智能，具有人类水平的智能，执行任何知识任务，或者甚至超越人类，还是一个希望和科幻。虽然我们可能还远离实现这个目标，但GPAIS已经成为现实，并位于人工智能研究的前沿。这篇文章讨论了现有的GPAIS定义，并提出了一个新的定义，以便在GPAIS之间进行渐进的差异化。我们将GPAIS分为关闭世界和开放世界两类，根据它们的自主度和能力，并基于一些因素，如新任务适应度、不直接训练的领域能力、少量数据学习能力、以及主动承认自己的局限性。然后，我们提出了实现GPAIS的方法taxonomy，描述了一些研究趋势，如使用人工智能技术来提高另一个人工智能，或基础模型。作为一个 prime example，我们探讨了生成AI，与文章中所提出的术语和概念相匹配。通过我们的定义和taxonomy，我们的目标是促进不同领域对通用任务的合作研究，因为它们在一些方面有共同点。最后，我们讨论了GPAIS的当前状况、挑战和前景，以及我们社会中的责任和可靠的人工智能系统和法规，以提供一个整体的GPAIS视图。

Deepfake Image Generation for Improved Brain Tumor Segmentation

paper_url: http://arxiv.org/abs/2307.14273
repo_url: None
paper_authors: Roa’a Al-Emaryeen, Sara Al-Nahhas, Fatima Himour, Waleed Mahafza, Omar Al-Kadi
for: 这研究旨在使用深度模拟图像生成技术提高脑肿划分精度。
methods: 研究使用生成对抗网络进行图像到图像翻译，然后使用基于U-Net的卷积神经网络进行图像划分。
results: 研究表明，使用深度模拟图像生成技术可以提高脑肿划分精度，并且可以帮助在受限数据量下进行训练。

Abstract
As the world progresses in technology and health, awareness of disease by revealing asymptomatic signs improves. It is important to detect and treat tumors in early stage as it can be life-threatening. Computer-aided technologies are used to overcome lingering limitations facing disease diagnosis, while brain tumor segmentation remains a difficult process, especially when multi-modality data is involved. This is mainly attributed to ineffective training due to lack of data and corresponding labelling. This work investigates the feasibility of employing deep-fake image generation for effective brain tumor segmentation. To this end, a Generative Adversarial Network was used for image-to-image translation for increasing dataset size, followed by image segmentation using a U-Net-based convolutional neural network trained with deepfake images. Performance of the proposed approach is compared with ground truth of four publicly available datasets. Results show improved performance in terms of image segmentation quality metrics, and could potentially assist when training with limited data.

摘要
世界的科技和医疗技术不断发展，疾病的早期发现越来越重要。早期发现和治疗肿瘤可以挽救人命。计算机支持技术可以减少疾病诊断中的限制，但脑肿瘤分 segmentation仍然是一项具有挑战性的任务，特别是当涉及多Modal数据时。这主要归因于训练不充分，因为缺乏数据和相应的标签。本研究探讨使用深层负拟合作网络进行有效脑肿瘤分 segmentation的可能性。为此，我们使用生成对抗网络进行图像到图像翻译，然后使用U-Net基于 convolutional neural network（CNN）进行图像分 segmentation，训练时使用深层负拟合作网络生成的图像。我们对提出的方法与公共数据集的标准chmark进行比较，结果显示，我们的方法可以提高图像分 segmentation质量指标，并且在训练时使用有限数据可能帮助。

2023-07-27

eess.IV

eess.IV - 2023-07-27

Weakly Supervised AI for Efficient Analysis of 3D Pathology Samples

paper_url: http://arxiv.org/abs/2307.14907
repo_url: https://github.com/mahmoodlab/mamba
paper_authors: Andrew H. Song, Mane Williams, Drew F. K. Williamson, Guillaume Jaume, Andrew Zhang, Bowen Chen, Robert Serafin, Jonathan T. C. Liu, Alex Baras, Anil V. Parwani, Faisal Mahmood
for:* 这个论文是为了提出一种基于深度学习的3D生物组织图像分析平台，用于评估癌症患者的风险水平和疾病进程。methods:* 该论文使用了多样化的3D生物组织图像技术，包括open-top light-sheet microscopy和微计算tomography，以获取病理样本的3D图像数据。然后，使用深度学习算法对这些数据进行分析和预测。results:* 该论文的实验结果表明，使用3D块分析方法可以提高预测癌症患者的5年生物化学回暂念 outcome的准确率，比传统的2D单片预测方法更高。此外，该论文还发现，capturing更大的生物组织体积可以提高预测性能，并减少采样偏见导致的预测变化。

Abstract
Human tissue and its constituent cells form a microenvironment that is fundamentally three-dimensional (3D). However, the standard-of-care in pathologic diagnosis involves selecting a few two-dimensional (2D) sections for microscopic evaluation, risking sampling bias and misdiagnosis. Diverse methods for capturing 3D tissue morphologies have been developed, but they have yet had little translation to clinical practice; manual and computational evaluations of such large 3D data have so far been impractical and/or unable to provide patient-level clinical insights. Here we present Modality-Agnostic Multiple instance learning for volumetric Block Analysis (MAMBA), a deep-learning-based platform for processing 3D tissue images from diverse imaging modalities and predicting patient outcomes. Archived prostate cancer specimens were imaged with open-top light-sheet microscopy or microcomputed tomography and the resulting 3D datasets were used to train risk-stratification networks based on 5-year biochemical recurrence outcomes via MAMBA. With the 3D block-based approach, MAMBA achieves an area under the receiver operating characteristic curve (AUC) of 0.86 and 0.74, superior to 2D traditional single-slice-based prognostication (AUC of 0.79 and 0.57), suggesting superior prognostication with 3D morphological features. Further analyses reveal that the incorporation of greater tissue volume improves prognostic performance and mitigates risk prediction variability from sampling bias, suggesting the value of capturing larger extents of heterogeneous 3D morphology. With the rapid growth and adoption of 3D spatial biology and pathology techniques by researchers and clinicians, MAMBA provides a general and efficient framework for 3D weakly supervised learning for clinical decision support and can help to reveal novel 3D morphological biomarkers for prognosis and therapeutic response.

摘要
人体组织和其中的细胞形成一个基本三维（3D）的微环境。然而，现行的诊断标准仅选择一些二维（2D）的 slice 进行微scopic 诊断，可能会导致采样偏见和误诊断。各种用于捕捉 3D 组织结构的方法已经开发出来，但它们尚未得到了临床应用；手动和计算机的评估这些大量 3D 数据的实际和/或无法提供病人级别的临床信息。我们现在提出了模式不敏感多例学习 дляvolumetric Block分析（MAMBA），一种基于深度学习的平台，可以处理不同成像模式的3D组织图像并预测病人结果。已有的肠癌 Specimen 通过开放式光sheet Microscopy或微计Tomography imaging，并将结果用于通过MAMBA训练基于5年生物化学回报的风险 stratification 网络。使用3D block 的方法，MAMBA在 receiver 操作特征曲线（AUC）中 achievement 0.86和0.74，高于2D传统单片 slice 基于的预测（AUC 0.79和0.57），表明3D形态特征的超越预测。进一步分析表明，包括更大的组织量可以提高预测性能，并减少采样偏见导致的预测变化，表明捕捉更大的多样性3D形态的价值。随着3D空间生物学和病理学技术的快速发展和广泛应用，MAMBA提供了一个通用和有效的3Dweakly supervised learning框架，可以帮助抽取 novel 3D形态生物标志物并提供临床决策支持。

A full-resolution training framework for Sentinel-2 image fusion

paper_url: http://arxiv.org/abs/2307.14864
repo_url: https://github.com/matciotola/FR-FUSE
paper_authors: Matteo Ciotola, Mario Ragosta, Giovanni Poggi, Giuseppe Scarpa
for: 这个论文旨在提出一种新的无监督深度学习框架，用于修复Sentinel-2图像的超分辨率。
methods: 该方案利用Sentinel-2图像的10米和20米频道进行融合，避免了生成训练数据的下降分辨率过程。同时，提出了一种适应cycle-consistency的损失函数。
results: 在我们的初步实验中，提出的方案已经显示了与supervised方法相当的承诺性。此外，由构建的损失函数，得到的训练网络可以被归类为多resolution分析方法。

Abstract
This work presents a new unsupervised framework for training deep learning models for super-resolution of Sentinel-2 images by fusion of its 10-m and 20-m bands. The proposed scheme avoids the resolution downgrade process needed to generate training data in the supervised case. On the other hand, a proper loss that accounts for cycle-consistency between the network prediction and the input components to be fused is proposed. Despite its unsupervised nature, in our preliminary experiments the proposed scheme has shown promising results in comparison to the supervised approach. Besides, by construction of the proposed loss, the resulting trained network can be ascribed to the class of multi-resolution analysis methods.

摘要
这个工作提出了一种新的无监督框架，用于训练深度学习模型以提高遥感二号卫星（Sentinel-2）图像的超分辨率。该方案不需要生成训练数据的分辨率降低过程，而是提出了一种适当的损失函数，以考虑网络预测和输入组件之间的循环一致性。尽管是无监督的，但在我们的初步实验中，该方案已经展示了与批处学习方法相比的扎实表现。此外，由于构建的损失函数，得到的训练网络可以被归类为多分辨率分析方法。

Seeing through the Brain: Image Reconstruction of Visual Perception from Human Brain Signals

paper_url: http://arxiv.org/abs/2308.02510
repo_url: None
paper_authors: Yu-Ting Lan, Kan Ren, Yansen Wang, Wei-Long Zheng, Dongsheng Li, Bao-Liang Lu, Lili Qiu
for: 这 paper 的目的是重构观察到的图像基于电энцефалографи解剖信息。
methods: 该 paper 提出了一个全面的拓展管道，名为 NeuroImagen，以重构观察到的图像。该管道包括一种新的多级感知信息解码器，以提取多级输出信息。然后，一种幂等扩散模型将利用提取的信息来重构高分辨率的图像。
results: 实验结果表明，该方法可以有效地重构图像，并且对比 traditional methods 表现出优异的数值性能。

Abstract
Seeing is believing, however, the underlying mechanism of how human visual perceptions are intertwined with our cognitions is still a mystery. Thanks to the recent advances in both neuroscience and artificial intelligence, we have been able to record the visually evoked brain activities and mimic the visual perception ability through computational approaches. In this paper, we pay attention to visual stimuli reconstruction by reconstructing the observed images based on portably accessible brain signals, i.e., electroencephalography (EEG) data. Since EEG signals are dynamic in the time-series format and are notorious to be noisy, processing and extracting useful information requires more dedicated efforts; In this paper, we propose a comprehensive pipeline, named NeuroImagen, for reconstructing visual stimuli images from EEG signals. Specifically, we incorporate a novel multi-level perceptual information decoding to draw multi-grained outputs from the given EEG data. A latent diffusion model will then leverage the extracted information to reconstruct the high-resolution visual stimuli images. The experimental results have illustrated the effectiveness of image reconstruction and superior quantitative performance of our proposed method.

摘要
⟨SYS⟩视觉是信任的起点，但是人类视觉与认知之间的底层机制仍然是一个谜。随着 neuroscience 和人工智能的Recent advances，我们可以记录人类视觉活动和模拟视觉能力通过计算方法。在这篇论文中，我们关注于基于电encephalography（EEG）数据的视觉刺激重建。由于EEG信号是时间序列格式的动态信号，容易受到噪声的影响，因此处理和提取有用信息需要更多的努力。在这篇论文中，我们提出了一个完整的管道，名为NeuroImagen，用于从EEG信号中重建视觉刺激图像。 Specifically，我们采用了一种新的多级感知信息解码器，以从给定的EEG数据中提取多级输出。然后，一种扩散模型会利用提取的信息来重建高分辨率的视觉刺激图像。实验结果表明了图像重建的效果和我们提出的方法的数量性表现优于其他方法。Note: Simplified Chinese is used in this translation, which is a standardized form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Taiwan and Hong Kong.

paper_url: http://arxiv.org/abs/2307.14735
repo_url: https://github.com/shankhanil006/tta-iqa
paper_authors: Subhadeep Roy, Shankhanil Mitra, Soma Biswas, Rajiv Soundararajan
for: 这个研究是为了提高隐藏式图像质量评估（IQA）算法的测试时性能，因为现有的设计几乎没有测试时的分布迁移，导致这些方法在测试时的性能不佳。
methods: 这个研究使用了两个新的质量相关辅助任务，一个是批量水平的群集相似损失，另一个是样本水平的相对排名损失，以使模型变得质量意识并适应目标数据。
results: 实验显示，即使仅使用小批量的测试数据，也可以通过更新源模型的批量常数汇总来 дости� Significant improvement in performance。

Abstract
While the design of blind image quality assessment (IQA) algorithms has improved significantly, the distribution shift between the training and testing scenarios often leads to a poor performance of these methods at inference time. This motivates the study of test time adaptation (TTA) techniques to improve their performance at inference time. Existing auxiliary tasks and loss functions used for TTA may not be relevant for quality-aware adaptation of the pre-trained model. In this work, we introduce two novel quality-relevant auxiliary tasks at the batch and sample levels to enable TTA for blind IQA. In particular, we introduce a group contrastive loss at the batch level and a relative rank loss at the sample level to make the model quality aware and adapt to the target data. Our experiments reveal that even using a small batch of images from the test distribution helps achieve significant improvement in performance by updating the batch normalization statistics of the source model.

摘要
“对照设计盲目图像质量评估（IQA）算法的改进，但是在测试场景下的分布迁移常会导致这些方法在推导时表现不佳。这为测试时间适应（TTA）技术的研究带来动机。现有的辅助任务和损失函数用于TTA可能不适用于质量感知的改进。在这个工作中，我们介绍两个新的质量相关的辅助任务，一为批量级的集成对比损失函数，另一为样本级的相对排名损失函数，以使模型变得质量感知，适应目标数据。我们的实验表明，甚至仅使用测试分布中的小批量图像，仍可以在更新源模型的批量 нормализаation 统计时取得显著的改进。”

A Multimodal Supervised Machine Learning Approach for Satellite-based Wildfire Identification in Europe

paper_url: http://arxiv.org/abs/2308.02508
repo_url: None
paper_authors: Angelica Urbanelli, Luca Barco, Edoardo Arnaudo, Claudio Rossi
for: 提高自动化卫星热点检测系统的准确率，用于早期识别和监测灾难性自然事件。
methods: 利用多种信息源，包括MODIS、VIIRS热点服务、ERSI年度土地用途土地覆盖（LULC）和Copernicus Sentinel-3数据，提出了一种多模态超级vised机器学习方法，用于热点检测结果的准确分类。
results: 实验结果表明，该方法在热点识别任务中表现了效果。

Abstract
The increasing frequency of catastrophic natural events, such as wildfires, calls for the development of rapid and automated wildfire detection systems. In this paper, we propose a wildfire identification solution to improve the accuracy of automated satellite-based hotspot detection systems by leveraging multiple information sources. We cross-reference the thermal anomalies detected by the Moderate-resolution Imaging Spectroradiometer (MODIS) and the Visible Infrared Imaging Radiometer Suite (VIIRS) hotspot services with the European Forest Fire Information System (EFFIS) database to construct a large-scale hotspot dataset for wildfire-related studies in Europe. Then, we propose a novel multimodal supervised machine learning approach to disambiguate hotspot detections, distinguishing between wildfires and other events. Our methodology includes the use of multimodal data sources, such as the ERSI annual Land Use Land Cover (LULC) and the Copernicus Sentinel-3 data. Experimental results demonstrate the effectiveness of our approach in the task of wildfire identification.

摘要
随着自然灾害的频发增长，如野火，需要开发高速自动化野火检测系统。在本文中，我们提出了一种野火识别解决方案，以提高自动化卫星基于热点检测系统的准确性。我们将模拟迪拜摄像机（MODIS）和可见谱偏振辐射仪（VIIRS）热点服务与欧洲森林火灾信息系统（EFFIS）数据库进行交叉引用，构建欧洲大规模热点数据集，用于野火相关研究。然后，我们提出了一种新的多模式超vised机器学习方法，用于分解热点检测结果，将野火与其他事件分开。我们的方法包括使用多模式数据源，如地理信息系统（ERSI）年度土地用途土地覆盖（LULC）数据和科学实验室三号卫星数据。实验结果表明我们的方法在野火识别任务中的有效性。

LLDiffusion: Learning Degradation Representations in Diffusion Models for Low-Light Image Enhancement

paper_url: http://arxiv.org/abs/2307.14659
repo_url: https://github.com/taowangzj/lldiffusion
paper_authors: Tao Wang, Kaihao Zhang, Ziqian Shao, Wenhan Luo, Bjorn Stenger, Tae-Kyun Kim, Wei Liu, Hongdong Li
for: 提高低光照图像的修升效果（Low-Light Image Enhancement，LLIE）
methods: 使用扩散模型，具有考虑受损表示的权重，以提高图像修升效果，并且结合图像生成和修升的 JOINT 学习框架。
results: 在多个 Synthetic 和实际无对数据集上进行了广泛的实验，并证明了与状态艺术方法相比，提出的 LLDiffusion 方法可以更好地修升低光照图像，同时保持图像的颜色准确性。

Abstract
Current deep learning methods for low-light image enhancement (LLIE) typically rely on pixel-wise mapping learned from paired data. However, these methods often overlook the importance of considering degradation representations, which can lead to sub-optimal outcomes. In this paper, we address this limitation by proposing a degradation-aware learning scheme for LLIE using diffusion models, which effectively integrates degradation and image priors into the diffusion process, resulting in improved image enhancement. Our proposed degradation-aware learning scheme is based on the understanding that degradation representations play a crucial role in accurately modeling and capturing the specific degradation patterns present in low-light images. To this end, First, a joint learning framework for both image generation and image enhancement is presented to learn the degradation representations. Second, to leverage the learned degradation representations, we develop a Low-Light Diffusion model (LLDiffusion) with a well-designed dynamic diffusion module. This module takes into account both the color map and the latent degradation representations to guide the diffusion process. By incorporating these conditioning factors, the proposed LLDiffusion can effectively enhance low-light images, considering both the inherent degradation patterns and the desired color fidelity. Finally, we evaluate our proposed method on several well-known benchmark datasets, including synthetic and real-world unpaired datasets. Extensive experiments on public benchmarks demonstrate that our LLDiffusion outperforms state-of-the-art LLIE methods both quantitatively and qualitatively. The source code and pre-trained models are available at https://github.com/TaoWangzj/LLDiffusion.

摘要
当前的深度学习方法 для低光照图像提高（LLIE）通常基于像素级映射，学习从对应的数据集中获得的映射。然而，这些方法经常忽略了考虑降低表示，这可能会导致优化结果不佳。在这篇论文中，我们解决这个限制，提出一种基于降低表示的学习方案，使得图像提高更加有效。我们的提议的降低表示学习方案基于降低表示对准特定的降低特征，以便准确地模型低光照图像中的特有降低模式。为此，我们首先提出了一种联合学习框架，用于同时学习图像生成和图像提高。其次，我们开发了一种低光照扩散模型（LLDiffusion），该模型具有一个合理的动态扩散模块。这个模块考虑了图像的颜色地图和隐藏的降低表示，以指导扩散过程。通过这种方式，我们的LLDiffusion可以更好地提高低光照图像，考虑到降低特征和颜色准确性。最后，我们对多个公共 benchmark 上进行了广泛的实验，并证明了我们的LLDiffusion在量化和质量上都超过了现有的LLIE方法。我们的源代码和预训练模型可以在https://github.com/TaoWangzj/LLDiffusion 上获取。

A Weakly Supervised Segmentation Network Embedding Cross-scale Attention Guidance and Noise-sensitive Constraint for Detecting Tertiary Lymphoid Structures of Pancreatic Tumors

paper_url: http://arxiv.org/abs/2307.14603
repo_url: None
paper_authors: Bingxue Wang, Liwen Zou, Jun Chen, Yingying Cao, Zhenghua Cai, Yudong Qiu, Liang Mao, Zhongqiu Wang, Jingya Chen, Luying Gui, Xiaoping Yang
For: 这个研究的目的是对腺苷癌病例中的次次性リンパ球结构（TLSs）进行检测，以便诊断和治疗腺苷癌病人。* Methods: 我们提出了一种几何学习（few-shot learning）的弱化监督分类网络，并使用了一个预训练的核� Species segmentation模型和域对抗网络来获得核细胞数据。我们还实现了跨Scale的注意力导引机制，并且将粗细度特征学习自原始血液病理图像，以及我们的设计的核细胞数据注意力。在训练过程中，我们引入了一个体积敏感的嵌入距离函数损失，以减少微小预测错误。* Results: 我们的提案方法在两个收集的数据集上进行了实验，结果显示，该方法可以对TLSs检测精度有所提高，较以传统的监督学习方法为佳。此外，我们还应用了我们的方法来研究TLSs数据中与周围血管浸润的相互关系，并获得了一些临床 statistically significant 的结果。

Abstract
The presence of tertiary lymphoid structures (TLSs) on pancreatic pathological images is an important prognostic indicator of pancreatic tumors. Therefore, TLSs detection on pancreatic pathological images plays a crucial role in diagnosis and treatment for patients with pancreatic tumors. However, fully supervised detection algorithms based on deep learning usually require a large number of manual annotations, which is time-consuming and labor-intensive. In this paper, we aim to detect the TLSs in a manner of few-shot learning by proposing a weakly supervised segmentation network. We firstly obtain the lymphocyte density maps by combining a pretrained model for nuclei segmentation and a domain adversarial network for lymphocyte nuclei recognition. Then, we establish a cross-scale attention guidance mechanism by jointly learning the coarse-scale features from the original histopathology images and fine-scale features from our designed lymphocyte density attention. A noise-sensitive constraint is introduced by an embedding signed distance function loss in the training procedure to reduce tiny prediction errors. Experimental results on two collected datasets demonstrate that our proposed method significantly outperforms the state-of-the-art segmentation-based algorithms in terms of TLSs detection accuracy. Additionally, we apply our method to study the congruent relationship between the density of TLSs and peripancreatic vascular invasion and obtain some clinically statistical results.

摘要
在胰腺癌图像中存在辐合细胞结构（TLS）是诊断和治疗胰腺癌的重要预测指标。因此，TLS的检测在胰腺癌诊断和治疗中扮演着关键的角色。然而，通常需要大量的手动标注，这是时间consuming和劳动密集的。在本文中，我们提出了一种几何学学习的弱类划分网络，以实现几何学学习的TLSVessel检测。我们首先通过将核体分割模型和血液识别模型进行结合，获得了血液细胞密度图。然后，我们实现了跨度级别的注意力引导机制，通过同时学习来自原始 histopathology 图像的粗细度特征和我们设计的血液细胞注意力的细胞特征。在训练过程中，我们引入了一种噪声敏感的约束，通过嵌入签名距离函数损失来减少微型预测错误。实验结果表明，我们提出的方法在两个收集的数据集上明显超过了状态部署的 segmentation-based 算法的TLSVessel检测精度。此外，我们通过应用我们的方法来研究胰腺癌中TLS的浓度和周围血管浸润的关系，并获得了一些临床 statistically 有用的结果。

paper_url: http://arxiv.org/abs/2307.14520
repo_url: None
paper_authors: Soorena Salari, Amirhossein Rasoulian, Hassan Rivaz, Yiming Xiao
for: 这个论文旨在提高脑肿瘤静脉切除手术中肿瘤移植的准确性，以确保治疗的安全和效果。
methods: 这个论文使用了实时ultrasound（iUS）技术，并对MRI-iUS注射进行了多Modal注射，以实时跟踪肿瘤移植。
results: 该论文提出了一种基于3D焦点模ulation的深度学习技术，用于精准评估MRI-iUS注射注射错误，以避免手术中的不良后果。

Abstract
In brain tumor resection, accurate removal of cancerous tissues while preserving eloquent regions is crucial to the safety and outcomes of the treatment. However, intra-operative tissue deformation (called brain shift) can move the surgical target and render the pre-surgical plan invalid. Intra-operative ultrasound (iUS) has been adopted to provide real-time images to track brain shift, and inter-modal (i.e., MRI-iUS) registration is often required to update the pre-surgical plan. Quality control for the registration results during surgery is important to avoid adverse outcomes, but manual verification faces great challenges due to difficult 3D visualization and the low contrast of iUS. Automatic algorithms are urgently needed to address this issue, but the problem was rarely attempted. Therefore, we propose a novel deep learning technique based on 3D focal modulation in conjunction with uncertainty estimation to accurately assess MRI-iUS registration errors for brain tumor surgery. Developed and validated with the public RESECT clinical database, the resulting algorithm can achieve an estimation error of 0.59+-0.57 mm.

摘要
在脑肿减除手术中，准确地除去肿瘤组织，保留语言功能区的安全和结果是手术治疗的关键。然而，在手术中的脑组织塑形（脑移动）可以使手术目标移动，使原来的预手术计划无效。实时ultrasound（iUS）已被采用，以提供实时图像，跟踪脑组织塑形。然而，在手术中的质量控制是重要的，因为Difficult 3D 可视化和低对ultrasound的冲击力，导致困难的手动验证。因此，我们提出了一种新的深度学习技术，基于3D 焦点变化，并与不确定性估计，准确地评估MRI-iUS 注册错误，用于脑肿手术。我们在公共的RESECT临床数据库中开发和验证了该算法，其误差估计为0.59±0.57毫米。

Phenotype-preserving metric design for high-content image reconstruction by generative inpainting

paper_url: http://arxiv.org/abs/2307.14436
repo_url: None
paper_authors: Vaibhav Sharma, Artur Yakimovich
for:This paper focuses on the problem of image restoration in high-content microscopy datasets, specifically the issue of imaging and sample preparation artefacts in fluorescence microscopy images of cultured cells with labelled nuclei.methods:The authors evaluate state-of-the-art inpainting methods for image restoration, including DeepFill V2 and Edge Connect, and fine-tune these models with relatively little data to faithfully restore microscopy images.results:The authors demonstrate that the area of the region to be restored is more important than shape, and propose a novel phenotype-preserving metric design strategy that quantifies the size and count of restored biological phenotypes like cell nuclei to penalize undesirable manipulation.

Abstract
In the past decades, automated high-content microscopy demonstrated its ability to deliver large quantities of image-based data powering the versatility of phenotypic drug screening and systems biology applications. However, as the sizes of image-based datasets grew, it became infeasible for humans to control, avoid and overcome the presence of imaging and sample preparation artefacts in the images. While novel techniques like machine learning and deep learning may address these shortcomings through generative image inpainting, when applied to sensitive research data this may come at the cost of undesired image manipulation. Undesired manipulation may be caused by phenomena such as neural hallucinations, to which some artificial neural networks are prone. To address this, here we evaluate the state-of-the-art inpainting methods for image restoration in a high-content fluorescence microscopy dataset of cultured cells with labelled nuclei. We show that architectures like DeepFill V2 and Edge Connect can faithfully restore microscopy images upon fine-tuning with relatively little data. Our results demonstrate that the area of the region to be restored is of higher importance than shape. Furthermore, to control for the quality of restoration, we propose a novel phenotype-preserving metric design strategy. In this strategy, the size and count of the restored biological phenotypes like cell nuclei are quantified to penalise undesirable manipulation. We argue that the design principles of our approach may also generalise to other applications.

摘要

Optimization of Image Acquisition for Earth Observation Satellites via Quantum Computing

paper_url: http://arxiv.org/abs/2307.14419
repo_url: None
paper_authors: Antón Makarov, Márcio M. Taddei, Eneko Osaba, Giacomo Franceschetto, Esther Villar-Rodriguez, Izaskun Oregi
for: 这个研究旨在寻找依照条件组合来找到最佳的卫星图像采集时间表。
methods: 本研究使用了两种QUBO形式来模型这个问题，并使用不同的缓存处理方法来处理非常复杂的约束。
results: 实验结果显示，选择适当的形式和缓存处理方法是解决这个问题的关键。此外，本研究还提供了现有量子计算机可行解决的问题实际大小上限。

Abstract
Satellite image acquisition scheduling is a problem that is omnipresent in the earth observation field; its goal is to find the optimal subset of images to be taken during a given orbit pass under a set of constraints. This problem, which can be modeled via combinatorial optimization, has been dealt with many times by the artificial intelligence and operations research communities. However, despite its inherent interest, it has been scarcely studied through the quantum computing paradigm. Taking this situation as motivation, we present in this paper two QUBO formulations for the problem, using different approaches to handle the non-trivial constraints. We compare the formulations experimentally over 20 problem instances using three quantum annealers currently available from D-Wave, as well as one of its hybrid solvers. Fourteen of the tested instances have been obtained from the well-known SPOT5 benchmark, while the remaining six have been generated ad-hoc for this study. Our results show that the formulation and the ancilla handling technique is crucial to solve the problem successfully. Finally, we also provide practical guidelines on the size limits of problem instances that can be realistically solved on current quantum computers.

摘要
卫星图像获取计划是地球观测领域中一个普遍存在的问题，其目标是在给定的轨道过行下找到最佳的图像子集。这个问题可以通过 combinatorial optimization 模型来表示，在人工智能和运筹学社区中已经得到了广泛研究。然而，尽管它具有潜在的 интерес，它在量子计算 Paradigma 中几乎没有被研究。在这种情况下，我们在这篇文章中提出了两种 QUBO 表述方法，使用不同的方法处理非质量的约束。我们通过实验对三个 D-Wave 提供的量子泵浦机和其中一个混合解决方案进行比较，测试了 20 个问题实例。其中 14 个问题来自 SPOT5 标准套件，剩下的 6 个问题被 специаль地设计为这项研究。我们的结果表明，表述和 ancilla 处理技术是解决问题的关键。最后，我们还提供了现有量子计算机上可行解决问题的实际大小限制。

2023-07-26

cs.SD

cs.SD - 2023-07-26

Say Goodbye to RNN-T Loss: A Novel CIF-based Transducer Architecture for Automatic Speech Recognition

paper_url: http://arxiv.org/abs/2307.14132
repo_url: None
paper_authors: Tian-Hao Zhang, Dinghao Zhou, Guiping Zhong, Baoxiang Li
for: 这篇论文是为了提出一种新的语音识别模型，即CIF-Transducer（CIF-T），以实现高效的Alignment。
methods: 这篇论文使用了Continuous Integrate-and-Fire（CIF）机制和RNN-T模型，并废除了RNN-T损失，从而实现计算减少和预测网络更加重要的角色。另外，文章还提出了Funnel-CIF、Context Blocks、Unified Gating和Bilinear Pooling共同网络以及辅助训练策略等技术来进一步提高性能。
results: 实验结果表明，CIF-T在178小时AISHELL-1和10000小时WenetSpeech datasets上达到了当前最佳性能，同时具有较低的计算开销。

Abstract
RNN-T models are widely used in ASR, which rely on the RNN-T loss to achieve length alignment between input audio and target sequence. However, the implementation complexity and the alignment-based optimization target of RNN-T loss lead to computational redundancy and a reduced role for predictor network, respectively. In this paper, we propose a novel model named CIF-Transducer (CIF-T) which incorporates the Continuous Integrate-and-Fire (CIF) mechanism with the RNN-T model to achieve efficient alignment. In this way, the RNN-T loss is abandoned, thus bringing a computational reduction and allowing the predictor network a more significant role. We also introduce Funnel-CIF, Context Blocks, Unified Gating and Bilinear Pooling joint network, and auxiliary training strategy to further improve performance. Experiments on the 178-hour AISHELL-1 and 10000-hour WenetSpeech datasets show that CIF-T achieves state-of-the-art results with lower computational overhead compared to RNN-T models.

摘要
RNN-T模型广泛用于语音识别，这些模型基于RNN-T损失来实现输入音频和目标序列之间的长度匹配。然而，实现复杂性和基于匹配优化目标的RNN-T损失会导致计算冗余和预测网络的功能减少。在这篇论文中，我们提出了一种新的模型名为CIF-Transducer（CIF-T），它将继续采用Continuous Integrate-and-Fire（CIF）机制和RNN-T模型来实现有效的匹配。这样，RNN-T损失就不再需要，从而带来计算减少和预测网络更加重要的角色。我们还引入了Funnel-CIF、Context Blocks、Unified Gating和Bilinear Pooling共同网络，以及辅助训练策略，以进一步提高性能。在AISHELL-1和WenetSpeech数据集上进行了178小时和10000小时的实验，得出的结果表明，CIF-T可以与RNN-T模型相比，在计算负担下获得状态之差的最佳结果。

The Hidden Dance of Phonemes and Visage: Unveiling the Enigmatic Link between Phonemes and Facial Features

paper_url: http://arxiv.org/abs/2307.13953
repo_url: None
paper_authors: Liao Qu, Xianwei Zou, Xiang Li, Yandong Wen, Rita Singh, Bhiksha Raj
for: 这项研究揭示了声音和面部特征之间的悬峰关系。传统的声音-面部相关研究通常使用较长的声音输入，包括从声音生成面像和从声音重建3D面形。但在voice-based犯罪等实际应用中，可能只有有限的声音证据。此外，各个段声音——音位——对应不同的空气流和面部运动。因此，发现声音和面部特征之间的隐藏关系有利于推进speech-face多模态学习。
methods: 我们提出了一个细化的分析管道，用于探索声音-面部关系。我们为每个音位-AM对建立了估计器，并通过假设检测来评估相关性。我们的结果表明，AM在元音上更易预测，特别是在激音上。此外，我们发现，在某些AM发生更大运动时，它更易预测。
results: 我们的结果支持物理学的相关性理论，并为未来的speech-face多模态学习铺平了道路。

Abstract
This work unveils the enigmatic link between phonemes and facial features. Traditional studies on voice-face correlations typically involve using a long period of voice input, including generating face images from voices and reconstructing 3D face meshes from voices. However, in situations like voice-based crimes, the available voice evidence may be short and limited. Additionally, from a physiological perspective, each segment of speech -- phoneme -- corresponds to different types of airflow and movements in the face. Therefore, it is advantageous to discover the hidden link between phonemes and face attributes. In this paper, we propose an analysis pipeline to help us explore the voice-face relationship in a fine-grained manner, i.e., phonemes v.s. facial anthropometric measurements (AM). We build an estimator for each phoneme-AM pair and evaluate the correlation through hypothesis testing. Our results indicate that AMs are more predictable from vowels compared to consonants, particularly with plosives. Additionally, we observe that if a specific AM exhibits more movement during phoneme pronunciation, it is more predictable. Our findings support those in physiology regarding correlation and lay the groundwork for future research on speech-face multimodal learning.

摘要
Note: "Simplified Chinese" is a romanization of Chinese characters, and the text above is written in the "Yale Romanization" system, which is a widely used system for romanizing Chinese. The actual Chinese characters and pronunciation may vary depending on the region and dialect.

Rethinking Voice-Face Correlation: A Geometry View

paper_url: http://arxiv.org/abs/2307.13948
repo_url: https://github.com/lxa9867/VAF
paper_authors: Xiang Li, Yandong Wen, Muqiao Yang, Jinglu Wang, Rita Singh, Bhiksha Raj
for: investigate the capability of reconstructing 3D facial shape from voice from a geometry perspective without any semantic information
methods: propose a voice-anthropometric measurement (AM)-face paradigm, which identifies predictable facial AMs from the voice and uses them to guide 3D face reconstruction
results: significant correlations between voice and specific parts of the face geometry, such as the nasal cavity and cranium

Abstract
Previous works on voice-face matching and voice-guided face synthesis demonstrate strong correlations between voice and face, but mainly rely on coarse semantic cues such as gender, age, and emotion. In this paper, we aim to investigate the capability of reconstructing the 3D facial shape from voice from a geometry perspective without any semantic information. We propose a voice-anthropometric measurement (AM)-face paradigm, which identifies predictable facial AMs from the voice and uses them to guide 3D face reconstruction. By leveraging AMs as a proxy to link the voice and face geometry, we can eliminate the influence of unpredictable AMs and make the face geometry tractable. Our approach is evaluated on our proposed dataset with ground-truth 3D face scans and corresponding voice recordings, and we find significant correlations between voice and specific parts of the face geometry, such as the nasal cavity and cranium. Our work offers a new perspective on voice-face correlation and can serve as a good empirical study for anthropometry science.

摘要
previous works on voice-face matching and voice-guided face synthesis have shown strong correlations between voice and face, but mainly rely on coarse semantic cues such as gender, age, and emotion. in this paper, we aim to investigate the capability of reconstructing the 3D facial shape from voice from a geometry perspective without any semantic information. we propose a voice-anthropometric measurement (AM)-face paradigm, which identifies predictable facial AMs from the voice and uses them to guide 3D face reconstruction. by leveraging AMs as a proxy to link the voice and face geometry, we can eliminate the influence of unpredictable AMs and make the face geometry tractable. our approach is evaluated on our proposed dataset with ground-truth 3D face scans and corresponding voice recordings, and we find significant correlations between voice and specific parts of the face geometry, such as the nasal cavity and cranium. our work offers a new perspective on voice-face correlation and can serve as a good empirical study for anthropometry science.

Perceptual Quality Enhancement of Sound Field Synthesis Based on Combination of Pressure and Amplitude Matching

paper_url: http://arxiv.org/abs/2307.13941
repo_url: None
paper_authors: Keisuke Kimura, Shoichi Koyama, Hiroshi Saruwatari
for: 提高听众体验质量的声场 synthesis 方法
methods: 使用多个喇叭器进行声场synthesis，并通过压力和幅度匹配来减少高频Synthesis错误（声场假象）
results: 对于数字实验和实际系统的听测，提出的方法可以提高声场synthesis的感知质量，比传统压力匹配法更好。

Abstract
A sound field synthesis method enhancing perceptual quality is proposed. Sound field synthesis using multiple loudspeakers enables spatial audio reproduction with a broad listening area; however, synthesis errors at high frequencies called spatial aliasing artifacts are unavoidable. To minimize these artifacts, we propose a method based on the combination of pressure and amplitude matching. On the basis of the human's auditory properties, synthesizing the amplitude distribution will be sufficient for horizontal sound localization. Furthermore, a flat amplitude response should be synthesized as much as possible to avoid coloration. Therefore, we apply amplitude matching, which is a method to synthesize the desired amplitude distribution with arbitrary phase distribution, for high frequencies and conventional pressure matching for low frequencies. Experimental results of numerical simulations and listening tests using a practical system indicated that the perceptual quality of the sound field synthesized by the proposed method was improved from that synthesized by pressure matching.

摘要
一种提高听觉质量的声场合成方法被提出。使用多个喇叭器实现声场合成可以提供广泛的听众区域，但高频合成错误无法避免。为降低这些artifacts，我们提出基于压力和强度匹配的方法。根据人类听觉特性，只需要合成水平声localization的强度分布即可。此外，要避免颜色化，因此我们应用强度匹配，用于高频范围内的恰当强度分布，并使用压力匹配来处理低频范围内的压力。 numericsimulations和实际系统的听试结果表明，提出的方法可以提高声场合成的听觉质量，比普通压力匹配更好。

Exploring the Interactions between Target Positive and Negative Information for Acoustic Echo Cancellation

paper_url: http://arxiv.org/abs/2307.13888
repo_url: None
paper_authors: Chang Han, Xinmeng Xu, Weiping Tu, Yuhong Yang, Yajie Liu
for: 这个论文是为了提高隔音干扰消除（AEC）的性能，使其能够更好地除掉干扰信号而保留近端语音最少干扰。
methods: 这个论文使用了一种新的encoder-decoder架构，其中包括一个协作模块（CM），用于在学习方式下建立target正信号和干扰信号之间的相关性。
results: 实验结果显示，这个CMNet模型在近端语音和干扰信号之间建立了更好的相关性，并且在AEC任务上表现了更高的性能。

Abstract
Acoustic echo cancellation (AEC) aims to remove interference signals while leaving near-end speech least distorted. As the indistinguishable patterns between near-end speech and interference signals, near-end speech can't be separated completely, causing speech distortion and interference signals residual. We observe that besides target positive information, e.g., ground-truth speech and features, the target negative information, such as interference signals and features, helps make pattern of target speech and interference signals more discriminative. Therefore, we present a novel AEC model encoder-decoder architecture with the guidance of negative information termed as CMNet. A collaboration module (CM) is designed to establish the correlation between the target positive and negative information in a learnable manner via three blocks: target positive, target negative, and interactive block. Experimental results demonstrate our CMNet achieves superior performance than recent methods.

摘要
听音预处理（AEC）目标是从干扰信号中分离近端语音，以保持近端语音最小化损害。由于干扰信号和近端语音之间存在不可分割的模式，因此完全分离近端语音和干扰信号是不可能的，从而导致语音损害和干扰信号剩下。我们发现，除了目标正面信息（如真实语音和特征）外，目标负面信息（如干扰信号和特征）也可以帮助制定目标语音和干扰信号的模式更加明确。因此，我们提出了一种基于encoder-decoder架构的新型AEC模型，称为CMNet。一个合作模块（CM）是用来在一个学习性的方式下建立目标正面和负面信息之间的相互关系。经过实验证明，我们的CMNet在相比之前的方法上表现出了更高的性能。

2023-07-26

eess.AS

eess.AS - 2023-07-26

Sound Field Estimation around a Rigid Sphere with Physics-informed Neural Network

paper_url: http://arxiv.org/abs/2307.14013
repo_url: None
paper_authors: Xingyu Chen, Fei Ma, Amy Bastine, Prasanga Samarasinghe, Huiyuan Sun
for: 实时测量圆体周围的声场需要足够的样本，但这不一定可行。这篇论文提出了基于物理学习网络的声场估计方法，将物理知识 integrate into 网络架构和训练过程。与其他学习基于方法不同，提议的方法具有更好的适应能力和较少的样本需求。
methods: physics-informed neural network
results: 比起圆函数方法和平面波分解方法，提议的方法可以实现更加精确的声场估计，并且不需要大量的样本。在实验中，这篇论文的方法可以从有限的测量数据中获得更加精确的声场估计，超过圆函数方法和平面波分解方法的表现。

Abstract
Accurate estimation of the sound field around a rigid sphere necessitates adequate sampling on the sphere, which may not always be possible. To overcome this challenge, this paper proposes a method for sound field estimation based on a physics-informed neural network. This approach integrates physical knowledge into the architecture and training process of the network. In contrast to other learning-based methods, the proposed method incorporates additional constraints derived from the Helmholtz equation and the zero radial velocity condition on the rigid sphere. Consequently, it can generate physically feasible estimations without requiring a large dataset. In contrast to the spherical harmonic-based method, the proposed approach has better fitting abilities and circumvents the ill condition caused by truncation. Simulation results demonstrate the effectiveness of the proposed method in achieving accurate sound field estimations from limited measurements, outperforming the spherical harmonic method and plane-wave decomposition method.

摘要
固定圆球的声场估算需要充足的样本点，但这不总是可能的。为解决这个挑战，本文提出了基于物理学习网络的声场估算方法。这种方法将物理知识 integrate into网络的架构和训练过程中。与其他学习基本方法不同，该方法添加了基于海尔曼方程和径向速度条件在固定圆球上的额外约束。因此，它可以生成符合物理规则的估算结果，不需要大量数据。与圆球幂函数基本方法相比，本方法有更好的适应性和规则化特征，并且不受截断的缺陷。通过实验结果，本文证明了该方法在基于有限测量数据的情况下可以实现高精度的声场估算，超过圆球幂函数基本方法和平面波分解方法。

Speech representation learning: Learning bidirectional encoders with single-view, multi-view, and multi-task methods

paper_url: http://arxiv.org/abs/2308.00129
repo_url: None
paper_authors: Qingming Tang
for: 本论文主要针对sequence数据上的时间或空间学习 representation learning，以提高下游序列预测任务的性能。
methods: 本论文使用supervised learning和多种不同的学习方法，包括auxiliary loss学习、无监督学习、半监督学习和多视图学习。
results: 本论文通过多种学习设置和方法，对speech数据进行了广泛的研究，并获得了一些有价值的结果。这些结果可以应用于其他领域中。

Abstract
This thesis focuses on representation learning for sequence data over time or space, aiming to improve downstream sequence prediction tasks by using the learned representations. Supervised learning has been the most dominant approach for training deep neural networks for learning good sequential representations. However, one limiting factor to scale supervised learning is the lack of enough annotated data. Motivated by this challenge, it is natural to explore representation learning methods that can utilize large amounts of unlabeled and weakly labeled data, as well as an additional data modality. I describe my broad study of representation learning for speech data. Unlike most other works that focus on a single learning setting, this thesis studies multiple settings: supervised learning with auxiliary losses, unsupervised learning, semi-supervised learning, and multi-view learning. Besides different learning problems, I also explore multiple approaches for representation learning. Though I focus on speech data, the methods described in this thesis can also be applied to other domains. Overall, the field of representation learning is developing rapidly. State-of-the-art results on speech related tasks are typically based on Transformers pre-trained with large-scale self-supervised learning, which aims to learn generic representations that can benefit multiple downstream tasks. Since 2020, large-scale pre-training has been the de facto choice to achieve good performance. This delayed thesis does not attempt to summarize and compare with the latest results on speech representation learning; instead, it presents a unique study on speech representation learning before the Transformer era, that covers multiple learning settings. Some of the findings in this thesis can still be useful today.

摘要
这个论文关注在时间或空间序列数据上进行表示学习，以提高下游序列预测任务的性能。supervised learning是深度神经网络训练好序列表示的最主要方法。然而，缺乏充足的注释数据是规模supervised learning的限制因素。为了解决这个挑战，这个论文 explore representation learning方法，可以利用大量无注释和弱注释数据，以及多个数据模式。我描述了对speech数据的广泛研究，不同于大多数其他作品，这个论文研究了多种学习Setting：supervised learning with auxiliary losses、Unsupervised learning、semi-supervised learning和多视图学习。此外，我还探索了多种表示学习方法。尽管我专注于speech数据，但这些方法可以应用到其他领域。总的来说，表示学习领域在快速发展。目前最佳的speech相关任务的结果通常基于Transformers预先训练大规模自我超vised learning，该学习目标是学习通用的表示，可以改善多个下游任务。自2020年以来，大规模预训练成为了downstream任务的启用之选择。这个论文不尝试总结和与最新的speech表示学习结果进行比较，而是提供了在Transformer时代之前的speech表示学习研究，覆盖多种学习Setting。一些这个论文中的发现仍然有用。

2023-07-26

cs.CV

cs.CV - 2023-07-26

Artifact Restoration in Histology Images with Diffusion Probabilistic Models

paper_url: http://arxiv.org/abs/2307.14262
repo_url: https://github.com/zhenqi-he/artifusion
paper_authors: Zhenqi He, Junjun He, Jin Ye, Yiqing Shen
for: histological whole slide images (WSIs) restoration
methods: denoising diffusion probabilistic model (ArtiFusion) with a novel Swin-Transformer denoising architecture and time token scheme
results: effective restoration of artifact-free regions with preserved tissue structures and stain style, as demonstrated through extensive evaluations.

Abstract
Histological whole slide images (WSIs) can be usually compromised by artifacts, such as tissue folding and bubbles, which will increase the examination difficulty for both pathologists and Computer-Aided Diagnosis (CAD) systems. Existing approaches to restoring artifact images are confined to Generative Adversarial Networks (GANs), where the restoration process is formulated as an image-to-image transfer. Those methods are prone to suffer from mode collapse and unexpected mistransfer in the stain style, leading to unsatisfied and unrealistic restored images. Innovatively, we make the first attempt at a denoising diffusion probabilistic model for histological artifact restoration, namely ArtiFusion.Specifically, ArtiFusion formulates the artifact region restoration as a gradual denoising process, and its training relies solely on artifact-free images to simplify the training complexity.Furthermore, to capture local-global correlations in the regional artifact restoration, a novel Swin-Transformer denoising architecture is designed, along with a time token scheme. Our extensive evaluations demonstrate the effectiveness of ArtiFusion as a pre-processing method for histology analysis, which can successfully preserve the tissue structures and stain style in artifact-free regions during the restoration. Code is available at https://github.com/zhenqi-he/ArtiFusion.

摘要
histological whole slide images (WSIs) 可能会受到artefacts的影响，如组织卷积和气泡，这会提高Pathologist和Computer-Aided Diagnosis (CAD)系统的检查难度。现有的恢复artefact图像方法被限定于生成对抗网络 (GANs)，其恢复过程是表示为图像-to-图像传输。这些方法容易受到模式落寞和意外传输的问题，导致不满意的和不实际的恢复图像。我们在这里做出了一个新的恢复气泡概率模型，即ArtiFusion。具体来说，ArtiFusion将artefact区域恢复视为一种慢涨推敲过程，其训练仅基于无artefact图像，以简化训练复杂性。此外，为了捕捉区域artefact恢复中的本地-全局相关性，我们设计了一种Swin-Transformer推净架构，并采用时间token方案。我们的广泛评估表明ArtiFusion作为 histology分析前置处理方法，可以成功保留组织结构和染色 Style在恢复后的artefact-free区域中。代码可以在https://github.com/zhenqi-he/ArtiFusion上获取。

Sparse Double Descent in Vision Transformers: real or phantom threat?

paper_url: http://arxiv.org/abs/2307.14253
repo_url: https://github.com/vgcq/sdd_vit
paper_authors: Victor Quétu, Marta Milovanovic, Enzo Tartaglione
for: 这 paper 的目的是研究 Vision Transformers (ViT) 是否受到 “sparse double descent” 现象的影响，并找到避免该现象的方法。
methods: 该 paper 使用了一种 attention-based 方法，并对 ViT 进行了优化调整，以避免 inductive bias 的影响。
results: 研究发现，对于 ViT，可以通过优化 lambda 值来避免 sparse double descent 现象，但是这会导致模型的压缩。

Abstract
Vision transformers (ViT) have been of broad interest in recent theoretical and empirical works. They are state-of-the-art thanks to their attention-based approach, which boosts the identification of key features and patterns within images thanks to the capability of avoiding inductive bias, resulting in highly accurate image analysis. Meanwhile, neoteric studies have reported a ``sparse double descent'' phenomenon that can occur in modern deep-learning models, where extremely over-parametrized models can generalize well. This raises practical questions about the optimal size of the model and the quest over finding the best trade-off between sparsity and performance is launched: are Vision Transformers also prone to sparse double descent? Can we find a way to avoid such a phenomenon? Our work tackles the occurrence of sparse double descent on ViTs. Despite some works that have shown that traditional architectures, like Resnet, are condemned to the sparse double descent phenomenon, for ViTs we observe that an optimally-tuned $\ell_2$ regularization relieves such a phenomenon. However, everything comes at a cost: optimal lambda will sacrifice the potential compression of the ViT.

摘要
幻transformer（ViT）在最近的理论和实验研究中受到广泛关注。它们因其基于注意力的方法而成为现代图像分析的州��ensional标准，可以快速和准确地找到图像中的关键特征和模式。然而，新的研究还发现了一种“稀疏双峰”现象，这种现象在现代深度学习模型中出现，其中非常过参数的模型可以总是具有高度的泛化能力。这引发了实用问题：幻transformer也是否受到稀疏双峰现象的影响？我们的工作是研究幻transformer中稀疏双峰现象的发生。虽然一些研究表明，传统的architecture，如Resnet，是不可避免稀疏双峰现象的，但是对于幻transformer，我们发现了一种优化的 $\ell_2$ 正则化可以缓解这种现象。然而，这来的代价是优化lambda会导致幻transformer的潜在压缩被抑制。

Fluorescent Neuronal Cells v2: Multi-Task, Multi-Format Annotations for Deep Learning in Microscopy

paper_url: http://arxiv.org/abs/2307.14243
repo_url: None
paper_authors: Luca Clissa, Antonio Macaluso, Roberto Morelli, Alessandra Occhinegro, Emiliana Piscitiello, Ludovico Taddei, Marco Luppi, Roberto Amici, Matteo Cerri, Timna Hitrec, Lorenzo Rinaldi, Antonio Zoccoli
for: 本研究用于推动生物科学领域的computer视觉技术发展，提供多种标注数据集，包括semantic segmentation、物体检测和计数等学习任务。
methods: 本研究使用多种染色物标注 rodent neuronal cells的核心和细胞膜，包括多种生物marker和生物chemical marker，以便研究computer视觉技术的发展。
results: 本研究提供了一个多样化的数据集，包括rodent neuronal cells的核心和细胞膜的多种染色物标注，可以推动computer视觉技术的发展，并且可以用于多种生物科学研究。

Abstract
Fluorescent Neuronal Cells v2 is a collection of fluorescence microscopy images and the corresponding ground-truth annotations, designed to foster innovative research in the domains of Life Sciences and Deep Learning. This dataset encompasses three image collections in which rodent neuronal cells' nuclei and cytoplasm are stained with diverse markers to highlight their anatomical or functional characteristics. Alongside the images, we provide ground-truth annotations for several learning tasks, including semantic segmentation, object detection, and counting. The contribution is two-fold. First, given the variety of annotations and their accessible formats, we envision our work facilitating methodological advancements in computer vision approaches for segmentation, detection, feature learning, unsupervised and self-supervised learning, transfer learning, and related areas. Second, by enabling extensive exploration and benchmarking, we hope Fluorescent Neuronal Cells v2 will catalyze breakthroughs in fluorescence microscopy analysis and promote cutting-edge discoveries in life sciences. The data are available at: https://amsacta.unibo.it/id/eprint/7347

摘要
fluorescent neuronal cells v2是一个包含 fluorescence microscopy 图像和相应的ground truth注释的集合，旨在推动生命科学和深度学习领域的创新研究。这个数据集包括三个图像集，其中 rodent neuronal cells的核和质物被使用不同的标记物来标出其形态或功能特征。同时，我们提供了ground truth注释，用于多种学习任务，包括semantic segmentation、object detection和 counting。我们的贡献是twofold。首先，由于数据集中的多样性和可访问的格式，我们期望我们的工作会促进计算机视觉方法的进步，包括 segmentation、 detection、feature learning、unsupervised和self-supervised learning、转移学习等领域。其次，通过允许广泛探索和比较，我们希望fluorescent neuronal cells v2会促进 fluorescence microscopy 分析的进步，并推动生命科学的前沿研究。数据可以在以下链接中下载：https://amsacta.unibo.it/id/eprint/7347。

Defending Adversarial Patches via Joint Region Localizing and Inpainting

paper_url: http://arxiv.org/abs/2307.14242
repo_url: None
paper_authors: Junwen Chen, Xingxing Wei
for: 防御对象是遭受攻击的图像识别和检测任务，应对于各种攻击方法，包括对象视觉上的变化和内容上的遗传。
methods: 提出了一种基于“本地化和填充”机制的防御方法，包括一个两棵分支结构的“本地化”子网络和一个使用周围上下文信息填充原始内容的“填充”子网络，两者通过迭代优化方式相互关联学习。
results: 通过对多个交通标识和检测任务进行测试，证明了该防御方法能够有效地防御各种攻击检测和识别任务，并且可以保持图像的可读性和检测性。

Abstract
Deep neural networks are successfully used in various applications, but show their vulnerability to adversarial examples. With the development of adversarial patches, the feasibility of attacks in physical scenes increases, and the defenses against patch attacks are urgently needed. However, defending such adversarial patch attacks is still an unsolved problem. In this paper, we analyse the properties of adversarial patches, and find that: on the one hand, adversarial patches will lead to the appearance or contextual inconsistency in the target objects; on the other hand, the patch region will show abnormal changes on the high-level feature maps of the objects extracted by a backbone network. Considering the above two points, we propose a novel defense method based on a ``localizing and inpainting" mechanism to pre-process the input examples. Specifically, we design an unified framework, where the ``localizing" sub-network utilizes a two-branch structure to represent the above two aspects to accurately detect the adversarial patch region in the image. For the ``inpainting" sub-network, it utilizes the surrounding contextual cues to recover the original content covered by the adversarial patch. The quality of inpainted images is also evaluated by measuring the appearance consistency and the effects of adversarial attacks. These two sub-networks are then jointly trained via an iterative optimization manner. In this way, the ``localizing" and ``inpainting" modules can interact closely with each other, and thus learn a better solution. A series of experiments versus traffic sign classification and detection tasks are conducted to defend against various adversarial patch attacks.

摘要
深度神经网络在不同应用中得到了成功，但它们受到了针对性攻击的漏洞。随着物理场景中的攻击可能性的提高，防御针对贴图攻击的需求也日益增加。然而，防御针对贴图攻击仍然是一个未解决的问题。在这篇论文中，我们分析了针对贴图攻击的特性，并发现：一方面，贴图会导致目标对象的外观或上下文不一致；另一方面，贴图区域在对象提取后的高级特征图中会出现异常变化。基于以上两点，我们提出了一种基于“局部化和填充”机制的防御方法。具体来说，我们设计了一个统一框架，其中“局部化”子网络采用两棵树结构来准确检测贴图区域在图像中。为“填充”子网络，它利用周围的上下文征化来恢复贴图覆盖的原始内容。我们对填充图像的质量也进行了评估，包括外观一致性和针对攻击的影响。这两个子网络然后通过迭代优化方式进行联合培训，以便“局部化”和“填充”模块可以更好地互动。通过这种方式，我们可以更好地防御针对贴图攻击。我们在交通标识和检测任务上进行了多个实验，以防御不同类型的贴图攻击。

DisguisOR: Holistic Face Anonymization for the Operating Room

paper_url: http://arxiv.org/abs/2307.14241
repo_url: https://github.com/wngtn/disguisor
paper_authors: Lennart Bastian, Tony Danjun Wang, Tobias Czempiel, Benjamin Busam, Nassir Navab
for: 这篇研究旨在提高医疗数据科学（SDS）中的隐私保护，特别是在运行室（OR）中的录影视频中。methods: 本研究使用多条camera流的RGB和深度图像进行整合，从而获得了3D点云表示。然后，通过对检测到的3D人体关键点进行对应，将人脸模型覆盖在每个摄取到的相机视野中。results: 本方法能够更高效地找到人脸，并且实现了更加自然的隐私保护。DisguisOR可以实现Scene Level的隐私保护，并且具有推进SDS更多研究的潜力。

Abstract
Purpose: Recent advances in Surgical Data Science (SDS) have contributed to an increase in video recordings from hospital environments. While methods such as surgical workflow recognition show potential in increasing the quality of patient care, the quantity of video data has surpassed the scale at which images can be manually anonymized. Existing automated 2D anonymization methods under-perform in Operating Rooms (OR), due to occlusions and obstructions. We propose to anonymize multi-view OR recordings using 3D data from multiple camera streams. Methods: RGB and depth images from multiple cameras are fused into a 3D point cloud representation of the scene. We then detect each individual's face in 3D by regressing a parametric human mesh model onto detected 3D human keypoints and aligning the face mesh with the fused 3D point cloud. The mesh model is rendered into every acquired camera view, replacing each individual's face. Results: Our method shows promise in locating faces at a higher rate than existing approaches. DisguisOR produces geometrically consistent anonymizations for each camera view, enabling more realistic anonymization that is less detrimental to downstream tasks. Conclusion: Frequent obstructions and crowding in operating rooms leaves significant room for improvement for off-the-shelf anonymization methods. DisguisOR addresses privacy on a scene level and has the potential to facilitate further research in SDS.

摘要
Methods: RGB and depth images from multiple cameras are fused into a 3D point cloud representation of the scene. We then detect each individual's face in 3D by regressing a parametric human mesh model onto detected 3D human keypoints and aligning the face mesh with the fused 3D point cloud. The mesh model is rendered into every acquired camera view, replacing each individual's face.Results: Our method shows promise in locating faces at a higher rate than existing approaches. DisguisOR produces geometrically consistent anonymizations for each camera view, enabling more realistic anonymization that is less detrimental to downstream tasks.Conclusion: Frequent obstructions and crowding in operating rooms leaves significant room for improvement for off-the-shelf anonymization methods. DisguisOR addresses privacy on a scene level and has the potential to facilitate further research in SDS.

Computational Approaches for Traditional Chinese Painting: From the “Six Principles of Painting” Perspective

paper_url: http://arxiv.org/abs/2307.14227
repo_url: None
paper_authors: Wei Zhang, Jian-Wei Zhang, Kam Kwai Wong, Yifang Wang, Yingchaojie Feng, Luwei Wang, Wei Chen
for: 本研究旨在探讨计算机技术在传统中国画中的应用，以保护和普及这种独特的艺术风格。methods: 本研究采用了三个视角来分析计算机技术在传统中国画中的应用，包括以“六 principios of Painting”理论为基础的艺术元素分类、四个阶段框架来描述传统中国画应用的目的、以及常用的计算机技术在传统中国画中的应用。results: 本研究通过分析92篇文献和专家访谈，提出了一个四个阶段框架来描述传统中国画应用的目的，并概括了常用的计算机技术在传统中国画中的应用。这些成果可以帮助人们更好地理解计算机技术在传统中国画中的应用，并为未来的研究提供指导。

Abstract
Traditional Chinese Painting (TCP) is an invaluable cultural heritage resource and a unique visual art style. In recent years, increasing interest has been placed on digitalizing TCPs to preserve and revive the culture. The resulting digital copies have enabled the advancement of computational methods for structured and systematic understanding of TCPs. To explore this topic, we conducted an in-depth analysis of 92 pieces of literature. We examined the current use of computer technologies on TCPs from three perspectives, based on numerous conversations with specialists. First, in light of the "Six Principles of Painting" theory, we categorized the articles according to their research focus on artistic elements. Second, we created a four-stage framework to illustrate the purposes of TCP applications. Third, we summarized the popular computational techniques applied to TCPs. The framework also provides insights into potential applications and future prospects, with professional opinion. The list of surveyed publications and related information is available online at https://ca4tcp.com.

摘要
传统中国画（TCP）是一种无价的文化遗产资源和独特的视觉艺术风格。在最近几年，对于数字化TCP的兴趣日益增长，以保存和复兴文化。这些数字化 kopi 已经帮助计算机科学方面的研究人员对TCP进行结构化和系统化的研究。为了探讨这个主题，我们进行了深入的文献分析，检视了92篇论文。我们根据“六则绘画理论”分类了文章，按照许多专家的讲话，对TCP的计算机技术的应用进行了三个视角。首先，根据“六则绘画理议”分类文章，按照艺术元素的研究方向进行了分类。其次，我们创建了四个阶段框架，以 Illustrate TCP 的应用目的。最后，我们总结了应用于TCP的流行计算机技术。这个框架还提供了可能的应用和未来前景，以及专业意见。悉数据和相关信息可以在https://ca4tcp.com 上查看。

ADAPT: Efficient Multi-Agent Trajectory Prediction with Adaptation

paper_url: http://arxiv.org/abs/2307.14187
repo_url: https://github.com/KUIS-AI/adapt
paper_authors: Görkay Aydemir, Adil Kaan Akan, Fatma Güney
for: 预测复杂交通场景中机器人的未来轨迹，需要可靠和高效的预测方法。现有的预测方法都有缺点，或者效率低下，或者精度不高。
methods: 我们提出了ADAPT方法，它可以同时预测整个场景中所有机器人的轨迹，并且可以在运行时动态调整模型的权重。我们的方法在单机器人和多机器人设置下，在Argoverse和Interaction数据集上都超越了状态静态方法，而且具有相对较低的计算开销。
results: 我们的分析表明，ADAPT方法可以准确地预测每个机器人的轨迹，并且可以快速地完成预测任务。这是因为ADAPT方法可以根据每个机器人的特点，动态地调整预测模型的权重，以确保每个机器人的预测准确性。

Abstract
Forecasting future trajectories of agents in complex traffic scenes requires reliable and efficient predictions for all agents in the scene. However, existing methods for trajectory prediction are either inefficient or sacrifice accuracy. To address this challenge, we propose ADAPT, a novel approach for jointly predicting the trajectories of all agents in the scene with dynamic weight learning. Our approach outperforms state-of-the-art methods in both single-agent and multi-agent settings on the Argoverse and Interaction datasets, with a fraction of their computational overhead. We attribute the improvement in our performance: first, to the adaptive head augmenting the model capacity without increasing the model size; second, to our design choices in the endpoint-conditioned prediction, reinforced by gradient stopping. Our analyses show that ADAPT can focus on each agent with adaptive prediction, allowing for accurate predictions efficiently. https://KUIS-AI.github.io/adapt

摘要
预测未来行车路径需要可靠和高效的预测，以确保所有在场景中的代理人的行车路径都能够预测 accurately。然而，现有的行车路径预测方法 Either inefficient or sacrifice accuracy. To address this challenge, we propose ADAPT, a novel approach for jointly predicting the trajectories of all agents in the scene with dynamic weight learning. Our approach outperforms state-of-the-art methods in both single-agent and multi-agent settings on the Argoverse and Interaction datasets, with a fraction of their computational overhead. We attribute the improvement in our performance to two aspects: first, the adaptive head augmenting the model capacity without increasing the model size; second, our design choices in the endpoint-conditioned prediction, reinforced by gradient stopping. Our analyses show that ADAPT can focus on each agent with adaptive prediction, allowing for accurate predictions efficiently.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you prefer Traditional Chinese, I can provide that as well.

Resolution-Aware Design of Atrous Rates for Semantic Segmentation Networks

paper_url: http://arxiv.org/abs/2307.14179
repo_url: None
paper_authors: Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, Sang Woo Kim
for: 本研究旨在提供深度学习网络 semantic segmentation 的优化方法，以提高 segmentation 结果的准确性。
methods: 本研究使用了 DeepLab 深度学习网络，并提出了一种基于 atrous spatial pyramid pooling (ASPP) 模块的 практиче guideline，以优化 ASPP 模块中的 atrous rate。
results: comparing with other values, 使用优化的 atrous rate consistently 提高了 segmentation 结果 across multiple datasets，包括 STARE、CHASE_DB1、HRF、Cityscapes 和 iSAID 数据集。

Abstract
DeepLab is a widely used deep neural network for semantic segmentation, whose success is attributed to its parallel architecture called atrous spatial pyramid pooling (ASPP). ASPP uses multiple atrous convolutions with different atrous rates to extract both local and global information. However, fixed values of atrous rates are used for the ASPP module, which restricts the size of its field of view. In principle, atrous rate should be a hyperparameter to change the field of view size according to the target task or dataset. However, the manipulation of atrous rate is not governed by any guidelines. This study proposes practical guidelines for obtaining an optimal atrous rate. First, an effective receptive field for semantic segmentation is introduced to analyze the inner behavior of segmentation networks. We observed that the use of ASPP module yielded a specific pattern in the effective receptive field, which was traced to reveal the module's underlying mechanism. Accordingly, we derive practical guidelines for obtaining the optimal atrous rate, which should be controlled based on the size of input image. Compared to other values, using the optimal atrous rate consistently improved the segmentation results across multiple datasets, including the STARE, CHASE_DB1, HRF, Cityscapes, and iSAID datasets.

摘要
深度学习是一种广泛使用的深度神经网络，用于 semantic segmentation，其成功归功于其平行架构 called atrous spatial pyramid pooling (ASPP)。ASPP使用多个不同的atrous convolutions来提取本地和全局信息。然而，ASPP模块中使用的atrous rate是固定的，这限制了其观察领域的大小。在理论上，atrous rate应该是一个可变的超参数，根据目标任务或数据集来变化观察领域的大小。然而，atrous rate的调整没有任何指导。这个研究提出了实用的指南，以获取最佳的atrous rate。首先，我们引入了 semantic segmentation 的有效覆盖区域，以分析 segmentation 网络的内部行为。我们发现，使用 ASPP 模块后会产生特定的模式在有效覆盖区域中，这被跟踪到了模块的内部机制。根据这些结果，我们 derive 了实用的指南，以控制 atrous rate 的选择，即基于输入图像的大小。与其他值相比，使用最佳的 atrous rate consistently 改善了 segmentation 结果，在多个数据集上，包括 STARE、CHASE_DB1、HRF、Cityscapes 和 iSAID 数据集。

High-definition event frame generation using SoC FPGA devices

paper_url: http://arxiv.org/abs/2307.14177
repo_url: None
paper_authors: Krzysztof Blachut, Tomasz Kryjak
for: 这个论文目的是实现FPGA设备上高分辨率事件数据流（HD -1280 x 720像素）的归一化和投影。
methods: 这个论文使用了FPGA设备来实现高分辨率事件数据流的归一化和投影。
results: 研究结果表明该方法是可行的，但需要考虑一些挑战、限制和让步。选择的数据表示方式的硬件资源与AMD Xilinx等流行平台进行比较。得到的事件帧可以用于典型的视觉算法，如物体分类和检测，使用传统和深度神经网络方法。

Abstract
In this paper we have addressed the implementation of the accumulation and projection of high-resolution event data stream (HD -1280 x 720 pixels) onto the image plane in FPGA devices. The results confirm the feasibility of this approach, but there are a number of challenges, limitations and trade-offs to be considered. The required hardware resources of selected data representations, such as binary frame, event frame, exponentially decaying time surface and event frequency, were compared with those available on several popular platforms from AMD Xilinx. The resulting event frames can be used for typical vision algorithms, such as object classification and detection, using both classical and deep neural network methods.

摘要
在这篇论文中，我们对FPGA设备上高分辨率事件数据流（HD-1280x720像素）的归一化和投影进行了实现。结果表明该方法可行，但需要考虑一些挑战、限制和让步。我们对选择的数据表示方式的硬件资源进行了比较，包括二进制帧、事件帧、加速度度时间表面和事件频率。得到的事件帧可以用于典型的视觉算法，如物体分类和检测，使用 both classical 和深度神经网络方法。

A Survey on Generative Modeling with Limited Data, Few Shots, and Zero Shot

paper_url: http://arxiv.org/abs/2307.14397
repo_url: https://github.com/sutd-visual-computing-group/awesome-generative-modeling-under-data-constraints
paper_authors: Milad Abdollahzadeh, Touba Malekzadeh, Christopher T. H. Teo, Keshigeyan Chandrasegaran, Guimeng Liu, Ngai-Man Cheung
for: 本研究旨在探讨在数据约束下学习生成模型，包括受限数据、几架数据和零架数据等情况。
methods: 本研究提出了两种分类法：一种是基于任务的分类，另一种是基于方法的分类。同时，研究者还分析了不同任务和方法之间的互动。
results: 研究者发现了一些未来研究的潜在方向，包括如何在数据约束下提高生成模型的性能，如何将生成模型应用于健康医疗领域，以及如何将生成模型与其他技术结合使用。

Abstract
In machine learning, generative modeling aims to learn to generate new data statistically similar to the training data distribution. In this paper, we survey learning generative models under limited data, few shots and zero shot, referred to as Generative Modeling under Data Constraint (GM-DC). This is an important topic when data acquisition is challenging, e.g. healthcare applications. We discuss background, challenges, and propose two taxonomies: one on GM-DC tasks and another on GM-DC approaches. Importantly, we study interactions between different GM-DC tasks and approaches. Furthermore, we highlight research gaps, research trends, and potential avenues for future exploration. Project website: https://gmdc-survey.github.io.

摘要
在机器学习中，生成模型目标是学习生成新数据，与训练数据分布相似。在这篇论文中，我们对受限数据的生成模型学习进行报告，包括几个难点和挑战。我们还提出了两种分类：一种是生成模型下数据约束任务（GM-DC）任务，另一种是生成模型下数据约束方法（GM-DC）方法。此外，我们还研究了不同GM-DC任务和方法之间的交互关系。此外，我们还提出了未来探索的研究漏斗和趋势。您可以查看更多信息在我们的项目网站：。

Creative Birds: Self-Supervised Single-View 3D Style Transfer

paper_url: http://arxiv.org/abs/2307.14127
repo_url: https://github.com/wrk226/creative_birds
paper_authors: Renke Wang, Guimin Que, Shuo Chen, Xiang Li, Jun Li, Jian Yang
for: 本研究主要针对鸟类三维重建中的单一视角3D题目，提出了一个新的方法，能够将两个单一视角图像中的形状和文本URE转换到3D mesh上。
methods: 本方法使用了一个新的形式转换生成器（DRGNet）和一个多层感知核（MLP），将源图像和目标图像的特征提取出来，并生成3D mesh的空间坐标。此外，本方法还引入了一个semantic UV文本转换模组，实现了文本类型的Style Transfer，并可以与许多现有的方法相结合。
results: 实验结果显示，本方法在单一视角3D Style Transfer任务上实现了州前的性能，并且可以实现高品质的3D鸟类重建。

Abstract
In this paper, we propose a novel method for single-view 3D style transfer that generates a unique 3D object with both shape and texture transfer. Our focus lies primarily on birds, a popular subject in 3D reconstruction, for which no existing single-view 3D transfer methods have been developed.The method we propose seeks to generate a 3D mesh shape and texture of a bird from two single-view images. To achieve this, we introduce a novel shape transfer generator that comprises a dual residual gated network (DRGNet), and a multi-layer perceptron (MLP). DRGNet extracts the features of source and target images using a shared coordinate gate unit, while the MLP generates spatial coordinates for building a 3D mesh. We also introduce a semantic UV texture transfer module that implements textural style transfer using semantic UV segmentation, which ensures consistency in the semantic meaning of the transferred regions. This module can be widely adapted to many existing approaches. Finally, our method constructs a novel 3D bird using a differentiable renderer. Experimental results on the CUB dataset verify that our method achieves state-of-the-art performance on the single-view 3D style transfer task. Code is available in https://github.com/wrk226/creative_birds.

摘要
在这篇论文中，我们提出了一种新的单视图3D样式传输方法，该方法可以生成一个独特的3D对象，包括形状和Texture传输。我们主要关注鸟类，这是3D重建中非常流行的主题，现有的单视图3D传输方法尚未得到开发。我们的方法可以从两个单视图图像中生成一个鸟类3D网格形状和Texture。为达到这个目标，我们提出了一种新的形状传输生成器，它包括一个双重径脱敏网络（DRGNet）和一个多层权重网络（MLP）。DRGNet使用共享坐标门户单元提取源和目标图像的特征，而MLP生成3D网格的空间坐标。我们还提出了一种semantic UV文本传输模块，它通过semantic UV分割实现文本风格传输，以保证传输的区域具有相同的semantic意义。这个模块可以与许多现有方法结合使用。最后，我们的方法使用一个可导渠 Renderer构建一个新的3D鸟类。实验结果表明，我们的方法在单视图3D样式传输任务中达到了国际级的性能。代码可以在https://github.com/wrk226/creative_birds中找到。

paper_url: http://arxiv.org/abs/2307.14126
repo_url: None
paper_authors: Hu Wang, Yuanhong Chen, Congbo Ma, Jodie Avery, Louise Hull, Gustavo Carneiro
For: The paper is written to address the issue of missing modality in multi-modal tasks, and to propose a simpler and more effective method for handling this problem.* Methods: The proposed method, called Shared-Specific Feature Modelling (ShaSpec), uses auxiliary tasks based on distribution alignment and domain classification, as well as a residual feature fusion procedure, to learn shared and specific features from all available input modalities.* Results: The paper reports that ShaSpec outperforms competing methods by a large margin on both medical image segmentation and computer vision classification tasks, with improvements of more than 3% on BraTS2018 for enhancing tumour, 5% for tumour core, and 3% for whole tumour.Here is the same information in Simplified Chinese:* For: 这篇论文是为了解决多Modal任务中缺失Modal的问题而写的。* Methods: 提议的方法是Shared-Specific Feature Modelling（ShaSpec），它使用分布对齐和领域分类的auxiliary任务，以及剩余特征混合过程，来学习所有输入Modalities中的共享和特定特征。* Results: 论文报告说，ShaSpec比竞争方法更高效，在医学像分割和计算机视觉分类任务上都有显著提高（BraTS2018上的提升肿瘤3%，核心肿瘤5%，整个肿瘤3%）。

Abstract
The missing modality issue is critical but non-trivial to be solved by multi-modal models. Current methods aiming to handle the missing modality problem in multi-modal tasks, either deal with missing modalities only during evaluation or train separate models to handle specific missing modality settings. In addition, these models are designed for specific tasks, so for example, classification models are not easily adapted to segmentation tasks and vice versa. In this paper, we propose the Shared-Specific Feature Modelling (ShaSpec) method that is considerably simpler and more effective than competing approaches that address the issues above. ShaSpec is designed to take advantage of all available input modalities during training and evaluation by learning shared and specific features to better represent the input data. This is achieved from a strategy that relies on auxiliary tasks based on distribution alignment and domain classification, in addition to a residual feature fusion procedure. Also, the design simplicity of ShaSpec enables its easy adaptation to multiple tasks, such as classification and segmentation. Experiments are conducted on both medical image segmentation and computer vision classification, with results indicating that ShaSpec outperforms competing methods by a large margin. For instance, on BraTS2018, ShaSpec improves the SOTA by more than 3% for enhancing tumour, 5% for tumour core and 3% for whole tumour.

摘要
《缺失Modalitate问题是多模态模型解决的核心问题，但它不是易于解决的。目前的方法在评估时或在训练时都只处理缺失的模态，或者通过训练不同的模型来处理特定的缺失模态情况。此外，这些模型是为特定任务设计的，因此例如分类模型不易于适应分割任务，而且反之亦然。本文提出了共享特定特征模型（ShaSpec）方法，它比竞争方法更简单而效果更好。ShaSpec在训练和评估过程中利用所有可用的输入模态，通过学习共享特征和特定特征来更好地表示输入数据。这是通过auxiliary任务基于分布对齐和领域分类，以及剩余特征 fusión过程来实现的。此外，ShaSpec的设计简单，可以方便地适应多个任务，如分类和分割。实验结果表明，ShaSpec在BraTS2018上比前一个SOTA提高了超过3%的恢复肿瘤、5%的肿瘤核心和3%的整体肿瘤。》

Memory-Efficient Graph Convolutional Networks for Object Classification and Detection with Event Cameras

paper_url: http://arxiv.org/abs/2307.14124
repo_url: None
paper_authors: Kamil Jeziorek, Andrea Pinna, Tomasz Kryjak
for: 本研究旨在提高事件摄像头数据处理的效率和准确率，并且考虑了数据存储和计算成本。
methods: 本研究使用图 convolutional neural networks (GCNs) 进行事件数据分析，并对不同的图 convolution 操作进行比较分析，以选择最佳的操作。
results: 研究结果显示，使用提出的方法可以实现52.3%的分类精度，同时减少了特征提取模块中的参数数量450倍，并将数据表示形式的大小减少4.5倍。此外，对 N-Caltech101 数据集进行 object detection 预测，实现了53.7%的 mAP@0.5 精度和82个图像每秒的执行速率。

Abstract
Recent advances in event camera research emphasize processing data in its original sparse form, which allows the use of its unique features such as high temporal resolution, high dynamic range, low latency, and resistance to image blur. One promising approach for analyzing event data is through graph convolutional networks (GCNs). However, current research in this domain primarily focuses on optimizing computational costs, neglecting the associated memory costs. In this paper, we consider both factors together in order to achieve satisfying results and relatively low model complexity. For this purpose, we performed a comparative analysis of different graph convolution operations, considering factors such as execution time, the number of trainable model parameters, data format requirements, and training outcomes. Our results show a 450-fold reduction in the number of parameters for the feature extraction module and a 4.5-fold reduction in the size of the data representation while maintaining a classification accuracy of 52.3%, which is 6.3% higher compared to the operation used in state-of-the-art approaches. To further evaluate performance, we implemented the object detection architecture and evaluated its performance on the N-Caltech101 dataset. The results showed an accuracy of 53.7 % mAP@0.5 and reached an execution rate of 82 graphs per second.

摘要
最近的事件摄像头研究发展强调处理原始稀疏数据，这使得可以利用高时间分辨率、高动态范围、低延迟和图像模糊鲁棒性的独特特点。一种有前途的方法是使用图像会议网络（GCN）来分析事件数据。然而，当前研究主要关注计算成本优化，忽略了相关的内存成本。在这篇论文中，我们同时考虑这两个因素，以达到满意的结果和相对较低的模型复杂度。为此，我们进行了不同图像会议操作的比较分析，考虑因素包括执行时间、可训练模型参数数量、数据格式要求和训练结果。我们的结果显示了特征提取模块中参数数量的450倍减少和数据表示形式的4.5倍减小，同时保持52.3%的分类精度，与现有方法相比增加6.3%。为了进一步评估性能，我们实现了对象检测架构并在N-Caltech101数据集上评估其性能。结果显示了53.7%的mAP@0.5精度和82个图像每秒执行速度。

Periocular biometrics: databases, algorithms and directions

paper_url: http://arxiv.org/abs/2307.14111
repo_url: None
paper_authors: Fernando Alonso-Fernandez, Josef Bigun
for: 本文是一篇对 périocular 生物认证研究进行的综述，提供了现有Literature的概述和主要问题的探讨，以及未来研究趋势的简要介绍。
methods: 本文使用的方法包括 periocular 特征提取和gender和ethnicity分类等，以及对不同 gender 和ethnicity的认证性能的研究。
results: 本文 Summarizes the state of the art in periocular biometric research, including the most relevant issues and a thorough coverage of the existing literature.

Abstract
Periocular biometrics has been established as an independent modality due to concerns on the performance of iris or face systems in uncontrolled conditions. Periocular refers to the facial region in the eye vicinity, including eyelids, lashes and eyebrows. It is available over a wide range of acquisition distances, representing a trade-off between the whole face (which can be occluded at close distances) and the iris texture (which do not have enough resolution at long distances). Since the periocular region appears in face or iris images, it can be used also in conjunction with these modalities. Features extracted from the periocular region have been also used successfully for gender classification and ethnicity classification, and to study the impact of gender transformation or plastic surgery in the recognition performance. This paper presents a review of the state of the art in periocular biometric research, providing an insight of the most relevant issues and giving a thorough coverage of the existing literature. Future research trends are also briefly discussed.

摘要
périocular 生物ometrics 已经被确立为一种独立的modalità，因为关注肉眼或面系统在无控制的环境下表现不佳。periocular 指的是眼睛附近的脸部区域，包括眼皮、毛发和眉毛。它可以在各种距离范围内获得，表示一种质量和距离之间的交易，而整个脸部（可能会被 occluded 在近距离）和眼球 тексту（没有 enough resolution 在远距离）。由于 periocular 区域出现在脸部或眼球图像中，因此也可以与这些modalities 结合使用。从 periocular 区域提取的特征已经成功地用于性别类型和种族类型的分类，以及研究 gender transformation 或整形手术对认知性能的影响。这篇文章介绍了 periocular 生物metric 研究的现状，提供了有关最重要的问题和现有文献的全面概述，以及未来研究趋势的简要讨论。

VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet

paper_url: http://arxiv.org/abs/2307.14073
repo_url: https://github.com/ZhihaoHu/VideoControlNet
paper_authors: Zhihao Hu, Dong Xu
for: 本研究 propose a new motion-guided video-to-video translation framework called VideoControlNet, which can generate various videos based on given prompts and condition from the input video.
methods: 我们使用 diffusion model with ControlNet 和 motion-guided P-frame generation (MgPG) method, 以及 motion-guided B-frame interpolation (MgBI) module to generate videos.
results: 我们的 эксперименты表明，VideoControlNet 继承了预训练的大型扩散模型的生成能力，并将图像扩散模型扩展到视频扩散模型。更多结果请参考我们项目页面。

Abstract
Recently, diffusion models like StableDiffusion have achieved impressive image generation results. However, the generation process of such diffusion models is uncontrollable, which makes it hard to generate videos with continuous and consistent content. In this work, by using the diffusion model with ControlNet, we proposed a new motion-guided video-to-video translation framework called VideoControlNet to generate various videos based on the given prompts and the condition from the input video. Inspired by the video codecs that use motion information for reducing temporal redundancy, our framework uses motion information to prevent the regeneration of the redundant areas for content consistency. Specifically, we generate the first frame (i.e., the I-frame) by using the diffusion model with ControlNet. Then we generate other key frames (i.e., the P-frame) based on the previous I/P-frame by using our newly proposed motion-guided P-frame generation (MgPG) method, in which the P-frames are generated based on the motion information and the occlusion areas are inpainted by using the diffusion model. Finally, the rest frames (i.e., the B-frame) are generated by using our motion-guided B-frame interpolation (MgBI) module. Our experiments demonstrate that our proposed VideoControlNet inherits the generation capability of the pre-trained large diffusion model and extends the image diffusion model to the video diffusion model by using motion information. More results are provided at our project page.

摘要
近些年，扩散模型如StableDiffusion在图像生成方面取得了卓越的成绩。然而，扩散模型的生成过程不可控，这使得生成视频时难以保持内容连续和一致。在这项工作中，我们通过将扩散模型与ControlNet结合使用，提出了一种基于提示和输入视频的动作指导的视频到视频翻译框架——VideoControlNet。我们受到视频编码器使用运动信息减少 temporal 重复的启发，我们的框架使用运动信息来避免重新生成冗余区域以保持内容一致。具体来说，我们使用扩散模型与ControlNet生成首帧（i.e., I-frame），然后使用我们新提出的运动指导 P-frame 生成方法（MgPG）生成后续的其他关键帧（i.e., P-frame），并使用扩散模型填充 occlusion 区域。最后，我们使用我们的运动指导 B-frame interpolate 模块（MgBI）生成剩余帧（i.e., B-frame）。我们的实验表明，我们提posed VideoControlNet 继承了预训练的大扩散模型的生成能力，并将图像扩散模型扩展到视频扩散模型，并且使用运动信息。更多结果请参考我们项目页面。

PNT-Edge: Towards Robust Edge Detection with Noisy Labels by Learning Pixel-level Noise Transitions

paper_url: http://arxiv.org/abs/2307.14070
repo_url: None
paper_authors: Wenjie Xuan, Shanshan Zhao, Yu Yao, Juhua Liu, Tongliang Liu, Yixin Chen, Bo Du, Dacheng Tao
for: 这篇论文是为了解决标签损害问题，尤其是在大规模训练数据中，以提高边测出的性能。
methods: 这篇论文提出了一种Pixel-level NoiseTransitions（PNT）模型，通过估计标签损害的过程来解决标签损害问题。PNT模型包括Pixel-wise Shift Learning（PSL）模块，可以估计标签损害的转换场。
results: 实验结果显示，PNT模型能够有效地减轻标签损害的影响，并且可以保持边测出的高性能。

Abstract
Relying on large-scale training data with pixel-level labels, previous edge detection methods have achieved high performance. However, it is hard to manually label edges accurately, especially for large datasets, and thus the datasets inevitably contain noisy labels. This label-noise issue has been studied extensively for classification, while still remaining under-explored for edge detection. To address the label-noise issue for edge detection, this paper proposes to learn Pixel-level NoiseTransitions to model the label-corruption process. To achieve it, we develop a novel Pixel-wise Shift Learning (PSL) module to estimate the transition from clean to noisy labels as a displacement field. Exploiting the estimated noise transitions, our model, named PNT-Edge, is able to fit the prediction to clean labels. In addition, a local edge density regularization term is devised to exploit local structure information for better transition learning. This term encourages learning large shifts for the edges with complex local structures. Experiments on SBD and Cityscapes demonstrate the effectiveness of our method in relieving the impact of label noise. Codes will be available at github.

摘要
以前的边检测方法通过大规模的训练数据和像素级标注来实现高性能。然而，手动标注边界尚很困难，特别是 для大量数据集，因此标注中存在噪声。这个噪声问题在分类领域已经得到了广泛的研究，而在边检测领域仍然尚未得到充分的研究。为了解决边检测中的噪声问题，本文提出了学习像素级噪声转移（Pixel-level NoiseTransitions，PNT）来模型标签损害过程。为此，我们开发了一种名为像素级Shift学习（Pixel-wise Shift Learning，PSL）模块，以便估计从清晰标签到噪声标签的转移为一个拟合场景。通过利用估计的噪声转移，我们的模型可以适应清晰标签。此外，我们还提出了一种基于地方检测结构信息的本地检测密度规则，以便更好地学习转移。这个规则鼓励学习大尺度的转移，以便处理复杂的地方结构。实验表明，我们的方法可以减轻标签噪声的影响。代码将提供在GitHub上。

Pre-Training with Diffusion models for Dental Radiography segmentation

paper_url: http://arxiv.org/abs/2307.14066
repo_url: None
paper_authors: Jérémy Rousseau, Christian Alaka, Emma Covili, Hippolyte Mayard, Laura Misrachi, Willy Au
for: 针对医疗放射学像 segmentation task, specifically dental radiography, which is limited by the high cost of labeling.
methods: 提出了一种简单的预训练方法，使用 Denoising Diffusion Probabilistic Models (DDPM) 进行 semantic segmentation.
results: 实验结果表明，该方法可以 achieve remarkable performance in terms of label efficiency, without requiring architectural modifications between pre-training and downstream tasks.

Abstract
Medical radiography segmentation, and specifically dental radiography, is highly limited by the cost of labeling which requires specific expertise and labor-intensive annotations. In this work, we propose a straightforward pre-training method for semantic segmentation leveraging Denoising Diffusion Probabilistic Models (DDPM), which have shown impressive results for generative modeling. Our straightforward approach achieves remarkable performance in terms of label efficiency and does not require architectural modifications between pre-training and downstream tasks. We propose to first pre-train a Unet by exploiting the DDPM training objective, and then fine-tune the resulting model on a segmentation task. Our experimental results on the segmentation of dental radiographs demonstrate that the proposed method is competitive with state-of-the-art pre-training methods.

摘要
医疗放射segmentation，特别是牙科放射segmentation，受到标注成本的限制，需要专业知识和劳动密集的标注。在这个工作中，我们提议一种简单的预训练方法 для语义分割，利用Diffusion Probabilistic Models（DDPM），这种模型已经在生成模型中表现出色。我们的简单方法可以达到remarkable的标签效率，不需要预训练和下游任务之间的建筑修改。我们首先预训练了Unet使用DDPM训练目标，然后细化该模型以进行分割任务。我们的实验结果表明，我们提议的方法可以与现有的预训练方法竞争。

ECO: Ensembling Context Optimization for Vision-Language Models

paper_url: http://arxiv.org/abs/2307.14063
repo_url: None
paper_authors: Lorenzo Agnolucci, Alberto Baldrati, Francesco Todino, Federico Becattini, Marco Bertini, Alberto Del Bimbo
for: 这篇论文主要是为了研究如何使用文本提示来进行图像分类，并使用 CLIP 模型实现零 shot 转换。
methods: 该论文使用了一种 ensemble 方法，通过学习多个文本提示来提高图像分类的性能。
results: 研究发现，使用多个文本提示可以提高图像分类的性能，而且不需要在执行时添加额外成本。并且在 11 个 benchmark 上进行了证明。

Abstract
Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts. Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space. This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP's classification capabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time. We demonstrate the capabilities of our approach on 11 different benchmarks.

摘要
Image recognition 近期经历了一种新的思维方式，使用视力语言模型来实现几个步骤分类基于文本提示。其中，CLIP 模型表现出了很好的零shot 转移能力，可以通过匹配图像和自定义文本提示在其 latent space 进行匹配。这有助于许多工作，旨在改进或学习图像提示的文本上。在这篇论文中，我们遵循这种趋势，学习一个图像分类的 ensemble 提示。我们发现，学习多样的和可能更短的文本上下文可以大幅提高结果，而不需要在执行时添加额外成本。我们在 11 个标准测试集上进行了证明。

Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models

paper_url: http://arxiv.org/abs/2307.14061
repo_url: https://github.com/Zoky-2020/Set-level_Guidance_Attack
paper_authors: Dong Lu, Zhiqiang Wang, Teng Wang, Weili Guan, Hongchang Gao, Feng Zheng
for: 本研究是 investigate the adversarial transferability of recent VLP models.
methods: 我们提出了一种高度可转移的 Set-level Guidance Attack (SGA)，它充分利用了modal interactions和cross-modal guidance，并包括了alignment-preserving augmentation.
results: SGA可以生成高度可转移的 adversarial examples，可以强制 transferred across different VLP models on multiple downstream vision-language tasks. 例如，在图像文本检索 task 上，SGA可以significantly enhance the attack success rate for transfer attacks from ALBEF to TCL，比对 estado-of-the-art 高得多 (at least 9.78% and up to 30.21%).

Abstract
Vision-language pre-training (VLP) models have shown vulnerability to adversarial examples in multimodal tasks. Furthermore, malicious adversaries can be deliberately transferred to attack other black-box models. However, existing work has mainly focused on investigating white-box attacks. In this paper, we present the first study to investigate the adversarial transferability of recent VLP models. We observe that existing methods exhibit much lower transferability, compared to the strong attack performance in white-box settings. The transferability degradation is partly caused by the under-utilization of cross-modal interactions. Particularly, unlike unimodal learning, VLP models rely heavily on cross-modal interactions and the multimodal alignments are many-to-many, e.g., an image can be described in various natural languages. To this end, we propose a highly transferable Set-level Guidance Attack (SGA) that thoroughly leverages modality interactions and incorporates alignment-preserving augmentation with cross-modal guidance. Experimental results demonstrate that SGA could generate adversarial examples that can strongly transfer across different VLP models on multiple downstream vision-language tasks. On image-text retrieval, SGA significantly enhances the attack success rate for transfer attacks from ALBEF to TCL by a large margin (at least 9.78% and up to 30.21%), compared to the state-of-the-art.

摘要
“视觉语言预训（VLP）模型在多modal任务中显示易受到恶意攻击。此外，恶意攻击者可以故意传播到攻击其他黑盒模型。然而，现有的工作主要集中在白盒攻击。在这篇论文中，我们提出了第一个研究视觉语言预训模型的攻击传播性的研究。我们发现，现有的方法在白盒 Setting下的攻击性表现强，而在黑盒 Setting下的传播性较弱。这种传播性减退部分由于视觉语言模型在模式之间的交互不充分利用。特别是，不同于单模型学习，VLP模型依赖于模式之间的交互，并且多模式对应关系是多对多的，例如一个图像可以被多种自然语言描述。为此，我们提出了高度传播的集成指导攻击（SGA），它仔细利用了模式交互和具有对应增强的扩展。实验结果表明，SGA可以在多个下游视觉语言任务上生成强制转移的攻击例子，并且在图像文本检索任务上明显提高了传播攻击的成功率（至少9.78%和30.21%），相比之下前 estado-of-the-art。”

Towards Establishing Systematic Classification Requirements for Automated Driving

paper_url: http://arxiv.org/abs/2307.14058
repo_url: None
paper_authors: Ken T. Mori, Trent Brown, Steven Peters
for: 这篇论文是为了定义一种适用于自动驾驶领域的一致的分类要求的方法。
methods: 该方法首先从行为需求角度标识了法律类别，然后考虑了对象和感知类别的两个方面，从而获得了一个分类层次结构。
results: 应用该方法于一个示例法律文本后，与标准数据集类别相比，两者之间存在有限的一致，这表明需要显式考虑法律需求关于感知。

Abstract
Despite the presence of the classification task in many different benchmark datasets for perception in the automotive domain, few efforts have been undertaken to define consistent classification requirements. This work addresses the topic by proposing a structured method to generate a classification structure. First, legal categories are identified based on behavioral requirements for the vehicle. This structure is further substantiated by considering the two aspects of collision safety for objects as well as perceptual categories. A classification hierarchy is obtained by applying the method to an exemplary legal text. A comparison of the results with benchmark dataset categories shows limited agreement. This indicates the necessity for explicit consideration of legal requirements regarding perception.

摘要
即使在自动驾驶领域的识别任务中存在多种benchmark数据集，但很少有努力来定义一致的分类要求。这项工作强调这一点，并提出了一种结构化的方法来生成分类结构。首先，我们根据车辆的行为要求来确定法律类别。然后，我们根据对象避免碰撞安全和感知类别来进一步补充这种结构。通过应用这种方法，我们得到了一个分类层次结构。对比 benchmark数据集类别，我们发现了有限的一致。这表明了法律要求的明确考虑是必要的。

Unite-Divide-Unite: Joint Boosting Trunk and Structure for High-accuracy Dichotomous Image Segmentation

paper_url: http://arxiv.org/abs/2307.14052
repo_url: https://github.com/pjlallen/udun
paper_authors: Jialun Pei, Zhangjun Zhou, Yueming Jin, He Tang, Pheng-Ann Heng
for: 高精度二分图像分割（DIS）目标是从自然场景中找到类型不同的前景对象。
methods: 我们提出了一种新的 Unit-Divide-Unite 网络（UDUN），它通过重新安排和分割补做特征来同时提高轮廓和结构的识别效果。
results: UDUN 在六个评估指标中所有取得了比领先者更高的成绩，并在 1024*1024 输入下实现了实时推理，并且可以在 65.3 fps 上进行推理。

Abstract
High-accuracy Dichotomous Image Segmentation (DIS) aims to pinpoint category-agnostic foreground objects from natural scenes. The main challenge for DIS involves identifying the highly accurate dominant area while rendering detailed object structure. However, directly using a general encoder-decoder architecture may result in an oversupply of high-level features and neglect the shallow spatial information necessary for partitioning meticulous structures. To fill this gap, we introduce a novel Unite-Divide-Unite Network (UDUN} that restructures and bipartitely arranges complementary features to simultaneously boost the effectiveness of trunk and structure identification. The proposed UDUN proceeds from several strengths. First, a dual-size input feeds into the shared backbone to produce more holistic and detailed features while keeping the model lightweight. Second, a simple Divide-and-Conquer Module (DCM) is proposed to decouple multiscale low- and high-level features into our structure decoder and trunk decoder to obtain structure and trunk information respectively. Moreover, we design a Trunk-Structure Aggregation module (TSA) in our union decoder that performs cascade integration for uniform high-accuracy segmentation. As a result, UDUN performs favorably against state-of-the-art competitors in all six evaluation metrics on overall DIS-TE, i.e., achieving 0.772 weighted F-measure and 977 HCE. Using 1024*1024 input, our model enables real-time inference at 65.3 fps with ResNet-18.

摘要
高精度二分图像分割（DIS）目标是从自然场景中找到不受类别限制的前景对象。主要挑战在DIS中是准确地确定高级特征区域，而不是仅仅是提供高级特征。直接使用通用的编码器-解码器架构可能会导致过度产生高级特征，而忽略细致的空间信息，这会导致精细结构的识别受到威胁。为了填补这一漏洞，我们提出了一种新的团结分解网络（UDUN）。UDUN通过重新排序和分割相 complementary 的特征，以同时提高核心区域和细致结构的识别效果。UDUN的主要优势包括：一、使用双Size输入，通过共享背bone生成更加整体和细致的特征，同时保持模型轻量级。二、提出了简单的分割和聚合模块（DCM），将多尺度低级和高级特征分割到我们的结构解码器和核心解码器中，以获得结构信息和核心信息。此外，我们还设计了团结聚合模块（TSA），通过顺序集成来实现高精度分割。因此，UDUN在所有六个评价指标中表现出色，在DIS-TE上 achieved 0.772 weighted F-measure和977 HCE。使用1024*1024输入，我们的模型在65.3 fps上实现了实时推理，并且使用ResNet-18。

3D Semantic Subspace Traverser: Empowering 3D Generative Model with Shape Editing Capability

paper_url: http://arxiv.org/abs/2307.14051
repo_url: https://github.com/TrepangCat/3D_Semantic_Subspace_Traverser
paper_authors: Ruowei Wang, Yu Liu, Pei Su, Jianwei Zhang, Qijun Zhao
for: 本研究旨在提供一种基于semantic attribute的3D形状生成模型，以便在3D内容创建中保持形状结构的semantic consistency和提供形状结构的semantic特性编辑功能。
methods: 该模型使用implicit函数来表示3D形状，并结合一种novel的latent-space GAN和一个线性子空间模型，以探索3D形状的本地latent空间中的semantic维度。每个维度对应一个特定的semantic特性，可以通过 traverse这些维度的系数来编辑生成的形状的semantic特性。
results: 实验结果表明，该方法可以生成具有复杂结构的plausible形状，并提供形状结构的semantic特性编辑功能。代码和训练模型可以在https://github.com/TrepangCat/3D_Semantic_Subspace_Traverser上获取。

Abstract
Shape generation is the practice of producing 3D shapes as various representations for 3D content creation. Previous studies on 3D shape generation have focused on shape quality and structure, without or less considering the importance of semantic information. Consequently, such generative models often fail to preserve the semantic consistency of shape structure or enable manipulation of the semantic attributes of shapes during generation. In this paper, we proposed a novel semantic generative model named 3D Semantic Subspace Traverser that utilizes semantic attributes for category-specific 3D shape generation and editing. Our method utilizes implicit functions as the 3D shape representation and combines a novel latent-space GAN with a linear subspace model to discover semantic dimensions in the local latent space of 3D shapes. Each dimension of the subspace corresponds to a particular semantic attribute, and we can edit the attributes of generated shapes by traversing the coefficients of those dimensions. Experimental results demonstrate that our method can produce plausible shapes with complex structures and enable the editing of semantic attributes. The code and trained models are available at https://github.com/TrepangCat/3D_Semantic_Subspace_Traverser

摘要
三维形状生成是三维内容创建中的一种实践，旨在生成具有不同表示形式的三维形状。在先前的研究中，大多数研究者对三维形状生成的焦点是形状质量和结构，而忽略或少考虑semantic信息的重要性。这导致生成的模型往往无法保持形状结构的semantic一致性，也无法在生成过程中修改形状的semantic特征。在本文中，我们提出了一种新的semantic生成模型，名为3Dsemantic Subspace Traverser（3DSS）。该模型利用形状的semantic特征来为不同类别的三维形状进行生成和编辑。我们使用隐函数作为三维形状的表示方式，并结合了一种novel的latent-space GAN和一个线性子空间模型来发现三维形状的semantic维度。每个维度对应一个特定的semantic特征，可以在生成过程中编辑形状的semantic特征。实验结果表明，我们的方法可以生成具有复杂结构的plausible形状，并允许在生成过程中修改形状的semantic特征。代码和训练模型可以在https://github.com/TrepangCat/3D_Semantic_Subspace_Traverser上下载。

Controllable Guide-Space for Generalizable Face Forgery Detection

paper_url: http://arxiv.org/abs/2307.14039
repo_url: None
paper_authors: Ying Guo, Cheng Zhen, Pengfei Yan
for: 提高面伪检测的普遍化能力
methods: 提出控制可能空间（GS）方法，以增强伪造领域特征的分类，并使真伪造领域之间的距离变得更加明显。另外，使用解联模组削弱伪造 irrelevant 跟踪的影响，并根据邻近领域特征的聚类度进行决策面的调整。
results: 经过广泛的实验显示，我们的方法可以在多个内部和交叉领域的设定下实现顶尖的普遍化性。

Abstract
Recent studies on face forgery detection have shown satisfactory performance for methods involved in training datasets, but are not ideal enough for unknown domains. This motivates many works to improve the generalization, but forgery-irrelevant information, such as image background and identity, still exists in different domain features and causes unexpected clustering, limiting the generalization. In this paper, we propose a controllable guide-space (GS) method to enhance the discrimination of different forgery domains, so as to increase the forgery relevance of features and thereby improve the generalization. The well-designed guide-space can simultaneously achieve both the proper separation of forgery domains and the large distance between real-forgery domains in an explicit and controllable manner. Moreover, for better discrimination, we use a decoupling module to weaken the interference of forgery-irrelevant correlations between domains. Furthermore, we make adjustments to the decision boundary manifold according to the clustering degree of the same domain features within the neighborhood. Extensive experiments in multiple in-domain and cross-domain settings confirm that our method can achieve state-of-the-art generalization.

摘要
最近的面孔伪造检测研究表现良好在培训集上，但在未知领域中表现不够 Ideal。这种情况 Motivates 许多研究者增强泛化性，但伪造 irrelevant information，如图像背景和身份，仍然存在不同领域特征中，导致意外的凝集，限制了泛化。在这篇论文中，我们提出一种可控制的引导空间（GS）方法，以提高伪造领域特征的分化程度，从而提高伪造相关性。well-designed引导空间可同时实现各伪造领域的正确分离和真伪造领域之间的大距离。此外，为了提高分化度，我们使用一个解除相关性模块，以减少不同领域特征之间的干扰。此外，我们根据邻域中同一个领域特征的凝集度进行决策边缘 manifold 的调整。广泛的实验表明，我们的方法可以 achieve state-of-the-art 泛化性。

Human-centric Scene Understanding for 3D Large-scale Scenarios

paper_url: http://arxiv.org/abs/2307.14392
repo_url: https://github.com/4dvlab/hucenlife
paper_authors: Yiteng Xu, Peishan Cong, Yichen Yao, Runnan Chen, Yuenan Hou, Xinge Zhu, Xuming He, Jingyi Yu, Yuexin Ma
for: 本研究旨在提供一个大规模多Modal的人Centric场景理解 dataset，以便提高3D感知技术的性能。
methods: 本研究使用了多种方法，包括 LiDAR 技术和多Modal 捕获。
results: 本研究实现了state-of-the-art性能在人Centric场景理解任务中，并提供了多种 benchmark для相关研究。Here’s a breakdown of each point:
for: The paper is aimed at providing a large-scale multi-modal dataset for human-centric scene understanding, in order to improve the performance of 3D perception technologies.
methods: The paper uses various methods, including LiDAR technology and multi-modal capturing.
results: The paper achieves state-of-the-art performance in human-centric scene understanding tasks, and provides multiple benchmarks for related research.

Abstract
Human-centric scene understanding is significant for real-world applications, but it is extremely challenging due to the existence of diverse human poses and actions, complex human-environment interactions, severe occlusions in crowds, etc. In this paper, we present a large-scale multi-modal dataset for human-centric scene understanding, dubbed HuCenLife, which is collected in diverse daily-life scenarios with rich and fine-grained annotations. Our HuCenLife can benefit many 3D perception tasks, such as segmentation, detection, action recognition, etc., and we also provide benchmarks for these tasks to facilitate related research. In addition, we design novel modules for LiDAR-based segmentation and action recognition, which are more applicable for large-scale human-centric scenarios and achieve state-of-the-art performance.

摘要
人Centric场景理解对实际应用有着重要意义，但是它受到人姿态和行为多样化、人环境交互复杂、群体干扰等因素的影响很大。在这篇论文中，我们提供了一个大规模多模态场景理解数据集，名为HuCenLife，该数据集在日常生活场景中收集了丰富细化的注释。我们的HuCenLife可以帮助多种3D感知任务，如分割、检测、动作识别等，并为这些任务提供了参考。此外，我们还设计了基于LiDAR的分割和动作识别模块，这些模块更适合大规模人Centric场景，并实现了当前最佳性能。

Consensus-Adaptive RANSAC

paper_url: http://arxiv.org/abs/2307.14030
repo_url: https://github.com/cavalli1234/ca-ransac
paper_authors: Luca Cavalli, Daniel Barath, Marc Pollefeys, Viktor Larsson
for: 提高 robust estimation 精度，使 RANSAC 能够更好地适应不同 dataset 和任务。
methods: 基于 attention 层和一步 transformer，使 RANSAC 可以更好地探索参数空间，并且可以适应不同的 residuals 情况。
results: 对比 state-of-the-art 估计器，提出的方法具有较高的精度和更好的一致性，并且增加了只有小致用时间 overhead。

Abstract
RANSAC and its variants are widely used for robust estimation, however, they commonly follow a greedy approach to finding the highest scoring model while ignoring other model hypotheses. In contrast, Iteratively Reweighted Least Squares (IRLS) techniques gradually approach the model by iteratively updating the weight of each correspondence based on the residuals from previous iterations. Inspired by these methods, we propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer. The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer. This rich state then guides the minimal sampling between iterations as well as the model refinement. We evaluate the proposed approach on essential and fundamental matrix estimation on a number of indoor and outdoor datasets. It outperforms state-of-the-art estimators by a significant margin adding only a small runtime overhead. Moreover, we demonstrate good generalization properties of our trained model, indicating its effectiveness across different datasets and tasks. The proposed attention mechanism and one-step transformer provide an adaptive behavior that enhances the performance of RANSAC, making it a more effective tool for robust estimation. Code is available at https://github.com/cavalli1234/CA-RANSAC.

摘要
RANSAC和其 variants 广泛用于robust estimation，然而它们通常采用一种滥货的方法来找到最高分数模型，而忽略其他模型假设。相反，Iteratively Reweighted Least Squares (IRLS) 技术逐渐接近模型，通过在前一轮的 residuals 基础上更新每个匹配的权重。受到这些方法的启发，我们提出了一个新的 RANSAC 框架，通过一个新的注意层来探索参数空间。这个注意层在一批点到模型的 residuals 上运行，并将每个点的估计状态更新，以考虑前一轮的consensus。这个 ricstate 然后导引最小抽样和模型精度的改进。我们对几个indoor和outdoor数据集进行评估，并证明了我们的方法超过了当前的状态艺术家，并且具有良好的泛化性。这些注意层和一步transformer 提供了一种适应性，使RANSAC成为更有效的robust estimation工具。代码可以在中找到。

Topologically-Regularized Multiple Instance Learning for Red Blood Cell Disease Classification

paper_url: http://arxiv.org/abs/2307.14025
repo_url: None
paper_authors: Salome Kazeminia, Ario Sadafi, Asya Makhro, Anna Bogdanova, Carsten Marr, Bastian Rieck
for: 该研究用于自动识别罕见血液疾病的单细胞图像。
methods: 该研究使用了一种基于 topology 的方法，从单细胞图像中提取多尺度 topological 特征，以规范模型，保持数据的特有 topological 特性。
results: 实验表明，使用 topological 规范可以提高自动识别罕见血液疾病的性能，相比传统的多例学习方法，该方法可以提高性能超过 3%。这是首个使用 topological 性质来规范 MIL 过程的方法。

Abstract
Diagnosing rare anemia disorders using microscopic images is challenging for skilled specialists and machine-learning methods alike. Due to thousands of disease-relevant cells in a single blood sample, this constitutes a complex multiple-instance learning (MIL) problem. While the spatial neighborhood of red blood cells is not meaningful per se, the topology, i.e., the geometry of blood samples as a whole, contains informative features to remedy typical MIL issues, such as vanishing gradients and overfitting when training on limited data. We thus develop a topology-based approach that extracts multi-scale topological features from bags of single red blood cell images. The topological features are used to regularize the model, enforcing the preservation of characteristic topological properties of the data. Applied to a dataset of 71 patients suffering from rare anemia disorders with 521 microscopic images of red blood cells, our experiments show that topological regularization is an effective method that leads to more than 3% performance improvements for the automated classification of rare anemia disorders based on single-cell images. This is the first approach that uses topological properties for regularizing the MIL process.

摘要
诊断罕见血红细胞疾病使用微型图像是复杂的，专家和机器学习方法都面临挑战。由于每个血液样本中有数千个疾病相关的细胞，这构成了复杂的多例学习（MIL）问题。血红细胞的空间邻居不是直接意义的，但血液样本的整体几何结构含有有用的特征，以解决典型的MIL问题，如衰减梯度和预测过拟合。我们因此开发了基于topology的方法，EXTRACTING多尺度的topological特征 FROM single red blood cell images。这些特征用于规范模型，使模型保留特征数据的特有topological属性。应用于71名患有罕见血红细胞疾病的患者，521个微型血红细胞图像的实验显示， topological regularization 是一种有效的方法，可以提高基于单细胞图像的罕见血红细胞自动分类的性能，高于3%。这是首次使用topological属性来规范MIL过程的方法。

Retinotopy Inspired Brain Encoding Model and the All-for-One Training Recipe

paper_url: http://arxiv.org/abs/2307.14021
repo_url: None
paper_authors: Huzheng Yang, Jianbo Shi, James Gee
for: 预测脑细胞响应图像刺激，实现脑信号捕捉技术的复制。
methods: 引入多种多样性优势，包括个体脑功能多样性、个体差异和成像模式差异，并通过分解大型模型问题而解决困难。
results: 采用多种多样性，具有3D脑图像映射的学习，并在五个公共数据集上预训练一个全面的脑编码模型，以及证明该模型可以作为视觉后处理模型的替换。进一步应用脑解码。

Abstract
Brain encoding models aim to predict brain voxel-wise responses to stimuli images, replicating brain signals captured by neuroimaging techniques. There is a large volume of publicly available data, but training a comprehensive brain encoding model is challenging. The main difficulties stem from a) diversity within individual brain, with functional heterogeneous brain regions; b) diversity of brains from different subjects, due to genetic and developmental differences; c) diversity of imaging modalities and processing pipelines. We use this diversity to our advantage by introducing the All-for-One training recipe, which divides the challenging one-big-model problem into multiple small models, with the small models aggregating the knowledge while preserving the distinction between the different functional regions. Agnostic of the training recipe, we use biological knowledge of the brain, specifically retinotopy, to introduce inductive bias to learn a 3D brain-to-image mapping that ensures a) each neuron knows which image regions and semantic levels to gather information, and b) no neurons are left behind in the model. We pre-trained a brain encoding model using over one million data points from five public datasets spanning three imaging modalities. To the best of our knowledge, this is the most comprehensive brain encoding model to the date. We demonstrate the effectiveness of the pre-trained model as a drop-in replacement for commonly used vision backbone models. Furthermore, we demonstrate the application of the model to brain decoding. Code and the model checkpoint will be made available.

摘要
brain编码模型目的是预测脑细胞层次响应外部刺激图像，模拟脑信号 captured by 神经成像技术。有大量公共数据可用，但 trains comprehensive brain编码模型是挑战。主要困难来自于：a) 个体脑中功能多样性，各个功能区域的多样性;b) 不同个体的脑发育差异，基因和发育差异;c) 成像模式和处理流水线的多样性。我们利用这些多样性，提出了 All-for-One 训练方法，将复杂的一个大模型问题分解成多个小模型，小模型互相协同，汇集知识，同时保持不同功能区域的分别。无论训练方法，我们利用脑科学中的知识，具体是视网膜，引入假设导向，学习 3D 脑-图像映射，以确保：a) 每个神经细胞知道哪些图像区域和semantic层次收集信息;b) 无神经细胞被模型忽略。我们预训练了脑编码模型，使用公共数据集总计一百万多个数据点，来自五个公共数据集，覆盖三种成像模式。到目前为止，这是最全面的脑编码模型。我们证明了预训练模型可以作为常用视觉底层模型的替换。此外，我们还应用了模型到脑解oding。代码和模型检查点将被公布。

RPG-Palm: Realistic Pseudo-data Generation for Palmprint Recognition

paper_url: http://arxiv.org/abs/2307.14016
repo_url: None
paper_authors: Lei Shen, Jianlong Jin, Ruixin Zhang, Huaen Li, Kai Zhao, Yingyi Zhang, Jingyun Zhang, Shouhong Ding, Yang Zhao, Wei Jia
for: 提高palmprint认证模型性能，addressing the lack of large-scale public palmprint datasets.
methods: 提出了一种新的 Pseudo-Palmprint Generation (RPG) 模型，使用 conditional modulation generator 和 identity-aware loss 来提高内类多样性和人脸独特性。
results: 实验结果表明，使用 synthetic pretraining 可以显著提高palmprint认证模型的性能，例如，在 $1:1$ 和 $1:3$ Open-set 协议下，我们的模型比 state-of-the-art B'ezierPalm 提高了 более чем $5%$ 和 $14%$。而且，只使用 $10%$ 的实际训练数据，我们的方法仍然可以超越 ArcFace 使用 $100%$ 实际训练数据。

Abstract
Palmprint recently shows great potential in recognition applications as it is a privacy-friendly and stable biometric. However, the lack of large-scale public palmprint datasets limits further research and development of palmprint recognition. In this paper, we propose a novel realistic pseudo-palmprint generation (RPG) model to synthesize palmprints with massive identities. We first introduce a conditional modulation generator to improve the intra-class diversity. Then an identity-aware loss is proposed to ensure identity consistency against unpaired training. We further improve the B\'ezier palm creases generation strategy to guarantee identity independence. Extensive experimental results demonstrate that synthetic pretraining significantly boosts the recognition model performance. For example, our model improves the state-of-the-art B\'ezierPalm by more than $5\%$ and $14\%$ in terms of TAR@FAR=1e-6 under the $1:1$ and $1:3$ Open-set protocol. When accessing only $10\%$ of the real training data, our method still outperforms ArcFace with $100\%$ real training data, indicating that we are closer to real-data-free palmprint recognition.

摘要
最近，手印识别技术已经显示出了很大的潜力，因为它是一种隐私友好的和稳定的生物指纹。然而，由于缺乏大规模的公共手印数据集，进一步的研究和开发受到了限制。在这篇论文中，我们提出了一种新的现实 pseudo-手印生成（RPG）模型，可以Synthesize手印 prints with massive identities。我们首先引入了条件修饰生成器，以提高内类多样性。然后，我们提出了一种身份相关损失，以保证身份一致性。最后，我们进一步改进了Bézier手印皱纹生成策略，以保证身份独立。我们的实验结果表明，在Synthetic预训练下，识别模型的性能得到了显著提高。例如，我们的模型比State-of-the-art BézierPalm提高了 более чем5%和14%在1:1和1:3开放集成协议下的TAR@FAR=1e-6。当只使用实际训练数据的10%时，我们的方法仍然超越了使用100%实际训练数据的ArcFace，这表明我们更接近实际数据免训练的手印识别。

Car-Studio: Learning Car Radiance Fields from Single-View and Endless In-the-wild Images

paper_url: http://arxiv.org/abs/2307.14009
repo_url: https://github.com/lty2226262/Car_studio
paper_authors: Tianyu Liu, Hao Zhao, Yang Yu, Guyue Zhou, Ming Liu
for: 研究人员希望通过自适应驾驶模拟器中的编辑功能来提高自适应驾驶系统的性能。
methods: 作者提出了一种搭建自由图像学习和建立dataset的管道，并针对自驾汽车中的车身射线场进行设计，以满足模拟器的需求。
results: 通过实验，作者证明了他们的模型与基eline相比具有竞争性能，并逐渐实现了控制性图像编辑功能。

Abstract
Compositional neural scene graph studies have shown that radiance fields can be an efficient tool in an editable autonomous driving simulator. However, previous studies learned within a sequence of autonomous driving datasets, resulting in unsatisfactory blurring when rotating the car in the simulator. In this letter, we propose a pipeline for learning unconstrained images and building a dataset from processed images. To meet the requirements of the simulator, which demands that the vehicle maintain clarity when the perspective changes and that the contour remains sharp from the background to avoid artifacts when editing, we design a radiation field of the vehicle, a crucial part of the urban scene foreground. Through experiments, we demonstrate that our model achieves competitive performance compared to baselines. Using the datasets built from in-the-wild images, our method gradually presents a controllable appearance editing function. We will release the dataset and code on https://lty2226262.github.io/car-studio/ to facilitate further research in the field.

摘要
<>转换给定文本到简化中文。> compositional neural scene graph 研究表明，辐射场可以是编辑自驾护 simulate 中的效果 Tools。然而，前一 studies 在一系列自动驾护数据集中学习，导致在 simulate 中旋转车辆时出现不满人的模糊。在这封信中，我们提出一个管道来学习无约束的图像和建立数据集。为满足 simulate 的需求，我们设计了车辆的辐射场，城市前景中的重要部分。通过实验，我们示出了我们的模型与基eline相比具有竞争性。使用从野外的图像构建的数据集，我们逐渐实现了可控的外观编辑功能。我们将发布数据集和代码到 https://lty2226262.github.io/car-studio/，以便进一步研究在这个领域。

Adaptive Frequency Filters As Efficient Global Token Mixers

paper_url: http://arxiv.org/abs/2307.14008
repo_url: https://github.com/microsoft/TokenMixers
paper_authors: Zhipeng Huang, Zhizheng Zhang, Cuiling Lan, Zheng-Jun Zha, Yan Lu, Baining Guo
for: 本文 targets 广泛视觉任务中的效率和准确性问题，旨在提出一种有效的减少计算成本的方法，以便在移动设备上部署深度学习模型。
methods: 本文使用了传统的卷积定理来deep learning中，并发现了适应频率筛选器可以作为全球化征素。
results: 实验表明，提出的AFF征素混合器可以减少计算成本，同时保持或提高准确性。AFFNet也在广泛视觉任务中达到了较好的平衡点。

Abstract
Recent vision transformers, large-kernel CNNs and MLPs have attained remarkable successes in broad vision tasks thanks to their effective information fusion in the global scope. However, their efficient deployments, especially on mobile devices, still suffer from noteworthy challenges due to the heavy computational costs of self-attention mechanisms, large kernels, or fully connected layers. In this work, we apply conventional convolution theorem to deep learning for addressing this and reveal that adaptive frequency filters can serve as efficient global token mixers. With this insight, we propose Adaptive Frequency Filtering (AFF) token mixer. This neural operator transfers a latent representation to the frequency domain via a Fourier transform and performs semantic-adaptive frequency filtering via an elementwise multiplication, which mathematically equals to a token mixing operation in the original latent space with a dynamic convolution kernel as large as the spatial resolution of this latent representation. We take AFF token mixers as primary neural operators to build a lightweight neural network, dubbed AFFNet. Extensive experiments demonstrate the effectiveness of our proposed AFF token mixer and show that AFFNet achieve superior accuracy and efficiency trade-offs compared to other lightweight network designs on broad visual tasks, including visual recognition and dense prediction tasks.

摘要

Learning Snippet-to-Motion Progression for Skeleton-based Human Motion Prediction

paper_url: http://arxiv.org/abs/2307.14006
repo_url: None
paper_authors: Xinshun Wang, Qiongjie Cui, Chen Chen, Shen Zhao, Mengyuan Liu
for: 这篇论文的目的是提出一个多阶段框架，以便更好地预测人类动作。
methods: 该方法使用了一个单一的几何模型，实现了对特征传播的直接和有效的实现。
results: 实验结果显示，该方法可以在Human 3.6M、CMU Mocap和3DPW datasets上取得最佳性能，并且比之前的方法有更好的预测性。

Abstract
Existing Graph Convolutional Networks to achieve human motion prediction largely adopt a one-step scheme, which output the prediction straight from history input, failing to exploit human motion patterns. We observe that human motions have transitional patterns and can be split into snippets representative of each transition. Each snippet can be reconstructed from its starting and ending poses referred to as the transitional poses. We propose a snippet-to-motion multi-stage framework that breaks motion prediction into sub-tasks easier to accomplish. Each sub-task integrates three modules: transitional pose prediction, snippet reconstruction, and snippet-to-motion prediction. Specifically, we propose to first predict only the transitional poses. Then we use them to reconstruct the corresponding snippets, obtaining a close approximation to the true motion sequence. Finally we refine them to produce the final prediction output. To implement the network, we propose a novel unified graph modeling, which allows for direct and effective feature propagation compared to existing approaches which rely on separate space-time modeling. Extensive experiments on Human 3.6M, CMU Mocap and 3DPW datasets verify the effectiveness of our method which achieves state-of-the-art performance.

摘要
现有的图像 convolutional neural networks для人体动作预测主要采用一步方案，直接从历史输入输出预测结果，而不利用人体动作的模式。我们发现人体动作有过渡模式，可以将动作分解成表示每个过渡的小剪辑。每个小剪辑可以从其起始和结束姿势（称为过渡姿势）中重建。我们提议一个小剪辑-动作多stage框架，将动作预测分解成更容易实现的子任务。每个子任务包括三个模块：过渡姿势预测、小剪辑重建和小剪辑-动作预测。具体来说，我们首先预测过渡姿势，然后使用它们重建相应的小剪辑，获得一个近似真实动作序列。最后，我们进行细化修正，以生成最终预测输出。为实现网络，我们提议一种新的统一图像模型，允许直接和有效地传播特征，而不是现有的分离空间-时间模型。广泛的实验在人类3.6M、CMU Mocap 和 3DPW 数据集上证明了我们的方法的有效性，达到了状态 искусственный智能的性能。

Causal reasoning in typical computer vision tasks

paper_url: http://arxiv.org/abs/2307.13992
repo_url: None
paper_authors: Kexuan Zhang, Qiyu Sun, Chaoqiang Zhao, Yang Tang
for: This paper aims to comprehensively review existing causal methods in typical vision and vision-language tasks, and provide future roadmaps for the development and application of causal theory in computer vision.
methods: The paper uses a causal paradigm to model the intrinsic causal structure of vision and vision-language tasks, and reviews existing causal methods in semantic segmentation, object detection, and image captioning.
results: The paper discusses the advantages of using causality in deep learning-based computer vision tasks and proposes future roadmaps for the development and application of causal theory in other complex scenes and systems.

Abstract
Deep learning has revolutionized the field of artificial intelligence. Based on the statistical correlations uncovered by deep learning-based methods, computer vision has contributed to tremendous growth in areas like autonomous driving and robotics. Despite being the basis of deep learning, such correlation is not stable and is susceptible to uncontrolled factors. In the absence of the guidance of prior knowledge, statistical correlations can easily turn into spurious correlations and cause confounders. As a result, researchers are now trying to enhance deep learning methods with causal theory. Causal theory models the intrinsic causal structure unaffected by data bias and is effective in avoiding spurious correlations. This paper aims to comprehensively review the existing causal methods in typical vision and vision-language tasks such as semantic segmentation, object detection, and image captioning. The advantages of causality and the approaches for building causal paradigms will be summarized. Future roadmaps are also proposed, including facilitating the development of causal theory and its application in other complex scenes and systems.

摘要
深度学习已经革命化人工智能领域。基于深度学习方法发现的统计相关性，计算机视觉在自动驾驶和机器人等领域带来了巨大的成长。然而，这种相关性并不稳定，容易受到外部因素的影响。在知识导向的指导下 absence，统计相关性可以轻易变成假 correlate 和干扰因素。因此，研究人员现在尝试通过 causal theory 增强深度学习方法。causal theory 模型了不受数据偏见影响的内在 causal 结构，可以减少假 correlate 的出现。本文将对常见视觉和语言视觉任务，如 semantic segmentation、object detection 和 image captioning 等进行全面的review。causality 的优势和建立 causal 模型的方法将被总结。未来的路线图还将包括在其他复杂的场景和系统中应用 causal theory 的发展，以及促进 causal theory 的应用。

paper_url: http://arxiv.org/abs/2307.13991
repo_url: None
paper_authors: Junwon Seo, Taekyung Kim, Seongyong Ahn, Kiho Kwak
for: 这篇论文的目的是为了实现自主导航在非道路环境中，准确地估算地形通行性。
methods: 这篇论文使用了元学习框架，通过自动驾驶数据收集自多种环境，训练了一个全球模型，以估算地形通行性。
results: 研究人员通过在多种地形上收集驾驶数据，训练了一个全球模型，可以准确地估算地形通行性，并且通过与控制器集成，实现了安全和稳定的自主导航。

Abstract
Autonomous navigation in off-road conditions requires an accurate estimation of terrain traversability. However, traversability estimation in unstructured environments is subject to high uncertainty due to the variability of numerous factors that influence vehicle-terrain interaction. Consequently, it is challenging to obtain a generalizable model that can accurately predict traversability in a variety of environments. This paper presents METAVerse, a meta-learning framework for learning a global model that accurately and reliably predicts terrain traversability across diverse environments. We train the traversability prediction network to generate a dense and continuous-valued cost map from a sparse LiDAR point cloud, leveraging vehicle-terrain interaction feedback in a self-supervised manner. Meta-learning is utilized to train a global model with driving data collected from multiple environments, effectively minimizing estimation uncertainty. During deployment, online adaptation is performed to rapidly adapt the network to the local environment by exploiting recent interaction experiences. To conduct a comprehensive evaluation, we collect driving data from various terrains and demonstrate that our method can obtain a global model that minimizes uncertainty. Moreover, by integrating our model with a model predictive controller, we demonstrate that the reduced uncertainty results in safe and stable navigation in unstructured and unknown terrains.

摘要
自主导航在未知预期环境中需要准确地估计地形可行性。然而，在无结构环境中 traversability 估计受到多种因素的变化所带来的不确定性的影响，因此很难取得一个通用的模型，可以准确地预测不同环境中的地形可行性。这篇论文提出了 METAVerse，一个基于 meta-学的框架，用于学习一个准确且可靠地预测地形可行性的全球模型。我们在自然supervised 的方式下，使用汽车-地面交互反馈来训练一个 dense 和连续值的 cost 图，从 LiDAR 点云中生成一个粗略的地形可行性预测结果。通过 meta-学的使用，我们可以在多个环境中收集的驾驶数据上训练一个全球模型，以实现最小化估计uncertainty。在部署时，我们通过在线适应来迅速地适应当地环境，并且通过利用最近的交互经验来进行更新。我们通过收集不同 terrains 的驾驶数据，并进行了全面的评估，得到了一个准确且可靠的全球模型，并且通过与预测控制器集成，实现了安全和稳定的自主导航。

Hybrid Representation-Enhanced Sampling for Bayesian Active Learning in Musculoskeletal Segmentation of Lower Extremities

paper_url: http://arxiv.org/abs/2307.13986
repo_url: None
paper_authors: Ganping Li, Yoshito Otake, Mazen Soufi, Masashi Taniguchi, Masahide Yagi, Noriaki Ichihashi, Keisuke Uemura, Masaki Takao, Nobuhiko Sugano, Yoshinobu Sato
for: 这个研究的目的是提高医疗影像标注的效率，使用 Bayesian active learning (BAL) 方法选择最有用的标注样本，以减少人工标注的时间和努力。methods: 这个研究使用了一个混合表现式增强抽样法，将高密度和多样性的标注样本选择进行人工修改，以便最大化与未标注样本的相似性，最小化与现有训练样本的相似性。results: 研究结果显示，提案的抽样法在两个lower extremity (LE) dataset上表现出色，在两种抽样规则下都达到了superiority或非凡性。量值结果显示，混合表现式增强抽样法在骨附致度标注中表现出色，并且在不同的抽样规则下进行了评估和比较。

Abstract
Purpose: Obtaining manual annotations to train deep learning (DL) models for auto-segmentation is often time-consuming. Uncertainty-based Bayesian active learning (BAL) is a widely-adopted method to reduce annotation efforts. Based on BAL, this study introduces a hybrid representation-enhanced sampling strategy that integrates density and diversity criteria to save manual annotation costs by efficiently selecting the most informative samples. Methods: The experiments are performed on two lower extremity (LE) datasets of MRI and CT images by a BAL framework based on Bayesian U-net. Our method selects uncertain samples with high density and diversity for manual revision, optimizing for maximal similarity to unlabeled instances and minimal similarity to existing training data. We assess the accuracy and efficiency using Dice and a proposed metric called reduced annotation cost (RAC), respectively. We further evaluate the impact of various acquisition rules on BAL performance and design an ablation study for effectiveness estimation. Results: The proposed method showed superiority or non-inferiority to other methods on both datasets across two acquisition rules, and quantitative results reveal the pros and cons of the acquisition rules. Our ablation study in volume-wise acquisition shows that the combination of density and diversity criteria outperforms solely using either of them in musculoskeletal segmentation. Conclusion: Our sampling method is proven efficient in reducing annotation costs in image segmentation tasks. The combination of the proposed method and our BAL framework provides a semi-automatic way for efficient annotation of medical image datasets.

摘要
目的：获取手动标注以训练深度学习（DL）模型的自动分割是时间consuming的。uncertainty-based Bayesian活动学习（BAL）是一种广泛采用的方法，可以减少手动标注的努力。基于BAL，本研究提出了一种混合表示函数增强选择策略，可以高效地选择最有用的样本进行手动修改，以便更好地适应不同的样本。方法：我们在两个lower extremity（LE）数据集上进行了MRI和CT图像的实验，使用基于Bayesian U-net的BAL框架。我们的方法选择了uncertainty高的样本，同时满足高density和多样性要求，以便手动修改后，与未标注数据的最大相似性和已有训练数据的最小相似性。我们使用Dice和一个提出的 metric called reduced annotation cost（RAC）进行评估精度和效率。我们进一步 evaluate了不同的获取规则对BAL性能的影响，并设计了一个ablation study来评估效果。结果：我们的方法在两个数据集上都显示了superiority或非 инфериорity compared to其他方法，并且quantitative results reveal了不同获取规则的优缺点。我们的ablation study表明，混合density和多样性 criterion outperforms使用任一 criterion alone in musculoskeletal segmentation。结论：我们的采样方法可以减少图像分割任务中的手动标注成本。将我们的采样方法与我们的BAL框架结合使用，可以提供一种 semi-automatic的方式，以便快速和高效地注解医疗图像数据集。

Enhanced Security against Adversarial Examples Using a Random Ensemble of Encrypted Vision Transformer Models

paper_url: http://arxiv.org/abs/2307.13985
repo_url: None
paper_authors: Ryota Iijima, Miki Tanaka, Sayaka Shiota, Hitoshi Kiya
for: 防御深度神经网络（DNNs）受到敌意攻击（Adversarial Examples，AE）的袭击。
methods: 提出了一种随机ensemble的加密ViT模型来实现更加可靠的防御。
results: 在实验中，提议的方案比 conventional方法更加鲁棒 против不仅黑盒攻击，还有白盒攻击。

Abstract
Deep neural networks (DNNs) are well known to be vulnerable to adversarial examples (AEs). In addition, AEs have adversarial transferability, which means AEs generated for a source model can fool another black-box model (target model) with a non-trivial probability. In previous studies, it was confirmed that the vision transformer (ViT) is more robust against the property of adversarial transferability than convolutional neural network (CNN) models such as ConvMixer, and moreover encrypted ViT is more robust than ViT without any encryption. In this article, we propose a random ensemble of encrypted ViT models to achieve much more robust models. In experiments, the proposed scheme is verified to be more robust against not only black-box attacks but also white-box ones than convention methods.

摘要
深度神经网络（DNNs）已经广泛地知道它们容易受到敌意例子（AEs）的攻击。此外，AEs还具有敌意传递性，意味着生成 для源模型的AEs可以诱导另一个黑盒模型（目标模型）的非常小的概率。在先前的研究中，确认了视transformer（ViT）比 convolutional neural network（CNN）模型such as ConvMixer更加抵抗性能 adversarial transferability的性能。此外，加密ViT比不加密ViT更加抗性能。在这篇文章中，我们提议一种随机ensemble of encrypted ViT模型来实现更加可靠的模型。在实验中，我们的方案被证明更加抗性能于不只是黑盒攻击，还有白盒攻击。

Analysis of Video Quality Datasets via Design of Minimalistic Video Quality Models

paper_url: http://arxiv.org/abs/2307.13981
repo_url: None
paper_authors: Wei Sun, Wen Wen, Xiongkuo Min, Long Lan, Guangtao Zhai, Kede Ma
for:这篇论文的目的是为了评估视频质量评估（BVQA）模型的进步，以及现有的视频质量评估数据集的评估。methods:这篇论文使用了一种简单的BVQA模型，包括视频预处理器、空间质量分析器、可选的时间质量分析器和质量回归器，并对八个视频质量评估数据集进行了比较。results:研究发现，大多数数据集受到容易的数据集问题的影响，一些数据集甚至可以使用盲图质量评估（BIQA）解决方案。研究还发现，不同的BVQA设计选择对于不同的数据集有着不同的影响。这些结果表明，当前的BVQA领域需要进一步的改进，同时也提供了构建下一代视频质量评估数据集和模型的好做法。

Abstract
Blind video quality assessment (BVQA) plays an indispensable role in monitoring and improving the end-users' viewing experience in various real-world video-enabled media applications. As an experimental field, the improvements of BVQA models have been measured primarily on a few human-rated VQA datasets. Thus, it is crucial to gain a better understanding of existing VQA datasets in order to properly evaluate the current progress in BVQA. Towards this goal, we conduct a first-of-its-kind computational analysis of VQA datasets via designing minimalistic BVQA models. By minimalistic, we restrict our family of BVQA models to build only upon basic blocks: a video preprocessor (for aggressive spatiotemporal downsampling), a spatial quality analyzer, an optional temporal quality analyzer, and a quality regressor, all with the simplest possible instantiations. By comparing the quality prediction performance of different model variants on eight VQA datasets with realistic distortions, we find that nearly all datasets suffer from the easy dataset problem of varying severity, some of which even admit blind image quality assessment (BIQA) solutions. We additionally justify our claims by contrasting our model generalizability on these VQA datasets, and by ablating a dizzying set of BVQA design choices related to the basic building blocks. Our results cast doubt on the current progress in BVQA, and meanwhile shed light on good practices of constructing next-generation VQA datasets and models.

摘要
视频质量评估（BVQA）在各种视频媒体应用中扮演着不可或缺的角色，评估用户在实际场景中的观看体验。作为实验领域，BVQA模型的改进主要基于一些人工评估的VQA数据集。因此，更深刻地理解现有VQA数据集是关键的。为了实现这个目标，我们通过设计最简单的BVQA模型进行计算分析。我们限制我们的BVQA模型只能使用基本块：视频预处理器（用于激进的时空下采样）、空间质量分析器、可选的时间质量分析器和质量回归器，其中所有的实现都是最简单的。通过对不同模型变体在八个VQA数据集上的质量预测性能进行比较，我们发现大多数数据集受到不同程度的易于评估问题的影响，一些甚至接受盲图质量评估（BIQA）解决方案。此外，我们还通过对这些VQA数据集的模型普适性进行比较，以及对BVQA设计决策的绝对多样化进行排除，来证明我们的结论。结果表明，目前BVQA领域的进步存在很大的问题，同时也提供了构建下一代VQA数据集和模型的好做法。

Tracking Anything in High Quality

paper_url: http://arxiv.org/abs/2307.13974
repo_url: https://github.com/jiawen-zhu/hqtrack
paper_authors: Jiawen Zhu, Zhenyu Chen, Zeqi Hao, Shijie Chang, Lu Zhang, Dong Wang, Huchuan Lu, Bin Luo, Jun-Yan He, Jin-Peng Lan, Hanyuan Chen, Chenyang Li
for: 本文提出了一种高质量视频对象跟踪框架（HQTrack），用于高精度地跟踪视频中的任意对象。
methods: 该框架包括视频多对象分割器（VMOS）和掩码精度提升器（MR）两部分。VMOS通过卷积神经网络进行对象分割，而MR使用预训练的模型来精度地改善跟踪结果。
results: 对比其他参赛方法，HQTrack在Visual Object Tracking and Segmentation（VOTS2023）挑战中得到了第二名的成绩，而不使用任何套路如测试时数据增强和模型ensemble。

Abstract
Visual object tracking is a fundamental video task in computer vision. Recently, the notably increasing power of perception algorithms allows the unification of single/multiobject and box/mask-based tracking. Among them, the Segment Anything Model (SAM) attracts much attention. In this report, we propose HQTrack, a framework for High Quality Tracking anything in videos. HQTrack mainly consists of a video multi-object segmenter (VMOS) and a mask refiner (MR). Given the object to be tracked in the initial frame of a video, VMOS propagates the object masks to the current frame. The mask results at this stage are not accurate enough since VMOS is trained on several closeset video object segmentation (VOS) datasets, which has limited ability to generalize to complex and corner scenes. To further improve the quality of tracking masks, a pretrained MR model is employed to refine the tracking results. As a compelling testament to the effectiveness of our paradigm, without employing any tricks such as test-time data augmentations and model ensemble, HQTrack ranks the 2nd place in the Visual Object Tracking and Segmentation (VOTS2023) challenge. Code and models are available at https://github.com/jiawen-zhu/HQTrack.

摘要
<>Translate the given text into Simplified Chinese.<>视觉对象跟踪是计算机视觉中的基本任务。近些年，人工智能的识别算法的能力不断提高，使得单/多对象和框/面签基本跟踪可以协调。其中，Segment Anything Model（SAM）吸引了很多关注。在这份报告中，我们提出了高质量跟踪任何对象的框架（HQTrack）。HQTrack主要由视频多对象分割器（VMOS）和面签修正器（MR）两部分组成。给定视频中的对象在初始帧，VMOS将对象面签推广到当前帧。但由于VMOS被训练于一些相似的视频对象分割（VOS）数据集，其泛化能力有限，因此对跟踪面签的结果进行修正。为了进一步提高跟踪面签的质量，我们采用了预训练的MR模型进行修正。作为我们模型的吸引力，在Visual Object Tracking and Segmentation（VOTS2023）挑战中，无需使用任何套路技术和模型集成，HQTrack在评测中排名第二。代码和模型可以在https://github.com/jiawen-zhu/HQTrack上获取。

paper_url: http://arxiv.org/abs/2307.13958
repo_url: None
paper_authors: Zitong Yu, Rizhao Cai, Yawen Cui, Ajian Liu, Changsheng Chen
for: 提高face anti-spoofing（FAS）系统的 robustness，使用视觉转换器基于多模态学习方法。
methods: 提出了一种名为Vision Prompt flexible-modal FAS（VP-FAS）的方法，通过在固定预训练基模型上学习模式相关的提示来适应流处理时缺失的模态。
results: 在两个多模态FAS benchmark数据集上进行了广泛的实验，证明了VP-FAS框架在不同缺失模态情况下的高效性，同时减少了模型重新训练的需求。

Abstract
Recently, vision transformer based multimodal learning methods have been proposed to improve the robustness of face anti-spoofing (FAS) systems. However, multimodal face data collected from the real world is often imperfect due to missing modalities from various imaging sensors. Recently, flexible-modal FAS~\cite{yu2023flexible} has attracted more attention, which aims to develop a unified multimodal FAS model using complete multimodal face data but is insensitive to test-time missing modalities. In this paper, we tackle one main challenge in flexible-modal FAS, i.e., when missing modality occurs either during training or testing in real-world situations. Inspired by the recent success of the prompt learning in language models, we propose \textbf{V}isual \textbf{P}rompt flexible-modal \textbf{FAS} (VP-FAS), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to downstream flexible-modal FAS task. Specifically, both vanilla visual prompts and residual contextual prompts are plugged into multimodal transformers to handle general missing-modality cases, while only requiring less than 4\% learnable parameters compared to training the entire model. Furthermore, missing-modality regularization is proposed to force models to learn consistent multimodal feature embeddings when missing partial modalities. Extensive experiments conducted on two multimodal FAS benchmark datasets demonstrate the effectiveness of our VP-FAS framework that improves the performance under various missing-modality cases while alleviating the requirement of heavy model re-training.

摘要
最近，基于视觉变换器的多Modal学习方法被提议以提高face anti-spoofing（FAS）系统的 Robustness。然而，从实际世界中收集的多Modal face数据经常受到不同感知器的数据损失。近期，flexible-modal FAS 在这些损失中吸引了更多的关注，它的目标是开发一个可以使用完整的多Modal face数据进行融合的 FAS模型，但是不敏感于测试时缺失的模态。在这篇论文中，我们解决了flexible-modal FAS中的一个主要挑战，即在训练或测试过程中缺失模态。我们灵感自近期的语言模型的Prompt学习的成功，我们提出了Visual Prompt flexible-modal FAS（VP-FAS），它通过学习模态相关的Prompt来适应冻结预训练基础模型到下游多Modal FAS任务。特别是，我们在多Modal transformer中插入了vanilla visual prompt和 residual contextual prompt，以处理一般缺失模态的情况，而无需更新整个模型。此外，我们还提出了缺失模态的Regularization，以强制模型学习一致的多Modal特征嵌入，即使缺失部分模态。我们在两个多Modal FAS benchmark数据集上进行了广泛的实验， demonstarted VP-FAS框架可以在不同的缺失模态情况下提高性能，而且降低模型重新训练的需求。

Heterogeneous Embodied Multi-Agent Collaboration

paper_url: http://arxiv.org/abs/2307.13957
repo_url: None
paper_authors: Xinzhu Liu, Di Guo, Huaping Liu
for: 这个论文研究了多智能体在复杂的室内视觉环境中完成多智能体任务的协作方法。
methods: 该论文提出了一种基于多智能体探测异常物品并预测合理容器的层次决策模型，以及一种基于手势交换的群体通信机制。
results: 经过广泛的实验，论文证明了提出的模型的效果。项目官方网站和实验视频可以在https://hetercol.github.io/查看。

Abstract
Multi-agent embodied tasks have recently been studied in complex indoor visual environments. Collaboration among multiple agents can improve work efficiency and has significant practical value. However, most of the existing research focuses on homogeneous multi-agent tasks. Compared with homogeneous agents, heterogeneous agents can leverage their different capabilities to allocate corresponding sub-tasks and cooperate to complete complex tasks. Heterogeneous multi-agent tasks are common in real-world scenarios, and the collaboration strategy among heterogeneous agents is a challenging and important problem to be solved. To study collaboration among heterogeneous agents, we propose the heterogeneous multi-agent tidying-up task, in which multiple heterogeneous agents with different capabilities collaborate with each other to detect misplaced objects and place them in reasonable locations. This is a demanding task since it requires agents to make the best use of their different capabilities to conduct reasonable task planning and complete the whole task. To solve this task, we build a heterogeneous multi-agent tidying-up benchmark dataset in a large number of houses with multiple rooms based on ProcTHOR-10K. We propose the hierarchical decision model based on misplaced object detection, reasonable receptacle prediction, as well as the handshake-based group communication mechanism. Extensive experiments are conducted to demonstrate the effectiveness of the proposed model. The project's website and videos of experiments can be found at https://hetercol.github.io/.

摘要
多智能体任务在复杂的室内视觉环境中最近得到了研究。多个代理机器人可以增加工作效率，具有重要的实用价值。然而，大多数现有研究都集中在同类代理机器人任务上。相比同类代理机器人，多种代理机器人可以利用不同的能力来分配相应的子任务并合作完成复杂任务。多种代理机器人任务在实际场景中很常见，合作策略中的多种代理机器人是一个挑战性和重要的问题。为了研究多种代理机器人之间的合作，我们提出了多种代理机器人整理任务，在多个房间的多个室内进行探测落入的物品并将其置于合理的位置。这是一项需要代理机器人利用不同的能力进行合理的任务规划，以完成整个任务的任务。为解决这个任务，我们建立了多种代理机器人整理任务的基准数据集，基于ProcTHOR-10K。我们提出了层次决策模型，包括落入物品探测、合理容器预测以及手势交换机制。我们对这些实验进行了广泛的实验，以证明提案的模型的有效性。项目的官方网站和实验视频可以在https://hetercol.github.io/查看。

The Hidden Dance of Phonemes and Visage: Unveiling the Enigmatic Link between Phonemes and Facial Features

paper_url: http://arxiv.org/abs/2307.13953
repo_url: None
paper_authors: Liao Qu, Xianwei Zou, Xiang Li, Yandong Wen, Rita Singh, Bhiksha Raj
for: 这篇论文探讨了声音和面部特征之间的关系。传统的声音-面部相关性研究通常需要使用长时间的声音输入，包括从声音生成面像和从声音重建3D面膜。但在voice-based犯罪调查中，可能只有有限的声音证据。此外，从 физиологи学角度来看，每个 segment of speech （phoneme）对应不同的空气流和面部运动。因此，发现声音和面部特征之间的隐藏关系是有利的。
methods: 我们提出了一个分析管道，用于在细致的方式探讨声音和面部特征之间的关系。我们建立了每个声音-特征对的估计器，并通过假设检测来评估相关性。我们发现，在元音上更容易预测面部特征，特别是填凿音。此外，我们发现，在某些特征在声音发音时的更大运动，可以更好地预测。
results: 我们的结果支持 physiology 中关于声音和面部特征之间的相关性的发现。我们的研究为未来的声音-面部多模态学习奠基。

Abstract
This work unveils the enigmatic link between phonemes and facial features. Traditional studies on voice-face correlations typically involve using a long period of voice input, including generating face images from voices and reconstructing 3D face meshes from voices. However, in situations like voice-based crimes, the available voice evidence may be short and limited. Additionally, from a physiological perspective, each segment of speech -- phoneme -- corresponds to different types of airflow and movements in the face. Therefore, it is advantageous to discover the hidden link between phonemes and face attributes. In this paper, we propose an analysis pipeline to help us explore the voice-face relationship in a fine-grained manner, i.e., phonemes v.s. facial anthropometric measurements (AM). We build an estimator for each phoneme-AM pair and evaluate the correlation through hypothesis testing. Our results indicate that AMs are more predictable from vowels compared to consonants, particularly with plosives. Additionally, we observe that if a specific AM exhibits more movement during phoneme pronunciation, it is more predictable. Our findings support those in physiology regarding correlation and lay the groundwork for future research on speech-face multimodal learning.

摘要

Rethinking Voice-Face Correlation: A Geometry View

paper_url: http://arxiv.org/abs/2307.13948
repo_url: https://github.com/lxa9867/VAF
paper_authors: Xiang Li, Yandong Wen, Muqiao Yang, Jinglu Wang, Rita Singh, Bhiksha Raj
for: 这种研究旨在探索voice和face之间的含义，从geometry角度来恢复3D面征。
methods: 该研究提出了一种voice-anthropometric measurement（AM）-face模式，通过利用AM作为voice和face之间的拟合器，消除不可预测的AM的影响，使face geometry变得可追踪。
results: 研究发现，voice和specific parts of the face geometry（如鼻腔和头骨）之间存在显著的相关性，这些结果可能为人骨学科提供新的视角。

Abstract
Previous works on voice-face matching and voice-guided face synthesis demonstrate strong correlations between voice and face, but mainly rely on coarse semantic cues such as gender, age, and emotion. In this paper, we aim to investigate the capability of reconstructing the 3D facial shape from voice from a geometry perspective without any semantic information. We propose a voice-anthropometric measurement (AM)-face paradigm, which identifies predictable facial AMs from the voice and uses them to guide 3D face reconstruction. By leveraging AMs as a proxy to link the voice and face geometry, we can eliminate the influence of unpredictable AMs and make the face geometry tractable. Our approach is evaluated on our proposed dataset with ground-truth 3D face scans and corresponding voice recordings, and we find significant correlations between voice and specific parts of the face geometry, such as the nasal cavity and cranium. Our work offers a new perspective on voice-face correlation and can serve as a good empirical study for anthropometry science.

摘要
previous works on voice-face matching and voice-guided face synthesis have shown strong correlations between voice and face, but mainly rely on coarse semantic cues such as gender, age, and emotion. In this paper, we aim to investigate the capability of reconstructing the 3D facial shape from voice from a geometry perspective without any semantic information. We propose a voice-anthropometric measurement (AM)-face paradigm, which identifies predictable facial AMs from the voice and uses them to guide 3D face reconstruction. By leveraging AMs as a proxy to link the voice and face geometry, we can eliminate the influence of unpredictable AMs and make the face geometry tractable. Our approach is evaluated on our proposed dataset with ground-truth 3D face scans and corresponding voice recordings, and we find significant correlations between voice and specific parts of the face geometry, such as the nasal cavity and cranium. Our work offers a new perspective on voice-face correlation and can serve as a good empirical study for anthropometry science.Here's the translation in Traditional Chinese:previous works on voice-face matching and voice-guided face synthesis have shown strong correlations between voice and face, but mainly rely on coarse semantic cues such as gender, age, and emotion. In this paper, we aim to investigate the capability of reconstructing the 3D facial shape from voice from a geometry perspective without any semantic information. We propose a voice-anthropometric measurement (AM)-face paradigm, which identifies predictable facial AMs from the voice and uses them to guide 3D face reconstruction. By leveraging AMs as a proxy to link the voice and face geometry, we can eliminate the influence of unpredictable AMs and make the face geometry tractable. Our approach is evaluated on our proposed dataset with ground-truth 3D face scans and corresponding voice recordings, and we find significant correlations between voice and specific parts of the face geometry, such as the nasal cavity and cranium. Our work offers a new perspective on voice-face correlation and can serve as a good empirical study for anthropometry science.

Centroid-aware feature recalibration for cancer grading in pathology images

paper_url: http://arxiv.org/abs/2307.13947
repo_url: https://github.com/colin19950703/cafenet
paper_authors: Jaeung Lee, Keunho Byeon, Jin Tae Kwak
for: 静观细胞质量评估是生物医学影像分析中的一项关键任务，Recent developments in artificial neural networks have shown great potential for improving the accuracy and quality of cancer diagnosis.
methods: 提议使用一种具有中心点感知的特征重新调整网络，该网络可以将输入病理图像映射到一个嵌入空间中，并通过注意机制将其调整为不同类型的肿瘤等级中心点嵌入向量。
results: 通过对COLON dataset进行实验，确认提议网络可以准确地进行病理图像评估，并且能够鲁棒地适应不同环境下的数据集。

Abstract
Cancer grading is an essential task in pathology. The recent developments of artificial neural networks in computational pathology have shown that these methods hold great potential for improving the accuracy and quality of cancer diagnosis. However, the issues with the robustness and reliability of such methods have not been fully resolved yet. Herein, we propose a centroid-aware feature recalibration network that can conduct cancer grading in an accurate and robust manner. The proposed network maps an input pathology image into an embedding space and adjusts it by using centroids embedding vectors of different cancer grades via attention mechanism. Equipped with the recalibrated embedding vector, the proposed network classifiers the input pathology image into a pertinent class label, i.e., cancer grade. We evaluate the proposed network using colorectal cancer datasets that were collected under different environments. The experimental results confirm that the proposed network is able to conduct cancer grading in pathology images with high accuracy regardless of the environmental changes in the datasets.

摘要
乳腺癌等级是病理学中的一项重要任务。现代人工神经网络在计算病理学中的应用表明，这些方法在改善肿瘤诊断的准确性和质量方面具有很大潜力。然而，这些方法的可靠性和可重复性问题仍未得到完全解决。在这里，我们提出了一种注意力机制基于中心点的特征重新调整网络，可以准确地进行肿瘤等级诊断。该网络将输入病理图像映射到嵌入空间中，并通过中心点嵌入向量的注意力机制进行调整。根据重新调整后的嵌入vector，该网络将输入病理图像分类为相应的肿瘤等级。我们使用了不同环境下收集的直肠癌数据集进行评估，实验结果表明，我们的方法能够在不同环境下准确地进行肿瘤等级诊断。

Improving Semi-Supervised Semantic Segmentation with Dual-Level Siamese Structure Network

paper_url: http://arxiv.org/abs/2307.13938
repo_url: https://github.com/kunzhan/DSSN
paper_authors: Zhibo Tain, Xiaolin Zhang, Peng Zhang, Kun Zhan
for: 提高semantic segmentation任务中使用无标例数据的效果，减少标注训练示例的成本。
methods: 提posed dual-level Siamese structure network (DSSN) for pixel-wise contrastive learning，通过对各类强弱视图进行匹配，使用强大的扩展视图在低级图像空间和高级特征空间进行对齐，以 maximize 使用可用的无标例数据。 Additionally, 引入一种新的类意识 Pseudo-label 选择策略，解决大多数现有方法不进行选择或应用预先定义的阈值 для 所有类。 Specifically, 我们的策略选择每个类的高 confidence 预测值作为 pseudo labels，以便使用强 augmented views 进行supervision。 This strategy 可以考虑类偏移和提高长尾类的性能。
results: 对 PASCAL VOC 2012 和 Cityscapes 两个 dataset 进行实验，实现了semantic segmentation任务中使用无标例数据的最佳效果，比其他 SSS 算法出色。

Abstract
Semi-supervised semantic segmentation (SSS) is an important task that utilizes both labeled and unlabeled data to reduce expenses on labeling training examples. However, the effectiveness of SSS algorithms is limited by the difficulty of fully exploiting the potential of unlabeled data. To address this, we propose a dual-level Siamese structure network (DSSN) for pixel-wise contrastive learning. By aligning positive pairs with a pixel-wise contrastive loss using strong augmented views in both low-level image space and high-level feature space, the proposed DSSN is designed to maximize the utilization of available unlabeled data. Additionally, we introduce a novel class-aware pseudo-label selection strategy for weak-to-strong supervision, which addresses the limitations of most existing methods that do not perform selection or apply a predefined threshold for all classes. Specifically, our strategy selects the top high-confidence prediction of the weak view for each class to generate pseudo labels that supervise the strong augmented views. This strategy is capable of taking into account the class imbalance and improving the performance of long-tailed classes. Our proposed method achieves state-of-the-art results on two datasets, PASCAL VOC 2012 and Cityscapes, outperforming other SSS algorithms by a significant margin.

摘要
semi-supervised semantic segmentation (SSS) 是一项重要的任务，它利用标注和无标注数据来降低标注训练示例的成本。然而，SSS 算法的效果受到无标注数据的利用的限制。为解决这个问题，我们提出了 dual-level Siamese structure network (DSSN) для像素级对比学习。DSSN 通过在低级图像空间和高级特征空间使用强大的扩展视图进行像素级对比损失，以 maximize 利用可用的无标注数据。此外，我们还引入了一种新的类感知 pseudo-label 选择策略，用于弱到强超vision。这种策略选择每个类的高信度预测值作为 pseudo label，以便在强augmented views中进行supervision。这种策略能够考虑类别不均衡和改进长尾类的性能。我们的提议方法在 PASCAL VOC 2012 和 Cityscapes 两个数据集上实现了state-of-the-art 的结果，比其他 SSS 算法有明显的优势。

paper_url: http://arxiv.org/abs/2307.13933
repo_url: https://github.com/ydk122024/aide
paper_authors: Dingkang Yang, Shuai Huang, Zhi Xu, Zhenpeng Li, Shunli Wang, Mingcheng Li, Yuzheng Wang, Yang Liu, Kun Yang, Zhaoyu Chen, Yan Wang, Jing Liu, Peixuan Zhang, Peng Zhai, Lihua Zhang
for: 这篇论文主要旨在提供一个包含上下文信息的driver monitoring系统测试数据集，以提高交通安全和安全驾驶。
methods: 本论文使用多个视角设置、多Modal注释和四种实践任务来提供全面的Driver monitoring。同时， authors也提供了三种基线框架的实验比较，以及两种 fusions 策略来学习有效的多流/多Modal表示。
results: 研究人员通过对AIDE数据集进行了extensive 测试和分析，发现了数据集中的关键组成部分和基线框架的重要性和合理性。

Abstract
Driver distraction has become a significant cause of severe traffic accidents over the past decade. Despite the growing development of vision-driven driver monitoring systems, the lack of comprehensive perception datasets restricts road safety and traffic security. In this paper, we present an AssIstive Driving pErception dataset (AIDE) that considers context information both inside and outside the vehicle in naturalistic scenarios. AIDE facilitates holistic driver monitoring through three distinctive characteristics, including multi-view settings of driver and scene, multi-modal annotations of face, body, posture, and gesture, and four pragmatic task designs for driving understanding. To thoroughly explore AIDE, we provide experimental benchmarks on three kinds of baseline frameworks via extensive methods. Moreover, two fusion strategies are introduced to give new insights into learning effective multi-stream/modal representations. We also systematically investigate the importance and rationality of the key components in AIDE and benchmarks. The project link is https://github.com/ydk122024/AIDE.

摘要
驾驶员分心已成为过去十年内最重要的严重交通事故原因之一。尽管激发驾驶员视觉系统的发展，但由于缺乏全面的感知数据集，道路安全和交通安全仍然受到限制。本文提出了一个帮助驾驶员观察 dataset（AIDE），该dataset考虑了车内和车外情况的上下文信息，并包括驾驶员和场景的多视图设置、面部、身体、姿势和手势的多模式注解、以及适用于驾驶理解的四种实用任务设计。为了全面探索AIDE，我们提供了三种基eline框架的实验均衡，以及两种 fusión策略，以获得有效的多流/多模式表示。此外，我们系统地调查了AIDE和基eline的关键组件的重要性和合理性。项目链接：https://github.com/ydk122024/AIDE。

Spatio-Temporal Domain Awareness for Multi-Agent Collaborative Perception

paper_url: http://arxiv.org/abs/2307.13929
repo_url: https://github.com/ydk122024/SCOPE
paper_authors: Kun Yang, Dingkang Yang, Jingyu Zhang, Mingcheng Li, Yang Liu, Jing Liu, Hanqi Wang, Peng Sun, Liang Song
for: 提高自动驾驶车辆的感知性能
methods: 提出了一种新的协同感知框架（SCOPE），通过综合考虑多个agent的空间和时间特征来提高目标agent的感知
results: 对实际和模拟的协同3D物体检测任务进行了广泛的实验，证明了我们的方法的优越性和必要性

Abstract
Multi-agent collaborative perception as a potential application for vehicle-to-everything communication could significantly improve the perception performance of autonomous vehicles over single-agent perception. However, several challenges remain in achieving pragmatic information sharing in this emerging research. In this paper, we propose SCOPE, a novel collaborative perception framework that aggregates the spatio-temporal awareness characteristics across on-road agents in an end-to-end manner. Specifically, SCOPE has three distinct strengths: i) it considers effective semantic cues of the temporal context to enhance current representations of the target agent; ii) it aggregates perceptually critical spatial information from heterogeneous agents and overcomes localization errors via multi-scale feature interactions; iii) it integrates multi-source representations of the target agent based on their complementary contributions by an adaptive fusion paradigm. To thoroughly evaluate SCOPE, we consider both real-world and simulated scenarios of collaborative 3D object detection tasks on three datasets. Extensive experiments demonstrate the superiority of our approach and the necessity of the proposed components.

摘要
多智能合作感知作为自动驾驶车辆与所有东西通信的潜在应用，可以 significatively 提高自动驾驶车辆的感知性能。然而，在实现这项新研究领域中，仍有许多挑战。在这篇论文中，我们提出了 SCOPE，一种新的合作感知框架，可以在综合方式上聚合路上智能机器人的空间时间意识特征。具体来说，SCOPE具有以下三大优势：1. 它考虑有效的时间上下文semantic见解，以提高目标机器人的当前表示;2. 它聚合异ogeneous智能机器人的感知核心空间信息，并通过多尺度特征互动来超越地理化错误;3. 它通过适应融合方式，将多个来源的目标机器人表示融合，以便充分利用它们的补做贡献。为了全面评估 SCOPE，我们在三个数据集上进行了合作3D物体检测任务的实际和 simulated 实验。广泛的实验结果表明我们的方法的优越性，以及提案的组件的必要性。

paper_url: http://arxiv.org/abs/2307.13927
repo_url: None
paper_authors: Zhongze Wang, Haitao Zhao, Lujian Yao, Jingchao Peng, Kaijie Zhao
for: 这个论文主要用于提高雾度缓解方法的性能，特别是利用雾度差异来提高雾度特征的精度。methods: 该方法使用了一种名为全球分支（GB）和本地分支（LB）的树 структуры，其中GB使用了同构网络进行特征提取，并提出了全球雾度特征纠正模块（GDFR）来更新全球特征。LB则是利用本地雾度差异来更新本地特征，并引入了中间雾度差异feedforward（IDRF）模块。results: 该方法在多个数据集上达到了现有方法的最佳性能，并且能够更好地处理具有不同雾度差异的图像。

Abstract
In image dehazing task, haze density is a key feature and affects the performance of dehazing methods. However, some of the existing methods lack a comparative image to measure densities, and others create intermediate results but lack the exploitation of their density differences, which can facilitate perception of density. To address these deficiencies, we propose a density-aware dehazing method named Density Feature Refinement Network (DFR-Net) that extracts haze density features from density differences and leverages density differences to refine density features. In DFR-Net, we first generate a proposal image that has lower overall density than the hazy input, bringing in global density differences. Additionally, the dehazing residual of the proposal image reflects the level of dehazing performance and provides local density differences that indicate localized hard dehazing or high density areas. Subsequently, we introduce a Global Branch (GB) and a Local Branch (LB) to achieve density-awareness. In GB, we use Siamese networks for feature extraction of hazy inputs and proposal images, and we propose a Global Density Feature Refinement (GDFR) module that can refine features by pushing features with different global densities further away. In LB, we explore local density features from the dehazing residuals between hazy inputs and proposal images and introduce an Intermediate Dehazing Residual Feedforward (IDRF) module to update local features and pull them closer to clear image features. Sufficient experiments demonstrate that the proposed method achieves results beyond the state-of-the-art methods on various datasets.

摘要
在图像霾除任务中，霾 densities 是关键特征，影响霾除方法的性能。然而，一些现有方法缺乏对比图像，而其他方法创造了中间结果，但缺乏利用它们的density differences 来促进霾除性能。为解决这些不足，我们提出了一种名为 density feature refinement network (DFR-Net) 的霾除方法，它从 density differences 中提取霾 densities 特征，并利用 density differences 来精细化霾 densities 特征。在 DFR-Net 中，我们首先生成一个 proposal 图像，其全体density 较低于霾输入图像，从而带来全局 density differences。此外，提取霾除 residual 的 proposal 图像反映了霾除性能的水平，并提供了本地 density differences，表示本地强霾或高density 区域。接着，我们引入了全球分支 (GB) 和本地分支 (LB)，以实现density-awareness。在 GB 中，我们使用 Siamese 网络 для霾输入图像和提案图像的特征提取，并提出了全球density feature refinement (GDFR) 模块，可以通过推动不同全局density 的特征更远的方式来精细化特征。在 LB 中，我们探索本地霾 densities 特征从霾除 residual 中的霾输入图像和提案图像之间的差异，并引入了中间霾除 residual feedforward (IDRF) 模块来更新本地特征并吸引它们更近于清晰图像特征。充分的实验结果表明，我们的方法可以在不同的 dataset 上达到现有方法的 state-of-the-art 性能。

EasyNet: An Easy Network for 3D Industrial Anomaly Detection

paper_url: http://arxiv.org/abs/2307.13925
repo_url: None
paper_authors: Ruitao Chen, Guoyang Xie, Jiaqi Liu, Jinbao Wang, Ziqi Luo, Jinfan Wang, Feng Zheng
for: 这个研究旨在提高工业制程中的3D异常检测，以应对现有的缺陷。
methods: 我们提出了一个简单易用的网络（称为EasyNet），不使用预训模型和内存库。我们设计了一个多维多模式特征编码解oder，以精准地重建异常区域的分类图像，并透过多模式异常分类网络获得精确的异常地图。最后，我们提出了一个注意力基于信息熵融合模组，用于Feature融合，使其适合实时部署。
results: 实验结果显示，EasyNet可以在不使用预训模型和内存库的情况下，达到92.6%的异常检测AUROC。此外，EasyNet比现有的方法更快，在Tesla V100 GPU上 achieve 94.55 FPS的高帧率。

Abstract
3D anomaly detection is an emerging and vital computer vision task in industrial manufacturing (IM). Recently many advanced algorithms have been published, but most of them cannot meet the needs of IM. There are several disadvantages: i) difficult to deploy on production lines since their algorithms heavily rely on large pre-trained models; ii) hugely increase storage overhead due to overuse of memory banks; iii) the inference speed cannot be achieved in real-time. To overcome these issues, we propose an easy and deployment-friendly network (called EasyNet) without using pre-trained models and memory banks: firstly, we design a multi-scale multi-modality feature encoder-decoder to accurately reconstruct the segmentation maps of anomalous regions and encourage the interaction between RGB images and depth images; secondly, we adopt a multi-modality anomaly segmentation network to achieve a precise anomaly map; thirdly, we propose an attention-based information entropy fusion module for feature fusion during inference, making it suitable for real-time deployment. Extensive experiments show that EasyNet achieves an anomaly detection AUROC of 92.6% without using pre-trained models and memory banks. In addition, EasyNet is faster than existing methods, with a high frame rate of 94.55 FPS on a Tesla V100 GPU.

摘要
三维异常检测是现代计算机视觉任务中的一个突出和生命关键任务，在工业生产中扮演着重要的角色。Recently many advanced algorithms have been published, but most of them cannot meet the needs of IM. There are several disadvantages: i) difficult to deploy on production lines since their algorithms heavily rely on large pre-trained models; ii) hugely increase storage overhead due to overuse of memory banks; iii) the inference speed cannot be achieved in real-time. To overcome these issues, we propose an easy and deployment-friendly network (called EasyNet) without using pre-trained models and memory banks: firstly, we design a multi-scale multi-modality feature encoder-decoder to accurately reconstruct the segmentation maps of anomalous regions and encourage the interaction between RGB images and depth images; secondly, we adopt a multi-modality anomaly segmentation network to achieve a precise anomaly map; thirdly, we propose an attention-based information entropy fusion module for feature fusion during inference, making it suitable for real-time deployment. Extensive experiments show that EasyNet achieves an anomaly detection AUROC of 92.6% without using pre-trained models and memory banks. In addition, EasyNet is faster than existing methods, with a high frame rate of 94.55 FPS on a Tesla V100 GPU.

trajdata: A Unified Interface to Multiple Human Trajectory Datasets

paper_url: http://arxiv.org/abs/2307.13924
repo_url: https://github.com/nvlabs/trajdata
paper_authors: Boris Ivanovic, Guanyu Song, Igor Gilitschenski, Marco Pavone
for: 本研究旨在提供一个统一的人行轨迹数据接口，以便研究人行轨迹预测和自动驾驶车辆的动作识别。
methods: 本研究使用了多个大规模的实际世界人行轨迹数据集，并提供了一个简单、统一的轨迹和地图数据表示方式和API。
results: 本研究通过对现有轨迹数据集进行广泛的实验性评估，为研究人行轨迹预测和自动驾驶车辆的动作识别提供了深入的数据理解，并提出了未来数据集的建议。

Abstract
The field of trajectory forecasting has grown significantly in recent years, partially owing to the release of numerous large-scale, real-world human trajectory datasets for autonomous vehicles (AVs) and pedestrian motion tracking. While such datasets have been a boon for the community, they each use custom and unique data formats and APIs, making it cumbersome for researchers to train and evaluate methods across multiple datasets. To remedy this, we present trajdata: a unified interface to multiple human trajectory datasets. At its core, trajdata provides a simple, uniform, and efficient representation and API for trajectory and map data. As a demonstration of its capabilities, in this work we conduct a comprehensive empirical evaluation of existing trajectory datasets, providing users with a rich understanding of the data underpinning much of current pedestrian and AV motion forecasting research, and proposing suggestions for future datasets from these insights. trajdata is permissively licensed (Apache 2.0) and can be accessed online at https://github.com/NVlabs/trajdata

摘要
领域趋势预测在最近几年内得到了大量的研究和应用，部分归功于许多大规模的实际世界人 trajectory 数据集（AV 和行人运动跟踪）的发布。而这些数据集各自使用自定义的数据格式和 API，使研究人员在多个数据集之间训练和评估方法变得繁琐。为了解决这个问题，我们提出了 trajdata：一个统一的界面，用于多个人 trajectory 数据集。trajdata 的核心思想是提供简单、统一、高效的数据表示和 API，用于 trajectory 和地图数据。在这个工作中，我们进行了大量的实验性评估，为用户提供了许多现有的 pedestrian 和 AV 运动预测研究的数据基础知识，并提出了未来数据集的建议。trajdata 采用 Apache 2.0 许可证，可在 GitHub 上获取，详细信息请参考。

Points-to-3D: Bridging the Gap between Sparse Points and Shape-Controllable Text-to-3D Generation

paper_url: http://arxiv.org/abs/2307.13908
repo_url: None
paper_authors: Chaohui Yu, Qiang Zhou, Jingliang Li, Zhe Zhang, Zhibin Wang, Fan Wang
for: 提供一种基于稀疏3D点 cloud的文本到3D生成框架，以填补现有方法的束缚和不可控性问题。
methods: 使用Point-E生成的稀疏3D点云作为几何假设，并通过维护点云导向损失来适应NeRF的几何。同时，通过控制NeRF的外观分布来提高视角一致性。
results: 比较和分析表明，Points-to-3D可以提高视角一致性并实现良好的形状控制，从而为文本到3D生成提供一个新的控制方法。

Abstract
Text-to-3D generation has recently garnered significant attention, fueled by 2D diffusion models trained on billions of image-text pairs. Existing methods primarily rely on score distillation to leverage the 2D diffusion priors to supervise the generation of 3D models, e.g., NeRF. However, score distillation is prone to suffer the view inconsistency problem, and implicit NeRF modeling can also lead to an arbitrary shape, thus leading to less realistic and uncontrollable 3D generation. In this work, we propose a flexible framework of Points-to-3D to bridge the gap between sparse yet freely available 3D points and realistic shape-controllable 3D generation by distilling the knowledge from both 2D and 3D diffusion models. The core idea of Points-to-3D is to introduce controllable sparse 3D points to guide the text-to-3D generation. Specifically, we use the sparse point cloud generated from the 3D diffusion model, Point-E, as the geometric prior, conditioned on a single reference image. To better utilize the sparse 3D points, we propose an efficient point cloud guidance loss to adaptively drive the NeRF's geometry to align with the shape of the sparse 3D points. In addition to controlling the geometry, we propose to optimize the NeRF for a more view-consistent appearance. To be specific, we perform score distillation to the publicly available 2D image diffusion model ControlNet, conditioned on text as well as depth map of the learned compact geometry. Qualitative and quantitative comparisons demonstrate that Points-to-3D improves view consistency and achieves good shape controllability for text-to-3D generation. Points-to-3D provides users with a new way to improve and control text-to-3D generation.

摘要
文本到3D生成最近受到了广泛关注，受到了2D扩散模型在数百万张图像和文本对的训练。现有方法主要通过分数散熔炼来利用2D扩散先验来监督3D模型的生成，例如NeRF。然而，分数散熔炼容易受到视角不一致问题的影响，而半 implicit NeRF模型也可能导致不可预测的3D形态，因此导致文本到3D生成的真实性和可控性受到限制。在这种情况下，我们提出了一种灵活的点 clouds到3D框架，用于跨越稀疏可得到的3D点 cloud和真实形态可控的3D生成知识。核心思想是通过控制可控稀疏3D点来指导文本到3D生成。我们使用由3D扩散模型Point-E生成的稀疏点云作为几何优先，根据单个参考图像进行条件。为了更好地利用稀疏3D点，我们提出了一种高效的点云引导损失，以适应NeRF的几何进行适应。此外，我们还提出了优化NeRF以实现更加视角一致的外观。具体来说，我们通过分数散熔炼来ControlNet的2D图像扩散模型进行学习，并在文本和深度图中获得学习的紧凑geometry。经过质量和量度比较，我们发现Point-to-3D可以提高视角一致和实现好的形态可控性，提供了一种新的方法来改进和控制文本到3D生成。

YOLOBench: Benchmarking Efficient Object Detectors on Embedded Systems

paper_url: http://arxiv.org/abs/2307.13901
repo_url: https://github.com/deeplite/deeplite-torch-zoo
paper_authors: Ivan Lazarevich, Matteo Grimaldi, Ravish Kumar, Saptarshi Mitra, Shahrukh Khan, Sudhakar Sah
for: 本研究为了提供一个包含550多个YOLO基于物体检测模型的benchmark，以及4种不同的嵌入式硬件平台（x86 CPU、ARM CPU、Nvidia GPU、NPU）上的4个数据集。
methods: 本研究使用了一种控制性的比较方法来评估不同的YOLO基于一stage检测器，并在一个固定的训练环境下收集了准确率和延迟数据。
results: 研究发现，如果将现代检测头和训练技术integrated into the learning process，包括older模型如YOLOv3和YOLOv4在内的多个YOLO系列模型可以实现良好的准确率-延迟质量平衡。此外，研究还评估了在YOLOBench上使用的训练成本为零的准确率估计器，并发现其中一些可以有效地预测Pareto优化的检测模型。

Abstract
We present YOLOBench, a benchmark comprised of 550+ YOLO-based object detection models on 4 different datasets and 4 different embedded hardware platforms (x86 CPU, ARM CPU, Nvidia GPU, NPU). We collect accuracy and latency numbers for a variety of YOLO-based one-stage detectors at different model scales by performing a fair, controlled comparison of these detectors with a fixed training environment (code and training hyperparameters). Pareto-optimality analysis of the collected data reveals that, if modern detection heads and training techniques are incorporated into the learning process, multiple architectures of the YOLO series achieve a good accuracy-latency trade-off, including older models like YOLOv3 and YOLOv4. We also evaluate training-free accuracy estimators used in neural architecture search on YOLOBench and demonstrate that, while most state-of-the-art zero-cost accuracy estimators are outperformed by a simple baseline like MAC count, some of them can be effectively used to predict Pareto-optimal detection models. We showcase that by using a zero-cost proxy to identify a YOLO architecture competitive against a state-of-the-art YOLOv8 model on a Raspberry Pi 4 CPU. The code and data are available at https://github.com/Deeplite/deeplite-torch-zoo

摘要
我们介绍YOLOBench，一个包含550多个YOLO基于物件探测模型的benchmark，采用4个不同的 datasets和4个不同的嵌入式硬件平台（x86 CPU、ARM CPU、Nvidia GPU、NPU）。我们收集了一些YOLO基于一阶探测器的精度和延迟数据，并通过一个公平的比较方式，以确定这些探测器在不同的模型 scales 中的精度-延迟贸易。我们通过 pareto-optimality 分析所收集的数据，发现如果应用现代探测头和训练技术，多个YOLO系列的架构都可以取得良好的精度-延迟贸易，包括旧的模型YOLOv3和YOLOv4。我们还评估了用于神经建构搜寻的训练-自由精度估计器，并发现大多数现有的零成本精度估计器被简单的基准值MAC Count所出perform。我们还示出了使用零成本代理来识别YOLO架构，与现代YOLOv8模型在Raspberry Pi 4 CPU上竞争。我们的代码和数据可以在https://github.com/Deeplite/deeplite-torch-zoo上取得。

AViT: Adapting Vision Transformers for Small Skin Lesion Segmentation Datasets

paper_url: http://arxiv.org/abs/2307.13897
repo_url: https://github.com/siyi-wind/avit
paper_authors: Siyi Du, Nourhan Bayasi, Ghassan Harmarneh, Rafeef Garbi
for: 这个论文主要针对皮肤损伤分割（SLS）问题，旨在提高皮肤损伤分割的精度和效率。
methods: 这个论文提出了一种新的策略，即将预训练的视transformer（ViT）转移到SLS任务上，并通过附加轻量级模块（adapters）来修改特征表示。此外，论文还使用了一个浅层Convolutional Neural Network（CNN）作为提示生成器，从输入图像中生成提示embedding，以便指导分割任务。
results: 根据论文的实验结果，AViT可以在4个皮肤损伤Dataset上达到与当前最佳性能相当或更高的性能，而且具有许多 fewer Trainable parameters。

Abstract
Skin lesion segmentation (SLS) plays an important role in skin lesion analysis. Vision transformers (ViTs) are considered an auspicious solution for SLS, but they require more training data compared to convolutional neural networks (CNNs) due to their inherent parameter-heavy structure and lack of some inductive biases. To alleviate this issue, current approaches fine-tune pre-trained ViT backbones on SLS datasets, aiming to leverage the knowledge learned from a larger set of natural images to lower the amount of skin training data needed. However, fully fine-tuning all parameters of large backbones is computationally expensive and memory intensive. In this paper, we propose AViT, a novel efficient strategy to mitigate ViTs' data-hunger by transferring any pre-trained ViTs to the SLS task. Specifically, we integrate lightweight modules (adapters) within the transformer layers, which modulate the feature representation of a ViT without updating its pre-trained weights. In addition, we employ a shallow CNN as a prompt generator to create a prompt embedding from the input image, which grasps fine-grained information and CNN's inductive biases to guide the segmentation task on small datasets. Our quantitative experiments on 4 skin lesion datasets demonstrate that AViT achieves competitive, and at times superior, performance to SOTA but with significantly fewer trainable parameters. Our code is available at https://github.com/siyi-wind/AViT.

摘要
皮肤损伤分割（SLS）在皮肤损伤分析中扮演着重要的角色。视transformer（ViTs）被视为有利的解决方案，但它们需要更多的训练数据 Compared to convolutional neural networks (CNNs) due to their inherent parameter-heavy structure and lack of some inductive biases. To alleviate this issue, current approaches fine-tune pre-trained ViT backbones on SLS datasets, aiming to leverage the knowledge learned from a larger set of natural images to lower the amount of skin training data needed. However, fully fine-tuning all parameters of large backbones is computationally expensive and memory intensive.在这篇论文中，我们提出了一种新的有效策略，以降低ViTs的数据喂食量。具体来说，我们在转换层中添加轻量级模块（适配器），以 modify the feature representation of a ViT without updating its pre-trained weights. 此外，我们使用一个浅深的CNN作为提示生成器，以从输入图像中生成提示embedding，这个embedding捕捉了细节信息和CNN的适应性来导航分割任务。我们的量化实验表明，AViT在4个皮肤损伤数据集上具有竞争力和时刻优于SOTA的性能，但具有显著更少的可训练参数。我们的代码可以在https://github.com/siyi-wind/AViT中找到。

Pretrained Deep 2.5D Models for Efficient Predictive Modeling from Retinal OCT

paper_url: http://arxiv.org/abs/2307.13865
repo_url: None
paper_authors: Taha Emre, Marzieh Oghbaie, Arunava Chakravarty, Antoine Rivail, Sophie Riedl, Julia Mai, Hendrik P. N. Scholl, Sobha Sivaprasad, Daniel Rueckert, Andrew Lotery, Ursula Schmidt-Erfurth, Hrvoje Bogunović
for: 这篇论文旨在探讨如何使用2.5D架构来优化医疗影像处理中的深度学习模型，以提高模型的性能和数据效率。
methods: 本论文使用了2.5D架构， combining 2D和3D技术，以及Convolutional Neural Networks (CNNs)、Long Short-Term Memory (LSTM)和Transformers等方法，并且将2D非对称预训练方法应用到2.5D架构中，以提高模型的性能和数据效率。
results: 本论文透过实验表明，2.5D架构可以优化医疗影像处理中的深度学习模型，并且可以预测在6个月内进展到泼血性macular degeneration (AMD)的风险，在两个大量 longitudinal OCT数据集上。

Abstract
In the field of medical imaging, 3D deep learning models play a crucial role in building powerful predictive models of disease progression. However, the size of these models presents significant challenges, both in terms of computational resources and data requirements. Moreover, achieving high-quality pretraining of 3D models proves to be even more challenging. To address these issues, hybrid 2.5D approaches provide an effective solution for utilizing 3D volumetric data efficiently using 2D models. Combining 2D and 3D techniques offers a promising avenue for optimizing performance while minimizing memory requirements. In this paper, we explore 2.5D architectures based on a combination of convolutional neural networks (CNNs), long short-term memory (LSTM), and Transformers. In addition, leveraging the benefits of recent non-contrastive pretraining approaches in 2D, we enhanced the performance and data efficiency of 2.5D techniques even further. We demonstrate the effectiveness of architectures and associated pretraining on a task of predicting progression to wet age-related macular degeneration (AMD) within a six-month period on two large longitudinal OCT datasets.

摘要
医疗影像领域中，3D深度学习模型在建立疾病进程预测的力度环境中扮演着关键的角色。然而，这些模型的大小带来了计算资源和数据需求的挑战。此外，获得高质量预训练3D模型也是非常困难的。为解决这些问题，混合2.5D方法提供了有效的解决方案，能够有效地利用3DVolume数据，同时减少内存需求。在这篇论文中，我们探索了基于Convolutional Neural Networks（CNN）、Long Short-Term Memory（LSTM）和Transformers的2.5D架构，并利用了2D非对抗预训练方法的优点，进一步提高了2.5D技术的性能和数据效率。我们在两个大 longitudinal OCT数据集上进行了预测湿性年龄相关macular degeneration（AMD）在6个月期内的进程预测任务，以证明我们的架构和预训练方法的有效性。

On the unreasonable vulnerability of transformers for image restoration – and an easy fix

paper_url: http://arxiv.org/abs/2307.13856
repo_url: None
paper_authors: Shashank Agnihotri, Kanchana Vaishnavi Gandikota, Julia Grabinski, Paramanand Chandramouli, Margret Keuper
for: 这个论文 investigate the adversarial robustness of Vision Transformers (ViTs) in image restoration tasks.
methods: 他们使用 Projected Gradient Descent (PGD) 和 CosPGD 等 adversarial attack 来评估 ViTs 的 Robustness.
results: 他们发现 ViTs 在实际图像Deblurring task 中高度易受到 adversarial attack 的影响，而且 adversarial training 可以提高 Restormer 的 Robustness, but other networks 的 result less promising.

Abstract
Following their success in visual recognition tasks, Vision Transformers(ViTs) are being increasingly employed for image restoration. As a few recent works claim that ViTs for image classification also have better robustness properties, we investigate whether the improved adversarial robustness of ViTs extends to image restoration. We consider the recently proposed Restormer model, as well as NAFNet and the "Baseline network" which are both simplified versions of a Restormer. We use Projected Gradient Descent (PGD) and CosPGD, a recently proposed adversarial attack tailored to pixel-wise prediction tasks for our robustness evaluation. Our experiments are performed on real-world images from the GoPro dataset for image deblurring. Our analysis indicates that contrary to as advocated by ViTs in image classification works, these models are highly susceptible to adversarial attacks. We attempt to improve their robustness through adversarial training. While this yields a significant increase in robustness for Restormer, results on other networks are less promising. Interestingly, the design choices in NAFNet and Baselines, which were based on iid performance, and not on robust generalization, seem to be at odds with the model robustness. Thus, we investigate this further and find a fix.

摘要
根据视觉Recognition任务的成功，视觉Transformers（ViTs）现在在图像恢复领域得到了越来越多的应用。一些最近的研究表明，ViTs也有更好的鲁棒性质量，因此我们想 investigate这点是否扩展到图像恢复领域。我们考虑了Recently proposed Restormer模型，以及NAFNet和基eline网络，它们都是Restormer的简化版本。我们使用Projected Gradient Descent（PGD）和CosPGD，一种专门为像素精度预测任务设计的攻击方法来评估这些模型的鲁棒性。我们的实验使用了GoPro dataset上的实际图像锐化任务。我们的分析表明，与图像分类任务中所advocated的ViTs不同，这些模型对攻击很容易受到影响。我们通过对这些模型进行鲁棒训练来提高其鲁棒性。虽然这对Restormer模型具有显著的效果，但对NAFNet和基eline网络来说，结果较为吃亏。我们进一步调查这个问题，并发现一些design choice在NAFNet和基eline网络中，它们是基于独立性能而不是鲁棒泛化的设计，与模型的鲁棒性相冲突。因此，我们进一步调查这个问题，并发现一些解决方案。

Exploring the Sharpened Cosine Similarity

paper_url: http://arxiv.org/abs/2307.13855
repo_url: None
paper_authors: Skyler Wu, Fred Lu, Edward Raff, James Holt
for: 检查SCS可以取代卷积层来提高图像识别器的性能。
methods: 研究SCS参数的行为和可以作为卷积层的替代方案，并 benchmarked on CIFAR-10 多个CNN架构。
results: SCS可能不会带来明显的增加精度，但可能将特征更加易于理解；在某些情况下，SCS可能会增加防火墙性。

Abstract
Convolutional layers have long served as the primary workhorse for image classification. Recently, an alternative to convolution was proposed using the Sharpened Cosine Similarity (SCS), which in theory may serve as a better feature detector. While multiple sources report promising results, there has not been to date a full-scale empirical analysis of neural network performance using these new layers. In our work, we explore SCS's parameter behavior and potential as a drop-in replacement for convolutions in multiple CNN architectures benchmarked on CIFAR-10. We find that while SCS may not yield significant increases in accuracy, it may learn more interpretable representations. We also find that, in some circumstances, SCS may confer a slight increase in adversarial robustness.

摘要
convolutional layers 长期作为图像分类的主要工具。最近，一种使用简化 cosine similarity（SCS）的替代方案被提出，这 theoretically 可能是一个更好的特征检测器。虽然多个来源报告了 promising results，但到目前为止没有一个全面的 empirical analysis of neural network performance 使用这些新层。在我们的工作中，我们探索 SCS 的参数行为和作为替换 convolutions 的多个 CNN 架构在 CIFAR-10 上的性能。我们发现，虽然 SCS 可能不会导致显著的准确率提高，但它可能学习更易于理解的表示。我们还发现，在某些情况下，SCS 可能会提供一个 slight 的 adversarial robustness 提高。

SplitFed resilience to packet loss: Where to split, that is the question

paper_url: http://arxiv.org/abs/2307.13851
repo_url: None
paper_authors: Chamani Shiranthika, Zahra Hafezi Kafshgari, Parvaneh Saeedi, Ivan V. Bajić
for: 本文研究了Split Federated Learning（SplitFed或SFL）的可靠性问题，具体来说是在通信链路上 packet loss 的影响下对 SFL 的性能的研究。
methods: 本文使用了多种 SFL 聚合策略，并在不同的拆分点（ shallow split 和 deep split）进行了测试，以判断拆分点是否对最终模型的准确率产生 statistically significant 的影响。
results: 实验结果表明，使用 deeper split point 可以获得更高的准确率。

Abstract
Decentralized machine learning has broadened its scope recently with the invention of Federated Learning (FL), Split Learning (SL), and their hybrids like Split Federated Learning (SplitFed or SFL). The goal of SFL is to reduce the computational power required by each client in FL and parallelize SL while maintaining privacy. This paper investigates the robustness of SFL against packet loss on communication links. The performance of various SFL aggregation strategies is examined by splitting the model at two points -- shallow split and deep split -- and testing whether the split point makes a statistically significant difference to the accuracy of the final model. Experiments are carried out on a segmentation model for human embryo images and indicate the statistically significant advantage of a deeper split point.

摘要
<>使用分布式机器学习的许多应用程序在最近几年内得到扩展，包括联邦学习（FL）、分解学习（SL）以及其它混合技术，如分解联邦学习（SplitFed或SFL）。SFL的目标是降低每个客户端在FL中所需的计算能力，并平行化SL，同时保持隐私。这篇论文研究了SFL在通信链路上 packet loss 的影响，并对不同的SFL聚合策略进行了测试，以确定它们在最终模型的准确性方面是否存在 statistically significant 的差异。实验使用了人类胚胎图像分割模型，并显示了深度分割点能够获得 statistically significant 的优势。Note: " statistically significant" in Chinese is "统计学上有意义" (tòng jí yǐng xìng)

CosSIF: Cosine similarity-based image filtering to overcome low inter-class variation in synthetic medical image datasets

paper_url: http://arxiv.org/abs/2307.13842
repo_url: https://github.com/mominul-ssv/cossif
paper_authors: Mominul Islam, Hasib Zunair, Nabeel Mohammed
for: 这个研究旨在提高医疗图像分析中的深度学习模型效能，特别是在没有明显间类差异的医疗图像资料集上。methods: 本研究提出了一种新的筛选算法 called Cosine Similarity-based Image Filtering (CosSIF)，并在其基础上开发了两种筛选方法：Filtering Before GAN Training (FBGT) 和 Filtering After GAN Training (FAGT)。results: 实验结果显示，使用modern transformer和 convolutional-based networks并与 CosSIF 筛选方法可以实现substantial performance gain in various evaluation metrics。尤其是在 ISIC-2016 资料集上，FAGT 方法可以比基eline方法提高 sensitivity 1.59% 和 AUC 1.88%。在 HAM10000 资料集上，将 FABT 方法应用于过滤 synthetic images 可以提高 recall 13.75%，并且仅使用 FAGT 方法可以 дости得最大的准确率 94.44%。

Abstract
Crafting effective deep learning models for medical image analysis is a complex task, particularly in cases where the medical image dataset lacks significant inter-class variation. This challenge is further aggravated when employing such datasets to generate synthetic images using generative adversarial networks (GANs), as the output of GANs heavily relies on the input data. In this research, we propose a novel filtering algorithm called Cosine Similarity-based Image Filtering (CosSIF). We leverage CosSIF to develop two distinct filtering methods: Filtering Before GAN Training (FBGT) and Filtering After GAN Training (FAGT). FBGT involves the removal of real images that exhibit similarities to images of other classes before utilizing them as the training dataset for a GAN. On the other hand, FAGT focuses on eliminating synthetic images with less discriminative features compared to real images used for training the GAN. Experimental results reveal that employing either the FAGT or FBGT method with modern transformer and convolutional-based networks leads to substantial performance gains in various evaluation metrics. FAGT implementation on the ISIC-2016 dataset surpasses the baseline method in terms of sensitivity by 1.59\% and AUC by 1.88\%. Furthermore, for the HAM10000 dataset, applying FABT outperforms the baseline approach in terms of recall by 13.75\%, and with the sole implementation of FAGT, achieves a maximum accuracy of 94.44\%.

摘要
制作深度学习模型用于医学影像分析是一个复杂的任务，特别在医学影像集lacks significant inter-class variation的情况下。这个挑战进一步加剧了在使用这些数据集来生成synthetic images using generative adversarial networks (GANs)时，GANs的输出 heavily relies on the input data。在这项研究中，我们提出了一种新的筛选算法called Cosine Similarity-based Image Filtering (CosSIF)。我们利用CosSIF开发了两种不同的筛选方法：Filtering Before GAN Training (FBGT)和Filtering After GAN Training (FAGT)。FBGT involves the removal of real images that exhibit similarities to images of other classes before using them as the training dataset for a GAN。 On the other hand, FAGT focuses on eliminating synthetic images with less discriminative features compared to real images used for training the GAN。实验结果表明，使用FABT或FBGT方法并与现代转换和卷积网络结合使用，可以实现明显的性能提升在多个评价指标中。FAGT实现在ISIC-2016数据集上超越基准方法，敏感性提升1.59%，AUC提升1.88%。此外，对HAM10000数据集应用FAGT，可以提高记忆率by 13.75%，并且只通过FAGT实现最高的准确率为94.44%。

A real-time material breakage detection for offshore wind turbines based on improved neural network algorithm

paper_url: http://arxiv.org/abs/2307.13765
repo_url: None
paper_authors: Yantong Liu
for: 提高陆上风电机稳定性，为可持续能源产生做出了重要贡献。
methods: 使用高级版YOLOv8物体检测模型，配备卷积块注意模块（CBAM），进一步提高特征识别能力。
results: 通过使用5432张风电园和公共数据集进行严谨测试，实现了精度稳定性的显著提高，为陆上风电机维护做出了重要贡献。

Abstract
The integrity of offshore wind turbines, pivotal for sustainable energy generation, is often compromised by surface material defects. Despite the availability of various detection techniques, limitations persist regarding cost-effectiveness, efficiency, and applicability. Addressing these shortcomings, this study introduces a novel approach leveraging an advanced version of the YOLOv8 object detection model, supplemented with a Convolutional Block Attention Module (CBAM) for improved feature recognition. The optimized loss function further refines the learning process. Employing a dataset of 5,432 images from the Saemangeum offshore wind farm and a publicly available dataset, our method underwent rigorous testing. The findings reveal a substantial enhancement in defect detection stability, marking a significant stride towards efficient turbine maintenance. This study's contributions illuminate the path for future research, potentially revolutionizing sustainable energy practices.

摘要
“陆上风电机的完整性，是可持续能源生产的重要因素，但常受到表面材料欠整问题的影响。尽管有许多检测技术可用，但还有许多限制，包括成本高、效率低和应用范围仅对某些类型的材料有效。本研究提出了一种新的方法，利用进步版的YOLOv8物体检测模型，并补助了一个卷积层注意模块（CBAM），以提高特征识别能力。另外，我们还对检测过程进行了优化损失函数。使用了5,432幅陆上风电机Saemangeum数据集和公开 disponibile数据集，我们对方法进行了严格的测试。发现的结果显示，我们的方法能够实现更高稳定性的欠整检测，创造了一个重要的进步，将来可能对可持续能源实践产生革命性的影响。本研究的贡献照明了未来研究的道路，并且将为可持续能源领域的发展带来新的希望。”Note that Simplified Chinese is used in mainland China, while Traditional Chinese is used in Taiwan and Hong Kong.

Implementing and Benchmarking the Locally Competitive Algorithm on the Loihi 2 Neuromorphic Processor

paper_url: http://arxiv.org/abs/2307.13762
repo_url: None
paper_authors: Gavin Parpart, Sumedh R. Risbud, Garrett T. Kenyon, Yijing Watkins
for: 这研究旨在证明 neuromorphic processor 可以实现高效、低功耗的数据处理，尤其是在小型 робот、卫星等具有 strict SWaP 要求的应用中。
methods: 这研究使用了 Locally Competitive Algorithm (LCA) 在 neuromorphic processor 上进行了实现，并对 Loihi 2 processor 进行了 optimize。
results: 研究发现，使用 Loihi 2 processor 实现 LCA 的效率和速度比 CPU 和 GPU 设备更高，特别是在大 sparse penalty 下。此外，调整 LCA 参数可以提高性能。这些结果表明 neuromorphic processor 可以在资源受限的设备上进行高效、高精度的数据处理。

Abstract
Neuromorphic processors have garnered considerable interest in recent years for their potential in energy-efficient and high-speed computing. The Locally Competitive Algorithm (LCA) has been utilized for power efficient sparse coding on neuromorphic processors, including the first Loihi processor. With the Loihi 2 processor enabling custom neuron models and graded spike communication, more complex implementations of LCA are possible. We present a new implementation of LCA designed for the Loihi 2 processor and perform an initial set of benchmarks comparing it to LCA on CPU and GPU devices. In these experiments LCA on Loihi 2 is orders of magnitude more efficient and faster for large sparsity penalties, while maintaining similar reconstruction quality. We find this performance improvement increases as the LCA parameters are tuned towards greater representation sparsity. Our study highlights the potential of neuromorphic processors, particularly Loihi 2, in enabling intelligent, autonomous, real-time processing on small robots, satellites where there are strict SWaP (small, lightweight, and low power) requirements. By demonstrating the superior performance of LCA on Loihi 2 compared to conventional computing device, our study suggests that Loihi 2 could be a valuable tool in advancing these types of applications. Overall, our study highlights the potential of neuromorphic processors for efficient and accurate data processing on resource-constrained devices.

摘要
We present a new LCA implementation designed for the Loihi 2 processor and perform initial benchmarks comparing it to LCA on CPU and GPU devices. Our results show that LCA on Loihi 2 is several orders of magnitude more efficient and faster for large sparsity penalties, while maintaining similar reconstruction quality. We find that the performance improvement increases as the LCA parameters are tuned towards greater representation sparsity.Our study highlights the potential of neuromorphic processors, particularly Loihi 2, for enabling intelligent, autonomous, and real-time processing on small robots and satellites with strict SWaP (small, lightweight, and low power) requirements. By demonstrating the superior performance of LCA on Loihi 2 compared to conventional computing devices, our study suggests that Loihi 2 could be a valuable tool in advancing these types of applications.Overall, our study highlights the potential of neuromorphic processors for efficient and accurate data processing on resource-constrained devices.

PlaneRecTR: Unified Query Learning for 3D Plane Recovery from a Single View

paper_url: http://arxiv.org/abs/2307.13756
repo_url: https://github.com/sjingjia/planerectr
paper_authors: Jingjia Shi, Shuaifeng Zhi, Kai Xu
for: 本研究旨在提出一种能够独立从单张图像中恢复3D平面的全新框架，即PlaneRecTR，该框架可以同时处理多个相关的子任务，包括平面检测、分割、参数估计和深度估计。
methods: PlaneRecTR 使用 Transformer 架构，并通过一种新的嵌入式嵌入法将所有相关的子任务集成到一个紧凑的模型中。
results: 经过广泛的量化和质量实验，我们的提议的统一学习方法在公共的 ScanNet 和 NYUv2-Plane 数据集上达到了新的状态场下的最佳性能。

Abstract
3D plane recovery from a single image can usually be divided into several subtasks of plane detection, segmentation, parameter estimation and possibly depth estimation. Previous works tend to solve this task by either extending the RCNN-based segmentation network or the dense pixel embedding-based clustering framework. However, none of them tried to integrate above related subtasks into a unified framework but treat them separately and sequentially, which we suspect is potentially a main source of performance limitation for existing approaches. Motivated by this finding and the success of query-based learning in enriching reasoning among semantic entities, in this paper, we propose PlaneRecTR, a Transformer-based architecture, which for the first time unifies all subtasks related to single-view plane recovery with a single compact model. Extensive quantitative and qualitative experiments demonstrate that our proposed unified learning achieves mutual benefits across subtasks, obtaining a new state-of-the-art performance on public ScanNet and NYUv2-Plane datasets. Codes are available at https://github.com/SJingjia/PlaneRecTR.

摘要
三元平面恢复从单张图像通常可以分解为多个子任务，包括平面检测、分割、参数估算和可能的深度估算。现有的工作通常是通过扩展RCNN基于分割网络或密集像素嵌入基于聚类框架来解决这个任务。然而，现有的方法都没有尝试将上述相关的子任务集成到一个简单的框架中，而是将它们分别处理并处理，我们认为这可能是现有方法性能下限的主要原因。受这一发现和semantic Entities之间的聚合学习的成功启发，在这篇论文中，我们提出PlaneRecTR，一种基于Transformer架构的architecture，可以同时解决单视图平面恢复的所有相关子任务。我们的提议的统一学习方法在各个子任务之间带来互助效果，并在公共的ScanNet和NYUv2-Plane数据集上实现了新的state-of-the-art性能。代码可以在https://github.com/SJingjia/PlaneRecTR中获取。

ChildGAN: Large Scale Synthetic Child Facial Data Using Domain Adaptation in StyleGAN

paper_url: http://arxiv.org/abs/2307.13746
repo_url: None
paper_authors: Muhammad Ali Farooq, Wang Yao, Gabriel Costache, Peter Corcoran
for: 这个论文是为了生成Synthetic boys and girls facial data，使用StyleGAN2建立了一个新的ChildGAN网络。
methods: 这个论文使用了含 Transfer Learning的smooth domain transfer方法，并生成了大规模的数据集，包括多种智能的 facial 变换，如表情、年龄增长、眼睛打开效果、头部 pose、皮肤和头发颜色变化、不同的照明条件等。
results: 这个论文通过多种计算机视觉应用测试，如CNN基于的儿童性别分类器、面部定位和表情特征检测测试、人脸认知度评估使用ArcFace，以及眼睛检测和眼球比例测试， validate了生成的儿童脸部数据的真实性和特点。

Abstract
In this research work, we proposed a novel ChildGAN, a pair of GAN networks for generating synthetic boys and girls facial data derived from StyleGAN2. ChildGAN is built by performing smooth domain transfer using transfer learning. It provides photo-realistic, high-quality data samples. A large-scale dataset is rendered with a variety of smart facial transformations: facial expressions, age progression, eye blink effects, head pose, skin and hair color variations, and variable lighting conditions. The dataset comprises more than 300k distinct data samples. Further, the uniqueness and characteristics of the rendered facial features are validated by running different computer vision application tests which include CNN-based child gender classifier, face localization and facial landmarks detection test, identity similarity evaluation using ArcFace, and lastly running eye detection and eye aspect ratio tests. The results demonstrate that synthetic child facial data of high quality offers an alternative to the cost and complexity of collecting a large-scale dataset from real children.

摘要
在这项研究中，我们提出了一种新的ChildGAN，即基于StyleGAN2的两个GAN网络，用于生成Synthetic的男孩和女孩脸部数据。ChildGAN通过平滑领域传输学习来建立，可以提供高质量、photo-realistic的数据样本。我们使用了多种智能的脸部变换，包括表情、年龄增长、眼睛跳动效果、头部姿态、皮肤和头发颜色变化以及不同的照明条件。数据集包含超过300k个不同的数据样本。此外，我们验证了生成的脸部特征的独特性和特点，通过运行不同的计算机视觉应用测试，包括CNN基于儿童性别分类器、脸部定位和面部特征检测测试、ArcFace进行identity similarity评估以及最后运行眼睛检测和眼睛方向测试。结果表明，高质量的Synthetic儿童脸部数据提供了一种可行的代替实际收集大规模数据的方式，减少了成本和复杂度。

A Comprehensive Analysis on the Leakage of Fuzzy Matchers

paper_url: http://arxiv.org/abs/2307.13717
repo_url: None
paper_authors: Axel Durbet, Paul-Marie Grollemund, Kevin Thiry-Atighehchi
for: 本文对评估距离时的信息泄露进行了全面的分析，特别是对于基于权限误差的误差距离（i.e., 杂化误差）。
methods: 本文提出了一个枚举式的信息泄露场景catalog，以及这些场景对数据隐私安全性的影响。每个场景都导致了通用攻击，攻击的计算成本用于确定安全性水平的Upper bound。
results: 本文的分析结果显示，在使用弱隐私保护matcher时， informations leakage可能导致攻击者通过side channel attack或部分杂化设计获取敏感数据。

Abstract
This paper provides a comprehensive analysis of information leakage during distance evaluation, with an emphasis on threshold-based obfuscated distance (i.e., Fuzzy Matcher). Leakage can occur due to a malware infection or the use of a weakly privacy-preserving matcher, exemplified by side channel attacks or partially obfuscated designs. We provide an exhaustive catalog of information leakage scenarios as well as their impacts on the security concerning data privacy. Each of the scenarios leads to generic attacks whose impacts are expressed in terms of computational costs, hence allowing the establishment of upper bounds on the security level.

摘要
Translated into Simplified Chinese:这篇论文提供了评估距离时信息泄露的全面分析，强调阈值基于杂化距离（即杂化匹配器）的情况。泄露可能由恶意软件感染或弱privacy保护匹配器引起，例如侧通攻击或部分杂化设计。我们提供了丰富的信息泄露场景目录以及它们对数据隐私安全的影响。每个场景都导致了通用攻击，攻击的影响表现为计算成本，因此可以确定安全水平的Upper bound。

Personal Protective Equipment Detection in Extreme Construction Conditions

paper_url: http://arxiv.org/abs/2307.13654
repo_url: None
paper_authors: Yuexiong Ding, Xiaowei Luo
for: 这个研究旨在开发一个可靠的人员侦测模型，以应对在建筑工程中的极端环境。
methods: 这个研究使用了神经风格转移（NST）和YOLOv5技术，组合了这两种技术来建立一个抗衰变的侦测模型。
results: 实验结果显示，NST模组可以对极端环境进行更好的模拟，并帮助NST-YOLOv5模型在真实世界极端环境中获得0.141和0.083 mAP_(05:95)的改善。

Abstract
Object detection has been widely applied for construction safety management, especially personal protective equipment (PPE) detection. Though the existing PPE detection models trained on conventional datasets have achieved excellent results, their performance dramatically declines in extreme construction conditions. A robust detection model NST-YOLOv5 is developed by combining the neural style transfer (NST) and YOLOv5 technologies. Five extreme conditions are considered and simulated via the NST module to endow the detection model with excellent robustness, including low light, intense light, sand dust, fog, and rain. Experiments show that the NST has great potential as a tool for extreme data synthesis since it is better at simulating extreme conditions than other traditional image processing algorithms and helps the NST-YOLOv5 achieve 0.141 and 0.083 mAP_(05:95) improvements in synthesized and real-world extreme data. This study provides a new feasible way to obtain a more robust detection model for extreme construction conditions.

摘要
<>传统的人工智能技术已经广泛应用于建筑安全管理中，特别是个人防护设备（PPE）检测。现有的PPE检测模型通过传统数据集训练已经达到了出色的结果，但是其在极端建筑条件下的性能却显著下降。为了解决这个问题，本研究开发了一种Robust检测模型NST-YOLOv5，通过结合神经风格传输（NST）和YOLOv5技术。研究认为，在极端条件下，NST模块可以更好地模拟极端情况，包括低光照、强光照、尘埃、雾和雨。实验结果表明，NST具有优秀的可模拟极端条件的能力，可以帮助NST-YOLOv5在人工生成的极端数据中达到0.141和0.083 mAP_(05:95)的改进。本研究为建筑安全管理中的极端条件下的检测模型提供了一个新的可靠的方法。

Learning Transferable Object-Centric Diffeomorphic Transformations for Data Augmentation in Medical Image Segmentation

paper_url: http://arxiv.org/abs/2307.13645
repo_url: None
paper_authors: Nilesh Kumar, Prashnna K. Gyawali, Sandesh Ghimire, Linwei Wang
for:* 这种论文主要是为了解决医疗图像分割中获取标注数据的挑战，特别是需要专家 manually annotate每个像素点。methods:* 这种方法使用了对象的可变变换来减少这个挑战，但这些变换通常是全像的，因此无法在不同的数据集或问题中进行传输。results:* 我们提出了一种新的对象中心数据增强模型，可以学习对象的形态变化并在图像中增强对象，而不需要修改图像的其他部分。* 我们证明了该模型在儿科肿瘤分割中的效果，并且可以从同一个数据集中学习形态变化，以及从外部数据集中传输形态变化。

Abstract
Obtaining labelled data in medical image segmentation is challenging due to the need for pixel-level annotations by experts. Recent works have shown that augmenting the object of interest with deformable transformations can help mitigate this challenge. However, these transformations have been learned globally for the image, limiting their transferability across datasets or applicability in problems where image alignment is difficult. While object-centric augmentations provide a great opportunity to overcome these issues, existing works are only focused on position and random transformations without considering shape variations of the objects. To this end, we propose a novel object-centric data augmentation model that is able to learn the shape variations for the objects of interest and augment the object in place without modifying the rest of the image. We demonstrated its effectiveness in improving kidney tumour segmentation when leveraging shape variations learned both from within the same dataset and transferred from external datasets.

摘要
Simplified Chinese:医学图像分割中获取标注数据具有挑战性，因为需要专家进行像素级别的标注。最近的研究表明，将对象兴趣添加到扩展变换可以减轻这些挑战。然而，这些变换通常是基于整个图像而学习的，导致其在不同数据集或图像对齐问题中的传输性不佳。而对象中心的扩展变换提供了一个大好的机会来超越这些问题，但现有的工作仅关注位置和随机变换而忽略对象形状的变化。为此，我们提议一种新的对象中心数据增强模型，能够学习对象兴趣的形状变化并在图像中增强对象而不改变其他部分。我们在使用同一个数据集中的形状变化和外部数据集中的形状变化来优化肾癌分 segmentation时进行了证明。

Optical Flow boosts Unsupervised Localization and Segmentation

paper_url: http://arxiv.org/abs/2307.13640
repo_url: https://github.com/mlzxy/flowdino
paper_authors: Xinyu Zhang, Abdeslam Boularias
for: 本研究旨在提出一种基于运动指示的无监督本地化和分割方法，以解决自主 робоット视觉任务中的长期挑战。
methods: 我们提出了一种新的损失函数形式ulation，使用无监督视频中的摄像机流来鼓励自我监督视transformer（ViT）特征来 closer to each other。
results: 我们的finetuning过程超过了无监督 semantic segmentation的状态前方法，并且在无监督物体localization和semantic segmentation benchmark上也达到了更高的性能。

Abstract
Unsupervised localization and segmentation are long-standing robot vision challenges that describe the critical ability for an autonomous robot to learn to decompose images into individual objects without labeled data. These tasks are important because of the limited availability of dense image manual annotation and the promising vision of adapting to an evolving set of object categories in lifelong learning. Most recent methods focus on using visual appearance continuity as object cues by spatially clustering features obtained from self-supervised vision transformers (ViT). In this work, we leverage motion cues, inspired by the common fate principle that pixels that share similar movements tend to belong to the same object. We propose a new loss term formulation that uses optical flow in unlabeled videos to encourage self-supervised ViT features to become closer to each other if their corresponding spatial locations share similar movements, and vice versa. We use the proposed loss function to finetune vision transformers that were originally trained on static images. Our fine-tuning procedure outperforms state-of-the-art techniques for unsupervised semantic segmentation through linear probing, without the use of any labeled data. This procedure also demonstrates increased performance over original ViT networks across unsupervised object localization and semantic segmentation benchmarks.

摘要
自主化机器人视觉挑战包括无监督的本地化和分割，这两个任务都是关键的，因为有限的杂化图像手动标注数据。这些任务是重要的，因为它们可以适应不断变化的物品类别，并且拥有长期学习的承袭。现有的方法主要基于视觉外观继续性作为物体征料，通过自动将自然语言中的词语分配到相应的位置。在这项工作中，我们启用运动征料，基于共同命运原则，即像素之间的运动相似性可以用于推断这些像素属于同一个物体。我们提出了一种新的损失函数表述，使用无标注视频中的光流来鼓励自我超vised Vision Transformer（ViT）特征自适应更近。我们使用该损失函数来练化原始基于静止图像的ViT网络，并通过线性探测超越当前最佳无监督Semantic Segmentation技术。此外，我们还证明了我们的练化方法可以在无监督物体本地化和Semantic Segmentation benchmark中表现出更高的性能。

Fake It Without Making It: Conditioned Face Generation for Accurate 3D Face Shape Estimation

paper_url: http://arxiv.org/abs/2307.13639
repo_url: None
paper_authors: Will Rowan, Patrik Huber, Nick Pears, Andrew Keeling
for: bridging the gap between 2D and 3D face shape estimation
methods: conditioned stable diffusion model for face image generation, leveraging abundant 2D facial information to inform 3D space
results: large-scale synthesized dataset of 250K photorealistic images and corresponding 3DMM parameters, and a deep neural network (ControlFace) that achieves competitive performance on the NoW benchmark without requiring 3D supervision or manual 3D asset creation.Here’s the full Chinese text:
for: bridging the gap between 2D 和 3D 人脸形状估计
methods: conditioned stable diffusion model for face image generation, 利用丰富的 2D 脸部信息来指导 3D 空间
results: large-scale synthesized dataset of 250K photorealistic images and corresponding 3DMM parameters, 以及一个深度神经网络 (ControlFace) ，在 NoW benchmark 上 achieve 竞争性性能，无需 3D 监督或手动 3D 资产创建。

Abstract
Accurate 3D face shape estimation is an enabling technology with applications in healthcare, security, and creative industries, yet current state-of-the-art methods either rely on self-supervised training with 2D image data or supervised training with very limited 3D data. To bridge this gap, we present a novel approach which uses a conditioned stable diffusion model for face image generation, leveraging the abundance of 2D facial information to inform 3D space. By conditioning stable diffusion on depth maps sampled from a 3D Morphable Model (3DMM) of the human face, we generate diverse and shape-consistent images, forming the basis of SynthFace. We introduce this large-scale synthesised dataset of 250K photorealistic images and corresponding 3DMM parameters. We further propose ControlFace, a deep neural network, trained on SynthFace, which achieves competitive performance on the NoW benchmark, without requiring 3D supervision or manual 3D asset creation.

摘要
当前最先进的3D面部形态估算技术都是基于自动超参的2D图像数据或有限的3D数据进行超参数学习的。为了bridging这个差距，我们提出了一种新的方法，该方法使用conditioned stable diffusion模型来生成面图像，通过利用丰富的2D facial信息来导航3D空间。我们通过将稳定扩散模型conditioned on 3DMM中的深度图像，生成了多样化和形态一致的图像，这些图像组成了SynthFace大规模合成数据集。我们还提出了ControlFace，一种基于SynthFace的深度神经网络，在NoW标准测试集上达到了竞争性表现，不需要3D指导或手动创建3D资产。

RecursiveDet: End-to-End Region-based Recursive Object Detection

paper_url: http://arxiv.org/abs/2307.13619
repo_url: https://github.com/bravezzzzzz/recursivedet
paper_authors: Jing Zhao, Li Sun, Qingli Li
for: 提高 End-to-end 区域基于对象检测器的性能和参数数量，使其更加高效。
methods: 提出一种基于重复解码的方法，通过共享参数并使用可 recursive 的解码器，提高检测器的性能。并在解码器中使用位置编码（PE），使其根据输入 bounding box 的具体位置和大小进行适应。
results: 通过对多个主流区域基于对象检测器进行减少，并在不同的 stage 进行 recursive 解码，实现了明显的性能提升，并且需要 fewer 参数和轻微增加计算成本。

Abstract
End-to-end region-based object detectors like Sparse R-CNN usually have multiple cascade bounding box decoding stages, which refine the current predictions according to their previous results. Model parameters within each stage are independent, evolving a huge cost. In this paper, we find the general setting of decoding stages is actually redundant. By simply sharing parameters and making a recursive decoder, the detector already obtains a significant improvement. The recursive decoder can be further enhanced by positional encoding (PE) of the proposal box, which makes it aware of the exact locations and sizes of input bounding boxes, thus becoming adaptive to proposals from different stages during the recursion. Moreover, we also design centerness-based PE to distinguish the RoI feature element and dynamic convolution kernels at different positions within the bounding box. To validate the effectiveness of the proposed method, we conduct intensive ablations and build the full model on three recent mainstream region-based detectors. The RecusiveDet is able to achieve obvious performance boosts with even fewer model parameters and slightly increased computation cost. Codes are available at https://github.com/bravezzzzzz/RecursiveDet.

摘要
通常的端到端区域基于对象检测器如Sparse R-CNN都有多个阶段性 bounding box 解码机制，这些阶段性机制会根据之前的结果进行重复的精度调整。在这篇论文中，我们发现了这些阶段性机制的通用设置实际上是重复的。通过将参数共享并实现一个循环解码器，检测器就可以获得显著的提升。此外，我们还设计了基于中心点编码（PE）的方法，使循环解码器变得能够根据不同的阶段提供不同的提档。此外，我们还设计了基于中心点编码的方法，使循环解码器能够根据不同的位置和大小在矩形框内分辨不同的 RoI 特征元素和动态核心。为验证提档的效果，我们进行了广泛的ablation和在三个最新的主流区域基于检测器上建立了全模型。RecusiveDet 能够在更少的模型参数和微量的计算成本下实现明显的性能提升。代码可以在上下载。

Object-based Probabilistic Similarity Evidence of Sparse Latent Features from Fully Convolutional Networks

paper_url: http://arxiv.org/abs/2307.13606
repo_url: https://github.com/cjuliani/probabilistic-similarity-evidence-FCN
paper_authors: Cyril Juliani
for: 本研究旨在探讨使用神经网络学习的相似性分析方法，以便更好地理解和分类复杂pattern。
methods: 该研究使用了全连接卷积网络（FCN）生成的含义表示，并通过软件推理来确定对象在2D图像中的视觉相似性。
results: 研究发现，通过增加卷积网络中的特征变量权重，可以更好地识别对象的视觉特征，并提高相似性分析的准确性。

Abstract
Similarity analysis using neural networks has emerged as a powerful technique for understanding and categorizing complex patterns in various domains. By leveraging the latent representations learned by neural networks, data objects such as images can be compared effectively. This research explores the utilization of latent information generated by fully convolutional networks (FCNs) in similarity analysis, notably to estimate the visual resemblance of objects segmented in 2D pictures. To do this, the analytical scheme comprises two steps: (1) extracting and transforming feature patterns per 2D object from a trained FCN, and (2) identifying the most similar patterns through fuzzy inference. The step (2) can be further enhanced by incorporating a weighting scheme that considers the significance of latent variables in the analysis. The results provide valuable insights into the benefits and challenges of employing neural network-based similarity analysis for discerning data patterns effectively.

摘要
neural network Similarity analysis 已经成为了复杂模式理解和分类的有力的技术。通过利用 neural network 学习的含义表示，数据对象如图像可以比较有效地比较。这个研究探讨了使用 fully convolutional network (FCN) 生成的含义信息在 similarity analysis 中的应用，特别是用于估计在 2D 图像中分割的物体的视觉相似性。这个方法包括两步：（1）从 trained FCN 中提取和转换每个 2D 对象的特征模式，并（2）通过权重补做来确定最相似的模式。在第二步中，可以进一步增强使用 weighting scheme，考虑 latent variables 在分析中的重要性。研究结果为我们提供了有价值的理解，以及使用 neural network 基于的 similarity analysis 在分析复杂数据模式时的挑战和机会。

Decisive Data using Multi-Modality Optical Sensors for Advanced Vehicular Systems

paper_url: http://arxiv.org/abs/2307.13600
repo_url: None
paper_authors: Muhammad Ali Farooq, Waseem Shariff, Mehdi Sefidgar Dilmaghani, Wang Yao, Moazam Soomro, Peter Corcoran
for: 这篇论文主要针对各种光学技术的设计和开发，以实现现代车辆前视系统和驾驶员监测系统。
methods: 论文使用了各种光学探测技术，包括长波热成像（LWIR）摄像头、近红外（NIR）摄像头、神经科学摄像头、可见CMOS摄像头和深度摄像头。
results: 论文描述了这些光学技术在真实环境中的应用，以及它们在不同应用中的独特优势。

Abstract
Optical sensors have played a pivotal role in acquiring real world data for critical applications. This data, when integrated with advanced machine learning algorithms provides meaningful information thus enhancing human vision. This paper focuses on various optical technologies for design and development of state-of-the-art out-cabin forward vision systems and in-cabin driver monitoring systems. The focused optical sensors include Longwave Thermal Imaging (LWIR) cameras, Near Infrared (NIR), Neuromorphic/ event cameras, Visible CMOS cameras and Depth cameras. Further the paper discusses different potential applications which can be employed using the unique strengths of each these optical modalities in real time environment.

摘要
光学感知技术在实际应用中发挥了关键作用，提供了有用的信息，从而增强人类视觉。这篇论文探讨了各种光学技术在开发前瞻系统和司机监测系统中的设计和开发。主要涉及的光学感知器包括长波紫外线摄像机（LWIR）、近红外（NIR）、神经科学摄像机、可见CMOS摄像机和深度摄像机。论文还讨论了每种光学特性在实时环境中的不同应用可能性。

2023-07-26

cs.AI

cs.AI - 2023-07-26

Improving International Climate Policy via Mutually Conditional Binding Commitments

paper_url: http://arxiv.org/abs/2307.14267
repo_url: None
paper_authors: Jobst Heitzig, Jörg Oechssler, Christoph Pröschel, Niranjana Ragavan, Yat Long Lo
for: 这篇论文是为了解决气候变化国际协议（Paris协议）面临的挑战，即大多数国家决定的国家决定（NDCs）的无条件性，导致主要排放者之间的免费骗取行为和NDCs中的具体性缺乏。
methods: 该论文提出了一种分布式、底层的决策机制——条件承诺机制， draws inspiration from国家人投票伙伴关系，并提供了适应性和激励早期采用者的优势。
results: 该论文介绍了机制的概述、在AI4ClimateCooperation挑战中的表现，以及实际应用方面的可能性。

Abstract
The Paris Agreement, considered a significant milestone in climate negotiations, has faced challenges in effectively addressing climate change due to the unconditional nature of most Nationally Determined Contributions (NDCs). This has resulted in a prevalence of free-riding behavior among major polluters and a lack of concrete conditionality in NDCs. To address this issue, we propose the implementation of a decentralized, bottom-up approach called the Conditional Commitment Mechanism. This mechanism, inspired by the National Popular Vote Interstate Compact, offers flexibility and incentives for early adopters, aiming to formalize conditional cooperation in international climate policy. In this paper, we provide an overview of the mechanism, its performance in the AI4ClimateCooperation challenge, and discuss potential real-world implementation aspects. Prior knowledge of the climate mitigation collective action problem, basic economic principles, and game theory concepts are assumed.

摘要
《巴黎协议》被视为气候谈判中的重要里程碑，但它在有效地解决气候变化问题上遇到了挑战。这是因为大多数国家确定的气候承诺（NDCs）的条件性较弱，导致主要污染者存在“免费乘客”的现象，NDCs中缺乏具体的条件性。为解决这个问题，我们提议实施一种分散式、底层式的承诺机制，称为条件承诺机制。这种机制灵感自国家人投票协议，提供了灵活性和激励早期采取行动的优势，以帮助正式化国际气候政策中的条件合作。在这篇论文中，我们提供机制的概述、在AI4气候合作挑战中的表现，以及实际应用方面的思考。假设读者有气候 Mitigation的集体行动问题、基本经济原则和游戏理论的知识。

Improving International Climate Policy via Mutually Conditional Binding Commitments

paper_url: http://arxiv.org/abs/2307.14266
repo_url: None
paper_authors: Jobst Heitzig, Jörg Oechssler, Christoph Pröschel, Niranjana Ragavan, Richie YatLong Lo
for: 提高国际气候政策决策的现实性
methods: 使用优化的RICE-N模拟和多代理人强化学习框架，以及 Conditional Commitments Mechanism（CCF机制）等方法
results: 提出了减少实验与现实之间差距，增强协调和考虑社会因素的方法，以及改进强化学习算法等建议，以提高国际气候政策决策的效果和可行性。

Abstract
This paper proposes enhancements to the RICE-N simulation and multi-agent reinforcement learning framework to improve the realism of international climate policy negotiations. Acknowledging the framework's value, we highlight the necessity of significant enhancements to address the diverse array of factors in modeling climate negotiations. Building upon our previous work on the "Conditional Commitments Mechanism" (CCF mechanism) we discuss ways to bridge the gap between simulation and reality. We suggest the inclusion of a recommender or planner agent to enhance coordination, address the Real2Sim gap by incorporating social factors and non-party stakeholder sub-agents, and propose enhancements to the underlying Reinforcement Learning solution algorithm. These proposed improvements aim to advance the evaluation and formulation of negotiation protocols for more effective international climate policy decision-making in Rice-N. However, further experimentation and testing are required to determine the implications and effectiveness of these suggestions.

摘要

The flow of ideas in word embeddings

paper_url: http://arxiv.org/abs/2307.16819
repo_url: None
paper_authors: Debayan Dasgupta
for: investigate the similarity-based flow of ideas in language models
methods: adopts microrheology tools and random walker in word embeddings
results: shows signatures of anomalous diffusion and potential association with creativity

Abstract
The flow of ideas has been extensively studied by physicists, psychologists, and machine learning engineers. This paper adopts specific tools from microrheology to investigate the similarity-based flow of ideas. We introduce a random walker in word embeddings and study its behavior. Such similarity-mediated random walks through the embedding space show signatures of anomalous diffusion commonly observed in complex structured systems such as biological cells and complex fluids. The paper concludes by proposing the application of popular tools employed in the study of random walks and diffusion of particles under Brownian motion to assess quantitatively the incorporation of diverse ideas in a document. Overall, this paper presents a self-referenced method combining microrheology and machine learning concepts to explore the meandering tendencies of language models and their potential association with creativity.

摘要
研究想法的流动已经广泛研究了物理学家、心理学家和机器学习工程师。这篇论文采用特定的工具从微流动学来研究相似性基于的想法流动。我们引入了单词嵌入中的随机游走者，并研究其行为。这种相似性媒介的随机游走在嵌入空间中显示了复杂结构系统中常见的异常扩散特征。论文结束时提出了使用广泛用于随机游走和分子扩散下 Брау恩运动的计算方法来评估文档中多元想法的 incorporation。总的来说，这篇论文提出了结合微流动学和机器学习概念的自referenced方法，用以探索语言模型的漫游倾向和创造力的可能关系。

Visual Saliency Detection in Advanced Driver Assistance Systems

paper_url: http://arxiv.org/abs/2308.03770
repo_url: None
paper_authors: Francesco Rundo, Michael Sebastian Rundo, Concetto Spampinato
for: 这 paper 的目的是提出一种智能系统，用于评估司机的注意力水平和驾驶场景的重要性。
methods: 该系统使用了semantic segmentation 3D deep network、dedicated 1D temporal deep convolutional network、hardware accelerator等技术。
results: 实验结果表明，该系统可以准确地评估司机的注意力水平和驾驶场景的重要性，提高了驾驶安全性。

Abstract
Visual Saliency refers to the innate human mechanism of focusing on and extracting important features from the observed environment. Recently, there has been a notable surge of interest in the field of automotive research regarding the estimation of visual saliency. While operating a vehicle, drivers naturally direct their attention towards specific objects, employing brain-driven saliency mechanisms that prioritize certain elements over others. In this investigation, we present an intelligent system that combines a drowsiness detection system for drivers with a scene comprehension pipeline based on saliency. To achieve this, we have implemented a specialized 3D deep network for semantic segmentation, which has been pretrained and tailored for processing the frames captured by an automotive-grade external camera. The proposed pipeline was hosted on an embedded platform utilizing the STA1295 core, featuring ARM A7 dual-cores, and embeds an hardware accelerator. Additionally, we employ an innovative biosensor embedded on the car steering wheel to monitor the driver drowsiness, gathering the PhotoPlethysmoGraphy (PPG) signal of the driver. A dedicated 1D temporal deep convolutional network has been devised to classify the collected PPG time-series, enabling us to assess the driver level of attentiveness. Ultimately, we compare the determined attention level of the driver with the corresponding saliency-based scene classification to evaluate the overall safety level. The efficacy of the proposed pipeline has been validated through extensive experimental results.

摘要
“视觉吸引”指代人类在观察环境时自然地吸引注意力和提取重要特征的机制。在汽车研究领域，近期对视觉吸引的估计表现出了明显的兴趣增长。在运行汽车时， drivers 自然地将注意力集中在特定对象上，使用大脑驱动的吸引机制，将某些元素优先于别的元素。在这次研究中，我们提出了一个智能系统，该系统结合了驾驶员睡眠检测系统和基于吸引的场景理解管道。为达到这一目标，我们实施了一个特殊的3D深度网络 дляsemantic segmentation，该网络在 automotive-grade 外部摄像头捕捉的帧中进行了预训练和定制。我们的提案的管道在 ARM A7 双核 STA1295 核心上的嵌入式平台上运行，并利用硬件加速器。此外，我们还使用了车辆方向盘上的生物传感器来监测驾驶员睡眠，收集了 driver 的 PhotoPlethysmoGraphy (PPG) 信号。我们设计了一个1D时间深度卷积网络，以分类收集的 PPG 时间序列，从而评估驾驶员的注意度水平。最后，我们将驾驶员的注意度水平与相应的吸引基于场景分类相比评估整体安全水平。我们的提案的管道的效果得到了广泛的实验 validate。

A New Perspective on Evaluation Methods for Explainable Artificial Intelligence (XAI)

paper_url: http://arxiv.org/abs/2307.14246
repo_url: None
paper_authors: Timo Speith, Markus Langer
for: 这个论文主要是为了探讨Explainable Artificial Intelligence（XAI）在Requirements Engineering（RE）领域中的重要性，以及XAI在系统质量中的影响。
methods: 该论文使用了一种critical examination的方法，检查了XAI的可解性和性能之间的Supposed trade-off，并提出了一种nuanced approach来缓解这个负面关系。
results: 该论文的研究结果表明，在不同的资源和领域特点的情况下，可解性和性能之间存在一定的 equilibrio，而不是简单的trade-off关系。这些结果提供了一个基础 для未来的研究和实践，以推进RE领域中的AI发展。

Abstract
Within the field of Requirements Engineering (RE), the increasing significance of Explainable Artificial Intelligence (XAI) in aligning AI-supported systems with user needs, societal expectations, and regulatory standards has garnered recognition. In general, explainability has emerged as an important non-functional requirement that impacts system quality. However, the supposed trade-off between explainability and performance challenges the presumed positive influence of explainability. If meeting the requirement of explainability entails a reduction in system performance, then careful consideration must be given to which of these quality aspects takes precedence and how to compromise between them. In this paper, we critically examine the alleged trade-off. We argue that it is best approached in a nuanced way that incorporates resource availability, domain characteristics, and considerations of risk. By providing a foundation for future research and best practices, this work aims to advance the field of RE for AI.

摘要
在人工智能支持系统中的需求工程（RE）领域，增加了可解释人工智能（XAI）的重要性，以实现人工智能支持系统与用户需求、社会期望和法规标准的一致。通常，可解释性被视为系统质量的重要非功能要求。然而，它与性能之间的优先级权衡带来挑战。如果满足可解释性需求导致系统性能下降，那么需要考虑哪个质量方面优先，以及如何妥协这两个方面。在这篇论文中，我们 críticamente评估了这种负面冲击。我们认为，应该以细化的方式进行评估，考虑资源可用性、领域特点以及风险考虑。通过提供未来研究和最佳实践的基础，这篇论文旨在推动RE领域的发展。

Revisiting the Performance-Explainability Trade-Off in Explainable Artificial Intelligence (XAI)

paper_url: http://arxiv.org/abs/2307.14239
repo_url: None
paper_authors: Barnaby Crook, Maximilian Schlüter, Timo Speith
for: This paper aims to advance the field of Requirements Engineering (RE) for Artificial Intelligence (AI) by critically examining the supposed trade-off between explainability and performance.
methods: The paper argues that the trade-off between explainability and performance should be approached in a nuanced way that incorporates resource availability, domain characteristics, and considerations of risk.
results: The paper provides a foundation for future research and best practices in RE for AI, with the goal of advancing the field.

Abstract
Within the field of Requirements Engineering (RE), the increasing significance of Explainable Artificial Intelligence (XAI) in aligning AI-supported systems with user needs, societal expectations, and regulatory standards has garnered recognition. In general, explainability has emerged as an important non-functional requirement that impacts system quality. However, the supposed trade-off between explainability and performance challenges the presumed positive influence of explainability. If meeting the requirement of explainability entails a reduction in system performance, then careful consideration must be given to which of these quality aspects takes precedence and how to compromise between them. In this paper, we critically examine the alleged trade-off. We argue that it is best approached in a nuanced way that incorporates resource availability, domain characteristics, and considerations of risk. By providing a foundation for future research and best practices, this work aims to advance the field of RE for AI.

摘要
在人工智能支持系统中的需求工程（RE）领域，随着可解释人工智能（XAI）的增加重要性，用户需求、社会期望和法规标准的Alignment已经吸引了关注。通常，可解释性被视为系统质量的重要非函数需求。然而，supposed trade-off between explainability和性能挑战了 présumé的积极影响。如果满足可解释性需求意味着系统性能下降，那么需要仔细考虑哪个质量特征优先级顺序和如何妥协 между them。在这篇论文中，我们critically examine the alleged trade-off。我们认为应该 approached in a nuanced way that incorporates resource availability, domain characteristics, and considerations of risk。通过提供未来研究和最佳实践的基础，这项工作想要进一步发展RE领域的AI。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

UnScientify: Detecting Scientific Uncertainty in Scholarly Full Text

paper_url: http://arxiv.org/abs/2307.14236
repo_url: None
paper_authors: Panggih Kusuma Ningrum, Philipp Mayr, Iana Atanassova
for: 本研究的目的是开发一个可交互式的系统，用于探测科学文献中的不确定性。
methods: 该系统使用了一种弱监督的技术，利用细致的注释方案来在科学文献中标识句子级上的语言表达的不确定性。其执行管道包括模式匹配、复杂句子检查和作者参考检查。
results: 该系统可自动标记和注释科学文献中的不确定性标识，并考虑不同类型的科学不确定性，以便应用于信息检索、文本挖掘和学术文献处理等领域。此外，该系统提供了可解释的结果，帮助理解文本中标识的不确定性实例。

Abstract
This demo paper presents UnScientify, an interactive system designed to detect scientific uncertainty in scholarly full text. The system utilizes a weakly supervised technique that employs a fine-grained annotation scheme to identify verbally formulated uncertainty at the sentence level in scientific texts. The pipeline for the system includes a combination of pattern matching, complex sentence checking, and authorial reference checking. Our approach automates labeling and annotation tasks for scientific uncertainty identification, taking into account different types of scientific uncertainty, that can serve various applications such as information retrieval, text mining, and scholarly document processing. Additionally, UnScientify provides interpretable results, aiding in the comprehension of identified instances of scientific uncertainty in text.

摘要
这个 demo 文章介绍了一个名为 UnScientify 的交互式系统，用于探测科学不确定性在学术全文中。该系统采用一种弱监督技术，使用细致的注释方案来在科学文本中识别句子级上的语言化不确定性。该管道包括组合pattern匹配、复杂句检查和作者参考检查。我们的方法自动标注和注释科学不确定性标识 task，考虑不同类型的科学不确定性，可以满足信息检索、文本挖掘和学术文档处理等应用。此外，UnScientify 提供可解释结果，帮助理解文本中标识的科学不确定性实例。

Non-Linear Self Augmentation Deep Pipeline for Cancer Treatment outcome Prediction

paper_url: http://arxiv.org/abs/2307.14398
repo_url: None
paper_authors: Francesco Rundo, Concetto Spampinato, Michael Rundo
for: 这个研究的目的是提高化疗治结果的预测，以便更好地选择适合受到化疗治疗的病人。
methods: 这个研究使用了一种新的非线性细胞架构，以及一个深度下渠排序器，从抑肝 CT 图像中提取和增强 2D 特征，以提高化疗治结果的预测。
results: 这个研究的结果显示，这种新的方法可以实现约 93% 的全局准确率，即使在肾脏癌症 (mUC) 等特殊的疾病中也有出色的预测效果。

Abstract
Immunotherapy emerges as promising approach for treating cancer. Encouraging findings have validated the efficacy of immunotherapy medications in addressing tumors, resulting in prolonged survival rates and notable reductions in toxicity compared to conventional chemotherapy methods. However, the pool of eligible patients for immunotherapy remains relatively small, indicating a lack of comprehensive understanding regarding the physiological mechanisms responsible for favorable treatment response in certain individuals while others experience limited benefits. To tackle this issue, the authors present an innovative strategy that harnesses a non-linear cellular architecture in conjunction with a deep downstream classifier. This approach aims to carefully select and enhance 2D features extracted from chest-abdomen CT images, thereby improving the prediction of treatment outcomes. The proposed pipeline has been meticulously designed to seamlessly integrate with an advanced embedded Point of Care system. In this context, the authors present a compelling case study focused on Metastatic Urothelial Carcinoma (mUC), a particularly aggressive form of cancer. Performance evaluation of the proposed approach underscores its effectiveness, with an impressive overall accuracy of approximately 93%

摘要
免疫疗法在治疗癌症方面与兴趣增加。有关免疫疗法药物在治疗肿瘤方面的显著结果，使得生存时间增加和化学治疗方法相比，较少副作用。然而，适合免疫疗法的病人群较小，这表明我们对治疗成功的生理机制所知甚少。为解决这个问题，作者们提出了一个创新的策略，利用不对称细胞架构和深度下游分类器。这个方法的目的是将来自胸腹部 Computed Tomography 影像的2D特征 precisely 选择和增强，以提高治疗结果预测的精度。提案的管道已经严格地设计，以便与高级嵌入式点检系统集成。在这个上下文中，作者们提出了一个吸引人的案例研究， concentrate 在具有攻击性的膀胱癌（mUC）上。研究表明，提案的方法在这个案例中具有很高的总精度，约93%。

Sources of Opacity in Computer Systems: Towards a Comprehensive Taxonomy

paper_url: http://arxiv.org/abs/2307.14232
repo_url: None
paper_authors: Sara Mann, Barnaby Crook, Lena Kästner, Astrid Schomäcker, Timo Speith
for: 本研究旨在提高现代计算机系统的透明性，以便在需要公正或负责任的领域中应用。
methods: 本研究提出了一种基于八种透明性源的分类法，这八种源分为三大类：建筑性、分析性和社会技术性。对每种源，提供了实践中透明性处理的初步建议。
results: 本研究提供了一个 Context-dependent 透明性分类法，可以帮助需求工程师和其他实践者更好地理解特定context中透明性的主要来源，并选择或开发合适的透明性处理策略。

Abstract
Modern computer systems are ubiquitous in contemporary life yet many of them remain opaque. This poses significant challenges in domains where desiderata such as fairness or accountability are crucial. We suggest that the best strategy for achieving system transparency varies depending on the specific source of opacity prevalent in a given context. Synthesizing and extending existing discussions, we propose a taxonomy consisting of eight sources of opacity that fall into three main categories: architectural, analytical, and socio-technical. For each source, we provide initial suggestions as to how to address the resulting opacity in practice. The taxonomy provides a starting point for requirements engineers and other practitioners to understand contextually prevalent sources of opacity, and to select or develop appropriate strategies for overcoming them.

摘要
现代计算机系统在现代生活中 ubique，但是许多它们仍然呈 opacity。这种情况会在需要 fairness 或 accountability 的领域 pose significant challenges。我们认为，在不同的 context 中 opaque 的 sources 的最佳策略是针对性的，即根据具体情况下的 opaque 的来源。通过synthesizing 和 extending 现有的讨论，我们提出了一个包含 eight sources of opacity 的税onomy，分为三大类：architectural、analytical 和 socio-technical。对于每个来源，我们提供了初步的实践方法，以便 requirements engineers 和其他专业人员在具体情况下理解contextually prevalent 的 opaque 来源，并选择或开发合适的策略来解决它们。

Explore the possibility of advancing climate negotiations on the basis of regional trade organizations: A study based on RICE-N

paper_url: http://arxiv.org/abs/2307.14226
repo_url: None
paper_authors: Wubo Dai
for: 这篇论文是为了提供新的理论支持气候谈判，帮助解决现在国际合作的不清楚前景。
methods: 这篇论文使用了深度学习建立了一个基于代理模型（ABM），并基于现有的贸易团体，对气候谈判进行了模拟。
results: 模拟结果显示，该方案具有良好的前景。

Abstract
Climate issues have become more and more important now. Although global governments have made some progress, we are still facing the truth that the prospect of international cooperation is not clear at present. Due to the limitations of the Integrated assessment models (IAMs) model, it is difficult to simulate the dynamic negotiation process. Therefore, using deep learning to build a new agents based model (ABM) might can provide new theoretical support for climate negotiations. Building on the RICE-N model, this work proposed an approach to climate negotiations based on existing trade groups. Simulation results show that the scheme has a good prospect.

摘要
现在，气候问题已经变得非常重要。虽然全球政府已经做出了一些进展，但现在我们还面临着国际合作的未定性。由于 инте格рирован的评估模型（IAMs）的局限性，模拟动态谈判过程很难。因此，使用深度学习建立新的代理人基本模型（ABM）可能会提供新的理论支持 для气候谈判。基于RICE-N模型，本研究提出了基于现有贸易组织的方法。计算结果显示，该方案具有良好的前景。Note: "RICE-N" stands for "Regional Integrated model of Climate and the Economy with Non-cooperative Negociations".

AI and Education: An Investigation into the Use of ChatGPT for Systems Thinking

paper_url: http://arxiv.org/abs/2307.14206
repo_url: None
paper_authors: Holger Arndt
for: 这个探索性研究检查了虚拟智能工具ChatGPT是否可以支持不同科目的系统思维（ST）能力。
methods: 研究使用了通用和专业Prompt来评估ChatGPT的准确率、帮助度和可靠性在不同版本的工具中。
results: 研究发现ChatGPT可以在不同科目提供大量正确和很有帮助的答案，表明它可以增强ST技能。但有时会出现错误，需要用户保持批判性。 DESPITE SOME LIMITATIONS, THIS STUDY SUGGESTS THAT WITH CAREFUL USE AND ATTENTION TO ITS IDIOSYNCRASIES, ChatGPT CAN BE A VALUABLE TOOL FOR TEACHING AND LEARNING ST.

Abstract
This exploratory study investigates the potential of the artificial intelligence tool, ChatGPT, to support systems thinking (ST) in various subjects. Using both general and subject specific prompts, the study assesses the accuracy, helpfulness, and reliability of ChatGPT's responses across different versions of the tool. The results indicate that ChatGPT can provide largely correct and very helpful responses in various subjects, demonstrating its potential as a tool for enhancing ST skills. However, occasional inaccuracies highlight the need for users to remain critical of ChatGPT's responses. Despite some limitations, this study suggests that with careful use and attention to its idiosyncrasies, ChatGPT can be a valuable tool for teaching and learning ST.

摘要
这项探索性研究检查了人工智能工具ChatGPT在不同学科中支持系统思维（ST）的潜力。通过通用和专业specific prompts，研究评估了ChatGPT的答案准确性、帮助性和可靠性，并发现ChatGPT在不同版本中的答案准确性较高，能够提供帮助学习ST技能的工具。然而， occasional inaccuracies 表明用户需要保持批判性，不能完全依赖ChatGPT的答案。不withstanding some limitations，这项研究表明，通过仔细使用和注意其特点，ChatGPT可以成为教学和学习ST的有价值工具。

Unveiling Security, Privacy, and Ethical Concerns of ChatGPT

paper_url: http://arxiv.org/abs/2307.14192
repo_url: None
paper_authors: Xiaodong Wu, Ran Duan, Jianbing Ni
for: 本研究探讨 chatGPT 如何应用于不同领域，以及 chatGPT 的安全、隐私和伦理问题。
methods: 本研究使用 topic modeling 和 reinforcement learning 技术，从 GPT-1 到 GPT-4 的升级路径，探讨模型的特点、局限性和应用前景。
results: 本研究指出 chatGPT 的潜在风险和问题，包括安全、隐私和伦理问题，并且提出了解决这些问题的开放问题。

Abstract
This paper delves into the realm of ChatGPT, an AI-powered chatbot that utilizes topic modeling and reinforcement learning to generate natural responses. Although ChatGPT holds immense promise across various industries, such as customer service, education, mental health treatment, personal productivity, and content creation, it is essential to address its security, privacy, and ethical implications. By exploring the upgrade path from GPT-1 to GPT-4, discussing the model's features, limitations, and potential applications, this study aims to shed light on the potential risks of integrating ChatGPT into our daily lives. Focusing on security, privacy, and ethics issues, we highlight the challenges these concerns pose for widespread adoption. Finally, we analyze the open problems in these areas, calling for concerted efforts to ensure the development of secure and ethically sound large language models.

摘要
这篇论文探讨了chatGPT，一种基于人工智能的聊天机器人，它利用话题模型和强化学习生成自然的回应。虽然chatGPT在各个领域，如客户服务、教育、心理健康治疗、个人产生力和内容创作等领域都具有极大的承诺，但是需要考虑其安全、隐私和伦理问题。通过探讨GPT-1到GPT-4的升级路径，讨论模型的特点、局限性和应用潜力，这篇研究目的是为了照明chatGPT在我们日常生活中的潜在风险。Focus on安全、隐私和伦理问题，我们高亮了这些问题对普及的挑战。最后，我们分析了在这些领域的开放问题，呼吁一共努力确保开发出安全和伦理正确的大语言模型。

LOIS: Looking Out of Instance Semantics for Visual Question Answering

paper_url: http://arxiv.org/abs/2307.14142
repo_url: None
paper_authors: Siyu Zhang, Yeming Chen, Yaoru Sun, Fang Wang, Haibo Shi, Haoran Wang
for: 这个论文的目的是提高视觉问答模型的理解能力，尤其是在理解图像中的对象semantics的关系。methods: 该论文提出了一种不使用 bounding boxes 的模型框架，称为 Looking Out of Instance Semantics (LOIS)，以便更好地描述图像中的Visual fact。此外，该论文还提出了两种关系注意力模块：1）内模态注意力和2）间模态注意力，以解决多视Modality特征之间的关系。results: 该论文的实验结果表明，与四个标准 VQA 数据集进行比较，该提出的方法在改进视觉理解能力方面表现出优秀的成绩。

Abstract
Visual question answering (VQA) has been intensively studied as a multimodal task that requires effort in bridging vision and language to infer answers correctly. Recent attempts have developed various attention-based modules for solving VQA tasks. However, the performance of model inference is largely bottlenecked by visual processing for semantics understanding. Most existing detection methods rely on bounding boxes, remaining a serious challenge for VQA models to understand the causal nexus of object semantics in images and correctly infer contextual information. To this end, we propose a finer model framework without bounding boxes in this work, termed Looking Out of Instance Semantics (LOIS) to tackle this important issue. LOIS enables more fine-grained feature descriptions to produce visual facts. Furthermore, to overcome the label ambiguity caused by instance masks, two types of relation attention modules: 1) intra-modality and 2) inter-modality, are devised to infer the correct answers from the different multi-view features. Specifically, we implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information. In addition, our proposed attention model can further analyze salient image regions by focusing on important word-related questions. Experimental results on four benchmark VQA datasets prove that our proposed method has favorable performance in improving visual reasoning capability.

摘要
Visual问答（VQA）已经广泛研究过，它需要跨视觉和语言之间的桥接来得出正确的答案。现有的尝试都开发了多种注意力模块来解决VQA任务。然而，模型推理性能受到视觉处理的限制，尤其是对象 semantics的理解。大多数现有的检测方法都依赖于 bounding boxes，这是VQA模型理解图像中对象 semantics的 causal nexus 的重要挑战。为此，我们在这里提出一种不使用 bounding boxes 的finer模型框架，称为 Looking Out of Instance Semantics（LOIS）。LOIS允许更细化的特征描述，以生成更加精准的视觉事实。此外，为了解决因instance masks引起的标签模糊，我们提出了两种类型的关系注意力模块：1）内模态关系注意力模块和2）间模态关系注意力模块。这些模块可以帮助模型正确地从不同的多视图特征中推理答案。具体来说，我们实现了相互关系注意力模块，以模型复杂的和深入的视觉semantics关系 между实例对象和背景信息。此外，我们的提议的注意力模型还可以进一步分析重要的单词相关问题，以增强图像区域的注意力。实验结果表明，我们的提议方法在四个 benchmark VQA 数据集上表现出色，提高了图像逻辑能力。

paper_url: http://arxiv.org/abs/2307.14138
repo_url: None
paper_authors: Behzad Nourani-Koliji, Steven Bilaj, Amir Rezaei Balef, Setareh Maghsudi
for: 解决 piecewise stationary combinatorial semi-bandit问题，处理非站台环境下的变化和 causal 关系。
methods: 使用 Upper Confidence Bound (UCB) 算法，并采用适应性的 GLR 测试来检测变化。新引入 group restart 策略以适应结构化环境。
results: theoretically 确定了变化数量对性能的影响，并且在实际场景中比 benchmark 表现更好。

Abstract
We study the piecewise stationary combinatorial semi-bandit problem with causally related rewards. In our nonstationary environment, variations in the base arms' distributions, causal relationships between rewards, or both, change the reward generation process. In such an environment, an optimal decision-maker must follow both sources of change and adapt accordingly. The problem becomes aggravated in the combinatorial semi-bandit setting, where the decision-maker only observes the outcome of the selected bundle of arms. The core of our proposed policy is the Upper Confidence Bound (UCB) algorithm. We assume the agent relies on an adaptive approach to overcome the challenge. More specifically, it employs a change-point detector based on the Generalized Likelihood Ratio (GLR) test. Besides, we introduce the notion of group restart as a new alternative restarting strategy in the decision making process in structured environments. Finally, our algorithm integrates a mechanism to trace the variations of the underlying graph structure, which captures the causal relationships between the rewards in the bandit setting. Theoretically, we establish a regret upper bound that reflects the effects of the number of structural- and distribution changes on the performance. The outcome of our numerical experiments in real-world scenarios exhibits applicability and superior performance of our proposal compared to the state-of-the-art benchmarks.

摘要
我们研究分割站位的 combinatorial 半带兽问题，其中奖励的生成过程受到基础武器的分布变化、奖励之间的 causal 关系变化或者两者同时变化。在这种非站ARY环境下，一个优化的决策者需要同时考虑这些变化并适应应对。在 combinatorial 半带兽设置下，决策者只能观察选择的武器集的结果。我们的提议的策略是使用 Upper Confidence Bound（UCB）算法。我们假设Agent使用适应的方法来解决这个挑战。具体来说，它使用基于 Generalized Likelihood Ratio（GLR）测试的变化检测器。此外，我们引入了一种新的结构化环境中的 restart 策略——组合重启。最后，我们的算法包含一个跟踪下面结构变化的机制，该结构变化捕捉了奖励之间的 causal 关系。理论上，我们确定了一个 regret Upper bound，该 bound 反映了结构变化和分布变化对性能的影响。实际上，我们的数值实验在真实世界情况下展现了我们的提议的可应用性和优越性，相比之下现状标准 benchmark。

Developing and Evaluating Tiny to Medium-Sized Turkish BERT Models

paper_url: http://arxiv.org/abs/2307.14134
repo_url: None
paper_authors: Himmet Toprak Kesgin, Muzaffer Kaan Yuce, Mehmet Fatih Amasyali
for: This paper aims to bridge the research gap in less-resourced languages by introducing and evaluating tiny, mini, small, and medium-sized uncased Turkish BERT models.
methods: The authors trained these models on a diverse dataset encompassing over 75GB of text from multiple sources and tested them on several tasks, including mask prediction, sentiment analysis, news classification, and zero-shot classification.
results: Despite their smaller size, the models exhibited robust performance, including zero-shot task, while ensuring computational efficiency and faster execution times.Here is the same information in Simplified Chinese text:
for: 这篇论文目标是bridging the research gap in less-resourced languages,通过引入和评估不同大小的uncased Turkish BERT模型。
methods: 作者使用了多种任务和数据集来训练这些模型，包括偏好预测、情感分析、新闻分类和零批预测。
results: despite their smaller size, these models showed robust performance, including zero-shot task, while ensuring computational efficiency and faster execution times.

Abstract
This study introduces and evaluates tiny, mini, small, and medium-sized uncased Turkish BERT models, aiming to bridge the research gap in less-resourced languages. We trained these models on a diverse dataset encompassing over 75GB of text from multiple sources and tested them on several tasks, including mask prediction, sentiment analysis, news classification, and, zero-shot classification. Despite their smaller size, our models exhibited robust performance, including zero-shot task, while ensuring computational efficiency and faster execution times. Our findings provide valuable insights into the development and application of smaller language models, especially in the context of the Turkish language.

摘要
Note:* "tiny" refers to models with fewer parameters, typically less than 100M;* "mini" refers to models with parameters between 100M and 500M;* "small" refers to models with parameters between 500M and 1B;* "medium-sized" refers to models with parameters between 1B and 2B.Also, "uncased" means that the models were trained without the use of capitalization, which is a common practice in natural language processing tasks.

A semantics-driven methodology for high-quality image annotation

paper_url: http://arxiv.org/abs/2307.14119
repo_url: None
paper_authors: Fausto Giunchiglia, Mayukh Bagchi, Xiaolei Diao
for: 本研究的目的是提出一种基于自然语言处理、知识表示和计算机视觉的方法，以降低对图像标注的主观决策。
methods: 该方法利用WordNet语义层次结构来提供图像标注的意义，并通过基于物品和视觉属性的对应来驱动图像的标注。
results: 该方法在ImageNet层次中的图像上进行了验证，并显示了降低主观决策的效果。

Abstract
Recent work in Machine Learning and Computer Vision has highlighted the presence of various types of systematic flaws inside ground truth object recognition benchmark datasets. Our basic tenet is that these flaws are rooted in the many-to-many mappings which exist between the visual information encoded in images and the intended semantics of the labels annotating them. The net consequence is that the current annotation process is largely under-specified, thus leaving too much freedom to the subjective judgment of annotators. In this paper, we propose vTelos, an integrated Natural Language Processing, Knowledge Representation, and Computer Vision methodology whose main goal is to make explicit the (otherwise implicit) intended annotation semantics, thus minimizing the number and role of subjective choices. A key element of vTelos is the exploitation of the WordNet lexico-semantic hierarchy as the main means for providing the meaning of natural language labels and, as a consequence, for driving the annotation of images based on the objects and the visual properties they depict. The methodology is validated on images populating a subset of the ImageNet hierarchy.

摘要
In this paper, we propose vTelos, an integrated methodology that combines Natural Language Processing, Knowledge Representation, and Computer Vision to explicitly define the intended annotation semantics. This approach leverages the WordNet lexico-semantic hierarchy to provide meaning to natural language labels and drive the annotation of images based on the objects and visual properties they depict. We validate the methodology on a subset of the ImageNet hierarchy.

GraphRNN Revisited: An Ablation Study and Extensions for Directed Acyclic Graphs

paper_url: http://arxiv.org/abs/2307.14109
repo_url: None
paper_authors: Taniya Das, Mark Koch, Maya Ravichandran, Nikhil Khatri
for: 学习图形生成模型
methods: 使用深度学习架构GraphRNN，并对基线模型进行评估和简要改进
results: 1) 对You等人提出的GraphRNN架构进行重现并评估，并发现BFSTraversal对模型性能有重要贡献；2) 对GraphRNN进行扩展，使其可以生成直接的有向无环图，并在实际数据集上达到显著提高。

Abstract
GraphRNN is a deep learning-based architecture proposed by You et al. for learning generative models for graphs. We replicate the results of You et al. using a reproduced implementation of the GraphRNN architecture and evaluate this against baseline models using new metrics. Through an ablation study, we find that the BFS traversal suggested by You et al. to collapse representations of isomorphic graphs contributes significantly to model performance. Additionally, we extend GraphRNN to generate directed acyclic graphs by replacing the BFS traversal with a topological sort. We demonstrate that this method improves significantly over a directed-multiclass variant of GraphRNN on a real-world dataset.

摘要
GRaphRNN 是一种深度学习建议的架构，由 You 等人提出用于学习图形生成模型。我们使用重现 GRaphRNN 架构的实现来重现 You 等人的结果，并对基线模型进行评估。通过一个剥削研究，我们发现 You 等人建议的 BFS 搜索方法可以帮助 collapse 同构图的表示，对模型性能产生重要贡献。此外，我们将 GRaphRNN 扩展到生成指定的有向无环图，通过将 BFS 搜索替换为拓扑排序。我们示出了这种方法在一个真实的数据集上表现明显更好。

Actions Speak What You Want: Provably Sample-Efficient Reinforcement Learning of the Quantal Stackelberg Equilibrium from Strategic Feedbacks

paper_url: http://arxiv.org/abs/2307.14085
repo_url: None
paper_authors: Siyu Chen, Mengdi Wang, Zhuoran Yang
for: The paper is written for learning a Quantal Stackelberg Equilibrium (QSE) in an episodic Markov game with a leader-follower structure.
methods: The paper uses reinforcement learning (RL) and entropy-regularized policy optimization to solve the leader’s decision-making problem. The authors propose sample-efficient algorithms for both the online and offline settings, based on maximum likelihood estimation and model-free or model-based RL.
results: The paper achieves sublinear regret upper bounds for the leader’s decision-making problem, and also quantifies the uncertainty of the estimators. The authors propose optimistic and pessimistic algorithms for online and offline settings, and show that their algorithms are computationally efficient when specialized to the linear and myopic setting.

Abstract
We study reinforcement learning (RL) for learning a Quantal Stackelberg Equilibrium (QSE) in an episodic Markov game with a leader-follower structure. In specific, at the outset of the game, the leader announces her policy to the follower and commits to it. The follower observes the leader's policy and, in turn, adopts a quantal response policy by solving an entropy-regularized policy optimization problem induced by leader's policy. The goal of the leader is to find her optimal policy, which yields the optimal expected total return, by interacting with the follower and learning from data. A key challenge of this problem is that the leader cannot observe the follower's reward, and needs to infer the follower's quantal response model from his actions against leader's policies. We propose sample-efficient algorithms for both the online and offline settings, in the context of function approximation. Our algorithms are based on (i) learning the quantal response model via maximum likelihood estimation and (ii) model-free or model-based RL for solving the leader's decision making problem, and we show that they achieve sublinear regret upper bounds. Moreover, we quantify the uncertainty of these estimators and leverage the uncertainty to implement optimistic and pessimistic algorithms for online and offline settings. Besides, when specialized to the linear and myopic setting, our algorithms are also computationally efficient. Our theoretical analysis features a novel performance-difference lemma which incorporates the error of quantal response model, which might be of independent interest.

摘要
我们研究利用强化学习（RL）学习一个量化Stackelberg平衡（QSE）在一个 episodic Markov 游戏中，具有领袖-追随者结构。具体来说，在游戏开始时，领袖公布她的策略给追随者，并将其固定下来。追随者根据领袖的策略采取一个量化响应策略，这是通过解决由领袖策略引起的 entropy-regularized 策略优化问题来实现的。领袖的目标是找到她的优化策略，使得她在与追随者交互时获得最优预期总回报。一个关键挑战是，领袖无法见到追随者的奖励，她需要从追随者的行为中推断出追随者的量化响应模型。我们提出了 sample-efficient 算法，这些算法基于（i）通过最大可信度估计学习量化响应模型，以及（ii）模型自由或模型基于 RL 解决领袖决策问题。我们证明了这些算法可以达到负线性 regret Upper bound。此外，我们还评估了这些估计器的uncertainty，并利用这些uncertainty来实现在线和离线设置中的优胜算法。此外，当特化到线性和偏向设置时，我们的算法也是计算效率高的。我们的理论分析包括一个新的性能差异 lemma，它可能是独立的兴趣。

Learning to simulate partially known spatio-temporal dynamics with trainable difference operators

paper_url: http://arxiv.org/abs/2307.14395
repo_url: None
paper_authors: Xiang Huang, Zhuoyuan Li, Hongsheng Liu, Zidong Wang, Hongye Zhou, Bin Dong, Bei Hua
for: 用神经网络模拟空间时间动态的研究在最近几年得到了广泛关注。然而，大多数现有方法采用纯数据驱动黑盒模型，具有限制精度和可读性。
methods: 我们提出一种新的混合架构，名为PDE-Net++，它将可训练的差分算子与黑盒模型结合在一起，并嵌入部分先验知识。我们还提出了两种不同的差分层：可训练的flipping差分层（TFDL）和可训练的动态差分层（TDDL）。
results: 数值实验表明，PDE-Net++的预测精度较高，并且在推断过程中表现出色。相比之下，黑盒模型的预测精度较差。

Abstract
Recently, using neural networks to simulate spatio-temporal dynamics has received a lot of attention. However, most existing methods adopt pure data-driven black-box models, which have limited accuracy and interpretability. By combining trainable difference operators with black-box models, we propose a new hybrid architecture explicitly embedded with partial prior knowledge of the underlying PDEs named PDE-Net++. Furthermore, we introduce two distinct options called the trainable flipping difference layer (TFDL) and the trainable dynamic difference layer (TDDL) for the difference operators. Numerous numerical experiments have demonstrated that PDE-Net++ has superior prediction accuracy and better extrapolation performance than black-box models.

摘要
最近，使用神经网络模拟空间时间动态得到了很多关注。然而，大多数现有方法采用纯数据驱动黑盒模型，准确性和可解释性受限。我们提出一种新的混合架构，名为PDE-Net++，其包含可训练的差分算子和黑盒模型。此外，我们还提出了两种不同的选项，即可训练的折衔差层（TFDL）和可训练的动态差层（TDDL）。数字实验证明，PDE-Net++在预测精度和推断性方面都有较高的性能，比黑盒模型更好。

Uncertainty Guided Adaptive Warping for Robust and Efficient Stereo Matching

paper_url: http://arxiv.org/abs/2307.14071
repo_url: None
paper_authors: Junpeng Jing, Jiankun Li, Pengfei Xiong, Jiangyu Liu, Shuaicheng Liu, Yichen Guo, Xin Deng, Mai Xu, Lai Jiang, Leonid Sigal
for: 提高掌控双眼匹配的稳定性和可靠性，以便应用于实际世界中。
methods: 提出了一个新的不确定指标驱动的匹配方法，通过在截图运算中灵活地调整对应点的样本数量，以及将传统非 Parametric 截图改进为可学习的截图。
results: 实验结果显示，这个方法可以在不需要重新训练的情况下，在 ETH3D、KITTI 和 Middlebury 数据集上取得最佳性能。此外，这个方法还可以在实时应用中实现高性能和轻量级化。

Abstract
Correlation based stereo matching has achieved outstanding performance, which pursues cost volume between two feature maps. Unfortunately, current methods with a fixed model do not work uniformly well across various datasets, greatly limiting their real-world applicability. To tackle this issue, this paper proposes a new perspective to dynamically calculate correlation for robust stereo matching. A novel Uncertainty Guided Adaptive Correlation (UGAC) module is introduced to robustly adapt the same model for different scenarios. Specifically, a variance-based uncertainty estimation is employed to adaptively adjust the sampling area during warping operation. Additionally, we improve the traditional non-parametric warping with learnable parameters, such that the position-specific weights can be learned. We show that by empowering the recurrent network with the UGAC module, stereo matching can be exploited more robustly and effectively. Extensive experiments demonstrate that our method achieves state-of-the-art performance over the ETH3D, KITTI, and Middlebury datasets when employing the same fixed model over these datasets without any retraining procedure. To target real-time applications, we further design a lightweight model based on UGAC, which also outperforms other methods over KITTI benchmarks with only 0.6 M parameters.

摘要
Simplified Chinese:相关基于的三维匹配已经实现了出色的性能，它追求了两个特征图的成本量。然而，当前使用固定模型时，不同数据集上的性能并不uniform，这限制了它们在实际应用中的可行性。为了解决这个问题，这篇文章提出了一新的视角，即动态计算相关性的方法。一种名为Uncertainty Guided Adaptive Correlation（UGAC）模块被引入，以适应不同的enario。在扭曲操作中，基于偏差值的不确定性估计来动态调整抽样区域。此外，我们改进了传统的非参数化扭曲，使得learnable参数可以被学习。我们表明，通过将Recurrent Network激活UGAC模块，可以更加稳定地和有效地进行三维匹配。广泛的实验表明，我们的方法在ETH3D、KITTI和Middlebury数据集上达到了 Fix 模型不需要 retrained 的最佳性能。为了实现实时应用，我们进一步设计了一种具有UGAC模块的轻量级模型，它也在KITTI benchmark上超越了其他方法，只有0.6M参数。

Hypergraph Isomorphism Computation

paper_url: http://arxiv.org/abs/2307.14394
repo_url: None
paper_authors: Yifan Feng, Jiashu Han, Shihui Ying, Yue Gao
for: 这篇论文的目的是解决高阶结构信息的问题，并且提出了一个基于Weisfiler-Lehman test算法的高阶图变换测试方法，以及两种基于这个算法的专案架构。methods: 这篇论文使用了Weisfiler-Lehman test算法来解决高阶图变换测试问题，并且提出了一个基于这个算法的专案架构，包括Hypergraph Weisfeiler-Lehamn Subtree Kernel和Hypergraph Weisfeiler-Lehamn Hyperedge Kernel。results: 这篇论文的结果显示，使用了提出的方法可以在处理复杂高阶图结构时实现高效的运行速度，比起其他常用的核心基于方法更快，甚至可以在80倍以上的时间内运行。

Abstract
The isomorphism problem is a fundamental problem in network analysis, which involves capturing both low-order and high-order structural information. In terms of extracting low-order structural information, graph isomorphism algorithms analyze the structural equivalence to reduce the solver space dimension, which demonstrates its power in many applications, such as protein design, chemical pathways, and community detection. For the more commonly occurring high-order relationships in real-life scenarios, the problem of hypergraph isomorphism, which effectively captures these high-order structural relationships, cannot be straightforwardly addressed using graph isomorphism methods. Besides, the existing hypergraph kernel methods may suffer from high memory consumption or inaccurate sub-structure identification, thus yielding sub-optimal performance. In this paper, to address the abovementioned problems, we first propose the hypergraph Weisfiler-Lehman test algorithm for the hypergraph isomorphism test problem by generalizing the Weisfiler-Lehman test algorithm from graphs to hypergraphs. Secondly, based on the presented algorithm, we propose a general hypergraph Weisfieler-Lehman kernel framework and implement two instances, which are Hypergraph Weisfeiler-Lehamn Subtree Kernel and Hypergraph Weisfeiler-Lehamn Hyperedge Kernel. In order to fulfill our research objectives, a comprehensive set of experiments was meticulously designed, including seven graph classification datasets and 12 hypergraph classification datasets. Results on hypergraph classification datasets show significant improvements compared to other typical kernel-based methods, which demonstrates the effectiveness of the proposed methods. In our evaluation, we found that our proposed methods outperform the second-best method in terms of runtime, running over 80 times faster when handling complex hypergraph structures.

摘要
“iso”问题是网络分析中的基本问题，它涉及到捕捉低阶和高阶结构信息。在提取低阶结构信息方面，图 isomorphism 算法可以将结构等价性缩小到解决空间维度，这种能力在蛋白质设计、化学路径和社区检测等多个应用中得到了证明。然而，在实际生活中更常出现的高阶关系，高阶图 isomorphism 问题不能直接使用图 isomorphism 方法处理。此外，现有的高阶kernel方法可能会具有高内存消耗或不准确的结构特征标识，从而导致低效性。在本文中，我们提出了高阶 Weisfiler-Lehman 测试算法来解决上述问题，并基于该算法提出了一种总体的高阶 Weisfiler-Lehman kernel框架。此外，我们还实现了两个实例：高阶 Weisfeiler-Lehamn 子树kernel和高阶 Weisfeiler-Lehamn 边kernel。为了实现我们的研究目标，我们细心设计了一系列实验，包括7个图分类 dataset 和12个高阶图分类 dataset。结果表明，我们的提出的方法在高阶图分类 dataset 上具有显著的改善，这表明了我们的方法的有效性。在我们的评估中，我们发现了我们的提出的方法在处理复杂的高阶结构时的运行时间比其他常见的 kernel-based 方法快得多，达到80倍以上。

Acceptable risks in Europe’s proposed AI Act: Reasonableness and other principles for deciding how much risk management is enough

paper_url: http://arxiv.org/abs/2308.02047
repo_url: None
paper_authors: Henry Fraser, Jose-Miguel Bello y Villarino
for:The paper evaluates the European Commission’s proposed AI Act’s approach to risk management and risk acceptability for high-risk AI systems.methods:The paper critiques the Act’s provisions on risk acceptability, arguing that they are unworkable and do not promote a proportionate regulatory burden or trustworthiness.results:The paper suggests that the European Parliament’s recent draft amendments to the risk management provisions, which include “reasonableness” and cost-benefit analysis, are more workable and better balance the goals of proportionality and trustworthiness. The paper also emphasizes the importance of civic legitimacy in risk acceptability judgments, including detailed guidance or involvement from regulators and meaningful input from affected stakeholders.Here is the same information in Simplified Chinese text:for:这篇论文评估欧盟委员会的提议的人工智能法案中的风险管理和风险可接受性方面的高风险人工智能系统。methods:论文批判法案中的风险接受性条款，认为它们不实施可持续的规制负担，也不促进信任worthy的人工智能。results:论文认为欧洲 parlament最新的修订草案中的风险管理条款，包括“合理”的原则，可以更好地实现可持续的规制负担和信任worthy的人工智能。论文还强调了风险接受性评估的公民合法性的重要性，包括 regulators提供详细指南或参与，以及affected stakeholders的意见参与。

Abstract
This paper critically evaluates the European Commission's proposed AI Act's approach to risk management and risk acceptability for high-risk AI systems that pose risks to fundamental rights and safety. The Act aims to promote "trustworthy" AI with a proportionate regulatory burden. Its provisions on risk acceptability require residual risks from high-risk systems to be reduced or eliminated "as far as possible", having regard to the "state of the art". This criterion, especially if interpreted narrowly, is unworkable and promotes neither proportionate regulatory burden, nor trustworthiness. By contrast the Parliament's most recent draft amendments to the risk management provisions introduce "reasonableness", cost-benefit analysis, and are more transparent about the value-laden and contextual nature of risk acceptability judgements. This paper argues that the Parliament's approach is more workable, and better balances the goals of proportionality and trustworthiness. It explains what reasonableness in risk acceptability judgments would entail, drawing on principles from negligence law and European medical devices regulation. And it contends that the approach to risk acceptability judgments need a firm foundation of civic legitimacy: including detailed guidance or involvement from regulators, and meaningful input from affected stakeholders.

摘要

Open Image Content Disarm And Reconstruction

paper_url: http://arxiv.org/abs/2307.14057
repo_url: None
paper_authors: Eli Belkind, Ran Dubin, Amit Dvir
for: 本研究旨在提出一种图像内容级落实和重建（ICDR）系统，以防止恶意软件使用图像隐藏恶意脚本或敏感数据。
methods: 该系统采用零信任方式，对图像文件进行分析和清理，以除除恶意软件和敏感数据。
results: 实验结果表明，ICDR系统能够准确地检测和移除图像文件中的恶意软件和敏感数据，同时保持图像质量和文件可用性。

Abstract
With the advance in malware technology, attackers create new ways to hide their malicious code from antivirus services. One way to obfuscate an attack is to use common files as cover to hide the malicious scripts, so the malware will look like a legitimate file. Although cutting-edge Artificial Intelligence and content signature exist, evasive malware successfully bypasses next-generation malware detection using advanced methods like steganography. Some of the files commonly used to hide malware are image files (e.g., JPEG). In addition, some malware use steganography to hide malicious scripts or sensitive data in images. Steganography in images is difficult to detect even with specialized tools. Image-based attacks try to attack the user's device using malicious payloads or utilize image steganography to hide sensitive data inside legitimate images and leak it outside the user's device. Therefore in this paper, we present a novel Image Content Disarm and Reconstruction (ICDR). Our ICDR system removes potential malware, with a zero trust approach, while maintaining high image quality and file usability. By extracting the image data, removing it from the rest of the file, and manipulating the image pixels, it is possible to disable or remove the hidden malware inside the file.

摘要

One-Nearest Neighborhood Guides Inlier Estimation for Unsupervised Point Cloud Registration

paper_url: http://arxiv.org/abs/2307.14019
repo_url: None
paper_authors: Yongzhe Yuan, Yue Wu, Maoguo Gong, Qiguang Miao, A. K. Qin
for: 这篇论文是为了提高无监督点云注册方法的精度，尤其是在部分重叠的场景下，而设计的。
methods: 该论文提出了一种有效的无监督点云注册方法，通过捕捉源点云和其相对参照点云的几何结构一致性来提高准确率。
results: 该论文通过实验证明了该方法的效iveness，并且在Synthetic和实际数据集上都达到了优秀的结果。

Abstract
The precision of unsupervised point cloud registration methods is typically limited by the lack of reliable inlier estimation and self-supervised signal, especially in partially overlapping scenarios. In this paper, we propose an effective inlier estimation method for unsupervised point cloud registration by capturing geometric structure consistency between the source point cloud and its corresponding reference point cloud copy. Specifically, to obtain a high quality reference point cloud copy, an One-Nearest Neighborhood (1-NN) point cloud is generated by input point cloud. This facilitates matching map construction and allows for integrating dual neighborhood matching scores of 1-NN point cloud and input point cloud to improve matching confidence. Benefiting from the high quality reference copy, we argue that the neighborhood graph formed by inlier and its neighborhood should have consistency between source point cloud and its corresponding reference copy. Based on this observation, we construct transformation-invariant geometric structure representations and capture geometric structure consistency to score the inlier confidence for estimated correspondences between source point cloud and its reference copy. This strategy can simultaneously provide the reliable self-supervised signal for model optimization. Finally, we further calculate transformation estimation by the weighted SVD algorithm with the estimated correspondences and corresponding inlier confidence. We train the proposed model in an unsupervised manner, and extensive experiments on synthetic and real-world datasets illustrate the effectiveness of the proposed method.

摘要
Typically, the precision of unsupervised point cloud registration methods is limited by the lack of reliable inlier estimation and self-supervised signal, especially in partially overlapping scenarios. In this paper, we propose an effective inlier estimation method for unsupervised point cloud registration by capturing geometric structure consistency between the source point cloud and its corresponding reference point cloud copy. Specifically, we generate an One-Nearest Neighborhood (1-NN) point cloud by input point cloud to facilitate matching map construction and improve matching confidence. Benefiting from the high-quality reference copy, we argue that the neighborhood graph formed by inlier and its neighborhood should have consistency between the source point cloud and its corresponding reference copy. Based on this observation, we construct transformation-invariant geometric structure representations and capture geometric structure consistency to score the inlier confidence for estimated correspondences between the source point cloud and its reference copy. This strategy can simultaneously provide a reliable self-supervised signal for model optimization. Finally, we calculate transformation estimation by the weighted SVD algorithm with the estimated correspondences and corresponding inlier confidence. We train the proposed model in an unsupervised manner, and extensive experiments on synthetic and real-world datasets illustrate the effectiveness of the proposed method.

ESSAformer: Efficient Transformer for Hyperspectral Image Super-resolution

paper_url: http://arxiv.org/abs/2307.14010
repo_url: None
paper_authors: Mingjin Zhang, Chi Zhang, Qiming Zhang, Jie Guo, Xinbo Gao, Jing Zhang
for: restaura un alta resolução imagem hiperspectral a partir de uma observação de baixa resolução
methods: utiliza uma rede de transformador com estrutura de refinamento iterativo e uma nova métrica de similaridade espectral para incorporar informações de espectro na formação da imagem
results: gerou imagens de alta resolução mais naturais e obteve resultados visuais e quantitativos excelentes sem precisar de pretreinamento em grandes conjuntos de dados

Abstract
Single hyperspectral image super-resolution (single-HSI-SR) aims to restore a high-resolution hyperspectral image from a low-resolution observation. However, the prevailing CNN-based approaches have shown limitations in building long-range dependencies and capturing interaction information between spectral features. This results in inadequate utilization of spectral information and artifacts after upsampling. To address this issue, we propose ESSAformer, an ESSA attention-embedded Transformer network for single-HSI-SR with an iterative refining structure. Specifically, we first introduce a robust and spectral-friendly similarity metric, \ie, the spectral correlation coefficient of the spectrum (SCC), to replace the original attention matrix and incorporates inductive biases into the model to facilitate training. Built upon it, we further utilize the kernelizable attention technique with theoretical support to form a novel efficient SCC-kernel-based self-attention (ESSA) and reduce attention computation to linear complexity. ESSA enlarges the receptive field for features after upsampling without bringing much computation and allows the model to effectively utilize spatial-spectral information from different scales, resulting in the generation of more natural high-resolution images. Without the need for pretraining on large-scale datasets, our experiments demonstrate ESSA's effectiveness in both visual quality and quantitative results.

摘要
单个多spectral像超分辨（单个HSI-SR）目标是从低分辨度观测获取高分辨度多spectral像。然而，现有的CNN基于方法具有限制性，不能建立长距离依赖关系和 spectral特征之间的交互信息。这会导致使用spectral信息的不足和 после upsampling artifacts。为解决这个问题，我们提出了ESSAformer，一种基于 transformer 网络的 ESSA 注意力嵌入结构，其中包括迭代性修复结构。具体来说，我们首先引入一种可靠的 spectral-friendly 相似度指标，即spectrum 相似度系数（SCC），以 replacing 原始注意力矩阵，并通过这个指标来带入 inductive biases 到模型中，以便训练。然后，我们利用 kernelizable 注意力技术，并有理论支持，将SCC 转化为一种高效的 ESSA 注意力，从而降低注意力计算的复杂性。ESSA 可以在 upsampling 后扩大特征的接受场，无需增加计算量，并且可以有效地利用不同的scale中的 spatial-spectral 信息，从而生成更自然的高分辨度图像。在无需大规模数据预训练的情况下，我们的实验表明ESSA 在视觉质量和量化结果方面都有显著的效果。

DPBERT: Efficient Inference for BERT based on Dynamic Planning

paper_url: http://arxiv.org/abs/2308.00108
repo_url: None
paper_authors: Weixin Wu, Hankz Hankui Zhuo
for: 这个研究旨在提高BERT模型在移动设备上的应用，因为现有的输入适应参数推理方法无法充分利用BERT模型的结构。
methods: 我们提出了一个名为“动态观念规划”的精致化策略，它可以通过选择一个序列转换层的子集来加速BERT模型的推理过程。我们将这个方法添加到原始BERT模型中，以便在推理过程中决定哪些层应该被包含或被忽略。
results: 我们在GLUE评量标准上进行实验，结果显示我们的方法可以降低推理时间至75%，同时保持98%的准确度，对比于现有的输入适应方法而言，它具有更好的几何-速度贡献。

Abstract
Large-scale pre-trained language models such as BERT have contributed significantly to the development of NLP. However, those models require large computational resources, making it difficult to be applied to mobile devices where computing power is limited. In this paper we aim to address the weakness of existing input-adaptive inference methods which fail to take full advantage of the structure of BERT. We propose Dynamic Planning in BERT, a novel fine-tuning strategy that can accelerate the inference process of BERT through selecting a subsequence of transformer layers list of backbone as a computational path for an input sample. To do this, our approach adds a planning module to the original BERT model to determine whether a layer is included or bypassed during inference. Experimental results on the GLUE benchmark exhibit that our method reduces latency to 75\% while maintaining 98\% accuracy, yielding a better accuracy-speed trade-off compared to state-of-the-art input-adaptive methods.

摘要
大规模预训练语言模型如BERT对自然语言处理（NLP）发展做出了重要贡献。然而，这些模型需要大量计算资源，使其在移动设备上应用具有有限的计算能力具有困难。在这篇论文中，我们目标是解决现有输入适应推理方法的弱点，这些方法无法完全利用BERT结构的优势。我们提议使用BERT动态规划策略，这是一种新的微调策略，可以通过选择一个序列中的transformer层列表来加速BERT的推理过程。为了实现这一点，我们的方法在原始BERT模型中添加了规划模块，以确定在推理过程中是否包含或绕过某层。实验结果表明，我们的方法可以将延迟时间降低至75%，同时保持98%的准确率，与现有输入适应方法相比，实现了更好的准确率-速度质量比。

How User Language Affects Conflict Fatality Estimates in ChatGPT

paper_url: http://arxiv.org/abs/2308.00072
repo_url: None
paper_authors: Daniel Kazenwadel, Christoph V. Steinert
for: 这份研究旨在探讨OpenAI的ChatGPT语言模型是否受到语言特定的训练数据中的偏观影响。
methods: 研究使用GPT-3.5自动化查询程序，在希伯来和阿拉伯语言中查询了 especific airstrikes，在土耳其语言和кур德语言中查询了另一场战争。
results: 研究发现，当使用攻击者的语言进行查询时，GPT-3.5提供了27±11%比较低的伤亡估计，而且当查询结果存在推卸责任的答案时，这个差异会更大。这种语言偏观可能会增强现有的媒体偏观和信息径，最终增强冲突。

Abstract
OpenAI's ChatGPT language model has gained popularity as a powerful tool for complex problem-solving and information retrieval. However, concerns arise about the reproduction of biases present in the language-specific training data. In this study, we address this issue in the context of the Israeli-Palestinian and Turkish-Kurdish conflicts. Using GPT-3.5, we employed an automated query procedure to inquire about casualties in specific airstrikes, in both Hebrew and Arabic for the former conflict and Turkish and Kurdish for the latter. Our analysis reveals that GPT-3.5 provides 27$\pm$11 percent lower fatality estimates when queried in the language of the attacker than in the language of the targeted group. Evasive answers denying the existence of such attacks further increase the discrepancy, creating a novel bias mechanism not present in regular search engines. This language bias has the potential to amplify existing media biases and contribute to information bubbles, ultimately reinforcing conflicts.

摘要

Dual-Space Attacks against Random-Walk-based Anomaly Detection

paper_url: http://arxiv.org/abs/2307.14387
repo_url: https://github.com/yuni-lai/dualattackrw
paper_authors: Yuni Lai, Marcin Waniek, Yulin Zhu, Liying Li, Jingwen Wu, Tomasz P. Michalak, Talal Rahwan, Kai Zhou
for: 这 paper 旨在探讨 Random Walks-based Anomaly Detection (RWAD) 中的两个攻击表面，即图空间攻击和特征空间攻击。
methods: 作者采用了实际的双空间攻击，包括图空间攻击和特征空间攻击。图空间攻击是一个 би-级优化问题，而特征空间攻击可以通过使用随机游走模型的关闭式解来解决。
results: 实验表明，作者的提出的攻击方法有效地启用了 RWAD 中的目标节点，并且在黑盒设置下进行了跨度攻击。此外，特征空间攻击也有效地降低了目标节点的异常分数。

Abstract
Random Walks-based Anomaly Detection (RWAD) is commonly used to identify anomalous patterns in various applications. An intriguing characteristic of RWAD is that the input graph can either be pre-existing or constructed from raw features. Consequently, there are two potential attack surfaces against RWAD: graph-space attacks and feature-space attacks. In this paper, we explore this vulnerability by designing practical dual-space attacks, investigating the interplay between graph-space and feature-space attacks. To this end, we conduct a thorough complexity analysis, proving that attacking RWAD is NP-hard. Then, we proceed to formulate the graph-space attack as a bi-level optimization problem and propose two strategies to solve it: alternative iteration (alterI-attack) or utilizing the closed-form solution of the random walk model (cf-attack). Finally, we utilize the results from the graph-space attacks as guidance to design more powerful feature-space attacks (i.e., graph-guided attacks). Comprehensive experiments demonstrate that our proposed attacks are effective in enabling the target nodes from RWAD with a limited attack budget. In addition, we conduct transfer attack experiments in a black-box setting, which show that our feature attack significantly decreases the anomaly scores of target nodes. Our study opens the door to studying the dual-space attack against graph anomaly detection in which the graph space relies on the feature space.

摘要
Random Walks-based Anomaly Detection（RWAD）通常用于识别各种应用中的异常模式。RWAD中的一个有趣特点是输入图可以是先前存在的或者从原始特征中构建的。因此，RWAD有两个可能的攻击面：图形空间攻击和特征空间攻击。在这篇论文中，我们探索这一漏洞，并设计了实用的双空间攻击。我们首先进行了全面的复杂度分析，证明攻击RWAD是NP困难的。然后，我们将图形空间攻击形式为二级优化问题，并提出了两种解决方案：alternative iteration（alterI-attack）或者利用随机游走模型的关闭式解（cf-attack）。最后，我们利用图形空间攻击的结果作为指导，设计了更有力的特征空间攻击（i.e., graph-guided attacks）。我们的实验表明，我们提posed的攻击方法可以在有限的攻击预算下启用目标节点。此外，我们进行了黑盒设置下的传输攻击实验，显示我们的特征攻击可以减少目标节点的异常分数。我们的研究开启了图 anomaly detection中的双空间攻击，其中图形空间依赖于特征空间。

Controlling the Latent Space of GANs through Reinforcement Learning: A Case Study on Task-based Image-to-Image Translation

paper_url: http://arxiv.org/abs/2307.13978
repo_url: None
paper_authors: Mahyar Abbasian, Taha Rajabzadeh, Ahmadreza Moradipari, Seyed Amir Hossein Aqajari, Hongsheng Lu, Amir Rahmani
for: This paper aims to address the challenge of exerting control over the generation process of Generative Adversarial Networks (GANs) by integrating a reinforcement learning (RL) agent with a latent-space GAN (l-GAN).
methods: The proposed methodology utilizes an actor-critic RL agent with a meticulously designed reward policy to acquire proficiency in navigating the latent space of the l-GAN and generating outputs based on specified tasks.
results: The authors conducted a series of experiments employing the MNIST dataset, including arithmetic addition as an illustrative task, and the outcomes serve to validate their methodology.

Abstract
Generative Adversarial Networks (GAN) have emerged as a formidable AI tool to generate realistic outputs based on training datasets. However, the challenge of exerting control over the generation process of GANs remains a significant hurdle. In this paper, we propose a novel methodology to address this issue by integrating a reinforcement learning (RL) agent with a latent-space GAN (l-GAN), thereby facilitating the generation of desired outputs. More specifically, we have developed an actor-critic RL agent with a meticulously designed reward policy, enabling it to acquire proficiency in navigating the latent space of the l-GAN and generating outputs based on specified tasks. To substantiate the efficacy of our approach, we have conducted a series of experiments employing the MNIST dataset, including arithmetic addition as an illustrative task. The outcomes of these experiments serve to validate our methodology. Our pioneering integration of an RL agent with a GAN model represents a novel advancement, holding great potential for enhancing generative networks in the future.

摘要

Understanding Deep Neural Networks via Linear Separability of Hidden Layers

paper_url: http://arxiv.org/abs/2307.13962
repo_url: None
paper_authors: Chao Zhang, Xinyu Chen, Wensheng Li, Lixue Liu, Wei Wu, Dacheng Tao
For: 本研究用线性可分性来研究深度神经网络的特点。* Methods: 我们首先提出了基于MINKOWSKI差的线性可分性度量（MD-LSM）来评估两个点集的线性可分性度。然后，我们证明了投入更新的网络权重可以提高隐藏层输出的线性可分性度，并且更新后的网络会获得更好的训练性能。此外，我们还研究了活动函数和网络大小（包括宽度和深度）对隐藏层的线性可分性的影响。* Results: 我们通过实验 validate our findings on some popular deep networks, including MLP, CNN, DBN, ResNet, VGGNet, AlexNet, ViT, and GoogLeNet.

Abstract
In this paper, we measure the linear separability of hidden layer outputs to study the characteristics of deep neural networks. In particular, we first propose Minkowski difference based linear separability measures (MD-LSMs) to evaluate the linear separability degree of two points sets. Then, we demonstrate that there is a synchronicity between the linear separability degree of hidden layer outputs and the network training performance, i.e., if the updated weights can enhance the linear separability degree of hidden layer outputs, the updated network will achieve a better training performance, and vice versa. Moreover, we study the effect of activation function and network size (including width and depth) on the linear separability of hidden layers. Finally, we conduct the numerical experiments to validate our findings on some popular deep networks including multilayer perceptron (MLP), convolutional neural network (CNN), deep belief network (DBN), ResNet, VGGNet, AlexNet, vision transformer (ViT) and GoogLeNet.

摘要
在本文中，我们测量了深度神经网络的线性可分性，以研究深度神经网络的特点。特别是，我们首先提出了Minkowski差分基于的线性可分性度量（MD-LSM），用于评估两个点集的线性可分性度。然后，我们示出了潜在的同步现象：如果卷积重量更新可以提高隐藏层输出的线性可分性度，那么更新后的网络将在训练性能上得到提高，并且相反。此外，我们还研究了活化函数和网络大小（包括宽和深度）对隐藏层的线性可分性的影响。最后，我们进行了一些实验，以验证我们的发现在一些流行的深度网络上，包括多层感知网络（MLP）、卷积神经网络（CNN）、深度信念网络（DBN）、ResNet、VGGNet、AlexNet、视transformer（ViT）和GoogLeNet。

Flexible Differentially Private Vertical Federated Learning with Adaptive Feature Embeddings

paper_url: http://arxiv.org/abs/2308.02362
repo_url: None
paper_authors: Yuxi Mi, Hongquan Liu, Yewei Xia, Yiheng Sun, Jihong Guan, Shuigeng Zhou
for: 本研究旨在探讨Vertically Federated Learning（VFL）中隐私保护的缺陷，因为共享特征嵌入可能泄露敏感信息。
methods: 本文提出了一种flexible和通用的方法，即在 differential privacy（DP）下分解隐私保护和任务用途两个目标，并在两个目标之间寻找 equilibria。
results: 经过广泛的实验 validate，提议的 VFL-AFE 框架能够具有防御隐私攻击和保持任务用途的能力，而不需要牺牲 Established DP 机制。

Abstract
The emergence of vertical federated learning (VFL) has stimulated concerns about the imperfection in privacy protection, as shared feature embeddings may reveal sensitive information under privacy attacks. This paper studies the delicate equilibrium between data privacy and task utility goals of VFL under differential privacy (DP). To address the generality issue of prior arts, this paper advocates a flexible and generic approach that decouples the two goals and addresses them successively. Specifically, we initially derive a rigorous privacy guarantee by applying norm clipping on shared feature embeddings, which is applicable across various datasets and models. Subsequently, we demonstrate that task utility can be optimized via adaptive adjustments on the scale and distribution of feature embeddings in an accuracy-appreciative way, without compromising established DP mechanisms. We concretize our observation into the proposed VFL-AFE framework, which exhibits effectiveness against privacy attacks and the capacity to retain favorable task utility, as substantiated by extensive experiments.

摘要
vertical Federated learning (VFL) 的出现引发了隐私保护不足的担忧，因为共享特征表示可能暴露敏感信息面临隐私攻击。这篇论文研究了 VFL 中数据隐私和任务利用目标之间的紧耦合关系，并提出了一种flexible和通用的方法来解决这个问题。Specifically, we initially derive a rigorous privacy guarantee by applying norm clipping on shared feature embeddings, which is applicable across various datasets and models. Subsequently, we demonstrate that task utility can be optimized via adaptive adjustments on the scale and distribution of feature embeddings in an accuracy-appreciative way, without compromising established DP mechanisms. We concretize our observation into the proposed VFL-AFE framework, which exhibits effectiveness against privacy attacks and the capacity to retain favorable task utility, as substantiated by extensive experiments.Here's the word-for-word translation of the text into Simplified Chinese: vertical Federated learning (VFL) 的出现引发了隐私保护不足的担忧，因为共享特征表示可能暴露敏感信息面临隐私攻击。这篇论文研究了 VFL 中数据隐私和任务利用目标之间的紧耦合关系，并提出了一种flexible和通用的方法来解决这个问题。Specifically, we initially derive a rigorous privacy guarantee by applying norm clipping on shared feature embeddings, which is applicable across various datasets and models. Subsequently, we demonstrate that task utility can be optimized via adaptive adjustments on the scale and distribution of feature embeddings in an accuracy-appreciative way, without compromising established DP mechanisms. We concretize our observation into the proposed VFL-AFE framework, which exhibits effectiveness against privacy attacks and the capacity to retain favorable task utility, as substantiated by extensive experiments.

How Does Diffusion Influence Pretrained Language Models on Out-of-Distribution Data?

paper_url: http://arxiv.org/abs/2307.13949
repo_url: https://github.com/maybelizzy/diffusion_ood_robustness
paper_authors: Huazheng Wang, Daixuan Cheng, Haifeng Sun, Jingyu Wang, Qi Qi, Jianxin Liao, Jing Wang, Cong Liu
for: 这种研究旨在探讨 diffusion models 如何影响 modern NLP 中的预训语言模型（PLMs）在异常数据（OOD）上的性能。
methods: 该研究使用了 diffusion 模型，包括 forward diffusion 过程和 reverse denoising 过程，以及测试不同训练参数和数据统计特征的实验。
results: 研究发现，在 OOD 数据上，训练 PLMs WITH diffusion 会下降重建能力；而 diffusion 模型可以有效地检测 OOD 样本，在大多数数据集上达到了领先的性能，具体是absolute accuracy 提高达18%。这些结果表明，diffusion 减少了 PLMs 在 OOD 数据上的Robustness。

Abstract
Transformer-based pretrained language models (PLMs) have achieved great success in modern NLP. An important advantage of PLMs is good out-of-distribution (OOD) robustness. Recently, diffusion models have attracted a lot of work to apply diffusion to PLMs. It remains under-explored how diffusion influences PLMs on OOD data. The core of diffusion models is a forward diffusion process which gradually applies Gaussian noise to inputs, and a reverse denoising process which removes noise. The noised input reconstruction is a fundamental ability of diffusion models. We directly analyze OOD robustness by measuring the reconstruction loss, including testing the abilities to reconstruct OOD data, and to detect OOD samples. Experiments are conducted by analyzing different training parameters and data statistical features on eight datasets. It shows that finetuning PLMs with diffusion degrades the reconstruction ability on OOD data. The comparison also shows that diffusion models can effectively detect OOD samples, achieving state-of-the-art performance in most of the datasets with an absolute accuracy improvement up to 18%. These results indicate that diffusion reduces OOD robustness of PLMs.

摘要
transformer-based pre-trained语言模型（PLM）在现代NLP中取得了很大成功。PLM的一个重要优势是对于不同类型的输入数据（out-of-distribution，OOD）的Robustness。近期，扩散模型在应用扩散到PLM方面吸引了很多研究。然而， diffusion对PLM在OOD数据上的影响还很少研究。扩散模型的核心是一个前向扩散过程，逐渐将输入数据加载到Gaussian噪声中，以及一个反推噪声过程，去除噪声。重要的是，扩散模型可以很好地重建噪声输入。我们直接分析OOD robustness，测试PLM是否能够正确重建OOD数据，以及是否能够检测OOD样本。我们通过对不同的训练参数和数据统计特征进行分析，在八个数据集上进行了实验。结果表明，练化PLMs with diffusion会降低OOD数据的重建能力。 comparison也显示，扩散模型可以有效地检测OOD样本，在大多数数据集中达到了状态的精度提升最多18%。这些结果表明， diffusion减少了PLMs的OOD Robustness。

Learning-based Control for PMSM Using Distributed Gaussian Processes with Optimal Aggregation Strategy

paper_url: http://arxiv.org/abs/2307.13945
repo_url: None
paper_authors: Zhenxiao Yin, Xiaobing Dai, Zewen Yang, Yang Shen, Georges Hattab, Hang Zhao
for: 这篇论文的目的是提出一种基于Lyapunov稳定理论的分布式GPR控制策略，用于提高PM synchronous motor的精度控制。
methods: 该策略使用分布式GPR来描述系统，并采用 posterior mean来避免计算复杂的 posterior variance。
results: 在模拟中，该策略得到了证明，并且在高频PM synchronous motor控制中实现了简单、高效的实现。

Abstract
The growing demand for accurate control in varying and unknown environments has sparked a corresponding increase in the requirements for power supply components, including permanent magnet synchronous motors (PMSMs). To infer the unknown part of the system, machine learning techniques are widely employed, especially Gaussian process regression (GPR) due to its flexibility of continuous system modeling and its guaranteed performance. For practical implementation, distributed GPR is adopted to alleviate the high computational complexity. However, the study of distributed GPR from a control perspective remains an open problem. In this paper, a control-aware optimal aggregation strategy of distributed GPR for PMSMs is proposed based on the Lyapunov stability theory. This strategy exclusively leverages the posterior mean, thereby obviating the need for computationally intensive calculations associated with posterior variance in alternative approaches. Moreover, the straightforward calculation process of our proposed strategy lends itself to seamless implementation in high-frequency PMSM control. The effectiveness of the proposed strategy is demonstrated in the simulations.

摘要
随着不确定环境中精准控制的需求增长，电动机组件，包括永磁同步机（PMSM）的要求也在增长。为了推断未知系统部分，机器学习技术广泛应用，特别是 Gaussian process regression（GPR），因为它可以 kontinuous系统模型的灵活性和 garantizado性。但是，从控制角度来看，分布式GPR的研究仍然是一个开放的问题。在这篇论文中，一种基于Lyapunov稳定理论的控制意识优化策略 для分布式GPR在PMSM控制中被提出。这种策略仅仅利用 posterior mean，因此不需要计算量大的 posterior variance在其他方法中所需的计算量。此外，我们提出的计算过程易于实现高频PMSM控制。在 simulations中，我们证明了该策略的有效性。

Entropy Neural Estimation for Graph Contrastive Learning

paper_url: http://arxiv.org/abs/2307.13944
repo_url: https://github.com/kunzhan/M-ILBO
paper_authors: Yixuan Ma, Xiaolin Zhang, Peng Zhang, Kun Zhan
for: 本文提出了一种基于对比学习的图гра夫特表示学习方法，目的是提取图гра夫特上的独特高级表示。
methods: 本文使用了一种基于对比学习的方法，即通过最大化对比信息的下界来估算数据集的熵。同时，本文还提出了一种简单 yet effective的子集采样策略，以提高对比表示的精度。
results: 本文通过实验表明，提出的方法可以在七个图гра夫特benchmark上达到当前状态的较好表现。同时，本文还介绍了一种跨视图一致性约束，以保证学习的表示是视图的整体图гра夫特表示的一致。

Abstract
Contrastive learning on graphs aims at extracting distinguishable high-level representations of nodes. In this paper, we theoretically illustrate that the entropy of a dataset can be approximated by maximizing the lower bound of the mutual information across different views of a graph, \ie, entropy is estimated by a neural network. Based on this finding, we propose a simple yet effective subset sampling strategy to contrast pairwise representations between views of a dataset. In particular, we randomly sample nodes and edges from a given graph to build the input subset for a view. Two views are fed into a parameter-shared Siamese network to extract the high-dimensional embeddings and estimate the information entropy of the entire graph. For the learning process, we propose to optimize the network using two objectives, simultaneously. Concretely, the input of the contrastive loss function consists of positive and negative pairs. Our selection strategy of pairs is different from previous works and we present a novel strategy to enhance the representation ability of the graph encoder by selecting nodes based on cross-view similarities. We enrich the diversity of the positive and negative pairs by selecting highly similar samples and totally different data with the guidance of cross-view similarity scores, respectively. We also introduce a cross-view consistency constraint on the representations generated from the different views. This objective guarantees the learned representations are consistent across views from the perspective of the entire graph. We conduct extensive experiments on seven graph benchmarks, and the proposed approach achieves competitive performance compared to the current state-of-the-art methods. The source code will be publicly released once this paper is accepted.

摘要
contrastive learning on graphs aims to extract distinguishable high-level representations of nodes. In this paper, we theoretically prove that the entropy of a dataset can be approximated by maximizing the lower bound of the mutual information across different views of a graph, \ie, entropy is estimated by a neural network. Based on this finding, we propose a simple yet effective subset sampling strategy to contrast pairwise representations between views of a dataset. Specifically, we randomly sample nodes and edges from a given graph to build the input subset for a view. Two views are fed into a parameter-shared Siamese network to extract the high-dimensional embeddings and estimate the information entropy of the entire graph. For the learning process, we propose to optimize the network using two objectives, simultaneously. Concretely, the input of the contrastive loss function consists of positive and negative pairs. Our selection strategy of pairs is different from previous works and we present a novel strategy to enhance the representation ability of the graph encoder by selecting nodes based on cross-view similarities. We enrich the diversity of the positive and negative pairs by selecting highly similar samples and totally different data with the guidance of cross-view similarity scores, respectively. We also introduce a cross-view consistency constraint on the representations generated from the different views. This objective guarantees the learned representations are consistent across views from the perspective of the entire graph. We conduct extensive experiments on seven graph benchmarks, and the proposed approach achieves competitive performance compared to the current state-of-the-art methods. The source code will be publicly released once this paper is accepted.

Stability of Multi-Agent Learning: Convergence in Network Games with Many Players

paper_url: http://arxiv.org/abs/2307.13922
repo_url: None
paper_authors: Aamal Hussain, Dan Leonte, Francesco Belardinelli, Georgios Piliouras
for: 研究多智能学习在多名玩家游戏中的复杂动态行为。
methods: 使用Q学习方法研究多名玩家游戏的各对各交互和网络结构的影响。
results: 发现在适当的网络条件下，可以实现多名玩家游戏中稳定的学习动态，无论玩家的数量如何。

Abstract
The behaviour of multi-agent learning in many player games has been shown to display complex dynamics outside of restrictive examples such as network zero-sum games. In addition, it has been shown that convergent behaviour is less likely to occur as the number of players increase. To make progress in resolving this problem, we study Q-Learning dynamics and determine a sufficient condition for the dynamics to converge to a unique equilibrium in any network game. We find that this condition depends on the nature of pairwise interactions and on the network structure, but is explicitly independent of the total number of agents in the game. We evaluate this result on a number of representative network games and show that, under suitable network conditions, stable learning dynamics can be achieved with an arbitrary number of agents.

摘要
多体学习在多名玩家游戏中展现出复杂的动力学行为，不受限于严格的网络零游戏例子。此外，我们发现，随着玩家数量增加，协调性行为越来越少有可能出现。为解决这个问题，我们研究了Q学习动力学和确定了一个充分条件，使得动力学在任何网络游戏中 converges to a unique equilibrium。我们发现这个条件取决于对抗对和网络结构，但是不виси于总的agent数量。我们在一些代表性的网络游戏中评估了这些结果，并显示，在适当的网络条件下，可以通过任意数量的代理人实现稳定的学习动力学。

HyperFed: Hyperbolic Prototypes Exploration with Consistent Aggregation for Non-IID Data in Federated Learning

paper_url: http://arxiv.org/abs/2307.14384
repo_url: None
paper_authors: Xinting Liao, Weiming Liu, Chaochao Chen, Pengyang Zhou, Huabin Zhu, Yanchao Tan, Jun Wang, Yue Qi
for: 提高 Federated Learning 下非Identical Independent Distribution（non-IID）Client数据的性能
methods: Hyperbolic Prototype Tammes Initialization（HPTI）、Hyperbolic Prototype Learning（HPL）和Consistent Aggregation（CA）
results: 在四个数据集上进行了广泛的研究，证明 HyperFed 可以有效地提高 Federated Learning 下 non-IID Client数据的性能

Abstract
Federated learning (FL) collaboratively models user data in a decentralized way. However, in the real world, non-identical and independent data distributions (non-IID) among clients hinder the performance of FL due to three issues, i.e., (1) the class statistics shifting, (2) the insufficient hierarchical information utilization, and (3) the inconsistency in aggregating clients. To address the above issues, we propose HyperFed which contains three main modules, i.e., hyperbolic prototype Tammes initialization (HPTI), hyperbolic prototype learning (HPL), and consistent aggregation (CA). Firstly, HPTI in the server constructs uniformly distributed and fixed class prototypes, and shares them with clients to match class statistics, further guiding consistent feature representation for local clients. Secondly, HPL in each client captures the hierarchical information in local data with the supervision of shared class prototypes in the hyperbolic model space. Additionally, CA in the server mitigates the impact of the inconsistent deviations from clients to server. Extensive studies of four datasets prove that HyperFed is effective in enhancing the performance of FL under the non-IID set.

摘要
Federated learning (FL) 共同模型用户数据在分布式方式下进行协同学习。然而，在实际世界中，客户端数据分布不匹配（non-IID）会阻碍 FL 的性能，主要问题包括：1）类别统计 Parametric shift，2）不足的层次信息利用，3）客户端聚合不一致。为解决以上问题，我们提出了 HyperFed，它包括以下三个主要模块：1）hyperbolic prototype Tammes initialization（HPTI），2）hyperbolic prototype learning（HPL），3）consistent aggregation（CA）。首先，HPTI 在服务器端构建固定类型和 uniformly distributed 的类prototype，并将其分享给客户端以匹配类统计，导向客户端的准确特征表示。其次，HPL 在每个客户端上在 hyperbolic 模型空间中捕捉地方数据中的层次信息，并在服务器端的 supervision 下进行学习。最后，CA 在服务器端mitigates the impact of inconsistent deviations from clients to server。经验studies of four datasets 表明，HyperFed 可以有效提高 FL 在 non-IID 环境下的性能。

paper_url: http://arxiv.org/abs/2307.13912
repo_url: None
paper_authors: Chenyan Jia, Michelle S. Lam, Minh Chau Mai, Jeff Hancock, Michael S. Bernstein
for: 本研究旨在开发一种基于社会科学理论和方法的人工智能系统，以mitigate partisan animosity在社交媒体上的影响。
methods: 本研究使用了一种方法，将社会科学中已经评估和验证的社会科学构造翻译成人工智能系统中的目标函数，称为社会目标函数。然后，通过使用社会科学中已经开发的问卷Instruments和质量代码本来翻译这些构造，并将其转化为详细的大语言模型提示。
results: 研究发现，使用这种方法可以创建一个评估社交媒体帖子中anti-democratic attitudes的模型，并在三个研究中证明了该模型的有效性。在第一个研究中，通过手动标注（alpha=.895）社交媒体帖子中anti-democratic attitudes的分数，并试出多种基于这些分数的 feed ranking 条件，发现可以减少参与者的偏见情感（d=.20）和下排帖子（d=.25）无需妨碍参与者的体验和参与度。在第二个研究中，通过创建一个民主态度模型，发现与手动标注（rho=.75）具有强相关性。最后，在第三个研究中，重复第一个研究，使用民主态度模型代替手动标注，并发现 feed downranking 使用社会目标函数可以减少参与者的偏见情感（d=.25）。这种方法提供了一种新的策略，可以基于社会科学理论和方法来减少社交媒体中的社会危害。

Abstract
Can we design artificial intelligence (AI) systems that rank our social media feeds to consider democratic values such as mitigating partisan animosity as part of their objective functions? We introduce a method for translating established, vetted social scientific constructs into AI objective functions, which we term societal objective functions, and demonstrate the method with application to the political science construct of anti-democratic attitudes. Traditionally, we have lacked observable outcomes to use to train such models, however, the social sciences have developed survey instruments and qualitative codebooks for these constructs, and their precision facilitates translation into detailed prompts for large language models. We apply this method to create a democratic attitude model that estimates the extent to which a social media post promotes anti-democratic attitudes, and test this democratic attitude model across three studies. In Study 1, we first test the attitudinal and behavioral effectiveness of the intervention among US partisans (N=1,380) by manually annotating (alpha=.895) social media posts with anti-democratic attitude scores and testing several feed ranking conditions based on these scores. Removal (d=.20) and downranking feeds (d=.25) reduced participants' partisan animosity without compromising their experience and engagement. In Study 2, we scale up the manual labels by creating the democratic attitude model, finding strong agreement with manual labels (rho=.75). Finally, in Study 3, we replicate Study 1 using the democratic attitude model instead of manual labels to test its attitudinal and behavioral impact (N=558), and again find that the feed downranking using the societal objective function reduced partisan animosity (d=.25). This method presents a novel strategy to draw on social science theory and methods to mitigate societal harms in social media AIs.

摘要
可以我们设计人工智能（AI）系统，以考虑民主价值观为其目标函数中的一部分？我们介绍了一种将社会科学建构翻译成AI目标函数的方法，我们称之为社会目标函数，并示例了这种方法应用于政治科学构建中的反民主态度。在过去，我们缺乏可观察的结果来训练这些模型，但社会科学已经开发出了调查工具和质量代码库 для这些构建，其精度使其可以翻译成详细的提示 для大语言模型。我们应用这种方法创建了一个民主态度模型，可以估计社交媒体文章是否推动反民主态度，并在三项研究中测试了这种民主态度模型。在第一项研究中，我们首先测试了对美国党派者（N=1,380）的情感和行为效果，并 manually annotate（α=.895）社交媒体文章的反民主态度得分。去掉（d=.20）和下推文章（d=.25）可以降低参与者的党派仇恨，而不是削弱他们的体验和参与度。在第二项研究中，我们扩大了手动标签，创建了民主态度模型，并发现与手动标签强相关（ρ=.75）。在第三项研究中，我们重复了第一项研究，使用民主态度模型而不是手动标签，并发现feed下推使用社会目标函数减少了党派仇恨（d=.25）。这种方法提供了一种新的策略，可以基于社会科学理论和方法来减少社会媒体AIs中的社会危害。

Robustness Verification of Deep Neural Networks using Star-Based Reachability Analysis with Variable-Length Time Series Input

paper_url: http://arxiv.org/abs/2307.13907
repo_url: None
paper_authors: Neelanjana Pal, Diego Manzanas Lopez, Taylor T Johnson
for: 这个论文是为了验证和验议基于神经网络的时序数据分析和预测维护的可靠性和可靠性。
methods: 该论文使用了基于时序数据的神经网络分析，并使用了变量长度输入数据来简化输入处理和提高网络架构的通用性。
results: 该论文通过使用星形可达性分析和一些性能指标来检查神经网络的可靠性，并证明了神经网络的输出受输入噪声影响的影响。

Abstract
Data-driven, neural network (NN) based anomaly detection and predictive maintenance are emerging research areas. NN-based analytics of time-series data offer valuable insights into past behaviors and estimates of critical parameters like remaining useful life (RUL) of equipment and state-of-charge (SOC) of batteries. However, input time series data can be exposed to intentional or unintentional noise when passing through sensors, necessitating robust validation and verification of these NNs. This paper presents a case study of the robustness verification approach for time series regression NNs (TSRegNN) using set-based formal methods. It focuses on utilizing variable-length input data to streamline input manipulation and enhance network architecture generalizability. The method is applied to two data sets in the Prognostics and Health Management (PHM) application areas: (1) SOC estimation of a Lithium-ion battery and (2) RUL estimation of a turbine engine. The NNs' robustness is checked using star-based reachability analysis, and several performance measures evaluate the effect of bounded perturbations in the input on network outputs, i.e., future outcomes. Overall, the paper offers a comprehensive case study for validating and verifying NN-based analytics of time-series data in real-world applications, emphasizing the importance of robustness testing for accurate and reliable predictions, especially considering the impact of noise on future outcomes.

摘要
<>Translate the given text into Simplified Chinese.<>数据驱动、基于神经网络（NN）的异常检测和预测维护是当前研究领域之一。NN分析时间序列数据可以提供价值的信息，如设备的剩余有用生命（RUL）和电池的状态充电（SOC）。但是，输入时间序列数据可能会受到意外或无意义的噪声影响，因此需要robust验证和验证这些NN。这篇论文介绍了一种基于集合形式方法的NN验证方法，它通过使用可变长输入数据来简化输入处理和提高网络架构的通用性。该方法在两个PHM应用领域的数据集上进行了实践：（1）Li-ion电池SOC估计和（2）涡轮机RUL估计。通过星形可达性分析来检查NN的Robustness，并使用一些性能指标来评估输入噪声对网络输出的影响，即未来的结果。总之，这篇论文提供了一个全面的NP-based analytics验证和验证方法，强调验证NN-based时间序列数据分析的重要性，特别是考虑噪声对未来结果的影响。

Data Augmentation for Neural Machine Translation using Generative Language Model

paper_url: http://arxiv.org/abs/2307.16833
repo_url: None
paper_authors: Seokjin Oh, Su ah Lee, Woohwan Jung
for: 提高机器翻译模型的性能，Addressing the scarcity of large parallel corpora in Neural Machine Translation.
methods: 使用提示基于的数据增强技术，利用大规模语言模型如ChatGPT生成Synthetic parallel corpus，不需要新的模型训练成本。
results: 与未增强基eline相比，提高0.68 Bleu分数。

Abstract
Despite the rapid growth in model architecture, the scarcity of large parallel corpora remains the main bottleneck in Neural Machine Translation. Data augmentation is a technique that enhances the performance of data-hungry models by generating synthetic data instead of collecting new ones. We explore prompt-based data augmentation approaches that leverage large-scale language models such as ChatGPT. To create a synthetic parallel corpus, we compare 3 methods using different prompts. We employ two assessment metrics to measure the diversity of the generated synthetic data. This approach requires no further model training cost, which is mandatory in other augmentation methods like back-translation. The proposed method improves the unaugmented baseline by 0.68 BLEU score.

摘要
尽管模型架构在快速发展，但数据缺乏大量并行 Corpora 仍然是机器翻译神经网络中的主要瓶颈。数据扩充是一种技术，可以提高数据吞吐量模型的性能，而不需要收集新的数据。我们探索了基于 prompt 的数据扩充方法，利用大规模语言模型如 ChatGPT。为创建一个合成并行 Corpora，我们比较了三种不同的 prompt 方法。我们采用了两个评估指标来度量生成的合成数据的多样性。这种方法不需要额外的模型训练成本，与其他扩充方法如 back-translation 不同。我们的方法可以提高无扩充基准值的 BLEU 得分0.68分。

FinTree: Financial Dataset Pretrain Transformer Encoder for Relation Extraction

paper_url: http://arxiv.org/abs/2307.13900
repo_url: None
paper_authors: Hyunjong Ok
for: FinTree is written for financial relation extraction tasks, specifically to improve the accuracy of relation predictions between two given entities.
methods: FinTree uses a pre-trained encoder language model, with a novel structure that predicts a masked token instead of the conventional [CLS] token, inspired by the Pattern Exploiting Training methodology. The model is trained with a unique input pattern to provide contextual and positional information about the entities of interest, and a post-processing step ensures accurate predictions in line with the entity types.
results: FinTree outperforms on the REFinD, a large-scale financial relation extraction dataset.

Abstract
We present FinTree, Financial Dataset Pretrain Transformer Encoder for Relation Extraction. Utilizing an encoder language model, we further pretrain FinTree on the financial dataset, adapting the model in financial domain tasks. FinTree stands out with its novel structure that predicts a masked token instead of the conventional [CLS] token, inspired by the Pattern Exploiting Training methodology. This structure allows for more accurate relation predictions between two given entities. The model is trained with a unique input pattern to provide contextual and positional information about the entities of interest, and a post-processing step ensures accurate predictions in line with the entity types. Our experiments demonstrate that FinTree outperforms on the REFinD, a large-scale financial relation extraction dataset. The code and pretrained models are available at https://github.com/HJ-Ok/FinTree.

摘要
我们介绍FinTree，一个基于语言模型的金融 dataset 预训读取器。我们透过使用语言模型，进一步预训 FinTree 在金融领域任务中。FinTree 的独特结构是预测填写的 tokens，而不是 convention 的 [CLS] tokens，这种结构允许更精确地预测两个 Entities 之间的关系。模型在特定的输入模式下训练，以提供 Contextual 和位置信息，并且进行后处理步骤，以确保预测和 Entities 类型相符。我们的实验表明，FinTree 在 REFinD 大规模金融关系提取 dataset 上表现出色。代码和预训模型可以在获取。

Regularizing Neural Networks with Meta-Learning Generative Models

paper_url: http://arxiv.org/abs/2307.13899
repo_url: None
paper_authors: Shin’ya Yamaguchi, Daiki Chijiwa, Sekitoshi Kanai, Atsutoshi Kumagai, Hisashi Kashima
for: 提高深度学习中的生成数据增强
methods: 利用生成模型生成的 sintetic 样本作为增强数据，并通过 meta 学习来动态确定 sintetic 样本以最小化验证损失
results: 对 six 个数据集进行实验，发现 MGR 可以避免生成数据增强导致性能下降，并稳定超越基elines。

Abstract
This paper investigates methods for improving generative data augmentation for deep learning. Generative data augmentation leverages the synthetic samples produced by generative models as an additional dataset for classification with small dataset settings. A key challenge of generative data augmentation is that the synthetic data contain uninformative samples that degrade accuracy. This is because the synthetic samples do not perfectly represent class categories in real data and uniform sampling does not necessarily provide useful samples for tasks. In this paper, we present a novel strategy for generative data augmentation called meta generative regularization (MGR). To avoid the degradation of generative data augmentation, MGR utilizes synthetic samples in the regularization term for feature extractors instead of in the loss function, e.g., cross-entropy. These synthetic samples are dynamically determined to minimize the validation losses through meta-learning. We observed that MGR can avoid the performance degradation of na\"ive generative data augmentation and boost the baselines. Experiments on six datasets showed that MGR is effective particularly when datasets are smaller and stably outperforms baselines.

摘要

AI4GCC - Team: Below Sea Level: Critiques and Improvements

paper_url: http://arxiv.org/abs/2307.13894
repo_url: None
paper_authors: Bram Renting, Phillip Wozny, Robert Loftin, Claudia Wieners, Erman Acar
for: 评估气候变化对经济的影响
methods: 使用 интегра assessment 模型(IAM) RICE-N 进行评估
results: 提出了改进 rice-N 模型的建议，包括使用关税收入和奖励过production，并批判IAMs 中偏正的损害函数和不切实际的抑制成本函数。

Abstract
We present a critical analysis of the simulation framework RICE-N, an integrated assessment model (IAM) for evaluating the impacts of climate change on the economy. We identify key issues with RICE-N, including action masking and irrelevant actions, and suggest improvements such as utilizing tariff revenue and penalizing overproduction. We also critically engage with features of IAMs in general, namely overly optimistic damage functions and unrealistic abatement cost functions. Our findings contribute to the ongoing efforts to further develop the RICE-N framework in an effort to improve the simulation, making it more useful as an inspiration for policymakers.

摘要
我们提出了关于模拟框架RICE-N的批判分析，这是一种气候变化影响经济的集成评估模型（IAM）。我们发现了RICE-N中的关键问题，包括行动遮盖和无关行动，并建议使用关税收入和强制产量罚款来改进。我们还与IAM中的一些特征进行批判，包括过估损害函数和不实际的降低成本函数。我们的发现可以帮助进一步发展RICE-N框架，使其更有用作政策制定者的参考。

Dynamic Grouping for Climate Change Negotiation: Facilitating Cooperation and Balancing Interests through Effective Strategies

paper_url: http://arxiv.org/abs/2307.13893
repo_url: None
paper_authors: Yu Qin, Duo Zhang, Yuren Pang
for: 这篇论文旨在提出一种基于现实世界商业和政治谈判协议的气候变化缓解动态分组模型，以促进不同参与者之间的有效合作，实现全球气候变化目标。
methods: 该模型包括三个阶段：分组和更新、内部谈判和间部谈判。它利用分组方法和更新策略解决多地区气候谈判中的复杂性和不均衡。
results: 通过在RICE-N框架中应用谈判模型，表明了国际合作气候变化缓解的可能性。

Abstract
In this paper, we propose a dynamic grouping negotiation model for climate mitigation based on real-world business and political negotiation protocols. Within the AI4GCC competition framework, we develop a three-stage process: group formation and updates, intra-group negotiation, and inter-group negotiation. Our model promotes efficient and effective cooperation between various stakeholders to achieve global climate change objectives. By implementing a group-forming method and group updating strategy, we address the complexities and imbalances in multi-region climate negotiations. Intra-group negotiations ensure that all members contribute to mitigation efforts, while inter-group negotiations use the proposal-evaluation framework to set mitigation and savings rates. We demonstrate our negotiation model within the RICE-N framework, illustrating a promising approach for facilitating international cooperation on climate change mitigation.

摘要
在这篇论文中，我们提出了一种动态分组谈判模型，用于气候 Mitigation 的实际商业和政治谈判协议。在 AI4GCC 竞赛框架下，我们开发了三个阶段过程：分组形成和更新、内部谈判和间部谈判。我们的模型推动了不同参与者之间的有效和有效的合作，以实现全球气候变化目标。通过实施分组形成方法和分组更新策略，我们解决了多地区气候谈判中的复杂性和不平衡。内部谈判确保所有成员做出了减少措施的贡献，而间部谈判使用提案评估框架来设置减少和节约率。我们在 RICE-N 框架中示出了我们的谈判模型，预示了对国际气候变化减少努力的有效方法。

AI4GCC-Team – Below Sea Level: Score and Real World Relevance

paper_url: http://arxiv.org/abs/2307.13892
repo_url: None
paper_authors: Phillip Wozny, Bram Renting, Robert Loftin, Claudia Wieners, Erman Acar
for: The paper is written to address the challenges of carbon leakage in the context of the RICE-N climate-economic simulation, with the goal of achieving a comparable temperature rise to RCP 3.4/4.5 and SSP 2.
methods: The paper proposes a negotiation protocol inspired by the Carbon Border Adjustment Mechanism (CBAM) and Climate Clubs (CC), and demonstrates the effectiveness of this approach through simulations.
results: The paper’s proposed protocol results in a temperature rise comparable to RCP 3.4/4.5 and SSP 2, and provides an analysis of its World Trade Organization compliance, administrative and political feasibility, and ethical concerns. However, the paper also acknowledges the risk of hurting the least developing countries and suggests specific corrective measures to avoid exacerbating existing inequalities.In Simplified Chinese text, the three key points would be:
for: 该文章是为了解决RICE-N气候经济模拟中的碳泄漏问题，目的是实现RCP 3.4/4.5和SSP 2的温室气体升高。
methods: 文章提出了一种启发自碳泄漏机制和气候俱乐部的谈判协议，并通过模拟证明其效果。
results: 文章的提出的协议实现了RCP 3.4/4.5和SSP 2的温室气体升高，并进行了世界贸易组织合法性、行政和政治可行性以及伦理问题的分析。但文章也承认可能对最弱国家造成影响，并建议特定的修正措施来避免加剧现有不平等。

Abstract
As our submission for track three of the AI for Global Climate Cooperation (AI4GCC) competition, we propose a negotiation protocol for use in the RICE-N climate-economic simulation. Our proposal seeks to address the challenges of carbon leakage through methods inspired by the Carbon Border Adjustment Mechanism (CBAM) and Climate Clubs (CC). We demonstrate the effectiveness of our approach by comparing simulated outcomes to representative concentration pathways (RCP) and shared socioeconomic pathways (SSP). Our protocol results in a temperature rise comparable to RCP 3.4/4.5 and SSP 2. Furthermore, we provide an analysis of our protocol's World Trade Organization compliance, administrative and political feasibility, and ethical concerns. We recognize that our proposal risks hurting the least developing countries, and we suggest specific corrective measures to avoid exacerbating existing inequalities, such as technology sharing and wealth redistribution. Future research should improve the RICE-N tariff mechanism and implement actions allowing for the aforementioned corrective measures.

摘要
为AIfor Global Climate Cooperation（AI4GCC）比赛的第三轨道提交，我们提议一种谈判协议，用于在RICE-N气候经济模拟中 address carbon leakage 问题。我们的提议启发自Carbon Border Adjustment Mechanism（CBAM）和Climate Clubs（CC）的方法。我们通过对比 simulate 结果和代表气候道具（RCP）和共产经济道具（SSP）来证明我们的方法的有效性。我们的协议会导致温度升高相当于RCP 3.4/4.5和SSP 2。此外，我们还提供了对我们协议的世界贸易组织合法性、行政和政治可行性以及道德问题的分析。我们认为我们的建议可能会对最少发展国家产生负面影响，我们建议特定的修正措施，以避免增加现有的不平等，如技术分享和财富重新分配。未来的研究应该完善RICE-N关税机制，并实施相应的行动，以实现上述修正措施。

Dynamic Grouping for Climate Change Negotiation: Facilitating Cooperation and Balancing Interests through Effective Strategies

paper_url: http://arxiv.org/abs/2307.13886
repo_url: None
paper_authors: Duo Zhang, Yuren Pang, Yu Qin
for: This paper aims to improve the accuracy and effectiveness of climate change negotiation models by addressing limitations in the current framework.
methods: The paper explores five critical aspects of geographical impacts and refines the utility and rewards framework to better account for heterogeneity and historical/cultural factors.
results: By addressing these limitations, the paper hopes to enhance the accuracy and effectiveness of climate change negotiation models, enabling policymakers and stakeholders to devise targeted and appropriate strategies to tackle climate change at both regional and global levels.Here’s the same information in Simplified Chinese text:
for: 这篇论文目的是改进当前气候变化谈判模型的准确性和效果，消除当前框架中的限制。
methods: 论文探讨了五个关键地区的影响，并对奖励和折损函数进行了修改，以更好地考虑地域差异和历史文化因素。
results: 通过修改限制，论文希望提高气候变化谈判模型的准确性和效果，帮助政策制定者和各方决策者制定适当的地域和全球级气候变化策略。

Abstract
The current framework for climate change negotiation models presents several limitations that warrant further research and development. In this track, we discuss mainly two key areas for improvement, focusing on the geographical impacts and utility framework. In the aspects of geographical impacts, We explore five critical aspects: (1) the shift from local to global impact, (2) variability in climate change effects across regions, (3) heterogeneity in geographical location and political structures, and (4) collaborations between adjacent nations, (5) the importance of including historical and cultural factors influencing climate negotiations. Furthermore, we emphasize the need to refine the utility and rewards framework to reduce the homogeneity and the level of overestimating the climate mitigation by integrating the positive effects of saving rates into the reward function and heterogeneity among all regions. By addressing these limitations, we hope to enhance the accuracy and effectiveness of climate change negotiation models, enabling policymakers and stakeholders to devise targeted and appropriate strategies to tackle climate change at both regional and global levels.

摘要
当前气候变化谈判模型存在多个限制，需要进一步的研究和发展。在这一轨道上，我们主要讨论两个关键领域的改进，即地域影响和用途框架。在地域影响方面，我们探讨五个关键方面：（1）从本地到全球影响的转变，（2）气候变化影响不同地区的变化性，（3）地理位置和政治结构之间的多样性，（4）邻国合作，（5）包括历史和文化因素 influencing气候谈判。此外，我们强调要更新用途和奖励框架，以减少同化和气候遏制的误差，并将保存率integrated到奖励函数中，以增强气候谈判模型的准确性和效果。通过解决这些限制，我们希望能够提高气候变化谈判模型的准确性和效果，帮助政策制定者和各方利益者制定适应的气候变化策略，并在全球和地域水平上应对气候变化。

WebArena: A Realistic Web Environment for Building Autonomous Agents

paper_url: http://arxiv.org/abs/2307.13854
repo_url: https://github.com/web-arena-x/webarena
paper_authors: Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig
for: 本研究旨在开发一个真实和可重现的自动化代理环境，以便用于日常任务管理。
methods: 本研究使用了现代自然语言处理技术，包括理解和行动之前的推理，以及网页上进行任务完成。
results: 研究发现，解决复杂任务是具有挑战性，并且现今的语言模型仍未能完全成功地完成这些实际生活中的任务。

Abstract
With generative AI advances, the exciting potential for autonomous agents to manage daily tasks via natural language commands has emerged. However, cur rent agents are primarily created and tested in simplified synthetic environments, substantially limiting real-world scenario representation. In this paper, we build an environment for agent command and control that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on websites, and we create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and are designed to emulate tasks that humans routinely perform on the internet. We design and implement several autonomous agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 10.59%. These results highlight the need for further development of robust agents, that current state-of-the-art LMs are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress. Our code, data, environment reproduction resources, and video demonstrations are publicly available at https://webarena.dev/.

摘要
“受到生成AI的进步启发，现在可以使用自然语言指令来让自动代理人执行日常任务。然而，目前的代理人主要在简单的合成环境中被设计和测试，这限制了实际世界情况的表现。在这篇论文中，我们建立了一个高度现实和可重现的环境，用于代理人的指令和控制。我们特别关注代理人在网站上进行任务的情况，并创建了四种常见的领域中的完整网站：电子商务、社交讨论区、协同软件开发和内容管理。我们的环境包括工具（如地图）和外部知识库（如用户手册），以促进人类化的任务解决。基于我们的环境，我们发布了一组对任务完成的评估标准。我们的任务集包括多元、长期和人类在网络上常进行的任务。我们设计和实现了一些自动代理人，应用latest技术，如理解才行。我们的最佳GPT-4基于代理人仅在终端任务成功率为10.59%。这些结果显示解决复杂任务是具有挑战性，目前的state-of-the-art LMs在这些实际任务中表现未能完美，WebArena可以用来衡量这种进步。我们的代码、数据、环境重现资源和视频示例都公开available at 。”

MAEA: Multimodal Attribution for Embodied AI

paper_url: http://arxiv.org/abs/2307.13850
repo_url: None
paper_authors: Vidhi Jain, Jayant Sravan Tamarapalli, Sahiti Yerramilli, Yonatan Bisk
for: 这篇论文是关于Multimodal Perception for Embodied AI的研究，旨在解决多modal输入可能包含高度相互补充的信息问题。
methods: 论文使用了Attribution分析来理解不同策略在ALFRED数据集上的全球趋势，并 investigate模型和数据集偏见。
results: 论文提出了MAEA框架，可以计算任意分Diffable策略的全球Attribution，并通过Attribution分析下降级语言和视觉Attribution。

Abstract
Understanding multimodal perception for embodied AI is an open question because such inputs may contain highly complementary as well as redundant information for the task. A relevant direction for multimodal policies is understanding the global trends of each modality at the fusion layer. To this end, we disentangle the attributions for visual, language, and previous action inputs across different policies trained on the ALFRED dataset. Attribution analysis can be utilized to rank and group the failure scenarios, investigate modeling and dataset biases, and critically analyze multimodal EAI policies for robustness and user trust before deployment. We present MAEA, a framework to compute global attributions per modality of any differentiable policy. In addition, we show how attributions enable lower-level behavior analysis in EAI policies for language and visual attributions.

摘要
(Simplified Chinese translation)理解多模态识别对嵌入式AI是一个开放的问题，因为这些输入可能包含高度相互补充的信息。一个有用的方向是理解每个模态在拟合层的全球趋势。为此，我们分离不同策略在ALFRED dataset上训练的视觉、语言和前一个动作输入的归因分析。归因分析可以用来排序和分组失败场景，探索模型和数据集偏见，并且在部署之前对多模态EAI策略进行critical分析。我们提出了MAEA框架，用于计算任何可微分策略的全球归因。此外，我们还示出了归因如何帮助分析EAI策略的低级行为。

Scaling Integer Arithmetic in Probabilistic Programs

paper_url: http://arxiv.org/abs/2307.13837
repo_url: None
paper_authors: William X. Cao, Poorva Garg, Ryan Tjoa, Steven Holtzen, Todd Millstein, Guy Van den Broeck
for: 这篇论文是关于probabilistic programming languages (PPLs)中的分布问题的研究。
methods: 这篇论文使用了一种名为“binary encoding strategy”的方法，这种方法利用了整数运算中的逻辑结构来实现精确的 probabilistic inference。
results: 该研究表明，使用这种binary encoding strategy可以在高维复杂的整数分布中实现精确的 probabilistic inference，并且可以扩展到更大的整数分布。

Abstract
Distributions on integers are ubiquitous in probabilistic modeling but remain challenging for many of today's probabilistic programming languages (PPLs). The core challenge comes from discrete structure: many of today's PPL inference strategies rely on enumeration, sampling, or differentiation in order to scale, which fail for high-dimensional complex discrete distributions involving integers. Our insight is that there is structure in arithmetic that these approaches are not using. We present a binary encoding strategy for discrete distributions that exploits the rich logical structure of integer operations like summation and comparison. We leverage this structured encoding with knowledge compilation to perform exact probabilistic inference, and show that this approach scales to much larger integer distributions with arithmetic.

摘要
随机分布在整数上是现代随机模型中非常普遍的，但它们对许多现代随机编程语言（PPL）来说仍然是挑战。核心问题在于整数的离散结构：许多今天的PPL推理策略都是基于枚举、采样或导数来扩展，这些方法在高维复杂的整数分布中失效。我们的创新是利用整数的数学结构，这些方法没有使用。我们提出了一种二进制编码策略，利用整数操作的逻辑结构来捕捉整数分布的结构。我们利用这种结构化编码，与知识编译来实现精确的随机推理，并示出这种方法可以扩展到更大的整数分布。

Offline Reinforcement Learning with On-Policy Q-Function Regularization

paper_url: http://arxiv.org/abs/2307.13824
repo_url: None
paper_authors: Laixi Shi, Robert Dadashi, Yuejie Chi, Pablo Samuel Castro, Matthieu Geist
for: 本研究旨在解决离线强化学习（RL）中的推理扩展错误问题，通过对政策进行正则化，以避免由历史数据集和期望政策之间的分布转换引起的扩展错误。
methods: 本研究提议使用Q函数正则化，通过对Q函数进行估计，以优化政策。两种基于Q函数正则化的算法被提出，并在D4RL标准套件中进行了实验。
results: 实验结果表明，使用Q函数正则化可以减轻推理扩展错误，并在D4RL标准套件中展现出强大的表现。

Abstract
The core challenge of offline reinforcement learning (RL) is dealing with the (potentially catastrophic) extrapolation error induced by the distribution shift between the history dataset and the desired policy. A large portion of prior work tackles this challenge by implicitly/explicitly regularizing the learning policy towards the behavior policy, which is hard to estimate reliably in practice. In this work, we propose to regularize towards the Q-function of the behavior policy instead of the behavior policy itself, under the premise that the Q-function can be estimated more reliably and easily by a SARSA-style estimate and handles the extrapolation error more straightforwardly. We propose two algorithms taking advantage of the estimated Q-function through regularizations, and demonstrate they exhibit strong performance on the D4RL benchmarks.

摘要
核心挑战是线上强化学习（RL）是处理 History 集和期望策略之间分布变化导致的（可能 catastrophic）推理错误。大量先前工作是通过显式/隐式正则化学习策略向行为策略的方向进行正则化，这在实践中很难估量。在这个工作中，我们提议将正则化向行为策略的 Q-函数而不是行为策略本身，因为 Q-函数可以更容易地和更可靠地通过 SARSA 样式的估计。我们提出了两种利用估计 Q-函数的正则化算法，并在 D4RL 标准启动中展示它们的强大表现。

Fitting Auditory Filterbanks with Multiresolution Neural Networks

paper_url: http://arxiv.org/abs/2307.13821
repo_url: https://github.com/lostanlen/lostanlen2023waspaa
paper_authors: Vincent Lostanlen, Daniel Haider, Han Han, Mathieu Lagrange, Peter Balazs, Martin Ehler
for: 这 paper 是为了解决深度学习音频模型中的非parametric vs. parametric问题。
methods: 这 paper 使用了多resolution neural network (MuReNN)，其中包括分割 discrete wavelet transform (DWT) 的 octave subbands，并在每个 octave 中训练独立的 convolutional 操作。
results: compared to convnets 和 Gabor convolutions, MuReNN 在 three optimization problems 中达到了 state-of-the-art 性能。

Abstract
Waveform-based deep learning faces a dilemma between nonparametric and parametric approaches. On one hand, convolutional neural networks (convnets) may approximate any linear time-invariant system; yet, in practice, their frequency responses become more irregular as their receptive fields grow. On the other hand, a parametric model such as LEAF is guaranteed to yield Gabor filters, hence an optimal time-frequency localization; yet, this strong inductive bias comes at the detriment of representational capacity. In this paper, we aim to overcome this dilemma by introducing a neural audio model, named multiresolution neural network (MuReNN). The key idea behind MuReNN is to train separate convolutional operators over the octave subbands of a discrete wavelet transform (DWT). Since the scale of DWT atoms grows exponentially between octaves, the receptive fields of the subsequent learnable convolutions in MuReNN are dilated accordingly. For a given real-world dataset, we fit the magnitude response of MuReNN to that of a well-established auditory filterbank: Gammatone for speech, CQT for music, and third-octave for urban sounds, respectively. This is a form of knowledge distillation (KD), in which the filterbank ''teacher'' is engineered by domain knowledge while the neural network ''student'' is optimized from data. We compare MuReNN to the state of the art in terms of goodness of fit after KD on a hold-out set and in terms of Heisenberg time-frequency localization. Compared to convnets and Gabor convolutions, we find that MuReNN reaches state-of-the-art performance on all three optimization problems.

摘要
waveform-based深度学习面临一个选择between非 Parametric和 Parametricapproaches。一个方面，卷积神经网络(convnets)可以近似任何线性时变系统;然而，在实践中，它们的频谱响应变得更加异常的随着它们的感知场的增大。另一方面，一个参数化模型如LEAF可以确保生成Gabor滤波器，因此获得最佳时间频域定位;然而，这种强大的推导牵扯来了表达能力的代价。在这篇论文中，我们希望超越这个困境，通过引入多尺度神经网络(MuReNN)来实现。MuReNN的关键思想是在分割 octave 子域中训练分离的卷积操作。由于 DWT atoms 的规模在 octave 中 exponential 增长，MuReNN 中的后续可学习卷积的感知场随着 octave 的增大而增大。对于一个实际数据集，我们将 MuReNN 的幅响应与一个已知的听力滤波器链：Gammatone for speech, CQT for music, and third-octave for urban sounds, respectively。这是一种知识储存（KD），在哪里听力滤波器''教师''是通过领域知识工程而设计的，而神经网络''学生''是通过数据优化的。我们将 MuReNN 与现状的最佳性进行比较，包括在 KD 后的好处评价和 Heisenberg 时间频域本地化。与 convnets 和 Gabor 卷积相比，我们发现 MuReNN 在三个优化问题中达到了状态机器人的表现。

ForestMonkey: Toolkit for Reasoning with AI-based Defect Detection and Classification Models

paper_url: http://arxiv.org/abs/2307.13815
repo_url: None
paper_authors: Jiajun Zhang, Georgina Cosma, Sarah Bugby, Jason Watkins
for: 本文提出了一个名为“Forest Monkey”（FM）的工具集，用于解释任何基于人工智能的缺陷检测和分类模型的预测结果，并提供了一些可读的图表和文本来描述这些结果。
methods: 本文使用了一些方法，包括从预测结果提取特征，将图像转换为缺陷特征，以及使用决策树基本的人工智能推理器。
results: 本文透过对四个不同的数据集和四个不同的模型进行时间性能评估，以评估FM工具集的效果。此外，文章还提供了一个教学 tutorials，以帮助用户在使用FM工具集进行解释任务。

Abstract
Artificial intelligence (AI) reasoning and explainable AI (XAI) tasks have gained popularity recently, enabling users to explain the predictions or decision processes of AI models. This paper introduces Forest Monkey (FM), a toolkit designed to reason the outputs of any AI-based defect detection and/or classification model with data explainability. Implemented as a Python package, FM takes input in the form of dataset folder paths (including original images, ground truth labels, and predicted labels) and provides a set of charts and a text file to illustrate the reasoning results and suggest possible improvements. The FM toolkit consists of processes such as feature extraction from predictions to reasoning targets, feature extraction from images to defect characteristics, and a decision tree-based AI-Reasoner. Additionally, this paper investigates the time performance of the FM toolkit when applied to four AI models with different datasets. Lastly, a tutorial is provided to guide users in performing reasoning tasks using the FM toolkit.

摘要
人工智能（AI）逻辑和可解释AI（XAI）任务在最近几年内受欢迎，使用户可以解释AI模型的预测或决策过程。这篇论文介绍了森林猴（FM）工具集，用于对任何基于AI的检测和分类模型的输出进行逻辑推理。实现为Python包，FM接受输入为数据集文件夹路径（包括原始图像、真实标签和预测标签），并提供一组图表和文本文件来说明逻辑结果以及可能的改进建议。FM工具集包括从预测中提取特征到逻辑目标的特征提取、从图像中提取到缺陷特征的特征提取，以及基于决策树的AI-Reasoner。此外，本篇论文还 investigate FM工具集在不同数据集上四个AI模型的时间性能。最后，本文提供了用户执行逻辑任务的教程。

Speech representation learning: Learning bidirectional encoders with single-view, multi-view, and multi-task methods

paper_url: http://arxiv.org/abs/2308.00129
repo_url: None
paper_authors: Qingming Tang
for: 本论文旨在提高时间或空间序列数据上的预测任务，通过使用学习的表示。
methods: 本论文使用超级vised学习来训练深度神经网络，以学习好的序列表示。
results: 本论文在多种学习Setting中进行了广泛的研究，包括有监督学习、无监督学习、半监督学习以及多视图学习。

Abstract
This thesis focuses on representation learning for sequence data over time or space, aiming to improve downstream sequence prediction tasks by using the learned representations. Supervised learning has been the most dominant approach for training deep neural networks for learning good sequential representations. However, one limiting factor to scale supervised learning is the lack of enough annotated data. Motivated by this challenge, it is natural to explore representation learning methods that can utilize large amounts of unlabeled and weakly labeled data, as well as an additional data modality. I describe my broad study of representation learning for speech data. Unlike most other works that focus on a single learning setting, this thesis studies multiple settings: supervised learning with auxiliary losses, unsupervised learning, semi-supervised learning, and multi-view learning. Besides different learning problems, I also explore multiple approaches for representation learning. Though I focus on speech data, the methods described in this thesis can also be applied to other domains. Overall, the field of representation learning is developing rapidly. State-of-the-art results on speech related tasks are typically based on Transformers pre-trained with large-scale self-supervised learning, which aims to learn generic representations that can benefit multiple downstream tasks. Since 2020, large-scale pre-training has been the de facto choice to achieve good performance. This delayed thesis does not attempt to summarize and compare with the latest results on speech representation learning; instead, it presents a unique study on speech representation learning before the Transformer era, that covers multiple learning settings. Some of the findings in this thesis can still be useful today.

摘要
这个论文关注在时间或空间序列数据上进行表示学习，以提高 subsequenct 预测任务中的表示质量。supervised learning 是深度神经网络训练深入表示的最主要方法。然而，缺乏足够的标注数据是规模化 supervised learning 的限制因素。为了解决这个挑战，我们可以 explore representation learning 方法，可以利用大量未标注和弱标注数据，以及额外的数据模式。我描述了我对 speech 数据的广泛研究。与大多数其他作品一样，这个论文不仅关注单一的学习设定，而是研究多种设定：supervised learning with auxiliary losses，unsupervised learning，semi-supervised learning，和多视图学习。此外，我们还探索了多种表示学习方法。虽然我们关注 speech 数据，但这些方法可以应用到其他领域。总的来说，表示学习领域在发展 rapidly。现代 speech 相关任务的 state-of-the-art 结果通常基于 Transformers 预训练大规模自我学习，该目的是学习通用的表示，可以促进多个下游任务。自 2020 年以来，大规模预训练成为了downstream任务的标准选择，以实现好的性能。这个论文不尝试综述和与最新 results on speech representation learning 进行比较，而是提供了在 Transformer 时代之前的唯一研究，涵盖多种学习设定。一些这个论文中的发现仍然可以在今天上有用。

How to Scale Your EMA

paper_url: http://arxiv.org/abs/2307.13813
repo_url: https://github.com/ZulqarnainZilli/-9-Email-Marketing-Tips-For-Content-Marketers
paper_authors: Dan Busbridge, Jason Ramapuram, Pierre Ablin, Tatiana Likhomanenko, Eeshan Gunesh Dhekane, Xavier Suau, Russ Webb
for: 这 paper 的目的是解决实际机器学习中保持批处理大小的训练动态性的问题，以便实现批处理大小和墙 clock 时间的负荷。
methods: 这 paper 使用了一种 scaling rule，即在批处理大小变化时，对学习率进行线性Scaling，以实现批处理大小和墙 clock 时间的负荷。此外，paper 还使用了模型Exponential Moving Average (EMA)，以提高超vised learning 的稳定性和通用性。
results: 这 paper 的结果表明，通过使用 scaling rule 和模型 EMA，可以在不同的架构、优化器和数据模式下实现训练动态性，并且可以在小批处理大小和大批处理大小下训练 BYOL 方法，从而实现 wall-clock 时间的6倍减少。

Abstract
Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule, for example, in stochastic gradient descent, one should scale the learning rate linearly with the batch size. Another important tool for practical machine learning is the model Exponential Moving Average (EMA), which is a model copy that does not receive gradient information, but instead follows its target model with some momentum. This model EMA can improve the robustness and generalization properties of supervised learning, stabilize pseudo-labeling, and provide a learning signal for Self-Supervised Learning (SSL). Prior works have treated the model EMA separately from optimization, leading to different training dynamics across batch sizes and lower model performance. In this work, we provide a scaling rule for optimization in the presence of model EMAs and demonstrate its validity across a range of architectures, optimizers, and data modalities. We also show the rule's validity where the model EMA contributes to the optimization of the target model, enabling us to train EMA-based pseudo-labeling and SSL methods at small and large batch sizes. For SSL, we enable training of BYOL up to batch size 24,576 without sacrificing performance, optimally a 6$\times$ wall-clock time reduction.

摘要
保持批处理大小下的训练动态是实用机器学习中重要的工具，它允许批处理大小和墙 clock 时间之间的变数协调。这种协调通常通过一个扩大规则实现，例如在杂散梯度下降中，需要将学习率线性地与批处理大小相乘。另外，模型 exponentially moving average（EMA）也是一种重要的实用机器学习工具，它可以提高supervised learning的稳定性和泛化性，并为自动标注和自主学习提供学习信号。先前的工作通常将模型 EMA 与优化分开处理，导致不同的批处理大小下的训练动态，从而降低模型性能。在这项工作中，我们提供了在模型 EMA 存在下的优化 scaling rule，并证明其在不同的架构、优化器和数据模式下的有效性。我们还显示了这种规则在模型 EMA 对目标模型优化的情况下的有效性，允许我们在小批处理大小和大批处理大小下进行 Pseudo-labeling 和 SSL 训练。对 SSL，我们可以在批处理大小为 24,576 的情况下训练 BYOL，无需牺牲性能，实现了墙 clock 时间的6倍减少。

When Multi-Task Learning Meets Partial Supervision: A Computer Vision Review

paper_url: http://arxiv.org/abs/2307.14382
repo_url: None
paper_authors: Maxime Fontana, Michael Spratling, Miaojing Shi
for: 这篇论文主要研究的是多任务学习（MTL），即同时学习多个任务，并利用这些任务之间的关系来减少内存需求和计算时间。
methods: 该论文主要介绍了传统的MTL方法，包括不同的参数共享技术来传递知识 между任务。同时，它还讨论了由多个目标函数组成的多目标优化问题，以及这种多目标优化问题所带来的挑战。
results: 该论文提出了一些基于partial supervision的MTL方法，以解决多目标优化问题中的挑战。它还介绍了一些可用的数据集、工具和benchmarking结果，以评估这些方法的性能。

Abstract
Multi-Task Learning (MTL) aims to learn multiple tasks simultaneously while exploiting their mutual relationships. By using shared resources to simultaneously calculate multiple outputs, this learning paradigm has the potential to have lower memory requirements and inference times compared to the traditional approach of using separate methods for each task. Previous work in MTL has mainly focused on fully-supervised methods, as task relationships can not only be leveraged to lower the level of data-dependency of those methods but they can also improve performance. However, MTL introduces a set of challenges due to a complex optimisation scheme and a higher labeling requirement. This review focuses on how MTL could be utilised under different partial supervision settings to address these challenges. First, this review analyses how MTL traditionally uses different parameter sharing techniques to transfer knowledge in between tasks. Second, it presents the different challenges arising from such a multi-objective optimisation scheme. Third, it introduces how task groupings can be achieved by analysing task relationships. Fourth, it focuses on how partially supervised methods applied to MTL can tackle the aforementioned challenges. Lastly, this review presents the available datasets, tools and benchmarking results of such methods.

摘要
First, the review examines how MTL traditionally uses parameter sharing techniques to transfer knowledge between tasks. Second, it discusses the challenges that arise from the multi-objective optimization scheme. Third, it introduces task groupings based on task relationships. Fourth, it focuses on how partially supervised methods can be applied to MTL to tackle these challenges. Finally, the review presents available datasets, tools, and benchmarking results for such methods.

EdgeConvEns: Convolutional Ensemble Learning for Edge Intelligence

paper_url: http://arxiv.org/abs/2307.14381
repo_url: None
paper_authors: Ilkay Sikdokur, İnci M. Baytaş, Arda Yurdakul
for: 本研究旨在实现在边缘网络中部署深度学习模型，以提高边缘设备的学习能力和预测性能。
methods: 本研究提出了一种 convolutional ensemble learning 方法，称为 EdgeConvEns，可以在边缘设备上训练不同计算能力的弱模型，并将这些模型 ensemble 在中央服务器上进行更好的预测性能。
results: 实验结果表明，EdgeConvEns 可以在不同训练场景下超过当前最佳性能，并且需要 fewer 次网络通信和 menos 数据传输。

Abstract
Deep edge intelligence aims to deploy deep learning models that demand computationally expensive training in the edge network with limited computational power. Moreover, many deep edge intelligence applications require handling distributed data that cannot be transferred to a central server due to privacy concerns. Decentralized learning methods, such as federated learning, offer solutions where models are learned collectively by exchanging learned weights. However, they often require complex models that edge devices may not handle and multiple rounds of network communication to achieve state-of-the-art performances. This study proposes a convolutional ensemble learning approach, coined EdgeConvEns, that facilitates training heterogeneous weak models on edge and learning to ensemble them where data on edge are heterogeneously distributed. Edge models are implemented and trained independently on Field-Programmable Gate Array (FPGA) devices with various computational capacities. Learned data representations are transferred to a central server where the ensemble model is trained with the learned features received from the edge devices to boost the overall prediction performance. Extensive experiments demonstrate that the EdgeConvEns can outperform the state-of-the-art performance with fewer communications and less data in various training scenarios.

摘要
深入智能旨在部署需要计算费时训练的深度学习模型在边缘网络中，该网络具有有限的计算能力。此外，许多深入智能应用需要处理分散的数据，这些数据不能被传输到中央服务器 Due to privacy concerns. 联邦学习方法，如联邦学习，可以解决这些问题，但它们经常需要复杂的模型，边缘设备可能无法处理，并且需要多次网络通信以 достиieving state-of-the-art表现。本研究提出了一种 convolutional ensemble learning 方法，称为 EdgeConvEns，它可以在边缘上训练不同计算 capacities的 Edge 模型，并将学习到的数据表示 transferred to a central server，并在该服务器上训练 ensemble 模型，以提高总预测性能。 Edge 模型在 Field-Programmable Gate Array (FPGA) 设备上独立实现和训练，学习到的数据表示在中央服务器上进行 ensemble 训练，以提高预测性能。广泛的实验表明，EdgeConvEns 可以在不同训练场景下超越现有的性能，并且需要 fewer communications 和 less data。

A large language model-assisted education tool to provide feedback on open-ended responses

paper_url: http://arxiv.org/abs/2308.02439
repo_url: https://github.com/KordingLab/llm4teach-freetext-server
paper_authors: Jordan K. Matelsky, Felipe Parodi, Tony Liu, Richard D. Lange, Konrad P. Kording
for: 这篇论文是为了提供一种自动回答开结问题的工具，以帮助教师提供快速个性化反馈，从而提高学生的知识水平和教学方法。
methods: 这个工具使用大型自然语言模型（LLMs），由教师定义的标准来指导其回答开结问题。
results: 这个工具可以快速提供个性化反馈，帮助学生快速测试知识和identify改进的领域。

Abstract
Open-ended questions are a favored tool among instructors for assessing student understanding and encouraging critical exploration of course material. Providing feedback for such responses is a time-consuming task that can lead to overwhelmed instructors and decreased feedback quality. Many instructors resort to simpler question formats, like multiple-choice questions, which provide immediate feedback but at the expense of personalized and insightful comments. Here, we present a tool that uses large language models (LLMs), guided by instructor-defined criteria, to automate responses to open-ended questions. Our tool delivers rapid personalized feedback, enabling students to quickly test their knowledge and identify areas for improvement. We provide open-source reference implementations both as a web application and as a Jupyter Notebook widget that can be used with instructional coding or math notebooks. With instructor guidance, LLMs hold promise to enhance student learning outcomes and elevate instructional methodologies.

摘要
открытые вопросы是教师们喜欢使用的工具，用于评估学生理解度和促进课程材料的探究性评估。提供反馈 для这些答案是一项时间消耗大的任务，可能会让教师感受到压力，导致反馈质量下降。许多教师会转而使用更简单的问题格式，如多选题，以获得快速的反馈，但是这将导致个性化的反馈和深入的评估被 sacrificed。在这里，我们介绍了一种工具，使用大型自然语言模型（LLMs），以 instruktor-defined 的标准来自动回答开放式问题。我们的工具可以快速提供个性化反馈，让学生快速测试自己的知识水平，并快速发现需要改进的方面。我们提供了开源的参考实现，一个网应用和一个 Jupyter Notebook widget，可以与教学编程或数学笔记一起使用。With instructor guidance，LLMs 表示可以提高学生学习成果和提高教学方法。

Is GPT a Computational Model of Emotion? Detailed Analysis

paper_url: http://arxiv.org/abs/2307.13779
repo_url: None
paper_authors: Ala N. Tak, Jonathan Gratch
for: 这篇论文探讨 GPT 家族大语言模型的情感理解能力。
methods: 论文首先研究 GPT 如何理解自己的生活记忆，然后通过系统地变化情况来影响情绪强度和应急响应。
results: 研究发现，不使用提问工程ering的情况下，GPT 的预测与人类提供的评估和情感标签高度相符。然而，GPT 在预测情绪强度和应急响应方面存在困难。GPT-4 在初期研究中表现最佳，但在第二次研究中表现不佳，尽管通过小量提问工程ering提供了更好的结果。这些研究表明了如何有效地使用这些模型的优点，以及如何解决它们的弱点，特别是 Response 的变化。

Abstract
This paper investigates the emotional reasoning abilities of the GPT family of large language models via a component perspective. The paper first examines how the model reasons about autobiographical memories. Second, it systematically varies aspects of situations to impact emotion intensity and coping tendencies. Even without the use of prompt engineering, it is shown that GPT's predictions align significantly with human-provided appraisals and emotional labels. However, GPT faces difficulties predicting emotion intensity and coping responses. GPT-4 showed the highest performance in the initial study but fell short in the second, despite providing superior results after minor prompt engineering. This assessment brings up questions on how to effectively employ the strong points and address the weak areas of these models, particularly concerning response variability. These studies underscore the merits of evaluating models from a componential perspective.

摘要

An Empirical Study on Bugs Inside PyTorch: A Replication Study

paper_url: http://arxiv.org/abs/2307.13777
repo_url: None
paper_authors: Sharon Chee Yin Ho, Vahid Majdinasab, Mohayeminul Islam, Diego Elias Costa, Emad Shihab, Foutse Khomh, Sarah Nadi, Muhammad Raza
for: 本研究旨在探讨PyTorch库中的bug标识和修复过程，以便更好地理解深度学习库中bug的特点和影响。
methods: 本研究采用了对PyTorch库的开发过程中发现的bug进行分析，并对bug的原因和表现特征进行描述，以及分析bug修复的方法。
results: 研究发现，PyTorch库中的bug更像传统软件项目中的bug，而不是深度学习特有的问题。此外，本研究还对TensorFlow库的bug标识和修复过程进行了比较，探讨了两个库之间的相似性和差异。

Abstract
Software systems are increasingly relying on deep learning components, due to their remarkable capability of identifying complex data patterns and powering intelligent behaviour. A core enabler of this change in software development is the availability of easy-to-use deep learning libraries. Libraries like PyTorch and TensorFlow empower a large variety of intelligent systems, offering a multitude of algorithms and configuration options, applicable to numerous domains of systems. However, bugs in those popular deep learning libraries also may have dire consequences for the quality of systems they enable; thus, it is important to understand how bugs are identified and fixed in those libraries. Inspired by a study of Jia et al., which investigates the bug identification and fixing process at TensorFlow, we characterize bugs in the PyTorch library, a very popular deep learning framework. We investigate the causes and symptoms of bugs identified during PyTorch's development, and assess their locality within the project, and extract patterns of bug fixes. Our results highlight that PyTorch bugs are more like traditional software projects bugs, than related to deep learning characteristics. Finally, we also compare our results with the study on TensorFlow, highlighting similarities and differences across the bug identification and fixing process.

摘要
Inspired by a study on TensorFlow, we investigated the bug identification and fixing process in PyTorch, a very popular deep learning framework. We found that the causes and symptoms of bugs in PyTorch are more like traditional software project bugs, rather than being specific to deep learning. We also extracted patterns of bug fixes and compared our results with the study on TensorFlow, highlighting similarities and differences in the bug identification and fixing process.

Combating the Curse of Multilinguality in Cross-Lingual WSD by Aligning Sparse Contextualized Word Representations

paper_url: http://arxiv.org/abs/2307.13776
repo_url: https://github.com/begab/sparsity_makes_sense
paper_authors: Gábor Berend
for: 本研究旨在使用大型预训练的单语言自然语言处理模型进行Zero-shot单词意思分类（WSD），并采用上下文化映射机制。
methods: 本研究使用了词典学习程序来获取笔记缩短的上下文化词表示，并使用大型预训练的单语言自然语言处理模型进行Zero-shot单词意思分类。
results: 实验结果表明，通过上述修改，可以对17种语言进行详细的实验，并获得了62.0到68.5的平均F1分数的显著提升（升幅约6.5）。

Abstract
In this paper, we advocate for using large pre-trained monolingual language models in cross lingual zero-shot word sense disambiguation (WSD) coupled with a contextualized mapping mechanism. We also report rigorous experiments that illustrate the effectiveness of employing sparse contextualized word representations obtained via a dictionary learning procedure. Our experimental results demonstrate that the above modifications yield a significant improvement of nearly 6.5 points of increase in the average F-score (from 62.0 to 68.5) over a collection of 17 typologically diverse set of target languages. We release our source code for replicating our experiments at https://github.com/begab/sparsity_makes_sense.

摘要
在这篇论文中，我们支持使用大型预训练单语言自然语言模型在跨语言零shot单词含义决定（WSD）中与contextualized mapping机制相结合。我们还对实验结果进行了严格的报告，表明使用稀疏contextualized词表示 obtener得到的改进方法可以带来较大的改进，具体是从62.0提高到68.5的平均F分数，在17种语言集中。我们将代码发布在https://github.com/begab/sparsity_makes_sense上，以便其他人复现我们的实验。

E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning

paper_url: http://arxiv.org/abs/2307.13770
repo_url: https://github.com/chenghan111/e2vpt
paper_authors: Cheng Han, Qifan Wang, Yiming Cui, Zhiwen Cao, Wenguan Wang, Siyuan Qi, Dongfang Liu
for: 这篇论文的目的是提出一种有效且有效的大规模 transformer 模型适应方法，以减少 fine-tuning 中的参数数量。
methods: 这篇论文使用了一些 parameter-efficient learning 技术，包括引入 learnable key-value prompts 和 visual prompts，以提高模型的适应能力。此外，它还提出了一个 prompt pruning 程序，可以系统地删除低重要性的 prompt，以提高模型的效率。
results: 这篇论文的实验结果显示，它的方法可以与一些现有的基eline相比，在两个 benchmark 上表现出色，并且仅使用了模型的 0.32% 的参数数量。

Abstract
As the size of transformer-based models continues to grow, fine-tuning these large-scale pretrained vision models for new tasks has become increasingly parameter-intensive. Parameter-efficient learning has been developed to reduce the number of tunable parameters during fine-tuning. Although these methods show promising results, there is still a significant performance gap compared to full fine-tuning. To address this challenge, we propose an Effective and Efficient Visual Prompt Tuning (E^2VPT) approach for large-scale transformer-based model adaptation. Specifically, we introduce a set of learnable key-value prompts and visual prompts into self-attention and input layers, respectively, to improve the effectiveness of model fine-tuning. Moreover, we design a prompt pruning procedure to systematically prune low importance prompts while preserving model performance, which largely enhances the model's efficiency. Empirical results demonstrate that our approach outperforms several state-of-the-art baselines on two benchmarks, with considerably low parameter usage (e.g., 0.32% of model parameters on VTAB-1k). Our code is available at https://github.com/ChengHan111/E2VPT.

摘要
随着 transformer-based 模型的大小继续增长， fine-tuning these large-scale pretrained vision models for new tasks 已成为 parameter-intensive 的挑战。parameter-efficient learning 已经开发出来以减少 fine-tuning 过程中的可调参数数量。虽然这些方法显示了扎实的结果，但是还有一定的性能差距 compared to full fine-tuning。为了解决这个挑战，我们提出了一种 Effective and Efficient Visual Prompt Tuning (E^2VPT) 方法，用于大规模 transformer-based 模型的适应。specifically，我们在 self-attention 层和输入层中引入了一些可学习的 key-value prompts 和 visual prompts，以提高模型的 fine-tuning 效果。此外，我们还设计了一种 prompt pruning 过程，可以系统地剔除低重要性的 prompts，并保持模型的性能，这有效地提高了模型的效率。实验结果表明，我们的方法可以与一些 state-of-the-art 基eline 相比，在 two benchmarks 上表现出色，并且具有较低的参数使用率（例如，0.32% 的模型参数在 VTAB-1k 上）。我们的代码可以在 https://github.com/ChengHan111/E2VPT 上找到。

ClusterSeq: Enhancing Sequential Recommender Systems with Clustering based Meta-Learning

paper_url: http://arxiv.org/abs/2307.13766
repo_url: None
paper_authors: Mohammmadmahdi Maheri, Reza Abdollahzadeh, Bardia Mohammadi, Mina Rafiei, Jafar Habibi, Hamid R. Rabiee
For: 解决用户冷启始问题，提高续传推荐系统的效果。* Methods: meta-learning clustering-based sequential recommender system，利用用户序列中的动态信息提高物品预测精度。* Results: 比对几种现状的meta-学推荐器，ClusterSeq显示出较高的预测精度，特别是对”小用户”的预测。

Abstract
In practical scenarios, the effectiveness of sequential recommendation systems is hindered by the user cold-start problem, which arises due to limited interactions for accurately determining user preferences. Previous studies have attempted to address this issue by combining meta-learning with user and item-side information. However, these approaches face inherent challenges in modeling user preference dynamics, particularly for "minor users" who exhibit distinct preferences compared to more common or "major users." To overcome these limitations, we present a novel approach called ClusterSeq, a Meta-Learning Clustering-Based Sequential Recommender System. ClusterSeq leverages dynamic information in the user sequence to enhance item prediction accuracy, even in the absence of side information. This model preserves the preferences of minor users without being overshadowed by major users, and it capitalizes on the collective knowledge of users within the same cluster. Extensive experiments conducted on various benchmark datasets validate the effectiveness of ClusterSeq. Empirical results consistently demonstrate that ClusterSeq outperforms several state-of-the-art meta-learning recommenders. Notably, compared to existing meta-learning methods, our proposed approach achieves a substantial improvement of 16-39% in Mean Reciprocal Rank (MRR).

摘要
在实际应用场景中，顺序推荐系统的效果受用户冷启 пробле 的限制，这种问题由用户与ITEM之间的交互有限，难以准确地确定用户的喜好。先前的研究已经尝试通过meta-学习与用户和ITEM的信息结合来解决这个问题，但这些方法面临用户喜好动态模型化的内在挑战，特别是对"小用户"（minor users）的喜好表现出明显的差异。为了解决这些限制，我们提出了一种新的方法：ClusterSeq，这是一种基于 clustering 的 Meta-Learning Sequential Recommender System。ClusterSeq 利用用户序列中的动态信息来提高ITEM预测精度，即使在没有副信息的情况下。这种模型保持了小用户的喜好，不被大用户（major users）所掩蔽，同时利用用户序列中的共同知识来提高推荐的准确率。经验 validate 了 ClusterSeq 的效果，与先前的 estado-of-the-art meta-学习推荐器相比，ClusterSeq 在 Mean Reciprocal Rank（MRR）上表现出了明显的提高，具体数据表明，ClusterSeq 与先前的 meta-学习方法相比，在 MRR 上提高了16-39%。

Implicitly Normalized Explicitly Regularized Density Estimation

paper_url: http://arxiv.org/abs/2307.13763
repo_url: None
paper_authors: Mark Kozdoba, Binyamin Perets, Shie Mannor
for: 本文提出了一种新的非 Parametric density estimation方法，该方法基于 Sobolev нор的 regularization。
methods: 本方法不同于 Kernel Density Estimation，可以使模型的偏好明确和可读性。虽然不存在关闭式analytic形式的kernel，但可以使用采样来approximate它。但问题是非CONvex，标准的梯度方法不好。但是，我们表明可以使用适当的初始化和自然梯度，以获得良好的解。
results: 本方法可以获得不正规化的概率分布，这使得不能使用log-likelihood дляcross validation。但我们表明可以使用 Fisher Divergence based Score Matching方法来解决这个问题。我们在 ADBench 最新的异常检测 benchmark 上评估了本方法，并发现它在more than 15Algorithms中排名第二。

Abstract
We propose a new approach to non-parametric density estimation, that is based on regularizing a Sobolev norm of the density. This method is provably different from Kernel Density Estimation, and makes the bias of the model clear and interpretable. While there is no closed analytic form for the associated kernel, we show that one can approximate it using sampling. The optimization problem needed to determine the density is non-convex, and standard gradient methods do not perform well. However, we show that with an appropriate initialization and using natural gradients, one can obtain well performing solutions. Finally, while the approach provides unnormalized densities, which prevents the use of log-likelihood for cross validation, we show that one can instead adapt Fisher Divergence based Score Matching methods for this task. We evaluate the resulting method on the comprehensive recent Anomaly Detection benchmark suite, ADBench, and find that it ranks second best, among more than 15 algorithms.

摘要
我们提出了一种新的非参数性密度估计方法，基于 Sobolev нор的规范化。这种方法与核密度估计方法不同，可以清晰地显示模型的偏见。虽然该密度函数没有固定的关联核函数，但我们表明可以使用采样来approximate它。优化问题需要解决的非 conjugate 问题，标准的梯度法不太好。然而，我们表明，通过适当的初始化和使用自然梯度，可以获得良好的解。虽然该方法提供的密度函数没有标准化，因此无法使用对数似然函数进行cross validation，但我们示出了使用 Fisher 分布 Based Score Matching 方法来解决这个问题。我们对最新的 Anomaly Detection benchmark suite ADBench 进行了评估，并发现其在more than 15 算法中排名第二。

paper_url: http://arxiv.org/abs/2307.13755
repo_url: None
paper_authors: Seyed Mojtaba Marvasti-Zadeh, Nilanjan Ray, Nadir Erbilgin
for: 提高现有 объек检测器的性能和泛化能力，通过使用有限的标注数据和广泛的无标注数据进行 semi-supervised object detection。
methods: 提出了一种新的训练阶段基于模型级别的准确强化（TMR）和一种简单 yet effective的表示不一致（RD）策略，用于解决经典EMA策略和教师-学生模型在训练后期的一致问题。
results: 对比于现有SSOD方法，提出的方法在COCO标准、COCO附加和Pascal VOC数据集上得到了更高的性能，具体来说是与基线Unbiased-Teacher-v2（& Unbiased-Teacher-v1）方法相比，平均mAP差距为2.23、2.1、3.36（& 2.07、1.9、3.27）。

Abstract
Semi-supervised object detection (SSOD) aims to improve the performance and generalization of existing object detectors by utilizing limited labeled data and extensive unlabeled data. Despite many advances, recent SSOD methods are still challenged by inadequate model refinement using the classical exponential moving average (EMA) strategy, the consensus of Teacher-Student models in the latter stages of training (i.e., losing their distinctiveness), and noisy/misleading pseudo-labels. This paper proposes a novel training-based model refinement (TMR) stage and a simple yet effective representation disagreement (RD) strategy to address the limitations of classical EMA and the consensus problem. The TMR stage of Teacher-Student models optimizes the lightweight scaling operation to refine the model's weights and prevent overfitting or forgetting learned patterns from unlabeled data. Meanwhile, the RD strategy helps keep these models diverged to encourage the student model to explore complementary representations. Our approach can be integrated into established SSOD methods and is empirically validated using two baseline methods, with and without cascade regression, to generate more reliable pseudo-labels. Extensive experiments demonstrate the superior performance of our approach over state-of-the-art SSOD methods. Specifically, the proposed approach outperforms the baseline Unbiased-Teacher-v2 (& Unbiased-Teacher-v1) method by an average mAP margin of 2.23, 2.1, and 3.36 (& 2.07, 1.9, and 3.27) on COCO-standard, COCO-additional, and Pascal VOC datasets, respectively.

摘要
semi-supervised对象检测（SSOD）目标是提高现有对象检测器的性能和泛化能力，通过利用有限的标注数据和广泛的无标注数据。 despite many advances, recent SSOD methods are still challenged by inadequate model refinement using the classical exponential moving average (EMA) strategy, the consensus of Teacher-Student models in the latter stages of training (i.e., losing their distinctiveness), and noisy/misleading pseudo-labels. This paper proposes a novel training-based model refinement (TMR) stage and a simple yet effective representation disagreement (RD) strategy to address the limitations of classical EMA and the consensus problem. The TMR stage of Teacher-Student models optimizes the lightweight scaling operation to refine the model's weights and prevent overfitting or forgetting learned patterns from unlabeled data. Meanwhile, the RD strategy helps keep these models diverged to encourage the student model to explore complementary representations. Our approach can be integrated into established SSOD methods and is empirically validated using two baseline methods, with and without cascade regression, to generate more reliable pseudo-labels. Extensive experiments demonstrate the superior performance of our approach over state-of-the-art SSOD methods. Specifically, the proposed approach outperforms the baseline Unbiased-Teacher-v2 (& Unbiased-Teacher-v1) method by an average mAP margin of 2.23, 2.1, and 3.36 (& 2.07, 1.9, and 3.27) on COCO-standard, COCO-additional, and Pascal VOC datasets, respectively.

Benchmarking and Analyzing Generative Data for Visual Recognition

paper_url: http://arxiv.org/abs/2307.13697
repo_url: https://github.com/Luodian/GenBench
paper_authors: Bo Li, Haotian Liu, Liangyu Chen, Yong Jae Lee, Chunyuan Li, Ziwei Liu
for: 本研究探讨了大型预训 génative 模型在视觉识别中的潜在作用，主要比较了三种不同的数据来源（生成、检索和原始）。
methods: 我们提出了一个广泛的标准套件（\textbf{GenBench），包括22个数据集和2548个类别，用于评估不同的视觉识别任务中的生成数据。我们还提出了一个无需训练的 metric（\textbf{CLER），用于评估生成数据在识别任务中的效果。
results: 我们的研究发现，生成数据在许多视觉识别任务中表现出优异的特点，并且可以通过文本推理来注入外部知识来提高性能。

Abstract
Advancements in large pre-trained generative models have expanded their potential as effective data generators in visual recognition. This work delves into the impact of generative images, primarily comparing paradigms that harness external data (\ie generative \vs retrieval \vs original). Our key contributions are: \textbf{1) GenBench Construction:} We devise \textbf{GenBench}, a broad benchmark comprising 22 datasets with 2548 categories, to appraise generative data across various visual recognition tasks. \textbf{2) CLER Score:} To address the insufficient correlation of existing metrics (\eg, FID, CLIP score) with downstream recognition performance, we propose \textbf{CLER}, a training-free metric indicating generative data's efficiency for recognition tasks prior to training. \textbf{3) New Baselines:} Comparisons of generative data with retrieved data from the same external pool help to elucidate the unique traits of generative data. \textbf{4) External Knowledge Injection:} By fine-tuning special token embeddings for each category via Textual Inversion, performance improves across 17 datasets, except when dealing with low-resolution reference images. Our exhaustive benchmark and analysis spotlight generative data's promise in visual recognition, while identifying key challenges for future investigation.

摘要
“大型预训生成模型的进步已经扩展了它们在视觉识别中的应用前景。这个工作探讨了生成图像的影响，主要是比较使用外部数据（即生成 VS 重新 VS 原始）。我们的主要贡献包括：1. 生成测验工具（GenBench）的设计：我们开发了一个包含22个dataset、2548个类别的广泛benchmark，以评估不同的视觉识别任务中的生成数据。2. CLER分数的提案：为了解决现有的度量器（如FID、CLIP分数）与下游识别性能之间的不足相关性，我们提出了CLER，一个无需训练的度量器，可以在生成数据前以评估该数据的识别能力。3. 新的基准值：通过与相同的外部数据库中的重新数据进行比较，我们可以更好地显示生成数据的独特特征。4. 外部知识注入：通过在每个类别的特殊token嵌入中进行文本反转，我们在17个dataset中提高了表现，除了对低分辨率的参考图像。我们的充分的benchmark和分析灯示了生成数据在视觉识别中的应用潜力，同时点出了未来的挑战。”

Foundational Models Defining a New Era in Vision: A Survey and Outlook

paper_url: http://arxiv.org/abs/2307.13721
repo_url: https://github.com/awaisrauf/awesome-cv-foundational-models
paper_authors: Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Fahad Shahbaz Khan
for:foundational models for computer vision tasks, such as segmentation, object detection, and image/video captioning, are reviewed in this paper.methods:the paper discusses various architecture designs, training objectives, pre-training datasets, fine-tuning mechanisms, and prompting patterns used in foundational models.results:the paper reviews recent developments in foundational models and their applications in computer vision tasks, including their ability to generalize to new scenes and tasks, their contextual understanding, and their limitations in real-world environments.

Abstract
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundational models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions. In this survey, we provide a comprehensive review of such emerging foundational models, including typical architecture designs to combine different modalities (vision, text, audio, etc), training objectives (contrastive, generative), pre-training datasets, fine-tuning mechanisms, and the common prompting patterns; textual, visual, and heterogeneous. We discuss the open challenges and research directions for foundational models in computer vision, including difficulties in their evaluations and benchmarking, gaps in their real-world understanding, limitations of their contextual understanding, biases, vulnerability to adversarial attacks, and interpretability issues. We review recent developments in this field, covering a wide range of applications of foundation models systematically and comprehensively. A comprehensive list of foundational models studied in this work is available at \url{https://github.com/awaisrauf/Awesome-CV-Foundational-Models}.

摘要
视觉系统能够理解和描述视觉场景的compositional性是理解我们世界的基本要求。实际环境中 объектов和他们的位置之间的复杂关系、歧义和变化可以更好地用人类语言来描述，这些语言自然受到语法规则和其他模态的限制。通过大规模的训练数据和模型学习，可以bridge这些模式之间的差异，实现上下文理解、泛化和提示能力。这些模型被称为基础模型。基础模型的输出可以通过人提供的提示进行修改，例如提供 bounding box 来 segment particular object，或者通过问题提问来进行互动对话，或者通过语言指令来控制机器人的行为。在这篇评论中，我们提供了基础模型的广泛和系统性的 Review，包括不同模式结合（视觉、文本、音频等）、训练目标（对比、生成）、预训练数据集、练习机制和常见的提示模式（文本、视觉、混合）。我们还讨论了基础模型在计算机视觉领域的开放挑战和研究方向，包括评价和测试 benchmarking 困难、实际世界理解的差距、上下文理解的局限性、偏见、攻击性和可读性问题。我们还综述了该领域最新的发展，涵盖了基础模型的各种应用，从系统性和完整性来评价。基础模型的完整列表可以在 \url{https://github.com/awaisrauf/Awesome-CV-Foundational-Models} 上查看。

Composite Diffusion | whole >= Σparts

paper_url: http://arxiv.org/abs/2307.13720
repo_url: None
paper_authors: Vikram Jamwal, Ramaneswaran S
for: 这篇论文旨在提供一种基于文本扩散的高质量图像生成方法，帮助艺术家和 графический设计师更好地控制图像的配置和分布。
methods: 该方法使用 Composite Diffusion 技术，让艺术家通过自由形式的分割场景，将多个场景组合成一个完整的图像。在这个过程中，艺术家可以使用自然语言描述每个场景的内容，并可以通过引用图像或控制输入来调整图像的组合和融合。
results: 该方法可以提供高质量的图像生成，并且可以帮助艺术家更好地控制图像的配置和分布。通过对比现有图像质量指标和艺术家的愿望，我们提出了新的质量标准，以更好地评估图像生成的效果。

Abstract
For an artist or a graphic designer, the spatial layout of a scene is a critical design choice. However, existing text-to-image diffusion models provide limited support for incorporating spatial information. This paper introduces Composite Diffusion as a means for artists to generate high-quality images by composing from the sub-scenes. The artists can specify the arrangement of these sub-scenes through a flexible free-form segment layout. They can describe the content of each sub-scene primarily using natural text and additionally by utilizing reference images or control inputs such as line art, scribbles, human pose, canny edges, and more. We provide a comprehensive and modular method for Composite Diffusion that enables alternative ways of generating, composing, and harmonizing sub-scenes. Further, we wish to evaluate the composite image for effectiveness in both image quality and achieving the artist's intent. We argue that existing image quality metrics lack a holistic evaluation of image composites. To address this, we propose novel quality criteria especially relevant to composite generation. We believe that our approach provides an intuitive method of art creation. Through extensive user surveys, quantitative and qualitative analysis, we show how it achieves greater spatial, semantic, and creative control over image generation. In addition, our methods do not need to retrain or modify the architecture of the base diffusion models and can work in a plug-and-play manner with the fine-tuned models.

摘要
For an artist or graphic designer, the spatial layout of a scene is a critical design choice. However, existing text-to-image diffusion models provide limited support for incorporating spatial information. This paper introduces Composite Diffusion as a means for artists to generate high-quality images by composing from sub-scenes. The artists can specify the arrangement of these sub-scenes through a flexible free-form segment layout. They can describe the content of each sub-scene primarily using natural text and additionally by utilizing reference images or control inputs such as line art, scribbles, human pose, canny edges, and more. We provide a comprehensive and modular method for Composite Diffusion that enables alternative ways of generating, composing, and harmonizing sub-scenes. Further, we wish to evaluate the composite image for effectiveness in both image quality and achieving the artist's intent. We argue that existing image quality metrics lack a holistic evaluation of image composites. To address this, we propose novel quality criteria especially relevant to composite generation. We believe that our approach provides an intuitive method of art creation. Through extensive user surveys, quantitative and qualitative analysis, we show how it achieves greater spatial, semantic, and creative control over image generation. In addition, our methods do not need to retrain or modify the architecture of the base diffusion models and can work in a plug-and-play manner with the fine-tuned models.

The Visual Language of Fabrics

paper_url: http://arxiv.org/abs/2307.13681
repo_url: None
paper_authors: Valentin Deschaintre, Julia Guerrero-Viu, Diego Gutierrez, Tamy Boubekeur, Belen Masia
for: 本研究准备了一个名为text2fabric的新数据集，该数据集将自然语言描述与不同的织物材质图像联系起来。
methods: 研究人员使用自然语言描述来描述织物的外观，并分析了数据集，从中提取了一个紧凑的词汇、属性和结构，以便更好地理解人们如何描述织物。
results: 研究人员通过使用text2fabric数据集，可以准确地理解织物的描述，并且可以使用这些描述来特化大型视觉语言模型，例如CLIP，以创建一个有意义的潜在空间，并提高物料检索和自动标注等应用。

Abstract
We introduce text2fabric, a novel dataset that links free-text descriptions to various fabric materials. The dataset comprises 15,000 natural language descriptions associated to 3,000 corresponding images of fabric materials. Traditionally, material descriptions come in the form of tags/keywords, which limits their expressivity, induces pre-existing knowledge of the appropriate vocabulary, and ultimately leads to a chopped description system. Therefore, we study the use of free-text as a more appropriate way to describe material appearance, taking the use case of fabrics as a common item that non-experts may often deal with. Based on the analysis of the dataset, we identify a compact lexicon, set of attributes and key structure that emerge from the descriptions. This allows us to accurately understand how people describe fabrics and draw directions for generalization to other types of materials. We also show that our dataset enables specializing large vision-language models such as CLIP, creating a meaningful latent space for fabric appearance, and significantly improving applications such as fine-grained material retrieval and automatic captioning.

摘要
我们介绍text2fabric数据集，这是一个新的数据集，将自然语言描述与各种织物材质相关联。该数据集包含15,000个自然语言描述和3,000个相应的织物图像。传统上，材质描述通常以标签/关键词的形式出现，这限制了其表达能力，需要先采用适当的词汇库，并最终导致描述系统被剪辑。因此，我们研究使用自然语言来更好地描述材质外观，以织物作为非专家通常处理的常见物品为例。基于数据集的分析，我们标识出了一个紧凑的词汇集、属性集和关键结构，这些元素允许我们准确地理解人们如何描述织物，并提供了泛化到其他材质的方向。此外，我们还示出了使用text2fabric数据集可以特化大型视觉语言模型，创造出meaningful的织物外观空间，并显著提高了材质 Retrieval和自动标题等应用。

How Can Large Language Models Help Humans in Design and Manufacturing?

paper_url: http://arxiv.org/abs/2307.14377
repo_url: None
paper_authors: Liane Makatura, Michael Foshey, Bohan Wang, Felix HähnLein, Pingchuan Ma, Bolei Deng, Megan Tjandrasuwita, Andrew Spielberg, Crystal Elaine Owens, Peter Yichen Chen, Allan Zhao, Amy Zhu, Wil J Norton, Edward Gu, Joshua Jacob, Yifei Li, Adriana Schulz, Wojciech Matusik
for: investigate the application of Large Language Models (LLMs) in generative design across the entire design and manufacturing workflow.
methods: convert text-based prompts into design specifications, transform designs into manufacturing instructions, produce design spaces and variations, compute design performance, and search for designs based on performance.
results: highlight both the benefits and limitations of current LLMs through a series of examples, with the goal of catalyzing continued improvement and progression of these models.

Abstract
The advancement of Large Language Models (LLMs), including GPT-4, provides exciting new opportunities for generative design. We investigate the application of this tool across the entire design and manufacturing workflow. Specifically, we scrutinize the utility of LLMs in tasks such as: converting a text-based prompt into a design specification, transforming a design into manufacturing instructions, producing a design space and design variations, computing the performance of a design, and searching for designs predicated on performance. Through a series of examples, we highlight both the benefits and the limitations of the current LLMs. By exposing these limitations, we aspire to catalyze the continued improvement and progression of these models.

摘要
大语言模型（LLM）的发展，包括GPT-4，为生成设计带来了新的机遇。我们对整个设计和生产工作流程中的应用进行调查。具体来说，我们分析LLM在以下任务中的用途：将文本提示转换成设计规范，将设计转换成生产指令，生成设计空间和设计变化，计算设计的性能，以及基于性能搜索设计。通过一些例子，我们显示了当前LLM的优势和局限性。通过暴露这些局限性，我们希望能够促进这些模型的持续改进和进步。

FedDRL: A Trustworthy Federated Learning Model Fusion Method Based on Staged Reinforcement Learning

paper_url: http://arxiv.org/abs/2307.13716
repo_url: None
paper_authors: Leiming Chen, Cihao Dong, Sibo Qiao, Ziling Huang, Kai Wang, Yuming Nie, Zhaoxiang Hou, Cheewei Tan
for: 解决 federated learning 中 client 模型质量不均匀和恶意上传模型导致全局模型精度下降的问题。
methods: 提出了一种基于 reinforcement learning 的模型融合方法，包括两个阶段：第一阶段是过滤恶意模型并选择可信客户端模型参与融合，第二阶段是自适应调整可信客户端模型的权重并进行最佳全局模型融合。
results: 对五种模型融合场景进行了比较，研究结果表明，我们的算法比基eline algorithms 高于可靠性而保持精度。

Abstract
Traditional federated learning uses the number of samples to calculate the weights of each client model and uses this fixed weight value to fusion the global model. However, in practical scenarios, each client's device and data heterogeneity leads to differences in the quality of each client's model. Thus the contribution to the global model is not wholly determined by the sample size. In addition, if clients intentionally upload low-quality or malicious models, using these models for aggregation will lead to a severe decrease in global model accuracy. Traditional federated learning algorithms do not address these issues. To solve this probelm, we propose FedDRL, a model fusion approach using reinforcement learning based on a two staged approach. In the first stage, Our method could filter out malicious models and selects trusted client models to participate in the model fusion. In the second stage, the FedDRL algorithm adaptively adjusts the weights of the trusted client models and aggregates the optimal global model. We also define five model fusion scenarios and compare our method with two baseline algorithms in those scenarios. The experimental results show that our algorithm has higher reliability than other algorithms while maintaining accuracy.

摘要
传统的联合学习方法使用客户端模型的样本数来计算每个客户端模型的权重值，然后使用这些固定权重值进行模型融合。然而，在实际应用中，每个客户端的设备和数据多样性会导致每个客户端模型的质量差异，因此折衔到全局模型的质量不仅取决于样本数。此外，如果客户端故意上传低质量或黑客模型，使用这些模型进行融合会导致全局模型的准确率受到严重的影响。传统的联合学习算法不能解决这些问题。为解决这个问题，我们提出了FedDRL，一种基于强化学习的模型融合方法。在第一个阶段，我们的方法可以过滤出黑客模型，并选择可信worth客户端模型参与模型融合。在第二个阶段，FedDRL算法可以自适应地调整可信worth客户端模型的权重值，并将最佳的全局模型进行融合。我们还定义了五种模型融合场景，并与两个基eline算法进行比较。实验结果显示，我们的算法在可靠性和准确率之间取得了良好的平衡。

Towards an AI Accountability Policy

paper_url: http://arxiv.org/abs/2307.13658
repo_url: None
paper_authors: Przemyslaw Grabowicz, Nicholas Perello, Yair Zick
for: 这份白皮书是回应美国国家电信管理局（NATIONAL TELECOMMUNICATIONS AND INFORMATION ADMINISTRATION，NTIA）发布的“AI责任政策请求意见”。
methods: 本白皮书提供了一系列相互连接的建议，用于制定AI责任政策。
results: 本白皮书的建议可以帮助建立一个可靠、可信、可追溯的AI责任政策制度。I hope that helps! Let me know if you have any other questions.

Abstract
This white paper is a response to the "AI Accountability Policy Request for Comments" by the National Telecommunications and Information Administration of the United States. The question numbers for which comments were requested are provided in superscripts at the end of key sentences answering the respective questions. The white paper offers a set of interconnected recommendations for an AI accountability policy.

摘要
这份白皮书是回应美国国家电信和信息管理局（NATIONAL TELECOMMUNICATIONS AND INFORMATION ADMINISTRATION，NTIA）发布的“人工智能负责任政策请求意见”（AI Accountability Policy Request for Comments）。在关键句中的问号（superscripts）后面提供了对应的答案。本白皮书提出了一系列相互关联的人工智能负责任政策建议。

QuickQual: Lightweight, convenient retinal image quality scoring with off-the-shelf pretrained models

paper_url: http://arxiv.org/abs/2307.13646
repo_url: https://github.com/justinengelmann/quickqual
paper_authors: Justin Engelmann, Amos Storkey, Miguel O. Bernabeu
for: 这个论文的目的是提出一种新的Retinal Image Quality Scoring（RIQS）方法，以解决目前主流的深度学习（DL）方法对眼eground图像质量评分的问题。
methods: 这个方法使用了一个简单的ImageNet预训练的Densenet121背景，并使用Support Vector Machine（SVM）进行分类。
results: 这个方法可以达到新的州对眼eground图像质量评分的最佳状态（Accuracy：88.50%，AUC：0.9687），表明RIQS可以通过普通的感知特征学习来解决，而不需要大量的fundus图像数据进行深度学习模型训练。

Abstract
Image quality remains a key problem for both traditional and deep learning (DL)-based approaches to retinal image analysis, but identifying poor quality images can be time consuming and subjective. Thus, automated methods for retinal image quality scoring (RIQS) are needed. The current state-of-the-art is MCFNet, composed of three Densenet121 backbones each operating in a different colour space. MCFNet, and the EyeQ dataset released by the same authors, was a huge step forward for RIQS. We present QuickQual, a simple approach to RIQS, consisting of a single off-the-shelf ImageNet-pretrained Densenet121 backbone plus a Support Vector Machine (SVM). QuickQual performs very well, setting a new state-of-the-art for EyeQ (Accuracy: 88.50% vs 88.00% for MCFNet; AUC: 0.9687 vs 0.9588). This suggests that RIQS can be solved with generic perceptual features learned on natural images, as opposed to requiring DL models trained on large amounts of fundus images. Additionally, we propose a Fixed Prior linearisation scheme, that converts EyeQ from a 3-way classification to a continuous logistic regression task. For this task, we present a second model, QuickQual MEga Minified Estimator (QuickQual-MEME), that consists of only 10 parameters on top of an off-the-shelf Densenet121 and can distinguish between gradable and ungradable images with an accuracy of 89.18% (AUC: 0.9537). Code and model are available on GitHub: https://github.com/justinengelmann/QuickQual . QuickQual is so lightweight, that the entire inference code (and even the parameters for QuickQual-MEME) is already contained in this paper.

摘要
Image quality remains a key problem for both traditional and deep learning (DL)-based approaches to retinal image analysis, but identifying poor quality images can be time-consuming and subjective. Thus, automated methods for retinal image quality scoring (RIQS) are needed. The current state-of-the-art is MCFNet, composed of three Densenet121 backbones each operating in a different color space. MCFNet, and the EyeQ dataset released by the same authors, was a huge step forward for RIQS. We present QuickQual, a simple approach to RIQS, consisting of a single off-the-shelf ImageNet-pretrained Densenet121 backbone plus a Support Vector Machine (SVM). QuickQual performs very well, setting a new state-of-the-art for EyeQ (Accuracy: 88.50% vs 88.00% for MCFNet; AUC: 0.9687 vs 0.9588). This suggests that RIQS can be solved with generic perceptual features learned on natural images, as opposed to requiring DL models trained on large amounts of fundus images. Additionally, we propose a Fixed Prior linearization scheme, that converts EyeQ from a 3-way classification to a continuous logistic regression task. For this task, we present a second model, QuickQual MEga Minified Estimator (QuickQual-MEME), that consists of only 10 parameters on top of an off-the-shelf Densenet121 and can distinguish between gradable and ungradable images with an accuracy of 89.18% (AUC: 0.9537). Code and model are available on GitHub: . QuickQual is so lightweight, that the entire inference code (and even the parameters for QuickQual-MEME) is already contained in this paper.

Safety Margins for Reinforcement Learning

paper_url: http://arxiv.org/abs/2307.13642
repo_url: None
paper_authors: Alexander Grushin, Walt Woods, Alvaro Velasquez, Simon Khan
for: 本研究旨在提供一种能够量化评估自主控制器在某些情况下的危险性，以便在例如货物运输应用中引入人工监督。
methods: 本研究使用了一种基于概率动作的方法来定义自主控制器的真实扰乱性，并可以在实时中计算出代理扰乱度量。
results: 研究人员通过评估APE-X和A3C在Atari环境中学习的策略，发现安全间隔可以直接反映自主控制器的危险程度，并且当自主控制器接近失败状态时，安全间隔会逐渐减少。

Abstract
Any autonomous controller will be unsafe in some situations. The ability to quantitatively identify when these unsafe situations are about to occur is crucial for drawing timely human oversight in, e.g., freight transportation applications. In this work, we demonstrate that the true criticality of an agent's situation can be robustly defined as the mean reduction in reward given some number of random actions. Proxy criticality metrics that are computable in real-time (i.e., without actually simulating the effects of random actions) can be compared to the true criticality, and we show how to leverage these proxy metrics to generate safety margins, which directly tie the consequences of potentially incorrect actions to an anticipated loss in overall performance. We evaluate our approach on learned policies from APE-X and A3C within an Atari environment, and demonstrate how safety margins decrease as agents approach failure states. The integration of safety margins into programs for monitoring deployed agents allows for the real-time identification of potentially catastrophic situations.

摘要
任何自主控制器都会在某些情况下不安全。可以量化地识别这些不安全情况的出现是控制器运行中的关键，例如在货物运输应用中。在这种工作中，我们表明了真正的危机性可以通过计算一些随机动作后的奖励减少平均值来定义。可计时计算的代理危机指标可以与真实危机指标进行比较，我们示出如何利用这些代理指标生成安全优势，这些优势直接与可能错误的行为相关联，并且与预计的性能损失相对比较。我们在APE-X和A3Clearned policies中的Atari环境中评估了我们的方法，并示出了安全优势随着控制器接近失败状态而减少。将安全优势 integrating into deployed agents的监控程序中可以实时识别潜在的灾难性情况。

GPT-3 Models are Few-Shot Financial Reasoners

paper_url: http://arxiv.org/abs/2307.13617
repo_url: None
paper_authors: Raul Salles de Padua, Imran Qureshi, Mustafa U. Karakaplan
For: The paper is written to evaluate the performance of pre-trained language models, specifically GPT-3, in answering financial questions.* Methods: The paper uses a combination of a retriever and a logic engine to answer financial questions, and the authors experiment with different approaches to fine-tune the model.* Results: The authors find that a separate retrieval model and logic engine are essential components to achieving state-of-the-art performance in the financial question answering task, and their refined prompt-engineering approach on GPT-3 achieves near state-of-the-art accuracy without any fine-tuning.Here are the three points in Simplified Chinese text:* For: 本文是用来评估预训练语言模型，尤其是 GPT-3，在回答金融问题上的性能。* Methods: 本文使用了一种组合Retriever和逻辑引擎来回答金融问题，并对不同方法进行了 эксперимент来细化模型。* Results: 作者发现，分离的Retrieval模型和逻辑引擎是回答金融问题的State-of-the-art性能的关键组成部分，并且他们的改进的提问工程学approach在 GPT-3 上 achiev near State-of-the-art accuracy without any fine-tuning.

Abstract
Financial analysis is an important tool for evaluating company performance. Practitioners work to answer financial questions to make profitable investment decisions, and use advanced quantitative analyses to do so. As a result, Financial Question Answering (QA) is a question answering task that requires deep reasoning about numbers. Furthermore, it is unknown how well pre-trained language models can reason in the financial domain. The current state-of-the-art requires a retriever to collect relevant facts about the financial question from the text and a generator to produce a valid financial program and a final answer. However, recently large language models like GPT-3 have achieved state-of-the-art performance on wide variety of tasks with just a few shot examples. We run several experiments with GPT-3 and find that a separate retrieval model and logic engine continue to be essential components to achieving SOTA performance in this task, particularly due to the precise nature of financial questions and the complex information stored in financial documents. With this understanding, our refined prompt-engineering approach on GPT-3 achieves near SOTA accuracy without any fine-tuning.

摘要
金融分析是评估公司性能的重要工具。实践者们努力回答金融问题，以达到可持续的投资决策。为此，金融问答（QA）是一个需要深入理解数字的问答任务。然而，目前不确定前置语言模型在金融领域的推理能力。现状顶峰需要一个检索器收集金融问题相关的信息，以及一个生成器生成有效的金融计划和最终答案。然而，最近大型语言模型如GPT-3已经在各种任务上达到了顶峰性能，只需要几个示例。我们进行了多个实验，发现在这个任务中，独立的检索器和逻辑引擎仍然是必要的组成部分，特别是因为金融问题的具体性和金融文档中的复杂信息。通过这种理解，我们对GPT-3进行了改进的提示工程，达到了近顶峰准确率，无需任何微调。

Team Intro to AI team8 at CoachAI Badminton Challenge 2023: Advanced ShuttleNet for Shot Predictions

paper_url: http://arxiv.org/abs/2307.13715
repo_url: None
paper_authors: Shih-Hong Chen, Pin-Hsuan Chou, Yong-Fu Liu, Chien-An Han
for: 提高现有框架ShuttleNet在预测羽毛球击球类型和位置的性能，通过利用过去的拍打。
methods: 利用过去的拍打来提高ShuttleNet的预测性能。
results: 在CoachAI Badminton Challenge中达到了比基准更好的结果，并最终蝉联赛事的冠军。

Abstract
In this paper, our objective is to improve the performance of the existing framework ShuttleNet in predicting badminton shot types and locations by leveraging past strokes. We participated in the CoachAI Badminton Challenge at IJCAI 2023 and achieved significantly better results compared to the baseline. Ultimately, our team achieved the first position in the competition and we made our code available.

摘要
在这篇论文中，我们的目标是通过利用过去的拍打来提高现有框架ShuttleNet在预测羽毛球shot类型和位置的性能。我们参加了IJCAI 2023年的CoachAI Badminton Challenge并取得了对基线的显著改进。最终，我们的团队取得了比赛的第一名，并将代码公开。Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

2023-07-26

cs.CL

cs.CL - 2023-07-26

Say Goodbye to RNN-T Loss: A Novel CIF-based Transducer Architecture for Automatic Speech Recognition

paper_url: http://arxiv.org/abs/2307.14132
repo_url: None
paper_authors: Tian-Hao Zhang, Dinghao Zhou, Guiping Zhong, Baoxiang Li
for: 提高 ASR 模型的效率和性能
methods: 提出一种新的 CIF-Transducer 模型，具有 Continuous Integrate-and-Fire 机制，避免 RNN-T 损失，并具有更多的预测网络作用
results: 在 AISHELL-1 和 WenetSpeech 数据集上实现了 state-of-the-art 的效果，并且比 RNN-T 模型具有更低的计算开销

Abstract
RNN-T models are widely used in ASR, which rely on the RNN-T loss to achieve length alignment between input audio and target sequence. However, the implementation complexity and the alignment-based optimization target of RNN-T loss lead to computational redundancy and a reduced role for predictor network, respectively. In this paper, we propose a novel model named CIF-Transducer (CIF-T) which incorporates the Continuous Integrate-and-Fire (CIF) mechanism with the RNN-T model to achieve efficient alignment. In this way, the RNN-T loss is abandoned, thus bringing a computational reduction and allowing the predictor network a more significant role. We also introduce Funnel-CIF, Context Blocks, Unified Gating and Bilinear Pooling joint network, and auxiliary training strategy to further improve performance. Experiments on the 178-hour AISHELL-1 and 10000-hour WenetSpeech datasets show that CIF-T achieves state-of-the-art results with lower computational overhead compared to RNN-T models.

摘要

Leveraging Implicit Feedback from Deployment Data in Dialogue

paper_url: http://arxiv.org/abs/2307.14117
repo_url: None
paper_authors: Richard Yuanzhe Pang, Stephen Roller, Kyunghyun Cho, He He, Jason Weston
for: 本研究旨在提高社交对话机器人，通过学习自然对话中的用户和模型之间的交互。
methods: 本研究使用自然对话中的用户响应长度、情感和未来的人类响应作为机器生成句子质量的隐式指标，来训练新模型。
results: 人工评估表明新模型的回答比基础模型更高质量，但是某些代理指标可能会导致更多的不良特性，如争议性或不友好的回答。

Abstract
We study improving social conversational agents by learning from natural dialogue between users and a deployed model, without extra annotations. To implicitly measure the quality of a machine-generated utterance, we leverage signals like user response length, sentiment and reaction of the future human utterances in the collected dialogue episodes. Our experiments use the publicly released deployment data from BlenderBot (Xu et al., 2023). Human evaluation indicates improvements in our new models over baseline responses; however, we find that some proxy signals can lead to more generations with undesirable properties as well. For example, optimizing for conversation length can lead to more controversial or unfriendly generations compared to the baseline, whereas optimizing for positive sentiment or reaction can decrease these behaviors.

摘要
我们研究改进社交对话代理人，学习自然的用户对话和部署模型之间的自然对话，不需要额外的标注。为了隐式地衡量机器生成的句子质量，我们利用用户回应长度、情感和未来人类对话的反映信号。我们的实验使用公共发布的部署数据集from BlenderBot（Xu et al., 2023）。人类评估显示我们的新模型比基线响应提高，但我们发现一些代理信号可能会导致更多的不良特性。例如，优化对话长度可能会导致更多的争议性或不友好的生成，而优化正面情感或反应可能会降低这些行为。

Decoding ChatGPT: A Taxonomy of Existing Research, Current Challenges, and Possible Future Directions

paper_url: http://arxiv.org/abs/2307.14107
repo_url: None
paper_authors: Shahab Saquib Sohail, Faiza Farhat, Yassine Himeur, Mohammad Nadeem, Dag Øivind Madsen, Yashbir Singh, Shadi Atalla, Wathiq Mansoor
for: 本研究的目的是为了提供一份关于ChatGPT研究的综述，探讨ChatGPT在不同领域的应用和潜在问题。
methods: 本研究使用了Scopus检索的 более чем100篇论文，进行了分类和 crítical分析，描述了不同领域的应用和挑战。
results: 研究发现了ChatGPT在医疗、市场营销、金融服务、软件工程、学术科研写作、环境科学和自然语言处理等领域的潜在应用，并提出了解决存在的问题和未来研究方向。

Abstract
Chat Generative Pre-trained Transformer (ChatGPT) has gained significant interest and attention since its launch in November 2022. It has shown impressive performance in various domains, including passing exams and creative writing. However, challenges and concerns related to biases and trust persist. In this work, we present a comprehensive review of over 100 Scopus-indexed publications on ChatGPT, aiming to provide a taxonomy of ChatGPT research and explore its applications. We critically analyze the existing literature, identifying common approaches employed in the studies. Additionally, we investigate diverse application areas where ChatGPT has found utility, such as healthcare, marketing and financial services, software engineering, academic and scientific writing, research and education, environmental science, and natural language processing. Through examining these applications, we gain valuable insights into the potential of ChatGPT in addressing real-world challenges. We also discuss crucial issues related to ChatGPT, including biases and trustworthiness, emphasizing the need for further research and development in these areas. Furthermore, we identify potential future directions for ChatGPT research, proposing solutions to current challenges and speculating on expected advancements. By fully leveraging the capabilities of ChatGPT, we can unlock its potential across various domains, leading to advancements in conversational AI and transformative impacts in society.

摘要
chat生成预训练变换器（chatGPT）已经在2022年11月发布以来引起了广泛的关注和注意。它在不同领域展现出了卓越的表现，包括考试和创作写作。然而，关于偏见和信任的问题和担忧仍然存在。在这项工作中，我们对Scopus检索的 более than 100篇论文进行了全面的回顾，以提供 chatGPT 研究的分类和探讨其应用。我们critically analyze existing literature, identifying common approaches employed in the studies. In addition, we investigate diverse application areas where chatGPT has found utility, such as healthcare, marketing and financial services, software engineering, academic and scientific writing, research and education, environmental science, and natural language processing. Through examining these applications, we gain valuable insights into the potential of chatGPT in addressing real-world challenges. We also discuss crucial issues related to chatGPT, including biases and trustworthiness, emphasizing the need for further research and development in these areas. Furthermore, we identify potential future directions for chatGPT research, proposing solutions to current challenges and speculating on expected advancements. By fully leveraging the capabilities of chatGPT, we can unlock its potential across various domains, leading to advancements in conversational AI and transformative impacts in society.

Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems

paper_url: http://arxiv.org/abs/2307.14031
repo_url: https://github.com/cambridgeltl/multi3woz
paper_authors: Songbo Hu, Han Zhou, Mete Hergul, Milan Gritta, Guchun Zhang, Ignacio Iacobacci, Ivan Vulić, Anna Korhonen
For: The paper aims to create a large-scale, culturally adapted, and multi-domain task-oriented dialog (ToD) dataset for multiple languages.* Methods: The paper introduces a novel dataset called Multi3WOZ, which is collected through a complex bottom-up process that includes human evaluation and cultural adaptation.* Results: The paper presents the first sets of baseline scores across different ToD-related tasks for future reference, highlighting the challenging nature of the dataset.Here’s the information in Simplified Chinese text:* For: 这篇论文的目的是创建多种语言的多频道任务对话（ToD）数据集，以便训练和评估多语言和跨语言的ToD系统。* Methods: 这篇论文引入了一个新的数据集called Multi3WOZ，它是通过复杂的底层数据收集过程，包括人工评估和文化适应，而收集的。* Results: 这篇论文提供了首次的基线分数，用于未来参考，同时强调数据集的挑战性。

Abstract
Creating high-quality annotated data for task-oriented dialog (ToD) is known to be notoriously difficult, and the challenges are amplified when the goal is to create equitable, culturally adapted, and large-scale ToD datasets for multiple languages. Therefore, the current datasets are still very scarce and suffer from limitations such as translation-based non-native dialogs with translation artefacts, small scale, or lack of cultural adaptation, among others. In this work, we first take stock of the current landscape of multilingual ToD datasets, offering a systematic overview of their properties and limitations. Aiming to reduce all the detected limitations, we then introduce Multi3WOZ, a novel multilingual, multi-domain, multi-parallel ToD dataset. It is large-scale and offers culturally adapted dialogs in 4 languages to enable training and evaluation of multilingual and cross-lingual ToD systems. We describe a complex bottom-up data collection process that yielded the final dataset, and offer the first sets of baseline scores across different ToD-related tasks for future reference, also highlighting its challenging nature.

摘要
In this work, we first provide a systematic overview of the current landscape of multilingual ToD datasets, highlighting their properties and limitations. To address these limitations, we introduce Multi3WOZ, a novel multilingual, multi-domain, multi-parallel ToD dataset. It is large-scale and offers culturally adapted dialogs in 4 languages to enable training and evaluation of multilingual and cross-lingual ToD systems.We describe a complex bottom-up data collection process that yielded the final dataset, and offer the first sets of baseline scores across different ToD-related tasks for future reference. The dataset is challenging, and we highlight the difficulties in collecting and annotating the data.

Unsupervised extraction of local and global keywords from a single text

paper_url: http://arxiv.org/abs/2307.14005
repo_url: None
paper_authors: Lida Aleksanyan, Armen E. Allahverdyan
for: 本研究旨在提出一种无监督、文库独立的方法，用于从单个文本中提取关键词。
methods: 该方法基于文本中词语的空间分布，以及这种分布对Random Permutation of Words的响应。与现有方法（如YAKE）相比，该方法具有三个优势：首先，它在长文本中更有效地提取关键词。第二，它可以推导出两种类型的关键词：本地关键词和全局关键词。第三，它揭示了文本中的基本主题。此外，该方法语言独立，适用于短文本。结果由人工笔评员（具有文库作品数据库中的先驱知识）进行验证，并通过人工独立的论证，基于EXTRACTED CONTENT WORDS的平均长度和EXTRACTED WORDS中的平均数量词。
results: 研究发现，关键词与更高阶文本特征之间存在关系，同时还发现关键词与章节分区之间的连接。

Abstract
We propose an unsupervised, corpus-independent method to extract keywords from a single text. It is based on the spatial distribution of words and the response of this distribution to a random permutation of words. As compared to existing methods (such as e.g. YAKE) our method has three advantages. First, it is significantly more effective at extracting keywords from long texts. Second, it allows inference of two types of keywords: local and global. Third, it uncovers basic themes in texts. Additionally, our method is language-independent and applies to short texts. The results are obtained via human annotators with previous knowledge of texts from our database of classical literary works (the agreement between annotators is from moderate to substantial). Our results are supported via human-independent arguments based on the average length of extracted content words and on the average number of nouns in extracted words. We discuss relations of keywords with higher-order textual features and reveal a connection between keywords and chapter divisions.

摘要
我们提出了一种无监督、文献自主的方法，用于从单个文本中提取关键词。该方法基于文本中词语的空间分布和词语随机Permutation的响应。与现有方法（如YAKE）相比，我们的方法有三个优势：首先，它更有效地从长文本中提取关键词。第二，它可以推断两种类型的关键词：本地和全局。第三，它揭示了文本中基本主题。此外，我们的方法语言独立，适用于短文本。结果由人工标注者通过文本数据库中的古典文学作品的先验知识获得，并且人工标注者之间的一致度从中到substantial。我们的结果得到了人工独立的证明，基于提取的内容词的平均长度和提取词中的平均名称数。我们讨论关键词与高阶文本特征之间的关系，并发现关键词和章节分区之间的连接。

Affective Natural Language Generation of Event Descriptions through Fine-grained Appraisal Conditions

paper_url: http://arxiv.org/abs/2307.14004
repo_url: None
paper_authors: Yarik Menchaca Resendiz, Roman Klinger
for: 这 paper 的目的是提高文本生成模型中的情感表达，并且使用评估理论来更加细化控制文本的内容和情感表达。
methods: 这 paper 使用了 Bart 和 T5 两种基本的文本生成模型，并在training过程中添加了评估变量来控制文本的内容和情感表达。
results: 这 paper 的实验结果表明，在添加评估变量时，文本生成模型的准确率提高了10个百分点，并且文本中含有更多的细节和情感表达。这表明用户可以通过评估变量来更加细化控制文本的内容和情感表达。

Abstract
Models for affective text generation have shown a remarkable progress, but they commonly rely only on basic emotion theories or valance/arousal values as conditions. This is appropriate when the goal is to create explicit emotion statements ("The kid is happy."). Emotions are, however, commonly communicated implicitly. For instance, the emotional interpretation of an event ("Their dog died.") does often not require an explicit emotion statement. In psychology, appraisal theories explain the link between a cognitive evaluation of an event and the potentially developed emotion. They put the assessment of the situation on the spot, for instance regarding the own control or the responsibility for what happens. We hypothesize and subsequently show that including appraisal variables as conditions in a generation framework comes with two advantages. (1) The generation model is informed in greater detail about what makes a specific emotion and what properties it has. This leads to text generation that better fulfills the condition. (2) The variables of appraisal allow a user to perform a more fine-grained control of the generated text, by stating properties of a situation instead of only providing the emotion category. Our Bart and T5-based experiments with 7 emotions (Anger, Disgust, Fear, Guilt, Joy, Sadness, Shame), and 7 appraisals (Attention, Responsibility, Control, Circumstance, Pleasantness, Effort, Certainty) show that (1) adding appraisals during training improves the accurateness of the generated texts by 10 pp in F1. Further, (2) the texts with appraisal variables are longer and contain more details. This exemplifies the greater control for users.

摘要
模型 для生成情感文本已经取得了非常出色的进步，但它们通常只是基于基本情感理论或振荡/兴奋值作为条件。这是当目标是创建明确的情感声明("这个孩子很高兴。")时非常适用。然而，情感通常会通过各种不同的方式表达。在心理学中，评估理论解释了情感与情感表达之间的联系。它们将情感评估纳入了框架中，例如负责任、控制等。我们提出并证明了，在生成框架中包含评估变量来自动生成文本会带来两点优势。（1）生成模型在更加细化的情况下了解情感的特点和性质，这会导致生成的文本更加符合条件。（2）评估变量使得用户可以通过指定情感表达的属性来进行更加细化的控制，而不是仅提供情感类别。我们使用Bart和T5搭配7种情感（愤怒、厌恶、恐慌、负罪感、喜乐、悲伤、耻辱）和7种评估（注意力、负责任、控制、情况、愉悦、努力、确定）进行实验，结果显示：（1）在训练中添加评估可以提高生成文本准确率10个百分点。此外，（2）包含评估变量的文本比较详细，更容易控制。这是用户的优势。

Diff-E: Diffusion-based Learning for Decoding Imagined Speech EEG

paper_url: http://arxiv.org/abs/2307.14389
repo_url: https://github.com/yorgoon/diffe
paper_authors: Soowon Kim, Young-Eun Lee, Seo-Hyun Lee, Seong-Whan Lee
for: 用于干预神经网络中的意义语言传输
methods: 使用泛化抽象模型（DDPM）和条件自适应网络（Diff-E）来解决EEG信号的干扰问题
results: 比传统机器学习技术和基线模型有更高的准确率，表明DDPM可以有效地处理EEG信号，有potential应用于through imagined speech的脑机器交互。I hope that helps! Let me know if you have any further questions or if there’s anything else I can help with.

Abstract
Decoding EEG signals for imagined speech is a challenging task due to the high-dimensional nature of the data and low signal-to-noise ratio. In recent years, denoising diffusion probabilistic models (DDPMs) have emerged as promising approaches for representation learning in various domains. Our study proposes a novel method for decoding EEG signals for imagined speech using DDPMs and a conditional autoencoder named Diff-E. Results indicate that Diff-E significantly improves the accuracy of decoding EEG signals for imagined speech compared to traditional machine learning techniques and baseline models. Our findings suggest that DDPMs can be an effective tool for EEG signal decoding, with potential implications for the development of brain-computer interfaces that enable communication through imagined speech.

摘要
“对于想像语音的EEG信号解oding是一个具有高维度和低信号对频率的挑战。近年来，散射扩散概率模型（DDPMs）在不同领域的表示学习中兴起了重要的位置。我们的研究提出了一种使用DDPMs和 conditional autoencoder（Diff-E）来解码EEG信号的新方法。结果显示，Diff-E可以与传统机器学习技术和基准模型相比，对于想像语音的EEG信号解oding具有明显的改善。我们的发现表明，DDPMs可以是EEG信号解oding的有效工具，具有潜在的应用于通过想像语音的脑computer接口的开发。”Note: Please keep in mind that the translation is done by a machine and may not be perfect. If you have any specific requirements or preferences, please let me know and I'll be happy to help.

This is not correct! Negation-aware Evaluation of Language Generation Systems

paper_url: http://arxiv.org/abs/2307.13989
repo_url: https://github.com/dmlls/cannot-dataset
paper_authors: Miriam Anschütz, Diego Miguel Lozano, Georg Groh
for: 这篇论文的目的是提出一种能够识别谓语否定的评价指标 NegBLEURT，以解决现有的语言模型在识别谓语否定时的下降性问题。
methods: 该论文使用了一种基于规则的句子否定工具，并使用该工具生成了CANNOT negation evaluation dataset。然后，该论文使用了一种句子转换器和评价指标的微调版本，以提高其对谓语否定的敏感性。
results: 对现有的评价指标进行评测，该论文的微调版本在对谓语否定句子的评测中表现出色，而不会影响其对其他句子的评测。

Abstract
Large language models underestimate the impact of negations on how much they change the meaning of a sentence. Therefore, learned evaluation metrics based on these models are insensitive to negations. In this paper, we propose NegBLEURT, a negation-aware version of the BLEURT evaluation metric. For that, we designed a rule-based sentence negation tool and used it to create the CANNOT negation evaluation dataset. Based on this dataset, we fine-tuned a sentence transformer and an evaluation metric to improve their negation sensitivity. Evaluating these models on existing benchmarks shows that our fine-tuned models outperform existing metrics on the negated sentences by far while preserving their base models' performances on other perturbations.

摘要
大型语言模型会对句子意思的改变低估，因此学习的评估指标基于这些模型可能不敏感于否定。这篇论文提出了NegBLEURT，一个对否定意识的BLEURT评估指标。我们设计了一个基于规则的句子否定工具，并使用这个工具创建了CANNOT否定评估集。基于这个集合，我们精心调整了句子转换器和评估指标，以提高它们对否定句子的敏感性。在现有的评估准确表上评估这些模型，我们发现我们的精心调整模型在否定句子上大幅提高了表现，而且保持了基本模型在其他损害上的表现。

Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data

paper_url: http://arxiv.org/abs/2307.14385
repo_url: https://github.com/neuhai/mental-llm
paper_authors: Xuhai Xu, Bingshen Yao, Yuanzhe Dong, Saadia Gabriel, Hong Yu, James Hendler, Marzyeh Ghassemi, Anind K. Dey, Dakuo Wang
for:This paper aims to evaluate the performance of multiple large language models (LLMs) on various mental health prediction tasks using online text data.methods:The authors use zero-shot prompting, few-shot prompting, and instruction fine-tuning to evaluate the performance of LLMs on mental health tasks.results:The results show that instruction fine-tuning can significantly boost the performance of LLMs for all tasks simultaneously, with the best-finetuned models outperforming the best prompt design of GPT-3.5 and GPT-4 by a significant margin. The authors also conduct an exploratory case study on LLMs’ capability on mental health reasoning tasks and highlight the important ethical risks accompanying this line of research.Here’s the summary in Simplified Chinese text:for: 这 paper 的目的是评估多种大语言模型（LLMs）在在线文本数据上进行心理健康预测任务的性能。methods: 作者使用零shot prompting、几shot prompting和指令 fine-tuning 来评估 LLMs 在心理健康任务上的性能。results: 结果显示，指令 fine-tuning 可以在所有任务上同时提高 LLMs 的性能，最好的 fine-tuned 模型可以与 GPT-3.5 和 GPT-4 的最佳提示设计相比，提高10.9%的平衡准确率。作者还进行了一个探索性的案例研究，探讨 LLMs 在心理健康逻辑任务上的可能性。

Abstract
Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present the first comprehensive evaluation of multiple LLMs, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4, on various mental health prediction tasks via online text data. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for the mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on the mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.

摘要
大语言模型（LLM）的进步已经授权了许多应用程序。然而，在理解和提高 LLM 在心理健康领域的能力方面仍存在一定的研究差距。在这项工作中，我们首次对多种 LLM 进行了全面的评估，包括 Alpaca、Alpaca-LoRA、FLAN-T5、GPT-3.5 和 GPT-4，在在线文本数据上进行了多种心理健康预测任务的测试。我们进行了广泛的实验，涵盖零容量提示、几容量提示和指令精细调整。结果表明 LLM 在零容量和几容量提示设计下的性能有前提，而且指令精细调整可以显著提高 LLM 的性能。我们的最佳调整模型（Mental-Alpaca和Mental-FLAN-T5）在balanced accuracy方面比 GPT-3.5 的最佳提示设计（25和15倍大）高出10.9%，并比 GPT-4 的最佳提示设计（250和150倍大）高出4.8%。此外，我们的模型还与状态当前的任务特定语言模型在同等水平。我们还进行了一项探索性的案例研究，探讨 LLM 在心理健康逻辑任务上的能力，并证明了某些模型，如 GPT-4，具有潜在的可能性。我们将我们的发现总结为了各种可能的方法来提高 LLM 在心理健康任务中的能力，同时也注意到了现有的种族和性别偏见，以及这种研究的重要道德风险。

GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning

paper_url: http://arxiv.org/abs/2307.13923
repo_url: https://github.com/freedomintelligence/grammargpt
paper_authors: Yaxin Fan, Feng Jiang, Peifeng Li, Haizhou Li
for: 本研究旨在探索开源大语言模型（LLMs）在Native Chinese Grammatical Error Correction（CGEC）中的潜力。
methods: 我们提出了一种使用开源LMMs（如Phoenix）进行 instrucion tuning，并使用 hybrid 数据集（包括 ChatGPT 生成和人工标注）来驱动模型的改进。我们还提出了一种error-invariant augmentation方法，以增强模型对native Chinese grammatical errors的抗讯息能力。
results: 我们的实验结果显示，GrammarGPT 可以与当前最佳系统相比，显著超越其。尽管模型参数的大小为20倍，但需要的数据量 для instrucion tuning 只需1200倍，这说明开源LMMs在native CGEC 中的潜力。我们的 GrammarGPT 在 NLPCC2023 SharedTask1 中排名第三，证明我们的方法的效果。

Abstract
Grammatical error correction aims to correct ungrammatical sentences automatically. Recently, some work has demonstrated the excellent capabilities of closed-source Large Language Models (LLMs, e.g., ChatGPT) in grammatical error correction. However, the potential of open-source LLMs remains unexplored. In this paper, we introduced GrammarGPT, an open-source LLM, to preliminary explore its potential for native Chinese grammatical error correction. The core recipe of GrammarGPT is to leverage the hybrid dataset of ChatGPT-generated and human-annotated. For grammatical errors with clues, we proposed a heuristic method to guide ChatGPT to generate ungrammatical sentences by providing those clues. For grammatical errors without clues, we collected ungrammatical sentences from publicly available websites and manually corrected them. In addition, we employed an error-invariant augmentation method to enhance the ability of the model to correct native Chinese grammatical errors. We ultimately constructed about 1k parallel data and utilized these data to fine-tune open-source LLMs (e.g., Phoenix, released by The Chinese University of Hong Kong, Shenzhen) with instruction tuning. The experimental results show that GrammarGPT outperforms the existing SOTA system significantly. Although model parameters are 20x larger than the SOTA baseline, the required amount of data for instruction tuning is 1200x smaller, illustrating the potential of open-source LLMs on native CGEC. Our GrammarGPT ranks $3^{rd}$ on NLPCC2023 SharedTask1, demonstrating our approach's effectiveness. The code and data are available at \url{https://github.com/FreedomIntelligence/GrammarGPT}.

摘要
grammatical error correction旨在自动 corrections grammatical errors。最近的一些工作表明了关闭源的大语言模型（LLMs，例如ChatGPT）在grammatical error correction方面的出色表现。然而，开源的LLMs的潜力尚未得到探索。在这篇论文中，我们引入了 GrammarGPT，一个开源的LLM，以预liminary explore its potential for native Chinese grammatical error correction。GrammarGPT的核心方法是利用ChatGPT生成的和人类标注的混合数据集。对于带有提示的 grammatical errors，我们提出了一种规则方法，使ChatGPT生成不 grammatical sentences。对于无提示的 grammatical errors，我们收集了来自公共可用网站的不 grammatical sentences，并手动 corrections。此外，我们采用了一种不变 augmentation方法，以增强模型对native Chinese grammatical errors的 corrected。最后，我们构建了约1k的并行数据，并使用这些数据来精度调整开源LLMs（例如Phoenix，由香港中文大学深圳分校发布）。实验结果表明，GrammarGPT在native CGEC方面significantly outperforms现有的SOTA系统。虽然模型参数的数量为SOTA基线的20倍，但需要的数据量 дляinstruction tuning是1200倍 smaller，强调了开源LLMs的潜力。我们的GrammarGPT在NLPCC2023 SharedTask1中排名第三，证明了我们的方法的有效性。代码和数据可以在https://github.com/FreedomIntelligence/GrammarGPT中获取。

Trustworthiness of Children Stories Generated by Large Language Models

paper_url: http://arxiv.org/abs/2308.00073
repo_url: None
paper_authors: Prabin Bhandari, Hannah Marie Brennan
for: 这个研究是为了评估大语言模型（LLMs）在生成儿童故事方面的可靠性，并对其与实际儿童故事进行比较和对比。
methods: 这个研究使用了多种指标评估LLMs生成的儿童故事的可靠性，并与经典和新儿童故事进行比较和对比。
results: 研究发现，LLMs仍然很难生成与实际儿童故事一样高质量和细腻的故事。

Abstract
Large Language Models (LLMs) have shown a tremendous capacity for generating literary text. However, their effectiveness in generating children's stories has yet to be thoroughly examined. In this study, we evaluate the trustworthiness of children's stories generated by LLMs using various measures, and we compare and contrast our results with both old and new children's stories to better assess their significance. Our findings suggest that LLMs still struggle to generate children's stories at the level of quality and nuance found in actual stories

摘要
大型语言模型（LLM）已经表现出很大的可能性来生成文学作品。然而，它们在生成儿童故事方面的效果还未得到了全面的评估。在这项研究中，我们使用不同的指标来评估 LLM 生成的儿童故事的可靠性，并与旧和新的儿童故事进行比较和对比，以更好地评估它们的意义。我们发现 LLM 仍然在生成儿童故事方面存在质量和细节上的困难。

ARC-NLP at Multimodal Hate Speech Event Detection 2023: Multimodal Methods Boosted by Ensemble Learning, Syntactical and Entity Features

paper_url: http://arxiv.org/abs/2307.13829
repo_url: None
paper_authors: Umitcan Sahin, Izzet Emre Kucukkaya, Oguzhan Ozcelik, Cagri Toraman
for: 本研究旨在提出一种基于多模态深度学习和语法文本特征的 hate speech 检测方法，以及基于命名实体特征的目标检测方法，以满足 Multimodal Hate Speech Event Detection 2023 的两个子任务。
methods: 本研究使用了多模态深度学习模型，并通过ensemble学习和语法文本特征进行增强。在第一个子任务中，我们使用了这些模型来检测 hate speech。在第二个子任务中，我们使用了命名实体特征来进行目标检测。
results: 我们的模型在两个子任务中表现出色，比基于全文本、视觉和文本视觉的基线模型都有更高的性能。此外，我们的模型在两个子任务的最终排名中名列第一。

Abstract
Text-embedded images can serve as a means of spreading hate speech, propaganda, and extremist beliefs. Throughout the Russia-Ukraine war, both opposing factions heavily relied on text-embedded images as a vehicle for spreading propaganda and hate speech. Ensuring the effective detection of hate speech and propaganda is of utmost importance to mitigate the negative effect of hate speech dissemination. In this paper, we outline our methodologies for two subtasks of Multimodal Hate Speech Event Detection 2023. For the first subtask, hate speech detection, we utilize multimodal deep learning models boosted by ensemble learning and syntactical text attributes. For the second subtask, target detection, we employ multimodal deep learning models boosted by named entity features. Through experimentation, we demonstrate the superior performance of our models compared to all textual, visual, and text-visual baselines employed in multimodal hate speech detection. Furthermore, our models achieve the first place in both subtasks on the final leaderboard of the shared task.

摘要
文本嵌入图像可以作为散布仇恨言论、宣传和极端思想的途径。在俄乌战争期间，两方都重视使用文本嵌入图像来散布宣传和仇恨言论。确保恐怖言论检测的有效性非常重要，以避免恐怖言论的散布。在这篇论文中，我们介绍了我们的方法ologies，用于两个子任务：仇恨言论检测和目标检测。在首个子任务中，我们使用多Modal深度学习模型，并通过ensemble学习和 syntax文本特征来提高检测性能。在第二个子任务中，我们使用多Modal深度学习模型，并通过名称实体特征来提高检测性能。通过实验，我们证明了我们的模型在文本-视觉恐怖言论检测中的超越性。此外，我们的模型在最终领导板的两个子任务中获得了第一名。

Watermarking Conditional Text Generation for AI Detection: Unveiling Challenges and a Semantic-Aware Watermark Remedy

paper_url: http://arxiv.org/abs/2307.13808
repo_url: None
paper_authors: Yu Fu, Deyi Xiong, Yue Dong
for: 降低人工智能检测中的风险，研究人员提出了在机器生成文本中添加水印的方法，通过随机词汇限制来实现。
methods: 该方法通过随机限制词汇，对机器生成文本中的水印进行植入，以便在检测中使用。
results: 我们的实验结果表明，我们提议的Semantic-aware watermarking算法可以在文本生成任务中提供显著改进，包括摘要和数据转文本生成，而且保持检测能力。

Abstract
To mitigate potential risks associated with language models, recent AI detection research proposes incorporating watermarks into machine-generated text through random vocabulary restrictions and utilizing this information for detection. While these watermarks only induce a slight deterioration in perplexity, our empirical investigation reveals a significant detriment to the performance of conditional text generation. To address this issue, we introduce a simple yet effective semantic-aware watermarking algorithm that considers the characteristics of conditional text generation and the input context. Experimental results demonstrate that our proposed method yields substantial improvements across various text generation models, including BART and Flan-T5, in tasks such as summarization and data-to-text generation while maintaining detection ability.

摘要
为了减轻语言模型中存在的风险，当前的AI探测研究提议将水印 incorporated into machine-generated text through random vocabulary restrictions,并利用这些信息进行探测。although these watermarks only cause a slight decrease in perplexity,our empirical investigation reveals a significant negative impact on the performance of conditional text generation.to address this issue,we propose a simple yet effective semantic-aware watermarking algorithm that takes into account the characteristics of conditional text generation and the input context.our experimental results show that our proposed method achieves significant improvements across various text generation models,including BART and Flan-T5,in tasks such as summarization and data-to-text generation while maintaining detection ability.

Evaluating Large Language Models for Radiology Natural Language Processing

paper_url: http://arxiv.org/abs/2307.13693
repo_url: https://github.com/zhaozh10/LLM_CMP
paper_authors: Zhengliang Liu, Tianyang Zhong, Yiwei Li, Yutong Zhang, Yi Pan, Zihao Zhao, Peixin Dong, Chao Cao, Yuxiao Liu, Peng Shu, Yaonai Wei, Zihao Wu, Chong Ma, Jiaqi Wang, Sheng Wang, Mengyue Zhou, Zuowei Jiang, Chunlin Li, Jason Holmes, Shaochen Xu, Lu Zhang, Haixing Dai, Kai Zhang, Lin Zhao, Yuanhao Chen, Xu Liu, Peilong Wang, Pingkun Yan, Jun Liu, Bao Ge, Lichao Sun, Dajiang Zhu, Xiang Li, Wei Liu, Xiaoyan Cai, Xintao Hu, Xi Jiang, Shu Zhang, Xin Zhang, Tuo Zhang, Shijie Zhao, Quanzheng Li, Hongtu Zhu, Dinggang Shen, Tianming Liu
for: This study aims to evaluate the performance of 32 large language models (LLMs) in interpreting radiology reports and deriving impressions from radiologic findings.
methods: The study uses a dataset of radiology reports and assesses the LLMs’ ability to extract relevant information and provide accurate impressions.
results: The study provides insights into the strengths and weaknesses of the LLMs in this task, informing their practical applications within the medical domain.Here’s the same information in Simplified Chinese text:
for: 这个研究旨在评估32个大语言模型（LLMs）在阅读医学报告时的表现，特别是从医学成像中提取有用信息并提供准确的印象。
methods: 研究使用医学报告 dataset，评估 LLMs 在这个任务中的能力。
results: 研究提供了 LLMS 在这个任务中的优劣点，为医疗领域的实际应用提供指导。

Abstract
The rise of large language models (LLMs) has marked a pivotal shift in the field of natural language processing (NLP). LLMs have revolutionized a multitude of domains, and they have made a significant impact in the medical field. Large language models are now more abundant than ever, and many of these models exhibit bilingual capabilities, proficient in both English and Chinese. However, a comprehensive evaluation of these models remains to be conducted. This lack of assessment is especially apparent within the context of radiology NLP. This study seeks to bridge this gap by critically evaluating thirty two LLMs in interpreting radiology reports, a crucial component of radiology NLP. Specifically, the ability to derive impressions from radiologic findings is assessed. The outcomes of this evaluation provide key insights into the performance, strengths, and weaknesses of these LLMs, informing their practical applications within the medical domain.

摘要
大型自然语言模型（LLMs）的出现标志着自然语言处理（NLP）领域的重要转折。LLMs在多个领域中取得了巨大的成就，并在医疗领域中发挥了重要作用。现在有很多大型语言模型，许多这些模型在英语和中文之间具有双语能力。然而，这些模型的全面评估仍然缺失。特别是在验图学NP中，这种缺失更为突出。本研究的目的是评估三二个LLMs在解读验图报告方面的能力，这是验图学NP中关键的一环。specifically，这些LLMs在解读验图结果中提取印象的能力被评估。研究结果提供关键的洞察和指导，用于评估这些LLMs在医疗领域的实际应用。

ARB: Advanced Reasoning Benchmark for Large Language Models

paper_url: http://arxiv.org/abs/2307.13692
repo_url: None
paper_authors: Tomohiro Sawada, Daniel Paleka, Alexander Havrilla, Pranav Tadepalli, Paula Vidas, Alexander Kranias, John J. Nay, Kshitij Gupta, Aran Komatsuzaki
for: 本研究旨在提供一个更加挑战性的语言模型评估 benchmark，以测试当今的语言模型在多个领域的高级推理能力。
methods: 本研究使用了一个新的 benchmark，名为 ARB，该 benchmark 包括了多个领域的高级推理问题，如数学、物理、生物、化学和法律。
results: 研究发现，当前的语言模型在更加挑战性的任务上的表现仍然落后于人类专家，而且只有在数学和物理领域的符号推理和领域知识方面表现较好。

Abstract
Large Language Models (LLMs) have demonstrated remarkable performance on various quantitative reasoning and knowledge benchmarks. However, many of these benchmarks are losing utility as LLMs get increasingly high scores, despite not yet reaching expert performance in these domains. We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields. ARB presents a more challenging test than prior benchmarks, featuring problems in mathematics, physics, biology, chemistry, and law. As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge. We evaluate recent models such as GPT-4 and Claude on ARB and demonstrate that current models score well below 50% on more demanding tasks. In order to improve both automatic and assisted evaluation capabilities, we introduce a rubric-based evaluation approach, allowing GPT-4 to score its own intermediate reasoning steps. Further, we conduct a human evaluation of the symbolic subset of ARB, finding promising agreement between annotators and GPT-4 rubric evaluation scores.

摘要
大型语言模型（LLM）在多种量化逻辑和知识准则上表现出了很好的表现。然而，许多这些准则已经失去了用于测试LLM的价值，即使LLM的分数还没有达到专业水平。我们介绍了ARB，一个新的准则，包括多个领域的高级逻辑问题。ARB比之前的准则更加具有挑战性，包括数学、物理、生物、化学和法律等领域的问题。我们从ARB中选择了一个挑战性较高的数学和物理问题集，需要高级 симвоlic 逻辑和领域知识。我们使用GPT-4和Claude等现代模型进行评估，并证明了这些模型在更加具有挑战性的任务上的分数尚未达到50%。为了改善自动和协助评估能力，我们引入了一种基于笔记的评估方法，允许GPT-4评估自己的中间逻辑步骤。此外，我们进行了人类评估符号 subset of ARB，发现与GPT-4笔记评估分数有良好的一致性。

A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check

paper_url: http://arxiv.org/abs/2307.13655
repo_url: None
paper_authors: Xunjian Yin, Xiaojun Wan
for: 本研究旨在探讨基于预训练模型和音频GRU的中文拼写检查（CSC）模型在不同目的下的表现。
methods: 本研究使用九种不同结构的模型，并在自定义的测试集上进行了详细的实验和分析。
results: 本研究发现：1）合理地 fusion 音频GRU和文本信息可以提高CSC模型的性能。2）模型对测试集的错误分布有敏感性，表明模型存在缺陷，并且透露我们需要努力改进。3）模型对错误和上下文的影响很大，这也是我们需要关注的方向。4）常用的标准准则SIGHAN无法可靠地评估模型的表现。

Abstract
With the development of pre-trained models and the incorporation of phonetic and graphic information, neural models have achieved high scores in Chinese Spelling Check (CSC). However, it does not provide a comprehensive reflection of the models' capability due to the limited test sets. In this study, we abstract the representative model paradigm, implement it with nine structures and experiment them on comprehensive test sets we constructed with different purposes. We perform a detailed analysis of the results and find that: 1) Fusing phonetic and graphic information reasonably is effective for CSC. 2) Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models and reveals the direction we should work on. 3) Whether or not the errors and contexts have been seen has a significant impact on models. 4) The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.

摘要
随着预训模型的发展和音频和字形信息的包含，神经网络模型在中文拼写检查（CSC）中获得了高分。但是，这并不提供全面的模型能力反映，因为测试集的数量有限。在这种研究中，我们抽象了代表性模型思想，将其实现为九种结构，并在我们自己制作的全面测试集上进行了实验。我们进行了详细的分析结果，发现：1. 合理地汇合音频和字形信息是有效的 для CSC。2. 模型对测试集的错误分布具有敏感性，这反映了模型的缺点和我们应该努力改进的方向。3. 模型是否已经看到过错误和上下文有重要的影响。4. 常用的标准准则SIGHAN无法可靠地评估模型的性能。

Contributions to the Improvement of Question Answering Systems in the Biomedical Domain

paper_url: http://arxiv.org/abs/2307.13631
repo_url: None
paper_authors: Mourad Sarrouti
for: 这份论文主要目标是提高生物医学领域内的问答系统（Question Answering，简称QA）的性能。
methods: 本论文提出了四个贡献，包括一种基于机器学习的问题类划分方法，一种用于各种生物医学问题的问题分类方法，一种用于从MEDLINE数据库中检索相关文献的方法，以及一种用于生成准确和理想答案的方法。
results: 本论文的实验结果表明，使用提出的方法可以提高生物医学QA系统的性能，并且可以生成准确和理想的答案。

Abstract
This thesis work falls within the framework of question answering (QA) in the biomedical domain where several specific challenges are addressed, such as specialized lexicons and terminologies, the types of treated questions, and the characteristics of targeted documents. We are particularly interested in studying and improving methods that aim at finding accurate and short answers to biomedical natural language questions from a large scale of biomedical textual documents in English. QA aims at providing inquirers with direct, short and precise answers to their natural language questions. In this Ph.D. thesis, we propose four contributions to improve the performance of QA in the biomedical domain. In our first contribution, we propose a machine learning-based method for question type classification to determine the types of given questions which enable to a biomedical QA system to use the appropriate answer extraction method. We also propose an another machine learning-based method to assign one or more topics (e.g., pharmacological, test, treatment, etc.) to given questions in order to determine the semantic types of the expected answers which are very useful in generating specific answer retrieval strategies. In the second contribution, we first propose a document retrieval method to retrieve a set of relevant documents that are likely to contain the answers to biomedical questions from the MEDLINE database. We then present a passage retrieval method to retrieve a set of relevant passages to questions. In the third contribution, we propose specific answer extraction methods to generate both exact and ideal answers. Finally, in the fourth contribution, we develop a fully automated semantic biomedical QA system called SemBioNLQA which is able to deal with a variety of natural language questions and to generate appropriate answers by providing both exact and ideal answers.

摘要
Our four contributions to improving QA performance in the biomedical domain are:1. A machine learning-based method for question type classification to determine the types of given questions and enable the use of appropriate answer extraction methods.2. A machine learning-based method to assign one or more topics (e.g., pharmacological, test, treatment, etc.) to given questions to determine the semantic types of expected answers.3. A document retrieval method to retrieve relevant documents from the MEDLINE database, followed by a passage retrieval method to retrieve relevant passages to questions.4. Specific answer extraction methods to generate both exact and ideal answers.Our proposed system, SemBioNLQA, is designed to deal with a variety of natural language questions and generate appropriate answers by providing both exact and ideal answers.

Diversity and Language Technology: How Techno-Linguistic Bias Can Cause Epistemic Injustice

paper_url: http://arxiv.org/abs/2307.13714
repo_url: None
paper_authors: Paula Helm, Gábor Bella, Gertraud Koch, Fausto Giunchiglia
for: This paper aims to address the issue of techno-linguistic bias in AI-based language technology, which can result in systems that only express concepts from dominant languages and cultures, rather than accurately representing concepts from marginalized language communities.
methods: The paper uses the concept of epistemic injustice to explore the systematic tendency of technology developer communities to apply a simplistic understanding of diversity, leading to a disregard for valuable aspects of diversity and an under-representation of the needs and diverse worldviews of marginalized language communities.
results: The paper shows that many attempts to extend the reach of AI technology to “underserved languages” produce flawed solutions that adhere to a hard-wired representational preference for certain languages, resulting in techno-linguistic bias and a lack of accurate representation of concepts from marginalized language communities.

Abstract
It is well known that AI-based language technology -- large language models, machine translation systems, multilingual dictionaries, and corpora -- is currently limited to 2 to 3 percent of the world's most widely spoken and/or financially and politically best supported languages. In response, recent research efforts have sought to extend the reach of AI technology to ``underserved languages.'' In this paper, we show that many of these attempts produce flawed solutions that adhere to a hard-wired representational preference for certain languages, which we call techno-linguistic bias. Techno-linguistic bias is distinct from the well-established phenomenon of linguistic bias as it does not concern the languages represented but rather the design of the technologies. As we show through the paper, techno-linguistic bias can result in systems that can only express concepts that are part of the language and culture of dominant powers, unable to correctly represent concepts from other communities. We argue that at the root of this problem lies a systematic tendency of technology developer communities to apply a simplistic understanding of diversity which does not do justice to the more profound differences that languages, and ultimately the communities that speak them, embody. Drawing on the concept of epistemic injustice, we point to the broader sociopolitical consequences of the bias we identify and show how it can lead not only to a disregard for valuable aspects of diversity but also to an under-representation of the needs and diverse worldviews of marginalized language communities.

摘要
现在的人工智能语言技术 -- 大语模型、机器翻译系统、多语言词典和语料库 -- 只能涵盖2-3%的世界上最广泛使用和/或经济和政治上最具影响力的语言。因此，latest research efforts have sought to extend the reach of AI technology to "underserved languages." However, we show that many of these attempts produce flawed solutions that adhere to a hard-wired representational preference for certain languages, which we call "techno-linguistic bias." This bias is distinct from the well-established phenomenon of linguistic bias, as it does not concern the languages represented but rather the design of the technologies. We argue that the root of this problem lies in a systematic tendency of technology developer communities to apply a simplistic understanding of diversity, which does not do justice to the more profound differences that languages and ultimately the communities that speak them, embody. Drawing on the concept of epistemic injustice, we point to the broader sociopolitical consequences of the bias we identify and show how it can lead not only to a disregard for valuable aspects of diversity but also to an under-representation of the needs and diverse worldviews of marginalized language communities.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China and Singapore. Traditional Chinese is also widely used, particularly in Taiwan, Hong Kong, and Macau.

2023-07-27

Federated Model Aggregation via Self-Supervised Priors for Highly Imbalanced Medical Image Classification

GET3D–: Learning GET3D from Unconstrained Image Collections

NSA: Naturalistic Support Artifact to Boost Network Confidence

Clustering of illustrations by atmosphere using a combination of supervised and unsupervised learning

Weakly Supervised AI for Efficient Analysis of 3D Pathology Samples

Mixture of Self-Supervised Learning

Self-Supervised Learning for Improved Synthetic Aperture Sonar Target Recognition

Sample Less, Learn More: Efficient Action Recognition via Frame Feature Restoration

A full-resolution training framework for Sentinel-2 image fusion

IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer

Comparative Evaluation of Digital and Analog Chest Radiographs to Identify Tuberculosis using Deep Learning Model

Simplified Concrete Dropout – Improving the Generation of Attribution Masks for Fine-grained Classification

Building RadiologyNET: Unsupervised annotation of a large-scale multimodal medical database

Fading memory as inductive bias in residual recurrent networks

Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning

Contrastive Knowledge Amalgamation for Unsupervised Image Classification

pCTFusion: Point Convolution-Transformer Fusion with Semantic Aware Loss for Outdoor LiDAR Point Cloud Segmentation

3DPortraitGAN: Learning One-Quarter Headshot 3D GANs from a Single-View Portrait Dataset with Diverse Body Poses

Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining

Semantic Image Completion and Enhancement using GANs

Test Time Adaptation for Blind Image Quality Assessment

Understanding Silent Failures in Medical Image Classification

P2C: Self-Supervised Point Cloud Completion from Single Partial Clouds

vox2vec: A Framework for Self-supervised Contrastive Learning of Voxel-level Representations in Medical Images

EFLNet: Enhancing Feature Learning for Infrared Small Target Detection

Robust vertebra identification using simultaneous node and edge predicting Graph Neural Networks

GaitMorph: Transforming Gait by Optimally Transporting Discrete Codes

Pre-training Vision Transformers with Very Limited Synthesized Images

Taxonomy Adaptive Cross-Domain Adaptation in Medical Imaging via Optimization Trajectory Distillation

High Dynamic Range Imaging via Visual Attention Modules

MIM-OOD: Generative Masked Image Modelling for Out-of-Distribution Detection in Medical Images

Unified Adversarial Patch for Visible-Infrared Cross-modal Attacks in the Physical World

LLDiffusion: Learning Degradation Representations in Diffusion Models for Low-Light Image Enhancement

Spatial-Frequency U-Net for Denoising Diffusion Probabilistic Models

EqGAN: Feature Equalization Fusion for Few-shot Image Generation

HTNet for micro-expression recognition

360VOT: A New Benchmark Dataset for Omnidirectional Visual Object Tracking

FS-Depth: Focal-and-Scale Depth Estimation from a Single Image in Unseen Indoor Scene

NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection

Multiscale Dynamic Graph Representation for Biometric Recognition with Occlusions

GenCo: An Auxiliary Generator from Contrastive Learning for Enhanced Few-Shot Learning in Remote Sensing

TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation

A Weakly Supervised Segmentation Network Embedding Cross-scale Attention Guidance and Noise-sensitive Constraint for Detecting Tertiary Lymphoid Structures of Pancreatic Tumors

FakeTracer: Proactively Defending Against Face-swap DeepFakes via Implanting Traces in Training

MCPA: Multi-scale Cross Perceptron Attention Network for 2D Medical Image Segmentation

Neural Representation-Based Method for Metal-induced Artifact Reduction in Dental CBCT Imaging

GADER: GAit DEtection and Recognition in the Wild

Robust Detection, Association, and Localization of Vehicle Lights: A Context-Based Cascaded CNN Approach and Evaluations

Physically Plausible 3D Human-Scene Reconstruction from Monocular RGB Image using an Adversarial Learning Approach

Towards multi-modal anatomical landmark detection for ultrasound-guided brain tumor resection with contrastive learning

FocalErrorNet: Uncertainty-aware focal modulation network for inter-modal registration error estimation in ultrasound-guided neurosurgery

SuperInpaint: Learning Detail-Enhanced Attentional Implicit Representation for Super-resolutional Image Inpainting

Role of Image Acquisition and Patient Phenotype Variations in Automatic Segmentation Model Generalization

MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation

Self-supervised Few-shot Learning for Semantic Segmentation: An Annotation-free Approach

Phenotype-preserving metric design for high-content image reconstruction by generative inpainting

ProtoASNet: Dynamic Prototypes for Inherently Interpretable and Uncertainty-Aware Aortic Stenosis Classification in Echocardiography

Virtual Mirrors: Non-Line-of-Sight Imaging Beyond the Third Bounce

MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation

Visual Instruction Inversion: Image Editing via Visual Prompting

US & MR Image-Fusion Based on Skin Co-Registration

Large-scale Fully-Unsupervised Re-Identification

G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

Deepfake Image Generation for Improved Brain Tumor Segmentation

2023-07-27

Multi-Source Domain Adaptation through Dataset Dictionary Learning in Wasserstein Space

Designing Fiduciary Artificial Intelligence

PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback

Solving Data Quality Problems with Desbordante: a Demo

Approximate Model-Based Shielding for Safe Reinforcement Learning

Scaling Session-Based Transformer Recommendations using Optimized Negative Sampling and Loss Functions

CodeLens: An Interactive Tool for Visualizing Code Representations

Text-guided Foundation Model Adaptation for Pathological Image Classification

Base-based Model Checking for Multi-Agent Only Believing (long version)

Weakly Supervised Multi-Modal 3D Human Body Pose Estimation for Autonomous Driving

Exploiting the Potential of Seq2Seq Models as Robust Few-Shot Learners

Counterfactual Explanations for Graph Classification Through the Lenses of Density

Seeing through the Brain: Image Reconstruction of Visual Perception from Human Brain Signals

Hybrid ASP-based multi-objective scheduling of semiconductor manufacturing processes (Extended version)