2023-08-15

cs.CV

cs.CV - 2023-08-15

CCD-3DR: Consistent Conditioning in Diffusion for Single-Image 3D Reconstruction

paper_url: http://arxiv.org/abs/2308.07837
repo_url: None
paper_authors: Yan Di, Chenyangguang Zhang, Pengyuan Wang, Guangyao Zhai, Ruida Zhang, Fabian Manhardt, Benjamin Busam, Xiangyang Ji, Federico Tombari
for: 该文章描述了一种基于扩散模型的三维稀畴点云重建方法，用于基于单一RGB图像捕捉的对象重建。
methods: 该方法使用了一种新的中心扩散probabilistic模型来约束本地特征来 condtioning。在反 diffusion 过程中，抑制点云和样本点云被限制在一个子空间中，以确保点云中心保持不变。
results: 对于Synthetic ShapeNet-R2N2测试集，CCD-3DR超过了所有竞争者，增加了 más de 40%的性能提升。同时， authors还提供了实际应用中的Result on Pix3D数据集，以证明CCD-3DR在实际应用中的潜在性。

Abstract
In this paper, we present a novel shape reconstruction method leveraging diffusion model to generate 3D sparse point cloud for the object captured in a single RGB image. Recent methods typically leverage global embedding or local projection-based features as the condition to guide the diffusion model. However, such strategies fail to consistently align the denoised point cloud with the given image, leading to unstable conditioning and inferior performance. In this paper, we present CCD-3DR, which exploits a novel centered diffusion probabilistic model for consistent local feature conditioning. We constrain the noise and sampled point cloud from the diffusion model into a subspace where the point cloud center remains unchanged during the forward diffusion process and reverse process. The stable point cloud center further serves as an anchor to align each point with its corresponding local projection-based features. Extensive experiments on synthetic benchmark ShapeNet-R2N2 demonstrate that CCD-3DR outperforms all competitors by a large margin, with over 40% improvement. We also provide results on real-world dataset Pix3D to thoroughly demonstrate the potential of CCD-3DR in real-world applications. Codes will be released soon

摘要
在这篇论文中，我们提出了一种基于扩散模型的新型形态重建方法，用于从单个RGB图像中恢复3D稀畴点云。现有方法通常使用全局嵌入或本地投影基于特征作为扩散模型的指导条件，但这些策略无法一致地将净化后的点云与给定图像相对应，导致稳定性差和性能下降。在这篇论文中，我们提出了CCD-3DR，它利用一种新型的中心扩散概率模型来实现一致的本地特征控制。我们在扩散和归还过程中将噪声和采样点云压缩到一个子空间，使得点云中心保持不变。稳定的点云中心更serve as一个锚点，使每个点与其相应的本地投影基于特征进行对应。我们在Synthetic benchmark ShapeNet-R2N2上进行了广泛的实验，结果表明CCD-3DR与其他竞争对手相比，提高了40%以上。我们还对实际应用中的Pix3D数据集进行了详细的研究，以展示CCD-3DR在实际应用中的潜在能力。代码将很快发布。

Learning Better Keypoints for Multi-Object 6DoF Pose Estimation

paper_url: http://arxiv.org/abs/2308.07827
repo_url: None
paper_authors: Yangzheng Wu, Michael Greenspan
for: 本研究探讨了预定义关键点对pose estimation的影响，并发现可以通过训练一个图 Orientated Graph Network（KeyGNet）来选择一组分散的关键点，以提高准确率和效率。
methods: KeyGNet使用一种组合损失函数，包括 Wassserstein 距离和分散，来监督学习颜色和几何特征来估算关键点位置。
results: 实验表明，使用KeyGNet选择的关键点可以提高所有评价指标的准确率，包括所有七个测试集。特别是在Occlusion LINEMOD数据集上，KeyGNet选择的关键点可以提高ADD(S)的值 by +16.4% on PVN3D。

Abstract
We investigate the impact of pre-defined keypoints for pose estimation, and found that accuracy and efficiency can be improved by training a graph network to select a set of disperse keypoints with similarly distributed votes. These votes, learned by a regression network to accumulate evidence for the keypoint locations, can be regressed more accurately compared to previous heuristic keypoint algorithms. The proposed KeyGNet, supervised by a combined loss measuring both Wassserstein distance and dispersion, learns the color and geometry features of the target objects to estimate optimal keypoint locations. Experiments demonstrate the keypoints selected by KeyGNet improved the accuracy for all evaluation metrics of all seven datasets tested, for three keypoint voting methods. The challenging Occlusion LINEMOD dataset notably improved ADD(S) by +16.4% on PVN3D, and all core BOP datasets showed an AR improvement for all objects, of between +1% and +21.5%. There was also a notable increase in performance when transitioning from single object to multiple object training using KeyGNet keypoints, essentially eliminating the SISO-MIMO gap for Occlusion LINEMOD.

摘要
我们研究了预定的关键点对pose estimation的影响，并发现了通过训练一个图гра夫网络选择一组广泛分布的票点，以提高准确性和效率。这些票点由一个回归网络学习归一化证据以提高精度，相比之前的习惯性关键点算法。我们提出的KeyGNet，以combined损失函数 measuring Wasserstein distance和分布为优化目标，学习目标对象的颜色和几何特征以估算优化关键点位置。实验表明，KeyGNet选择的关键点提高了所有评价指标的准确性，包括所有七个数据集的所有三种关键点投票方法。特别是在Occlusion LINEMOD数据集上，KeyGNet选择的关键点提高了ADD(S)的准确性 by +16.4% on PVN3D，并且所有核心BOP数据集上的所有对象都显示了AR提升，分别为+1%到+21.5%。此外，使用KeyGNet关键点进行多对象训练时，可以基本消除SISO-MIMO障碍，为Occlusion LINEMOD数据集表现出了明显的提升。

ImbSAM: A Closer Look at Sharpness-Aware Minimization in Class-Imbalanced Recognition

paper_url: http://arxiv.org/abs/2308.07815
repo_url: https://github.com/cool-xuan/imbalanced_sam
paper_authors: Yixuan Zhou, Yi Qu, Xing Xu, Hengtao Shen
for: Addressing the challenge of class imbalance in recognition tasks, specifically the generalization issues that arise when tail classes have limited training data.
methods: Proposes a class-aware smoothness optimization algorithm called Imbalanced-SAM (ImbSAM) that leverages class priors to restrict the generalization scope of the class-agnostic SAM, improving generalization targeting tail classes.
results: Demonstrates remarkable performance improvements for tail classes and anomaly detection in two prototypical applications of class-imbalanced recognition: long-tailed classification and semi-supervised anomaly detection.

Abstract
Class imbalance is a common challenge in real-world recognition tasks, where the majority of classes have few samples, also known as tail classes. We address this challenge with the perspective of generalization and empirically find that the promising Sharpness-Aware Minimization (SAM) fails to address generalization issues under the class-imbalanced setting. Through investigating this specific type of task, we identify that its generalization bottleneck primarily lies in the severe overfitting for tail classes with limited training data. To overcome this bottleneck, we leverage class priors to restrict the generalization scope of the class-agnostic SAM and propose a class-aware smoothness optimization algorithm named Imbalanced-SAM (ImbSAM). With the guidance of class priors, our ImbSAM specifically improves generalization targeting tail classes. We also verify the efficacy of ImbSAM on two prototypical applications of class-imbalanced recognition: long-tailed classification and semi-supervised anomaly detection, where our ImbSAM demonstrates remarkable performance improvements for tail classes and anomaly. Our code implementation is available at https://github.com/cool-xuan/Imbalanced_SAM.

摘要
“类别不匹配是现实世界识别任务中的常见挑战，主要是因为大多数类别具有少量样本，也称为尾类。我们通过总化和实验发现，promising Sharpness-Aware Minimization (SAM) 在类别不匹配的设定下存在总化问题。通过研究这种特定任务，我们发现其总化瓶颈主要在tail classes中严重过拟合。为了缓解这个瓶颈，我们利用类别优先顺序来限制类型不感知 SAM 的总化范围，并提出一种类别意识细化优化算法名为 Imbalanced-SAM（ImbSAM）。通过类别优先顺序的引导，我们的 ImbSAM specifically 提高了tail classes的总化表现。我们还证明 ImbSAM 在long-tailed classification和 semi-supervised anomaly detection 中表现出色，尤其是在tail classes和异常处理方面。我们的代码实现可以在 GitHub 上找到：https://github.com/cool-xuan/Imbalanced_SAM。”

Grasp Transfer based on Self-Aligning Implicit Representations of Local Surfaces

paper_url: http://arxiv.org/abs/2308.07807
repo_url: None
paper_authors: Ahmet Tekden, Marc Peter Deisenroth, Yasemin Bekiroglu
for: 本文解决了将抓取经验或示例转移到新对象的问题，该对象与先前遇到的对象共享形状特征。
methods: 本文使用单一专家抓取示例学习了一个基于本地表面的印象模型，并在推理时使用这个模型将抓取转移到新对象的相似表面上。
results: 该方法在实验中可以成功将抓取转移到未看过的对象类别，并在实验和实际场景中表现出较好的空间精度和抓取精度。

Abstract
Objects we interact with and manipulate often share similar parts, such as handles, that allow us to transfer our actions flexibly due to their shared functionality. This work addresses the problem of transferring a grasp experience or a demonstration to a novel object that shares shape similarities with objects the robot has previously encountered. Existing approaches for solving this problem are typically restricted to a specific object category or a parametric shape. Our approach, however, can transfer grasps associated with implicit models of local surfaces shared across object categories. Specifically, we employ a single expert grasp demonstration to learn an implicit local surface representation model from a small dataset of object meshes. At inference time, this model is used to transfer grasps to novel objects by identifying the most geometrically similar surfaces to the one on which the expert grasp is demonstrated. Our model is trained entirely in simulation and is evaluated on simulated and real-world objects that are not seen during training. Evaluations indicate that grasp transfer to unseen object categories using this approach can be successfully performed both in simulation and real-world experiments. The simulation results also show that the proposed approach leads to better spatial precision and grasp accuracy compared to a baseline approach.

摘要
Objects we interact with and manipulate often share similar parts, such as handles, that allow us to transfer our actions flexibly due to their shared functionality. This work addresses the problem of transferring a grasp experience or a demonstration to a novel object that shares shape similarities with objects the robot has previously encountered. Existing approaches for solving this problem are typically restricted to a specific object category or a parametric shape. Our approach, however, can transfer grasps associated with implicit models of local surfaces shared across object categories. Specifically, we employ a single expert grasp demonstration to learn an implicit local surface representation model from a small dataset of object meshes. At inference time, this model is used to transfer grasps to novel objects by identifying the most geometrically similar surfaces to the one on which the expert grasp is demonstrated. Our model is trained entirely in simulation and is evaluated on simulated and real-world objects that are not seen during training. Evaluations indicate that grasp transfer to unseen object categories using this approach can be successfully performed both in simulation and real-world experiments. The simulation results also show that the proposed approach leads to better spatial precision and grasp accuracy compared to a baseline approach.Here's the text in Traditional Chinese:objects we interact with and manipulate often share similar parts, such as handles, that allow us to transfer our actions flexibly due to their shared functionality. This work addresses the problem of transferring a grasp experience or a demonstration to a novel object that shares shape similarities with objects the robot has previously encountered. Existing approaches for solving this problem are typically restricted to a specific object category or a parametric shape. Our approach, however, can transfer grasps associated with implicit models of local surfaces shared across object categories. Specifically, we employ a single expert grasp demonstration to learn an implicit local surface representation model from a small dataset of object meshes. At inference time, this model is used to transfer grasps to novel objects by identifying the most geometrically similar surfaces to the one on which the expert grasp is demonstrated. Our model is trained entirely in simulation and is evaluated on simulated and real-world objects that are not seen during training. Evaluations indicate that grasp transfer to unseen object categories using this approach can be successfully performed both in simulation and real-world experiments. The simulation results also show that the proposed approach leads to better spatial precision and grasp accuracy compared to a baseline approach.

Neuromorphic Seatbelt State Detection for In-Cabin Monitoring with Event Cameras

paper_url: http://arxiv.org/abs/2308.07802
repo_url: None
paper_authors: Paul Kielty, Cian Ryan, Mehdi Sefidgar Dilmaghani, Waseem Shariff, Joe Lemley, Peter Corcoran
for: 这个论文主要是为了研究如何使用事件摄像机在座位安全系统中检测安全带状态。
methods: 这篇论文使用了事件生成器生成的Synthetic neuromorphic frames，以及基于循环卷积神经网络的检测算法。
results: 论文的实验结果显示，在 binary 分类任务中，fastened/unfastened 帧的识别精度为 0.989 和 0.944 分别在 simulated 和 real test sets 上。当问题扩展到包括快速安全带的扭矩时，分别达到了 0.964 和 0.846 的 F1 分数。

Abstract
Neuromorphic vision sensors, or event cameras, differ from conventional cameras in that they do not capture images at a specified rate. Instead, they asynchronously log local brightness changes at each pixel. As a result, event cameras only record changes in a given scene, and do so with very high temporal resolution, high dynamic range, and low power requirements. Recent research has demonstrated how these characteristics make event cameras extremely practical sensors in driver monitoring systems (DMS), enabling the tracking of high-speed eye motion and blinks. This research provides a proof of concept to expand event-based DMS techniques to include seatbelt state detection. Using an event simulator, a dataset of 108,691 synthetic neuromorphic frames of car occupants was generated from a near-infrared (NIR) dataset, and split into training, validation, and test sets for a seatbelt state detection algorithm based on a recurrent convolutional neural network (CNN). In addition, a smaller set of real event data was collected and reserved for testing. In a binary classification task, the fastened/unfastened frames were identified with an F1 score of 0.989 and 0.944 on the simulated and real test sets respectively. When the problem extended to also classify the action of fastening/unfastening the seatbelt, respective F1 scores of 0.964 and 0.846 were achieved.

摘要
neuromorphic vision sensors 或事件摄像头，与传统摄像头不同，不是预先定义的帧率来捕捉图像。相反，它们在每个像素上逐渐记录当地明亮变化。这意味着事件摄像头只记录场景中的变化，并且具有非常高的时间分辨率、高动态范围和低功耗要求。最新的研究表明，这些特点使得事件摄像头成为了非常实用的护身伞系统（DMS）感知器，可以跟踪高速眼动和耶飞。这些研究提供了扩展事件基于DMS技术的证明，包括座席安全带状态检测。使用事件模拟器，一个由近红外（NIR）数据集生成的108,691个神经元模拟帧的车Occupants dataset被生成，并被分配到训练、验证和测试集中。此外，一个更小的真实事件数据集也被收集并保留用于测试。在一个二分类任务中，带fastened/unfastened帧被识别出来，F1分数分别为0.989和0.944。当问题扩展到还包括快速安装/解除座席安全带的动作时，分别获得了0.964和0.846的F1分数。

Handwritten Stenography Recognition and the LION Dataset

paper_url: http://arxiv.org/abs/2308.07799
repo_url: https://zenodo.org/record/8249818
paper_authors: Raphaela Heil, Malin Nauwerck
for: 这篇论文的目的是建立一个基准模型，用于识别手写stenography。
methods: 这篇论文使用了现有的文本识别模型，并应用了四种不同的编码方法，将目标序列转换成表示selected aspects of the writing system。此外，还使用了预训练方案，基于合成数据。
results: 基准模型在测试集上的平均字符错误率（CER）为29.81%，word error rate（WER）为55.14%。通过结合stenography-specific target sequence encodings、预训练和细化，可以大幅降低测试错误率，CER在24.5% - 26%之间，WER在44.8% - 48.2%之间。

Abstract
Purpose: In this paper, we establish a baseline for handwritten stenography recognition, using the novel LION dataset, and investigate the impact of including selected aspects of stenographic theory into the recognition process. We make the LION dataset publicly available with the aim of encouraging future research in handwritten stenography recognition. Methods: A state-of-the-art text recognition model is trained to establish a baseline. Stenographic domain knowledge is integrated by applying four different encoding methods that transform the target sequence into representations, which approximate selected aspects of the writing system. Results are further improved by integrating a pre-training scheme, based on synthetic data. Results: The baseline model achieves an average test character error rate (CER) of 29.81% and a word error rate (WER) of 55.14%. Test error rates are reduced significantly by combining stenography-specific target sequence encodings with pre-training and fine-tuning, yielding CERs in the range of 24.5% - 26% and WERs of 44.8% - 48.2%. Conclusion: The obtained results demonstrate the challenging nature of stenography recognition. Integrating stenography-specific knowledge, in conjunction with pre-training and fine-tuning on synthetic data, yields considerable improvements. Together with our precursor study on the subject, this is the first work to apply modern handwritten text recognition to stenography. The dataset and our code are publicly available via Zenodo.

摘要
目的：在这篇论文中，我们建立了手写stenography认识基线，使用新的LION数据集，并研究包括选择的stenographic理论方面的影响。我们将LION数据集公开提供，以促进未来的手写stenography认识研究。方法：我们使用现代文本认识模型进行基线建立。stenographic领域知识被集成，通过将目标序列转换为表示形式，以估计选择的stenographic特征。此外，我们还使用基于Synthetic数据的预训练方案，进一步改进结果。结果：基线模型在测试集上的平均字符错误率（CER）为29.81%，单词错误率（WER）为55.14%。通过将stenography特有的target序列编码与预训练和细化结合使用，可以将测试错误率显著降低到24.5% - 26%之间的CER，以及44.8% - 48.2%之间的WER。结论：我们的结果表明stenography认识是一项非常具有挑战性的任务。通过结合stenography特有的知识和预训练和细化，可以获得显著的改进。这是现代手写文本认识在stenography领域的第一篇研究，同时我们的数据集和代码也公开提供了via Zenodo。

DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding

paper_url: http://arxiv.org/abs/2308.07787
repo_url: https://github.com/joannahong/diffv2s
paper_authors: Jeongsoo Choi, Joanna Hong, Yong Man Ro
for: 这个论文的目的是提出一种基于自我超vised学习模型和提示调整技术的视觉导航 speaker embedding抽取器，以便在推理时不需要外部音频信息。
methods: 该论文使用了一种基于自我超vised学习模型和提示调整技术的视觉导航 speaker embedding抽取器，以及一种基于这些 speaker embedding 和视觉表示的扩散基于视频到语音Synthesis模型。
results: 该论文的实验结果显示，使用该视觉导航 speaker embedding抽取器和扩散基于视频到语音Synthesis模型可以保持输入视频帧中的音频细节，同时还可以在推理时不需要外部音频信息。这些方法在比较之前的视频到语音Synthesis技术中达到了state-of-the-art的性能。

Abstract
Recent research has demonstrated impressive results in video-to-speech synthesis which involves reconstructing speech solely from visual input. However, previous works have struggled to accurately synthesize speech due to a lack of sufficient guidance for the model to infer the correct content with the appropriate sound. To resolve the issue, they have adopted an extra speaker embedding as a speaking style guidance from a reference auditory information. Nevertheless, it is not always possible to obtain the audio information from the corresponding video input, especially during the inference time. In this paper, we present a novel vision-guided speaker embedding extractor using a self-supervised pre-trained model and prompt tuning technique. In doing so, the rich speaker embedding information can be produced solely from input visual information, and the extra audio information is not necessary during the inference time. Using the extracted vision-guided speaker embedding representations, we further develop a diffusion-based video-to-speech synthesis model, so called DiffV2S, conditioned on those speaker embeddings and the visual representation extracted from the input video. The proposed DiffV2S not only maintains phoneme details contained in the input video frames, but also creates a highly intelligible mel-spectrogram in which the speaker identities of the multiple speakers are all preserved. Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.

摘要
最新的研究表明，视频到语音合成技术已经取得了很好的结果，即从视觉输入重建说话。然而，之前的工作很难准确地合成说话，因为缺乏足够的指导，使模型很难准确地推断出正确的内容和相应的声音。为解决这个问题，他们采用了Extra speaker embedding作为引导，从参考音频信息中获得。然而，在推断时不一定可以获得相应的音频信息，特别是在推断时。在这篇论文中，我们提出了一种新的视觉引导的Speaker embedding抽取器，使用自我超visumodel和提示调整技术。在这种情况下，可以从输入视频信息中提取出富有的Speaker embedding信息，不需要外部音频信息。使用提取的视觉引导Speaker embedding表示，我们进一步发展了一种扩散基于的视频到语音合成模型，称为DiffV2S。DiffV2S Conditioned on these speaker embeddings and the visual representation extracted from the input video, the proposed DiffV2S not only maintains the phoneme details contained in the input video frames, but also creates a highly intelligible mel-spectrogram in which the speaker identities of the multiple speakers are all preserved. Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.

Future Video Prediction from a Single Frame for Video Anomaly Detection

paper_url: http://arxiv.org/abs/2308.07783
repo_url: None
paper_authors: Mohammad Baradaran, Robert Bergevin
for: 这篇论文的目的是提出一个新的代理任务，即从单一帧画面中预测未来的影片，以便实现影片异常检测（VAD）中的长期运动模型。
methods: 这篇论文使用了一种新的semi-supervised anomaly detection方法，具体来说是使用未来帧画面预测作为代理任务，并将初始和未来的原始帧替换为它们的Semantic segmentation map，以增强模型的敏感度和精度。
results: 实验结果显示，这篇论文的方法能够实现长期运动模型的学习，并且与现有的预测基于VAD方法相比，具有更高的效果和精度。

Abstract
Video anomaly detection (VAD) is an important but challenging task in computer vision. The main challenge rises due to the rarity of training samples to model all anomaly cases. Hence, semi-supervised anomaly detection methods have gotten more attention, since they focus on modeling normals and they detect anomalies by measuring the deviations from normal patterns. Despite impressive advances of these methods in modeling normal motion and appearance, long-term motion modeling has not been effectively explored so far. Inspired by the abilities of the future frame prediction proxy-task, we introduce the task of future video prediction from a single frame, as a novel proxy-task for video anomaly detection. This proxy-task alleviates the challenges of previous methods in learning longer motion patterns. Moreover, we replace the initial and future raw frames with their corresponding semantic segmentation map, which not only makes the method aware of object class but also makes the prediction task less complex for the model. Extensive experiments on the benchmark datasets (ShanghaiTech, UCSD-Ped1, and UCSD-Ped2) show the effectiveness of the method and the superiority of its performance compared to SOTA prediction-based VAD methods.

摘要
视频异常检测（VAD）是计算机视觉中的一项重要而困难的任务。主要挑战在于缺乏异常情况下训练样本，因此半监督异常检测方法在过去几年中得到了更多的关注，这些方法通过模型常规动作和出现异常的方式进行检测。 despite impressive advances of these methods in modeling normal motion and appearance, long-term motion modeling has not been effectively explored so far. 鼓励了未来帧预测代理任务的能力，我们引入了从单一帧预测未来视频的任务，作为一种新的代理任务，以解决先前方法学习更长的动作模式的挑战。此外，我们将初始和未来的原始帧替换为它们对应的 semantic segmentation map，这不仅使得方法能够识别物体类型，还使得预测任务对模型更加简单。我们在 ShanghaiTech、UCSD-Ped1 和 UCSD-Ped2 benchmark datasets 进行了广泛的实验，并证明了方法的有效性和与State-of-the-art（SOTA）预测基于 VAD 方法的性能的superiority。

Learning Image Deraining Transformer Network with Dynamic Dual Self-Attention

paper_url: http://arxiv.org/abs/2308.07781
repo_url: None
paper_authors: Zhentao Fan, Hongming Chen, Yufeng Li
for: 本研究旨在提出一种高效的单张图像雨水去除方法，使用Transformer架构和动态双层自注意（DDSA）精确地捕捉图像中的雨水信息。
methods: 本方法使用了动态双层自注意（DDSA）精确地选择图像中的相似性值，并结合了一种新的空间增强Feedforward网络（SEFN）来提高图像的重建质量。
results: 经验表明，本方法在标准数据集上达到了高质量的雨水去除效果，并且在不同的雨水环境下都能够保持高度的稳定性。

Abstract
Recently, Transformer-based architecture has been introduced into single image deraining task due to its advantage in modeling non-local information. However, existing approaches tend to integrate global features based on a dense self-attention strategy since it tend to uses all similarities of the tokens between the queries and keys. In fact, this strategy leads to ignoring the most relevant information and inducing blurry effect by the irrelevant representations during the feature aggregation. To this end, this paper proposes an effective image deraining Transformer with dynamic dual self-attention (DDSA), which combines both dense and sparse attention strategies to better facilitate clear image reconstruction. Specifically, we only select the most useful similarity values based on top-k approximate calculation to achieve sparse attention. In addition, we also develop a novel spatial-enhanced feed-forward network (SEFN) to further obtain a more accurate representation for achieving high-quality derained results. Extensive experiments on benchmark datasets demonstrate the effectiveness of our proposed method.

摘要
最近，基于Transformer架构的单图雨水减除技术在单图雨水减除任务中得到应用。这是因为Transformer架构可以更好地模型非本地信息。然而，现有的方法通常会将全球特征集成到一个笔 dense self-attention策略中，这会导致忽略最重要的信息并通过不相关的表示导致图像重建的模糊效果。为了解决这个问题，本文提出了一种高效的图像雨水减除Transformer（DDSA），它结合了密集和疏缺注意策略来更好地促进清晰图像重建。具体来说，我们只选择最有用的相似性值，并通过top-k相似计算来实现疏缺注意。此外，我们还开发了一种新的空间增强Feed-Forward网络（SEFN），以更好地获得更高质量的雨水减除结果。我们在标准数据集上进行了广泛的实验，并证明了我们的提议的效果。

An Interpretable Machine Learning Model with Deep Learning-based Imaging Biomarkers for Diagnosis of Alzheimer’s Disease

paper_url: http://arxiv.org/abs/2308.07778
repo_url: None
paper_authors: Wenjie Kang, Bo Li, Janne M. Papma, Lize C. Jiskoot, Peter Paul De Deyn, Geert Jan Biessels, Jurgen A. H. R. Claassen, Huub A. M. Middelkoop, Wiesje M. van der Flier, Inez H. G. B. Ramakers, Stefan Klein, Esther E. Bron
for: 本研究旨在提出一种可解释的机器学习框架，用于自动早期诊断阿尔茨heimer病（AD）。
methods: 本研究使用了可解释的机器学习模型（EBM），并使用深度学习来提取特征。
results: 研究在Alzheimer’s Disease Neuroimaging Initiative（ADNI）数据集上 achieved accuracy of 0.883和area-under-the-curve（AUC）of 0.970在AD和控制分类中。在一个外部测试集上也达到了accuracy of 0.778和AUC of 0.887在AD和主观认知下降（SCD）分类中。 compared to使用体量生物标志代替深度学习特征的EBM模型，以及一个优化的 convolutional neural network（CNN）模型。

Abstract
Machine learning methods have shown large potential for the automatic early diagnosis of Alzheimer's Disease (AD). However, some machine learning methods based on imaging data have poor interpretability because it is usually unclear how they make their decisions. Explainable Boosting Machines (EBMs) are interpretable machine learning models based on the statistical framework of generalized additive modeling, but have so far only been used for tabular data. Therefore, we propose a framework that combines the strength of EBM with high-dimensional imaging data using deep learning-based feature extraction. The proposed framework is interpretable because it provides the importance of each feature. We validated the proposed framework on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, achieving accuracy of 0.883 and area-under-the-curve (AUC) of 0.970 on AD and control classification. Furthermore, we validated the proposed framework on an external testing set, achieving accuracy of 0.778 and AUC of 0.887 on AD and subjective cognitive decline (SCD) classification. The proposed framework significantly outperformed an EBM model using volume biomarkers instead of deep learning-based features, as well as an end-to-end convolutional neural network (CNN) with optimized architecture.

摘要

Dual-path TokenLearner for Remote Photoplethysmography-based Physiological Measurement with Facial Videos

paper_url: http://arxiv.org/abs/2308.07771
repo_url: None
paper_authors: Wei Qian, Dan Guo, Kun Li, Xilan Tian, Meng Wang
for: 这个论文旨在提出一种基于 Transformer 框架的远程光谱 Plethysmography (rPPG) 测量方法，以减少干扰因素的影响，提高测量精度。
methods: 该方法使用了两种 TokenLearner（S-TL 和 T-TL）来捕捉不同的 facial ROI 之间的关系，以及 quasi- periodic patrern 的推断，以减少干扰因素的影响。
results: 在四个 physiological measurement benchmark 数据集上进行了广泛的实验，结果表明 Dual-TL 可以在 intra- 和 cross-dataset 测试中达到 state-of-the-art 性能，表明其在 rPPG 测量中的潜在应用前景。

Abstract
Remote photoplethysmography (rPPG) based physiological measurement is an emerging yet crucial vision task, whose challenge lies in exploring accurate rPPG prediction from facial videos accompanied by noises of illumination variations, facial occlusions, head movements, \etc, in a non-contact manner. Existing mainstream CNN-based models make efforts to detect physiological signals by capturing subtle color changes in facial regions of interest (ROI) caused by heartbeats. However, such models are constrained by the limited local spatial or temporal receptive fields in the neural units. Unlike them, a native Transformer-based framework called Dual-path TokenLearner (Dual-TL) is proposed in this paper, which utilizes the concept of learnable tokens to integrate both spatial and temporal informative contexts from the global perspective of the video. Specifically, the proposed Dual-TL uses a Spatial TokenLearner (S-TL) to explore associations in different facial ROIs, which promises the rPPG prediction far away from noisy ROI disturbances. Complementarily, a Temporal TokenLearner (T-TL) is designed to infer the quasi-periodic pattern of heartbeats, which eliminates temporal disturbances such as head movements. The two TokenLearners, S-TL and T-TL, are executed in a dual-path mode. This enables the model to reduce noise disturbances for final rPPG signal prediction. Extensive experiments on four physiological measurement benchmark datasets are conducted. The Dual-TL achieves state-of-the-art performances in both intra- and cross-dataset testings, demonstrating its immense potential as a basic backbone for rPPG measurement. The source code is available at \href{https://github.com/VUT-HFUT/Dual-TL}{https://github.com/VUT-HFUT/Dual-TL}

摘要
distant photoplethysmography (rPPG) 基于视频的生物指标测量是一个emerging yet crucial vision task， whose challenge lies in accurately predicting rPPG from facial videos accompanied by illumination variations, facial occlusions, head movements, etc. in a non-contact manner. Existing mainstream CNN-based models make efforts to detect physiological signals by capturing subtle color changes in facial regions of interest (ROI) caused by heartbeats. However, such models are constrained by the limited local spatial or temporal receptive fields in the neural units. Unlike them, a native Transformer-based framework called Dual-path TokenLearner (Dual-TL) is proposed in this paper, which utilizes the concept of learnable tokens to integrate both spatial and temporal informative contexts from the global perspective of the video. Specifically, the proposed Dual-TL uses a Spatial TokenLearner (S-TL) to explore associations in different facial ROIs, which promises the rPPG prediction far away from noisy ROI disturbances. Complementarily, a Temporal TokenLearner (T-TL) is designed to infer the quasi-periodic pattern of heartbeats, which eliminates temporal disturbances such as head movements. The two TokenLearners, S-TL and T-TL, are executed in a dual-path mode. This enables the model to reduce noise disturbances for final rPPG signal prediction. Extensive experiments on four physiological measurement benchmark datasets are conducted. The Dual-TL achieves state-of-the-art performances in both intra- and cross-dataset testings, demonstrating its immense potential as a basic backbone for rPPG measurement. The source code is available at \href{https://github.com/VUT-HFUT/Dual-TL}{https://github.com/VUT-HFUT/Dual-TL}.

Multi-scale Promoted Self-adjusting Correlation Learning for Facial Action Unit Detection

paper_url: http://arxiv.org/abs/2308.07770
repo_url: https://github.com/yuankaishen2001/Self-adjusting-AU
paper_authors: Xin Liu, Kaishen Yuan, Xuesong Niu, Jingang Shi, Zitong Yu, Huanjing Yue, Jingyu Yang
for: 这个论文旨在提出一种新的自适应AU相关学习方法（SACL），以提高人类表情识别task中AU相关性的准确性和效率。
methods: 本文使用了一种自适应AU相关学习方法，通过约束AU相关性的学习和更新，以及一种多层学习（MSFL）方法，将多个尺度的特征学习到一个统一的表征中。
results: 实验结果显示，提案的方法在广泛使用的AU识别benchmark datasets上表现出色，与现有的方法相比，只需28.7%和12.0%的parameters和FLOPs，并且具有较高的准确性和稳定性。

Abstract
Facial Action Unit (AU) detection is a crucial task in affective computing and social robotics as it helps to identify emotions expressed through facial expressions. Anatomically, there are innumerable correlations between AUs, which contain rich information and are vital for AU detection. Previous methods used fixed AU correlations based on expert experience or statistical rules on specific benchmarks, but it is challenging to comprehensively reflect complex correlations between AUs via hand-crafted settings. There are alternative methods that employ a fully connected graph to learn these dependencies exhaustively. However, these approaches can result in a computational explosion and high dependency with a large dataset. To address these challenges, this paper proposes a novel self-adjusting AU-correlation learning (SACL) method with less computation for AU detection. This method adaptively learns and updates AU correlation graphs by efficiently leveraging the characteristics of different levels of AU motion and emotion representation information extracted in different stages of the network. Moreover, this paper explores the role of multi-scale learning in correlation information extraction, and design a simple yet effective multi-scale feature learning (MSFL) method to promote better performance in AU detection. By integrating AU correlation information with multi-scale features, the proposed method obtains a more robust feature representation for the final AU detection. Extensive experiments show that the proposed method outperforms the state-of-the-art methods on widely used AU detection benchmark datasets, with only 28.7\% and 12.0\% of the parameters and FLOPs of the best method, respectively. The code for this method is available at \url{https://github.com/linuxsino/Self-adjusting-AU}.

摘要
Facial Action Unit (AU) 检测是影响情感计算和社会机器人的关键任务，因为它帮助确定人脸表达中的情感。生物学上来说，AU 之间存在无数关系，这些关系含有丰富的信息，对 AU 检测至关重要。以前的方法通过专家经验或统计规则在特定基准上预先定义 AU 关系，但这些方法无法全面反映复杂的 AU 关系。其他方法使用完全连接图来学习这些依赖关系，但这些方法可能会导致计算暴涨和大量数据依赖。为解决这些挑战，本文提出了一种新的自适应AU correlation学习方法（SACL），它具有较少的计算量，但能够提高 AU 检测的性能。SACL 方法通过有效地利用不同层次AU动作特征和情感表示信息来学习和更新 AU 关系图。此外，本文还探讨了多尺度学习在相关信息提取中的作用，并设计了一种简单 yet 有效的多尺度特征学习（MSFL）方法，以提高 AU 检测的性能。通过将 AU 关系信息与多尺度特征结合，提出的方法可以获得更加稳定的特征表示，进而提高 AU 检测的准确率。广泛的实验表明，提出的方法在多个常用 AU 检测基准数据集上的性能较之前的状态艺法高，仅使用 28.7% 和 12.0% 的参数和 FLOPs。代码可以在上获取。

Whale Detection Enhancement through Synthetic Satellite Images

paper_url: http://arxiv.org/abs/2308.07766
repo_url: https://github.com/prgumd/seadronesim2
paper_authors: Akshaj Gaur, Cheng Liu, Xiaomin Lin, Nare Karapetyan, Yiannis Aloimonos
for: 该研究目的是开发一个名为SeaDroneSim2的测试环境和数据集，以提高鲸鱼检测和减少训练数据收集的努力。
methods: 该研究使用现代计算机视觉算法和人工智能技术来生成synthetic图像数据集，以便用于训练机器学习算法。
results: 研究发现，通过将synthetic数据集与实际数据集相结合训练，可以提高鲸鱼检测性能15%，而不需要额外的数据收集努力。

Abstract
With a number of marine populations in rapid decline, collecting and analyzing data about marine populations has become increasingly important to develop effective conservation policies for a wide range of marine animals, including whales. Modern computer vision algorithms allow us to detect whales in images in a wide range of domains, further speeding up and enhancing the monitoring process. However, these algorithms heavily rely on large training datasets, which are challenging and time-consuming to collect particularly in marine or aquatic environments. Recent advances in AI however have made it possible to synthetically create datasets for training machine learning algorithms, thus enabling new solutions that were not possible before. In this work, we present a solution - SeaDroneSim2 benchmark suite, which addresses this challenge by generating aerial, and satellite synthetic image datasets to improve the detection of whales and reduce the effort required for training data collection. We show that we can achieve a 15% performance boost on whale detection compared to using the real data alone for training, by augmenting a 10% real data. We open source both the code of the simulation platform SeaDroneSim2 and the dataset generated through it.

摘要
“由于海洋生物种群在快速减少，收集和分析海洋生物数据已成为保护各种海洋动物，包括鲸鱼的重要策略。现代计算机视觉算法可以在各种领域中检测鲸鱼的图像，从而加速和提高监测过程。然而，这些算法具有大量训练数据的需求，特别是在海洋或水生环境中收集这些数据是困难和耗时的。最近的人工智能技术 however 使得可以 sintetically create datasets for training machine learning algorithms，从而开启了以前不可能的解决方案。在这项工作中，我们提出一个解决方案 - SeaDroneSim2 benchmark suite，它解决了这个挑战，通过生成航空和卫星synthetic图像数据，以提高鲸鱼检测的准确率。我们证明，通过将10%的实际数据与15%的synthetic数据混合，可以在训练中提高鲸鱼检测的性能，相比使用实际数据 alone。我们开源了 SeaDroneSim2 的代码和生成的数据集。”Note: The translation is done using Google Translate and may not be perfect. Please let me know if you need any further assistance.

CASPNet++: Joint Multi-Agent Motion Prediction

paper_url: http://arxiv.org/abs/2308.07751
repo_url: None
paper_authors: Maximilian Schäfer, Kun Zhao, Anton Kummert
for: 本研究旨在提高自动驾驶技术的支持，具体来说是预测道路用户未来的运动。
methods: 本研究使用Context-Aware Scene Prediction Network (CASPNet)的改进版本CASPNet++，通过改进景象理解和交互模型，支持场景中所有道路用户的联合预测。此外，还引入了基于实例的输出头，以提供多模态的轨迹。
results: 在EXTENSIVE量化和质量分析中，我们展示了CASPNet++在使用和融合多种环境输入源（如HD地图、雷达探测和激光分 segmentation）的能力。在 nuScenes 城市预测数据集上测试，CASPNet++达到了现状之最的性能。模型已经在测试车辆中部署，并在实时下运行，具有moderate的计算资源。

Abstract
The prediction of road users' future motion is a critical task in supporting advanced driver-assistance systems (ADAS). It plays an even more crucial role for autonomous driving (AD) in enabling the planning and execution of safe driving maneuvers. Based on our previous work, Context-Aware Scene Prediction Network (CASPNet), an improved system, CASPNet++, is proposed. In this work, we focus on further enhancing the interaction modeling and scene understanding to support the joint prediction of all road users in a scene using spatiotemporal grids to model future occupancy. Moreover, an instance-based output head is introduced to provide multi-modal trajectories for agents of interest. In extensive quantitative and qualitative analysis, we demonstrate the scalability of CASPNet++ in utilizing and fusing diverse environmental input sources such as HD maps, Radar detection, and Lidar segmentation. Tested on the urban-focused prediction dataset nuScenes, CASPNet++ reaches state-of-the-art performance. The model has been deployed in a testing vehicle, running in real-time with moderate computational resources.

摘要
预测路用户未来运动是智能驾驶技术支持的关键任务之一，尤其是自动驾驶（AD）。在这种情况下，预测安全驾驶动作的能力变得非常重要。基于我们之前的工作，Context-Aware Scene Prediction Network（CASPNet），我们提出了改进的系统CASPNet++。在这个工作中，我们将更进一步地提高交互模型和场景理解，以支持场景中所有道路用户的未来运动预测。此外，我们还引入了基于实例的输出头，以提供多模态轨迹 для关注点。在详细的量化和质量分析中，我们示出了CASPNet++在使用和融合多种环境输入源，如高分辨环境地图、雷达探测和激光分 segmentation 的可扩展性。在 nuScenes 城市预测数据集上，CASPNet++ 达到了状态agh 的性能。模型已经在测试车辆上部署，在实时运行中具有中等计算资源。

ChartDETR: A Multi-shape Detection Network for Visual Chart Recognition

paper_url: http://arxiv.org/abs/2308.07743
repo_url: None
paper_authors: Wenyuan Xue, Dapeng Chen, Baosheng Yu, Yifei Chen, Sai Zhou, Wei Peng
for: 这篇论文的目的是提出一种基于变换器的多形态检测器，以便自动从图表图像中识别表头和数据元素。methods: 该方法使用变换器来地址现有方法中的分组错误，通过引入查询组来预测所有数据元素形状，从而消除后期处理步骤。results: 该方法在三个 dataset 上达到了竞争性的 результа，包括在 Adobe Synthetic 上达到了 0.98 的 F1 分数，与之前最佳模型的 0.71 F1 分数相比显著提高。此外，我们也实现了一个新的状态对标结果，达到了 ExcelChart400k 上的 0.97。

Abstract
Visual chart recognition systems are gaining increasing attention due to the growing demand for automatically identifying table headers and values from chart images. Current methods rely on keypoint detection to estimate data element shapes in charts but suffer from grouping errors in post-processing. To address this issue, we propose ChartDETR, a transformer-based multi-shape detector that localizes keypoints at the corners of regular shapes to reconstruct multiple data elements in a single chart image. Our method predicts all data element shapes at once by introducing query groups in set prediction, eliminating the need for further postprocessing. This property allows ChartDETR to serve as a unified framework capable of representing various chart types without altering the network architecture, effectively detecting data elements of diverse shapes. We evaluated ChartDETR on three datasets, achieving competitive results across all chart types without any additional enhancements. For example, ChartDETR achieved an F1 score of 0.98 on Adobe Synthetic, significantly outperforming the previous best model with a 0.71 F1 score. Additionally, we obtained a new state-of-the-art result of 0.97 on ExcelChart400k. The code will be made publicly available.

摘要
“图表识别系统在当前receiving increasing attention，主要是因为需要自动从图表图像中识别表头和数据值。现有方法通过关键点检测来估算图表元素的形状，但是会在后处理中出现分组错误。为解决这个问题，我们提出了 ChartDETR，一种基于transformer的多形态检测器，可以在单个图表图像中寻找多个数据元素的角点。我们的方法通过设置查询组来预测所有数据元素的形状，从而消除后处理步骤。这个特性使得 ChartDETR 可以作为一个通用的框架，无需修改网络结构，可以有效地检测各种图表类型中的数据元素。我们对 ChartDETR 进行了三个数据集的评估，实现了所有图表类型中的竞争性结果，不需要任何额外增强。例如，在 Adobe Synthetic 数据集上，ChartDETR achieved an F1 score of 0.98，与之前的最佳模型（F1 score 0.71）有所显著超越。此外，我们在 ExcelChart400k 数据集上获得了新的州际最佳结果（F1 score 0.97）。代码将会公开发布。”

Identity-Consistent Aggregation for Video Object Detection

paper_url: http://arxiv.org/abs/2308.07737
repo_url: https://github.com/bladewaltz1/clipvid
paper_authors: Chaorui Deng, Da Chen, Qi Wu
for: 在视频对象检测（VID）中，通常利用视频中的丰富时间上下文来提高每帧中的对象表示。现有方法往往对不同对象的时间上下文进行一个汇总，而忽略其不同身份。而intuitively, 将不同对象的本地视图在不同帧中汇总可能为对象的理解提供更好的帮助。因此，在这篇论文中，我们目的是使模型能够专注于每个对象的一致性时间上下文，从而获得更全面的对象表示，并处理快速的对象出现变化，如遮挡、运动模糊等。
methods: 我们提出了一种名为ClipVID的VID模型，它具有特定于个体的汇总（ICA）层，可以为每个对象提取更细致的一致性时间上下文。通过设计非重复的集 Prediction 策略，我们减少了模型的重复性，使ICA层非常高效。此外，我们还设计了一个并行的剪辑clip-wise预测方案，使得整个视频clip的预测都可以在一个时间内完成。
results: 我们的方法在ImageNet VID数据集上实现了state-of-the-art（SOTA）性能（84.7% mAP），而且在7倍的运行速度（39.3 fps）上达到了前一代SOTA的速度。

Abstract
In Video Object Detection (VID), a common practice is to leverage the rich temporal contexts from the video to enhance the object representations in each frame. Existing methods treat the temporal contexts obtained from different objects indiscriminately and ignore their different identities. While intuitively, aggregating local views of the same object in different frames may facilitate a better understanding of the object. Thus, in this paper, we aim to enable the model to focus on the identity-consistent temporal contexts of each object to obtain more comprehensive object representations and handle the rapid object appearance variations such as occlusion, motion blur, etc. However, realizing this goal on top of existing VID models faces low-efficiency problems due to their redundant region proposals and nonparallel frame-wise prediction manner. To aid this, we propose ClipVID, a VID model equipped with Identity-Consistent Aggregation (ICA) layers specifically designed for mining fine-grained and identity-consistent temporal contexts. It effectively reduces the redundancies through the set prediction strategy, making the ICA layers very efficient and further allowing us to design an architecture that makes parallel clip-wise predictions for the whole video clip. Extensive experimental results demonstrate the superiority of our method: a state-of-the-art (SOTA) performance (84.7% mAP) on the ImageNet VID dataset while running at a speed about 7x faster (39.3 fps) than previous SOTAs.

摘要
在视频对象检测（VID）中，一种常见的做法是利用视频中的丰富时间上下文来强化每帧中的对象表示。现有的方法对不同对象的时间上下文待遇一样，而忽略了它们的不同标识。而 intuitively，将不同对象的本地视图在不同帧中聚合可能会更好地理解这些对象。因此，在这篇论文中，我们目标是让模型能够关注每个对象的一致性时间上下文，以获得更全面的对象表示，并处理快速的对象出现变化，如遮挡、动态模糊等。然而，在现有的 VID 模型之上实现这个目标存在低效率的问题，主要是因为它们的重复的区域提案和非平行的帧次预测方式。为了解决这个问题，我们提议 ClipVID，一种具有一致性聚合（ICA）层的 VID 模型，专门用于挖掘细致的时间上下文。它通过设置预测策略，有效减少了重复性，使 ICA 层非常高效，并让我们能够设计一个可以并行预测整个视频剪辑的架构。我们的实验结果表明，我们的方法可以达到最新的状态（SOTA）的性能（84.7% mAP）在 ImageNet VID 数据集上，而且在运行速度方面比前一代 SOTA 快约 7 倍（39.3 fps）。

Dynamic Low-Rank Instance Adaptation for Universal Neural Image Compression

paper_url: http://arxiv.org/abs/2308.07733
repo_url: https://github.com/llvy21/duic
paper_authors: Yue Lv, Jinxi Xiang, Jun Zhang, Wenming Yang, Xiao Han, Wei Yang
For: The paper aims to address the domain gap between training and inference datasets in neural image compression, and to improve the rate-distortion performance of out-of-domain images.* Methods: The proposed method uses low-rank adaptation and a dynamic gating network to update the adaptation parameters of the client’s decoder. The low-rank constraint reduces the bit rate overhead, and the dynamic gating network decides which decoder layers should employ adaptation.* Results: The proposed method significantly mitigates the domain gap and outperforms non-adaptive and instance adaptive methods with an average BD-rate improvement of approximately $19%$ and $5%$, respectively. Ablation studies confirm the method’s universality across various image compression architectures.Here is the information in Simplified Chinese text:* 用途：纸上提出了解决神经图像压缩领域中域外图像的问题，并且提高域外图像的率-损失性能。* 方法：提议使用低级别适应和动态阀网络来更新客户端解码器的适应参数。低级别约束 redues the bit rate overhead，而动态阀网络决定了哪些解码层应用适应。* 结果：提议方法能够有效地减少域外图像的域 gap，并且超越非适应方法和实例适应方法，具体是BD-rate上下降约19%和5%。杂合研究证明了方法的通用性。

Abstract
The latest advancements in neural image compression show great potential in surpassing the rate-distortion performance of conventional standard codecs. Nevertheless, there exists an indelible domain gap between the datasets utilized for training (i.e., natural images) and those utilized for inference (e.g., artistic images). Our proposal involves a low-rank adaptation approach aimed at addressing the rate-distortion drop observed in out-of-domain datasets. Specifically, we perform low-rank matrix decomposition to update certain adaptation parameters of the client's decoder. These updated parameters, along with image latents, are encoded into a bitstream and transmitted to the decoder in practical scenarios. Due to the low-rank constraint imposed on the adaptation parameters, the resulting bit rate overhead is small. Furthermore, the bit rate allocation of low-rank adaptation is \emph{non-trivial}, considering the diverse inputs require varying adaptation bitstreams. We thus introduce a dynamic gating network on top of the low-rank adaptation method, in order to decide which decoder layer should employ adaptation. The dynamic adaptation network is optimized end-to-end using rate-distortion loss. Our proposed method exhibits universality across diverse image datasets. Extensive results demonstrate that this paradigm significantly mitigates the domain gap, surpassing non-adaptive methods with an average BD-rate improvement of approximately $19\%$ across out-of-domain images. Furthermore, it outperforms the most advanced instance adaptive methods by roughly $5\%$ BD-rate. Ablation studies confirm our method's ability to universally enhance various image compression architectures.

摘要
最新的神经网络图像压缩技术显示出了超越传统标准编码器的可能性。然而，存在一个不可缓和的领域差距 между训练集（即自然图像）和推理集（例如艺术图像）。我们的提议是通过对客户端解码器的certain adaptation参数进行低级精度约束来解决Rate-Distortion Drop在异领域数据集上。特别是，我们使用低级精度约束来更新客户端解码器的adaptation参数，然后将这些参数、 along with image latents，编码到bitstream中并在实际应用场景中传输。由于低级精度约束的存在，所得到的bit rate overhead很小。此外，低级精度 adaptation的bit rate分配是非易的，需要根据异类输入的需求进行调整。我们因此引入了一个动态闭合网络，以确定哪些解码层应该使用适应。这个动态闭合网络通过练习率-损失函数来优化。我们的提议显示了对各种图像压缩架构的通用性。广泛的结果表明，我们的方法可以减少异领域图像压缩中的领域差距，相比非适应方法的平均BD-rate提高约19%。此外，它还超过了最先进的实例适应方法的BD-rate提高约5%。ablation研究证明了我们的方法可以通过不同的图像压缩架构进行加强。

paper_url: http://arxiv.org/abs/2308.07732
repo_url: https://github.com/haiyang-w/unitr
paper_authors: Haiyang Wang, Hao Tang, Shaoshuai Shi, Aoxue Li, Zhenguo Li, Bernt Schiele, Liwei Wang
For: The paper is written for developing an efficient multi-modal backbone for outdoor 3D perception, which can handle a variety of modalities with unified modeling and shared parameters.* Methods: The paper uses a modality-agnostic transformer encoder to handle view-discrepant sensor data for parallel modal-wise representation learning and automatic cross-modal interaction without additional fusion steps. It also presents a novel multi-modal integration strategy that considers semantic-abundant 2D perspective and geometry-aware 3D sparse neighborhood relations.* Results: The paper achieves a new state-of-the-art performance on the nuScenes benchmark, with +1.1 NDS higher for 3D object detection and +12.0 higher mIoU for BEV map segmentation, and lower inference latency.

Abstract
Jointly processing information from multiple sensors is crucial to achieving accurate and robust perception for reliable autonomous driving systems. However, current 3D perception research follows a modality-specific paradigm, leading to additional computation overheads and inefficient collaboration between different sensor data. In this paper, we present an efficient multi-modal backbone for outdoor 3D perception named UniTR, which processes a variety of modalities with unified modeling and shared parameters. Unlike previous works, UniTR introduces a modality-agnostic transformer encoder to handle these view-discrepant sensor data for parallel modal-wise representation learning and automatic cross-modal interaction without additional fusion steps. More importantly, to make full use of these complementary sensor types, we present a novel multi-modal integration strategy by both considering semantic-abundant 2D perspective and geometry-aware 3D sparse neighborhood relations. UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks. It sets a new state-of-the-art performance on the nuScenes benchmark, achieving +1.1 NDS higher for 3D object detection and +12.0 higher mIoU for BEV map segmentation with lower inference latency. Code will be available at https://github.com/Haiyang-W/UniTR .

摘要

Context-Aware Pseudo-Label Refinement for Source-Free Domain Adaptive Fundus Image Segmentation

paper_url: http://arxiv.org/abs/2308.07731
repo_url: https://github.com/xmed-lab/cpr
paper_authors: Zheang Huai, Xinpeng Ding, Yi Li, Xiaomeng Li
for: 这个论文是针对源数不可用的目标端做domain adaptation问题，即将源端模型训练到目标端，但源数没有可用，因此使用源模型生成的 Pseudo-label 进行更新。
methods: 本论文提出了一个基于上下文关系的 Pseudo-label 精度更新方法，包括将上下文相似度学习到 Pseudo-label 更新、适用于不同类别的Pixel-level和Class-level降噪方法，以及适度调整 Pseudo-label 以补偿错误更新。
results: 在跨领域眼部像素数据上进行实验，结果显示本方法可以实现顶尖的结果。

Abstract
In the domain adaptation problem, source data may be unavailable to the target client side due to privacy or intellectual property issues. Source-free unsupervised domain adaptation (SF-UDA) aims at adapting a model trained on the source side to align the target distribution with only the source model and unlabeled target data. The source model usually produces noisy and context-inconsistent pseudo-labels on the target domain, i.e., neighbouring regions that have a similar visual appearance are annotated with different pseudo-labels. This observation motivates us to refine pseudo-labels with context relations. Another observation is that features of the same class tend to form a cluster despite the domain gap, which implies context relations can be readily calculated from feature distances. To this end, we propose a context-aware pseudo-label refinement method for SF-UDA. Specifically, a context-similarity learning module is developed to learn context relations. Next, pseudo-label revision is designed utilizing the learned context relations. Further, we propose calibrating the revised pseudo-labels to compensate for wrong revision caused by inaccurate context relations. Additionally, we adopt a pixel-level and class-level denoising scheme to select reliable pseudo-labels for domain adaptation. Experiments on cross-domain fundus images indicate that our approach yields the state-of-the-art results. Code is available at https://github.com/xmed-lab/CPR.

摘要
在领域适应问题中，源数据可能无法提供到目标客边，因为隐私或知识产权问题。源无supervised领域适应（SF-UDA）target的目标分布，仅使用源模型和目标无标签数据进行对领域的适应。源模型通常对目标领域生成噪音和无法适应的文本标签，即邻近区域可能会被不同的文本标签。这个观察动机我们更新 pseudo-label。另外，我们发现在领域差距下，同一类型的特征通常会形成一个对应的分布，这implies context relations可以从特征距离中Calculate。为此，我们提出一个context-aware pseudo-label revision方法。具体来说，我们开发了一个context-similarity learning module，用于学习context relations。接下来，我们设计了使用学习的context relations来修订 pseudo-label。此外，我们提出了calibrate revisions的方法，以补偿因为不准确的context relations而导致的错误修订。此外，我们还采用了像素级和类别级的噪音除掉方法，以选择可靠的 pseudo-label для领域适应。实验结果显示，我们的方法在跨领域基因摄像头上获得了state-of-the-art的结果。代码可以在https://github.com/xmed-lab/CPR上获取。

Domain-Aware Fine-Tuning: Enhancing Neural Network Adaptability

paper_url: http://arxiv.org/abs/2308.07728
repo_url: None
paper_authors: Seokhyeon Ha, Sunbeom Jung, Jungwoo Lee
for: 本研究旨在提高 fine-tuning 过程中的模型性能，特别是在新目标领域中。methods: 本研究提出了一种名为 Domain-Aware Fine-Tuning (DAFT) 的新方法，它包括批量准则转换和细致调整。results: 对于多个基线方法，DAFT 方法能够明显提高模型的性能，并且在各种不同的数据集上都有显著的优势。

Abstract
Fine-tuning pre-trained neural network models has become a widely adopted approach across various domains. However, it can lead to the distortion of pre-trained feature extractors that already possess strong generalization capabilities. Mitigating feature distortion during adaptation to new target domains is crucial. Recent studies have shown promising results in handling feature distortion by aligning the head layer on in-distribution datasets before performing fine-tuning. Nonetheless, a significant limitation arises from the treatment of batch normalization layers during fine-tuning, leading to suboptimal performance. In this paper, we propose Domain-Aware Fine-Tuning (DAFT), a novel approach that incorporates batch normalization conversion and the integration of linear probing and fine-tuning. Our batch normalization conversion method effectively mitigates feature distortion by reducing modifications to the neural network during fine-tuning. Additionally, we introduce the integration of linear probing and fine-tuning to optimize the head layer with gradual adaptation of the feature extractor. By leveraging batch normalization layers and integrating linear probing and fine-tuning, our DAFT significantly mitigates feature distortion and achieves improved model performance on both in-distribution and out-of-distribution datasets. Extensive experiments demonstrate that our method outperforms other baseline methods, demonstrating its effectiveness in not only improving performance but also mitigating feature distortion.

摘要
“现代化的预训练神经网络模型已成为各个领域的广泛采用方法。然而，这可能导致预训练的特征提取器受到扭曲，这会影响模型的泛化能力。避免特征扭曲在新目标领域中进行适应是非常重要。近期的研究表明，在进行适应时对头层进行对齐可以有效地避免特征扭曲。然而，在细化过程中对批处理归一化层的处理会导致表现下降。在本文中，我们提出了适应领域域特征 fine-tuning（DAFT）方法，该方法包括批处理归一化转换和细化过程中的线性探测与细化。我们的批处理归一化转换方法可以减少在细化过程中对神经网络的修改，从而避免特征扭曲。此外，我们引入了线性探测与细化的集成，以便逐渐适应头层和特征提取器。通过利用批处理归一化层和集成线性探测与细化，我们的DAFT可以有效地避免特征扭曲，并在各种预测和非预测 datasets 上显著提高模型的性能。我们的实验结果表明，我们的方法可以超越其他基准方法，说明了它的效果不仅在提高性能，还在避免特征扭曲。”

Real-time Automatic M-mode Echocardiography Measurement with Panel Attention from Local-to-Global Pixels

paper_url: http://arxiv.org/abs/2308.07717
repo_url: https://github.com/hanktseng131415go/ramem
paper_authors: Ching-Hsun Tseng, Shao-Ju Chien, Po-Shen Wang, Shin-Jye Lee, Wei-Huan Hu, Bin Pu, Xiao-jun Zeng
For: 这个论文的目的是提出一种实时自动echocardiography测量方法，以解决现有的三个主要阻碍：无法建立一个自动化方案、手动标注M-mode echocardiogram是时间consuming、现有的卷积层（如ResNet）在处理大对象时效率低下。* Methods: 该论文使用了MEIS数据集（M-mode echocardiogram的实例分割数据集），提出了面掌注意力（local-to-global efficient attention）和更新后的UPANets V2，以实现大对象检测和全局接受场。* Results: 实验结果表明，RAMEM比现有的RIS脊梁（带有非本地注意力）在PASCAL 2012 SBD和人类性能测试中表现更好，并且可以在实时中进行自动化echocardiography测量。

Abstract
Motion mode (M-mode) recording is an essential part of echocardiography to measure cardiac dimension and function. However, the current diagnosis cannot build an automatic scheme, as there are three fundamental obstructs: Firstly, there is no open dataset available to build the automation for ensuring constant results and bridging M-mode echocardiography with real-time instance segmentation (RIS); Secondly, the examination is involving the time-consuming manual labelling upon M-mode echocardiograms; Thirdly, as objects in echocardiograms occupy a significant portion of pixels, the limited receptive field in existing backbones (e.g., ResNet) composed from multiple convolution layers are inefficient to cover the period of a valve movement. Existing non-local attentions (NL) compromise being unable real-time with a high computation overhead or losing information from a simplified version of the non-local block. Therefore, we proposed RAMEM, a real-time automatic M-mode echocardiography measurement scheme, contributes three aspects to answer the problems: 1) provide MEIS, a dataset of M-mode echocardiograms for instance segmentation, to enable consistent results and support the development of an automatic scheme; 2) propose panel attention, local-to-global efficient attention by pixel-unshuffling, embedding with updated UPANets V2 in a RIS scheme toward big object detection with global receptive field; 3) develop and implement AMEM, an efficient algorithm of automatic M-mode echocardiography measurement enabling fast and accurate automatic labelling among diagnosis. The experimental results show that RAMEM surpasses existing RIS backbones (with non-local attention) in PASCAL 2012 SBD and human performances in real-time MEIS tested. The code of MEIS and dataset are available at https://github.com/hanktseng131415go/RAME.

摘要
幻象模式（M-mode）记录是听觉心动图像测量的重要组成部分，但现有的诊断方案无法建立自动化机制，因为存在以下三个基本障碍：首先，没有开放的数据集可用于建立自动化，以确保定制化结果并将M-mode听觉心动图像与实时实例 segmentation（RIS）相连接；其次，检查需要手动标注M-mode听觉心动图像，时间consuming；第三，因为听觉心动图像中的对象占用了大量像素，现有的卷积层（例如ResNet）的有限感知范围不能覆盖心动期间oval movement。现有的非本地注意力（NL）不能实现实时，或者 computation overhead过高，或者 lost information from a simplified version of the non-local block。因此，我们提出了RAMEM，一种实时自动M-mode听觉心动图像测量方案，它在以下三个方面做出贡献：1. 提供MEIS数据集，用于实例 segmentation，以确保定制化结果和支持自动化方案的发展。2. 提出面attenion，具有local-to-global高效注意力，通过像素排序和更新UPANets V2在RIS方案中，以实现大对象检测与全局感知范围。3. 开发和实现AMEM算法，一种高效的自动M-mode听觉心动图像测量算法，可以快速和准确地自动标注诊断过程中。实验结果表明，RAMEM在PASCAL 2012 SBD和人类性能上都超过了现有的RIS卷积层（带有非本地注意力）。MEIS数据集和代码可以在https://github.com/hanktseng131415go/RAME中获取。

Enhancing Network Initialization for Medical AI Models Using Large-Scale, Unlabeled Natural Images

paper_url: http://arxiv.org/abs/2308.07688
repo_url: None
paper_authors: Soroosh Tayebi Arasteh, Leo Misera, Jakob Nikolas Kather, Daniel Truhn, Sven Nebelung
for: 这个研究旨在测试SSL在非医学影像领域进行预训练，以及与对非医学影像和医学影像进行预训练进行比较。
methods: 我们使用了视觉 трансформер，并将其预设的参数基于以下三种预训练方法：(i) SSL预训练自然影像（DINOv2），(ii) ImageNet dataset上的SL预训练，以及(iii) MIMIC-CXR dataset上的SL预训练。
results: 我们发现，使用这些预训练方法可以在800,000多帧颈部X-光像中诊断更多于20种不同的内部发现。SSL预训练在预训练自然影像时不仅超过ImageNet-based预训练（P<0.001 for all datasets），甚至在某些情况下也超过了预训练MIMIC-CXR dataset。

Abstract
Pre-training datasets, like ImageNet, have become the gold standard in medical image analysis. However, the emergence of self-supervised learning (SSL), which leverages unlabeled data to learn robust features, presents an opportunity to bypass the intensive labeling process. In this study, we explored if SSL for pre-training on non-medical images can be applied to chest radiographs and how it compares to supervised pre-training on non-medical images and on medical images. We utilized a vision transformer and initialized its weights based on (i) SSL pre-training on natural images (DINOv2), (ii) SL pre-training on natural images (ImageNet dataset), and (iii) SL pre-training on chest radiographs from the MIMIC-CXR database. We tested our approach on over 800,000 chest radiographs from six large global datasets, diagnosing more than 20 different imaging findings. Our SSL pre-training on curated images not only outperformed ImageNet-based pre-training (P<0.001 for all datasets) but, in certain cases, also exceeded SL on the MIMIC-CXR dataset. Our findings suggest that selecting the right pre-training strategy, especially with SSL, can be pivotal for improving artificial intelligence (AI)'s diagnostic accuracy in medical imaging. By demonstrating the promise of SSL in chest radiograph analysis, we underline a transformative shift towards more efficient and accurate AI models in medical imaging.

摘要
医疗图像分析领域的预训练数据集，如ImageNet，已成为黄金标准。然而，自动学习（SSL）技术，利用无标签数据来学习强健特征，现在提供了一种可能的代替方案。在本研究中，我们研究了将SSL预训练在非医学图像上应用于胸部X射线图像，以及与超参数预训练在非医学图像和医学图像上的比较。我们使用了视觉 трансформа器，并将其参数初始化为（i）SSL预训练natural images（DINOv2），（ii）SL预训练natural images（ImageNet数据集），和（iii）SL预训练在MIMIC-CXR数据库上的胸部X射线图像。我们对6个大型全球数据集中的超过800,000个胸部X射线图像进行测试，并识别了20种不同的成像发现。我们的SSL预训练在精心选择的图像上不仅超过了ImageNet基础预训练（P<0.001 for all datasets），而且在某些情况下还超过了SL在MIMIC-CXR数据库上的预训练。我们的发现表明，选择合适的预训练策略，特别是使用SSL，可以对医疗图像识别精度进行重要改进。我们的研究证明了SSL在胸部X射线图像分析中的承诺，并标识了医疗图像识别领域的一种转型变革，从而实现更高效和准确的人工智能模型。

A Review of Adversarial Attacks in Computer Vision

paper_url: http://arxiv.org/abs/2308.07673
repo_url: None
paper_authors: Yutong Zhang, Yao Li, Yin Li, Zhichang Guo
for: 本研究旨在解决深度神经网络受到敌意样本攻击的问题，尤其是在自动驾驶等安全关键场景中。
methods: 本研究使用了黑盒设定，即攻击者只能获得模型的输入和输出，而不知道模型的参数和梯度。
results: 研究发现，黑盒攻击可以 Transferability 到不同的深度学习和机器学习模型，并且可以 Achievability 在实际场景中。

Abstract
Deep neural networks have been widely used in various downstream tasks, especially those safety-critical scenario such as autonomous driving, but deep networks are often threatened by adversarial samples. Such adversarial attacks can be invisible to human eyes, but can lead to DNN misclassification, and often exhibits transferability between deep learning and machine learning models and real-world achievability. Adversarial attacks can be divided into white-box attacks, for which the attacker knows the parameters and gradient of the model, and black-box attacks, for the latter, the attacker can only obtain the input and output of the model. In terms of the attacker's purpose, it can be divided into targeted attacks and non-targeted attacks, which means that the attacker wants the model to misclassify the original sample into the specified class, which is more practical, while the non-targeted attack just needs to make the model misclassify the sample. The black box setting is a scenario we will encounter in practice.

摘要
深度神经网络在各种下游任务中广泛应用，特别是安全关键的情况下，如自动驾驶等，但深度网络受到反对攻击的威胁。这些反对攻击可能会在人类眼中不可见，但可能导致神经网络误分类，并且常常具有神经网络和机器学习模型之间的传播性和实际应用性。反对攻击可以分为白盒攻击和黑盒攻击两类，其中白盒攻击者知道模型的参数和梯度，黑盒攻击者只能获得输入和输出。根据攻击者的目的，反对攻击可以分为targeted攻击和非targeted攻击。targeted攻击需要模型误分类原始样本为指定的类别，更加实际；非targeted攻击只需要模型误分类样本。黑盒设定是我们在实践中会遇到的情况。

Inversion-by-Inversion: Exemplar-based Sketch-to-Photo Synthesis via Stochastic Differential Equations without Training

paper_url: http://arxiv.org/abs/2308.07665
repo_url: https://github.com/ximinng/inversion-by-inversion
paper_authors: Ximing Xing, Chuang Wang, Haitao Zhou, Zhihao Hu, Chongxuan Li, Dong Xu, Qian Yu
for: 将简图转换为真实图像
methods: 使用“倒反”两Stage方法，包括形态倒反和全控倒反两个阶段，通过形态能函数和外观能函数来控制图像的形态和外观特征。
results: 实验结果表明，提议的“倒反”方法能够生成高质量的真实图像，并且可以根据不同的示例图来控制图像的颜色和 текстура特征。

Abstract
Exemplar-based sketch-to-photo synthesis allows users to generate photo-realistic images based on sketches. Recently, diffusion-based methods have achieved impressive performance on image generation tasks, enabling highly-flexible control through text-driven generation or energy functions. However, generating photo-realistic images with color and texture from sketch images remains challenging for diffusion models. Sketches typically consist of only a few strokes, with most regions left blank, making it difficult for diffusion-based methods to produce photo-realistic images. In this work, we propose a two-stage method named ``Inversion-by-Inversion" for exemplar-based sketch-to-photo synthesis. This approach includes shape-enhancing inversion and full-control inversion. During the shape-enhancing inversion process, an uncolored photo is generated with the guidance of a shape-energy function. This step is essential to ensure control over the shape of the generated photo. In the full-control inversion process, we propose an appearance-energy function to control the color and texture of the final generated photo.Importantly, our Inversion-by-Inversion pipeline is training-free and can accept different types of exemplars for color and texture control. We conducted extensive experiments to evaluate our proposed method, and the results demonstrate its effectiveness.

摘要
<>translate_language: zh-CN<> exemplar-based sketch-to-photo synthesis 可以让用户生成基于绘图的 photo-realistic 图像。最近，Diffusion-based 方法在图像生成任务上 achieved 出色的表现，允许通过文本驱动生成或能量函数进行高度灵活的控制。然而，通过Diffusion 模型生成具有颜色和 texture 的 photo-realistic 图像仍然是一个挑战。绘图通常只有几个笔画，大多数区域都是质感，使得Diffusion 模型很难生成 photo-realistic 图像。在这项工作中，我们提出了一种 Two-stage 方法，名为“Inversion-by-Inversion”，用于 exemplar-based sketch-to-photo synthesis。这种方法包括 shape-enhancing inversion 和 full-control inversion。在 shape-enhancing inversion 过程中，通过一个 shape-energy 函数的引导，生成一个没有颜色的照片。这一步很重要，以确保对生成的照片的形状进行控制。在 full-control inversion 过程中，我们提出了一种 appearance-energy 函数，用于控制照片的颜色和 texture。重要的是，我们的 Inversion-by-Inversion 管道是无需训练的，可以接受不同类型的 exemplar 进行颜色和 texture 控制。我们进行了广泛的实验来评估我们的提议方法，结果显示其效果。

Gradient-Based Post-Training Quantization: Challenging the Status Quo

paper_url: http://arxiv.org/abs/2308.07662
repo_url: None
paper_authors: Edouard Yvinec, Arnaud Dapogny, Kevin Bailly
For: The paper focuses on gradient-based post-training quantization (GPTQ) methods for efficient deployment of deep neural networks.* Methods: The paper challenges common choices in GPTQ methods and derives best practices for designing more efficient and scalable GPTQ methods, including the problem formulation and optimization process.* Results: The paper proposes a novel importance-based mixed-precision technique and shows significant performance improvements on all tested state-of-the-art GPTQ methods and networks, achieving +6.819 points on ViT for 4-bit quantization.Here’s the simplified Chinese version:
for: 这篇论文关注的是在训练后进行权重调整的混合精度方法，以实现深度神经网络的高效部署。
methods: 论文挑战了常见的GPTQ方法选择，并提出了更有效和可扩展的GPTQ方法设计方法，包括问题定义和优化过程。
results: 论文提出了一种新的重要性基于混合精度技术，并在所有测试的当前GPTQ方法和网络上实现了显著的性能提升，例如在ViT网络上的4位量化得到了+6.819点的提升。

Abstract
Quantization has become a crucial step for the efficient deployment of deep neural networks, where floating point operations are converted to simpler fixed point operations. In its most naive form, it simply consists in a combination of scaling and rounding transformations, leading to either a limited compression rate or a significant accuracy drop. Recently, Gradient-based post-training quantization (GPTQ) methods appears to be constitute a suitable trade-off between such simple methods and more powerful, yet expensive Quantization-Aware Training (QAT) approaches, particularly when attempting to quantize LLMs, where scalability of the quantization process is of paramount importance. GPTQ essentially consists in learning the rounding operation using a small calibration set. In this work, we challenge common choices in GPTQ methods. In particular, we show that the process is, to a certain extent, robust to a number of variables (weight selection, feature augmentation, choice of calibration set). More importantly, we derive a number of best practices for designing more efficient and scalable GPTQ methods, regarding the problem formulation (loss, degrees of freedom, use of non-uniform quantization schemes) or optimization process (choice of variable and optimizer). Lastly, we propose a novel importance-based mixed-precision technique. Those guidelines lead to significant performance improvements on all the tested state-of-the-art GPTQ methods and networks (e.g. +6.819 points on ViT for 4-bit quantization), paving the way for the design of scalable, yet effective quantization methods.

摘要
量化已成为深度神经网络的有效部署步骤，将浮点运算转换为简单的固定点运算。最简单的方式是通过缩放和四舍五入变换，但这将导致压缩率有限或准确率下降。现在，使用梯度based后期量化（GPTQ）方法可以实现一个适当的平衡，特别是在尝试量化LLMs（大型语言模型）时，因为量化过程的扩展性是非常重要。GPTQ通过学习缩放操作使用小量训练集来实现。在这个工作中，我们挑战了GPTQ方法的常见选择。具体来说，我们发现这个过程在一定程度上是Robust，即选择特征、增强特征和calibration集的变量的影响相对较小。此外，我们还提出了一些设计更高效和可扩展的GPTQ方法的最佳实践，包括问题定义（损失、自由度、非对称量化方案）和优化过程（变量和优化器）中的一些变量。最后，我们提出了一种新的重要性基于混合精度技术。这些指南导致所有测试的State-of-the-art GPTQ方法和网络（如+6.819点的ViT для4位量化）获得显著性能提高，为设计可扩展、有效的量化方法铺平道路。

Geometry of the Visual Cortex with Applications to Image Inpainting and Enhancement

paper_url: http://arxiv.org/abs/2308.07652
repo_url: https://github.com/ballerin/v1diffusion
paper_authors: Francesco Ballerin, Erlend Grong
for: 这篇论文是为了提出基于视觉核心V1的扩展矩阵群$SE(2)$的图像填充和改善算法。
methods: 这篇论文使用了浸泡-推拿（WaxOn-WaxOff）方法，并利用了子 riemannian 结构来定义一种新的不锈滤波器，用于图像提高。
results: 研究人员通过应用这种方法于血管扩大扫描中的血管增强，得到了更加锐利的结果。

Abstract
Equipping the rototranslation group $SE(2)$ with a sub-Riemannian structure inspired by the visual cortex V1, we propose algorithms for image inpainting and enhancement based on hypoelliptic diffusion. We innovate on previous implementations of the methods by Citti, Sarti and Boscain et al., by proposing an alternative that prevents fading and capable of producing sharper results in a procedure that we call WaxOn-WaxOff. We also exploit the sub-Riemannian structure to define a completely new unsharp using $SE(2)$, analogous of the classical unsharp filter for 2D image processing, with applications to image enhancement. We demonstrate our method on blood vessels enhancement in retinal scans.

摘要
将$SE(2)$拓扑群受到视觉核V1的启发下的半里曼尼拓扑结构，我们提出了基于液体扩散的图像填充和改善算法。我们在之前的实现方法（Citti、Sarti和Boscain等人的方法）的基础上做出了修改，以避免模糊和生成更加锐利的结果，我们称之为“WaxOn-WaxOff”过程。我们还利用了子拓扑结构来定义一种全新的不锐化器，类似于传统的2D图像处理中的不锐化过滤器，并应用于图像增强。我们在血管扩大retinal扫描中进行了示例。

Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

paper_url: http://arxiv.org/abs/2308.07648
repo_url: https://github.com/bladewaltz1/promptswitch
paper_authors: Chaorui Deng, Qi Chen, Pengda Qin, Da Chen, Qi Wu
for: 本文主要研究text-video retrieval领域中的问题，即如何使用预训练的文本-图像基础模型（如CLIP）在视频领域中进行有效的学习。
methods: 本文提出了一种新的方法，即在CLIP图像Encoder中引入空间-时间”Prompt Cube”，以快速包含全视频 semantics在帧表示中。此外，本文还提出了一种auxiliary video captioning目标函数，以帮助学习详细的视频 semantics。
results: 通过使用本文提出的方法，可以在三个标准 benchmark dataset上取得状态机器的性能（MSR-VTT、MSVD、LSMDC），而且只需要使用一个简单的时间融合策略（即mean-pooling）。

Abstract
In text-video retrieval, recent works have benefited from the powerful learning capabilities of pre-trained text-image foundation models (e.g., CLIP) by adapting them to the video domain. A critical problem for them is how to effectively capture the rich semantics inside the video using the image encoder of CLIP. To tackle this, state-of-the-art methods adopt complex cross-modal modeling techniques to fuse the text information into video frame representations, which, however, incurs severe efficiency issues in large-scale retrieval systems as the video representations must be recomputed online for every text query. In this paper, we discard this problematic cross-modal fusion process and aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts. Concretely, we first introduce a spatial-temporal "Prompt Cube" into the CLIP image encoder and iteratively switch it within the encoder layers to efficiently incorporate the global video semantics into frame representations. We then propose to apply an auxiliary video captioning objective to train the frame representations, which facilitates the learning of detailed video semantics by providing fine-grained guidance in the semantic space. With a naive temporal fusion strategy (i.e., mean-pooling) on the enhanced frame representations, we obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.

摘要
在文本视频检索中， latest works 受益于预训练的文本图像基础模型（如 CLIP）的强大学习能力，通过适应它们到视频频谱中来进行改进。然而，一个 kritical problem 是如何有效地在图像Encoder中捕捉视频中的丰富 semantics。 To tackle this, state-of-the-art methods 采用复杂的跨模态模型化技术来融合文本信息到视频帧表示中，这 however, incurs severe efficiency issues in large-scale retrieval systems as the video representations must be recomputed online for every text query. In this paper, we discard this problematic cross-modal fusion process and aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts.Concretely, we first introduce a spatial-temporal "Prompt Cube" into the CLIP image encoder and iteratively switch it within the encoder layers to efficiently incorporate the global video semantics into frame representations. We then propose to apply an auxiliary video captioning objective to train the frame representations, which facilitates the learning of detailed video semantics by providing fine-grained guidance in the semantic space. With a naive temporal fusion strategy (i.e., mean-pooling) on the enhanced frame representations, we obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.

Backpropagation Path Search On Adversarial Transferability

paper_url: http://arxiv.org/abs/2308.07625
repo_url: None
paper_authors: Zhuoer Xu, Zhangxuan Gu, Jianping Zhang, Shiwen Cui, Changhua Meng, Weiqiang Wang
for: 防御深度神经网络受到攻击的隐晦攻击，需要在部署之前测试模型的可靠性。
methods: 使用Transfer-based攻击者制作攻击示例，并将其传递给黑盒中部署的受害者模型。为提高攻击性能，结构基本攻击者调整反propagation路径，但现有的结构基本攻击者未能探索 convolution 模块在 CNN 中的作用，并且修改反propagation 图表时使用了优化的方法。
results: 在各种传输设置下，我们的 backPropagation pAth Search (PAS) 可以大幅提高攻击成功率，包括正常训练和防御模型。

Abstract
Deep neural networks are vulnerable to adversarial examples, dictating the imperativeness to test the model's robustness before deployment. Transfer-based attackers craft adversarial examples against surrogate models and transfer them to victim models deployed in the black-box situation. To enhance the adversarial transferability, structure-based attackers adjust the backpropagation path to avoid the attack from overfitting the surrogate model. However, existing structure-based attackers fail to explore the convolution module in CNNs and modify the backpropagation graph heuristically, leading to limited effectiveness. In this paper, we propose backPropagation pAth Search (PAS), solving the aforementioned two problems. We first propose SkipConv to adjust the backpropagation path of convolution by structural reparameterization. To overcome the drawback of heuristically designed backpropagation paths, we further construct a DAG-based search space, utilize one-step approximation for path evaluation and employ Bayesian Optimization to search for the optimal path. We conduct comprehensive experiments in a wide range of transfer settings, showing that PAS improves the attack success rate by a huge margin for both normally trained and defense models.

摘要
深度神经网络容易受到反例攻击，需要在部署之前测试模型的可靠性。转移基于攻击者通过对代理模型创建反例，并将其传递到黑盒环境中部署的受害者模型。为增强反例传递性，结构基于攻击者可以修改反例传递的背景干扰路径，以避免攻击过拟合代理模型。然而，现有的结构基于攻击者未能探索 convolution 模块在 CNN 中，并修改背景干扰路径的方法，导致有限的效果。在这篇论文中，我们提出了 backPropagation pAth Search (PAS)，解决以下两个问题。我们首先提出 SkipConv，用于调整 convolution 模块的背景干扰路径。为了超越轮循的设计方法，我们进一步建立了 DAG 型搜索空间，利用一步逼近方法来评估路径，并使用 Bayesian 优化来搜索最佳路径。我们在各种转移设置下进行了广泛的实验，结果显示，PAS 可以在各种转移设置下提高攻击成功率，并且在防御模型上也有显著改善。

Self-Prompting Large Vision Models for Few-Shot Medical Image Segmentation

paper_url: http://arxiv.org/abs/2308.07624
repo_url: https://github.com/peteryyzhang/few-shot-self-prompt-sam
paper_authors: Qi Wu, Yuyao Zhang, Marawan Elbatel
for: 这篇论文主要应用于医疗领域的大基础模型（Segment Anything Model，SAM），以提高医疗影像分类的性能。
methods: 这篇论文提出了一种新的自我推问法，利用SAM的嵌入空间来推问自己，通过简单 yet有效的直线像素层级分类器。
results: 这篇论文在多个数据集上（比如几几个医疗影像分类 зада问）取得了竞争性的结果，较以少数影像进行微调的方法提高约15%。

Abstract
Recent advancements in large foundation models have shown promising potential in the medical industry due to their flexible prompting capability. One such model, the Segment Anything Model (SAM), a prompt-driven segmentation model, has shown remarkable performance improvements, surpassing state-of-the-art approaches in medical image segmentation. However, existing methods primarily rely on tuning strategies that require extensive data or prior prompts tailored to the specific task, making it particularly challenging when only a limited number of data samples are available. In this paper, we propose a novel perspective on self-prompting in medical vision applications. Specifically, we harness the embedding space of SAM to prompt itself through a simple yet effective linear pixel-wise classifier. By preserving the encoding capabilities of the large model, the contextual information from its decoder, and leveraging its interactive promptability, we achieve competitive results on multiple datasets (i.e. improvement of more than 15% compared to fine-tuning the mask decoder using a few images).

摘要
Translated into Simplified Chinese:最近的大基础模型在医疗领域的应用显示了扎实的投资潜力，特别是其 flexible 的提示能力。一种名为 Segment Anything Model（SAM）的提示驱动 segmentation 模型，在医疗图像 segmentation 方面显示了非凡的表现，超越了当前的状态艺术方法。然而，现有的方法主要依赖于调整策略，需要大量的数据或特定任务的先前提示，这使得只有有限数量的数据样本时 особенelly 挑战。在这篇论文中，我们提出了一种新的自我提示视角，具体来说是利用 SAM 的 embedding 空间来自我提示，通过一种简单 yet effective 的线性像素级分类器。通过保留大型模型的编码能力，保留解码器的上下文信息，以及利用其交互提示能力，我们在多个 dataset 上达到了竞争力的结果（比如 Fine-tuning mask decoder 使用几个图像时的提高 более 15%）。

Self-supervised Hypergraphs for Learning Multiple World Interpretations

paper_url: http://arxiv.org/abs/2308.07615
repo_url: None
paper_authors: Alina Marcu, Mihai Pirvu, Dragos Costea, Emanuela Haller, Emil Slusanschi, Ahmed Nabil Belbachir, Rahul Sukthankar, Marius Leordeanu
for: 学习多个场景表示，使用小量标注集。
methods: 利用场景表示之间的关系，建立多任务超гра�。使用超гра�提高VisTransformer模型，无需额外标注数据。
results: 比其他多任务图模型表现出色，在不同类型的超гра�和ensemble模型下进行自我超vision学习。I hope that helps! Let me know if you have any other questions.

Abstract
We present a method for learning multiple scene representations given a small labeled set, by exploiting the relationships between such representations in the form of a multi-task hypergraph. We also show how we can use the hypergraph to improve a powerful pretrained VisTransformer model without any additional labeled data. In our hypergraph, each node is an interpretation layer (e.g., depth or segmentation) of the scene. Within each hyperedge, one or several input nodes predict the layer at the output node. Thus, each node could be an input node in some hyperedges and an output node in others. In this way, multiple paths can reach the same node, to form ensembles from which we obtain robust pseudolabels, which allow self-supervised learning in the hypergraph. We test different ensemble models and different types of hyperedges and show superior performance to other multi-task graph models in the field. We also introduce Dronescapes, a large video dataset captured with UAVs in different complex real-world scenes, with multiple representations, suitable for multi-task learning.

摘要
我们提出了一种方法，通过利用场景表示的关系形式为多任务 гиперграフ来学习多个场景表示。我们还示出了如何使用 гиперграフ来提高一种强大预训练 VisTransformer 模型，无需任何额外的标注数据。在我们的 гиперграフ中，每个节点是一个解释层（例如深度或分割）的场景表示。在每个 гипер边上，一个或多个输入节点预测输出节点的层。因此，每个节点可以是输入节点在某些 гипер边上，并且是输出节点在其他 гипер边上。这样，多个路径可以达到同一个节点，从而形成ensemble，并使用这些ensemble来获得Robustpseudolabel，以实现自动标注学习在 гиперграフ中。我们测试了不同的ensemble模型和不同类型的 гипер边，并显示了与其他多任务图模型在领域中的超越性。我们还介绍了 Dronescapes，一个大量视频数据集，captured with UAVs在不同的复杂实际场景中，具有多种表示，适合多任务学习。

paper_url: http://arxiv.org/abs/2308.07611
repo_url: None
paper_authors: Po-Jui Lu, Benjamin Odry, Muhamed Barakovic, Matthias Weigel, Robin Sandkühler, Reza Rahmanzadeh, Xinjie Chen, Mario Ocampo-Pineda, Jens Kuhle, Ludwig Kappos, Philippe Cattin, Cristina Granziera
For: The paper aims to identify disability-related brain changes in multiple sclerosis (MS) patients using whole-brain quantitative MRI (qMRI) and a novel comprehensive approach called GAMER-MRIL.* Methods: The approach uses a gated-attention-based convolutional neural network (CNN) to select patch-based qMRI images that are important for a given task/question, and incorporates a structure-aware interpretability method called Layer-wise Relevance Propagation (LRP) to identify disability-related brain regions.* Results: The approach achieved an AUC of 0.885, and the most sensitive measures related to disability were qT1 and NDI. The proposed LRP approach obtained more specifically relevant regions than other interpretability methods, including the saliency map, the integrated gradients, and the original LRP. The relevant regions included the corticospinal tract, where average qT1 and NDI significantly correlated with patients’ disability scores.Here’s the Chinese version of the three key points:* 用途：本研究旨在通过整体approach GAMER-MRIL，利用整个脑quantitative MRI (qMRI) 数据，为多发性静脉炎 (MS) 患者识别缺乏功能相关的脑区域。* 方法：该方法使用 gated-attention-based convolutional neural network (CNN) 选择 qMRI 图像中重要的 patch，并 incorporates 结构意识的 interpretability method Layer-wise Relevance Propagation (LRP) 来发现缺乏功能相关的脑区域。* 结果：该方法实现了 AUC 0.885，qT1 和 NDI 是缺乏功能相关的最敏感度量。提议的 LRP 方法在其他 interpretability methods 中获得了更加特定的相关区域，包括 corticospinal tract，其中 qT1 和 NDI 与患者缺乏功能分数相关性 ($ \rho $ = -0.37 和 0.44)。

Abstract
Objective: Identifying disability-related brain changes is important for multiple sclerosis (MS) patients. Currently, there is no clear understanding about which pathological features drive disability in single MS patients. In this work, we propose a novel comprehensive approach, GAMER-MRIL, leveraging whole-brain quantitative MRI (qMRI), convolutional neural network (CNN), and an interpretability method from classifying MS patients with severe disability to investigating relevant pathological brain changes. Methods: One-hundred-sixty-six MS patients underwent 3T MRI acquisitions. qMRI informative of microstructural brain properties was reconstructed, including quantitative T1 (qT1), myelin water fraction (MWF), and neurite density index (NDI). To fully utilize the qMRI, GAMER-MRIL extended a gated-attention-based CNN (GAMER-MRI), which was developed to select patch-based qMRI important for a given task/question, to the whole-brain image. To find out disability-related brain regions, GAMER-MRIL modified a structure-aware interpretability method, Layer-wise Relevance Propagation (LRP), to incorporate qMRI. Results: The test performance was AUC=0.885. qT1 was the most sensitive measure related to disability, followed by NDI. The proposed LRP approach obtained more specifically relevant regions than other interpretability methods, including the saliency map, the integrated gradients, and the original LRP. The relevant regions included the corticospinal tract, where average qT1 and NDI significantly correlated with patients' disability scores ($\rho$=-0.37 and 0.44). Conclusion: These results demonstrated that GAMER-MRIL can classify patients with severe disability using qMRI and subsequently identify brain regions potentially important to the integrity of the mobile function. Significance: GAMER-MRIL holds promise for developing biomarkers and increasing clinicians' trust in NN.

摘要
目标：identifying multiple sclerosis (MS) 患者中 relate to disability 的 brain changes是非常重要的。目前，没有明确的认知关于单个 MS 患者中哪些病理特征驱动残疾。在这种工作中，我们提出了一种全新的 comprehensive 方法，GAMER-MRIL，通过整个大脑量化MRI (qMRI)、卷积神经网络 (CNN) 和可解释方法来从MS患者中分类患者严重残疾。方法：一百六十六名 MS 患者通过3T MRI成像。qMRI 中提供了微结构脑 Properties 的信息，包括量化T1 (qT1)、myelin water fraction (MWF) 和 neurite density index (NDI)。为了完全利用 qMRI，GAMER-MRIL 扩展了一种闭合注意力基于CNN (GAMER-MRI)，将其应用到整个大脑图像。为了找出残疾相关的脑区，GAMER-MRIL 修改了结构意识 interpretability 方法，卷积层感知 propagation (LRP)，以包含 qMRI。结果：测试性能为 AUC=0.885。qT1 是残疾相关度最高的度量，其次是 NDI。提出的 LRP 方法在特定的脑区中获得了更多的相关区域，比其他可解释方法更加具有特点。这些相关区域包括 corticospinal tract，其中 qT1 和 NDI 与患者残疾分数相关性 (-0.37 和 0.44)。结论：这些结果表明GAMER-MRIL 可以使用 qMRI 分类患者严重残疾，并在脑区中寻找可能与 mobil 功能完整性相关的区域。意义：GAMER-MRIL 具有发展生物标志物和提高临床医生对NN的信任的潜在价值。

AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model

paper_url: http://arxiv.org/abs/2308.07593
repo_url: None
paper_authors: Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, Yong Man Ro
for: 这篇论文主要应用在无音识别中，将声音资讯与视觉资讯结合，以提高无音识别的精度。
methods: 提案的 Audio Knowledge empowered Visual Speech Recognition 框架（AKVSR）使用大规模预训Audio模型对声音资讯进行了丰富的编码，并将非语言信息从声音资料中排除，将语言信息储存在高精度的Audio内存中，最后通过Audio Bridging Module与视觉资讯进行匹配，以实现无需声音输入的训练。
results: 这篇论文透过广泛的实验证明了提案的方法的有效性，在两个广泛使用的数据集LRS2和LRS3上实现了新的顶峰性能。

Abstract
Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements. VSR is regarded as a challenging task because of the insufficient information on lip movements. In this paper, we propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality. Different from the previous methods, the proposed AKVSR 1) utilizes rich audio knowledge encoded by a large-scale pretrained audio model, 2) saves the linguistic information of audio knowledge in compact audio memory by discarding the non-linguistic information from the audio through quantization, and 3) includes Audio Bridging Module which can find the best-matched audio features from the compact audio memory, which makes our training possible without audio inputs, once after the compact audio memory is composed. We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used datasets, LRS2 and LRS3.

摘要
visual speech recognition (VSR) 是指从舌头运动中预测说话的任务。 VSR 被视为一个具有挑战性的任务，因为舌头运动的信息不够。在这篇论文中，我们提议了一个听音知识强化的视频语音识别框架（AKVSR），用于补充视觉模式中的不够的语音信息。与前一些方法不同，我们的 AKVSR 具有以下特点：1. 利用大规模预训练的音频模型编码的丰富听音知识。2. 通过归约非语言信息，将音频信息储存在高效的音频内存中，以便在训练时不需要音频输入。3. 包括听音桥接模块，可以在训练时找到最佳匹配的音频特征，从而实现无需音频输入的训练。我们通过广泛的实验 validate 了我们的提议，并在常用的 datasets 上达到了新的state-of-the-art 性能。

Graph-Segmenter: Graph Transformer with Boundary-aware Attention for Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.07592
repo_url: None
paper_authors: Zizhang Wu, Yuanzhu Gan, Tianhao Xu, Fan Wang
for: 提高 semantic segmentation 的性能
methods: 使用 Graph Transformer 和 Boundary-aware Attention 模块
results: 在三个 widely used semantic segmentation dataset 上达到 state-of-the-art 性能

Abstract
The transformer-based semantic segmentation approaches, which divide the image into different regions by sliding windows and model the relation inside each window, have achieved outstanding success. However, since the relation modeling between windows was not the primary emphasis of previous work, it was not fully utilized. To address this issue, we propose a Graph-Segmenter, including a Graph Transformer and a Boundary-aware Attention module, which is an effective network for simultaneously modeling the more profound relation between windows in a global view and various pixels inside each window as a local one, and for substantial low-cost boundary adjustment. Specifically, we treat every window and pixel inside the window as nodes to construct graphs for both views and devise the Graph Transformer. The introduced boundary-aware attention module optimizes the edge information of the target objects by modeling the relationship between the pixel on the object's edge. Extensive experiments on three widely used semantic segmentation datasets (Cityscapes, ADE-20k and PASCAL Context) demonstrate that our proposed network, a Graph Transformer with Boundary-aware Attention, can achieve state-of-the-art segmentation performance.

摘要
“ transformer-based semantic segmentation 方法，通过将图像分成不同区域，使用滑块窗口来建立关系，已经取得了出色的成果。然而，由于在这些方法中模型窗口之间的关系不是主要的强调点，因此未能充分利用。为了解决这个问题，我们提出了一个名为 Graph-Segmenter 的网络，包括 Graph Transformer 和Boundary-aware Attention 模组。这个网络可以同时在全球视图中模型窗口之间的深层关系，以及每个窗口和内部每个像素之间的本地关系，并且实现了低成本的边界调整。具体来说，我们将每个窗口和内部每个像素视为节点，以建立这两个视图的图形。我们还引入了边界意识注意力模组，以便优化目标物边界上的像素关系。我们在 Cityscapes、ADE-20k 和 PASCAL Context 三个通用 semantic segmentation 数据集上进行了广泛的实验，结果显示，我们的提案的网络，Graph Transformer with Boundary-aware Attention，可以 дости得 estado-of-the-art 的 segmentation 性能。”

ADD: An Automatic Desensitization Fisheye Dataset for Autonomous Driving

paper_url: http://arxiv.org/abs/2308.07590
repo_url: None
paper_authors: Zizhang Wu, Chenxin Yuan, Hongyang Wei, Fan Song, Tianhao Xu
for: 提供一个大 FoV 鱼眼相机拍摄的自动驾驶环境中数据保护的解决方案，以满足法规要求。
methods: 基于大 FoV 鱼眼相机的自动驾驶拍摄数据，构建了首个 Autopilot Desensitization Dataset (ADD)，并提出了一种深度学习基于图像感知的图像隐藏框架。
results: 在 ADD 数据集上，提出了一种高效的多任务感知网络（DesCenterNet），可以同时实现人脸和车牌检测和隐藏任务。对于图像隐藏任务，我们提出了一种新的评价标准，并进行了广泛的比较实验，证明了我们的方法的有效性和超越性。

Abstract
Autonomous driving systems require many images for analyzing the surrounding environment. However, there is fewer data protection for private information among these captured images, such as pedestrian faces or vehicle license plates, which has become a significant issue. In this paper, in response to the call for data security laws and regulations and based on the advantages of large Field of View(FoV) of the fisheye camera, we build the first Autopilot Desensitization Dataset, called ADD, and formulate the first deep-learning-based image desensitization framework, to promote the study of image desensitization in autonomous driving scenarios. The compiled dataset consists of 650K images, including different face and vehicle license plate information captured by the surround-view fisheye camera. It covers various autonomous driving scenarios, including diverse facial characteristics and license plate colors. Then, we propose an efficient multitask desensitization network called DesCenterNet as a benchmark on the ADD dataset, which can perform face and vehicle license plate detection and desensitization tasks. Based on ADD, we further provide an evaluation criterion for desensitization performance, and extensive comparison experiments have verified the effectiveness and superiority of our method on image desensitization.

摘要
自动驾驶系统需要大量图像来分析周围环境。然而， captured 图像中的private信息，如行人脸或车辆识别号，却受到较少的数据保护，这成为了一个重要的问题。在这篇论文中，我们根据宽视场(FoV)大的鱼眼镜头的优点，建立了首个Autopilot Desensitization Dataset（ADD），并提出了首个深度学习基于图像抑制框架。通过ADD集成了650000张图像，包括不同的脸和车辆识别号信息， captured by surround-view fisheye camera。它覆盖了各种自动驾驶场景，包括多样化的脸容特征和车辆识别号颜色。然后，我们提出了一种高效的多任务抑制网络， called DesCenterNet，作为ADD集成的benchmark，可以同时完成脸和车辆识别号检测和抑制任务。基于ADD，我们还提供了图像抑制性评价标准，并进行了广泛的比较实验，证明了我们的方法在图像抑制方面的效果和优势。

Synthetic data generation method for hybrid image-tabular data using two generative adversarial networks

paper_url: http://arxiv.org/abs/2308.07573
repo_url: None
paper_authors: Tomohiro Kikuchi, Shouhei Hanaoka, Takahiro Nakao, Tomomi Takenaga, Yukihiro Nomura, Harushi Mori, Takeharu Yoshikawa
for: 这篇论文旨在提出一种生成医疗资料的方法，以便解决医疗领域中隐私问题和促进数据共享。
methods: 这篇论文使用了一种称为 auto-encoding GAN（αGAN）和一种称为 conditional tabular GAN（CTGAN）的生成opponent neural network（GAN）方法，以生成医疗领域中的合成医疗资料。
results: 这篇论文成功地实现了生成多样化的合成医疗资料，包括颈部X射像（CXR）和结构化的数据（包括人体尺寸数据和实验室测试数据），并且保持了这些数据之间的对应关系。

Abstract
The generation of synthetic medical records using generative adversarial networks (GANs) has become increasingly important for addressing privacy concerns and promoting data sharing in the medical field. In this paper, we propose a novel method for generating synthetic hybrid medical records consisting of chest X-ray images (CXRs) and structured tabular data (including anthropometric data and laboratory tests) using an auto-encoding GAN ({\alpha}GAN) and a conditional tabular GAN (CTGAN). Our approach involves training a {\alpha}GAN model on a large public database (pDB) to reduce the dimensionality of CXRs. We then applied the trained encoder of the GAN model to the images in original database (oDB) to obtain the latent vectors. These latent vectors were combined with tabular data in oDB, and these joint data were used to train the CTGAN model. We successfully generated diverse synthetic records of hybrid CXR and tabular data, maintaining correspondence between them. We evaluated this synthetic database (sDB) through visual assessment, distribution of interrecord distances, and classification tasks. Our evaluation results showed that the sDB captured the features of the oDB while maintaining the correspondence between the images and tabular data. Although our approach relies on the availability of a large-scale pDB containing a substantial number of images with the same modality and imaging region as those in the oDB, this method has the potential for the public release of synthetic datasets without compromising the secondary use of data.

摘要
现代生成技术在医疗领域中得到了广泛应用，尤其是通过生成对抗网络（GAN）来解决隐私问题和促进数据共享。本文提出了一种新的方法，使用自动编码GAN（αGAN）和条件表格GAN（CTGAN）生成混合类医疗记录，包括胸部X射线图像（CXR）和结构化表格数据（包括人体测量数据和实验室测试结果）。我们的方法是使用大规模公共数据库（pDB）来减少CXR的维度，然后使用训练过的GAN模型的编码器对oDB中的图像进行编码，得到了潜在 вектор。这些潜在 вектор与表格数据进行结合，并将这些联合数据用于CTGAN模型的训练。我们成功地生成了多样化的医疗记录，保持了图像和表格数据之间的协调。我们对这个synthetic数据库（sDB）进行了视觉评估、记录间距离分布和分类任务的评估。我们的评估结果表明，sDB捕捉了oDB中的特征，同时保持了图像和表格数据之间的协调。虽然我们的方法需要一个大规模的pDB，但这种方法具有公开 синтетиче数据库的潜在优势，不需要牺牲第二次使用数据的隐私。

Ske2Grid: Skeleton-to-Grid Representation Learning for Action Recognition

paper_url: http://arxiv.org/abs/2308.07571
repo_url: https://github.com/osvai/ske2grid
paper_authors: Dongqi Cai, Yangyuxuan Kang, Anbang Yao, Yurong Chen
for: 本文提出了一种新的 Representation Learning 框架，用于改进人体skeleton基于动作识别。
methods: 本文使用了三种新的设计方法：Graph-node index transform (GIT)、Up-sampling transform (UPT) 和 Progressive learning strategy (PLS)，用于构建一个具有更高表示能力的人体skeleton网格表示。
results: experiments 表明，使用本文提出的Ske2Grid方法可以在六个主流的人体skeleton基于动作识别 dataset 上达到更高的性能，而不需要额外的设计。

Abstract
This paper presents Ske2Grid, a new representation learning framework for improved skeleton-based action recognition. In Ske2Grid, we define a regular convolution operation upon a novel grid representation of human skeleton, which is a compact image-like grid patch constructed and learned through three novel designs. Specifically, we propose a graph-node index transform (GIT) to construct a regular grid patch through assigning the nodes in the skeleton graph one by one to the desired grid cells. To ensure that GIT is a bijection and enrich the expressiveness of the grid representation, an up-sampling transform (UPT) is learned to interpolate the skeleton graph nodes for filling the grid patch to the full. To resolve the problem when the one-step UPT is aggressive and further exploit the representation capability of the grid patch with increasing spatial size, a progressive learning strategy (PLS) is proposed which decouples the UPT into multiple steps and aligns them to multiple paired GITs through a compact cascaded design learned progressively. We construct networks upon prevailing graph convolution networks and conduct experiments on six mainstream skeleton-based action recognition datasets. Experiments show that our Ske2Grid significantly outperforms existing GCN-based solutions under different benchmark settings, without bells and whistles. Code and models are available at https://github.com/OSVAI/Ske2Grid

摘要
First, we propose a graph-node index transform (GIT) to assign nodes in the skeleton graph to desired grid cells. This ensures that GIT is a bijection and enriches the expressiveness of the grid representation.Second, we learn an up-sampling transform (UPT) to interpolate the skeleton graph nodes for filling the grid patch to the full. This ensures that the grid representation is dense and detailed.Third, we propose a progressive learning strategy (PLS) to decouple the UPT into multiple steps and align them with multiple paired GITs through a compact cascaded design learned progressively. This further exploits the representation capability of the grid patch with increasing spatial size.We construct networks upon prevailing graph convolution networks and conduct experiments on six mainstream skeleton-based action recognition datasets. The results show that our Ske2Grid significantly outperforms existing GCN-based solutions under different benchmark settings, without bells and whistles. The code and models are available at https://github.com/OSVAI/Ske2Grid.

Improved mirror ball projection for more accurate merging of multiple camera outputs and process monitoring

paper_url: http://arxiv.org/abs/2308.10991
repo_url: https://github.com/FrostKiwi/Mirrorball
paper_authors: Wladislav Artsimovich, Yoko Hirono
for: 用圆镜代替宽角摄像机，实现低成本的生产过程监测在危险环境中，包括高温、真空和强电磁场环境。
methods: 使用圆镜反射将多种摄像机类型（如彩色图像、近红外、长波长红外、 ultraviolet）集成到单一宽角输出中，并考虑不同摄像机位置和镜头使用。
results: 研究表明，使用圆镜反射可以减少不同摄像机位置引入的视角偏移，具体取决于镜子大小和监测目标距离。此外，本文还介绍了一种受限于投影镜球的扭曲问题的变种，并评估了过程监测via圆镜球的效果。

Abstract
Using spherical mirrors in place of wide-angle cameras allows for cost-effective monitoring of manufacturing processes in hazardous environment, where a camera would normally not operate. This includes environments of high heat, vacuum and strong electromagnetic fields. Moreover, it allows the layering of multiple camera types (e.g., color image, near-infrared, long-wavelength infrared, ultraviolet) into a single wide-angle output, whilst accounting for the different camera placements and lenses used. Normally, the different camera positions introduce a parallax shift between the images, but with a spherical projection as produced by a spherical mirror, this parallax shift is reduced, depending on mirror size and distance to the monitoring target. This paper introduces a variation of the 'mirror ball projection', that accounts for distortion produced by a perspective camera at the pole of the projection. Finally, the efficacy of process monitoring via a mirror ball is evaluated.

摘要
Note: Simplified Chinese is also known as "简化字" or "简化字".Translation Notes:* "wide-angle camera" is translated as "广角镜头" (guǎng jiàng jīng tóu), which is a more common term in Simplified Chinese.* "spherical mirror" is translated as "球形镜" (qiu xíng jìng), which is a more precise term in Simplified Chinese.* "parallax shift" is translated as "偏移" (piān yì), which is a more common term in Simplified Chinese.* "perspective camera" is translated as "投影镜头" (pù yǐng jīng tóu), which is a more precise term in Simplified Chinese.* "mirror ball projection" is translated as "镜球投影" (jìng qiu pù yǐng), which is a more common term in Simplified Chinese.

SST: A Simplified Swin Transformer-based Model for Taxi Destination Prediction based on Existing Trajectory

paper_url: http://arxiv.org/abs/2308.07555
repo_url: None
paper_authors: Zepu Wang, Yifei Sun, Zhiyu Lei, Xincheng Zhu, Peng Sun
for: 预测AXI trajectory的目的地有很多减值，可以帮助智能位置基础服务。
methods: 将AXI trajectory转换为二维网格，使用计算机视觉技术进行预测。
results: 我们的实验结果表明，使用简化的Swin Transformer（SST）结构可以在实际 trajectory数据上达到更高的准确率，比state-of-the-art方法更高。

Abstract
Accurately predicting the destination of taxi trajectories can have various benefits for intelligent location-based services. One potential method to accomplish this prediction is by converting the taxi trajectory into a two-dimensional grid and using computer vision techniques. While the Swin Transformer is an innovative computer vision architecture with demonstrated success in vision downstream tasks, it is not commonly used to solve real-world trajectory problems. In this paper, we propose a simplified Swin Transformer (SST) structure that does not use the shifted window idea in the traditional Swin Transformer, as trajectory data is consecutive in nature. Our comprehensive experiments, based on real trajectory data, demonstrate that SST can achieve higher accuracy compared to state-of-the-art methods.

摘要
<>转换给定文本为简化中文。>预测出AXI Taxi trajectory的目的地可以有各种 beneficial effects for intelligent location-based services. one potential method to achieve this prediction is by converting the taxi trajectory into a two-dimensional grid and using computer vision techniques. Although the Swin Transformer is an innovative computer vision architecture with demonstrated success in vision downstream tasks, it is not commonly used to solve real-world trajectory problems. In this paper, we propose a simplified Swin Transformer (SST) structure that does not use the shifted window idea in the traditional Swin Transformer, as trajectory data is consecutive in nature. Our comprehensive experiments, based on real trajectory data, demonstrate that SST can achieve higher accuracy compared to state-of-the-art methods.Here's the word-for-word translation of the text into Simplified Chinese:<>将给定文本转换为简化中文。>预测AXI taxi trajectory的目的地可以有各种有益的效果 для智能位置基于服务。一个 potential method to achieve this prediction is by converting the taxi trajectory into a two-dimensional grid and using computer vision techniques. Although the Swin Transformer is an innovative computer vision architecture with demonstrated success in vision downstream tasks, it is not commonly used to solve real-world trajectory problems. In this paper, we propose a simplified Swin Transformer (SST) structure that does not use the shifted window idea in the traditional Swin Transformer, as trajectory data is consecutive in nature. Our comprehensive experiments, based on real trajectory data, demonstrate that SST can achieve higher accuracy compared to state-of-the-art methods.

Multi-view 3D Face Reconstruction Based on Flame

paper_url: http://arxiv.org/abs/2308.07551
repo_url: None
paper_authors: Wenzhuo Zheng, Junhao Zhao, Xiaohong Liu, Yongyang Pan, Zhenghao Gan, Haozhe Han, Ning Liu
for: 本研究旨在提高面部3D重建质量，通过结合多视图训练框架和面 Parametric模型Flame，提出多视图训练和测试模型MFNet。
methods: 我们建立了一个无监督训练框架，并实施了多视图光流损失函数和面点损失约束，最后获得了完整的MFNet。我们还提出了多视图光流损失和可见面罩的创新实现。
results: 我们在AFLW和facescape数据集上测试了我们的模型，并在实际场景中拍摄了我们的脸部图像，并实现了3D面部重建的好 Result。我们的工作主要解决了将面 Parametric模型与多视图face 3D重建结合的问题，并探讨了基于Flame的多视图训练和测试框架在面部3D重建领域的贡献。

Abstract
At present, face 3D reconstruction has broad application prospects in various fields, but the research on it is still in the development stage. In this paper, we hope to achieve better face 3D reconstruction quality by combining multi-view training framework with face parametric model Flame, propose a multi-view training and testing model MFNet (Multi-view Flame Network). We build a self-supervised training framework and implement constraints such as multi-view optical flow loss function and face landmark loss, and finally obtain a complete MFNet. We propose innovative implementations of multi-view optical flow loss and the covisible mask. We test our model on AFLW and facescape datasets and also take pictures of our faces to reconstruct 3D faces while simulating actual scenarios as much as possible, which achieves good results. Our work mainly addresses the problem of combining parametric models of faces with multi-view face 3D reconstruction and explores the implementation of a Flame based multi-view training and testing framework for contributing to the field of face 3D reconstruction.

摘要
Translated into Simplified Chinese:当前，人脸3D重建具有广泛应用前景，但相关研究还处于发展阶段。在这篇论文中，我们希望通过结合多视图培训框架和人脸参数模型Flame，提出一种多视图培训和测试模型MFNet（多视图Flame网络）。我们建立了一个无监督培训框架，并实施约束 Multi-view optical flow loss function和面部标记损失等，最后获得了完整的MFNet。我们提出了面部 parametric 模型和多视图 face 3D 重建的innovative实现，包括多视图 optical flow 损失和可见面罩。我们在 AFLW 和 facescape 数据集上测试了我们的模型，并在实际场景中拍摄了我们的脸部图像，并实现了三维人脸重建。我们的工作主要解决了人脸参数模型与多视图 face 3D 重建的组合问题，并探讨了基于 Flame 的多视图培训和测试框架在人脸3D重建领域的应用。

3DHacker: Spectrum-based Decision Boundary Generation for Hard-label 3D Point Cloud Attack

paper_url: http://arxiv.org/abs/2308.07546
repo_url: None
paper_authors: Yunbo Tao, Daizong Liu, Pan Zhou, Yulai Xie, Wei Du, Wei Hu
for: 攻击3D点云模型的安全性在自动驾驶和机器人导航等应用中 receiving increasing attention。
methods: 我们提出了一种新的3D攻击方法，称为3D Hard-label Attacker（3DHacker），基于分类标签知识生成敏感amples。
results: 我们的3DHacker方法在具有黑盒环境的情况下，可以凭借高效率和小型做出比较出色的攻击性能，并且对攻击者的质量也有较好的控制。

Abstract
With the maturity of depth sensors, the vulnerability of 3D point cloud models has received increasing attention in various applications such as autonomous driving and robot navigation. Previous 3D adversarial attackers either follow the white-box setting to iteratively update the coordinate perturbations based on gradients, or utilize the output model logits to estimate noisy gradients in the black-box setting. However, these attack methods are hard to be deployed in real-world scenarios since realistic 3D applications will not share any model details to users. Therefore, we explore a more challenging yet practical 3D attack setting, \textit{i.e.}, attacking point clouds with black-box hard labels, in which the attacker can only have access to the prediction label of the input. To tackle this setting, we propose a novel 3D attack method, termed \textbf{3D} \textbf{H}ard-label att\textbf{acker} (\textbf{3DHacker}), based on the developed decision boundary algorithm to generate adversarial samples solely with the knowledge of class labels. Specifically, to construct the class-aware model decision boundary, 3DHacker first randomly fuses two point clouds of different classes in the spectral domain to craft their intermediate sample with high imperceptibility, then projects it onto the decision boundary via binary search. To restrict the final perturbation size, 3DHacker further introduces an iterative optimization strategy to move the intermediate sample along the decision boundary for generating adversarial point clouds with smallest trivial perturbations. Extensive evaluations show that, even in the challenging hard-label setting, 3DHacker still competitively outperforms existing 3D attacks regarding the attack performance as well as adversary quality.

摘要
随着深度感知器的成熟，3D点云模型的漏洞受到了各种应用程序中的关注，如自动驾驶和机器人导航。先前的3D反击器都是采用白盒设定来逐渐更新坐标偏移量基于梯度，或者使用输出模型的логи值来估计噪声梯度在黑盒设定下。然而，这些攻击方法在实际应用场景中很难实施，因为实际的3D应用程序不会分享任何模型细节给用户。因此，我们研究一种更加具有挑战性且实用的3D攻击设定，即在黑盒硬标记下攻击点云，在这个设定下，攻击者只有访问输入的预测标签。为解决这个设定，我们提出了一种新的3D攻击方法，即3D硬标记攻击者（3DHacker），基于已发展的决策边界算法来生成反击样本，只需通过知道类别标签来生成对抗样本。具体来说，为构建类别意识模型的决策边界，3DHacker首先随机将两个不同类型的点云在spectral domain中混合为中间样本，然后将其投射到决策边界 via binary search。为限制最终的偏移量，3DHacker进一步引入了一种迭代优化策略，将中间样本在决策边界上移动，以生成最小的极小偏移量。广泛的评估表明，即使在挑战性的硬标记设定下，3DHacker仍然可以与现有3D攻击相比，在攻击性和对手质量方面具有竞争力。

Multimodal Dataset Distillation for Image-Text Retrieval

paper_url: http://arxiv.org/abs/2308.07545
repo_url: None
paper_authors: Xindi Wu, Zhiwei Deng, Olga Russakovsky
for: 这篇论文的目的是扩展 dataset distillation 方法到vision-language模型的训练中，以实现从零开始训练新模型的可能性。
methods: 本文提出了一个基于对应汇总的多Modal dataset distillation 方法，将影像和其相应的语言描述汇总在一个对应的形式中。
results: 本文比较了三种核心集选择方法 (strategic subsampling of the training dataset)，并证明了对于具有挑战性的 Flickr30K 和 COCO 检索准确度测试 benchmark 的改进，将最好的核心集选择方法选择 1000 个影像-文本组合用于训练，仅能实现 5.6% 的影像-文本搜寻精度 (recall@1)，而对于我们的 dataset distillation 方法仅需要 100 个训练组合 (一个次元的数量更少)，则可以实现类似的精度。

Abstract
Dataset distillation methods offer the promise of reducing a large-scale dataset down to a significantly smaller set of (potentially synthetic) training examples, which preserve sufficient information for training a new model from scratch. So far dataset distillation methods have been developed for image classification. However, with the rise in capabilities of vision-language models, and especially given the scale of datasets necessary to train these models, the time is ripe to expand dataset distillation methods beyond image classification. In this work, we take the first steps towards this goal by expanding on the idea of trajectory matching to create a distillation method for vision-language datasets. The key challenge is that vision-language datasets do not have a set of discrete classes. To overcome this, our proposed multimodal dataset distillation method jointly distill the images and their corresponding language descriptions in a contrastive formulation. Since there are no existing baselines, we compare our approach to three coreset selection methods (strategic subsampling of the training dataset), which we adapt to the vision-language setting. We demonstrate significant improvements on the challenging Flickr30K and COCO retrieval benchmark: the best coreset selection method which selects 1000 image-text pairs for training is able to achieve only 5.6% image-to-text retrieval accuracy (recall@1); in contrast, our dataset distillation approach almost doubles that with just 100 (an order of magnitude fewer) training pairs.

摘要
dataset 简化方法可以将大规模 dataset 缩小到一个较小的（可能是人工生成的）训练示例集，保留足够的信息来训练一个新模型从头开始。目前，dataset 简化方法已经被开发出来用于图像分类。然而，随着视觉语言模型的能力的提高，特别是对于训练这些模型所需的数据集的规模的增长，现在是时候扩展 dataset 简化方法到更多领域。在这项工作中，我们做出了首先的尝试，扩展了路径匹配的想法，以创建一种用于视觉语言 dataset 的简化方法。主要挑战在于视觉语言 dataset 没有固定的分类集。为了解决这个问题，我们提出了一种多Modal 的 dataset 简化方法，通过对图像和其相应的语言描述进行joint降维来实现。由于没有现有的基准，我们对这种方法进行比较，并将其与三种核心选择方法（策略性抽样）进行比较。我们在复杂的 Flickr30K 和 COCO 检索benchmark上显示出了显著的改善：最佳核心选择方法，选择 1000 个图像-文本对作为训练集，只能达到 5.6% 的图像-文本检索精度（recall@1）；相比之下，我们的 dataset 简化方法可以在 100 个训练对（一个小数量）下达到同样的精度。

Visual and Textual Prior Guided Mask Assemble for Few-Shot Segmentation and Beyond

paper_url: http://arxiv.org/abs/2308.07539
repo_url: None
paper_authors: Chen Shuai, Meng Fanman, Zhang Runtong, Qiu Heqian, Li Hongliang, Wu Qingbo, Xu Linfeng
for:* The paper is written for few-shot segmentation (FSS) tasks, specifically to enhance the generalization ability of FSS models using CLIP.methods:* The proposed method, PGMA-Net, employs a class-agnostic mask assembly process to alleviate bias towards base classes, and formulates diverse tasks into a unified manner by assembling prior through affinity.* The method includes a Prior-Guided Mask Assemble Module (PGMAM) with multiple General Assemble Units (GAUs) that consider diverse and plug-and-play interactions, and a Hierarchical Decoder with Channel-Drop Mechanism (HDCDM) to flexibly exploit assembled masks and low-level features.results:* The proposed PGMA-Net achieves new state-of-the-art results in the FSS task, with mIoU of $77.6$ on $\text{PASCAL-}5^i$ and $59.4$ on $\text{COCO-}20^i$ in 1-shot scenario.* The method can also solve bbox-level and cross-domain FSS, co-segmentation, zero-shot segmentation (ZSS) tasks, leading to an any-shot segmentation framework without extra re-training.

Abstract
Few-shot segmentation (FSS) aims to segment the novel classes with a few annotated images. Due to CLIP's advantages of aligning visual and textual information, the integration of CLIP can enhance the generalization ability of FSS model. However, even with the CLIP model, the existing CLIP-based FSS methods are still subject to the biased prediction towards base classes, which is caused by the class-specific feature level interactions. To solve this issue, we propose a visual and textual Prior Guided Mask Assemble Network (PGMA-Net). It employs a class-agnostic mask assembly process to alleviate the bias, and formulates diverse tasks into a unified manner by assembling the prior through affinity. Specifically, the class-relevant textual and visual features are first transformed to class-agnostic prior in the form of probability map. Then, a Prior-Guided Mask Assemble Module (PGMAM) including multiple General Assemble Units (GAUs) is introduced. It considers diverse and plug-and-play interactions, such as visual-textual, inter- and intra-image, training-free, and high-order ones. Lastly, to ensure the class-agnostic ability, a Hierarchical Decoder with Channel-Drop Mechanism (HDCDM) is proposed to flexibly exploit the assembled masks and low-level features, without relying on any class-specific information. It achieves new state-of-the-art results in the FSS task, with mIoU of $77.6$ on $\text{PASCAL-}5^i$ and $59.4$ on $\text{COCO-}20^i$ in 1-shot scenario. Beyond this, we show that without extra re-training, the proposed PGMA-Net can solve bbox-level and cross-domain FSS, co-segmentation, zero-shot segmentation (ZSS) tasks, leading an any-shot segmentation framework.

摘要
“几shot分类（FSS）的目标是使用几个标注图像来分类新的类别。由于CLIP的优点，将CLIP与FSS模型结合可以提高模型的扩展能力。然而，即使使用CLIP模型，现有的CLIP-based FSS方法仍然受到基本类别的预测偏好，这是由于类别特定的层次交互所致。为解决这个问题，我们提出了一个可视和文本对照的 Prior Guided Mask Assemble Network (PGMA-Net)。它使用一个类别不偏的掩模过程来减少偏好，并将多种任务转换为一个统一的形式。具体来说，首先将类别相关的文本和可见特征转换为类别不偏的机会地图。然后，我们引入一个 Prior-Guided Mask Assemble Module (PGMAM)，包括多个通用组合单元 (GAUs)。它考虑了多种不同和可插入的交互，例如可见文本、间隔和内部图像、训练无须、高阶的交互。最后，为保持类别不偏的能力，我们提出了一个弹性调节的高级解码器 (HDCDM)，以灵活地利用掩模和低层特征，不需要靠类别特定的信息。它实现了新的顶尖成绩在FSS任务中，具体为PASCAL-$5^i$中的$77.6$和COCO-$20^i$中的$59.4$在1架构enario中。此外，我们显示了在无需额外重训的情况下，提案的PGMA-Net可以解决矩形范围内的FSS、共 segmentation、零shot segmentation (ZSS)任务，实现一个任何shot segmentation框架。”

AttMOT: Improving Multiple-Object Tracking by Introducing Auxiliary Pedestrian Attributes

paper_url: http://arxiv.org/abs/2308.07537
repo_url: None
paper_authors: Yunhao Li, Zhen Xiao, Lin Yang, Dan Meng, Xin Zhou, Heng Fan, Libo Zhang
for:* The paper aims to address the gap in exploring pedestrian attributes in multi-object tracking (MOT) and propose a method to predict pedestrian attributes to support general Re-ID embedding.methods:* The proposed method AAM explores different approaches to fuse Re-ID embedding and pedestrian attributes, including attention mechanisms, to improve the performance of MOT.results:* The proposed method AAM achieves consistent improvements in MOTA, HOTA, AssA, IDs, and IDF1 scores on several representative pedestrian multi-object tracking benchmarks, including MOT17 and MOT20, when applied to state-of-the-art trackers.

Abstract
Multi-object tracking (MOT) is a fundamental problem in computer vision with numerous applications, such as intelligent surveillance and automated driving. Despite the significant progress made in MOT, pedestrian attributes, such as gender, hairstyle, body shape, and clothing features, which contain rich and high-level information, have been less explored. To address this gap, we propose a simple, effective, and generic method to predict pedestrian attributes to support general Re-ID embedding. We first introduce AttMOT, a large, highly enriched synthetic dataset for pedestrian tracking, containing over 80k frames and 6 million pedestrian IDs with different time, weather conditions, and scenarios. To the best of our knowledge, AttMOT is the first MOT dataset with semantic attributes. Subsequently, we explore different approaches to fuse Re-ID embedding and pedestrian attributes, including attention mechanisms, which we hope will stimulate the development of attribute-assisted MOT. The proposed method AAM demonstrates its effectiveness and generality on several representative pedestrian multi-object tracking benchmarks, including MOT17 and MOT20, through experiments on the AttMOT dataset. When applied to state-of-the-art trackers, AAM achieves consistent improvements in MOTA, HOTA, AssA, IDs, and IDF1 scores. For instance, on MOT17, the proposed method yields a +1.1 MOTA, +1.7 HOTA, and +1.8 IDF1 improvement when used with FairMOT. To encourage further research on attribute-assisted MOT, we will release the AttMOT dataset.

摘要
多bject tracking (MOT) 是计算机视觉中的基本问题，具有许多应用，如智能监控和自动驾驶。 DESPITE 在 MOT 中做出了 significan progress， pedestrian 特征，如性别、发型、身体形态和服装特征，具有丰富和高级信息，却得到了更少的关注。为了解决这一漏洞，我们提出了一种简单、有效和通用的方法，可以预测 pedestrian 特征，以支持通用 Re-ID 嵌入。我们首先介绍 AttMOT，一个大型、高度充实的人工synthetic dataset for pedestrian tracking，包含了80k帧和6000万个 pedestrian ID，具有不同的时间、天气和场景。我们知道 AttMOT 是首个具有semantic attribute的 MOT dataset。然后，我们探索了不同的方法来融合 Re-ID 嵌入和 pedestrian 特征，包括注意力机制。我们希望这种方法能够激发attribute-assisted MOT的发展。我们提出的方法 AAM 在多个表现 pedestrian multi-object tracking benchmarks，包括 MOT17 和 MOT20，通过在 AttMOT dataset上进行实验，实现了显著的改进。例如，在 MOT17 上，我们的方法可以提高 +1.1 MOTA、+1.7 HOTA 和 +1.8 IDF1 分数。为了鼓励 attribute-assisted MOT 的进一步研究，我们将在 AttMOT dataset上发布 AttMOT。

Improved Region Proposal Network for Enhanced Few-Shot Object Detection

paper_url: http://arxiv.org/abs/2308.07535
repo_url: https://github.com/zshanggu/htrpn
paper_authors: Zeyu Shangguan, Mohammad Rostami
For: 这种研究是为了解决深度学习基本监督学习方法的限制，提高对象检测任务的性能。* Methods: 该研究提出了一种半监督法，通过使用无标注数据进行训练，提高几招物检测性能。具体来说，他们开发了一种层次三元分类区提案网络（HTRPN），以便检测并分类未标注的新类实例。* Results: 对COCO和PASCAL VOC基准数据集进行测试，研究结果表明，该方法可以提高几招物检测性能，并超越现有的状态平台FSOD方法。

Abstract
Despite significant success of deep learning in object detection tasks, the standard training of deep neural networks requires access to a substantial quantity of annotated images across all classes. Data annotation is an arduous and time-consuming endeavor, particularly when dealing with infrequent objects. Few-shot object detection (FSOD) methods have emerged as a solution to the limitations of classic object detection approaches based on deep learning. FSOD methods demonstrate remarkable performance by achieving robust object detection using a significantly smaller amount of training data. A challenge for FSOD is that instances from novel classes that do not belong to the fixed set of training classes appear in the background and the base model may pick them up as potential objects. These objects behave similarly to label noise because they are classified as one of the training dataset classes, leading to FSOD performance degradation. We develop a semi-supervised algorithm to detect and then utilize these unlabeled novel objects as positive samples during the FSOD training stage to improve FSOD performance. Specifically, we develop a hierarchical ternary classification region proposal network (HTRPN) to localize the potential unlabeled novel objects and assign them new objectness labels to distinguish these objects from the base training dataset classes. Our improved hierarchical sampling strategy for the region proposal network (RPN) also boosts the perception ability of the object detection model for large objects. We test our approach and COCO and PASCAL VOC baselines that are commonly used in FSOD literature. Our experimental results indicate that our method is effective and outperforms the existing state-of-the-art (SOTA) FSOD methods. Our implementation is provided as a supplement to support reproducibility of the results.

摘要
尽管深度学习在对象检测任务中具有显著的成功，但标准的深度神经网络训练需要大量的标注图像，特别是处理不常见的对象。几拟对象检测（FSOD）方法已经出现，以解决深度学习对象检测方法的限制。FSOD方法可以达到使用较少的训练数据来实现稳定的对象检测性能。然而，FSOD中的一个挑战是，训练集中不存在的新类型对象可能会出现在背景中，并被基础模型认为是可能的对象。这些对象会被视为标注噪声，导致FSOD性能下降。我们提出了一种半supervised算法，用于检测并利用训练集外的未标注新对象作为Positive样本，以改进FSOD性能。具体来说，我们开发了一种嵌入式三元分类区域提案网络（HTRPN），用于定位潜在的未标注新对象，并将其分配新的对象性标签，以分开这些对象与基础训练集类别。我们还改进了RPN的层次采样策略，以提高对象检测模型对大对象的感知能力。我们测试了我们的方法，并与COCO和PASCAL VOC基线相比较。我们的实验结果表明，我们的方法是有效的，并超越了现有的状态的最佳方法（SOTA）。我们的实现提供了补充，以支持result的重复性。

Inverse Lithography Physics-informed Deep Neural Level Set for Mask Optimization

paper_url: http://arxiv.org/abs/2308.12299
repo_url: None
paper_authors: Xing-Yu Ma, Shaogang Hao
for: 提高磁版印刷过程中的分辨率，提高磁版印刷过程中的印刷可靠性
methods: 利用深度学习（DL）方法和层设法（ILT），实现磁版优化
results: 相比于各种纯DL和ILT方法，ILDLS方法可以减少计算时间，提高印刷可靠性和过程窗口（PW）等效果

Abstract
As the feature size of integrated circuits continues to decrease, optical proximity correction (OPC) has emerged as a crucial resolution enhancement technology for ensuring high printability in the lithography process. Recently, level set-based inverse lithography technology (ILT) has drawn considerable attention as a promising OPC solution, showcasing its powerful pattern fidelity, especially in advanced process. However, massive computational time consumption of ILT limits its applicability to mainly correcting partial layers and hotspot regions. Deep learning (DL) methods have shown great potential in accelerating ILT. However, lack of domain knowledge of inverse lithography limits the ability of DL-based algorithms in process window (PW) enhancement and etc. In this paper, we propose an inverse lithography physics-informed deep neural level set (ILDLS) approach for mask optimization. This approach utilizes level set based-ILT as a layer within the DL framework and iteratively conducts mask prediction and correction to significantly enhance printability and PW in comparison with results from pure DL and ILT. With this approach, computation time is reduced by a few orders of magnitude versus ILT. By gearing up DL with knowledge of inverse lithography physics, ILDLS provides a new and efficient mask optimization solution.

摘要
(Note: The text has been translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. The translation may vary depending on the region and dialect.)

Confidence Contours: Uncertainty-Aware Annotation for Medical Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.07528
repo_url: None
paper_authors: Andre Ye, Quan Ze Chen, Amy Zhang
for: 本研究旨在提出一种新的验证图像分割模型的方法，以增强模型对不确定性的理解，从而更好地处理视觉抽象。
methods: 本研究提出了一种新的分割表示方法，称为信度轮廓（Confidence Contours），该方法通过高信度和低信度的轮廓来捕捉不确定性。同时，研究人员还开发了一种新的标注系统，用于收集轮廓数据。
results: 研究人员在Lung Image Dataset Consortium（LIDC）和一个 sintetic dataset上进行了评估，结果表明，Confidence Contours可以准确地捕捉不确定性，而且与标准的单个精度标注相比， annotator的努力不会增加太多。此外，研究人员还发现，通用的分割模型可以很好地学习Confidence Contours。最后，在5名医学专家的采访中，研究人员发现，Confidence Contour map比bayesian map更易于理解，因为它能够反映结构不确定性。

Abstract
Medical image segmentation modeling is a high-stakes task where understanding of uncertainty is crucial for addressing visual ambiguity. Prior work has developed segmentation models utilizing probabilistic or generative mechanisms to infer uncertainty from labels where annotators draw a singular boundary. However, as these annotations cannot represent an individual annotator's uncertainty, models trained on them produce uncertainty maps that are difficult to interpret. We propose a novel segmentation representation, Confidence Contours, which uses high- and low-confidence ``contours'' to capture uncertainty directly, and develop a novel annotation system for collecting contours. We conduct an evaluation on the Lung Image Dataset Consortium (LIDC) and a synthetic dataset. From an annotation study with 30 participants, results show that Confidence Contours provide high representative capacity without considerably higher annotator effort. We also find that general-purpose segmentation models can learn Confidence Contours at the same performance level as standard singular annotations. Finally, from interviews with 5 medical experts, we find that Confidence Contour maps are more interpretable than Bayesian maps due to representation of structural uncertainty.

摘要
医学图像分割模型化是一项高风险任务，理解不确定性是关键来解决视觉 ambiguity。先前的工作已经开发出了使用概率或生成机制来推导不确定性从标签中的分割模型，但这些标签不能表示个体注意者的不确定性，因此模型从这些标签学习的不确定性地图很难解释。我们提出了一种新的分割表示方式，即信心轮廓，可以直接捕捉不确定性，并开发了一种新的注意者系统来收集轮廓。我们在Lung Image Dataset Consortium（LIDC）和一个 sintetic dataset上进行了评估。从30名参与者的注意者研究中，结果表明，信心轮廓可以提供高度表示能力，而无需较大的注意者努力。此外，我们发现，通用分割模型可以学习信心轮廓，并且与标准单个标签学习模型的性能相当。最后，经验了5名医学专家的采访，发现，信心轮廓地图比bayesian地图更易于理解，因为它表示了结构不确定性。

Benchmarking Scalable Epistemic Uncertainty Quantification in Organ Segmentation

paper_url: http://arxiv.org/abs/2308.07506
repo_url: https://github.com/jadie1/medseguq
paper_authors: Jadie Adams, Shireen Y. Elhabian
for: 这个论文的目的是评估多种基于深度学习的自动组织器gmentation方法中的epistemicuncertainty量化方法，以便在临床应用中提供可靠和可Robust的模型。
methods: 本文使用了多种epistemic uncertainty量化方法，包括Bayesian neural networks, Monte Carlo dropout, and Deep Ensembles，并进行了比较性 benchmarking 测试。
results: 研究发现，Deep Ensembles方法在accuracy和uncertainty calibration方面表现最佳，而Bayesian neural networks方法在out-of-distribution detection方面表现最好。本文还提供了每种方法的优缺点和未来改进的建议。

Abstract
Deep learning based methods for automatic organ segmentation have shown promise in aiding diagnosis and treatment planning. However, quantifying and understanding the uncertainty associated with model predictions is crucial in critical clinical applications. While many techniques have been proposed for epistemic or model-based uncertainty estimation, it is unclear which method is preferred in the medical image analysis setting. This paper presents a comprehensive benchmarking study that evaluates epistemic uncertainty quantification methods in organ segmentation in terms of accuracy, uncertainty calibration, and scalability. We provide a comprehensive discussion of the strengths, weaknesses, and out-of-distribution detection capabilities of each method as well as recommendations for future improvements. These findings contribute to the development of reliable and robust models that yield accurate segmentations while effectively quantifying epistemic uncertainty.

摘要
深度学习基于方法可能在自动器官分割方面展示了较好的表现，但是量化和理解模型预测结果中的不确定性是重要的。虽然许多技术已经被提出用于知识型或模型基的不确定性估计，但是尚未清楚哪种方法在医学图像分析场景中更有优势。这篇论文提供了一项完整的比较研究，评估了器官分割中 epistemic 不确定性估计方法的准确性、不确定性归一化和可扩展性。我们提供了每种方法的优缺点、缺失和离群检测能力，以及未来改进的建议。这些发现有助于开发可靠和可靠的模型，以便实现准确的分割和有效地量化 epistemic 不确定性。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and widely used in other countries as well. If you need the translation in Traditional Chinese, please let me know.

ICAFusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection

paper_url: http://arxiv.org/abs/2308.07504
repo_url: https://github.com/chanchanchan97/icafusion
paper_authors: Jifeng Shen, Yifei Chen, Yue Liu, Xin Zuo, Heng Fan, Wankou Yang
for: 本研究旨在提高多spectral图像的特征融合，以提高多spectral对象检测的精度。
methods: 提出了一种基于双cross-attention transformer的新特征融合框架，通过全球特征交互模型，捕捉多modalitat中的补偿信息，提高对象特征的抑制性。
results: 实验结果表明，提出的方法可以在KAIST、FLIR和VEDAI数据集上实现superior表现，同时具有更快的推理速度，适用于各种实际应用场景。

Abstract
Effective feature fusion of multispectral images plays a crucial role in multi-spectral object detection. Previous studies have demonstrated the effectiveness of feature fusion using convolutional neural networks, but these methods are sensitive to image misalignment due to the inherent deffciency in local-range feature interaction resulting in the performance degradation. To address this issue, a novel feature fusion framework of dual cross-attention transformers is proposed to model global feature interaction and capture complementary information across modalities simultaneously. This framework enhances the discriminability of object features through the query-guided cross-attention mechanism, leading to improved performance. However, stacking multiple transformer blocks for feature enhancement incurs a large number of parameters and high spatial complexity. To handle this, inspired by the human process of reviewing knowledge, an iterative interaction mechanism is proposed to share parameters among block-wise multimodal transformers, reducing model complexity and computation cost. The proposed method is general and effective to be integrated into different detection frameworks and used with different backbones. Experimental results on KAIST, FLIR, and VEDAI datasets show that the proposed method achieves superior performance and faster inference, making it suitable for various practical scenarios. Code will be available at https://github.com/chanchanchan97/ICAFusion.

摘要
<> translates into 多spectral图像的有效特征融合在多spectral对象检测中发挥关键作用。先前的研究已经证明了使用卷积神经网络进行特征融合的效iveness，但这些方法容易受到图像不对齐的影响，导致性能下降。为解决这问题，一种新的特征融合框架基于双重交叉注意力变换器是提出来，可以模型全局特征交互和同时捕捉不同模式之间的补做信息。这个框架通过尝试引导的交叉注意力机制来增强对象特征的抗混淆性，从而提高性能。然而，堆叠多个变换器块以提高特征的增强，会增加模型的参数量和空间复杂度。为解决这问题，根据人类审查知识的过程，一种循环互动机制是提出来，可以在不同模式之间共享参数，从而降低模型的参数量和计算量。提出的方法可以与不同的检测框架集成，并且可以与不同的后处器结合使用。实验结果表明，提出的方法在KAIST、FLIR和VEDAI datasets上 achieve superior performance和快速的检测，适用于各种实际应用场景。代码将在https://github.com/chanchanchan97/ICAFusion中公开。

SpecTracle: Wearable Facial Motion Tracking from Unobtrusive Peripheral Cameras

paper_url: http://arxiv.org/abs/2308.07502
repo_url: None
paper_authors: Yinan Xuan, Varun Viswanath, Sunny Chu, Owen Bartolf, Jessica Echterhoff, Edward Wang
for: 这个论文旨在实现无障碍的虚拟现实环境中的”面对面”互动。
methods: 该系统使用两个宽角相机，位于幕面上，以实现面部动作跟踪。
results: 该系统可以在实时24帧/秒的� mobil GPU上运行，并且可以精准地跟踪用户面部的不同部分运动。个性化协调可以提高跟踪性能42.3%。

Abstract
Facial motion tracking in head-mounted displays (HMD) has the potential to enable immersive "face-to-face" interaction in a virtual environment. However, current works on facial tracking are not suitable for unobtrusive augmented reality (AR) glasses or do not have the ability to track arbitrary facial movements. In this work, we demonstrate a novel system called SpecTracle that tracks a user's facial motions using two wide-angle cameras mounted right next to the visor of a Hololens. Avoiding the usage of cameras extended in front of the face, our system greatly improves the feasibility to integrate full-face tracking into a low-profile form factor. We also demonstrate that a neural network-based model processing the wide-angle cameras can run in real-time at 24 frames per second (fps) on a mobile GPU and track independent facial movement for different parts of the face with a user-independent model. Using a short personalized calibration, the system improves its tracking performance by 42.3% compared to the user-independent model.

摘要
“头戴式显示器（HMD）中的面部运动跟踪可能启用虚拟环境中的互动。然而，当前的面部跟踪方法不适用于不干扰的增强现实（AR）镜或无法跟踪自由的面部运动。在这项工作中，我们介绍了一种名为SpecTracle的系统，它使用两个宽角相机安装在Hololens镜的两侧，以跟踪用户的面部运动。避免使用扩展到面前的相机，我们的系统可以大幅提高将全面跟踪集成到低 профиль形态中的可能性。我们还示出了一种基于神经网络的模型，通过处理宽角相机可以在实时24帧/秒（fps）的移动硬件上运行，并可以独立地跟踪不同部分的面部运动。通过短时间的个性化准备，系统可以提高跟踪性能42.3%比用户无关模型。”

BSED: Baseline Shapley-Based Explainable Detector

paper_url: http://arxiv.org/abs/2308.07490
repo_url: None
paper_authors: Michihiro Kuroki, Toshihiko Yamasaki
for: 这个论文的目的是提高Explainable Artificial Intelligence（XAI）在图像识别领域的可解释性，并提供一种基于基线特征的可解释的检测器（BSED），以满足可解释性axioms。
methods: 这个论文使用了Shapley值来扩展对象检测，并将其应用到各种检测器中，以实现可解释性。它还可以在不需要细致参数调整的情况下，对各种检测目标进行解释。
results: 论文的结果表明，BSED可以提供更有效的解释，并且可以在各种应用中 correction based on explanations from our method。此外，BSED的处理成本在理解的范围内，而原始的Shapley值则是计算成本过高的。

Abstract
Explainable artificial intelligence (XAI) has witnessed significant advances in the field of object recognition, with saliency maps being used to highlight image features relevant to the predictions of learned models. Although these advances have made AI-based technology more interpretable to humans, several issues have come to light. Some approaches present explanations irrelevant to predictions, and cannot guarantee the validity of XAI (axioms). In this study, we propose the Baseline Shapley-based Explainable Detector (BSED), which extends the Shapley value to object detection, thereby enhancing the validity of interpretation. The Shapley value can attribute the prediction of a learned model to a baseline feature while satisfying the explainability axioms. The processing cost for the BSED is within the reasonable range, while the original Shapley value is prohibitively computationally expensive. Furthermore, BSED is a generalizable method that can be applied to various detectors in a model-agnostic manner, and interpret various detection targets without fine-grained parameter tuning. These strengths can enable the practical applicability of XAI. We present quantitative and qualitative comparisons with existing methods to demonstrate the superior performance of our method in terms of explanation validity. Moreover, we present some applications, such as correcting detection based on explanations from our method.

摘要
人工智能（AI）的解释性（XAI）在图像识别领域已经取得了重要进步，使用焦点图来高亮图像特征，有助于人类更好地理解AI模型的预测。然而，这些进步并不能解决所有问题。一些方法提供不相关的解释，无法保证XAI的AXIoms的正确性。在本研究中，我们提出了基线Shapley值基于的解释探测器（BSED），该方法扩展了Shapley值到对象检测，从而提高了解释的正确性。Shapley值可以归因预测的learned模型到基线特征，同时满足解释AXIoms。BSED的处理成本在合理范围内，而原始Shapley值计算成本过高。此外，BSED是一种通用的方法，可以适用于不同的检测器，并且可以对各种检测目标进行不必做细致参数调整的解释。这些优点使得XAI在实际应用中得到了加强。我们对现有方法进行了量化和质量比较，以示我们的方法在解释正确性方面的超越。此外，我们还展示了一些应用，如通过我们的方法提供的解释来修正检测结果。

Space Object Identification and Classification from Hyperspectral Material Analysis

paper_url: http://arxiv.org/abs/2308.07481
repo_url: None
paper_authors: Massimiliano Vasile, Lewis Walker, Andrew Campbell, Simao Marto, Paul Murray, Stephen Marshall, Vasili Savitski
for: 这个论文是为了提取未知宇宙对象的谱spectrum信息而设计的数据处理管道。
methods: 该论文使用了两种物质标识和分类技术：一种是基于机器学习，另一种是基于最小二乘匹配已知谱spectrum库。
results: 论文将展示一些初步的物体识别和分类结果。

Abstract
This paper presents a data processing pipeline designed to extract information from the hyperspectral signature of unknown space objects. The methodology proposed in this paper determines the material composition of space objects from single pixel images. Two techniques are used for material identification and classification: one based on machine learning and the other based on a least square match with a library of known spectra. From this information, a supervised machine learning algorithm is used to classify the object into one of several categories based on the detection of materials on the object. The behaviour of the material classification methods is investigated under non-ideal circumstances, to determine the effect of weathered materials, and the behaviour when the training library is missing a material that is present in the object being observed. Finally the paper will present some preliminary results on the identification and classification of space objects.

摘要
Translation notes:* "hyperspectral signature" is translated as "多spectral特征" (duō spectrum de tiào xiàng)* "material composition" is translated as "物质组成" (wù zhì zhōng jī)* "machine learning" is translated as "机器学习" (jī shì xué xí)* "least square match" is translated as "最小二乘匹配" (zuì xiǎo èr chuī pīng pái)* "library of known spectra" is translated as "已知spectra库" (yǐ zhī spectrum kù)* "supervised machine learning algorithm" is translated as "指导式机器学习算法" (dì dǎo xìng jī shì xué xí algoritmos)* "classify the object" is translated as "对象分类" (duì yì fāng lèi)* "weathered materials" is translated as "天然风化物" (tiān zhēn fēng huà wù)Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need Traditional Chinese, please let me know.

Probabilistic MIMO U-Net: Efficient and Accurate Uncertainty Estimation for Pixel-wise Regression

paper_url: http://arxiv.org/abs/2308.07477
repo_url: https://github.com/antonbaumann/mimo-unet
paper_authors: Anton Baumann, Thomas Roßberg, Michael Schmitt
For: 提高机器学习模型的可靠性和可读性，特别在高度重要的实际应用场景中。* Methods: 基于多输入多输出（MIMO）框架，利用深度神经网络的过参数化来实现像素级回归任务。采用U-Net架构，在单个模型中培养多个互相约束的子网络。还提出了一种同步子网络性能的新程序。* Results: 对两个正交的数据集进行了全面的评估，与现有模型相比，具有相似的准确率，更好的准确率Calibration，robust的out-of-distribution检测能力，并且具有较小的参数大小和执行时间。代码可以在github.com/antonbaumann/MIMO-Unet中下载。

Abstract
Uncertainty estimation in machine learning is paramount for enhancing the reliability and interpretability of predictive models, especially in high-stakes real-world scenarios. Despite the availability of numerous methods, they often pose a trade-off between the quality of uncertainty estimation and computational efficiency. Addressing this challenge, we present an adaptation of the Multiple-Input Multiple-Output (MIMO) framework -- an approach exploiting the overparameterization of deep neural networks -- for pixel-wise regression tasks. Our MIMO variant expands the applicability of the approach from simple image classification to broader computer vision domains. For that purpose, we adapted the U-Net architecture to train multiple subnetworks within a single model, harnessing the overparameterization in deep neural networks. Additionally, we introduce a novel procedure for synchronizing subnetwork performance within the MIMO framework. Our comprehensive evaluations of the resulting MIMO U-Net on two orthogonal datasets demonstrate comparable accuracy to existing models, superior calibration on in-distribution data, robust out-of-distribution detection capabilities, and considerable improvements in parameter size and inference time. Code available at github.com/antonbaumann/MIMO-Unet

摘要
“机器学习中的不确定性估计是对预测模型的可靠性和解释性提高的重要因素，特别是在高度重要的实际应用中。 despite the availability of numerous methods, they often pose a trade-off between the quality of uncertainty estimation and computational efficiency. Addressing this challenge, we present an adaptation of the Multiple-Input Multiple-Output（MIMO）framework—an approach exploiting the overparameterization of deep neural networks—for pixel-wise regression tasks. Our MIMO variant expands the applicability of the approach from simple image classification to broader computer vision domains. For that purpose, we adapted the U-Net architecture to train multiple subnetworks within a single model, harnessing the overparameterization in deep neural networks. Additionally, we introduce a novel procedure for synchronizing subnetwork performance within the MIMO framework. Our comprehensive evaluations of the resulting MIMO U-Net on two orthogonal datasets demonstrate comparable accuracy to existing models, superior calibration on in-distribution data, robust out-of-distribution detection capabilities, and considerable improvements in parameter size and inference time. Code available at github.com/antonbaumann/MIMO-Unet”Here's the breakdown of the translation:1. 机器学习 (machine learning) -> 机器学习 (machine learning)2. 不确定性估计 (uncertainty estimation) -> 不确定性估计 (uncertainty estimation)3. MIMO (Multiple-Input Multiple-Output) -> MIMO (多输入多输出)4. overparameterization -> 过参数化5. deep neural networks -> 深度神经网络6. pixel-wise regression -> 像素级别回归7. U-Net -> U-Net8. subnetworks -> 子网络9. in-distribution data -> 在分布中的数据10. out-of-distribution detection -> 外分布检测11. parameter size -> 参数大小12. inference time -> 推理时间Note that Simplified Chinese is used in the translation, which is the standard writing system used in mainland China.

Reducing Training Demands for 3D Gait Recognition with Deep Koopman Operator Constraints

paper_url: http://arxiv.org/abs/2308.07468
repo_url: None
paper_authors: Cole Hill, Mauricio Pamplona Segundo, Sudeep Sarkar
for: 本研究的目的是提出一种使用深度学习技术实现人体步态识别，并使用Linear Dynamical Systems（LDS）模块和损失函数来保证时间相关性和动态可靠性。
methods: 本研究使用了深度神经网络来适应3D人体步态数据，并引入了LDS模块和基于Koopman算子理论的损失函数来保证模型的动态可靠性和时间相关性。
results: 根据USF HumanID和CASIA-B dataset的比较，本研究的LDS方法可以在训练数据的限制下实现更高的准确率，而且3D模型方法在不同的视角变化和包袋等情况下也表现更好。

Abstract
Deep learning research has made many biometric recognition solution viable, but it requires vast training data to achieve real-world generalization. Unlike other biometric traits, such as face and ear, gait samples cannot be easily crawled from the web to form massive unconstrained datasets. As the human body has been extensively studied for different digital applications, one can rely on prior shape knowledge to overcome data scarcity. This work follows the recent trend of fitting a 3D deformable body model into gait videos using deep neural networks to obtain disentangled shape and pose representations for each frame. To enforce temporal consistency in the network, we introduce a new Linear Dynamical Systems (LDS) module and loss based on Koopman operator theory, which provides an unsupervised motion regularization for the periodic nature of gait, as well as a predictive capacity for extending gait sequences. We compare LDS to the traditional adversarial training approach and use the USF HumanID and CASIA-B datasets to show that LDS can obtain better accuracy with less training data. Finally, we also show that our 3D modeling approach is much better than other 3D gait approaches in overcoming viewpoint variation under normal, bag-carrying and clothing change conditions.

摘要
This study uses deep neural networks to fit a 3D deformable body model to gait videos and get separate shape and pose representations for each frame. To make sure the movements in the videos are consistent, we use a new Linear Dynamical Systems (LDS) module and loss based on Koopman operator theory. This approach provides an unsupervised motion regularization for the periodic nature of gait, as well as a way to predict how gait sequences will continue. We compare LDS to traditional adversarial training and use the USF HumanID and CASIA-B datasets to show that LDS can get better accuracy with less training data.Finally, we show that our 3D modeling approach is much better than other 3D gait approaches at handling changes in viewpoint, bag-carrying, and clothing under normal conditions.

There Is a Digital Art History

paper_url: http://arxiv.org/abs/2308.07464
repo_url: https://github.com/Gracetyty/art-gallery
paper_authors: Leonardo Impett, Fabian Offert
for: 本研究探讨了 Johanna Drucker 十年前提出的问题：“是否有数字艺术历史？”，以及在大规模变换器基础上的视觉模型的出现对数字艺术历史的影响。
methods: 本研究使用了两种主要方法：一是对大规模视觉模型中新编码的视艺抄影重要性进行分析，二是通过使用当代大规模视觉模型investigate基本问题从艺术史和城市规划等领域来进行技术 caso study。
results: 研究结果表明，大规模视觉模型在数字艺术历史方面可能会导致一个新的парадигShift，因为它们可以自动批处和抽象不同形式的视觉逻辑，并且在数字生活中已经广泛应用。同时，这些系统需要一种新的批判方法，该方法需要考虑模型和其应用之间的知识共生。

Abstract
In this paper, we revisit Johanna Drucker's question, "Is there a digital art history?" -- posed exactly a decade ago -- in the light of the emergence of large-scale, transformer-based vision models. While more traditional types of neural networks have long been part of digital art history, and digital humanities projects have recently begun to use transformer models, their epistemic implications and methodological affordances have not yet been systematically analyzed. We focus our analysis on two main aspects that, together, seem to suggest a coming paradigm shift towards a "digital" art history in Drucker's sense. On the one hand, the visual-cultural repertoire newly encoded in large-scale vision models has an outsized effect on digital art history. The inclusion of significant numbers of non-photographic images allows for the extraction and automation of different forms of visual logics. Large-scale vision models have "seen" large parts of the Western visual canon mediated by Net visual culture, and they continuously solidify and concretize this canon through their already widespread application in all aspects of digital life. On the other hand, based on two technical case studies of utilizing a contemporary large-scale visual model to investigate basic questions from the fields of art history and urbanism, we suggest that such systems require a new critical methodology that takes into account the epistemic entanglement of a model and its applications. This new methodology reads its corpora through a neural model's training data, and vice versa: the visual ideologies of research datasets and training datasets become entangled.

摘要
在这篇论文中，我们重新回归到 Johanna Drucker 提出的十年前的问题：“是否有数字艺术历史？”——传统神经网络已经长期出现在数字艺术史上，而最近的数字人文科学项目则开始使用转换器模型。然而，这些模型的认知途径和方法论上的影响尚未系统地分析。我们将分析两个主要方面，这两个方面共同表明一种可能的未来方向：数字艺术史。一方面，大规模感知模型中新编码的视觉文化财富对数字艺术史产生了巨大的影响。由于大量非摄影图像的包容，可以自动提取和抽象不同类型的视觉逻辑。大规模感知模型已经“看到”了西方视觉Canvas的大部分，并且不断巩固和固化这个Canvas，通过在所有数字生活中广泛应用。另一方面，基于两个实践案例，我们建议需要一种新的批判方法，该方法考虑模型和其应用之间的认知纠缠。这种新方法可以通过神经网络的训练数据和研究数据来读取 corpora，并且反之，研究数据和训练数据的视觉意识都会紧密相互纠缠。

U-Turn Diffusion

paper_url: http://arxiv.org/abs/2308.07421
repo_url: None
paper_authors: Hamidreza Behjoo, Michael Chertkov
for: 这个论文探讨了基于人工智能的 diffusion 模型，用于生成合成图像。这些模型利用动态辅助时间机制，通过随机差分方程来获得分数函数。
methods: 该论文提出了一个效果评价标准：生成过程中快速谱 correlation 的破坏能力直接关系到生成图像质量。此外， authors 还提出了一种“U-Turn Diffusion”技术，通过将标准前向 diffusion 过程缩短，然后执行标准反向动力，最终生成一个与 i.i.d. 样本 Distribution 相似的合成图像。
results: 该论文通过使用不同的分析工具，如自相关分析、分数函数质量分析和高斯分布预测测试，来分析相关的时间尺度。结果表明，在优化 U-turn 时间后，生成的合成图像与实际数据样本之间的干扰距离最小化。

Abstract
We present a comprehensive examination of score-based diffusion models of AI for generating synthetic images. These models hinge upon a dynamic auxiliary time mechanism driven by stochastic differential equations, wherein the score function is acquired from input images. Our investigation unveils a criterion for evaluating efficiency of the score-based diffusion models: the power of the generative process depends on the ability to de-construct fast correlations during the reverse/de-noising phase. To improve the quality of the produced synthetic images, we introduce an approach coined "U-Turn Diffusion". The U-Turn Diffusion technique starts with the standard forward diffusion process, albeit with a condensed duration compared to conventional settings. Subsequently, we execute the standard reverse dynamics, initialized with the concluding configuration from the forward process. This U-Turn Diffusion procedure, combining forward, U-turn, and reverse processes, creates a synthetic image approximating an independent and identically distributed (i.i.d.) sample from the probability distribution implicitly described via input samples. To analyze relevant time scales we employ various analytical tools, including auto-correlation analysis, weighted norm of the score-function analysis, and Kolmogorov-Smirnov Gaussianity test. The tools guide us to establishing that the Kernel Intersection Distance, a metric comparing the quality of synthetic samples with real data samples, is minimized at the optimal U-turn time.

摘要
我们提出了一项全面的检查Score-based扩散模型，用于生成 sintetic 图像。这些模型基于动态辅助时间机制驱动的随机差分方程，其中Score函数从输入图像中获得。我们的调查发现一个用于评估扩散模型效率的标准：扩散过程中快速相关性的破坏能力直接关系到生成过程的能效性。为了提高生成的 sintetic 图像质量，我们提出了“U-Turn扩散”技术。U-Turn扩散过程从标准前进Diffusion过程开始，但是压缩了传统设置中的时间。然后，我们执行标准的反动动态，初始化使用前进过程的结束配置。这种U-Turn扩散过程，包括前进、U-turn和反向过程，可以生成一个约束相同分布的 sintetic 图像，与输入样本的随机分布相对独立。为了分析相关的时间尺度，我们使用了多种分析工具，包括自相关分析、加重函数分析和高斯假设测试。这些工具引导我们确定了最佳U-turn时间，以使得扩散模型可以生成高质量的 sintetic 图像。

Semantify: Simplifying the Control of 3D Morphable Models using CLIP

paper_url: http://arxiv.org/abs/2308.07415
repo_url: https://github.com/Omergral/Semantify
paper_authors: Omer Gralnik, Guy Gafni, Ariel Shamir
for: 用于自动控制3D形态模型
methods: 使用CLIP语言视觉基础模型的 semantic 力进行自我超vised 训练
results: 实现了定制3D形态模型的简单 slider интерфей스，并可以快速地适应各种3D模型的定制。Here’s a more detailed explanation of each point:
for: The paper is written for the purpose of simplifying the control of 3D morphable models, specifically using self-supervised learning and the semantic power of CLIP language-vision foundation models.
methods: The paper proposes a method called Semantify, which utilizes the semantic power of CLIP to learn a non-linear mapping from scores across a small set of semantically meaningful and disentangled descriptors to the parametric coefficients of a given 3D morphable model. This is done without a human-in-the-loop and using training data created by randomly sampling the model’s parameters, creating various shapes, and rendering them.
results: The paper presents results on numerous 3D morphable models, including body shape models, face shape and expression models, and animal shapes. The results show that the proposed method defines a simple slider interface for intuitive modeling and can be used to instantly fit a 3D parametric body shape to in-the-wild images.

Abstract
We present Semantify: a self-supervised method that utilizes the semantic power of CLIP language-vision foundation model to simplify the control of 3D morphable models. Given a parametric model, training data is created by randomly sampling the model's parameters, creating various shapes and rendering them. The similarity between the output images and a set of word descriptors is calculated in CLIP's latent space. Our key idea is first to choose a small set of semantically meaningful and disentangled descriptors that characterize the 3DMM, and then learn a non-linear mapping from scores across this set to the parametric coefficients of the given 3DMM. The non-linear mapping is defined by training a neural network without a human-in-the-loop. We present results on numerous 3DMMs: body shape models, face shape and expression models, as well as animal shapes. We demonstrate how our method defines a simple slider interface for intuitive modeling, and show how the mapping can be used to instantly fit a 3D parametric body shape to in-the-wild images.

摘要
我们介绍Semantify：一种自动超级方法，利用CLIP语言视觉基础模型的 semantic 力来简化3D可变模型的控制。将 parametric 模型作为input，通过随机抽样模型参数，创建不同形状并rendering 它们。然后，使用CLIP的内存空间计算模型的出力图像和一组字幕描述的相似度。我们的关键想法是首先选择一小集 semantically meaningful 和分离的描述符，描述3DMM的特征，然后学习一个非线性的 mapping 将 scores across 这个集合转换为input 模型的参数。这个 mapping 是通过人工不在从事的方式定义的，我们提出了一个 neural network 的解释。我们在 numerous 3DMM 上进行了实验，包括人体形状模型、脸形和表情模型以及动物形状。我们显示了我们的方法可以定义一个简单的滑块界面，并说明了如何将 mapping 用于快速适应实验中的内部显示。最后，我们显示了我们的方法可以将3D parametric 体形快速适应到野外图像中。

A Unified Query-based Paradigm for Camouflaged Instance Segmentation

paper_url: http://arxiv.org/abs/2308.07392
repo_url: https://github.com/dongbo811/uqformer
paper_authors: Do Dong, Jialun Pei, Rongrong Gao, Tian-Zhu Xiang, Shuo Wang, Huan Xiong
for: 提高隐藏的实例分割精度
methods: 使用 query-based 多任务学习框架，包括设计多scales的 unified learning transformer decoder 和 composed query learning paradigm，以 capture 隐藏的对象区域和边界特征
results: 与 14 状态级方法进行比较，实现了隐藏实例分割的显著提高

Abstract
Due to the high similarity between camouflaged instances and the background, the recently proposed camouflaged instance segmentation (CIS) faces challenges in accurate localization and instance segmentation. To this end, inspired by query-based transformers, we propose a unified query-based multi-task learning framework for camouflaged instance segmentation, termed UQFormer, which builds a set of mask queries and a set of boundary queries to learn a shared composed query representation and efficiently integrates global camouflaged object region and boundary cues, for simultaneous instance segmentation and instance boundary detection in camouflaged scenarios. Specifically, we design a composed query learning paradigm that learns a shared representation to capture object region and boundary features by the cross-attention interaction of mask queries and boundary queries in the designed multi-scale unified learning transformer decoder. Then, we present a transformer-based multi-task learning framework for simultaneous camouflaged instance segmentation and camouflaged instance boundary detection based on the learned composed query representation, which also forces the model to learn a strong instance-level query representation. Notably, our model views the instance segmentation as a query-based direct set prediction problem, without other post-processing such as non-maximal suppression. Compared with 14 state-of-the-art approaches, our UQFormer significantly improves the performance of camouflaged instance segmentation. Our code will be available at https://github.com/dongbo811/UQFormer.

摘要
due to the high similarity between camouflaged instances and the background, the recently proposed camouflaged instance segmentation (CIS) faces challenges in accurate localization and instance segmentation. to this end, inspired by query-based transformers, we propose a unified query-based multi-task learning framework for camouflaged instance segmentation, termed UQFormer, which builds a set of mask queries and a set of boundary queries to learn a shared composed query representation and efficiently integrates global camouflaged object region and boundary cues, for simultaneous instance segmentation and instance boundary detection in camouflaged scenarios. specifically, we design a composed query learning paradigm that learns a shared representation to capture object region and boundary features by the cross-attention interaction of mask queries and boundary queries in the designed multi-scale unified learning transformer decoder. then, we present a transformer-based multi-task learning framework for simultaneous camouflaged instance segmentation and camouflaged instance boundary detection based on the learned composed query representation, which also forces the model to learn a strong instance-level query representation. notably, our model views the instance segmentation as a query-based direct set prediction problem, without other post-processing such as non-maximal suppression. compared with 14 state-of-the-art approaches, our UQFormer significantly improves the performance of camouflaged instance segmentation. our code will be available at https://github.com/dongbo811/UQFormer.Here's the word-for-word translation of the text into Simplified Chinese:由于隐形实例和背景的高相似性，最近提出的隐形实例分割（CIS）面临精度地位和实例分割挑战。为此，我们取得了 query-based transformers 的灵感，并提出了一种统一的 query-based 多任务学习框架，称为 UQFormer，该框架在隐形场景中同时进行实例分割和实例边界检测。specifically，我们设计了一种组合查询学习方案，通过 маска查询和边界查询的交互跨度束对象区域和边界特征进行学习共享的查询表示。然后，我们提出了一种基于 transformer 的多任务学习框架，通过学习共享的查询表示来同时进行隐形实例分割和隐形实例边界检测。此外，我们的模型视实例分割为直接查询集prediction问题，不需要其他后处理如非最大suppression。与14种状态级方法进行比较，我们的 UQFormer 显著提高了隐形实例分割的性能。我们的代码将在 https://github.com/dongbo811/UQFormer 上发布。

DISBELIEVE: Distance Between Client Models is Very Essential for Effective Local Model Poisoning Attacks

paper_url: http://arxiv.org/abs/2308.07387
repo_url: None
paper_authors: Indu Joshi, Priyank Upadhya, Gaurav Kumar Nayak, Peter Schüffler, Nassir Navab
for: This paper focuses on the privacy issues in federated learning, specifically in the medical image analysis domain, and proposes a local model poisoning attack called DISBELIEVE to defend against robust aggregation methods.
methods: The proposed DISBELIEVE attack creates malicious parameters or gradients that are close to benign clients’ parameters or gradients but have a high adverse effect on the global model’s performance.
results: The proposed attack significantly lowers the performance of state-of-the-art robust aggregation methods for medical image analysis on three publicly available datasets, and is also effective on natural images for multi-class classification on the benchmark dataset CIFAR-10.Here’s the full Chinese translation of the paper’s abstract:for: 这篇论文关注联合学习中的隐私问题，具体是医疗图像分析领域，并提出了一种本地模型毒品攻击方法called DISBELIEVE，以防止robust集成方法。methods: DISBELIEVE攻击方法创造了假的参数或梯度，使其与正常客户端的参数或梯度很近，但是对全局模型的性能产生高度的负面影响。results: 提议的攻击方法对state-of-the-art robust集成方法在三个公开的医疗图像数据集上显示出了显著的下降性能，并且在自然图像的多类分类任务上也有严重的下降性能。

Abstract
Federated learning is a promising direction to tackle the privacy issues related to sharing patients' sensitive data. Often, federated systems in the medical image analysis domain assume that the participating local clients are \textit{honest}. Several studies report mechanisms through which a set of malicious clients can be introduced that can poison the federated setup, hampering the performance of the global model. To overcome this, robust aggregation methods have been proposed that defend against those attacks. We observe that most of the state-of-the-art robust aggregation methods are heavily dependent on the distance between the parameters or gradients of malicious clients and benign clients, which makes them prone to local model poisoning attacks when the parameters or gradients of malicious and benign clients are close. Leveraging this, we introduce DISBELIEVE, a local model poisoning attack that creates malicious parameters or gradients such that their distance to benign clients' parameters or gradients is low respectively but at the same time their adverse effect on the global model's performance is high. Experiments on three publicly available medical image datasets demonstrate the efficacy of the proposed DISBELIEVE attack as it significantly lowers the performance of the state-of-the-art \textit{robust aggregation} methods for medical image analysis. Furthermore, compared to state-of-the-art local model poisoning attacks, DISBELIEVE attack is also effective on natural images where we observe a severe drop in classification performance of the global model for multi-class classification on benchmark dataset CIFAR-10.

摘要
“联邦学习”是一种解决医疗资料共享时隐私问题的可能性。在医疗影像分析领域中，联邦系统通常假设地方客户端是“正直”的。然而，一些研究发现，可以将一组黑客户端引入联邦系统，导致全球模型的性能下降。为了解决这个问题，一些防护整合方法被提出，但大多数这些方法对于黑客户端的攻击 remain vulnerable。我们引入了一种名为“DISBELIEVE”的本地模型欺骗攻击，它可以创造出黑客户端的参数或梯度，使其与正常客户端的参数或梯度之间的距离很近，但同时对全球模型的性能产生严重的影响。我们在三个公开可用的医疗影像数据集上进行实验，结果显示 DISBELIEVE 攻击可以对现有的robust aggregation方法进行严重攻击，并且与其他本地模型欺骗攻击相比，DISBELIEVE 攻击在自然图像中也有很好的效果。

The Devil in the Details: Simple and Effective Optical Flow Synthetic Data Generation

paper_url: http://arxiv.org/abs/2308.07378
repo_url: None
paper_authors: Kwon Byung-Ki, Kim Sung-Bin, Tae-Hyun Oh
for: 这 paper 是为了研究 dense optical flow 的进展而写的，特别是使用 supervised learning 方法，需要大量标注数据。
methods: 这 paper 使用了一种 simpler synthetic data generation method，通过组合基本操作来实现一定的真实感。 authors 还提出了一种使用 occlusion masks 的新方法，以帮助 RAFT 网络在 supervised 方法中进行更好的初始化。
results: 据 authors 的实验结果，使用这种新方法可以让 RAFT 网络在 MPI Sintel 和 KITTI 2015 上表现出色，超过原始 RAFT 的表现。

Abstract
Recent work on dense optical flow has shown significant progress, primarily in a supervised learning manner requiring a large amount of labeled data. Due to the expensiveness of obtaining large scale real-world data, computer graphics are typically leveraged for constructing datasets. However, there is a common belief that synthetic-to-real domain gaps limit generalization to real scenes. In this paper, we show that the required characteristics in an optical flow dataset are rather simple and present a simpler synthetic data generation method that achieves a certain level of realism with compositions of elementary operations. With 2D motion-based datasets, we systematically analyze the simplest yet critical factors for generating synthetic datasets. Furthermore, we propose a novel method of utilizing occlusion masks in a supervised method and observe that suppressing gradients on occluded regions serves as a powerful initial state in the curriculum learning sense. The RAFT network initially trained on our dataset outperforms the original RAFT on the two most challenging online benchmarks, MPI Sintel and KITTI 2015.

摘要
最近的紧密光流研究已经取得了重要进步，主要是以监督学习方式进行，需要大量标注数据。由于真实世界数据的获得成本较高，因此通常会利用计算机图形进行数据构造。然而，有一种常见的信念是， sintetic-to-real 领域差限制了对真实场景的泛化。在这篇论文中，我们表明了光流数据集中所需的特征很简单，并提出了一种简单的 sintetic 数据生成方法，该方法可以在元素操作的组合下实现一定的真实感。对于 2D 运动基于的数据集，我们系统地分析了生成 sintetic 数据的最简 yet critical 因素。此外，我们提议了在监督学习方法中使用遮盲mask，并观察到在遮盲区域上抑制梯度 serve as a powerful initial state in the curriculum learning sense。RAFT 网络首先在我们的数据集上进行了训练，然后在两个最为挑战的在线抽象上出色表现，即 MPI Sintel 和 KITTI 2015。

Jurassic World Remake: Bringing Ancient Fossils Back to Life via Zero-Shot Long Image-to-Image Translation

paper_url: http://arxiv.org/abs/2308.07316
repo_url: https://github.com/alexmartin1722/Revive-2I
paper_authors: Alexander Martin, Haitian Zheng, Jie An, Jiebo Luo
for: 这个论文的目的是提出一种可以在大领域差距下进行零shot图像到图像翻译（I2I）的方法，并且能够在不同领域中进行应用。
methods: 这个论文使用了文本引导的潜在扩散模型来实现零shot I2I，并且提出了一个新的任务——Skull2Animal，用于翻译骨骼和生物体之间。
results: 研究发现，传统的I2I方法无法跨大领域差距进行翻译，而文本引导的扩散和图像编辑模型则能够准确地完成零shot I2I。此外，研究还发现，提示是跨大领域差距翻译的关键因素，因为需要将目标领域的优先知识传递给模型。

Abstract
With a strong understanding of the target domain from natural language, we produce promising results in translating across large domain gaps and bringing skeletons back to life. In this work, we use text-guided latent diffusion models for zero-shot image-to-image translation (I2I) across large domain gaps (longI2I), where large amounts of new visual features and new geometry need to be generated to enter the target domain. Being able to perform translations across large domain gaps has a wide variety of real-world applications in criminology, astrology, environmental conservation, and paleontology. In this work, we introduce a new task Skull2Animal for translating between skulls and living animals. On this task, we find that unguided Generative Adversarial Networks (GANs) are not capable of translating across large domain gaps. Instead of these traditional I2I methods, we explore the use of guided diffusion and image editing models and provide a new benchmark model, Revive-2I, capable of performing zero-shot I2I via text-prompting latent diffusion models. We find that guidance is necessary for longI2I because, to bridge the large domain gap, prior knowledge about the target domain is needed. In addition, we find that prompting provides the best and most scalable information about the target domain as classifier-guided diffusion models require retraining for specific use cases and lack stronger constraints on the target domain because of the wide variety of images they are trained on.

摘要
With a strong understanding of the target domain from natural language, we produce promising results in translating across large domain gaps and bringing skeletons back to life. In this work, we use text-guided latent diffusion models for zero-shot image-to-image translation (I2I) across large domain gaps (longI2I), where large amounts of new visual features and new geometry need to be generated to enter the target domain. Being able to perform translations across large domain gaps has a wide variety of real-world applications in criminology, astrology, environmental conservation, and paleontology. In this work, we introduce a new task Skull2Animal for translating between skulls and living animals. On this task, we find that unguided Generative Adversarial Networks (GANs) are not capable of translating across large domain gaps. Instead of these traditional I2I methods, we explore the use of guided diffusion and image editing models and provide a new benchmark model, Revive-2I, capable of performing zero-shot I2I via text-prompting latent diffusion models. We find that guidance is necessary for longI2I because, to bridge the large domain gap, prior knowledge about the target domain is needed. In addition, we find that prompting provides the best and most scalable information about the target domain as classifier-guided diffusion models require retraining for specific use cases and lack stronger constraints on the target domain because of the wide variety of images they are trained on.Note: Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Hong Kong, Macau, and Taiwan. The translation is written in Simplified Chinese, but the original text is in Traditional Chinese.

Dual Associated Encoder for Face Restoration

paper_url: http://arxiv.org/abs/2308.07314
repo_url: None
paper_authors: Yu-Ju Tsai, Yu-Lun Liu, Lu Qi, Kelvin C. K. Chan, Ming-Hsuan Yang
for: 提高低质量图像中的人脸细节 restored
methods: 使用双支持分支框架DAEFR，其中一支支持高质量图像（HQ）特征提取，另一支支持低质量图像（LQ）特征提取，并通过相互协同训练来促进代码预测和输出质量提高
results: 在 synthetic 和 real-world 数据集上，DAEFR 表现出色，可以更好地恢复人脸细节

Abstract
Restoring facial details from low-quality (LQ) images has remained a challenging problem due to its ill-posedness induced by various degradations in the wild. The existing codebook prior mitigates the ill-posedness by leveraging an autoencoder and learned codebook of high-quality (HQ) features, achieving remarkable quality. However, existing approaches in this paradigm frequently depend on a single encoder pre-trained on HQ data for restoring HQ images, disregarding the domain gap between LQ and HQ images. As a result, the encoding of LQ inputs may be insufficient, resulting in suboptimal performance. To tackle this problem, we propose a novel dual-branch framework named DAEFR. Our method introduces an auxiliary LQ branch that extracts crucial information from the LQ inputs. Additionally, we incorporate association training to promote effective synergy between the two branches, enhancing code prediction and output quality. We evaluate the effectiveness of DAEFR on both synthetic and real-world datasets, demonstrating its superior performance in restoring facial details.

摘要
优化 facial details 从低质量（LQ）图像的恢复问题一直是一个挑战，因为这个问题受到野外环境中各种破坏的影响，导致非固定的问题。现有的代码库先验 Mitigates 这个问题，通过使用 autoencoder 和学习的高质量（HQ）特征 codebook，实现了 Remarkable 的质量。然而，现有的这些方法 часто依赖于单个 encoder 预训练在 HQ 数据上，忽视 LQ 和 HQ 图像之间的领域差异。这导致 LQ 输入的编码可能不够，从而导致优化性不足。为解决这个问题，我们提出了一种新的 dual-branch 框架，名为 DAEFR。我们的方法在 auxiliary LQ 分支中提取了关键信息，并将这些信息与主要 HQ 分支相关联，以便在编码和输出质量之间产生有利的共同作用。我们在 synthetic 和实际世界的数据集上评估了 DAEFR 的效果，并证明其在恢复 facial details 方面具有 Superior 的性能。

Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation

paper_url: http://arxiv.org/abs/2308.07313
repo_url: https://github.com/michel-liu/grouppose-paddle
paper_authors: Huan Liu, Qiang Chen, Zichang Tan, Jiang-Jiang Liu, Jian Wang, Xiangbo Su, Xiaolong Li, Kun Yao, Junyu Han, Errui Ding, Yao Zhao, Jingdong Wang
For: 这种论文的目的是研究终端多人姿态估计问题，以DETR-like框架为基础，并主要发展复杂的解码器。* Methods: 这种方法使用了简单 yet effective transformerapproach，名为Group Pose。它将 $K$-keypoint pose estimation视为预测 $N\times K$ 关键点位置，每个关键点从一个关键点查询中预测，同时每个姿态被表示为一个实例查询用于得分 $N$ 姿态预测。* Results: 这种方法无需人工框架监督，在 MS COCO 和 CrowdPose 上实验表明，其表现比前一些使用复杂解码器的方法更好，甚至与使用人工框架监督的 ED-Pose 相当。可以在 $\href{https://github.com/Michel-liu/GroupPose-Paddle}{\rm Paddle}$ 和 $\href{https://github.com/Michel-liu/GroupPose}{\rm PyTorch}$ 中找到代码。

Abstract
In this paper, we study the problem of end-to-end multi-person pose estimation. State-of-the-art solutions adopt the DETR-like framework, and mainly develop the complex decoder, e.g., regarding pose estimation as keypoint box detection and combining with human detection in ED-Pose, hierarchically predicting with pose decoder and joint (keypoint) decoder in PETR. We present a simple yet effective transformer approach, named Group Pose. We simply regard $K$-keypoint pose estimation as predicting a set of $N\times K$ keypoint positions, each from a keypoint query, as well as representing each pose with an instance query for scoring $N$ pose predictions. Motivated by the intuition that the interaction, among across-instance queries of different types, is not directly helpful, we make a simple modification to decoder self-attention. We replace single self-attention over all the $N\times(K+1)$ queries with two subsequent group self-attentions: (i) $N$ within-instance self-attention, with each over $K$ keypoint queries and one instance query, and (ii) $(K+1)$ same-type across-instance self-attention, each over $N$ queries of the same type. The resulting decoder removes the interaction among across-instance type-different queries, easing the optimization and thus improving the performance. Experimental results on MS COCO and CrowdPose show that our approach without human box supervision is superior to previous methods with complex decoders, and even is slightly better than ED-Pose that uses human box supervision. $\href{https://github.com/Michel-liu/GroupPose-Paddle}{\rm Paddle}$ and $\href{https://github.com/Michel-liu/GroupPose}{\rm PyTorch}$ code are available.

摘要
在这篇论文中，我们研究了端到端多人姿态估计问题。现有的解决方案大多采用DETR-like框架，主要是开发复杂的解码器，例如将姿态估计视为关键点框检测并与人体检测结合在ED-Pose中，或者在PETR中 hierarchically 预测姿态和关键点。我们提出了一种简单 yet 有效的 transformer 方法，名为 Group Pose。我们简单地认为 $K$-关键点姿态估计是预测 $N\times K$ 关键点位置，每个从关键点查询中预测，同时每个姿态被 Represented 为一个实例查询用于得分 $N$ 姿态预测。我们受到了关键点查询之间相互交互不直接有助于的想法，因此我们对解码器自注意的进行了简单修改。我们将单个自注意所有 $N\times(K+1)$ 查询被替换为两个顺序的组自注意：（i） $N$ 内部实例自注意，每个在 $K$ 关键点查询和一个实例查询之间进行自注意，和（ii） $(K+1)$ 同类 across-instance 自注意，每个在 $N$ 查询之间进行自注意。这些修改后的解码器可以减少不同类型的关键点查询之间的交互，从而简化优化，并提高性能。我们在 COCO 和 CrowdPose 上进行了实验，发现我们的方法无需人工盒子超级视觉是与前一代方法相比提高性能，甚至与使用人工盒子超级视觉的 ED-Pose 相比略有提高。我们在 $\href{https://github.com/Michel-liu/GroupPose-Paddle}{\rm Paddle}$ 和 $\href{https://github.com/Michel-liu/GroupPose}{\rm PyTorch}$ 上提供了代码。

A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis

paper_url: http://arxiv.org/abs/2308.07301
repo_url: None
paper_authors: Esteve Valls Mascaro, Hyemin Ahn, Dongheui Lee
for: 这个论文的目的是提出一种新的任务无关的人体动作生成模型，即UNIMASK-M，可以有效地解决人体动作预测和填充中间姿势等问题。
methods: 这个模型使用了着重体部关系的架构，以及基于ViTs的人体姿势分解方法，以利用人体动作中的空间时间关系。此外，该模型还通过不同的面罩设计来进行姿势conditioned的动作生成。
results: 实验结果表明，UNIMASK-M模型在Human3.6M数据集上成功预测人体动作，并在LaFAN1数据集上实现了状态之最的动作填充结果，特别是在长距离转换期。更多信息可以查看项目官方网站：https://sites.google.com/view/estevevallsmascaro/publications/unimask-m。

Abstract
The synthesis of human motion has traditionally been addressed through task-dependent models that focus on specific challenges, such as predicting future motions or filling in intermediate poses conditioned on known key-poses. In this paper, we present a novel task-independent model called UNIMASK-M, which can effectively address these challenges using a unified architecture. Our model obtains comparable or better performance than the state-of-the-art in each field. Inspired by Vision Transformers (ViTs), our UNIMASK-M model decomposes a human pose into body parts to leverage the spatio-temporal relationships existing in human motion. Moreover, we reformulate various pose-conditioned motion synthesis tasks as a reconstruction problem with different masking patterns given as input. By explicitly informing our model about the masked joints, our UNIMASK-M becomes more robust to occlusions. Experimental results show that our model successfully forecasts human motion on the Human3.6M dataset. Moreover, it achieves state-of-the-art results in motion inbetweening on the LaFAN1 dataset, particularly in long transition periods. More information can be found on the project website https://sites.google.com/view/estevevallsmascaro/publications/unimask-m.

摘要
历史上人体运动的合成总是通过任务 dependent 模型来解决，这些模型通常会专注于特定的挑战，例如预测未来运动或者使用知道的关键姿势来填充中间姿势。在这篇论文中，我们提出了一种新的任务独立的模型，即 UNIMASK-M，它可以有效地解决这些挑战。我们的模型在每个领域中都可以达到或更好的性能。我们的 UNIMASK-M 模型从人体 pose 中提取了身体部分，以利用人体运动中的空间时间关系。此外，我们将各种姿势conditioned 的运动合成任务重新表述为一个重建问题，输入不同的面纱模式。通过直接告诉我们模型关于遮盖的关节的信息，我们的 UNIMASK-M 模型变得更加鲁棒，更能够抵御遮挡。实验结果表明，我们的模型在 Human3.6M 数据集上预测人体运动成功。此外，它在 LaFAN1 数据集上的动作填充任务中达到了领先的成绩，特别是在长期跨过渡期内。更多信息可以在项目网站（https://sites.google.com/view/estevevallsmascaro/publications/unimask-m）上找到。

Accurate Eye Tracking from Dense 3D Surface Reconstructions using Single-Shot Deflectometry

paper_url: http://arxiv.org/abs/2308.07298
repo_url: None
paper_authors: Jiazhang Wang, Tianfu Wang, Bingjie Xu, Oliver Cossairt, Florian Willomitzer
for: 提高虚拟现实设备、神经科学研究和心理学中的眼动跟踪精度和速度。
methods: 基于单shotphasemeasuring-deflectometry（PMD）的新方法，通过获取肤色镜面上的密集3D表面信息来提高眼动跟踪精度和速度。
results: 实验表明，该方法可以实现眼动跟踪精度低于0.25度，至少比现有技术高出了$>3300\times$。

Abstract
Eye-tracking plays a crucial role in the development of virtual reality devices, neuroscience research, and psychology. Despite its significance in numerous applications, achieving an accurate, robust, and fast eye-tracking solution remains a considerable challenge for current state-of-the-art methods. While existing reflection-based techniques (e.g., "glint tracking") are considered the most accurate, their performance is limited by their reliance on sparse 3D surface data acquired solely from the cornea surface. In this paper, we rethink the way how specular reflections can be used for eye tracking: We propose a novel method for accurate and fast evaluation of the gaze direction that exploits teachings from single-shot phase-measuring-deflectometry (PMD). In contrast to state-of-the-art reflection-based methods, our method acquires dense 3D surface information of both cornea and sclera within only one single camera frame (single-shot). Improvements in acquired reflection surface points("glints") of factors $>3300 \times$ are easily achievable. We show the feasibility of our approach with experimentally evaluated gaze errors of only $\leq 0.25^\circ$ demonstrating a significant improvement over the current state-of-the-art.

摘要
眼动跟踪在虚拟现实设备的开发、 neuroscience 研究和心理学中扮演着关键性的角色。尽管它在多个应用程序中具有重要的作用，但是实现高度准确、可靠和快速的眼动跟踪解决方案仍然是当前技术的主要挑战。现有的反射基本技术（如“光泽跟踪”）被认为是最准确的，但它们的性能受到仅仅凭借硬件表面的辐射数据的限制。在这篇论文中，我们重新思考了如何使用折射来跟踪眼动：我们提出了一种新的方法，可以准确地和快速地评估眼动方向，这种方法利用了单 shot 相位测量折射（PMD）的教程。与现有的反射基本技术不同，我们的方法可以在单个摄像头帧中获得硬件表面的密集3D数据，包括辐射表面点的增加。我们实验证明，我们的方法可以在辐射表面点的增加比例上提高了$>3300\times$，并且我们实验证明了我们的方法的可行性，误差仅为$\leq 0.25^\circ$，这表明了我们的方法与当前技术的显著提高。

A Robust Approach Towards Distinguishing Natural and Computer Generated Images using Multi-Colorspace fused and Enriched Vision Transformer

paper_url: http://arxiv.org/abs/2308.07279
repo_url: https://github.com/manjaryp/mce-vit
paper_authors: Manjary P Gangan, Anoop Kadan, Lajish V L
for: 能够分辨 natura 和计算机生成的图像
methods: 使用两个视Transformers进行拟合，一个在 RGB 色域，另一个在 YCbCr 色域，并将两个拟合结果进行拟合
results: 提高了对计算机生成图像和 GAN 生成图像的分辨率，以及提高了对压缩、噪音等后处理图像的Robustness和普适性

Abstract
The works in literature classifying natural and computer generated images are mostly designed as binary tasks either considering natural images versus computer graphics images only or natural images versus GAN generated images only, but not natural images versus both classes of the generated images. Also, even though this forensic classification task of distinguishing natural and computer generated images gets the support of the new convolutional neural networks and transformer based architectures that can give remarkable classification accuracies, they are seen to fail over the images that have undergone some post-processing operations usually performed to deceive the forensic algorithms, such as JPEG compression, gaussian noise, etc. This work proposes a robust approach towards distinguishing natural and computer generated images including both, computer graphics and GAN generated images using a fusion of two vision transformers where each of the transformer networks operates in different color spaces, one in RGB and the other in YCbCr color space. The proposed approach achieves high performance gain when compared to a set of baselines, and also achieves higher robustness and generalizability than the baselines. The features of the proposed model when visualized are seen to obtain higher separability for the classes than the input image features and the baseline features. This work also studies the attention map visualizations of the networks of the fused model and observes that the proposed methodology can capture more image information relevant to the forensic task of classifying natural and generated images.

摘要
文学类别自然和计算机生成的图像工作大都设计为二分类任务， Either considering natural images versus computer graphics images only or natural images versus GAN generated images only，但不是natural images versus both classes of generated images。尽管这种审查类别任务可以通过新的 convolutional neural networks 和 transformer 基础架构得到惊人的分类精度，但它们在图像经过一些预处理操作后，如 JPEG 压缩、 Gaussian noise 等，会失败。这个工作提议一种可靠的方法，用于分类自然和计算机生成的图像，包括计算机图形和 GAN 生成的图像，使用两个视transformer 网络，其中一个在 RGB 色空间中运行，另一个在 YCbCr 色空间中运行。提议的方法在比较基eline 的情况下， achieve 高性能增加，同时也 achieve 更高的可靠性和普遍性。图像特征视觉化时，可以看到提议的模型对类别之间的分离性更高，than the input image features 和基eline 的特征。此外，研究提议的模型网络的注意力地图时，发现该方法可以更好地捕捉与审查任务相关的图像信息。

Diving with Penguins: Detecting Penguins and their Prey in Animal-borne Underwater Videos via Deep Learning

paper_url: http://arxiv.org/abs/2308.07267
repo_url: None
paper_authors: Kejia Zhang, Mingyu Yang, Stephen D. J. Lang, Alistair M. McInnes, Richard B. Sherley, Tilo Burghardt
for: 这个论文的目的是提供一个可靠的海水下enguin检测器，以及一个鱼类检测器，并对enguin的捕食行为进行自动识别。
methods: 这个论文使用了现代生物学logging技术，并使用了深度学习系统来检测enguin和鱼类。
results: 这个论文提供了一个高度可靠的海水下enguin检测器，并对enguin的捕食行为进行了自动识别。但是，进一步的工作是必要的以使这种技术在实际场景中有用。

Abstract
African penguins (Spheniscus demersus) are an endangered species. Little is known regarding their underwater hunting strategies and associated predation success rates, yet this is essential for guiding conservation. Modern bio-logging technology has the potential to provide valuable insights, but manually analysing large amounts of data from animal-borne video recorders (AVRs) is time-consuming. In this paper, we publish an animal-borne underwater video dataset of penguins and introduce a ready-to-deploy deep learning system capable of robustly detecting penguins (mAP50@98.0%) and also instances of fish (mAP50@73.3%). We note that the detectors benefit explicitly from air-bubble learning to improve accuracy. Extending this detector towards a dual-stream behaviour recognition network, we also provide the first results for identifying predation behaviour in penguin underwater videos. Whilst results are promising, further work is required for useful applicability of predation behaviour detection in field scenarios. In summary, we provide a highly reliable underwater penguin detector, a fish detector, and a valuable first attempt towards an automated visual detection of complex behaviours in a marine predator. We publish the networks, the DivingWithPenguins video dataset, annotations, splits, and weights for full reproducibility and immediate usability by practitioners.

摘要
非洲伯劳鸟（Spheniscus demersus）是一种濒临灭绝的物种。关于它们在水下猎食策略和相关的捕食成功率的知识很少，但这些信息对保护非常重要。现代生物 logging技术有potential提供有价值的洞察，但是手动分析动物携带视频记录器（AVR）上的大量数据非常时间consuming。在这篇论文中，我们发布了一个动物携带的水下视频数据集和一个准备就绪的深度学习系统，能够准确地检测伯劳鸟（mAP50@98.0%）和鱼雷（mAP50@73.3%）。我们发现检测器受到空气泡学习的帮助，以提高准确性。通过扩展这个检测器，我们还提供了第一次在水下伯劳鸟视频中自动识别捕食行为的结果。虽然结果有前途，但更多的工作是需要在实际场景中使用捕食行为检测。总之，我们提供了一个非常可靠的水下伯劳鸟检测器、鱼雷检测器和水下伯劳鸟视频中自动识别复杂行为的第一次尝试。我们发布了网络、DivingWithPenguins视频数据集、注释、分割和 weights，以便实现全 reproduceability和 immediate usability by practitioners。

Efficient Real-time Smoke Filtration with 3D LiDAR for Search and Rescue with Autonomous Heterogeneous Robotic Systems

paper_url: http://arxiv.org/abs/2308.07264
repo_url: None
paper_authors: Alexander Kyuroson, Anton Koval, George Nikolakopoulos
for: 提高机器人在具有烟尘的潜地环境中的自主导航和定位精度。
methods: 提出了一种模块化agnostic滤除管道，利用照度和空间信息进行烟尘排除，以提高点云检测的精度。
results: 对多个前沿探索任务进行了实验研究，并提供了对比其他方法的计算影响和安全自主导航的价值观。

Abstract
Search and Rescue (SAR) missions in harsh and unstructured Sub-Terranean (Sub-T) environments in the presence of aerosol particles have recently become the main focus in the field of robotics. Aerosol particles such as smoke and dust directly affect the performance of any mobile robotic platform due to their reliance on their onboard perception systems for autonomous navigation and localization in Global Navigation Satellite System (GNSS)-denied environments. Although obstacle avoidance and object detection algorithms are robust to the presence of noise to some degree, their performance directly relies on the quality of captured data by onboard sensors such as Light Detection And Ranging (LiDAR) and camera. Thus, this paper proposes a novel modular agnostic filtration pipeline based on intensity and spatial information such as local point density for removal of detected smoke particles from Point Cloud (PCL) prior to its utilization for collision detection. Furthermore, the efficacy of the proposed framework in the presence of smoke during multiple frontier exploration missions is investigated while the experimental results are presented to facilitate comparison with other methodologies and their computational impact. This provides valuable insight to the research community for better utilization of filtration schemes based on available computation resources while considering the safe autonomous navigation of mobile robots.

摘要
寻找和救援（SAR）任务在恶劣和无结构的地壳环境中变得越来越重要，特别是在Global Navigation Satellite System（GNSS）被排除的环境中。由于移动 робот平台的自主导航和地点化依赖于其 бордов的感知系统，因此尘埃和烟雾直接影响移动 робот的性能。虽然障碍物避免和物体探测算法有一定的鲁棒性，但它们的性能直接取决于捕获到的数据质量，例如雷达和摄像头的数据。因此，这篇论文提出了一种新的模块不可识别的筛选管道，基于照度和空间信息，如地点密度，以去除从点云（PCL）中探测到的烟雾。此外，这篇论文还 investigate了在多个前沿探索任务中，提议的框架在烟雾存在下的效果，并对结果进行实验，以便与其他方法ologies和计算影响进行比较。这为研究者提供了有价值的反馈，以便更好地利用筛选方案，同时考虑移动 robot的自主导航安全性。

Large-kernel Attention for Efficient and Robust Brain Lesion Segmentation

paper_url: http://arxiv.org/abs/2308.07251
repo_url: https://github.com/liamchalcroft/mdunet
paper_authors: Liam Chalcroft, Ruben Lourenço Pereira, Mikael Brudfors, Andrew S. Kayser, Mark D’Esposito, Cathy J. Price, Ioannis Pappas, John Ashburner
for: 这个论文主要用于提出一种基于 transformer 块的 U-Net 架构，用于三维脑损块分割。
methods: 该模型使用了一种混合 convolutional 和 transformer 块的 variant，用于模elling 长距离交互。
results: 研究表明，该模型在三维脑损块分割任务中提供了最佳的折衔点，即性能与当前状态体系相当，且参数效率与 CNN 相当，同时具有转化不变性的良好假设。

Abstract
Vision transformers are effective deep learning models for vision tasks, including medical image segmentation. However, they lack efficiency and translational invariance, unlike convolutional neural networks (CNNs). To model long-range interactions in 3D brain lesion segmentation, we propose an all-convolutional transformer block variant of the U-Net architecture. We demonstrate that our model provides the greatest compromise in three factors: performance competitive with the state-of-the-art; parameter efficiency of a CNN; and the favourable inductive biases of a transformer. Our public implementation is available at https://github.com/liamchalcroft/MDUNet .

摘要
视transformer是深度学习模型，用于视觉任务，包括医学影像分割。然而，它缺乏效率和翻译不变性，与卷积神经网络（CNN）不同。为了模型3D脑损害分割中的长距离交互，我们提议一种alleviation transformer块变体的U-Net架构。我们示示了我们的模型提供了三个因素的最佳妥协：与状态之artefact的性能竞争; 参数效率与CNN相同; 以及转移器的有利 inductive bias。我们的公共实现可以在https://github.com/liamchalcroft/MDUNet上找到。

AAFACE: Attribute-aware Attentional Network for Face Recognition

paper_url: http://arxiv.org/abs/2308.07243
repo_url: None
paper_authors: Niloufar Alipour Talemi, Hossein Kashiani, Sahar Rahimi Malakshan, Mohammad Saeed Ebrahimi Saadabadi, Nima Najafzadeh, Mohammad Akyash, Nasser M. Nasrabadi
for: 这个论文是为了提出一种新的多分支神经网络，该网络同时进行软生物ometrics（SB）预测和人脸识别（FR）两个任务。
methods: 该网络使用SB特征来增强FR表示的推断能力。具体来说，我们提出了一个属性意识的集成（AAI）模块，该模块通过对FR与SB特征图进行Weighted集成来实现。AAI模块不仅具有完全上下文意识，还可以学习输入特征之间复杂的关系。
results: 我们的提出的网络在比较于现状的SB预测和FR方法上表现出了superiority。

Abstract
In this paper, we present a new multi-branch neural network that simultaneously performs soft biometric (SB) prediction as an auxiliary modality and face recognition (FR) as the main task. Our proposed network named AAFace utilizes SB attributes to enhance the discriminative ability of FR representation. To achieve this goal, we propose an attribute-aware attentional integration (AAI) module to perform weighted integration of FR with SB feature maps. Our proposed AAI module is not only fully context-aware but also capable of learning complex relationships between input features by means of the sequential multi-scale channel and spatial sub-modules. Experimental results verify the superiority of our proposed network compared with the state-of-the-art (SoTA) SB prediction and FR methods.

摘要
在本文中，我们提出了一种新的多分支神经网络，该网络同时进行软生物特征（SB）预测作为辅助特征和人脸识别（FR）作为主要任务。我们提出的AAFace网络利用SB特征来增强FR表示的分类能力。为此，我们提出了一种属性意识权重整合（AAI）模块，以进行FR与SB特征地图的Weighted整合。我们的AAI模块不仅具有完整的上下文意识，还能够学习输入特征之间的复杂关系，通过纵向多尺度通道和空间子模块。实验结果证明了我们提出的网络的优越性，与当前最佳状态（SoTA）SB预测和FR方法相比。

UniWorld: Autonomous Driving Pre-training via World Models

paper_url: http://arxiv.org/abs/2308.07234
repo_url: https://github.com/chaytonmin/uniworld
paper_authors: Chen Min, Dawei Zhao, Liang Xiao, Yiming Nie, Bin Dai
for: This paper is written for those interested in developing world models for robots, specifically for autonomous driving.
methods: The paper proposes a unified pre-training framework called UniWorld, which uses a spatial-temporal world model to perceive the surroundings and predict the future behavior of other participants. The framework is based on Alberto Elfes’ pioneering work in 1989 and uses a label-free pre-training process to build a foundational model.
results: The proposed method demonstrates promising results in key tasks such as motion prediction, multi-camera 3D object detection, and surrounding semantic scene completion. Compared to monocular pre-training methods on the nuScenes dataset, UniWorld shows a significant improvement of about 1.5% in IoU for motion prediction, 2.0% in mAP and 2.0% in NDS for multi-camera 3D object detection, as well as a 3% increase in mIoU for surrounding semantic scene completion. Additionally, the method achieves a 25% reduction in 3D training annotation costs, offering significant practical value for real-world autonomous driving.

Abstract
In this paper, we draw inspiration from Alberto Elfes' pioneering work in 1989, where he introduced the concept of the occupancy grid as World Models for robots. We imbue the robot with a spatial-temporal world model, termed UniWorld, to perceive its surroundings and predict the future behavior of other participants. UniWorld involves initially predicting 4D geometric occupancy as the World Models for foundational stage and subsequently fine-tuning on downstream tasks. UniWorld can estimate missing information concerning the world state and predict plausible future states of the world. Besides, UniWorld's pre-training process is label-free, enabling the utilization of massive amounts of image-LiDAR pairs to build a Foundational Model.The proposed unified pre-training framework demonstrates promising results in key tasks such as motion prediction, multi-camera 3D object detection, and surrounding semantic scene completion. When compared to monocular pre-training methods on the nuScenes dataset, UniWorld shows a significant improvement of about 1.5% in IoU for motion prediction, 2.0% in mAP and 2.0% in NDS for multi-camera 3D object detection, as well as a 3% increase in mIoU for surrounding semantic scene completion. By adopting our unified pre-training method, a 25% reduction in 3D training annotation costs can be achieved, offering significant practical value for the implementation of real-world autonomous driving. Codes are publicly available at https://github.com/chaytonmin/UniWorld.

摘要
在这篇论文中，我们启发自阿尔伯托·艾尔法斯在1989年的开创性工作，其中提出了机器人世界模型的概念。我们为机器人提供了一个空间-时间世界模型，称之为UniWorld，以便理解它所处的环境和预测其他参与者的未来行为。UniWorld包括先预测4D几何占据的世界模型为基础阶段，然后细化到下游任务。UniWorld可以 estimte missing world state information和预测可能的未来世界状态。此外，UniWorld的预训练过程无需标签，可以使用巨量的图像-LiDAR对组建基础模型。提出的统一预训练框架在关键任务中表现出了可观的成果，比如运动预测、多摄像头3D物体检测和周围 semanticscene完成。与单摄像头预训练方法在nuScenes dataset上进行比较，UniWorld在运动预测、3D物体检测和semanticscene完成任务中显示出了约1.5%的 IoU提升、2.0%的 mAP提升和2.0%的 NDS提升。通过采用我们的统一预训练方法，可以降低3D训练注释成本的25%，提供了实际应用自动驾驶的重要实践价值。代码可以在https://github.com/chaytonmin/UniWorld中找到。

paper_url: http://arxiv.org/abs/2308.07228
repo_url: None
paper_authors: Zhouxia Wang, Jiawei Zhang, Tianshui Chen, Wenping Wang, Ping Luo
for: 这个论文的目标是提高高质量的人脸图像从不知道的干扰中进行恢复。
methods: 该论文提出了RestoreFormer++,一种新的人脸图像恢复算法，它在一个手动注意力机制上模型了人脸图像的上下文信息，并在另一个扩展降低模型上帮助生成更加真实的降低图像，以增强对真实场景的适应性。
results: 对比当前算法，RestoreFormer++有多个优势，包括更高的真实性和质量，以及更好的适应性和泛化能力。

Abstract
Blind face restoration aims at recovering high-quality face images from those with unknown degradations. Current algorithms mainly introduce priors to complement high-quality details and achieve impressive progress. However, most of these algorithms ignore abundant contextual information in the face and its interplay with the priors, leading to sub-optimal performance. Moreover, they pay less attention to the gap between the synthetic and real-world scenarios, limiting the robustness and generalization to real-world applications. In this work, we propose RestoreFormer++, which on the one hand introduces fully-spatial attention mechanisms to model the contextual information and the interplay with the priors, and on the other hand, explores an extending degrading model to help generate more realistic degraded face images to alleviate the synthetic-to-real-world gap. Compared with current algorithms, RestoreFormer++ has several crucial benefits. First, instead of using a multi-head self-attention mechanism like the traditional visual transformer, we introduce multi-head cross-attention over multi-scale features to fully explore spatial interactions between corrupted information and high-quality priors. In this way, it can facilitate RestoreFormer++ to restore face images with higher realness and fidelity. Second, in contrast to the recognition-oriented dictionary, we learn a reconstruction-oriented dictionary as priors, which contains more diverse high-quality facial details and better accords with the restoration target. Third, we introduce an extending degrading model that contains more realistic degraded scenarios for training data synthesizing, and thus helps to enhance the robustness and generalization of our RestoreFormer++ model. Extensive experiments show that RestoreFormer++ outperforms state-of-the-art algorithms on both synthetic and real-world datasets.

摘要
目标是从不知名的降低中恢复高质量的面孔图像。现有算法主要通过引入约束来补充高质量的细节，实现了很好的进步。然而，大多数这些算法忽略面孔中的丰富上下文信息和它们之间的互动，导致优化性不佳。另外，它们对实际世界应用场景的差异不够关注，限制了其robustness和泛化性。在这种情况下，我们提出了RestoreFormer++，它在一个方面引入了完全的空间注意力机制，以模型面孔中的上下文信息和约束之间的互动；另一方面，它探索了一种扩展降低模型，以帮助生成更真实的降低面孔图像，从而缓解实际世界和synthetic世界之间的差异。与现有算法相比，RestoreFormer++有几个重要优点。首先，不同于传统的视觉转换器，我们引入了多头跨度的cross-attention机制，以全面探索降低信息和高质量约束之间的空间互动，从而使RestoreFormer++能够更好地恢复面孔图像。第二，我们不是通过认知 oriented的字典来学习约束，而是通过恢复 oriented的字典来学习约束，这种字典包含更多的多样化的高质量 facial detail，更好地符合恢复目标。第三，我们引入了一种扩展降低模型，该模型包含更真实的降低场景，从而帮助提高RestoreFormer++模型的Robustness和泛化性。广泛的实验表明，RestoreFormer++在synthetic和实际世界数据上都能够超越现有的算法。

2023-08-15

cs.AI

cs.AI - 2023-08-15

REFORMS: Reporting Standards for Machine Learning Based Science

paper_url: http://arxiv.org/abs/2308.07832
repo_url: None
paper_authors: Sayash Kapoor, Emily Cantrell, Kenny Peng, Thanh Hien Pham, Christopher A. Bail, Odd Erik Gundersen, Jake M. Hofman, Jessica Hullman, Michael A. Lones, Momin M. Malik, Priyanka Nanayakkara, Russell A. Poldrack, Inioluwa Deborah Raji, Michael Roberts, Matthew J. Salganik, Marta Serra-Garcia, Brandon M. Stewart, Gilles Vandewiele, Arvind Narayanan
for: 这篇论文的目的是提供机器学习（ML）基于科学研究的清晰报告标准。
methods: 这篇论文使用了一份名为REFORMS（Reporting Standards For Machine Learning Based Science）的检查列表，该列表包含32个问题和一对拥有的指南。REFORMS是基于19名研究者来自计算机科学、数据科学、数学、社会科学和医学等领域的共识而开发的。
results: 这篇论文提供了一个资源 для研究者在设计和实施研究时使用，以及为评审人员在审查论文时使用，以确保透明度和可重复性。

Abstract
Machine learning (ML) methods are proliferating in scientific research. However, the adoption of these methods has been accompanied by failures of validity, reproducibility, and generalizability. These failures can hinder scientific progress, lead to false consensus around invalid claims, and undermine the credibility of ML-based science. ML methods are often applied and fail in similar ways across disciplines. Motivated by this observation, our goal is to provide clear reporting standards for ML-based science. Drawing from an extensive review of past literature, we present the REFORMS checklist ($\textbf{Re}$porting Standards $\textbf{For}$ $\textbf{M}$achine Learning Based $\textbf{S}$cience). It consists of 32 questions and a paired set of guidelines. REFORMS was developed based on a consensus of 19 researchers across computer science, data science, mathematics, social sciences, and biomedical sciences. REFORMS can serve as a resource for researchers when designing and implementing a study, for referees when reviewing papers, and for journals when enforcing standards for transparency and reproducibility.

摘要

Tightest Admissible Shortest Path

paper_url: http://arxiv.org/abs/2308.08453
repo_url: None
paper_authors: Eyal Weiss, Ariel Felner, Gal A. Kaminka
for: 解决Weighted Directed Graphs中的短est path问题，考虑edge-weight计算时间和不确定性的影响。
methods: 基于 generalized framework的提议，引入紧跟最优路径问题（TASP），解决在 bounded uncertainty 下的短est path问题，通过质量保证来提供解决方案。
results: 提出了一种完整的算法，并提供了解决方案的质量保证，验证结果表明该方法的有效性。

Abstract
The shortest path problem in graphs is fundamental to AI. Nearly all variants of the problem and relevant algorithms that solve them ignore edge-weight computation time and its common relation to weight uncertainty. This implies that taking these factors into consideration can potentially lead to a performance boost in relevant applications. Recently, a generalized framework for weighted directed graphs was suggested, where edge-weight can be computed (estimated) multiple times, at increasing accuracy and run-time expense. We build on this framework to introduce the problem of finding the tightest admissible shortest path (TASP); a path with the tightest suboptimality bound on the optimal cost. This is a generalization of the shortest path problem to bounded uncertainty, where edge-weight uncertainty can be traded for computational cost. We present a complete algorithm for solving TASP, with guarantees on solution quality. Empirical evaluation supports the effectiveness of this approach.

摘要
将文本翻译成简化字符串。图形中的最短路径问题是人工智能的基础问题之一。大多数变体的问题和解决方案忽略了边Weight计算时间和Weight不确定性之间的通常关系。这意味着考虑这些因素可能会导致应用中的性能提升。最近，一种总结框架 для权重有向图被建议，其中边Weight可以在不同的精度和计算成本下重复计算。我们在这个框架之上引入了找到最紧张的可接受路径（TASP）问题，这是一种对于不确定性 bounded 的扩展，可以通过计算成本来贸易边Weight uncertainty。我们提出了一个完整的解决TASP问题的算法，并提供了解决方案质量的保证。实验证明了这种方法的有效性。

Learning to Identify Critical States for Reinforcement Learning from Videos

paper_url: http://arxiv.org/abs/2308.07795
repo_url: https://github.com/ai-initiative-kaust/videorlcs
paper_authors: Haozhe Liu, Mingchen Zhuge, Bing Li, Yuhui Wang, Francesco Faccio, Bernard Ghanem, Jürgen Schmidhuber
for: 本研究的目的是利用视频数据提取深度强化学习中的有用策略信息，不需要明确的动作信息。
methods: 该方法使用视频编码的集集数据，通过深度学习预测回报，然后使用面积基于的敏感分析提取重要的关键状态。
results: 广泛的实验显示，该方法可以理解和改进代理行为。代码和生成的数据集可以在 GitHub 上找到。

Abstract
Recent work on deep reinforcement learning (DRL) has pointed out that algorithmic information about good policies can be extracted from offline data which lack explicit information about executed actions. For example, videos of humans or robots may convey a lot of implicit information about rewarding action sequences, but a DRL machine that wants to profit from watching such videos must first learn by itself to identify and recognize relevant states/actions/rewards. Without relying on ground-truth annotations, our new method called Deep State Identifier learns to predict returns from episodes encoded as videos. Then it uses a kind of mask-based sensitivity analysis to extract/identify important critical states. Extensive experiments showcase our method's potential for understanding and improving agent behavior. The source code and the generated datasets are available at https://github.com/AI-Initiative-KAUST/VideoRLCS.

摘要
近期深度强化学习（DRL）的研究表明，可以从没有明确行动信息的线上数据中提取良好策略的算法信息。例如，人类或机器人视频可以传递大量的隐式信息关于奖励行动序列，但一个DRL机器人想要从这些视频中获益，首先必须自己学习 identificifying和识别相关的状态/行动/奖励。无需基于真实标注，我们的新方法called Deep State Identifier可以预测episode编码为视频中的返回。然后使用一种mask-based敏感分析来提取/识别重要的关键状态。广泛的实验表明了我们方法的可能性 для理解和改进代理行为。源代码和生成的数据集可以在https://github.com/AI-Initiative-KAUST/VideoRLCS中下载。

Implementing Quantum Generative Adversarial Network (qGAN) and QCBM in Finance

paper_url: http://arxiv.org/abs/2308.08448
repo_url: None
paper_authors: Santanu Ganguly
for: 本研究探讨了应用量子机器学习（QML）在金融领域的新热点研究领域，包括股票价格预测、资产风险管理和评估等。
methods: 本研究使用了真实的金融数据集和模拟环境，对量子机器学习（QML）模型进行比较，包括qGAN（量子生成对抗网络）和QCBM（量子环境生成机器）等模型。
results: 研究发现，量子机器学习（QML）在金融领域可以提供未来的量子优势，特别是在股票价格预测和资产风险管理等领域。

Abstract
Quantum machine learning (QML) is a cross-disciplinary subject made up of two of the most exciting research areas: quantum computing and classical machine learning (ML), with ML and artificial intelligence (AI) being projected as the first fields that will be impacted by the rise of quantum machines. Quantum computers are being used today in drug discovery, material & molecular modelling and finance. In this work, we discuss some upcoming active new research areas in application of quantum machine learning (QML) in finance. We discuss certain QML models that has become areas of active interest in the financial world for various applications. We use real world financial dataset and compare models such as qGAN (quantum generative adversarial networks) and QCBM (quantum circuit Born machine) among others, using simulated environments. For the qGAN, we define quantum circuits for discriminators and generators and show promises of future quantum advantage via QML in finance.

摘要
量子机器学习（QML）是两个最有前途的研究领域之间的跨学科领域：量子计算和经典机器学习（ML）， Machine learning和人工智能（AI）被predict为第一个受到量子机器的影响的领域。量子计算机在今天的药物发现、物质和分子模型以及金融领域中使用。在这项工作中，我们讨论了在金融领域中应用量子机器学习（QML）的未来活跃研究领域。我们讨论了一些在金融世界中受到关注的QML模型，并使用实际的金融数据进行比较。我们使用 simulated environments 来评估 qGAN 和 QCBM 等模型，并显示了未来量子优势的推荐。Note: Please note that the translation is in Simplified Chinese, which is one of the two standard forms of Chinese writing. If you need the translation in Traditional Chinese, please let me know.

Informed Named Entity Recognition Decoding for Generative Language Models

paper_url: http://arxiv.org/abs/2308.07791
repo_url: None
paper_authors: Tobias Deußer, Lars Hillebrand, Christian Bauckhage, Rafet Sifa
for: 这个论文主要是为了提高命名实体识别（Named Entity Recognition，NER）的性能。
methods: 这篇论文提出了一种简单 yet effective的方法，即 Informed Named Entity Recognition Decoding（iNERD），它将命名实体识别视为一种生成过程，利用了最新的生成模型的语言理解能力，并采用了了一种有知识的解码方案，以便更好地处理有限的信息抽取任务。
results: 论文在使用五种生成语言模型，测试在八个命名实体识别 datasets 上，得到了很出色的结果，特别是在未知实体类型集合的环境下，这说明了该方法的适应性。

Abstract
Ever-larger language models with ever-increasing capabilities are by now well-established text processing tools. Alas, information extraction tasks such as named entity recognition are still largely unaffected by this progress as they are primarily based on the previous generation of encoder-only transformer models. Here, we propose a simple yet effective approach, Informed Named Entity Recognition Decoding (iNERD), which treats named entity recognition as a generative process. It leverages the language understanding capabilities of recent generative models in a future-proof manner and employs an informed decoding scheme incorporating the restricted nature of information extraction into open-ended text generation, improving performance and eliminating any risk of hallucinations. We coarse-tune our model on a merged named entity corpus to strengthen its performance, evaluate five generative language models on eight named entity recognition datasets, and achieve remarkable results, especially in an environment with an unknown entity class set, demonstrating the adaptability of the approach.

摘要
现代语言模型在功能上不断提高，成为文本处理工具的标准配置。可是，信息提取任务，如命名实体识别，仍然受到这些进步的影响很少，因为它们主要基于上一代encoder-only transformer模型。在这里，我们提出了一种简单 yet有效的方法，命名实体识别生成（iNERD），它将命名实体识别视为生成过程。它利用最新的生成模型对语言理解能力的提高，并采用了有知识的编码方案，将开放式文本生成和信息提取的限制纳入考虑，从而提高性能并消除所有的幻觉。我们在合并的命名实体 корпу斯上粗略调整我们的模型，以强化其表现，并评估了五种生成语言模型在八个命名实体识别 datasets 上的表现，取得了非常出色的结果，特别是在未知实体类集的环境中，这表明了方法的适应性。

Do We Fully Understand Students’ Knowledge States? Identifying and Mitigating Answer Bias in Knowledge Tracing

paper_url: http://arxiv.org/abs/2308.07779
repo_url: https://github.com/lucky7-code/core
paper_authors: Chaoran Cui, Hebo Ma, Chen Zhang, Chunyun Zhang, Yumo Yao, Meng Chen, Yuling Ma
for: 这 paper 的目的是解决知识追踪 (KT) 中存在的答案偏见问题，以便更好地理解学生们的知识状态。
methods: 这 paper 使用了 causality 理论来解决 KT 中的答案偏见问题，并提出了一种 COunterfactual REasoning (CORE) 框架来减少答案偏见的影响。
results: 这 paper 的实验结果表明，CORE 框架可以减少 KT 中答案偏见的影响，并且可以与现有的多种 KT 模型结合使用。

Abstract
Knowledge tracing (KT) aims to monitor students' evolving knowledge states through their learning interactions with concept-related questions, and can be indirectly evaluated by predicting how students will perform on future questions. In this paper, we observe that there is a common phenomenon of answer bias, i.e., a highly unbalanced distribution of correct and incorrect answers for each question. Existing models tend to memorize the answer bias as a shortcut for achieving high prediction performance in KT, thereby failing to fully understand students' knowledge states. To address this issue, we approach the KT task from a causality perspective. A causal graph of KT is first established, from which we identify that the impact of answer bias lies in the direct causal effect of questions on students' responses. A novel COunterfactual REasoning (CORE) framework for KT is further proposed, which separately captures the total causal effect and direct causal effect during training, and mitigates answer bias by subtracting the latter from the former in testing. The CORE framework is applicable to various existing KT models, and we implement it based on the prevailing DKT, DKVMN, and AKT models, respectively. Extensive experiments on three benchmark datasets demonstrate the effectiveness of CORE in making the debiased inference for KT.

摘要
知识跟踪（KT）目的是监测学生在学习过程中知识状态的变化，通过问题相关的问题来评估学生的知识水平，并可以通过预测未来问题的回答来间接评估。在这篇论文中，我们发现了一种常见的答案偏见现象，即每个问题的答案准确率和错误率存在极大的偏见。现有的模型通常会借助答案偏见作为短cut来实现高度预测性能，从而忽略了学生的知识状态。为解决这问题，我们从 causality 角度对 KT 进行了研究。首先，我们从 KT 问题中构建了一个 causal 图，并发现了答案偏见对学生回答的直接 causal 效应。基于这个 causal 图，我们提出了一种新的 COunterfactual REasoning（CORE）框架，它在训练时分别捕捉总 causal 效应和直接 causal 效应，并在测试时对答案偏见进行补做，以确保debias 的推理。CORE 框架可以应用于多种现有 KT 模型，我们在 DKT、DKVMN 和 AKT 模型上实现了它。我们在三个 benchmark 数据集上进行了广泛的实验，并证明了 CORE 在 KT 中的有效性。

Hierarchical generative modelling for autonomous robots

paper_url: http://arxiv.org/abs/2308.07775
repo_url: None
paper_authors: Kai Yuan, Noor Sajid, Karl Friston, Zhibin Li
for: 这个论文旨在研究人类在与环境交互时如何生成复杂全身运动，以便在自主机器人操作中实现高效的目标完成。
methods: 这篇论文使用了层次生成模型，包括多级规划和自动控制，来模拟人类动作控制的深度时间架构。
results: 通过数字和物理实验，这篇论文证明了使用人类动作控制算法可以实现自主机器人完成复杂任务，例如抓取和运输箱子、穿过门户、踢足球等，并在身体损伤和地面不平的情况下保持稳定性。

Abstract
Humans can produce complex whole-body motions when interacting with their surroundings, by planning, executing and combining individual limb movements. We investigated this fundamental aspect of motor control in the setting of autonomous robotic operations. We approach this problem by hierarchical generative modelling equipped with multi-level planning-for autonomous task completion-that mimics the deep temporal architecture of human motor control. Here, temporal depth refers to the nested time scales at which successive levels of a forward or generative model unfold, for example, delivering an object requires a global plan to contextualise the fast coordination of multiple local movements of limbs. This separation of temporal scales also motivates robotics and control. Specifically, to achieve versatile sensorimotor control, it is advantageous to hierarchically structure the planning and low-level motor control of individual limbs. We use numerical and physical simulation to conduct experiments and to establish the efficacy of this formulation. Using a hierarchical generative model, we show how a humanoid robot can autonomously complete a complex task that necessitates a holistic use of locomotion, manipulation, and grasping. Specifically, we demonstrate the ability of a humanoid robot that can retrieve and transport a box, open and walk through a door to reach the destination, approach and kick a football, while showing robust performance in presence of body damage and ground irregularities. Our findings demonstrated the effectiveness of using human-inspired motor control algorithms, and our method provides a viable hierarchical architecture for the autonomous completion of challenging goal-directed tasks.

摘要
人类可以生成复杂全身运动when interacting with其 surroundings，通过规划、执行和组合各个肢体运动。我们在自主 роботизирован操作的设置下调查了这一基本的 дви作控制问题。我们采用层次生成模型，带有多级规划，以模仿人类 дви作控制的深度时间建筑。在这里，时间深度指的是成功级别模型在不同时间层次上进行的嵌套执行，例如，为了交付物品，需要一个全局规划，以Contextualize the rapid coordination of multiple local limb movements。这种时间层次分离也驱动了机器人和控制。特别是，为了实现多元感知motor控制，是在层次结构的规划和低级motor控制中进行分离。我们使用数字和物理 simulate experiments to verify the effectiveness of this approach.使用层次生成模型，我们展示了一个人工智能机器人可以自主完成一个复杂任务，需要整体使用 locomotion、抓取和抓取。例如，我们示出了一个人工智能机器人可以拾取和运送一个盒子，通过门way，然后踢过一个足球，并在存在身体损伤和地面不平的情况下表现稳定。我们的发现表明了使用人类 inspirational motor control算法的有效性，而我们的方法提供了一个可靠的层次建筑，用于自主完成具有挑战性的目标导向任务。

A Graph Encoder-Decoder Network for Unsupervised Anomaly Detection

paper_url: http://arxiv.org/abs/2308.07774
repo_url: None
paper_authors: Mahsa Mesgaran, A. Ben Hamza
for: 检测图像中异常节点
methods: 使用无监督图像编码器-解码器模型，学习异常分数函数对节点进行排序，并使用本地性受限的线性编码方法来找到异常分数矩阵
results: 在六个基准数据集上使用多种评价指标进行实验，结果显示该方法在异常检测方面具有优异性，比之前的方法更高效和可靠。

Abstract
A key component of many graph neural networks (GNNs) is the pooling operation, which seeks to reduce the size of a graph while preserving important structural information. However, most existing graph pooling strategies rely on an assignment matrix obtained by employing a GNN layer, which is characterized by trainable parameters, often leading to significant computational complexity and a lack of interpretability in the pooling process. In this paper, we propose an unsupervised graph encoder-decoder model to detect abnormal nodes from graphs by learning an anomaly scoring function to rank nodes based on their degree of abnormality. In the encoding stage, we design a novel pooling mechanism, named LCPool, which leverages locality-constrained linear coding for feature encoding to find a cluster assignment matrix by solving a least-squares optimization problem with a locality regularization term. By enforcing locality constraints during the coding process, LCPool is designed to be free from learnable parameters, capable of efficiently handling large graphs, and can effectively generate a coarser graph representation while retaining the most significant structural characteristics of the graph. In the decoding stage, we propose an unpooling operation, called LCUnpool, to reconstruct both the structure and nodal features of the original graph. We conduct empirical evaluations of our method on six benchmark datasets using several evaluation metrics, and the results demonstrate its superiority over state-of-the-art anomaly detection approaches.

摘要
Many graph neural networks (GNNs) 的关键组件是聚合操作，该操作的目标是将图的大小减小，保留重要的结构信息。然而，大多数现有的图聚合策略都是基于使用 GNN 层获得的分配矩阵，这些矩阵通常具有可学习参数，导致计算复杂性很高并且解释性差。在这篇论文中，我们提出了一种无监督的图编码器-解码器模型，用于从图中检测异常节点。在编码阶段，我们设计了一种新的聚合机制，名为 LCPool，它利用了本地化的线性编码来找到一个归一化矩阵，通过解决一个最小二乘优化问题来实现。通过在编码过程中强制实施本地化约束，LCPool 设计为无学习参数，能够高效处理大图，并能够生成一个粗略的图表示，保留原图的最重要的结构特征。在解码阶段，我们提出了一种解聚机制，名为 LCUnpool，用于重建原始图的结构和节点特征。我们对六个标准数据集进行了实验评估，并通过多个评价指标证明了我们的方法的优越性。

MOLE: MOdular Learning FramEwork via Mutual Information Maximization

paper_url: http://arxiv.org/abs/2308.07772
repo_url: None
paper_authors: Tianchao Li, Yulong Pei
for: 这个论文旨在介绍一种异步本地学习框架，即Modular Learning Framework (MOLE)，用于神经网络。
methods: 这个框架通过层次归一化神经网络，定义每个模块的训练目标为相互信息增大，然后逐次训练每个模块以增大相互信息。
results: 实验表明，MOLE可以解决不同类型的数据，包括向量、网格和图数据。此外，MOLE还可以解决图数据上的节点级和图级任务。因此，MOLE已经在实验上证明是对不同类型数据的通用解决方案。

Abstract
This paper is to introduce an asynchronous and local learning framework for neural networks, named Modular Learning Framework (MOLE). This framework modularizes neural networks by layers, defines the training objective via mutual information for each module, and sequentially trains each module by mutual information maximization. MOLE makes the training become local optimization with gradient-isolated across modules, and this scheme is more biologically plausible than BP. We run experiments on vector-, grid- and graph-type data. In particular, this framework is capable of solving both graph- and node-level tasks for graph-type data. Therefore, MOLE has been experimentally proven to be universally applicable to different types of data.

摘要
这份论文旨在介绍一种异步本地学习框架，名为模块学习框架（MOLE）。这个框架将神经网络归一化为层，通过互信息定义每个模块的训练目标，并逐渐训练每个模块以互信息最大化。MOLE使得训练变成了本地优化，梯度归一化在模块之间，这种方法更加生物学可靠性高于bp。我们在向量-, 网格-和图型数据上进行了实验，并证明MOLE可以解决图型数据上的图级和节点级任务。因此，MOLE已经实验证明对不同类型的数据都是通用的。

NeFL: Nested Federated Learning for Heterogeneous Clients

paper_url: http://arxiv.org/abs/2308.07761
repo_url: None
paper_authors: Honggu Kang, Seohyeon Cha, Jinwoo Shin, Jongmyeong Lee, Joonhyuk Kang
For: The paper is written for discussing the issue of slow or incapable clients in federated learning (FL) and proposing a new framework called nested federated learning (NeFL) to address this issue.* Methods: The paper uses a generalized framework that efficiently divides a model into submodels using both depthwise and widthwise scaling, and interprets models as solving ordinary differential equations (ODEs) with adaptive step sizes.* Results: The paper demonstrates that NeFL leads to significant gains, especially for the worst-case submodel, and aligns with recent studies in FL. Specifically, the paper shows an improvement of 8.33 on CIFAR-10.

Abstract
Federated learning (FL) is a promising approach in distributed learning keeping privacy. However, during the training pipeline of FL, slow or incapable clients (i.e., stragglers) slow down the total training time and degrade performance. System heterogeneity, including heterogeneous computing and network bandwidth, has been addressed to mitigate the impact of stragglers. Previous studies split models to tackle the issue, but with less degree-of-freedom in terms of model architecture. We propose nested federated learning (NeFL), a generalized framework that efficiently divides a model into submodels using both depthwise and widthwise scaling. NeFL is implemented by interpreting models as solving ordinary differential equations (ODEs) with adaptive step sizes. To address the inconsistency that arises when training multiple submodels with different architecture, we decouple a few parameters. NeFL enables resource-constrained clients to effectively join the FL pipeline and the model to be trained with a larger amount of data. Through a series of experiments, we demonstrate that NeFL leads to significant gains, especially for the worst-case submodel (e.g., 8.33 improvement on CIFAR-10). Furthermore, we demonstrate NeFL aligns with recent studies in FL.

摘要
federated learning (FL) 是一种有前途的方法，它可以保持隐私性而在分布式学习中进行训练。然而，在 FL 的训练管道中，慢速或无力的客户端（即延迟者）会降低总训练时间和性能。系统不同性，包括不同的计算和网络带宽，已经得到了 Mitigate 的注意。先前的研究把模型分成了解决这个问题，但是它们具有较少的度量自由度，即模型体系结构。我们提出了嵌入式联邦学习（NeFL），一种总体化的框架，它可以高效地将模型分成子模型使用深度和宽度的扩展。NeFL 通过解释模型为解决常微分方程（ODEs）的解释，并使用适应步长来实现。为了解决多个子模型不同体系结构时出现的不一致性，我们划分了一些参数。NeFL 让资源受限的客户端可以有效地参与 FL 管道，并让模型在更大的数据量上进行训练。通过一系列实验，我们表明了 NeFL 带来了显著的改善，特别是最差的子模型（例如， CIFAR-10 上的 8.33 提高）。此外，我们还证明了 NeFL 与最近的 FL 研究相一致。

Dynamic Embedding Size Search with Minimum Regret for Streaming Recommender System

paper_url: http://arxiv.org/abs/2308.07760
repo_url: https://github.com/hebowei2000/DESS
paper_authors: Bowei He, Xu He, Renrui Zhang, Yingxue Zhang, Ruiming Tang, Chen Ma
for: 寻找适合不断增长的用户和项目的流动推荐系统，以适应 dynamically changing environments。
methods: 模型更新过程中采用了流动模型更新策略，并将 embedding layer 的大小设置为动态变量，以提高推荐性能和减少内存成本。
results: 对两个推荐任务中的四个公共数据集进行了实验，证明了我们的方法可以在流动环境下提供更好的推荐性能，同时具有更低的内存成本和更高的时间效率。

Abstract
With the continuous increase of users and items, conventional recommender systems trained on static datasets can hardly adapt to changing environments. The high-throughput data requires the model to be updated in a timely manner for capturing the user interest dynamics, which leads to the emergence of streaming recommender systems. Due to the prevalence of deep learning-based recommender systems, the embedding layer is widely adopted to represent the characteristics of users, items, and other features in low-dimensional vectors. However, it has been proved that setting an identical and static embedding size is sub-optimal in terms of recommendation performance and memory cost, especially for streaming recommendations. To tackle this problem, we first rethink the streaming model update process and model the dynamic embedding size search as a bandit problem. Then, we analyze and quantify the factors that influence the optimal embedding sizes from the statistics perspective. Based on this, we propose the \textbf{D}ynamic \textbf{E}mbedding \textbf{S}ize \textbf{S}earch (\textbf{DESS}) method to minimize the embedding size selection regret on both user and item sides in a non-stationary manner. Theoretically, we obtain a sublinear regret upper bound superior to previous methods. Empirical results across two recommendation tasks on four public datasets also demonstrate that our approach can achieve better streaming recommendation performance with lower memory cost and higher time efficiency.

摘要
随着用户和项目的增加，传统的推荐系统通常采用静态数据集训练，但这些系统在变化的环境中难以适应。高 Throughput 数据需要模型在有效时间内进行更新，以捕捉用户兴趣动态，这导致了流处理推荐系统的出现。由于深度学习基本推荐系统的普遍性，嵌入层广泛采用低维度向量表示用户、项目和其他特征的特征。但是，已经证明将嵌入层大小设置为静态和共同的是优化推荐性和内存成本的不佳选择，特别是在流处理推荐中。为解决这个问题，我们首先重新思考流处理模型更新过程，并将动态嵌入大小搜索视为一个bandit问题。然后，我们分析和量化影响优化嵌入大小的因素，并基于这些因素提出了\textbf{D}ynamic \textbf{E}mbedding \textbf{S}ize \textbf{S}earch (\textbf{DESS})方法，以最小化嵌入大小选择 regret 在用户和项目两个方面。理论上，我们获得了superior于之前方法的下线 regret upper bound。实验结果在四个公共数据集上的两个推荐任务中也表明，我们的方法可以在不同的环境下实现更好的流处理推荐性，同时具有较低的内存成本和更高的时间效率。

Forward-Backward Reasoning in Large Language Models for Verification

paper_url: http://arxiv.org/abs/2308.07758
repo_url: None
paper_authors: Weisen Jiang, Han Shi, Longhui Yu, Zhengying Liu, Yu Zhang, Zhenguo Li, James T. Kwok
for: 本研究旨在提高开端问题 answering 的能力，提出了一种基于 Self-Consistency 和 backwards reasoning 的方法。
methods: 本方法使用了 Self-Consistency sampling 一些可能的 reasoning chains，并使用了 backwards reasoning 来验证 candidate answers。
results: 实验结果表明，FOBAR 可以在六个数据集和三个 LLMS 上达到开端问题 answering 的state-of-the-art性能。

Abstract
Chain-of-Though (CoT) prompting has shown promising performance in various reasoning tasks. Recently, Self-Consistency \citep{wang2023selfconsistency} proposes to sample a diverse set of reasoning chains which may lead to different answers while the answer that receives the most votes is selected. In this paper, we propose a novel method to use backward reasoning in verifying candidate answers. We mask a token in the question by ${\bf x}$ and ask the LLM to predict the masked token when a candidate answer is provided by \textit{a simple template}, i.e., "\textit{\textbf{If we know the answer of the above question is \{a candidate answer\}, what is the value of unknown variable ${\bf x}$?}" Intuitively, the LLM is expected to predict the masked token successfully if the provided candidate answer is correct. We further propose FOBAR to combine forward and backward reasoning for estimating the probability of candidate answers. We conduct extensive experiments on six data sets and three LLMs. Experimental results demonstrate that FOBAR achieves state-of-the-art performance on various reasoning benchmarks.

摘要
Chain-of-Though (CoT) 提示法在不同的理解任务中表现出色。自Consistency \citep{wang2023selfconsistency} 提议采样多种不同的理解链，以便通过不同的答案来选择最佳答案。在这篇论文中，我们提出了一种使用反向理解的新方法，用于验证候选答案。我们将问题中的一个token用{\bf x}来mask，然后问LLM predict这个masked token，当提供了一个简单的模板，即 "\textit{\textbf{如果我们知道上面的问题的答案是\{一个候选答案\},则unknown变量{\bf x}的值是什么？}"。Intuitively，LLM是预计能够成功预测masked token，如果提供的候选答案是正确的。我们还提出了FOBAR来组合前向和反向理解来估计候选答案的概率。我们在六个数据集和三个LLM上进行了广泛的实验，实验结果表明，FOBAR在多种理解 bencmarks 上达到了当前最佳性能。

Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

paper_url: http://arxiv.org/abs/2308.07749
repo_url: None
paper_authors: Bosheng Qin, Wentao Ye, Qifan Yu, Siliang Tang, Yueting Zhuang
for: 生成高质量人物动画视频，用于应用于游戏、电影等领域。
methods: 使用预训练的T2I扩散模型，通过权重学习模型来生成每帧视频，并使用文本引导和人物姿势来控制人物的动作。
results: 与现有state-of-the-art方法相比，Dancing Avatar可以生成高质量的人物动画视频，保持人物和背景的一致性，同时具有更高的时间协调性。

Abstract
The rising demand for creating lifelike avatars in the digital realm has led to an increased need for generating high-quality human videos guided by textual descriptions and poses. We propose Dancing Avatar, designed to fabricate human motion videos driven by poses and textual cues. Our approach employs a pretrained T2I diffusion model to generate each video frame in an autoregressive fashion. The crux of innovation lies in our adept utilization of the T2I diffusion model for producing video frames successively while preserving contextual relevance. We surmount the hurdles posed by maintaining human character and clothing consistency across varying poses, along with upholding the background's continuity amidst diverse human movements. To ensure consistent human appearances across the entire video, we devise an intra-frame alignment module. This module assimilates text-guided synthesized human character knowledge into the pretrained T2I diffusion model, synergizing insights from ChatGPT. For preserving background continuity, we put forth a background alignment pipeline, amalgamating insights from segment anything and image inpainting techniques. Furthermore, we propose an inter-frame alignment module that draws inspiration from an auto-regressive pipeline to augment temporal consistency between adjacent frames, where the preceding frame guides the synthesis process of the current frame. Comparisons with state-of-the-art methods demonstrate that Dancing Avatar exhibits the capacity to generate human videos with markedly superior quality, both in terms of human and background fidelity, as well as temporal coherence compared to existing state-of-the-art approaches.

摘要
“因应数字世界中创造生命如实的人物需求的增加，我们提出了舞动人物（Dancing Avatar），一个基于文本描述和姿势驱动的人工动画生成器。我们的方法使用预训T2I散射模型来生成每帧影像，透过autoregressive的方式实现每帧影像的生成。我们的创新在于，通过将文本描述和姿势知识融合到预训T2I散射模型中，以确保人物和背景的一致性。为保持人物的一致性，我们提出了一个内部对焦模块，将文本描述驱动的人物知识融合到预训T2I散射模型中。此外，我们还提出了一个间隔对焦模块，将预训T2I散射模型与另一个自动推理管线结合，以增强动画中人物的一致性。 Comparisons with state-of-the-art methods show that Dancing Avatar can generate high-quality human videos with superior fidelity, both in terms of human and background, as well as temporal coherence compared to existing state-of-the-art approaches.”

Exploiting Sparsity in Automotive Radar Object Detection Networks

paper_url: http://arxiv.org/abs/2308.07748
repo_url: None
paper_authors: Marius Lippke, Maurice Quach, Sascha Braun, Daniel Köhler, Michael Ulrich, Bastian Bischoff, Wei Yap Tan
for: 本研究旨在提高自动驾驶系统中的环境识别精度，以确保系统的安全和可靠运行。
methods: 本文使用 sparse convolutional object detection networks，这种网络结合了高效的网格式检测和低计算资源。 authors 还提出了适应 радиар特有挑战的 sparse kernel point pillars (SKPP) 和 dual voxel point convolutions (DVPC)，以解决网格渲染和稀疏基础架构的问题。
results: 在 nuScenes 上测试的 SKPP-DPVCN 架构，与基eline 相比提高了4.19%，并且与之前的状态分析提高了5.89%的Car AP4.0。此外，SKPP-DPVCN 还将平均扩散错误 (ASE) 降低了21.41%。

Abstract
Having precise perception of the environment is crucial for ensuring the secure and reliable functioning of autonomous driving systems. Radar object detection networks are one fundamental part of such systems. CNN-based object detectors showed good performance in this context, but they require large compute resources. This paper investigates sparse convolutional object detection networks, which combine powerful grid-based detection with low compute resources. We investigate radar specific challenges and propose sparse kernel point pillars (SKPP) and dual voxel point convolutions (DVPC) as remedies for the grid rendering and sparse backbone architectures. We evaluate our SKPP-DPVCN architecture on nuScenes, which outperforms the baseline by 5.89% and the previous state of the art by 4.19% in Car AP4.0. Moreover, SKPP-DPVCN reduces the average scale error (ASE) by 21.41% over the baseline.

摘要
“精确的环境认知是自动驾驶系统的安全和可靠运行所必备的。这篇论文探讨了具有强大的格子基础的对象探测网络，它们可以在自动驾驶系统中提供高性能，但是它们需要大量的计算资源。本文提出了稀疑几何点柱（SKPP）和双对称点核心（DVPC）来解决格式化和稀疑网络架构的挑战。我们评估了基于SKPP-DVPC的SKPP-DPVCN架构在nuScenes上的表现，该架构比基准点出5.89%的提升和前一个状态的实验出4.19%的提升。此外，SKPP-DPVCN还 redues了平均规模错误（ASE）的值比基准点下降21.41%。”

Formally-Sharp DAgger for MCTS: Lower-Latency Monte Carlo Tree Search using Data Aggregation with Formal Methods

paper_url: http://arxiv.org/abs/2308.07738
repo_url: None
paper_authors: Debraj Chakraborty, Damien Busatto-Gaston, Jean-François Raskin, Guillermo A. Pérez
for: 这 paper 的目的是提出一种高效的组合 formal methods、Monte Carlo Tree Search (MCTS) 和 deep learning 来生成大 Markov Decision processes (MDPs) 中的高质量递减时间策略。
methods: 这 paper 使用 model-checking 技术来引导 MCTS 算法，生成 MDP 中的高质量决策样本，并用这些样本来训练一个模仿策略。这个模仿策略可以用作在线 MCTS 搜索的导向，或者作为最低延迟时间的策略。
results: 这 paper 使用 statistical model checking 来检测需要更多样本的情况，并将更多样本集中在 MDP 中的不同配置下，以便训练模仿策略。并在 Frozen Lake 和 Pac-Man 环境中进行了实验，证明了该方法的有效性。

Abstract
We study how to efficiently combine formal methods, Monte Carlo Tree Search (MCTS), and deep learning in order to produce high-quality receding horizon policies in large Markov Decision processes (MDPs). In particular, we use model-checking techniques to guide the MCTS algorithm in order to generate offline samples of high-quality decisions on a representative set of states of the MDP. Those samples can then be used to train a neural network that imitates the policy used to generate them. This neural network can either be used as a guide on a lower-latency MCTS online search, or alternatively be used as a full-fledged policy when minimal latency is required. We use statistical model checking to detect when additional samples are needed and to focus those additional samples on configurations where the learnt neural network policy differs from the (computationally-expensive) offline policy. We illustrate the use of our method on MDPs that model the Frozen Lake and Pac-Man environments -- two popular benchmarks to evaluate reinforcement-learning algorithms.

摘要
我们研究如何有效地结合正式方法、Monte Carlo Tree Search（MCTS）和深度学习，以生成高质量的回溯时间政策在大Markov决策过程（MDP）中。特别是，我们使用模型检查技术来引导MCTS算法，以生成 offline 样本高质量决策在 MDP 的表示集中。这些样本可以用来训练一个模仿政策的神经网络，这个神经网络可以在更低的延迟下在线搜索中作为引导，或者作为尽可能快的全功能政策。我们使用统计模型检查来检测需要更多的样本，并将这些样本集中在计算机严重的 offline 政策与学习的神经网络政策之间的差异。我们在 Frozen Lake 和 Pac-Man 环境中使用我们的方法进行示例。

Flashpoints Signal Hidden Inherent Instabilities in Land-Use Planning

paper_url: http://arxiv.org/abs/2308.07714
repo_url: None
paper_authors: Hazhir Aliahmadi, Maeve Beckett, Sam Connolly, Dongmei Chen, Greg van Anders
For: The paper aims to improve the objectivity and transparency of land-use decision-making processes by using optimization-based planning approaches, such as Multi-Objective Land Allocation (MOLA).* Methods: The paper uses quantitative methods to evaluate planning priorities and generate a series of unstable “flashpoints” where small changes in planning priorities lead to large-scale changes in land use.* Results: The paper shows that quantitative methods can reduce the combinatorially large space of possible land-use patterns to a small, characteristic set that can engage stakeholders to arrive at more efficient and just outcomes. Additionally, the paper identifies “gray areas” in land-use type that arise due to instabilities in the planning process.

Abstract
Land-use decision-making processes have a long history of producing globally pervasive systemic equity and sustainability concerns. Quantitative, optimization-based planning approaches, e.g. Multi-Objective Land Allocation (MOLA), seemingly open the possibility to improve objectivity and transparency by explicitly evaluating planning priorities by the type, amount, and location of land uses. Here, we show that optimization-based planning approaches with generic planning criteria generate a series of unstable "flashpoints" whereby tiny changes in planning priorities produce large-scale changes in the amount of land use by type. We give quantitative arguments that the flashpoints we uncover in MOLA models are examples of a more general family of instabilities that occur whenever planning accounts for factors that coordinate use on- and between-sites, regardless of whether these planning factors are formulated explicitly or implicitly. We show that instabilities lead to regions of ambiguity in land-use type that we term "gray areas". By directly mapping gray areas between flashpoints, we show that quantitative methods retain utility by reducing combinatorially large spaces of possible land-use patterns to a small, characteristic set that can engage stakeholders to arrive at more efficient and just outcomes.

摘要
农用决策过程具有历史悠久的生产全球性平等和可持续发展问题。量化优化规划方法，例如多目标农用分配（MOLA），似乎可以提高 объекivity和透明度，由明确规划优先级来评估农用类型、量和位置。在这里，我们表明了量化规划方法中的“闪点”现象，即小 Change in 规划优先级可能导致大规模的农用类型占用量变化。我们提供了量化的证明，表明这些闪点在MOLA模型中是一种更通用的不稳定性现象，无论规划因素是否明确或暗示地表达。我们还显示了这些不稳定性导致农用类型之间的“灰色区”，即不同规划优先级下的农用类型占用量的变化范围。通过直接映射灰色区，我们表明了量化方法仍然保留了实用性，可以将可能的农用模式空间减少到一个小、特征集，以便更有效和公正的决策结果。

Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models

paper_url: http://arxiv.org/abs/2308.07706
repo_url: None
paper_authors: Kanchan Poudel, Manish Dhakal, Prasiddha Bhandari, Rabin Adhikari, Safal Thapaliya, Bishesh Khanal
for: 医疗图像分割是医疗领域中重要的应用之一，但是将文本指导integrated到图像分割模型中仍然是一个有限的进展。
methods: 我们提议使用多Modal vision-language模型来捕捉图像描述和图像的semantic信息，以便进行多种医疗图像的分割。
results: 我们的研究发现，将open-domain图像的视觉语言模型直接应用于医疗图像分割 tasks是不可靠的，但是通过微调可以提高其性能。我们在11个医疗dataset上使用4种VLMs和9种提示来评估其零基eline和微调性能。

Abstract
Medical Image Segmentation is crucial in various clinical applications within the medical domain. While state-of-the-art segmentation models have proven effective, integrating textual guidance to enhance visual features for this task remains an area with limited progress. Existing segmentation models that utilize textual guidance are primarily trained on open-domain images, raising concerns about their direct applicability in the medical domain without manual intervention or fine-tuning. To address these challenges, we propose using multimodal vision-language models for capturing semantic information from image descriptions and images, enabling the segmentation of diverse medical images. This study comprehensively evaluates existing vision language models across multiple datasets to assess their transferability from the open domain to the medical field. Furthermore, we introduce variations of image descriptions for previously unseen images in the dataset, revealing notable variations in model performance based on the generated prompts. Our findings highlight the distribution shift between the open-domain images and the medical domain and show that the segmentation models trained on open-domain images are not directly transferrable to the medical field. But their performance can be increased by finetuning them in the medical datasets. We report the zero-shot and finetuned segmentation performance of 4 Vision Language Models (VLMs) on 11 medical datasets using 9 types of prompts derived from 14 attributes.

摘要
医疗图像分割是医疗领域中不同临床应用中的关键。虽然当前的分割模型有效，但将文本指导 integrate into 图像特征以提高分割效果是一个有限的进展。现有的分割模型，它们主要是在开放领域图像上训练的，这引发了对其直接适用性在医疗领域的担忧。为了解决这些挑战，我们提议使用多模态视语言模型，以便从图像描述和图像中提取 semantic information，以便分割多种医疗图像。本研究对多个数据集进行了全面的评估，以评估现有的视语言模型在医疗领域是否可以进行转移。此外，我们还引入了图像描述中的变化，并评估模型的性能差异。我们的发现表明，开放领域图像和医疗领域之间存在分布差异，而且训练在开放领域图像上的模型不能直接应用于医疗领域。但是，通过训练这些模型在医疗数据集上，可以提高其性能。我们在11个医疗数据集上使用4种视语言模型进行零基础和训练性能测试，使用9种Prompt derived from 14个特征。

DiffGuard: Semantic Mismatch-Guided Out-of-Distribution Detection using Pre-trained Diffusion Models

paper_url: http://arxiv.org/abs/2308.07687
repo_url: https://github.com/cure-lab/diffguard
paper_authors: Ruiyuan Gao, Chenchen Zhao, Lanqing Hong, Qiang Xu
for: 这个研究旨在提出一种基于 semantic mismatch 的 Out-of-Distribution (OOD) 检测方法，并使用 pre-trained diffusion models 来实现。
methods: 本研究使用了 conditional Generative Adversarial Network (cGAN) 来增加 semantic mismatch 在图像空间中，并使用 pre-trained diffusion models 来直接进行 semantic mismatch-guided OOD 检测。
results: 实验结果显示 DiffGuard 能够在 Cifar-10 和 ImageNet 上达到州际级的 OOD 检测效果，并且可以与现有的 OOD 检测技术结合以 дости持续获得最佳 OOD 检测结果。

Abstract
Given a classifier, the inherent property of semantic Out-of-Distribution (OOD) samples is that their contents differ from all legal classes in terms of semantics, namely semantic mismatch. There is a recent work that directly applies it to OOD detection, which employs a conditional Generative Adversarial Network (cGAN) to enlarge semantic mismatch in the image space. While achieving remarkable OOD detection performance on small datasets, it is not applicable to ImageNet-scale datasets due to the difficulty in training cGANs with both input images and labels as conditions. As diffusion models are much easier to train and amenable to various conditions compared to cGANs, in this work, we propose to directly use pre-trained diffusion models for semantic mismatch-guided OOD detection, named DiffGuard. Specifically, given an OOD input image and the predicted label from the classifier, we try to enlarge the semantic difference between the reconstructed OOD image under these conditions and the original input image. We also present several test-time techniques to further strengthen such differences. Experimental results show that DiffGuard is effective on both Cifar-10 and hard cases of the large-scale ImageNet, and it can be easily combined with existing OOD detection techniques to achieve state-of-the-art OOD detection results.

摘要
(Simplified Chinese)给定一个分类器，外围样本的内在特性是其 contenuto 与所有合法类型的 semantics 不同，即 semantics mismatch。有一项最近的工作直接应用于 OOD 检测，使用 conditional Generative Adversarial Network (cGAN) 来增大图像空间中的 semantic mismatch。虽然在小 dataset 上达到了惊人的 OOD 检测性能，但是在 ImageNet scale 上 dataset 上不可能进行训练 cGAN 因为 condition 的困难。由于 diffusion models 训练更加容易，并且可以适应多种 condition，因此在这里我们提议直接使用预训练的 diffusion models 进行 semantics mismatch 导向的 OOD 检测，名为 DiffGuard。Specifically，给定一个 OOD 输入图像和分类器预测的标签，我们尝试通过增大这些 condition 下重建 OOD 图像的 semantic difference 和原始输入图像之间的差异来增大 semantic mismatch。我们还提供了多种测试时技术来进一步强化这些差异。实验结果表明，DiffGuard 效果良好于 Cifar-10 和 ImageNet 中的困难情况，并且可以与现有 OOD 检测技术相结合以 достиieving state-of-the-art OOD 检测结果。

paper_url: http://arxiv.org/abs/2308.07686
repo_url: https://github.com/lihong2303/agm_iccv2023
paper_authors: Hong Li, Xingyu Li, Pengbo Hu, Yinuo Lei, Chunxiao Li, Yi Zhou
for: 提高多模态学习模型的性能，解决现有的模态竞争问题。
methods: 提出了一种适应性Gradient Modulation方法，可以提高多模态模型的表现，并且可以应用于不同的融合策略。
results: 经验表明，我们的方法可以超越现有的模ulation方法，并且通过引入一种新的竞争强度度量，得到了对模态竞争的量化理解。

Abstract
While the field of multi-modal learning keeps growing fast, the deficiency of the standard joint training paradigm has become clear through recent studies. They attribute the sub-optimal performance of the jointly trained model to the modality competition phenomenon. Existing works attempt to improve the jointly trained model by modulating the training process. Despite their effectiveness, those methods can only apply to late fusion models. More importantly, the mechanism of the modality competition remains unexplored. In this paper, we first propose an adaptive gradient modulation method that can boost the performance of multi-modal models with various fusion strategies. Extensive experiments show that our method surpasses all existing modulation methods. Furthermore, to have a quantitative understanding of the modality competition and the mechanism behind the effectiveness of our modulation method, we introduce a novel metric to measure the competition strength. This metric is built on the mono-modal concept, a function that is designed to represent the competition-less state of a modality. Through systematic investigation, our results confirm the intuition that the modulation encourages the model to rely on the more informative modality. In addition, we find that the jointly trained model typically has a preferred modality on which the competition is weaker than other modalities. However, this preferred modality need not dominate others. Our code will be available at https://github.com/lihong2303/AGM_ICCV2023.

摘要
而 field of multi-modal learning 的发展速度不断增长，标准的联合训练方法的缺点也日益明显。这些研究表明，联合训练模型的性能下降归结于modal competition现象。现有的方法可以通过修改训练过程来改善联合训练模型，但这些方法只适用于late fusion模型。更重要的是，modal competition的机制还没有得到解释。在这篇论文中，我们首先提出一种适应性的梯度修正方法，可以提高不同拟合策略的多Modal模型性能。广泛的实验表明，我们的方法超过了所有现有的修正方法。此外，为了有一个准确的理解modal competition的机制，我们引入了一种新的竞争力度量，它基于单模态概念，这是一种用于表示没有竞争的状态的函数。通过系统性的调查，我们的结果证明了我们的修正方法能够鼓励模型依赖于更有用的感知模式。此外，我们发现，联合训练模型通常有一个具有较弱竞争力的首选模式，但这并不意味着这个模式会完全控制其他模式。我们的代码将在https://github.com/lihong2303/AGM_ICCV2023上发布。

EQ-Net: Elastic Quantization Neural Networks

paper_url: http://arxiv.org/abs/2308.07650
repo_url: https://github.com/xuke225/eq-net
paper_authors: Ke Xu, Lei Han, Ye Tian, Shangshang Yang, Xingyi Zhang
for: 该 paper 目的是提出一种一键网络量化 regime， named Elastic Quantization Neural Networks (EQ-Net)，用于训练一个可重用的量化超网。
methods: 该 paper 使用了一种弹性量化空间 (包括弹性比特宽、粒子大小和对称) 适应不同的主流量化形式。其次，提出了 Weight Distribution Regularization Loss (WDR-Loss) 和 Group Progressive Guidance Loss (GPG-Loss) 两种损失函数来减少量化空间中 weights 和输出 logits 的分布不一致。最后，使用了遗传算法和提出的 Conditional Quantization-Aware Accuracy Predictor (CQAP) 作为估计器快速搜索混合精度量化神经网络在超网中。
results: 广泛的实验表明，我们的 EQ-Net 与其静态对应物以及当前最佳稳定量化方法几乎相当或更好。代码可以在 \href{https://github.com/xuke225/EQ-Net.git}{https://github.com/xuke225/EQ-Net} 上获得。

Abstract
Current model quantization methods have shown their promising capability in reducing storage space and computation complexity. However, due to the diversity of quantization forms supported by different hardware, one limitation of existing solutions is that usually require repeated optimization for different scenarios. How to construct a model with flexible quantization forms has been less studied. In this paper, we explore a one-shot network quantization regime, named Elastic Quantization Neural Networks (EQ-Net), which aims to train a robust weight-sharing quantization supernet. First of all, we propose an elastic quantization space (including elastic bit-width, granularity, and symmetry) to adapt to various mainstream quantitative forms. Secondly, we propose the Weight Distribution Regularization Loss (WDR-Loss) and Group Progressive Guidance Loss (GPG-Loss) to bridge the inconsistency of the distribution for weights and output logits in the elastic quantization space gap. Lastly, we incorporate genetic algorithms and the proposed Conditional Quantization-Aware Accuracy Predictor (CQAP) as an estimator to quickly search mixed-precision quantized neural networks in supernet. Extensive experiments demonstrate that our EQ-Net is close to or even better than its static counterparts as well as state-of-the-art robust bit-width methods. Code can be available at \href{https://github.com/xuke225/EQ-Net.git}{https://github.com/xuke225/EQ-Net}.

摘要
当前的模型量化方法已经表现出了减少存储空间和计算复杂度的承诺。然而，由于不同硬件支持的量化形式的多样性，现有的解决方案通常需要重复优化不同的场景。在这篇论文中，我们探索了一种一键网络量化方式，名为弹性量化神经网络（EQ-Net），旨在训练一个可以共享量化超网。首先，我们提出了弹性量化空间（包括弹性位数、粒度和对称），以适应不同主流量化形式。其次，我们提出了Weight Distribution Regularization Loss（WDR-Loss）和Group Progressive Guidance Loss（GPG-Loss）来bridging弹性量化空间中 weights和输出logits的分布不一致性。最后，我们将遗传算法和提出的Conditional Quantization-Aware Accuracy Predictor（CQAP）作为估计器，快速查找混合精度量化神经网络在超网中。广泛的实验证明了我们的EQ-Net与其静态对手以及State-of-the-art Robust Bit-Width Methods相当或甚至更好。代码可以在 \href{https://github.com/xuke225/EQ-Net.git}{https://github.com/xuke225/EQ-Net} 中获取。

Ternary Singular Value Decomposition as a Better Parameterized Form in Linear Mapping

paper_url: http://arxiv.org/abs/2308.07641
repo_url: None
paper_authors: Boyu Chen, Hanxuan Chen, Jiao He, Fengyu Sun, Shangling Jui
For: 本文提出了一种简单 yet novel的参数化线性映射方法，以实现杰出的网络压缩性能。* Methods: 该方法基于pseudo SVD（Ternary SVD，TSVD），与标准SVD不同的是，TSVD限制$U$和$V$矩阵为ternary矩阵（${\pm 1, 0}$）。这意味着在计算$U(\cdot)$和$V(\cdot)$时，只需要进行加法运算。* Results: 在各种网络和任务中，TSVD可以实现现状顶峰的网络压缩性能，包括当前基线模型如ConvNext、Swim、BERT以及大型语言模型如OPT。

Abstract
We present a simple yet novel parameterized form of linear mapping to achieves remarkable network compression performance: a pseudo SVD called Ternary SVD (TSVD). Unlike vanilla SVD, TSVD limits the $U$ and $V$ matrices in SVD to ternary matrices form in $\{\pm 1, 0\}$. This means that instead of using the expensive multiplication instructions, TSVD only requires addition instructions when computing $U(\cdot)$ and $V(\cdot)$. We provide direct and training transition algorithms for TSVD like Post Training Quantization and Quantization Aware Training respectively. Additionally, we analyze the convergence of the direct transition algorithms in theory. In experiments, we demonstrate that TSVD can achieve state-of-the-art network compression performance in various types of networks and tasks, including current baseline models such as ConvNext, Swim, BERT, and large language model like OPT.

摘要
我们提出了一种简单 yet novel的参数化线性映射方法，可以夺得惊人的网络压缩性能：一种 pseudo SVD called Ternary SVD (TSVD)。不同于普通的 SVD，TSVD 限制 $U$ 和 $V$ 矩阵在 SVD 中到了三元矩阵形式 ($\{\pm 1, 0\}$)。这意味着在计算 $U(\cdot)$ 和 $V(\cdot)$ 时，TSVD 只需要使用加法指令，而不需要昂贵的乘法指令。我们提供了直接转移算法和训练转移算法 для TSVD，如 Post Training Quantization 和 Quantization Aware Training 等。此外，我们也 theoretically 分析了直接转移算法的整合性。在实验中，我们证明了 TSVD 可以在不同类型的网络和任务上夺得当今基线模型如 ConvNext、Swim、BERT 和大语言模型 OPT 的状态级网络压缩性能。

LLM-Mini-CEX: Automatic Evaluation of Large Language Model for Diagnostic Conversation

paper_url: http://arxiv.org/abs/2308.07635
repo_url: None
paper_authors: Xiaoming Shi, Jie Xu, Jinru Ding, Jiali Pang, Sichen Liu, Shuqing Luo, Xingwei Peng, Lu Lu, Haihong Yang, Mingtao Hu, Tong Ruan, Shaoting Zhang
for: 这研究旨在提供一个统一的评估标准，以评估医疗语言模型（LLM）的诊断能力。
methods: 该研究首先建立了一个特有的评估标准，称为LLM特有的Mini-CEX，以评估医疗LLM的诊断能力。此外，研究者还开发了一个patient simulator，用于自动与LLM进行对话，并使用ChatGPT来自动评估诊断对话的质量。
results: 实验结果表明，LLM特有的Mini-CEX是一个有效和必需的评估标准，可以评估医疗LLM的诊断对话质量。此外，ChatGPT也可以自动评估诊断对话的人文特质，并提供可重复和自动比较不同LLM的能力。

Abstract
There is an increasing interest in developing LLMs for medical diagnosis to improve diagnosis efficiency. Despite their alluring technological potential, there is no unified and comprehensive evaluation criterion, leading to the inability to evaluate the quality and potential risks of medical LLMs, further hindering the application of LLMs in medical treatment scenarios. Besides, current evaluations heavily rely on labor-intensive interactions with LLMs to obtain diagnostic dialogues and human evaluation on the quality of diagnosis dialogue. To tackle the lack of unified and comprehensive evaluation criterion, we first initially establish an evaluation criterion, termed LLM-specific Mini-CEX to assess the diagnostic capabilities of LLMs effectively, based on original Mini-CEX. To address the labor-intensive interaction problem, we develop a patient simulator to engage in automatic conversations with LLMs, and utilize ChatGPT for evaluating diagnosis dialogues automatically. Experimental results show that the LLM-specific Mini-CEX is adequate and necessary to evaluate medical diagnosis dialogue. Besides, ChatGPT can replace manual evaluation on the metrics of humanistic qualities and provides reproducible and automated comparisons between different LLMs.

摘要
<>将文本翻译成简化中文。<>随着医疗推荐系统的发展，有越来越多的研究者关注开发医疗推荐系统，以提高诊断效率。然而，这些系统的评价标准尚未统一，导致诊断系统的质量和风险难以评估，从而限制了医疗推荐系统的应用场景。此外，当前的评价方法仍然依赖于人工干预，通过与医疗推荐系统进行劳动密集的对话来获取诊断对话，以及人工评估诊断对话的质量。为了解决统一评价标准的缺失，我们首先建立了一个特定于医疗推荐系统的评价标准，称为LLM特定的Mini-CEX，以评估医疗推荐系统的诊断能力。为了解决人工干预的问题，我们开发了一个模拟病人的模拟器，可以自动与医疗推荐系统进行对话，并使用ChatGPT来自动评估诊断对话的质量。实验结果表明，LLM特定的Mini-CEX是有效和必要的评价医疗诊断对话的标准，而ChatGPT可以替代人工评估，并提供可重复和自动化的对比。

A Survey on Model Compression for Large Language Models

paper_url: http://arxiv.org/abs/2308.07633
repo_url: None
paper_authors: Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang
for: 本文旨在概述大自然语言处理任务中的语言模型压缩技术，尤其是针对资源有限的环境下进行实用部署。
methods: 本文介绍了各种压缩方法，包括量化、剪辑、知识传承和更多的技术，并讲述了每种方法的最新发展和创新应用。
results: 本文提供了评估压缩后模型效果的方法和指标，并探讨了这些方法在实际应用中的实用性。

Abstract
Large Language Models (LLMs) have revolutionized natural language processing tasks with remarkable success. However, their formidable size and computational demands present significant challenges for practical deployment, especially in resource-constrained environments. As these challenges become increasingly pertinent, the field of model compression has emerged as a pivotal research area to alleviate these limitations. This paper presents a comprehensive survey that navigates the landscape of model compression techniques tailored specifically for LLMs. Addressing the imperative need for efficient deployment, we delve into various methodologies, encompassing quantization, pruning, knowledge distillation, and more. Within each of these techniques, we highlight recent advancements and innovative approaches that contribute to the evolving landscape of LLM research. Furthermore, we explore benchmarking strategies and evaluation metrics that are essential for assessing the effectiveness of compressed LLMs. By providing insights into the latest developments and practical implications, this survey serves as an invaluable resource for both researchers and practitioners. As LLMs continue to evolve, this survey aims to facilitate enhanced efficiency and real-world applicability, establishing a foundation for future advancements in the field.

摘要

Vision-based Semantic Communications for Metaverse Services: A Contest Theoretic Approach

paper_url: http://arxiv.org/abs/2308.07618
repo_url: None
paper_authors: Guangyuan Liu, Hongyang Du, Dusit Niyato, Jiawen Kang, Zehui Xiong, Boon Hee Soong
for: 提供一个对Metaverse中人员与服务提供商之间的对话和资源分配的Semantic Communication框架，以提高用户在虚拟世界中的体验。
methods: 使用Contest Theory来模型用户和服务提供商之间的互动，并根据每个用户的需求进行资源分配。使用Semantic Communication技术将数据量降至51字节，从而减少网络资源的消耗。使用深度Q学网来优化优先级，以最大化性能和资源分配效率。
results: 比较传统平均分配方法，透过优化优先级，将下调损失率降至66.076%。提供一个为Metaverse中人员与服务提供商之间的资源分配解决方案，以提高用户在虚拟世界中的体验。

Abstract
The popularity of Metaverse as an entertainment, social, and work platform has led to a great need for seamless avatar integration in the virtual world. In Metaverse, avatars must be updated and rendered to reflect users' behaviour. Achieving real-time synchronization between the virtual bilocation and the user is complex, placing high demands on the Metaverse Service Provider (MSP)'s rendering resource allocation scheme. To tackle this issue, we propose a semantic communication framework that leverages contest theory to model the interactions between users and MSPs and determine optimal resource allocation for each user. To reduce the consumption of network resources in wireless transmission, we use the semantic communication technique to reduce the amount of data to be transmitted. Under our simulation settings, the encoded semantic data only contains 51 bytes of skeleton coordinates instead of the image size of 8.243 megabytes. Moreover, we implement Deep Q-Network to optimize reward settings for maximum performance and efficient resource allocation. With the optimal reward setting, users are incentivized to select their respective suitable uploading frequency, reducing down-sampling loss due to rendering resource constraints by 66.076\% compared with the traditional average distribution method. The framework provides a novel solution to resource allocation for avatar association in VR environments, ensuring a smooth and immersive experience for all users.

摘要
“Metaverse的受欢迎程度使得虚拟世界中的人物集成变得非常重要。在Metaverse中，人物需要实时更新和渲染，以反映用户的行为。实现实时同步是复杂的，对Metaverse服务提供商（MSP）的渲染资源分配方案带来高要求。为解决这个问题，我们提议一种基于 semantics 的通信框架，利用对用户和 MSP 之间的交互进行模型化，并确定每个用户的优化资源分配策略。使用semantic通信技术可以减少无线传输中的网络资源消耗，并且我们使用 Deep Q-Network 优化奖励设置，以实现最佳性和有效的资源分配。根据优化奖励设置，用户可以选择适合自己的上传频率，从而减少由渲染资源限制引起的下采样损失，比传统均值分布方法减少了66.076%。该框架为虚拟世界中人物协调资源分配提供了一种新的解决方案，以保证所有用户都能获得平滑和充满感的体验。”

ERA: Enhanced Relaxed A algorithm for Solving the Shortest Path Problem in Regular Grid Maps

paper_url: http://arxiv.org/abs/2308.10988
repo_url: None
paper_authors: Adel Ammar
for: 解决点到点最短路径问题在静态Regular 8- neighbbor connectivity（G8）格网中。
methods: 使用一种新的算法，可以看作是 Hadlock 算法的普遍化，并且与 relaxed $A^*$（$RA^*）算法相等于，但具有不同的计算策略，基于定义lookup矩阵。
results: 通过对不同类型和大小的格图（1290个运行在43个地图上）进行实验，证明该算法比 $RA^*$ 快2.25倍，比原始 $A^*$ 快17倍，具有更好的内存利用率，不需要存储 G Score 矩阵。

Abstract
This paper introduces a novel algorithm for solving the point-to-point shortest path problem in a static regular 8-neighbor connectivity (G8) grid. This algorithm can be seen as a generalization of Hadlock algorithm to G8 grids, and is shown to be theoretically equivalent to the relaxed $A^*$ ($RA^*$) algorithm in terms of the provided solution's path length, but with substantial time and memory savings, due to a completely different computation strategy, based on defining a set of lookup matrices. Through an experimental study on grid maps of various types and sizes (1290 runs on 43 maps), it is proven to be 2.25 times faster than $RA^*$ and 17 times faster than the original $A^*$, in average. Moreover, it is more memory-efficient, since it does not need to store a G score matrix.

摘要
这篇论文介绍了一种新的算法，用于解决在静态正方形8邻连接（G8）网格上的点到点最短路径问题。这种算法可以看作是 Hadlock 算法的总线式扩展，并且与 $RA^*$ 算法在解提供的路径长度方面是等价的，但具有不同的计算策略，基于定义一组查找表。通过对不同类型和大小的网格图（1290 个运行在 43 个图）进行实验研究，这种算法被证明为 $RA^*$ 的 2.25 倍快，并且比原始 $A^*$ 快了 17 倍，平均而言。此外，它还更加具有内存效率，因为它不需要存储 G 分数矩阵。

SGDiff: A Style Guided Diffusion Model for Fashion Synthesis

paper_url: http://arxiv.org/abs/2308.07605
repo_url: https://github.com/taited/sgdiff
paper_authors: Zhengwentai Sun, Yanghong Zhou, Honghong He, P. Y. Mok
For: 本研究旨在开发一种新的样式指导扩散模型（SGDiff），以解决现有图像生成模型存在的一些缺陷。* Methods: 该模型结合图像modal和预训练的文本到图像扩散模型，以实现创新的时尚图像生成。它利用补充性的样式指导，降低训练成本，并在文本输入只能控制生成的样式方面解决了一些问题。* Results: 本研究新引入了SG-Fashion数据集，该数据集专门用于时尚图像生成应用，具有高分辨率图像和广泛的衣物类别。通过了全面的缺失学习研究，我们证明了提议的模型可以生成符合类别、产品特性和风格的时尚图像。

Abstract
This paper reports on the development of \textbf{a novel style guided diffusion model (SGDiff)} which overcomes certain weaknesses inherent in existing models for image synthesis. The proposed SGDiff combines image modality with a pretrained text-to-image diffusion model to facilitate creative fashion image synthesis. It addresses the limitations of text-to-image diffusion models by incorporating supplementary style guidance, substantially reducing training costs, and overcoming the difficulties of controlling synthesized styles with text-only inputs. This paper also introduces a new dataset -- SG-Fashion, specifically designed for fashion image synthesis applications, offering high-resolution images and an extensive range of garment categories. By means of comprehensive ablation study, we examine the application of classifier-free guidance to a variety of conditions and validate the effectiveness of the proposed model for generating fashion images of the desired categories, product attributes, and styles. The contributions of this paper include a novel classifier-free guidance method for multi-modal feature fusion, a comprehensive dataset for fashion image synthesis application, a thorough investigation on conditioned text-to-image synthesis, and valuable insights for future research in the text-to-image synthesis domain. The code and dataset are available at: \url{https://github.com/taited/SGDiff}.

摘要
The paper also introduces a new dataset called SG-Fashion, which is specifically designed for fashion image synthesis and includes high-resolution images and a wide range of garment categories. The authors conduct a comprehensive ablation study to examine the effectiveness of the proposed method in various conditions and demonstrate its ability to generate fashion images with the desired categories, attributes, and styles.The main contributions of this paper include a novel classifier-free guidance method for multi-modal feature fusion, a comprehensive dataset for fashion image synthesis, a thorough investigation of conditioned text-to-image synthesis, and valuable insights for future research in the text-to-image synthesis domain. The code and dataset are available online at: .

Generating Personas for Games with Multimodal Adversarial Imitation Learning

paper_url: http://arxiv.org/abs/2308.07598
repo_url: None
paper_authors: William Ahlberg, Alessandro Sestini, Konrad Tollmar, Linus Gisslén
for: 这篇论文目标是生成多个人工智能机器人可以模仿人类游戏玩家的多种玩法。
methods: 该论文提出了一种基于多模式生成对抗学习（MultiGAIL）的新方法，使用辅助输入参数来学习不同的人工智能玩家模式，并使用多个批评器作为奖励模型。
results: 实验结果表明，该方法在两个环境中（连续和离散动作空间）都有效，可以生成多个不同的人工智能玩家模式。

Abstract
Reinforcement learning has been widely successful in producing agents capable of playing games at a human level. However, this requires complex reward engineering, and the agent's resulting policy is often unpredictable. Going beyond reinforcement learning is necessary to model a wide range of human playstyles, which can be difficult to represent with a reward function. This paper presents a novel imitation learning approach to generate multiple persona policies for playtesting. Multimodal Generative Adversarial Imitation Learning (MultiGAIL) uses an auxiliary input parameter to learn distinct personas using a single-agent model. MultiGAIL is based on generative adversarial imitation learning and uses multiple discriminators as reward models, inferring the environment reward by comparing the agent and distinct expert policies. The reward from each discriminator is weighted according to the auxiliary input. Our experimental analysis demonstrates the effectiveness of our technique in two environments with continuous and discrete action spaces.

摘要
现在的人工智能技术中，强化学习已经广泛应用于生成人类水平的游戏机器人。然而，这需要复杂的奖励工程，并且机器人的结果策略可能很难预测。为了模型人类多种游戏风格，超过强化学习是必要的，但这可以很难以表示为奖励函数。本文提出了一种新的模仿学习方法，可以生成多个人格策略用于游戏测试。我们称之为多modal生成对抗学习（MultiGAIL）。MultiGAIL使用了一个辅助输入参数，通过单个机器人模型来学习不同的人格。我们使用多个判据器作为奖励模型，通过比较机器人和各个专家策略来推断环境奖励。每个判据器的奖励得分被Weighted According to辅助输入。我们的实验分析表明，我们的技术在 kontinuous 和 discrete 动作空间中的两个环境中具有效果。

AutoLTS: Automating Cycling Stress Assessment via Contrastive Learning and Spatial Post-processing

paper_url: http://arxiv.org/abs/2308.07580
repo_url: None
paper_authors: Bo Lin, Shoshanna Saxe, Timothy C. Y. Chan
for: 这个论文是为了提供一种快速、精准和大规模的自行车压力评估方法，以便在城市道路网中规划自行车设施和路线建议。
methods: 这个论文使用了深度学习框架，利用街景图像来支持快速、精准和大规模的自行车压力评估。具体来说，这个框架包括一种对比学习方法，利用自行车压力标签的顺序关系，以及一种后处理技术，以保证预测结果的空间稳定性。
results: 在使用了39,153条道路段的 datasets 上，我们的结果表明，我们的深度学习框架可以快速、精准地进行自行车压力评估，并且可以使用街景图像来评估自行车压力，即使没有高质量的道路几何和机动车数据。

Abstract
Cycling stress assessment, which quantifies cyclists' perceived stress imposed by the built environment and motor traffics, increasingly informs cycling infrastructure planning and cycling route recommendation. However, currently calculating cycling stress is slow and data-intensive, which hinders its broader application. In this paper, We propose a deep learning framework to support accurate, fast, and large-scale cycling stress assessments for urban road networks based on street-view images. Our framework features i) a contrastive learning approach that leverages the ordinal relationship among cycling stress labels, and ii) a post-processing technique that enforces spatial smoothness into our predictions. On a dataset of 39,153 road segments collected in Toronto, Canada, our results demonstrate the effectiveness of our deep learning framework and the value of using image data for cycling stress assessment in the absence of high-quality road geometry and motor traffic data.

摘要
《单车压力评估》，它衡量单车者对建筑环境和机动车流的感知压力，逐渐成为单车基础设施规划和单车路径建议的重要指标。但目前计算单车压力却是慢且资料密集的，这限制了它的广泛应用。在这篇论文中，我们提出了一个深度学习框架，用于支持精确、快速、大规模的单车压力评估，基于街景影像。我们的框架包括：①一种对比学习方法，利用单车压力标签之间的顺序关系，并②一种后处理技术，对我们的预测进行空间稳定化。在加拿大多伦多的39,153条道路段的数据集上，我们的结果显示了我们的深度学习框架的有效性，以及使用影像数据进行单车压力评估在缺乏高品质道路几何和机动车流数据的情况下的价值。

IoT Data Trust Evaluation via Machine Learning

paper_url: http://arxiv.org/abs/2308.11638
repo_url: https://github.com/jettbrains/-L-
paper_authors: Timothy Tadj, Reza Arablouei, Volkan Dedeoglu
for: 评估互联网器件（IoT）数据的信任性。
methods: 使用随机散步填充（RWI）方法生成具有不可靠性的数据，并从感知器件数据中提取有效地捕捉自适应性和它们与邻居传感器数据的相关性的新特征。
results: 通过对多种基于机器学习（ML）的 IoT 数据信任评估方法进行广泛的实验，发现常见的 ML 基于方法表现不佳，这可以归因于不可靠的假设，即归一化提供可靠的标签 для数据信任。同时，通过使用RWI生成的数据和提取的特征，ML模型在未看到的数据上进行了好的普适性和超越性。此外， semi-supervised ML 方法，只需要约 10% 的数据标注，可以提供竞争力强的表现，并且在实际应用中更加具有实用性。

Abstract
Various approaches based on supervised or unsupervised machine learning (ML) have been proposed for evaluating IoT data trust. However, assessing their real-world efficacy is hard mainly due to the lack of related publicly-available datasets that can be used for benchmarking. Since obtaining such datasets is challenging, we propose a data synthesis method, called random walk infilling (RWI), to augment IoT time-series datasets by synthesizing untrustworthy data from existing trustworthy data. Thus, RWI enables us to create labeled datasets that can be used to develop and validate ML models for IoT data trust evaluation. We also extract new features from IoT time-series sensor data that effectively capture its auto-correlation as well as its cross-correlation with the data of the neighboring (peer) sensors. These features can be used to learn ML models for recognizing the trustworthiness of IoT sensor data. Equipped with our synthesized ground-truth-labeled datasets and informative correlation-based feature, we conduct extensive experiments to critically examine various approaches to evaluating IoT data trust via ML. The results reveal that commonly used ML-based approaches to IoT data trust evaluation, which rely on unsupervised cluster analysis to assign trust labels to unlabeled data, perform poorly. This poor performance can be attributed to the underlying unsubstantiated assumption that clustering provides reliable labels for data trust, a premise that is found to be untenable. The results also show that the ML models learned from datasets augmented via RWI while using the proposed features generalize well to unseen data and outperform existing related approaches. Moreover, we observe that a semi-supervised ML approach that requires only about 10% of the data labeled offers competitive performance while being practically more appealing compared to the fully-supervised approaches.

摘要
各种基于监督或无监督机器学习（ML）方法已经提出来评估互联网器件（IoT）数据的信任性。然而，在实际世界中评估这些方法的效果很难，主要因为缺乏相关的公共可用数据集，可以用于比较。为了解决这个问题，我们提议一种数据生成方法，即随机游走填充（RWI），用于增强IoT时序数据集。这种方法可以生成可信worthy数据，并将其与现有的可信worthy数据相结合，以生成可靠的标注数据集。这些标注数据集可以用于开发和验证ML模型，以评估IoT数据的信任性。此外，我们还提取了IoT时序感知器数据中有效地捕捉自身的自相关性以及与邻居感知器数据的相关性。这些特征可以用于学习ML模型，以识别IoT感知器数据的信任性。利用我们生成的标注数据集和有用的相关特征，我们进行了广泛的实验，用以检验不同的IoT数据信任评估方法。结果显示，通常用于IoT数据信任评估的ML基于方法，即基于归一化分析进行无监督标注，表现不佳。这种不佳表现可以归因于下面的前提，即归一化分析提供了可靠的标注数据，这是一个不可靠的假设。另外，我们发现，使用RWI生成的数据集和相关特征来学习ML模型，可以在未看过数据时达到比较好的表现，并且与现有相关方法相比，具有更好的普适性。此外，我们还发现，一种半监督的ML方法，只需要标注约10%的数据，可以达到相对较高的表现，而且在实际应用中更加实际。

Story Visualization by Online Text Augmentation with Context Memory

paper_url: http://arxiv.org/abs/2308.07575
repo_url: https://github.com/yonseivnl/cmota
paper_authors: Daechul Ahn, Daneul Kim, Gwangmo Song, Seung Hwan Kim, Honglak Lee, Dongyeop Kang, Jonghyun Choi
for: 提高 Story Visualization task 的效果，使得模型能够更好地从文本描述中提取视觉细节并在多句文本中保持上下文。
methods: 提出了一种基于 Bi-directional Transformer 框架的新内存体系结构，并在训练过程中使用在线文本增强来生成多个 Pseudo-descriptions 作为补做性的超级vision 。
results: 在 Pororo-SV 和 Flintstones-SV 两个常用的 Story Visualization 测试集上，提出的方法significantly 超过了现有的状态天地，包括 FID、character F1、frame accuracy、BLEU-2/3 和 R-precision 等多个维度的metric ，同时 computation complexity 相对较低。

Abstract
Story visualization (SV) is a challenging text-to-image generation task for the difficulty of not only rendering visual details from the text descriptions but also encoding a long-term context across multiple sentences. While prior efforts mostly focus on generating a semantically relevant image for each sentence, encoding a context spread across the given paragraph to generate contextually convincing images (e.g., with a correct character or with a proper background of the scene) remains a challenge. To this end, we propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation that generates multiple pseudo-descriptions as supplementary supervision during training for better generalization to the language variation at inference. In extensive experiments on the two popular SV benchmarks, i.e., the Pororo-SV and Flintstones-SV, the proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision with similar or less computational complexity.

摘要
Story visualization (SV) 是一个具有挑战性的文本到图像生成任务，因为不仅需要从文本描述中提取视觉细节，还需要编码长期上下文 Across multiple sentences。而且，在给出的段落中生成语言上下文感的图像（例如，正确的人物或场景背景）仍然是一个挑战。为此，我们提议一种新的储存架构，用于 Bi-directional Transformer 框架中的在线文本增强，在训练期间生成多个 pseudo-descriptions 作为补充性超级视图，以提高语言变化的泛化性。在两个流行的 SV 测试基准上，即 Pororo-SV 和 Flintstones-SV，我们的方法在多个纪录metric中显著超越了现状的术语，包括 FID、人物 F1、帧精度、BLEU-2/3 和 R-精度，同时具有相似或更低的计算复杂度。

Action Class Relation Detection and Classification Across Multiple Video Datasets

paper_url: http://arxiv.org/abs/2308.07558
repo_url: None
paper_authors: Yuya Yoshikawa, Yutaro Shigeto, Masashi Shimbo, Akikazu Takeuchi
for: 提高视频人体动作识别的数据集 augmenteion
methods: 使用语言和视觉信息关联类别进行类别关系探测和分类
results: 使用预训练的最新神经网络模型对文本和视频进行预测，可以获得高度预测性能，并且文本标签预测性能高于视频预测，可以将多模态融合以提高预测性能。

Abstract
The Meta Video Dataset (MetaVD) provides annotated relations between action classes in major datasets for human action recognition in videos. Although these annotated relations enable dataset augmentation, it is only applicable to those covered by MetaVD. For an external dataset to enjoy the same benefit, the relations between its action classes and those in MetaVD need to be determined. To address this issue, we consider two new machine learning tasks: action class relation detection and classification. We propose a unified model to predict relations between action classes, using language and visual information associated with classes. Experimental results show that (i) pre-trained recent neural network models for texts and videos contribute to high predictive performance, (ii) the relation prediction based on action label texts is more accurate than based on videos, and (iii) a blending approach that combines predictions by both modalities can further improve the predictive performance in some cases.

摘要
meta 视频集（MetaVD）提供了动作类别之间的注解关系，这些注解关系可以用于视频人体动作识别的数据集进行数据增强。然而，这些注解关系只适用于MetaVD覆盖的数据集。为了解决这个问题，我们考虑了两个新的机器学习任务：动作类别关系检测和分类。我们提议一种统一的模型，可以预测动作类别之间的关系，使用类别相关的语言和视觉信息。实验结果表明：（i）使用最近的预训练神经网络模型对文本和视频进行预测可以获得高度的预测性能，（ii）基于动作标签文本的关系预测比基于视频的预测更准确，（iii）将两种模态的预测结果融合使用可以在一些情况下提高预测性能。

Reinforcement Learning (RL) Augmented Cold Start Frequency Reduction in Serverless Computing

paper_url: http://arxiv.org/abs/2308.07541
repo_url: None
paper_authors: Siddharth Agarwal, Maria A. Rodriguez, Rajkumar Buyya
for: 本研究旨在减少 Function-as-a-Service（FaaS）平台上的冷启动频率，通过使用人工智能学习算法来预先启动函数。
methods: 本研究使用了Q学习算法，考虑了函数CPU使用率、现有函数实例和响应失败率等指标，以实现在预期需求的基础上进行前置启动函数。
results: 对比 Kubeless 默认策略和函数保持活动策略，RL 算法能够提高吞吐量达到 8.81%，降低计算负担和资源浪费达到 55% 和 37%，这直接归结于减少冷启动。

Abstract
Function-as-a-Service is a cloud computing paradigm offering an event-driven execution model to applications. It features serverless attributes by eliminating resource management responsibilities from developers and offers transparent and on-demand scalability of applications. Typical serverless applications have stringent response time and scalability requirements and therefore rely on deployed services to provide quick and fault-tolerant feedback to clients. However, the FaaS paradigm suffers from cold starts as there is a non-negligible delay associated with on-demand function initialization. This work focuses on reducing the frequency of cold starts on the platform by using Reinforcement Learning. Our approach uses Q-learning and considers metrics such as function CPU utilization, existing function instances, and response failure rate to proactively initialize functions in advance based on the expected demand. The proposed solution was implemented on Kubeless and was evaluated using a normalised real-world function demand trace with matrix multiplication as the workload. The results demonstrate a favourable performance of the RL-based agent when compared to Kubeless' default policy and function keep-alive policy by improving throughput by up to 8.81% and reducing computation load and resource wastage by up to 55% and 37%, respectively, which is a direct outcome of reduced cold starts.

摘要
Function-as-a-Service 是一种云计算 paradigm，它提供了事件驱动的执行模型，让应用程序在无需管理资源的情况下进行执行。它具有无Serverless特性，从开发者那里消除了资源管理责任，同时提供了透明的升级和缩放应用程序的功能。通常的Serverless应用程序具有严格的响应时间和可扩展性要求，因此它们通常依赖于部署的服务来提供快速的和可靠的反馈给客户端。然而，FaaS paradigm 受到冷启动的限制，即在需求时启动函数时存在非致命的延迟。这种工作将focuses on reducing the frequency of cold starts on the platform by using Reinforcement Learning。我们的方法使用Q-learning，考虑了函数 CPU 利用率、现有函数实例和响应失败率，以进行预先 initialize 函数基于预计的需求。我们的解决方案在 Kubeless 上实现，并通过使用一个 нормализован的实际函数需求轨迹进行评估。结果表明，我们的RL-based 代理比 Kubeless 的默认策略和函数保持活动策略提高了吞吐量，同时降低了计算负担和资源浪费，减少了冷启动的频率，从而提高了系统的性能。

Domain Adaptation via Minimax Entropy for Real/Bogus Classification of Astronomical Alerts

paper_url: http://arxiv.org/abs/2308.07538
repo_url: None
paper_authors: Guillermo Cabrera-Vives, César Bolivar, Francisco Förster, Alejandra M. Muñoz Arancibia, Manuel Pérez-Carrasco, Esteban Reyes
for: 这个论文是为了研究域域适应（Domain Adaptation）在天文物理数据分析中的应用，以提高天文物理数据分类的准确率。
methods: 该论文使用了四个不同的数据集：HiTS、DES、ATLAS和ZTF，并研究了这些数据集之间的域shift。它使用了一种简单的深度学习分类模型，并通过微调和半supervised深度域适应（MME）来改进模型。
results: 研究发现，只要在目标数据集上有一个或 fewer 的标注项目，就可以使得基本模型得到显著提高。此外，MME模型不会对源数据集的性能产生负面影响。

Abstract
Time domain astronomy is advancing towards the analysis of multiple massive datasets in real time, prompting the development of multi-stream machine learning models. In this work, we study Domain Adaptation (DA) for real/bogus classification of astronomical alerts using four different datasets: HiTS, DES, ATLAS, and ZTF. We study the domain shift between these datasets, and improve a naive deep learning classification model by using a fine tuning approach and semi-supervised deep DA via Minimax Entropy (MME). We compare the balanced accuracy of these models for different source-target scenarios. We find that both the fine tuning and MME models improve significantly the base model with as few as one labeled item per class coming from the target dataset, but that the MME does not compromise its performance on the source dataset.

摘要
时域天文学在实时处理多个大规模数据集方面取得了进展，导致多流机器学习模型的开发。在这个工作中，我们研究天文知讯报警的域 adapted（DA）技术，用于实时/假报警分类。我们使用四个不同的数据集进行研究：HiTS、DES、ATLAS和ZTF。我们研究这些数据集之间的域转换，并通过微调和半supervised深度DA来提高基本模型的性能。我们对不同的源目标场景进行比较，发现两种方法都可以大幅提高基本模型的性能，但MME方法不会优化源数据集的性能。

KMF: Knowledge-Aware Multi-Faceted Representation Learning for Zero-Shot Node Classification

paper_url: http://arxiv.org/abs/2308.08563
repo_url: None
paper_authors: Likang Wu, Junji Jiang, Hongke Zhao, Hao Wang, Defu Lian, Mengdi Zhang, Enhong Chen
for: Zero-Shot Node Classification (ZNC) task in graph data analysis, to predict nodes from unseen classes.
methods: Knowledge-Aware Multi-Faceted (KMF) framework that enhances label semantics via extracted KG-based topics, and reconstructs node content to a topic-level representation.
results: Extensive experiments on several public graph datasets, with an application of zero-shot cross-domain recommendation, demonstrating the effectiveness and generalization of KMF compared to state-of-the-art baselines.Here is the same information in Simplified Chinese:
for: Zero-Shot Node Classification (ZNC) 任务在图数据分析中，预测从训练过程中未经见过的节点。
methods: Knowledge-Aware Multi-Faceted (KMF) 框架，通过提取的知识图(KG)来增强标签 semantics，并将节点内容重建到一个话题级别表示。
results: 在多个公共图据集上进行了广泛的实验，并设计了跨领域零shot推荐应用，比较了状态的基elines。

Abstract
Recently, Zero-Shot Node Classification (ZNC) has been an emerging and crucial task in graph data analysis. This task aims to predict nodes from unseen classes which are unobserved in the training process. Existing work mainly utilizes Graph Neural Networks (GNNs) to associate features' prototypes and labels' semantics thus enabling knowledge transfer from seen to unseen classes. However, the multi-faceted semantic orientation in the feature-semantic alignment has been neglected by previous work, i.e. the content of a node usually covers diverse topics that are relevant to the semantics of multiple labels. It's necessary to separate and judge the semantic factors that tremendously affect the cognitive ability to improve the generality of models. To this end, we propose a Knowledge-Aware Multi-Faceted framework (KMF) that enhances the richness of label semantics via the extracted KG (Knowledge Graph)-based topics. And then the content of each node is reconstructed to a topic-level representation that offers multi-faceted and fine-grained semantic relevancy to different labels. Due to the particularity of the graph's instance (i.e., node) representation, a novel geometric constraint is developed to alleviate the problem of prototype drift caused by node information aggregation. Finally, we conduct extensive experiments on several public graph datasets and design an application of zero-shot cross-domain recommendation. The quantitative results demonstrate both the effectiveness and generalization of KMF with the comparison of state-of-the-art baselines.

摘要
近期，零shot节点分类（ZNC）在图数据分析中变得越来越重要。这个任务的目标是从训练过程中未见过的类型中预测节点。现有的工作主要利用图神经网络（GNNs）将特征抽象和标签含义相关联，从而实现知识传递从见到未见类型。然而，先前的工作忽略了多面性 semantic orientation的问题，即节点的内容通常涉及多个相关的标签 semantics。为了提高模型的通用性，我们提出了知识注意力多面性框架（KMF），该框架通过提取的知识图（KG）基于主题来增强标签 semantics的 ricness。然后，每个节点的内容被重建为主题级别的表示，以提供多面性和细化的semantic relevancy。由于图的实例（即节点）表示的特殊性，我们开发了一种新的 геометрических约束，以解决由节点信息汇集引起的prototype drift问题。最后，我们进行了多个公共图据集的广泛实验，并设计了零shot跨领域推荐应用。实验结果表明，KMF具有与状态 искусственныйbaseline的效果和通用性。

Nonlinearity, Feedback and Uniform Consistency in Causal Structural Learning

paper_url: http://arxiv.org/abs/2308.07520
repo_url: None
paper_authors: Shuyan Wang
for: 这个论文的目的是找到自动搜索方法，以便从观察数据中学习 causal structure。
methods: 这个论文使用的方法包括提出一种弱 faithfulness 定义，以及一种修改后的 causal discovery 算法，以relaxing Various simplification assumptions，使其适用于更广泛的 causal mechanism 和统计现象。
results: 这个论文的结果表明，使用修改后的 causal discovery 算法，可以在不同的 distributive 下学习 causal structure，并且可以找到 latent variables 的 causal connections。

Abstract
The goal of Causal Discovery is to find automated search methods for learning causal structures from observational data. In some cases all variables of the interested causal mechanism are measured, and the task is to predict the effects one measured variable has on another. In contrast, sometimes the variables of primary interest are not directly observable but instead inferred from their manifestations in the data. These are referred to as latent variables. One commonly known example is the psychological construct of intelligence, which cannot directly measured so researchers try to assess through various indicators such as IQ tests. In this case, casual discovery algorithms can uncover underlying patterns and structures to reveal the causal connections between the latent variables and between the latent and observed variables. This thesis focuses on two questions in causal discovery: providing an alternative definition of k-Triangle Faithfulness that (i) is weaker than strong faithfulness when applied to the Gaussian family of distributions, (ii) can be applied to non-Gaussian families of distributions, and (iii) under the assumption that the modified version of Strong Faithfulness holds, can be used to show the uniform consistency of a modified causal discovery algorithm; relaxing the sufficiency assumption to learn causal structures with latent variables. Given the importance of inferring cause-and-effect relationships for understanding and forecasting complex systems, the work in this thesis of relaxing various simplification assumptions is expected to extend the causal discovery method to be applicable in a wider range with diversified causal mechanism and statistical phenomena.

摘要
目标是找到自动搜寻方法，以学习 causal 结构从观察数据中。在某些情况下，所有变量的 interested causal mechanism 都被测量，需要预测一个测量到的变量对另一个变量的效应。相反，有时变量的首选变量不直接可观察，而是从数据中推导出来的。这些变量被称为 latent 变量。一个常见的例子是心理学中的智商，无法直接测量，因此研究人员会通过不同的指标，如 IQ 测试，来评估。在这种情况下， causal discovery 算法可以揭示下面的 causal 连接和 latent 变量与观察变量之间的连接。本论文关注两个问题在 causal discovery：提供一个 alternative 定义，可以用于 Gaussian 家族的分布，并且可以应用于非 Gaussian 分布家族，以及在 modified 版本的 Strong Faithfulness 假设下，可以用来证明一种修改后的 causal discovery 算法的均匀一致性。减少 sufficiency 假设，以学习包含 latent 变量的 causal 结构。由于推导 causal 关系的重要性，以上工作的扩展 causal discovery 方法的应用范围，预计将能够扩展到更多的 causal 机制和统计现象。

Boosting Semi-Supervised Learning by bridging high and low-confidence predictions

paper_url: http://arxiv.org/abs/2308.07509
repo_url: None
paper_authors: Khanh-Binh Nguyen, Joon-Sung Yang
for: 本研究旨在解决 Pseudo-labeling 方法中的三大问题，提高 semi-supervised learning 的性能和泛化能力。
methods: 本研究提出了一种新的 ReFixMatch 方法，通过全面利用无标示数据来提高模型的泛化能力和性能。
results: 对于 ImageNet 图像集，ReFixMatch 方法实现了 41.05% 的 top-1 准确率，超过 FixMatch 和当前状态的方法。

Abstract
Pseudo-labeling is a crucial technique in semi-supervised learning (SSL), where artificial labels are generated for unlabeled data by a trained model, allowing for the simultaneous training of labeled and unlabeled data in a supervised setting. However, several studies have identified three main issues with pseudo-labeling-based approaches. Firstly, these methods heavily rely on predictions from the trained model, which may not always be accurate, leading to a confirmation bias problem. Secondly, the trained model may be overfitted to easy-to-learn examples, ignoring hard-to-learn ones, resulting in the \textit{"Matthew effect"} where the already strong become stronger and the weak weaker. Thirdly, most of the low-confidence predictions of unlabeled data are discarded due to the use of a high threshold, leading to an underutilization of unlabeled data during training. To address these issues, we propose a new method called ReFixMatch, which aims to utilize all of the unlabeled data during training, thus improving the generalizability of the model and performance on SSL benchmarks. Notably, ReFixMatch achieves 41.05\% top-1 accuracy with 100k labeled examples on ImageNet, outperforming the baseline FixMatch and current state-of-the-art methods.

摘要
假标注是SSL中的一种重要技术，其中一个训练过的模型将生成对未标注数据的人工标注，允许同时在指导下进行 Label 和无标注数据的同时训练。然而，一些研究发现了假标注基于方法的三大问题。首先，这些方法强调训练过的模型的预测结果，可能并不准确，导致确认偏见问题。其次，训练过的模型可能会过拟合易学习的示例，忽略困难学习的示例，从而导致"马太效应"，即已经强的变得更强，弱的变得更弱。最后，大多数无标注数据的低信度预测被抛弃，因为使用高reshold，导致在训练中未充分利用无标注数据。为了解决这些问题，我们提出了一种新的方法 called ReFixMatch，它计划在训练中利用所有的无标注数据，从而提高模型的一般化性和SSL Benchmark中的性能。各种方法的实验结果表明，ReFixMatch可以与 FixMatch 和当前状态的方法相比，在 ImageNet 上达到41.05% 的 top-1 准确率，使用 100k 标注示例。

Detecting The Corruption Of Online Questionnaires By Artificial Intelligence

paper_url: http://arxiv.org/abs/2308.07499
repo_url: None
paper_authors: Benjamin Lebrun, Sharon Temtsin, Andrew Vonasch, Christoph Bartneck
for: 这个研究是为了检测在在线问卷中使用人工智能生成的文本是否可以被识别出来。
methods: 这个研究使用了人类和自动AI检测系统来检测文本的作者性。
results: 人类参与者的识别率为76%，但是这还不够保证数据质量。自动AI检测系统完全无用。如果AI变得太普遍，那么检测伪造提交的成本将超过在线问卷的 beneficial。这问题只能由群组平台系统上解决。

Abstract
Online questionnaires that use crowd-sourcing platforms to recruit participants have become commonplace, due to their ease of use and low costs. Artificial Intelligence (AI) based Large Language Models (LLM) have made it easy for bad actors to automatically fill in online forms, including generating meaningful text for open-ended tasks. These technological advances threaten the data quality for studies that use online questionnaires. This study tested if text generated by an AI for the purpose of an online study can be detected by both humans and automatic AI detection systems. While humans were able to correctly identify authorship of text above chance level (76 percent accuracy), their performance was still below what would be required to ensure satisfactory data quality. Researchers currently have to rely on the disinterest of bad actors to successfully use open-ended responses as a useful tool for ensuring data quality. Automatic AI detection systems are currently completely unusable. If AIs become too prevalent in submitting responses then the costs associated with detecting fraudulent submissions will outweigh the benefits of online questionnaires. Individual attention checks will no longer be a sufficient tool to ensure good data quality. This problem can only be systematically addressed by crowd-sourcing platforms. They cannot rely on automatic AI detection systems and it is unclear how they can ensure data quality for their paying clients.

摘要
在线问卷使用人群投票平台Recruit participants变得普遍，因为它们的使用容易和成本低。人工智能（AI）基于大语言模型（LLM）使得坏actor可以自动填充在线表单，包括生成有意义的文本 для开放式任务。这些技术进步威胁在线问卷中的数据质量。这个研究测试了一I生成的文本是否可以由人类和自动AI检测系统检测出来。人类参与者的准确率为76%，但是这还不够保证数据质量的满意度。研究人员目前需要坏actor的无自身利益来成功使用开放式回答。自动AI检测系统目前不可用。如果AI变得太普遍，则提交假答案的成本将超过在线问卷的利益。人类注意性检查不再是一个有效的数据质量保证工具。这个问题只能通过人群投票平台系统地解决。它们不能依靠自动AI检测系统，而且不清楚如何保证付出客户的数据质量。

paper_url: http://arxiv.org/abs/2308.07498
repo_url: https://github.com/HanqingWangAI/Dreamwalker
paper_authors: Hanqing Wang, Wei Liang, Luc Van Gool, Wenguan Wang
for:* DREAMWALKER is a world model-based VLN-CE agent that leverages mental experiments to plan and make strategic decisions in a freely traversable environment.methods:* The world model is built to summarize the visual, topological, and dynamic properties of the environment into a discrete, structured, and compact representation.* DREAMWALKER simulates and evaluates possible plans entirely in the internal abstract world before executing costly actions.results:* Extensive experiments and ablation studies on VLN-CE dataset confirm the effectiveness of the proposed approach and outline fruitful directions for future work.Here’s the Chinese translation:for:* DREAMWALKER 是一个基于世界模型的 VLN-CE 代理，通过MENTAL EXPERIMENTS 进行规划和策略决策在一个自由可行的环境中。methods:* 世界模型将环境的视觉、拓扑和动态特性总结为一个简化、结构化和压缩的表示。* DREAMWALKER 在内部抽象世界中 simulate 和评估可能的计划，以避免在真实世界中的浪费。results:* 对 VLN-CE 数据集的广泛实验和减少研究表明，提议的方法有效，并提供了未来工作的有价值导向。

Abstract
VLN-CE is a recently released embodied task, where AI agents need to navigate a freely traversable environment to reach a distant target location, given language instructions. It poses great challenges due to the huge space of possible strategies. Driven by the belief that the ability to anticipate the consequences of future actions is crucial for the emergence of intelligent and interpretable planning behavior, we propose DREAMWALKER -- a world model based VLN-CE agent. The world model is built to summarize the visual, topological, and dynamic properties of the complicated continuous environment into a discrete, structured, and compact representation. DREAMWALKER can simulate and evaluate possible plans entirely in such internal abstract world, before executing costly actions. As opposed to existing model-free VLN-CE agents simply making greedy decisions in the real world, which easily results in shortsighted behaviors, DREAMWALKER is able to make strategic planning through large amounts of ``mental experiments.'' Moreover, the imagined future scenarios reflect our agent's intention, making its decision-making process more transparent. Extensive experiments and ablation studies on VLN-CE dataset confirm the effectiveness of the proposed approach and outline fruitful directions for future work.

摘要

ST-MLP: A Cascaded Spatio-Temporal Linear Framework with Channel-Independence Strategy for Traffic Forecasting

paper_url: http://arxiv.org/abs/2308.07496
repo_url: None
paper_authors: Zepu Wang, Yuqi Nie, Peng Sun, Nam H. Nguyen, John Mulvey, H. Vincent Poor
for: 预测交通流量管理在智能交通系统（ITS）中的优化
methods: 使用简单的多层感知机制（MLP）模组和线性层，同时考虑时间资讯、空间资讯和预先定义的路径结构
results: 与现有的STGNNs和其他模型相比，ST-MLP表现出更高的准确性和computational efficiency

Abstract
The criticality of prompt and precise traffic forecasting in optimizing traffic flow management in Intelligent Transportation Systems (ITS) has drawn substantial scholarly focus. Spatio-Temporal Graph Neural Networks (STGNNs) have been lauded for their adaptability to road graph structures. Yet, current research on STGNNs architectures often prioritizes complex designs, leading to elevated computational burdens with only minor enhancements in accuracy. To address this issue, we propose ST-MLP, a concise spatio-temporal model solely based on cascaded Multi-Layer Perceptron (MLP) modules and linear layers. Specifically, we incorporate temporal information, spatial information and predefined graph structure with a successful implementation of the channel-independence strategy - an effective technique in time series forecasting. Empirical results demonstrate that ST-MLP outperforms state-of-the-art STGNNs and other models in terms of accuracy and computational efficiency. Our finding encourages further exploration of more concise and effective neural network architectures in the field of traffic forecasting.

摘要
“智能交通系统（ITS）中的实时流量预测 Criticality has drawn substantial scholarly focus, and Spatio-Temporal Graph Neural Networks (STGNNs) have been praised for their adaptability to road graph structures. However, current research on STGNNs architectures often prioritizes complex designs, leading to elevated computational burdens with only minor enhancements in accuracy. To address this issue, we propose ST-MLP, a concise spatio-temporal model based solely on cascaded Multi-Layer Perceptron (MLP) modules and linear layers. Specifically, we incorporate temporal information, spatial information, and predefined graph structure with a successful implementation of the channel-independence strategy - an effective technique in time series forecasting. Empirical results demonstrate that ST-MLP outperforms state-of-the-art STGNNs and other models in terms of accuracy and computational efficiency. Our finding encourages further exploration of more concise and effective neural network architectures in the field of traffic forecasting.”Here's a word-for-word translation of the text into Simplified Chinese:“智能交通系统（ITS）中的实时流量预测 Criticality 吸引了大量的学术关注，而 Spatio-Temporal Graph Neural Networks (STGNNs) 被赞誉为路径 граosph 结构的适应性。然而，目前 STGNNs 架构设计中经常优先过往复杂的设计，导致计算成本增加，仅对准确性有小量改善。为解决这个问题，我们提出 ST-MLP，一个简洁的 spatio-temporal 模型，基于弹性 Multi-Layer Perceptron (MLP) 模组和线性层。具体而言，我们将时间信息、空间信息和预设的路径结构融合在一起，并成功地实现了通道独立策略 - 一种有效的时间序列预测技术。实验结果显示，ST-MLP 在准确性和计算效率方面都超过了现有的 STGNNs 和其他模型。我们的发现鼓励我们继续探索更简洁和有效的神经网络架构在交通预测领域。”

Omega-Regular Reward Machines

paper_url: http://arxiv.org/abs/2308.07469
repo_url: None
paper_authors: Ernst Moritz Hahn, Mateo Perez, Sven Schewe, Fabio Somenzi, Ashutosh Trivedi, Dominik Wojtczak
for: 这篇论文旨在探讨在强化学习中设计合适的奖励机制是如何实现更高效的行为学习。
methods: 这篇论文使用了奖励机器和ωRegular语言来表达非马歇维奖励，以满足更复杂的学习目标。
results: 研究人员通过设计了一种基于模型自由RL算法的ε-优化策略来评估奖励机器的效果，并通过实验证明了该方法的可行性和有效性。

Abstract
Reinforcement learning (RL) is a powerful approach for training agents to perform tasks, but designing an appropriate reward mechanism is critical to its success. However, in many cases, the complexity of the learning objectives goes beyond the capabilities of the Markovian assumption, necessitating a more sophisticated reward mechanism. Reward machines and omega-regular languages are two formalisms used to express non-Markovian rewards for quantitative and qualitative objectives, respectively. This paper introduces omega-regular reward machines, which integrate reward machines with omega-regular languages to enable an expressive and effective reward mechanism for RL. We present a model-free RL algorithm to compute epsilon-optimal strategies against omega-egular reward machines and evaluate the effectiveness of the proposed algorithm through experiments.

摘要
利用强化学习（RL）训练代理人完成任务，但设计合适的奖励机制是关键。然而，在许多情况下，学习目标的复杂性超出了Markov预测的能力，需要更加复杂的奖励机制。奖励机器和ωRegular语言是两种用于表达非Markov奖励的形式，分别用于量化和质量目标。本文提出了ωRegular奖励机器，可以将奖励机器与ωRegular语言集成，以实现RL中的表达力和有效性。我们提出了一种无模型RL算法，可以计算ε优策略对ωRegular奖励机器，并通过实验评估提案的效果。

Playing with Words: Comparing the Vocabulary and Lexical Richness of ChatGPT and Humans

paper_url: http://arxiv.org/abs/2308.07462
repo_url: None
paper_authors: Pedro Reviriego, Javier Conde, Elena Merino-Gómez, Gonzalo Martínez, José Alberto Hernández
for: This study aims to compare the vocabulary and lexical richness of ChatGPT and human responses when performing the same tasks.
methods: The study uses two datasets containing answers to different types of questions answered by ChatGPT and humans, and analyzes the number of distinct words and lexical richness of each.
results: Preliminary results show that ChatGPT tends to use fewer distinct words and lower lexical richness than humans.Here’s the same information in Simplified Chinese:
for: 这项研究目的是对 chatGPT 和人类回答相同任务的 vocabulary 和语言丰富度进行比较。
methods: 研究使用两个数据集，每个数据集包含不同类型的问题，chatGPT 和人类回答的答案，并对每个数据集进行 vocabulary 和语言丰富度的分析。
results: 初步结果显示，chatGPT 使用的单词数量和语言丰富度比人类低。

Abstract
The introduction of Artificial Intelligence (AI) generative language models such as GPT (Generative Pre-trained Transformer) and tools such as ChatGPT has triggered a revolution that can transform how text is generated. This has many implications, for example, as AI-generated text becomes a significant fraction of the text in many disciplines, would this have an effect on the language capabilities of readers and also on the training of newer AI tools? Would it affect the evolution of languages? Focusing on one specific aspect of the language: words; will the use of tools such as ChatGPT increase or reduce the vocabulary used or the lexical richness (understood as the number of different words used in a written or oral production) when writing a given text? This has implications for words, as those not included in AI-generated content will tend to be less and less popular and may eventually be lost. In this work, we perform an initial comparison of the vocabulary and lexical richness of ChatGPT and humans when performing the same tasks. In more detail, two datasets containing the answers to different types of questions answered by ChatGPT and humans are used, and the analysis shows that ChatGPT tends to use fewer distinct words and lower lexical richness than humans. These results are very preliminary and additional datasets and ChatGPT configurations have to be evaluated to extract more general conclusions. Therefore, further research is needed to understand how the use of ChatGPT and more broadly generative AI tools will affect the vocabulary and lexical richness in different types of text and languages.

摘要
人工智能语言生成模型如GPT（生成预训练变换器）和ChatGPT等工具的出现已经引发了一场革命，这将对文本生成方式产生深远的影响。这有很多意义，例如，如果人工智能生成的文本在多个领域中占据了一定的比例，会对读者的语言能力和 newer AI工具的训练产生影响吗？会对语言演化产生影响？对于语言中的一个方面来说，使用工具如ChatGPT会增加或减少在写作文本时使用的词汇数量和语言 ricness（理解为在书面或口语中使用的不同词汇数量）？这有关词汇的影响，那些不包括在人工智能生成内容中将变得更加少用，并最终可能产生失传。在这项工作中，我们对ChatGPT和人类的答案集进行了初步比较，结果显示ChatGPT使用的词汇数量和语言 ricness较低。这些结果是非常初步的，需要更多的数据和ChatGPT配置来提取更广泛的结论。因此，进一步的研究是必要的，以了解人工智能生成工具在不同类型的文本和语言中的词汇和语言 ricness的影响。

Inductive Knowledge Graph Completion with GNNs and Rules: An Analysis

paper_url: http://arxiv.org/abs/2308.07942
repo_url: https://github.com/anilakash/indkgc
paper_authors: Akash Anil, Víctor Gutiérrez-Basulto, Yazmín Ibañéz-García, Steven Schockaert
for: 这个论文的目的是研究 inductive knowledge graph completion 任务，即从训练图像上学习推理规则，并将其应用于独立的测试图像上进行预测。
methods: 这个论文使用了规则基于的方法，但是在实践中，这些方法表现非常差。作者认为这是因为两个因素：（i）不可能的实体没有被评估，（ii）只考虑最有用的路径来确定链接预测答案的信任程度。作者们为了解决这些问题，研究了一些变种方法，包括一些专门 Addressing 这些问题。
results: 研究发现，这些变种方法可以几乎与 NBFNet 相当的性能，而且它们只使用了 NBFNet 使用的一小部分证据。此外，作者们还发现，一个further variant，即考虑整个知识图，可以一直高于 NBFNet 的性能。

Abstract
The task of inductive knowledge graph completion requires models to learn inference patterns from a training graph, which can then be used to make predictions on a disjoint test graph. Rule-based methods seem like a natural fit for this task, but in practice they significantly underperform state-of-the-art methods based on Graph Neural Networks (GNNs), such as NBFNet. We hypothesise that the underperformance of rule-based methods is due to two factors: (i) implausible entities are not ranked at all and (ii) only the most informative path is taken into account when determining the confidence in a given link prediction answer. To analyse the impact of these factors, we study a number of variants of a rule-based approach, which are specifically aimed at addressing the aforementioned issues. We find that the resulting models can achieve a performance which is close to that of NBFNet. Crucially, the considered variants only use a small fraction of the evidence that NBFNet relies on, which means that they largely keep the interpretability advantage of rule-based methods. Moreover, we show that a further variant, which does look at the full KG, consistently outperforms NBFNet.

摘要
任务是 inductive 知识图完成需要模型学习从训练图中学习推理模式，然后用于测试图上预测。规则式方法看起来很自然地适合这个任务，但在实践中它们实际上显著地下表现。我们认为这是因为两个因素：（i）不可能的实体没有被排序，（ii）只考虑测试图中最有用的路径来确定链接预测答案的信任度。为了分析这些因素的影响，我们研究了一些 variants 的规则式方法，它们专门解决这些问题。我们发现这些模型可以达到与 NBFNet 相似的性能，而且它们只使用了 NBFNet 使用的一小部分证据，这意味着它们保持了解释性的优势。此外，我们还显示了一种 further 的变体，它会在全知识图上进行预测，并一直表现出perform better 于 NBFNet。

Artificial Intelligence for Smart Transportation

paper_url: http://arxiv.org/abs/2308.07457
repo_url: https://github.com/SarmisthaDutta/application-of-artificial-Intelligence-for-future-sustainable-smart-city
paper_authors: Michael Wilbur, Amutheezan Sivagnanam, Afiya Ayman, Samitha Samaranayeke, Abhishek Dubey, Aron Laszka
for: 提高公共交通系统的效率和使用率，以满足社会发展和人类价值创造的需求。
methods: 利用人工智能技术，提供数据驱动的智能交通系统，包括数据收集、人工智能决策支持和计算机科学问题解决方案。
results: 通过对交通系统的数据分析和人工智能技术应用，提高交通系统的效率和使用率，为社会发展和人类价值创造提供可能性。

Abstract
There are more than 7,000 public transit agencies in the U.S. (and many more private agencies), and together, they are responsible for serving 60 billion passenger miles each year. A well-functioning transit system fosters the growth and expansion of businesses, distributes social and economic benefits, and links the capabilities of community members, thereby enhancing what they can accomplish as a society. Since affordable public transit services are the backbones of many communities, this work investigates ways in which Artificial Intelligence (AI) can improve efficiency and increase utilization from the perspective of transit agencies. This book chapter discusses the primary requirements, objectives, and challenges related to the design of AI-driven smart transportation systems. We focus on three major topics. First, we discuss data sources and data. Second, we provide an overview of how AI can aid decision-making with a focus on transportation. Lastly, we discuss computational problems in the transportation domain and AI approaches to these problems.

摘要
美国有超过7,000个公共交通机构，以及许多私人机构，每年共运送600亿公里的乘客。一个健全的公共交通系统会促进企业增长和扩张、分配社会和经济的利益，并将社区成员的能力相互连接，从而提高社会的可能性。由于公共交通服务是许多社区的基础设施，这项工作研究了使用人工智能（AI）提高公共交通系统的效率和使用率。本章讨论了智能交通系统设计的主要需求、目标和挑战。我们主要讨论以下三个主题：第一，讨论数据来源和数据；第二，介绍transportation领域中AI助于决策的方法；第三，讨论交通领域的计算问题和AI应用于这些问题。

GRU-D-Weibull: A Novel Real-Time Individualized Endpoint Prediction

paper_url: http://arxiv.org/abs/2308.07452
repo_url: None
paper_authors: Xiaoyang Ruan, Liwei Wang, Charat Thongprayoon, Wisit Cheungpasitporn, Hongfang Liu
for: 这份研究的目的是发展一个新的方法，即GRU-D-Weibull，用于预测个人水平的终点和时间至终点。这种方法可以实现实时个人化终点预测和人口水平的风险管理。
methods: 这份研究使用了一个新的方法，即GRU-D-Weibull，它结合了闸道运算和衰减（GRU-D）来模型Weibull分布。这种方法可以实现实时个人化终点预测和人口水平的风险管理。
results: 这份研究发现，GRU-D-Weibull方法在终点预测中表现出色，C-指数约为0.7，并在4.3年的追踪期间持续提高到约0.77。这与随机生存树的表现相似。GRU-D-Weibull方法在L1损失上显示出了优秀的表现，与其他方法相比，具有较低的损失值。此外，GRU-D-Weibull方法还能够对终点预测进行时间轴上的精确调整。

Abstract
Accurate prediction models for individual-level endpoints and time-to-endpoints are crucial in clinical practice. In this study, we propose a novel approach, GRU-D-Weibull, which combines gated recurrent units with decay (GRU-D) to model the Weibull distribution. Our method enables real-time individualized endpoint prediction and population-level risk management. Using a cohort of 6,879 patients with stage 4 chronic kidney disease (CKD4), we evaluated the performance of GRU-D-Weibull in endpoint prediction. The C-index of GRU-D-Weibull was ~0.7 at the index date and increased to ~0.77 after 4.3 years of follow-up, similar to random survival forest. Our approach achieved an absolute L1-loss of ~1.1 years (SD 0.95) at the CKD4 index date and a minimum of ~0.45 years (SD0.3) at 4 years of follow-up, outperforming competing methods significantly. GRU-D-Weibull consistently constrained the predicted survival probability at the time of an event within a smaller and more fixed range compared to other models throughout the follow-up period. We observed significant correlations between the error in point estimates and missing proportions of input features at the index date (correlations from ~0.1 to ~0.3), which diminished within 1 year as more data became available. By post-training recalibration, we successfully aligned the predicted and observed survival probabilities across multiple prediction horizons at different time points during follow-up. Our findings demonstrate the considerable potential of GRU-D-Weibull as the next-generation architecture for endpoint risk management, capable of generating various endpoint estimates for real-time monitoring using clinical data.

摘要
准确的预测模型对个体级别终点和时间至终点是临床实践中非常重要。在这项研究中，我们提出了一种新的方法，即GRU-D-Weibull，它将闭合杂列单元（GRU-D）与减速分布（Weibull distribution）结合在一起。我们的方法可以实现实时个体化终点预测和人口级别风险管理。使用6,879名CKD4阶段4慢性肾病患者的 cohort，我们评估了GRU-D-Weibull在终点预测方面的性能。GRU-D-Weibull的C-指数在指定日期为 aproximadamente 0.7，而在4.3年后跟踪中，它提高至 aproximadamente 0.77，与随机生存森林类似。我们的方法实现了终点预测的L1损失约为1.1年（SD 0.95）在CKD4指定日期，并在4年后跟踪中达到了约0.45年（SD0.3）的最小值，与其他方法相比显著性能更高。GRU-D-Weibull通常在终点预测过程中压缩预测生存概率的范围，与其他模型在跟踪期间保持相对更小和更固定的范围相比，表现更稳定。我们发现在指定日期的输入特征批量缺失率和预测点估值的误差之间存在 statistically significant 的相关性（从 approximately 0.1到 approximately 0.3），这种相关性随着时间的推移而减少。通过后期重新训练，我们成功地将预测和观察到的生存概率相互对应，并在不同的跟踪时间点 durante el seguimiento。我们的发现表明GRU-D-Weibull可能是下一代结构，它可以使用临床数据生成多种终点估计，用于实时监测终点风险。

Open-set Face Recognition using Ensembles trained on Clustered Data

paper_url: http://arxiv.org/abs/2308.07445
repo_url: None
paper_authors: Rafael Henrique Vareto, William Robson Schwartz
for: 该论文目的是开发一种可扩展的开放集面Recognition方法，能够处理大量未知人脸。
methods: 该方法使用 clustering 和多个二进制学习算法，对查询人脸样本进行分类，并使用 ensemble 提高预测性能。
results: 实验结果表明，该方法可以在大量人脸库中实现竞争力强的表现，即使面临大量未知人脸。Here’s the English version for reference:
for: The purpose of this paper is to develop a scalable open-set face recognition approach that can handle large numbers of unfamiliar faces.
methods: The method uses clustering and an ensemble of binary learning algorithms to classify query face samples and retrieve their correct identity from a gallery of hundreds or thousands of subjects.
results: Experimental results show that the approach can achieve competitive performance even when targeting scalability, handling large numbers of unfamiliar faces.

Abstract
Open-set face recognition describes a scenario where unknown subjects, unseen during the training stage, appear on test time. Not only it requires methods that accurately identify individuals of interest, but also demands approaches that effectively deal with unfamiliar faces. This work details a scalable open-set face identification approach to galleries composed of hundreds and thousands of subjects. It is composed of clustering and an ensemble of binary learning algorithms that estimates when query face samples belong to the face gallery and then retrieves their correct identity. The approach selects the most suitable gallery subjects and uses the ensemble to improve prediction performance. We carry out experiments on well-known LFW and YTF benchmarks. Results show that competitive performance can be achieved even when targeting scalability.

摘要
开放集 face recognition 描述一种enario，在训练阶段未经见过的不明人脸出现在测试阶段。不仅需要准确地识别关注人脸，而且也需要采用有效地处理未知脸的方法。这篇文章介绍了一种可扩展的开放集 face 识别方法，可以对包括百万多个主题的人脸库进行识别。该方法包括分集和一个 ensemble of 二分学习算法，以便在测试阶段确定查询脸样本是否属于人脸库，并且使用ensemble提高预测性能。我们在well-known LFW和YTF标准准样本上进行了实验，结果表明，即使targeting可扩展性，也可以实现竞争性的表现。

The Performance of Transferability Metrics does not Translate to Medical Tasks

paper_url: http://arxiv.org/abs/2308.07444
repo_url: None
paper_authors: Levy Chaves, Alceu Bissoto, Eduardo Valle, Sandra Avila
for: 本研究旨在评估七种传输可能性分数在医疗图像分析中的表现，以便更好地选择适合目标数据集的深度学习架构。
methods: 本研究使用了七种传输可能性分数，包括三种基于特征之间的相似度的方法，以及四种基于特征之间的相似度和特征之间的相似度的权重平均值的方法。
results: 本研究在三个医疗图像分析应用中进行了广泛的评估，并发现了 Transferability 分数在医疗图像分析中的表现不一定可靠和一致，需要进一步的研究。

Abstract
Transfer learning boosts the performance of medical image analysis by enabling deep learning (DL) on small datasets through the knowledge acquired from large ones. As the number of DL architectures explodes, exhaustively attempting all candidates becomes unfeasible, motivating cheaper alternatives for choosing them. Transferability scoring methods emerge as an enticing solution, allowing to efficiently calculate a score that correlates with the architecture accuracy on any target dataset. However, since transferability scores have not been evaluated on medical datasets, their use in this context remains uncertain, preventing them from benefiting practitioners. We fill that gap in this work, thoroughly evaluating seven transferability scores in three medical applications, including out-of-distribution scenarios. Despite promising results in general-purpose datasets, our results show that no transferability score can reliably and consistently estimate target performance in medical contexts, inviting further work in that direction.

摘要
<>将文本翻译成简化中文。<>基于知识传递的深度学习（DL）在医疗图像分析中提高了性能，但由于DL模型的数量爆炸式增长，探索所有候选者成为不可能的，因此需要更加经济的选择方法。基于传输性的分数评估方法在这个领域出现，允许效率地计算任何目标数据集上模型准确率的相关分数。然而，由于这些传输性分数在医疗图像 datasets 上的应用仍然不明确，因此这种方法在实践中的应用尚未得到推广。我们在这个工作中填补了这个空白，对七种传输性分数在三个医疗应用中的性能进行了全面评估。虽然在通用数据集上显示了承诺的结果，但我们的结果表明，在医疗上下文中，没有一个可靠和一致地估计目标性能的传输性分数。这对未来的研究提出了挑战。

Physics-Informed Deep Learning to Reduce the Bias in Joint Prediction of Nitrogen Oxides

paper_url: http://arxiv.org/abs/2308.07441
repo_url: None
paper_authors: Lianfa Li, Roxana Khalili, Frederick Lurmann, Nathan Pavlovic, Jun Wu, Yan Xu, Yisi Liu, Karl O’Sharkey, Beate Ritz, Luke Oman, Meredith Franklin, Theresa Bastain, Shohreh F. Farzan, Carrie Breton, Rima Habre
for: 这个论文旨在提高空气质量预测的准确性和可靠性，尤其是对氧气杂合物（NOx）的预测。
methods: 这篇论文使用了机器学习（ML）方法和化学运输模型（CTM）的知识来开发一个physics-informed deep learning框架，用于预测NO2和NOx的分布。这个框架具有减少预测偏差的能力，并能够提供明确的uncertainty估计。
results: 这篇论文的结果表明，使用这个physics-informed deep learning框架可以减少ML模型的预测偏差，并且能够更好地预测NO2和NOx的分布。这个框架还能够提供明确的uncertainty估计，这有助于更好地理解空气质量的变化和风险。

Abstract
Atmospheric nitrogen oxides (NOx) primarily from fuel combustion have recognized acute and chronic health and environmental effects. Machine learning (ML) methods have significantly enhanced our capacity to predict NOx concentrations at ground-level with high spatiotemporal resolution but may suffer from high estimation bias since they lack physical and chemical knowledge about air pollution dynamics. Chemical transport models (CTMs) leverage this knowledge; however, accurate predictions of ground-level concentrations typically necessitate extensive post-calibration. Here, we present a physics-informed deep learning framework that encodes advection-diffusion mechanisms and fluid dynamics constraints to jointly predict NO2 and NOx and reduce ML model bias by 21-42%. Our approach captures fine-scale transport of NO2 and NOx, generates robust spatial extrapolation, and provides explicit uncertainty estimation. The framework fuses knowledge-driven physicochemical principles of CTMs with the predictive power of ML for air quality exposure, health, and policy applications. Our approach offers significant improvements over purely data-driven ML methods and has unprecedented bias reduction in joint NO2 and NOx prediction.

摘要

Interaction-Aware Personalized Vehicle Trajectory Prediction Using Temporal Graph Neural Networks

paper_url: http://arxiv.org/abs/2308.07439
repo_url: None
paper_authors: Amr Abdelraouf, Rohit Gupta, Kyungtae Han
for: 提高先进驾驶辅助系统和自动驾驶车辆的预测性能
methods: 使用时间图 convolutional neural networks (GCN) 和长期快速传递Memory (LSTM) 模型当地交通数据
results: 比较先进的预测性能，尤其是在较长的预测时间范围内，并且在不同的驾驶者环境下显示出较高的预测精度。

Abstract
Accurate prediction of vehicle trajectories is vital for advanced driver assistance systems and autonomous vehicles. Existing methods mainly rely on generic trajectory predictions derived from large datasets, overlooking the personalized driving patterns of individual drivers. To address this gap, we propose an approach for interaction-aware personalized vehicle trajectory prediction that incorporates temporal graph neural networks. Our method utilizes Graph Convolution Networks (GCN) and Long Short-Term Memory (LSTM) to model the spatio-temporal interactions between target vehicles and their surrounding traffic. To personalize the predictions, we establish a pipeline that leverages transfer learning: the model is initially pre-trained on a large-scale trajectory dataset and then fine-tuned for each driver using their specific driving data. We employ human-in-the-loop simulation to collect personalized naturalistic driving trajectories and corresponding surrounding vehicle trajectories. Experimental results demonstrate the superior performance of our personalized GCN-LSTM model, particularly for longer prediction horizons, compared to its generic counterpart. Moreover, the personalized model outperforms individual models created without pre-training, emphasizing the significance of pre-training on a large dataset to avoid overfitting. By incorporating personalization, our approach enhances trajectory prediction accuracy.

摘要
<>预测车辆轨迹的准确性是智能驾驶系统和自动驾驶车辆的关键。现有方法主要依靠大规模的数据集来 derivate 通用的轨迹预测，忽略了个人驾驶模式的特点。为了解决这个空白，我们提出了一种基于互动的个性化车辆轨迹预测方法，该方法利用图像卷积神经网络（GCN）和长短期记忆（LSTM）来模型目标车辆和它们周围的交通之间的空间时间互动。为了个性化预测，我们建立了一个管道，该管道通过转移学习来使用大规模轨迹数据进行初始化，然后通过每个驾驶员的特定驾驶数据进行微调。我们使用人类在Loop的 simulate 来收集个性化自然驾驶轨迹和相应的周围车辆轨迹。实验结果表明我们的个性化GCN-LSTM模型在较长的预测时间范围内表现更出色，特别是比其通用对应模型更出色。此外，个性化模型也比没有预训练的模型更好，这说明了大规模数据集的预训练可以避免过拟合。通过个性化，我们的方法可以提高轨迹预测精度。

Semantic Similarity Loss for Neural Source Code Summarization

paper_url: http://arxiv.org/abs/2308.07429
repo_url: https://github.com/apcl-research/funcom-useloss
paper_authors: Chia-Yi Su, Collin McMillan
for: 本研究旨在提出一种改进的损失函数，用于自动生成源代码描述。
methods: 本研究使用一种semantic similarity metric来计算损失，并与传统的一个字一个字损失函数相结合。
results: 对多种基eline进行评估，得到了大多数情况下的改进。

Abstract
This paper presents an improved loss function for neural source code summarization. Code summarization is the task of writing natural language descriptions of source code. Neural code summarization refers to automated techniques for generating these descriptions using neural networks. Almost all current approaches involve neural networks as either standalone models or as part of a pretrained large language models e.g., GPT, Codex, LLaMA. Yet almost all also use a categorical cross-entropy (CCE) loss function for network optimization. Two problems with CCE are that 1) it computes loss over each word prediction one-at-a-time, rather than evaluating a whole sentence, and 2) it requires a perfect prediction, leaving no room for partial credit for synonyms. We propose and evaluate a loss function to alleviate this problem. In essence, we propose to use a semantic similarity metric to calculate loss over the whole output sentence prediction per training batch, rather than just loss for each word. We also propose to combine our loss with traditional CCE for each word, which streamlines the training process compared to baselines. We evaluate our approach over several baselines and report an improvement in the vast majority of conditions.

摘要

It computes loss for each word prediction individually, rather than evaluating the entire sentence.2. It requires perfect predictions, leaving no room for partial credit for synonyms.To address these issues, the proposed loss function uses a semantic similarity metric to calculate loss over the entire output sentence prediction for each training batch, rather than just for each word. Additionally, the proposed loss function combines with traditional CCE for each word, streamlining the training process compared to baselines. The approach is evaluated over several baselines and shows an improvement in the majority of conditions.In simplified Chinese, the text can be translated as:这篇论文提出了一种改进的损失函数，用于神经源代码概要。神经代码概要是自动生成源代码的自然语言描述的任务。目前大多数方法都使用神经网络作为独立模型或大语言模型的组件，如GPT、Codex、LLaMA。然而，这些模型的优化都使用分类交叉熵损失函数（CCE），它有两个问题：1. 它计算每个单词预测的损失，而不是整个句子。2. 它需要精准预测，不允许任何减少偏差。我们提议使用semantic相似度度量来计算损失，而不是单独计算每个单词的损失。此外，我们还提议将我们的损失函数与传统的CCE相结合，以便在训练过程中对每个单词进行损失计算，而不是直接使用CCE。我们对多个基准进行评估，并发现大多数情况下有改进。

UniBrain: Unify Image Reconstruction and Captioning All in One Diffusion Model from Human Brain Activity

paper_url: http://arxiv.org/abs/2308.07428
repo_url: None
paper_authors: Weijian Mai, Zhijun Zhang
for: 这篇论文旨在探讨使用人脑活动诱发的视觉刺激来重建图像和文本描述，以更好地理解人脑和视觉系统之间的连接。
methods: 该论文提出了一种名为UniBrain的一种普适傅 diffusion模型，通过将fMRI voxels转换为图像和文本的latent空间，并通过基于CLIP的fMRI图像和文本条件来导向反向傅 diffusion过程，生成真实的caption和图像。
results: UniBrain在图像重建和描述方面与现有方法相比，表现出较高的qualitative和quantitative性能，并在Natural Scenes Dataset（NSD）数据集上首次实现了图像描述结果。此外，简除试验和功能ROI分析还表明UniBrain的优势和视觉脑 decode的全面意义。

Abstract
Image reconstruction and captioning from brain activity evoked by visual stimuli allow researchers to further understand the connection between the human brain and the visual perception system. While deep generative models have recently been employed in this field, reconstructing realistic captions and images with both low-level details and high semantic fidelity is still a challenging problem. In this work, we propose UniBrain: Unify Image Reconstruction and Captioning All in One Diffusion Model from Human Brain Activity. For the first time, we unify image reconstruction and captioning from visual-evoked functional magnetic resonance imaging (fMRI) through a latent diffusion model termed Versatile Diffusion. Specifically, we transform fMRI voxels into text and image latent for low-level information and guide the backward diffusion process through fMRI-based image and text conditions derived from CLIP to generate realistic captions and images. UniBrain outperforms current methods both qualitatively and quantitatively in terms of image reconstruction and reports image captioning results for the first time on the Natural Scenes Dataset (NSD) dataset. Moreover, the ablation experiments and functional region-of-interest (ROI) analysis further exhibit the superiority of UniBrain and provide comprehensive insight for visual-evoked brain decoding.

摘要
Image重建和描述从脑动活动诱发的视觉系统的连接，使研究人员更深入了解人脑和视觉系统之间的关系。而最近，深度生成模型在这一领域得到了广泛应用。但是，重现真实的描述和图像，同时具有低级别细节和高semantic faithfulness仍然是一个挑战。在这种情况下，我们提出了UniBrain：基于人脑活动的一种普适的扩展模型，用于 reunifying image reconstruction和 captioning。UniBrain通过将fMRI voxels转换为文本和图像的latent空间，并通过基于CLIP的fMRI图像和文本条件，驱动回传diffusion过程，生成真实的描述和图像。UniBrain在质量和量上比现有方法出色，并在自然场景数据集（NSD）上首次报告图像描述结果。此外，我们还进行了层次 ROI分析和简要实验，以更深入地探索视觉诱发的脑决码。

Exploring the Intersection of Large Language Models and Agent-Based Modeling via Prompt Engineering

paper_url: http://arxiv.org/abs/2308.07411
repo_url: https://github.com/ejunprung/llm-agents
paper_authors: Edward Junprung
for: 这个研究旨在使用大语言模型（LLM）来实现人类行为的准确模拟，以探索人类行为在复杂社会系统中的行为和互动方式。
methods: 研究使用了提示工程（inspired by Park et al. (2023)），通过设计了两个人类行为的假设 scenario：一个是两个代理的谈判，另一个是六个代理的谋杀推理游戏。
results: 研究发现，使用LLM可以创造出非常真实的人类行为，包括在谈判和推理游戏中的互动和决策。这些结果表明，LLM可以成为模拟人类行为的有效工具。

Abstract
The final frontier for simulation is the accurate representation of complex, real-world social systems. While agent-based modeling (ABM) seeks to study the behavior and interactions of agents within a larger system, it is unable to faithfully capture the full complexity of human-driven behavior. Large language models (LLMs), like ChatGPT, have emerged as a potential solution to this bottleneck by enabling researchers to explore human-driven interactions in previously unimaginable ways. Our research investigates simulations of human interactions using LLMs. Through prompt engineering, inspired by Park et al. (2023), we present two simulations of believable proxies of human behavior: a two-agent negotiation and a six-agent murder mystery game.

摘要
最终的前ier для模拟是准确地表现复杂的现实世界社会系统。而代理人模型（ABM）则尝试研究代理人在更大的系统中的行为和互动，但它无法准确地捕捉人类驱动的行为的全部复杂性。大语言模型（LLM），如ChatGPT，在这个瓶颈上出现了作为解决方案，让研究人员能够在前所未有的方式探索人类驱动的互动。我们的研究探讨了使用LLM进行人类互动的模拟。通过提示工程，取得自 Park et al. (2023)，我们展示了两个人类行为的准确代理：一个两个代理的谈判和一个六个代理的谋杀推理游戏。

PARIS: Part-level Reconstruction and Motion Analysis for Articulated Objects

paper_url: http://arxiv.org/abs/2308.07391
repo_url: https://github.com/3dlg-hcvc/paris
paper_authors: Jiayi Liu, Ali Mahdavi-Amiri, Manolis Savva
for: simultaneous part-level reconstruction and motion parameter estimation for articulated objects
methods: self-supervised, end-to-end architecture with implicit shape and appearance models, optimizes motion parameters jointly without 3D supervision or semantic annotation
results: generalizes better across object categories, outperforms baselines and prior work, improves reconstruction with a Chamfer-L1 distance reduction of 3.94 (45.2%) for objects and 26.79 (84.5%) for parts, achieves 5% error rate for motion estimation across 10 object categories.

Abstract
We address the task of simultaneous part-level reconstruction and motion parameter estimation for articulated objects. Given two sets of multi-view images of an object in two static articulation states, we decouple the movable part from the static part and reconstruct shape and appearance while predicting the motion parameters. To tackle this problem, we present PARIS: a self-supervised, end-to-end architecture that learns part-level implicit shape and appearance models and optimizes motion parameters jointly without any 3D supervision, motion, or semantic annotation. Our experiments show that our method generalizes better across object categories, and outperforms baselines and prior work that are given 3D point clouds as input. Our approach improves reconstruction relative to state-of-the-art baselines with a Chamfer-L1 distance reduction of 3.94 (45.2%) for objects and 26.79 (84.5%) for parts, and achieves 5% error rate for motion estimation across 10 object categories. Video summary at: https://youtu.be/tDSrROPCgUc

摘要
我们考虑了同时进行部件重建和运动参数估计的骨架对象问题。给出两个多视图图像集合，我们将可动部分与静止部分分离，重建形状和外观，同时预测运动参数。为解决这个问题，我们提出了PARIS：一种自主、端到端架构，不需要3D超视觉、运动或semantic注解，可以同时学习部件级别的隐式形状和外观模型，并同步优化运动参数。我们的实验表明，我们的方法可以更好地泛化到不同的对象类别，并超越基elines和先前的方法，它们输入3D点云作为输入。我们的方法可以将 Chamfer-L1 距离减少3.94（45.2%） для对象和26.79（84.5%） для部件，并实现了10种对象类别中的运动估计错误率为5%。Video summary:

LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

paper_url: http://arxiv.org/abs/2308.07308
repo_url: None
paper_authors: Alec Helbling, Mansi Phute, Matthew Hull, Duen Horng Chau
for: 防止语言模型生成危险内容（如犯罪指南）
methods: 使用大语言模型自身的过滤机制来防止生成危险内容
results: even if a model is not fine-tuned to be aligned with human values, it is possible to stop it from presenting harmful content to users by validating the content using a language model.

Abstract
Large language models (LLMs) have skyrocketed in popularity in recent years due to their ability to generate high-quality text in response to human prompting. However, these models have been shown to have the potential to generate harmful content in response to user prompting (e.g., giving users instructions on how to commit crimes). There has been a focus in the literature on mitigating these risks, through methods like aligning models with human values through reinforcement learning. However, it has been shown that even aligned language models are susceptible to adversarial attacks that bypass their restrictions on generating harmful text. We propose a simple approach to defending against these attacks by having a large language model filter its own responses. Our current results show that even if a model is not fine-tuned to be aligned with human values, it is possible to stop it from presenting harmful content to users by validating the content using a language model.

摘要
大型语言模型（LLM）在最近几年内 Popularity 急剧增长，这主要归功于它们可以根据人类提示生成高质量的文本。然而，这些模型也被证明可能生成有害内容（如提供用户提示以commit犯罪）。在文献中，有许多研究探讨如何 Mitigate 这些风险，如通过 reinforcement learning 将模型与人类价值观念进行对齐。然而，研究表明，即使模型已经对齐，也可能受到恶意攻击，这些攻击可以让模型生成有害内容。我们提议一种简单的方法，即让大型语言模型自己过滤其回快。我们当前的结果表明，即使模型没有 fine-tune 对齐人类价值观念，也可以通过语言模型验证来防止它生成有害内容给用户。

Extend Wave Function Collapse to Large-Scale Content Generation

paper_url: http://arxiv.org/abs/2308.07307
repo_url: None
paper_authors: Yuhe Nie, Shaoming Zheng, Zhan Zhuang, Xuan Song
for: 解决 Wave Function Collapse (WFC) 算法在大规模内容生成中的时间复杂性和约束矛盾问题。
methods: 提出 Nested WFC (N-WFC) 算法框架，采用完整和部分完整块集准备策略，可以避免矛盾和回tracking问题，并且可以生成 deterministic 和 periodic 的无限内容。
results: 验证 N-WFC 算法的可行性和适用性，并且通过 weight-brush 系统和游戏设计方法，证明其适用于游戏设计。

Abstract
Wave Function Collapse (WFC) is a widely used tile-based algorithm in procedural content generation, including textures, objects, and scenes. However, the current WFC algorithm and related research lack the ability to generate commercialized large-scale or infinite content due to constraint conflict and time complexity costs. This paper proposes a Nested WFC (N-WFC) algorithm framework to reduce time complexity. To avoid conflict and backtracking problems, we offer a complete and sub-complete tileset preparation strategy, which requires only a small number of tiles to generate aperiodic and deterministic infinite content. We also introduce the weight-brush system that combines N-WFC and sub-complete tileset, proving its suitability for game design. Our contribution addresses WFC's challenge in massive content generation and provides a theoretical basis for implementing concrete games.

摘要
wave function collapse (WFC) 是一种广泛使用的瓷砖式算法 в procedural content generation 中，包括文本、物体和场景等。然而，当前 WFC 算法和相关研究缺乏可商业化的大规模或无限内容生成能力，这是因为约束冲突和时间复杂度成本的问题。本文提出了一种嵌套 WFC（N-WFC）算法框架，以降低时间复杂度。为了避免冲突和回溯问题，我们提供了完整的和半完整的瓷砖集准备策略，只需一小数量的瓷砖可以生成 periodic 和 deterministic 无限内容。我们还介绍了 weight-brush 系统，该系统结合 N-WFC 和半完整瓷砖集，证明其适用于游戏设计。我们的贡献解决了 WFC 在大规模内容生成中的挑战，并提供了实现具体游戏的理论基础。

Neural Authorship Attribution: Stylometric Analysis on Large Language Models

paper_url: http://arxiv.org/abs/2308.07305
repo_url: None
paper_authors: Tharindu Kumarage, Huan Liu
for: 这篇研究旨在探讨人工智能生成文本的伪造问题，并寻找一种可靠的方法来追溯这些文本的来源。
methods: 研究使用了现有的语言模型（LLMs），包括GPT-4、PaLM和Llama，并对这些模型进行了分析和比较。
results: 研究发现了不同的商业和开源模型之间的区别，并发现了这些模型在不同的语言方面的差异。这些结果可以帮助未来的伪造探测和对抗人工智能生成的伪造文本。

Abstract
Large language models (LLMs) such as GPT-4, PaLM, and Llama have significantly propelled the generation of AI-crafted text. With rising concerns about their potential misuse, there is a pressing need for AI-generated-text forensics. Neural authorship attribution is a forensic effort, seeking to trace AI-generated text back to its originating LLM. The LLM landscape can be divided into two primary categories: proprietary and open-source. In this work, we delve into these emerging categories of LLMs, focusing on the nuances of neural authorship attribution. To enrich our understanding, we carry out an empirical analysis of LLM writing signatures, highlighting the contrasts between proprietary and open-source models, and scrutinizing variations within each group. By integrating stylometric features across lexical, syntactic, and structural aspects of language, we explore their potential to yield interpretable results and augment pre-trained language model-based classifiers utilized in neural authorship attribution. Our findings, based on a range of state-of-the-art LLMs, provide empirical insights into neural authorship attribution, paving the way for future investigations aimed at mitigating the threats posed by AI-generated misinformation.

摘要
大型语言模型（LLM）如GPT-4、PaLM和Llama已经有效地推动人工生成的文本生成。随着AI生成文本的可能性的滥用，有一个急需AI生成文本科学的发展。人工作文本识别是一种审查AI生成文本的努力，寻求跟踪AI生成文本的来源LLM。LLM领域可以分为两个主要类别： propriety 和 open-source。在这项工作中，我们深入探究这些新兴类别的LLM，强调在 neural authorship attribution 方面的特点。通过对 LLM 的写作特征进行实质分析，包括 lexical、syntactic 和 structural 方面的语言特征，我们探讨了这些特征是否可以生成可读取的结果，并可以增强基于预训练语言模型的扩展语言模型来进行分类。我们的发现，基于一系列现代LLM，提供了实质性的启示，帮助我们更好地理解 neural authorship attribution，并为未来防止AI生成的误导而做出了重要贡献。

Why Not? Explaining Missing Entailments with Evee (Technical Report)

paper_url: http://arxiv.org/abs/2308.07294
repo_url: None
paper_authors: Christian Alrabbaa, Stefan Borgwardt, Tom Friese, Patrick Koopmann, Mikhail Kotlov
for: 本研究旨在帮助ontology用户更好地理解逻辑推论 derivations。
methods: 本研究使用了描述逻辑理解器和Protégé中的插件，以及新的abduction和 counterexample技术来解释缺失的结论。
results: 本研究提出了一种新的 $\rm E{\scriptsize VEE}$ 插件，可以帮助用户更好地理解ontology中缺失的结论。

Abstract
Understanding logical entailments derived by a description logic reasoner is not always straight-forward for ontology users. For this reason, various methods for explaining entailments using justifications and proofs have been developed and implemented as plug-ins for the ontology editor Prot\'eg\'e. However, when the user expects a missing consequence to hold, it is equally important to explain why it does not follow from the ontology. In this paper, we describe a new version of $\rm E{\scriptsize VEE}$, a Prot\'eg\'e plugin that now also provides explanations for missing consequences, via existing and new techniques based on abduction and counterexamples.

摘要
理解推理推论结论由描述逻辑理解者不一定是直观的，对ontology用户而言。为此，各种用于说明推论使用证明和证据的方法已经开发和实现为Protégé编辑器插件。然而，当用户期望缺失的结论不存在时，也是非常重要的解释为什么不从ontology中推论出来。在这篇论文中，我们描述了一种新版本的 $\rm E{\scriptsize VEE}$，一个Protégé插件，现在还提供缺失结论的解释，通过现有和新的技术基于推理和反例。

Cross-Attribute Matrix Factorization Model with Shared User Embedding

paper_url: http://arxiv.org/abs/2308.07284
repo_url: None
paper_authors: Wen Liang, Zeng Fan, Youzhi Liang, Jianguo Jia
for: 这个研究旨在应用深度学习技术来解决推荐系统中的寒冷开始问题，并且考虑用户和项目的特征对推荐系统的影响。
methods: 本研究使用的方法是内容匹配网络（Neural Matrix Factorization，NeuMF），并且将用户和项目的特征考虑进行扩展。
results: 实验结果显示，我们的跨特征网络匹配网络（Cross-Attribute Matrix Factorization，CAMF）模型在MovieLens和Pinterest dataset上具有较高的性能，特别是在资料集稀疏性较高的情况下。

Abstract
Over the past few years, deep learning has firmly established its prowess across various domains, including computer vision, speech recognition, and natural language processing. Motivated by its outstanding success, researchers have been directing their efforts towards applying deep learning techniques to recommender systems. Neural collaborative filtering (NCF) and Neural Matrix Factorization (NeuMF) refreshes the traditional inner product in matrix factorization with a neural architecture capable of learning complex and data-driven functions. While these models effectively capture user-item interactions, they overlook the specific attributes of both users and items. This can lead to robustness issues, especially for items and users that belong to the "long tail". Such challenges are commonly recognized in recommender systems as a part of the cold-start problem. A direct and intuitive approach to address this issue is by leveraging the features and attributes of the items and users themselves. In this paper, we introduce a refined NeuMF model that considers not only the interaction between users and items, but also acrossing associated attributes. Moreover, our proposed architecture features a shared user embedding, seamlessly integrating with user embeddings to imporve the robustness and effectively address the cold-start problem. Rigorous experiments on both the Movielens and Pinterest datasets demonstrate the superiority of our Cross-Attribute Matrix Factorization model, particularly in scenarios characterized by higher dataset sparsity.

摘要
在过去几年中，深度学习在不同领域，如计算机视觉、语音识别和自然语言处理等领域，都有着杰出的成绩。驱动于其出色的成绩，研究人员开始将深度学习技术应用于推荐系统。基于神经网络的共同积分（NCF）和神经矩阵分解（NeuMF）等模型，将传统的内积分换成神经网络架构，能够学习复杂的数据驱动函数。然而，这些模型可能会忽视用户和项目的特定属性，这可能会导致稳定性问题，尤其是对于“长尾”用户和项目。这种问题在推荐系统中广泛存在，通常被称为冷启动问题。在本文中，我们提出了一种改进的NeuMF模型，该模型不仅考虑用户和项目之间的交互，还考虑用户和项目之间的关联属性。此外，我们的提议的架构还包括共享用户嵌入，可以融合用户嵌入，从而提高系统的稳定性和效果地解决冷启动问题。我们在MovieLens和Pinterest数据集上进行了严格的实验，并证明我们的横向积分模型在数据集稀缺的情况下表现出优异性。

Autonomous Point Cloud Segmentation for Power Lines Inspection in Smart Grid

paper_url: http://arxiv.org/abs/2308.07283
repo_url: None
paper_authors: Alexander Kyuroson, Anton Koval, George Nikolakopoulos
for: 本研究旨在提出一种无监督机器学习（ML）框架，用于从 LiDAR 数据中探测、提取和分析高压和低压电缆的特征，以及相关的绿色区域。
methods: 提议的方法包括Initially eliminating ground points based on statistical analysis, denoising and transforming the remaining candidate points using Principle Component Analysis (PCA) and Kd-tree, and then segmenting power lines using a two-stage DBSCAN clustering.
results: 实验结果表明，提议的框架可以效率地探测电缆和进行相关的风险分析。

Abstract
LiDAR is currently one of the most utilized sensors to effectively monitor the status of power lines and facilitate the inspection of remote power distribution networks and related infrastructures. To ensure the safe operation of the smart grid, various remote data acquisition strategies, such as Airborne Laser Scanning (ALS), Mobile Laser Scanning (MLS), and Terrestrial Laser Scanning (TSL) have been leveraged to allow continuous monitoring of regional power networks, which are typically surrounded by dense vegetation. In this article, an unsupervised Machine Learning (ML) framework is proposed, to detect, extract and analyze the characteristics of power lines of both high and low voltage, as well as the surrounding vegetation in a Power Line Corridor (PLC) solely from LiDAR data. Initially, the proposed approach eliminates the ground points from higher elevation points based on statistical analysis that applies density criteria and histogram thresholding. After denoising and transforming of the remaining candidate points by applying Principle Component Analysis (PCA) and Kd-tree, power line segmentation is achieved by utilizing a two-stage DBSCAN clustering to identify each power line individually. Finally, all high elevation points in the PLC are identified based on their distance to the newly segmented power lines. Conducted experiments illustrate that the proposed framework is an agnostic method that can efficiently detect the power lines and perform PLC-based hazard analysis.

摘要
利达（LiDAR）现在是智能电网运行的一个最广泛使用的感知器，用于监测电力线路的状况和远程电力分布网络和相关基础设施的检查。为保证智能电网的安全运行，远程数据获取策略，如空中激光扫描（ALS）、移动激光扫描（MLS）和地面激光扫描（TSL）已经被利用，以实现区域电网的连续监测，这些区域通常被密集的植被环绕着。在这篇文章中，一种无监测机器学习（ML）框架被提议，用于从LiDAR数据中检测、提取和分析高压和低压电力线路的特征，以及周围的植被。首先，提议的方法从高空点的高程点中排除地面点，基于统计分析，应用密度标准和对 histogram 进行阈值设置。接着，通过应用原理Components分析（PCA）和 Kd-tree 转换，对剩下的候选点进行减噪和变换。然后，通过使用两个阶段 DBSCAN 聚合，对每个电力线 individually 进行分类，以实现电力线的分 Segmentation。最后，通过对新分类的高空点进行距离计算，全部在 PLC 中的高空点被标识出来。实验表明，提议的方法是一种无关的方法，可以有效地检测电力线和在 PLC 中进行风险分析。

Data-Efficient Energy-Aware Participant Selection for UAV-Enabled Federated Learning

paper_url: http://arxiv.org/abs/2308.07273
repo_url: None
paper_authors: Youssra Cheriguene, Wael Jaafar, Chaker Abdelaziz Kerrache, Halim Yanikomeroglu, Fatima Zohra Bousbaa, Nasreddine Lagraa
for: 提高Edge Federated Learning（FL）模型的准确性，并且考虑UAV的能源消耗和通信质量等约束。
methods: 提出了一种新的UAV参与者选择策略，即基于地区数据的结构相似度指数平均分数和能源消耗配置的数据高效能观测选择策略（DEEPS）。
results: 通过实验，提出的选择策略比废 randomly选择策略更高的Edge FL模型准确性、训练时间和UAV能源消耗。

Abstract
Unmanned aerial vehicle (UAV)-enabled edge federated learning (FL) has sparked a rise in research interest as a result of the massive and heterogeneous data collected by UAVs, as well as the privacy concerns related to UAV data transmissions to edge servers. However, due to the redundancy of UAV collected data, e.g., imaging data, and non-rigorous FL participant selection, the convergence time of the FL learning process and bias of the FL model may increase. Consequently, we investigate in this paper the problem of selecting UAV participants for edge FL, aiming to improve the FL model's accuracy, under UAV constraints of energy consumption, communication quality, and local datasets' heterogeneity. We propose a novel UAV participant selection scheme, called data-efficient energy-aware participant selection strategy (DEEPS), which consists of selecting the best FL participant in each sub-region based on the structural similarity index measure (SSIM) average score of its local dataset and its power consumption profile. Through experiments, we demonstrate that the proposed selection scheme is superior to the benchmark random selection method, in terms of model accuracy, training time, and UAV energy consumption.

摘要
“无人机（UAV）启用边缘联合学习（FL）已经引起了研究兴趣，由于UAV收集的数据量庞大和多样化，以及UAV数据传输到边缘服务器的隐私问题。然而，由于UAV收集的数据重复，如图像数据，以及不严谨的FL参与者选择，FL学习过程的收敛时间和模型偏见可能增加。因此，我们在本纸中 investigate 选择UAV参与者 для边缘FL，以提高FL模型的准确性，在UAV的能量消耗、通信质量和本地数据的多样性等因素限制下。我们提出了一种新的UAV参与者选择策略，称为数据高效能觉 participant selection strategy（DEEPS），该策略基于每个子区域中的本地数据和能量消耗profile中的结构相似度平均分数来选择最佳的FL参与者。通过实验，我们示出了提案的选择策略与参照方法（随机选择）相比，在模型准确性、训练时间和UAV能量消耗方面具有显著优势。”Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China. If you need the translation in Traditional Chinese, please let me know.

EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models

paper_url: http://arxiv.org/abs/2308.07269
repo_url: https://github.com/zjunlp/easyedit
paper_authors: Peng Wang, Ningyu Zhang, Xin Xie, Yunzhi Yao, Bozhong Tian, Mengru Wang, Zekun Xi, Siyuan Cheng, Kangwei Liu, Guozhou Zheng, Huajun Chen
for: 提高LLMs的知识更新和修正能力，以提高其可靠性和通用性。
methods: 支持多种现代知识编辑方法，可以轻松应用于多种well-known LLMs。
results: 在LlaMA-2上进行了知识编辑实验，表明知识编辑比传统精度uning更高效和更通用。

Abstract
Large Language Models (LLMs) usually suffer from knowledge cutoff or fallacy issues, which means they are unaware of unseen events or generate text with incorrect facts owing to the outdated/noisy data. To this end, many knowledge editing approaches for LLMs have emerged -- aiming to subtly inject/edit updated knowledge or adjust undesired behavior while minimizing the impact on unrelated inputs. Nevertheless, due to significant differences among various knowledge editing methods and the variations in task setups, there is no standard implementation framework available for the community, which hinders practitioners to apply knowledge editing to applications. To address these issues, we propose EasyEdit, an easy-to-use knowledge editing framework for LLMs. It supports various cutting-edge knowledge editing approaches and can be readily apply to many well-known LLMs such as T5, GPT-J, LlaMA, etc. Empirically, we report the knowledge editing results on LlaMA-2 with EasyEdit, demonstrating that knowledge editing surpasses traditional fine-tuning in terms of reliability and generalization. We have released the source code on GitHub at https://github.com/zjunlp/EasyEdit, along with Google Colab tutorials and comprehensive documentation for beginners to get started. Besides, we present an online system for real-time knowledge editing, and a demo video at http://knowlm.zjukg.cn/easyedit.mp4.

摘要
大型语言模型（LLM）通常会受到知识剖除或误差问题的影响，这意味着它们不知道未经见过的事件或生成文本中含有错误的信息，这由于模型使用的数据过时或噪音有关。为解决这些问题，许多知识编辑方法 для LLM emerged -- hoping to subtly inject 或编辑更新的知识或调整不符合预期的行为，同时尽量减少对无关输入的影响。然而，由于不同的知识编辑方法之间存在差异，以及任务设置的变化，当前没有一个标准的实现框架可供社区使用，这限制了实践者在应用知识编辑方法时的能力。为解决这些问题，我们提出了 EasyEdit，一个易于使用的知识编辑框架 для LLM。它支持多种前沿知识编辑方法，并可以 readily 应用于多个知名的 LLM such as T5, GPT-J, LlaMA, etc. 我们在 LlaMA-2 上进行了知识编辑试验，并证明知识编辑超过了传统的精细调整在可靠性和泛化方面的表现。我们在 GitHub 上公布了源代码，并提供了 Google Colab 教程和详细的文档，以便初学者快速入门。此外，我们还提供了在线实时知识编辑系统和 demo 视频，请参考 http://knowlm.zjukg.cn/easyedit.mp4.

Can we Agree? On the Rashōmon Effect and the Reliability of Post-Hoc Explainable AI

paper_url: http://arxiv.org/abs/2308.07247
repo_url: None
paper_authors: Clement Poiret, Antoine Grigis, Justin Thomas, Marion Noulhiane
for: 这个研究检查了Rashomon效应对机器学习模型中的知识抽取所带来的挑战。
methods: 这个研究使用了SHAP方法对Rashomon集中的模型进行解释。
results: 实验结果显示，随着样本大小增加，解释的一致性逐渐提高，但在少于128个样本的情况下，解释具有高度的变化性，因此不可靠地抽取知识。然而，在更多的数据下，模型之间的一致性提高，allowing for consensus。 bagging ensemble often had higher agreement。这些结果为我们提供了足够的数据来信任解释的指南。

Abstract
The Rash\=omon effect poses challenges for deriving reliable knowledge from machine learning models. This study examined the influence of sample size on explanations from models in a Rash\=omon set using SHAP. Experiments on 5 public datasets showed that explanations gradually converged as the sample size increased. Explanations from <128 samples exhibited high variability, limiting reliable knowledge extraction. However, agreement between models improved with more data, allowing for consensus. Bagging ensembles often had higher agreement. The results provide guidance on sufficient data to trust explanations. Variability at low samples suggests that conclusions may be unreliable without validation. Further work is needed with more model types, data domains, and explanation methods. Testing convergence in neural networks and with model-specific explanation methods would be impactful. The approaches explored here point towards principled techniques for eliciting knowledge from ambiguous models.

摘要
瑞索蒙效应对机器学习模型中提取可靠知识带来挑战。这项研究检查了样本大小对模型解释的影响，使用SHAP进行了5个公共数据集的实验。结果显示，随着样本大小增加，解释逐渐协调，但从128个样本开始，解释具有高度的变化， limiting reliable knowledge extraction。然而，随着更多的数据，模型之间的一致性提高，allowing for consensus。 Bagging ensemble often had higher agreement。结果提供了足够数据来信任解释的指导，变化在低样本数量 suggets that conclusion may be unreliable without validation。进一步的工作需要更多的模型类型，数据领域和解释方法。测试 converges in neural networks和模型特定的解释方法会对其具有深远的影响。研究方法可以带来原则性的技术，从ambiguous models中提取知识。

Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents

paper_url: http://arxiv.org/abs/2308.07241
repo_url: None
paper_authors: Byeonghwi Kim, Jinyeon Kim, Yuyeong Kim, Cheolhong Min, Jonghyun Choi
for: 提高家庭任务完成需要规划一系列的行动，考虑到 previous 动作的后果。
methods: 我们提出了 Context-Aware Planning and Environment-Aware Memory (CAPEAM)，它将Semantic context (例如适合交互的物品)和改变的空间安排和交互对象的状态 (例如交互对象的移动位置)包含在一系列动作中，以推断后续动作。
results: 我们经验表明，搭载CAPEAM的机器人在多个指标中达到了最新的状态前的表现，包括在不同环境中完成交互指令的任务，差异为大致 (+10.70%在未看到环境中)。

Abstract
Accomplishing household tasks requires to plan step-by-step actions considering the consequences of previous actions. However, the state-of-the-art embodied agents often make mistakes in navigating the environment and interacting with proper objects due to imperfect learning by imitating experts or algorithmic planners without such knowledge. To improve both visual navigation and object interaction, we propose to consider the consequence of taken actions by CAPEAM (Context-Aware Planning and Environment-Aware Memory) that incorporates semantic context (e.g., appropriate objects to interact with) in a sequence of actions, and the changed spatial arrangement and states of interacted objects (e.g., location that the object has been moved to) in inferring the subsequent actions. We empirically show that the agent with the proposed CAPEAM achieves state-of-the-art performance in various metrics using a challenging interactive instruction following benchmark in both seen and unseen environments by large margins (up to +10.70% in unseen env.).

摘要
完成家务需要规划每一步行动，考虑先前行动的后果。然而，现状的凉身agent经常在环境中导航和与合适的物体交互时出错，因为它们通过专家学习或算法规划而学习的知识不够完善。为了提高视觉导航和物体交互，我们提议考虑行动的后果，通过Context-Aware Planning and Environment-Aware Memory（CAPEAM）来 incorporate semantic context（例如，与物体交互时适用的对象）在一系列动作中，以及交互对象的改变的空间布局和状态（例如，交互对象的移动位置）。我们实验表明，携带我们提议的CAPEAM的代理人在多种 metrics 中表现出STATE-OF-THE-ART的表现，包括在seen和unseen环境中的挑战性交互指令遵循测试中，差异达 +10.70%。

2023-08-15

cs.CL

cs.CL - 2023-08-15

DS4DH at #SMM4H 2023: Zero-Shot Adverse Drug Events Normalization using Sentence Transformers and Reciprocal-Rank Fusion

paper_url: http://arxiv.org/abs/2308.12877
repo_url: None
paper_authors: Anthony Yazdani, Hossein Rouhizadeh, David Vicente Alvarez, Douglas Teodoro
for: 本研究是为了评估一种基于BERT fine-tuning和sentence transformers的社交媒体文本挖掘系统，用于正常化恶性药物事件提到Medical Dictionary for Regulatory Activities（MDRA）词汇。
methods: 本研究采用了两stage方法，首先使用BERT fine-tuning进行实体识别，然后使用sentence transformers和reciprocal-rank fusion进行零 shot正常化。
results: 本研究的结果显示，这种方法在MDRA词汇正常化中得到了44.9%的精度、40.5%的准确率和42.6%的F1分数，超过了共享任务5中的中值性能提高10%，并且在所有参与者中显示出最高性能。这些结果证明了该方法的有效性和在社交媒体文本挖掘领域的应用潜力。

Abstract
This paper outlines the performance evaluation of a system for adverse drug event normalization, developed by the Data Science for Digital Health group for the Social Media Mining for Health Applications 2023 shared task 5. Shared task 5 targeted the normalization of adverse drug event mentions in Twitter to standard concepts from the Medical Dictionary for Regulatory Activities terminology. Our system hinges on a two-stage approach: BERT fine-tuning for entity recognition, followed by zero-shot normalization using sentence transformers and reciprocal-rank fusion. The approach yielded a precision of 44.9%, recall of 40.5%, and an F1-score of 42.6%. It outperformed the median performance in shared task 5 by 10% and demonstrated the highest performance among all participants. These results substantiate the effectiveness of our approach and its potential application for adverse drug event normalization in the realm of social media text mining.

摘要
Here's the text in Simplified Chinese:这篇论文介绍了一种基于BERT微调和sentence transformers的社交媒体文本挖掘系统，用于正常化投诉病药事件。该系统采用了两 stageapproach：首先微调BERT进行实体识别，然后使用sentence transformers和reciprocal-rank fusions进行零批normalization。该approach实现了44.9%的精度、40.5%的准确率和42.6%的F1分数，比共享任务5中的中值性能提高10%，并达到了所有参与者中最高的性能。这些结果证明了该approach的有效性，并适用于社交媒体文本挖掘中的投诉病药事件正常化。

Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

paper_url: http://arxiv.org/abs/2308.07777
repo_url: None
paper_authors: Qiwei Li, Zuchao Li, Xiantao Cai, Bo Du, Hai Zhao
for: 这 paper 的目的是提高文档理解的精度，特别是利用文档结构图模型文档的布局结构知识。
methods: 该 paper 提出了一种名为 GraphLayoutLM 的新型文档理解模型，该模型利用文档结构图模型文档的布局结构知识，并使用图重新排序算法和布局意识多头自注意力层来学习文档布局知识。
results: 该 paper 在多个 benchmark 上达到了最佳成绩，包括 FUNSD、XFUND 和 CORD 等 datasets，并且通过对模型组件的缺省研究，表明了每个组件的贡献。

Abstract
In recent years, the use of multi-modal pre-trained Transformers has led to significant advancements in visually-rich document understanding. However, existing models have mainly focused on features such as text and vision while neglecting the importance of layout relationship between text nodes. In this paper, we propose GraphLayoutLM, a novel document understanding model that leverages the modeling of layout structure graph to inject document layout knowledge into the model. GraphLayoutLM utilizes a graph reordering algorithm to adjust the text sequence based on the graph structure. Additionally, our model uses a layout-aware multi-head self-attention layer to learn document layout knowledge. The proposed model enables the understanding of the spatial arrangement of text elements, improving document comprehension. We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results among these datasets. Our experimental results demonstrate that our proposed method provides a significant improvement over existing approaches and showcases the importance of incorporating layout information into document understanding models. We also conduct an ablation study to investigate the contribution of each component of our model. The results show that both the graph reordering algorithm and the layout-aware multi-head self-attention layer play a crucial role in achieving the best performance.

摘要
Translated into Simplified Chinese:在最近几年，基于多modal预训练的Transformers模型在文本 ricoh 理解方面带来了显著的进步。然而，现有的模型主要集中在文本和视觉特征之间，忽略了文档布局关系的重要性。在本文中，我们提出了 GraphLayoutLM 模型，它利用文档布局结构图来注入文档布局知识到模型中。GraphLayoutLM 模型使用图重新排序算法来根据图结构调整文本序列。此外，我们的模型还使用了布局意识多头自注意层来学习文档布局知识。该模型可以理解文本元素的空间排序，从而提高文档理解能力。我们在不同的benchmark上评估了我们的模型，包括FUNSD、XFUND和CORD等，并在这些数据集中达到了状态之最好的结果。我们的实验结果表明，我们提出的方法具有显著的改进，并证明了在文档理解模型中包含布局信息的重要性。我们还进行了一个ablation研究，以 Investigate each component of our model的贡献。结果显示，图重新排序算法和布局意识多头自注意层均对获得最佳性能做出了重要贡献。

SPM: Structured Pretraining and Matching Architectures for Relevance Modeling in Meituan Search

paper_url: http://arxiv.org/abs/2308.07711
repo_url: None
paper_authors: Wen Zan, Yaopeng Han, Xiaotian Jiang, Yao Xiao, Yang Yang, Dayao Chen, Sheng Chen
for: 提高生活服务平台上搜索结果的相关性，以提高用户体验。
methods: 提出了一种两stage预训练和匹配架构，使用了查询和文档多个字段作为输入，并使用了有效的信息压缩方法来处理长文档。
results: 经过大规模的实验和在线A/B测试，表明提出的架构有效提高了搜索结果的相关性，已经在Meituan上线部署一年多。

Abstract
In e-commerce search, relevance between query and documents is an essential requirement for satisfying user experience. Different from traditional e-commerce platforms that offer products, users search on life service platforms such as Meituan mainly for product providers, which usually have abundant structured information, e.g. name, address, category, thousands of products. Modeling search relevance with these rich structured contents is challenging due to the following issues: (1) there is language distribution discrepancy among different fields of structured document, making it difficult to directly adopt off-the-shelf pretrained language model based methods like BERT. (2) different fields usually have different importance and their length vary greatly, making it difficult to extract document information helpful for relevance matching. To tackle these issues, in this paper we propose a novel two-stage pretraining and matching architecture for relevance matching with rich structured documents. At pretraining stage, we propose an effective pretraining method that employs both query and multiple fields of document as inputs, including an effective information compression method for lengthy fields. At relevance matching stage, a novel matching method is proposed by leveraging domain knowledge in search query to generate more effective document representations for relevance scoring. Extensive offline experiments and online A/B tests on millions of users verify that the proposed architectures effectively improve the performance of relevance modeling. The model has already been deployed online, serving the search traffic of Meituan for over a year.

摘要
在电商搜索中，搜索结果的相关性是用户体验的关键要求。与传统电商平台不同，用户在生活服务平台such as Meituan上查询主要是为了找到供应商，这些供应商通常有很多结构化信息，例如名称、地址、类别、千种产品。使用这些丰富的结构化内容进行搜索相关性模型化是有挑战的，因为：（1）不同的结构化文档字段存在语言分布差异，使得直接采用市场上已有预训练语言模型的方法如BERT不太可能。（2）不同的字段通常有不同的重要性和长度，使得提取文档信息有帮助于相关性匹配的部分很困难。为解决这些问题，本文提出了一种新的两Stage预训练和匹配架构，用于与结构化文档进行相关性模型化。预训练阶段，我们提出了一种有效的预训练方法，该方法使用查询和多个文档字段作为输入，并使用有效的信息压缩方法来处理长字段。匹配阶段，我们提出了一种基于搜索查询领域知识的新匹配方法，该方法可以更有效地生成文档表示，用于相关性分数。广泛的Offline实验和在线A/B测试表明，提出的架构有效地提高了相关性模型的性能。该模型已经在Meituan上线服务了一年多。

Better Zero-Shot Reasoning with Role-Play Prompting

paper_url: http://arxiv.org/abs/2308.07702
repo_url: https://github.com/HLT-NLP/Role-Play-Prompting
paper_authors: Aobo Kong, Shiwan Zhao, Hao Chen, Qicheng Li, Yong Qin, Ruiqi Sun, Xin Zhou
for: 这种研究旨在探讨LLMs中的角色扮演如何影响其理解能力。
methods: 研究使用了一种策略性的角色扮演提示方法，在零基eline设定下测试了12种不同的理解准则，包括代数、常识理解、 симвоlic理解等。
results: 研究结果表明，使用角色扮演提示可以在大多数数据集上超越标准的零基eline方法，其中AQuA的准确率由53.5%提高到63.8%，Last Letter的准确率由23.8%提高到84.2%。这表明角色扮演提示可以提高LLMs的上下文理解和链条思维能力。

Abstract
Modern large language models (LLMs), such as ChatGPT, exhibit a remarkable capacity for role-playing, enabling them to embody not only human characters but also non-human entities like a Linux terminal. This versatility allows them to simulate complex human-like interactions and behaviors within various contexts, as well as to emulate specific objects or systems. While these capabilities have enhanced user engagement and introduced novel modes of interaction, the influence of role-playing on LLMs' reasoning abilities remains underexplored. In this study, we introduce a strategically designed role-play prompting methodology and assess its performance under the zero-shot setting across twelve diverse reasoning benchmarks, encompassing arithmetic, commonsense reasoning, symbolic reasoning, and more. Leveraging models such as ChatGPT and Llama 2, our empirical results illustrate that role-play prompting consistently surpasses the standard zero-shot approach across most datasets. Notably, accuracy on AQuA rises from 53.5% to 63.8%, and on Last Letter from 23.8% to 84.2%. Beyond enhancing contextual understanding, we posit that role-play prompting serves as an implicit Chain-of-Thought (CoT) trigger, thereby improving the quality of reasoning. By comparing our approach with the Zero-Shot-CoT technique, which prompts the model to "think step by step", we further demonstrate that role-play prompting can generate a more effective CoT. This highlights its potential to augment the reasoning capabilities of LLMs.

摘要
现代大型语言模型（LLM），如ChatGPT，显示出了很强的角色扮演能力，可以不仅扮演人类角色，还可以模拟非人类Entity，如Linux终端。这种多样性使得它们能够模拟人类间的复杂互动和行为，以及模拟特定的对象或系统。虽然这些能力提高了用户参与度和引入了新的交互方式，但LLM的理解能力下的影响仍未得到足够的探索。在这项研究中，我们提出了一种策略性的角色扮演提问方法，并评估其在零基础设定下的性能。通过使用ChatGPT和Llama 2这两种模型，我们的实验结果表明，角色扮演提问在大多数数据集上都能够超越标准的零基础设定。特别是，AQuA的准确率由53.5%提高到63.8%，Last Letter的准确率由23.8%提高到84.2%。除了提高上下文理解，我们认为角色扮演提问可以作为隐藏链条（Chain-of-Thought，CoT）触发器，因此改善LLM的理解质量。通过与零基础CoT技术进行比较，我们进一步证明了角色扮演提问可以生成更有效的CoT。这说明它可以增强LLM的理解能力。

Attention Is Not All You Need Anymore

paper_url: http://arxiv.org/abs/2308.07661
repo_url: https://github.com/rprokap/pset-9
paper_authors: Zhe Chen
for: 本文提出了一种用于减少Transformer架构中自注意机制的计算和内存复杂性的drop-in替换方案，以提高Transformer的性能。
methods: 本文提出的Extractor可以作为Transformer的自注意机制替换，并且可以减少计算和内存复杂性。
results: 实验结果表明，使用Extractor可以提高Transformer的性能，并且它的计算路径更短，可以更快速地完成计算。

Abstract
In recent years, the popular Transformer architecture has achieved great success in many application areas, including natural language processing and computer vision. Many existing works aim to reduce the computational and memory complexity of the self-attention mechanism in the Transformer by trading off performance. However, performance is key for the continuing success of the Transformer. In this paper, a drop-in replacement for the self-attention mechanism in the Transformer, called the Extractor, is proposed. Experimental results show that replacing the self-attention mechanism with the Extractor improves the performance of the Transformer. Furthermore, the proposed Extractor has the potential to run faster than the self-attention since it has a much shorter critical path of computation. Additionally, the sequence prediction problem in the context of text generation is formulated using variable-length discrete-time Markov chains, and the Transformer is reviewed based on our understanding.

摘要
近年来，受欢迎的Transformer架构在许多应用领域取得了很大成功，包括自然语言处理和计算机视觉。许多现有的工作尝试通过减少Transformer中的自注意机制的计算和内存复杂性，但是性能是Transformer继续成功的关键。在这篇论文中，一种可替换Transformer中的自注意机制，称为Extractor，被提议。实验结果表明，将自注意机制替换为Extractor可以提高Transformer的性能。此外，提议的Extractor可能比自注意机制更快速，因为它有许多短的计算路径。此外，在文本生成中的序列预测问题被形式化为变量长 discrete-time Markov链，并根据我们的理解对Transformer进行了评估。

SEER: Super-Optimization Explorer for HLS using E-graph Rewriting with MLIR

paper_url: http://arxiv.org/abs/2308.07654
repo_url: None
paper_authors: Jianyi Cheng, Samuel Coward, Lorenzo Chelini, Rafael Barbalho, Theo Drane
for: This paper aims to improve the performance of hardware designs produced by high-level synthesis (HLS) tools by automatically rewriting software programs into efficient HLS code.
methods: The proposed method, called SEER, uses an e-graph data structure to efficiently explore equivalent implementations of a program at scale, and orchestrates existing software compiler passes and hardware synthesis optimizers.
results: The paper shows that SEER achieves up to 38x the performance within 1.4x the area of the original program, and outperforms manually optimized designs produced by hardware experts in an Intel-provided case study.

Abstract
High-level synthesis (HLS) is a process that automatically translates a software program in a high-level language into a low-level hardware description. However, the hardware designs produced by HLS tools still suffer from a significant performance gap compared to manual implementations. This is because the input HLS programs must still be written using hardware design principles. Existing techniques either leave the program source unchanged or perform a fixed sequence of source transformation passes, potentially missing opportunities to find the optimal design. We propose a super-optimization approach for HLS that automatically rewrites an arbitrary software program into efficient HLS code that can be used to generate an optimized hardware design. We developed a toolflow named SEER, based on the e-graph data structure, to efficiently explore equivalent implementations of a program at scale. SEER provides an extensible framework, orchestrating existing software compiler passes and hardware synthesis optimizers. Our work is the first attempt to exploit e-graph rewriting for large software compiler frameworks, such as MLIR. Across a set of open-source benchmarks, we show that SEER achieves up to 38x the performance within 1.4x the area of the original program. Via an Intel-provided case study, SEER demonstrates the potential to outperform manually optimized designs produced by hardware experts.

摘要
高级合成（HLS）是一个过程，它自动将高级语言程序转换为低级硬件描述。然而，由HLS工具生成的硬件设计仍然受到性能差距的影响，这是因为输入HLS程序仍需遵循硬件设计原则。现有的技术可能会留下程序源代码不变，或者执行固定的源代码转换步骤，可能会错过优化的机会。我们提出了一种超优化方法，它可以自动将任何软件程序转换为高效的HLS代码，可以生成优化的硬件设计。我们开发了一个名为SEER的工具流，基于e-graph数据结构，以高效地探索相当的实现方式。SEER提供了可扩展的框架，可以启用现有的软件编译器过程和硬件合成优化器。我们的工作是首次利用e-graph重写来大规模的软件编译框架，如MLIR。对一组开源 benchmark 进行测试，我们发现SEER可以达到38倍的性能，在1.4倍的面积内。通过Intel提供的案例研究，SEER还能够超越由硬件专家手动优化的设计。

Steering Language Generation: Harnessing Contrastive Expert Guidance and Negative Prompting for Coherent and Diverse Synthetic Data Generation

paper_url: http://arxiv.org/abs/2308.07645
repo_url: None
paper_authors: Charles O’Neill, Yuan-Sen Ting, Ioana Ciuca, Jack Miller, Thang Bui
for: 提高大语言模型生成的数据质量和多样性，以便下游模型训练和实际数据利用。
methods: 引入对比专家指导，以确保领域遵循性，并使用现有真实数据和 sintetic 示例作为负例准入，以保证多样性和 authenticty。
results: 比前一些生成数据技术提高表现，在三个不同任务中（假设生成、恶意和非恶意评论生成、常识理解任务生成） Displaying better balance between data diversity and coherence。

Abstract
Large Language Models (LLMs) hold immense potential to generate synthetic data of high quality and utility, which has numerous applications from downstream model training to practical data utilisation. However, contemporary models, despite their impressive capacities, consistently struggle to produce both coherent and diverse data. To address the coherency issue, we introduce contrastive expert guidance, where the difference between the logit distributions of fine-tuned and base language models is emphasised to ensure domain adherence. In order to ensure diversity, we utilise existing real and synthetic examples as negative prompts to the model. We deem this dual-pronged approach to logit reshaping as STEER: Semantic Text Enhancement via Embedding Repositioning. STEER operates at inference-time and systematically guides the LLMs to strike a balance between adherence to the data distribution (ensuring semantic fidelity) and deviation from prior synthetic examples or existing real datasets (ensuring diversity and authenticity). This delicate balancing act is achieved by dynamically moving towards or away from chosen representations in the latent space. STEER demonstrates improved performance over previous synthetic data generation techniques, exhibiting better balance between data diversity and coherency across three distinct tasks: hypothesis generation, toxic and non-toxic comment generation, and commonsense reasoning task generation. We demonstrate how STEER allows for fine-tuned control over the diversity-coherency trade-off via its hyperparameters, highlighting its versatility.

摘要

LogPrompt: Prompt Engineering Towards Zero-Shot and Interpretable Log Analysis

paper_url: http://arxiv.org/abs/2308.07610
repo_url: None
paper_authors: Yilun Liu, Shimin Tao, Weibin Meng, Jingyu Wang, Wenbing Ma, Yanqing Zhao, Yuhang Chen, Hao Yang, Yanfei Jiang, Xun Chen
for: 本文提出了一种新的零shot和可解释的系统事件分析方法，以提高系统维护和工程生命周期中的可靠性和抗抗锋性。
methods: 本文使用大型自然语言模型（LLM）进行零shot系统事件分析任务，并采用了一系列高级提示策略，以提高LLM的性能。
results: 实验结果显示， LogPrompt 在九个公开的评估数据集上，在两个任务上表现出色，比既有方法（使用千余个日志）高于50%。此外， LogPrompt 的可解释性得到了专业人员的高度评估（4.42/5）。

Abstract
Automated log analysis is crucial in modern software-intensive systems for ensuring reliability and resilience throughout software maintenance and engineering life cycles. Existing methods perform tasks such as log parsing and log anomaly detection by providing a single prediction value without interpretation. However, given the increasing volume of system events, the limited interpretability of analysis results hinders analysts' trust and their ability to take appropriate actions. Moreover, these methods require substantial in-domain training data, and their performance declines sharply (by up to 62.5%) in online scenarios involving unseen logs from new domains, a common occurrence due to rapid software updates. In this paper, we propose LogPrompt, a novel zero-shot and interpretable log analysis approach. LogPrompt employs large language models (LLMs) to perform zero-shot log analysis tasks via a suite of advanced prompt strategies tailored for log tasks, which enhances LLMs' performance by up to 107.5% compared with simple prompts. Experiments on nine publicly available evaluation datasets across two tasks demonstrate that LogPrompt, despite using no training data, outperforms existing approaches trained on thousands of logs by up to around 50%. We also conduct a human evaluation of LogPrompt's interpretability, with six practitioners possessing over 10 years of experience, who highly rated the generated content in terms of usefulness and readability (averagely 4.42/5). LogPrompt also exhibits remarkable compatibility with open-source and smaller-scale LLMs, making it flexible for practical deployment.

摘要
现代软件强调系统中，自动化日志分析是关键要素，以确保软件稳定性和恢复能力在维护和工程生命周期中。现有方法可以完成日志分析任务，如日志分析和异常日志检测，但是这些方法通常只提供单个预测值而不具备解释。由于系统事件的增加，以及分析结果的有限可读性，分析人员对结果的信任和他们对结果的应用能力受到限制。此外，这些方法通常需要大量域内训练数据，并且在在线enario中（新领域的日志）发现的日志上进行分析时，其性能会下降（最多下降62.5%）。在这篇论文中，我们提出了一种新的零shot和可解释的日志分析方法——LogPrompt。LogPrompt使用大型自然语言模型（LLMs）来实现零shot日志分析任务，通过一组适用于日志任务的高级提示策略，提高LLMs的性能（最多提高107.5%）。我们在九个公共可用的评估数据集上进行了九个任务的实验，并证明了LogPrompt，即使没有使用任何训练数据，可以与已经训练 thousands of logs 的现有方法相比，在两个任务上提高性能（最多提高50%）。我们还进行了人类评估LogPrompt的可解释性，六位具有超过10年经验的实践者对生成的内容进行了评估，并评估结果表明，生成的内容在有用性和可读性方面得分4.42/5。此外，LogPrompt还表现出了remarkable的兼容性，可以与开源和较小规模的LLMs进行实际应用。

VBD-MT Chinese-Vietnamese Translation Systems for VLSP 2022

paper_url: http://arxiv.org/abs/2308.07601
repo_url: None
paper_authors: Hai Long Trieu, Song Kiet Bui, Tan Minh Tran, Van Khanh Tran, Hai An Nguyen
for: 本研究参加了2022年VLSP机器翻译共同任务。
methods: 我们基于神经网络模型的Transformer模型，使用了强大的多语言干扰预测模型mBART进行建构。我们还应用了一种采样方法来进行反向翻译，以利用大规模的可用单语言数据。此外，我们还应用了一些提高翻译质量的方法，包括拟合和后处理。
results: 我们在公共测试集上 achievement 38.9 BLEU在中越翻译和38.0 BLEU在越中翻译 tasks，这些成绩超过了一些强大的基eline。

Abstract
We present our systems participated in the VLSP 2022 machine translation shared task. In the shared task this year, we participated in both translation tasks, i.e., Chinese-Vietnamese and Vietnamese-Chinese translations. We build our systems based on the neural-based Transformer model with the powerful multilingual denoising pre-trained model mBART. The systems are enhanced by a sampling method for backtranslation, which leverage large scale available monolingual data. Additionally, several other methods are applied to improve the translation quality including ensembling and postprocessing. We achieve 38.9 BLEU on ChineseVietnamese and 38.0 BLEU on VietnameseChinese on the public test sets, which outperform several strong baselines.

摘要
我们在VLSP 2022机器翻译共同任务中提交了我们的系统。本年度共同任务中，我们参与了中越文和越文中翻译两个任务。我们基于神经网络模型的Transformer模型，并使用大规模可用的单语言数据进行采样方法进行增强。此外，我们还应用了多种方法来提高翻译质量，包括集成和后处理。在公共测试集上，我们取得了38.9的BLEU指标在中越文翻译和38.0的BLEU指标在越文中翻译，这些成绩超过了一些强大的基线。

A User-Centered Evaluation of Spanish Text Simplification

paper_url: http://arxiv.org/abs/2308.07556
repo_url: None
paper_authors: Adrian de Wynter, Anthony Hevia, Si-Qing Chen
for: 这个论文的目的是评估西班牙语文本简化（TS）系统的生产性，通过复杂句子和复杂词语识别两个 corpora 进行评估。
methods: 这个论文使用了神经网络来比较西班牙语特有的阅读性分数，并显示神经网络在预测用户TS首选项上一直表现出色。作者们还发现，多语言模型在同一任务上下降性能，但所有模型往往围绕幻数统计特征，如句子长度，进行围绕。
results: 作者们发现，神经网络在同一任务上一直表现出色，而且可以准确预测用户TS首选项。同时，作者们发现多语言模型在同一任务上下降性能，并且发现所有模型往往围绕幻数统计特征，如句子长度，进行围绕。

Abstract
We present an evaluation of text simplification (TS) in Spanish for a production system, by means of two corpora focused in both complex-sentence and complex-word identification. We compare the most prevalent Spanish-specific readability scores with neural networks, and show that the latter are consistently better at predicting user preferences regarding TS. As part of our analysis, we find that multilingual models underperform against equivalent Spanish-only models on the same task, yet all models focus too often on spurious statistical features, such as sentence length. We release the corpora in our evaluation to the broader community with the hopes of pushing forward the state-of-the-art in Spanish natural language processing.

摘要
我们对西班牙语文本简化（TS）的评估进行了一种生产系统的研究，通过两个聚合了复杂句子和复杂词的字句 corpus 进行了评估。我们将西班牙语特有的阅读性分数与神经网络进行比较，并发现后者在预测用户对TS的偏好时表现更好。在我们的分析中，我们发现了多语言模型在同一任务上下降表现，然而所有模型都太过注重干扰性的统计特征，如句子长度。我们将我们的评估 corpora 公开发布给广泛的社区，希望能够推动西班牙自然语言处理领域的前沿。

Improving CTC-AED model with integrated-CTC and auxiliary loss regularization

paper_url: http://arxiv.org/abs/2308.08449
repo_url: None
paper_authors: Daobin Zhu, Xiangdong Su, Hongbin Zhang
for: automatic speech recognition (ASR)
methods: Connectionist temporal classification (CTC) and attention-based encoder decoder (AED) joint training, with two fusion methods (DAL and PMP) and auxiliary loss regularization
results: Experimental results show that DAL method performs better in attention rescoring, while PMP method excels in CTC prefix beam search and greedy search.

Abstract
Connectionist temporal classification (CTC) and attention-based encoder decoder (AED) joint training has been widely applied in automatic speech recognition (ASR). Unlike most hybrid models that separately calculate the CTC and AED losses, our proposed integrated-CTC utilizes the attention mechanism of AED to guide the output of CTC. In this paper, we employ two fusion methods, namely direct addition of logits (DAL) and preserving the maximum probability (PMP). We achieve dimensional consistency by adaptively affine transforming the attention results to match the dimensions of CTC. To accelerate model convergence and improve accuracy, we introduce auxiliary loss regularization for accelerated convergence. Experimental results demonstrate that the DAL method performs better in attention rescoring, while the PMP method excels in CTC prefix beam search and greedy search.

摘要
Connectionist temporal classification (CTC) 和 attention-based encoder decoder (AED) 的共同训练已经广泛应用在自动语音识别（ASR）中。与大多数混合模型不同，我们提出的集成-CTC 使用 AED 的注意力机制来导引 CTC 的输出。在这篇论文中，我们采用了两种合并方法，namely direct addition of logits (DAL) 和 preserving the maximum probability (PMP)。我们通过适应性折射变换来保持维度的一致性，以适应 CTC 的维度。为了加速模型的启动和提高准确性，我们引入了辅助损失补偿。实验结果表明，DAL 方法在注意力重新评分中表现更好，而 PMP 方法在 CTC 前缀搜索和扩散搜索中表现更好。

CALYPSO: LLMs as Dungeon Masters’ Assistants

paper_url: http://arxiv.org/abs/2308.07540
repo_url: https://github.com/northern-lights-province/calypso-aiide-artifact
paper_authors: Andrew Zhu, Lara J. Martin, Andrew Head, Chris Callison-Burch
for: 这篇论文的目的是探讨用大自然语言模型（LLM）在桌面角色扮演游戏（D&D）中的应用场景，以及这些技术在桌面游戏中的可能性。
methods: 该论文使用了大自然语言模型（GPT-3和ChatGPT）来生成合理的自然语言文本，并通过与游戏导ilder（DM）进行形成评估，以确定LLM在D&D中的应用场景。
results: 研究发现，当给DM们提供LLM-力Point的支持时，他们表示可以 direktly present高品质的自然语言文本给玩家，以及低品质的想法，以便继续保持创作主义。这种方法可以帮助DMs在游戏中提供更多的创新和灵感，而无需干扰他们的创作过程。

Abstract
The role of a Dungeon Master, or DM, in the game Dungeons & Dragons is to perform multiple tasks simultaneously. The DM must digest information about the game setting and monsters, synthesize scenes to present to other players, and respond to the players' interactions with the scene. Doing all of these tasks while maintaining consistency within the narrative and story world is no small feat of human cognition, making the task tiring and unapproachable to new players. Large language models (LLMs) like GPT-3 and ChatGPT have shown remarkable abilities to generate coherent natural language text. In this paper, we conduct a formative evaluation with DMs to establish the use cases of LLMs in D&D and tabletop gaming generally. We introduce CALYPSO, a system of LLM-powered interfaces that support DMs with information and inspiration specific to their own scenario. CALYPSO distills game context into bite-sized prose and helps brainstorm ideas without distracting the DM from the game. When given access to CALYPSO, DMs reported that it generated high-fidelity text suitable for direct presentation to players, and low-fidelity ideas that the DM could develop further while maintaining their creative agency. We see CALYPSO as exemplifying a paradigm of AI-augmented tools that provide synchronous creative assistance within established game worlds, and tabletop gaming more broadly.

摘要

Finding Stakeholder-Material Information from 10-K Reports using Fine-Tuned BERT and LSTM Models

paper_url: http://arxiv.org/abs/2308.07522
repo_url: None
paper_authors: Victor Zitian Chen
For: The paper aims to identify stakeholder-material information in annual 10-K reports to help companies and investors efficiently extract material information.* Methods: The authors fine-tuned BERT models and RNN models with LSTM layers to identify stakeholder-material information, using business expert-labeled training data.* Results: The best model achieved an accuracy of 0.904 and an F1 score of 0.899 in test data, significantly outperforming the baseline model.

Abstract
All public companies are required by federal securities law to disclose their business and financial activities in their annual 10-K reports. Each report typically spans hundreds of pages, making it difficult for human readers to identify and extract the material information efficiently. To solve the problem, I have fine-tuned BERT models and RNN models with LSTM layers to identify stakeholder-material information, defined as statements that carry information about a company's influence on its stakeholders, including customers, employees, investors, and the community and natural environment. The existing practice uses keyword search to identify such information, which is my baseline model. Using business expert-labeled training data of nearly 6,000 sentences from 62 10-K reports published in 2022, the best model has achieved an accuracy of 0.904 and an F1 score of 0.899 in test data, significantly above the baseline model's 0.781 and 0.749 respectively. Furthermore, the same work was replicated on more granular taxonomies, based on which four distinct groups of stakeholders (i.e., customers, investors, employees, and the community and natural environment) are tested separately. Similarly, fined-tuned BERT models outperformed LSTM and the baseline. The implications for industry application and ideas for future extensions are discussed.

摘要
(Simplified Chinese translation)所有公开公司都需要根据联邦证券法披露其业务和财务活动在每年的10-K报告中。每份报告通常包含数百页的内容，使得人类读者很难快速 identificar和提取重要信息。为解决这个问题，我已经细化BERT模型和RNN模型的LSTM层来标识利益相关者材料信息，其定义为公司对利益相关者（包括客户、员工、投资者和社区和自然环境）的影响信息。现行做法使用关键词搜索来标识这类信息，这是我的基线模型。使用2022年62份10-K报告中的商业专家标注训练数据（约6,000句），最佳模型在测试数据中达到了0.904的准确率和0.899的F1得分，与基线模型的0.781和0.749分别显著上升。此外，同样的工作也在更细化的分类中进行了重复，基于这四个不同的利益相关者组（即客户、投资者、员工和社区和自然环境）进行了分开测试。同样，细化BERT模型也超过了LSTM和基线模型。关于业务应用和未来扩展的想法都是讨论的。

Data Race Detection Using Large Language Models

paper_url: http://arxiv.org/abs/2308.07505
repo_url: https://github.com/chrisneagu/FTC-Skystone-Dark-Angels-Romania-2020
paper_authors: Le Chen, Xianzhong Ding, Murali Emani, Tristan Vanderbruggen, Pei-hung Lin, Chuanhua Liao
for: 本研究旨在探讨一种基于大语言模型（LLM）的数据竞争检测方法，以代替手动创建资源投入庞大的工具。
methods: 本研究使用了提示工程和精度调整技术，创建了专门的DRB-ML数据集，并使用了代表性的LLM和开源LLM进行评估。
results: 研究显示，LLM可以成为数据竞争检测的可能性，但是它们还无法与传统数据竞争检测工具相比提供详细的变量对 causing 数据竞争的信息。

Abstract
Large language models (LLMs) are demonstrating significant promise as an alternate strategy to facilitate analyses and optimizations of high-performance computing programs, circumventing the need for resource-intensive manual tool creation. In this paper, we explore a novel LLM-based data race detection approach combining prompting engineering and fine-tuning techniques. We create a dedicated dataset named DRB-ML, which is derived from DataRaceBench, with fine-grain labels showing the presence of data race pairs and their associated variables, line numbers, and read/write information. DRB-ML is then used to evaluate representative LLMs and fine-tune open-source ones. Our experiment shows that LLMs can be a viable approach to data race detection. However, they still cannot compete with traditional data race detection tools when we need detailed information about variable pairs causing data races.

摘要

SOTASTREAM: A Streaming Approach to Machine Translation Training

paper_url: http://arxiv.org/abs/2308.07489
repo_url: https://github.com/marian-nmt/sotastream
paper_authors: Matt Post, Thamme Gowda, Roman Grundkiewicz, Huda Khayrallah, Rohit Jain, Marcin Junczys-Dowmunt
for: 这 paper aims to address the limitations of traditional data preparation methods for machine translation toolkits, which can be time-consuming, expensive, and cumbersome.
methods: The proposed approach separates the generation of data from its consumption, allowing for on-the-fly modifications and eliminating the need for a separate pre-processing step.
results: The proposed approach reduces training time, adds flexibility, reduces experiment management complexity, and reduces disk space without affecting the accuracy of the trained models.Here’s the simplified Chinese text:
for: 这 paper 的目的是解决传统机器翻译工具集的数据准备方法的限制，这些方法可能会占用很多时间、成本和复杂度。
methods: 提议的方法是将数据生成和数据消耗分离开来，这样允许在实际使用过程中进行实时修改，并完全消除预处理步骤。
results: 提议的方法可以减少训练时间、添加灵活性、减少实验管理复杂度和减少磁盘空间，而不影响训练出来的模型的准确性。

Abstract
Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer. This preparation step is increasingly at odds with modern research and development practices because this process produces a static, unchangeable version of the training data, making common training-time needs difficult (e.g., subword sampling), time-consuming (preprocessing with large data can take days), expensive (e.g., disk space), and cumbersome (managing experiment combinatorics). We propose an alternative approach that separates the generation of data from the consumption of that data. In this approach, there is no separate pre-processing step; data generation produces an infinite stream of permutations of the raw training data, which the trainer tensorizes and batches as it is consumed. Additionally, this data stream can be manipulated by a set of user-definable operators that provide on-the-fly modifications, such as data normalization, augmentation or filtering. We release an open-source toolkit, SOTASTREAM, that implements this approach: https://github.com/marian-nmt/sotastream. We show that it cuts training time, adds flexibility, reduces experiment management complexity, and reduces disk space, all without affecting the accuracy of the trained models.

摘要
许多机器翻译工具包括一个数据准备步骤，将原始数据转换成可直接用于训练的张量格式。这个过程在现代研发实践中变得越来越不合适，因为这会生成一个静态、不可变的版本的训练数据，使得一些常见的训练时间需求（如字符抽样）变得困难、时间consuming（处理大量数据可以花费多天）、昂贵（如磁盘空间）和困难（实验组合管理）。我们提出一种新的方法，即将数据生成与数据消耗分离开来。在这种方法中，没有单独的预处理步骤；数据生成生成了无限长的Permutation序列，这些 permutation被训练者张量化并批处理，直到它们被消耗。此外，这个数据流可以通过用户定义的操作符进行实时修改，例如数据Normalization、扩展或筛选。我们发布了一个开源工具kit，SOTASTREAM，实现了这种方法：https://github.com/marian-nmt/sotastream。我们表明，它可以减少训练时间，添加灵活性，降低实验管理复杂度，并降低磁盘空间，而无需影响训练出来的模型准确性。

O-1: Self-training with Oracle and 1-best Hypothesis

paper_url: http://arxiv.org/abs/2308.07486
repo_url: None
paper_authors: Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Kartik Audhkhasi
for: 提高Speech Recognition训练的准确率和评估 metrics
methods: 使用O-1自适应目标函数，可以处理both超级vised和无级vised数据，并且可以减少训练偏见
results: O-1对SpeechStew数据集和一个大规模的内部数据集进行评估，与EMBR相比，O-1可以将实际和oracle表现之间的差距减少80%，并且在不同的SpeechStew数据集上实现13%-25%的相对改善，对EMBR训练的内部数据集也可以减少12%的差距。总的来说，O-1可以提高WER的准确率9%。

Abstract
We introduce O-1, a new self-training objective to reduce training bias and unify training and evaluation metrics for speech recognition. O-1 is a faster variant of Expected Minimum Bayes Risk (EMBR), that boosts the oracle hypothesis and can accommodate both supervised and unsupervised data. We demonstrate the effectiveness of our approach in terms of recognition on publicly available SpeechStew datasets and a large-scale, in-house data set. On Speechstew, the O-1 objective closes the gap between the actual and oracle performance by 80\% relative compared to EMBR which bridges the gap by 43\% relative. O-1 achieves 13\% to 25\% relative improvement over EMBR on the various datasets that SpeechStew comprises of, and a 12\% relative gap reduction with respect to the oracle WER over EMBR training on the in-house dataset. Overall, O-1 results in a 9\% relative improvement in WER over EMBR, thereby speaking to the scalability of the proposed objective for large-scale datasets.

摘要
我们介绍O-1，一个新的自我训练目标，用于降低训练偏见和统一训练和评估指标 для语音识别。O-1是EMBR的快速版本，可以提高oracle假设，并可以处理both监控和无监控数据。我们透过使用O-1目标，在公开ailable的SpeechStew数据集和一个大规模的内部数据集上进行评估。在SpeechStew上，O-1目标可以关闭实际和oracle性能之间的差距 by 80% relative compared to EMBR，而EMBR则可以关闭差距 by 43% relative。O-1在不同的SpeechStew数据集上的表现亮眼，比EMBR高13%到25% relative，并且与oracle WER之间的差距降低12% relative。总的来说，O-1对EMBR的WER进行了9%的相对改善，证明了O-1目标的扩展性。

Development and Evaluation of Three Chatbots for Postpartum Mood and Anxiety Disorders

paper_url: http://arxiv.org/abs/2308.07407
repo_url: None
paper_authors: Xuewen Yao, Miriam Mikhelson, S. Craig Watkins, Eunsol Choi, Edison Thomaz, Kaya de Barbaro
for: 本研究的目的是开发一些聊天机器人，以提供新生婴期护理者所需的情感支持。
methods: 我们使用了规则引导的和生成模型，以提供上下文特定的同情支持。
results: 我们的规则引导模型表现最佳，其输出与真实参考数据几乎相同，同时含有最高水平的同情。人工用户对规则引导聊天机器人表示喜欢，因为它的回答具有上下文特定和人类化的特点。生成模型也能生成同情的回答，但由于训练数据的限制，它的回答经常具有含糊不清的问题。

Abstract
In collaboration with Postpartum Support International (PSI), a non-profit organization dedicated to supporting caregivers with postpartum mood and anxiety disorders, we developed three chatbots to provide context-specific empathetic support to postpartum caregivers, leveraging both rule-based and generative models. We present and evaluate the performance of our chatbots using both machine-based metrics and human-based questionnaires. Overall, our rule-based model achieves the best performance, with outputs that are close to ground truth reference and contain the highest levels of empathy. Human users prefer the rule-based chatbot over the generative chatbot for its context-specific and human-like replies. Our generative chatbot also produced empathetic responses and was described by human users as engaging. However, limitations in the training dataset often result in confusing or nonsensical responses. We conclude by discussing practical benefits of rule-based vs. generative models for supporting individuals with mental health challenges. In light of the recent surge of ChatGPT and BARD, we also discuss the possibilities and pitfalls of large language models for digital mental healthcare.

摘要
合作 Postpartum Support International (PSI) 非营利组织，我们开发了三个聊天机器人，以提供适应性强的同理支持给孕后照顾者，利用规则基本和生成模型。我们对聊天机器人的表现进行评估，使用机器人和人类Questionnaire。总的来说，我们的规则基本模型实现了最好的表现，输出与真实参照接近，同时具有最高水平的同理。人类用户对规则基本聊天机器人的喜欢度最高，因为它的回答具有Context-specific和人类化的特点。我们的生成模型也生成了同理的回答，但是训练数据的限制导致它们的回答有时会很混乱或无意义。我们 conclude 规则基本模型和生成模型在支持人们 mental health 挑战时的实际效用，以及 ChatGPT 和 BARD 等大语言模型在数字 mental healthcare 中的可能性和风险。

Text Injection for Capitalization and Turn-Taking Prediction in Speech Models

paper_url: http://arxiv.org/abs/2308.07395
repo_url: None
paper_authors: Shaan Bijwadia, Shuo-yiin Chang, Weiran Wang, Zhong Meng, Hao Zhang, Tara N. Sainath
for: 提高 auxiliary 任务表现（非ASR任务）
methods: 使用文本注入（JEIT）训练 ASR 模型，并在两个 auxiliary 任务上进行训练
results: 文本注入方法可以提高长尾数据的首字母排序性能，并提高转接推断精度

Abstract
Text injection for automatic speech recognition (ASR), wherein unpaired text-only data is used to supplement paired audio-text data, has shown promising improvements for word error rate. This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model. In this work, we use joint end-to-end and internal language model training (JEIT) as our text injection algorithm to train an ASR model which performs two auxiliary tasks. The first is capitalization, which is a de-normalization task. The second is turn-taking prediction, which attempts to identify whether a user has completed their conversation turn in a digital assistant interaction. We show results demonstrating that our text injection method boosts capitalization performance for long-tail data, and improves turn-taking detection recall.

摘要
文本注入技术可以用于自动语音识别（ASR），其中使用无对应的文本数据来补充带有音频数据的对应数据，有效地降低了单词错误率。本研究探讨了文本注入技术在辅助任务中的应用，这些任务通常是END-TO-END模型完成的非ASR任务。在这个工作中，我们使用了结合端到端和内部语言模型训练（JEIT）作为我们的文本注入算法，用于训练一个ASR模型，该模型完成了两个辅助任务。第一个是字母大小 normalization 任务，第二个是判断用户是否已经完成了在数字助手交互中的对话转移。我们的实验结果表明，我们的文本注入方法可以提高长尾数据中的字母大小正确率，并提高了对话转移检测的准确率。

Using Text Injection to Improve Recognition of Personal Identifiers in Speech

paper_url: http://arxiv.org/abs/2308.07393
repo_url: None
paper_authors: Yochai Blau, Rohan Agrawal, Lior Madmony, Gary Wang, Andrew Rosenberg, Zhehuai Chen, Zorik Gekhman, Genady Beryozkin, Parisa Haghani, Bhuvana Ramabhadran
for: 提高自动语音识别（ASR）系统中个人特定信息（PII）的识别率。
methods: 使用文本插入法将假文本替换PII类别，以提高训练数据中PII类别的识别率。
results: 在医疗记录中提高了名称和日期的回忆率，同时提高了总的word error rate（WER）。对数字序列也显示了改善Character Error Rate和句子准确率。

Abstract
Accurate recognition of specific categories, such as persons' names, dates or other identifiers is critical in many Automatic Speech Recognition (ASR) applications. As these categories represent personal information, ethical use of this data including collection, transcription, training and evaluation demands special care. One way of ensuring the security and privacy of individuals is to redact or eliminate Personally Identifiable Information (PII) from collection altogether. However, this results in ASR models that tend to have lower recognition accuracy of these categories. We use text-injection to improve the recognition of PII categories by including fake textual substitutes of PII categories in the training data using a text injection method. We demonstrate substantial improvement to Recall of Names and Dates in medical notes while improving overall WER. For alphanumeric digit sequences we show improvements to Character Error Rate and Sentence Accuracy.

摘要
准确地识别特定类别，如人名、日期等标识信息，在自动语音识别（ASR）应用中是非常重要的。这些类别代表个人信息，因此对这些数据的采集、译写、训练和评估需要特殊的注意。一种方法是完全不收集人类标识信息（PII），但这会导致ASR模型对这些类别的识别精度下降。我们使用文本插入法来提高PII类别的识别精度，通过在训练数据中插入假文本substitute来实现。我们在医疗笔记中示出了大幅提高名称和日期的回忆率，同时提高总的word Error Rate。对于字符串数字序列，我们示出了字符错误率和句子准确率的改善。

paper_url: http://arxiv.org/abs/2308.07317
repo_url: https://github.com/arielnlee/Platypus
paper_authors: Ariel N. Lee, Cole J. Hunter, Nataniel Ruiz
for: 这个论文是为了描述一个名为Platypus的家族 Large Language Models (LLMs)，它们在 HuggingFace 的开放 LLM Leaderboard 上达到了最高的表现并现在位于第一名。
methods: 这个论文使用了一个名为 Open-Platypus 的精心准备和合并 LoRA 模块，以保留预训练 LLMs 的强大优先知识，同时将特定领域知识带到表面。
results: Platypus 家族在量化 LLM 度量上表现出色，在模型尺寸上占据了全球 Open LLM leaderboard 的排名，而使用的 fine-tuning 数据和总计算量只是其他 state-of-the-art fine-tuned LLMs 所需的一小部分。例如，一个 13B Platypus 模型可以在单个 A100 GPU 上使用 25k 问题进行 5 小时的训练。这是 Open-Platypus 数据集的质量的证明，并开启了更多改进的可能性。

Abstract
We present $\textbf{Platypus}$, a family of fine-tuned and merged Large Language Models (LLMs) that achieves the strongest performance and currently stands at first place in HuggingFace's Open LLM Leaderboard as of the release date of this work. In this work we describe (1) our curated dataset $\textbf{Open-Platypus}$, that is a subset of other open datasets and which $\textit{we release to the public}$ (2) our process of fine-tuning and merging LoRA modules in order to conserve the strong prior of pretrained LLMs, while bringing specific domain knowledge to the surface (3) our efforts in checking for test data leaks and contamination in the training data, which can inform future research. Specifically, the Platypus family achieves strong performance in quantitative LLM metrics across model sizes, topping the global Open LLM leaderboard while using just a fraction of the fine-tuning data and overall compute that are required for other state-of-the-art fine-tuned LLMs. In particular, a 13B Platypus model can be trained on $\textit{a single}$ A100 GPU using 25k questions in 5 hours. This is a testament of the quality of our Open-Platypus dataset, and opens opportunities for more improvements in the field. Project page: https://platypus-llm.github.io

摘要
我们现在提出了$\textbf{ Platypus}$家族，这是一些精心调整和合并的大语言模型（LLMs），它在HuggingFace的开源LLM排名榜上 currently stands at first place as of the release date of this work. 在这个工作中，我们描述了我们的手动抽象 dataset $\textbf{Open-Platypus}$，这是其他开放数据集的一个子集，并且 $\textit{我们向公众发布了这些数据}$。我们的过程包括了精心调整和合并LoRA模块，以保留预训练LLMs的强大优先知识，同时将特定领域知识带到表面。我们还尽可能地检查测试数据泄露和训练数据污染，以便未来的研究。特别是，Platypus家族在量化LLM指标中表现出色，在模型尺寸上占据全球开源LLM排名榜首位，而使用的是比其他 state-of-the-art 精心调整LLMs的一部分的精心调整数据和总计算资源。例如，一个13B Platypus模型可以在 $\textit{单个}$ A100 GPU 上使用 25k 问题，在 5 小时内训练完成。这是一个证明我们 Open-Platypus 数据集的质量，并开创了更多的改进机会。项目页面：https://platypus-llm.github.io

The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

paper_url: http://arxiv.org/abs/2308.07286
repo_url: None
paper_authors: Patrick Fernandes, Daniel Deutsch, Mara Finkelstein, Parker Riley, André F. T. Martins, Graham Neubig, Ankush Garg, Jonathan H. Clark, Markus Freitag, Orhan Firat
for: 这篇论文是为了提供一种自动评估机器翻译（MT）系统的方法，以便在MT系统的快速迭代发展中进行评估。
methods: 这篇论文使用了大语言模型（LLM）的理解和在场景学习能力，并让它们标注翻译中的错误。
results: 研究发现，使用AutoMQM技术可以提高MT系统的性能，特别是使用更大的模型时。此外，AutoMQM还提供了解释性的错误块，与人工标注相Alignment。

Abstract
Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative development of MT systems. While considerable progress has been made on estimating a single scalar quality score, current metrics lack the informativeness of more detailed schemes that annotate individual errors, such as Multidimensional Quality Metrics (MQM). In this paper, we help fill this gap by proposing AutoMQM, a prompting technique which leverages the reasoning and in-context learning capabilities of large language models (LLMs) and asks them to identify and categorize errors in translations. We start by evaluating recent LLMs, such as PaLM and PaLM-2, through simple score prediction prompting, and we study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores (with particularly large gains for larger models) while providing interpretability through error spans that align with human annotations.

摘要
自动评估机器翻译（MT）是翻译系统的快速迭代发展的重要工具。虽然已经取得了较大的进步，但当前的度量仍然缺乏详细的错误标注，如多维质量指标（MQM）。在这篇文章中，我们帮助填补这个空白，并提出了AutoMQM技术，它利用大型自然语言模型（LLM）的理解和上下文学习能力，并让它们标注和分类翻译中的错误。我们首先通过对最近的LLM，如PaLM和PaLM-2，进行简单的分数预测提问，并研究了标注数据的影响。然后，我们评估了AutoMQM技术，并发现它在比只是提问分数时提高性能（尤其是大型模型），并提供了解释性的错误排序。

Comparison between parameter-efficient techniques and full fine-tuning: A case study on multilingual news article classification

paper_url: http://arxiv.org/abs/2308.07282
repo_url: None
paper_authors: Olesya Razuvayevskaya, Ben Wu, Joao A. Leite, Freddy Heppell, Ivan Srba, Carolina Scarton, Kalina Bontcheva, Xingyi Song
for: 本研究旨在investigating parameter-efficient fine-tuning techniques的影响于多语言文本分类任务（类型、框架和说服技巧检测），包括不同的输入长度、预测类数和分类难度。
methods: 本研究使用了Adaptors和LoRA技术来实现parameter-efficient fine-tuning，并进行了对不同训练场景（训练原始多语言数据、翻译成英语和英语只数据）和不同语言的深入分析。
results: 研究发现，在多语言文本分类任务中，Adaptors和LoRA技术可以减少训练时间和计算成本，并且在某些情况下可以提高性能。

Abstract
Adapters and Low-Rank Adaptation (LoRA) are parameter-efficient fine-tuning techniques designed to make the training of language models more efficient. Previous results demonstrated that these methods can even improve performance on some classification tasks. This paper complements the existing research by investigating how these techniques influence the classification performance and computation costs compared to full fine-tuning when applied to multilingual text classification tasks (genre, framing, and persuasion techniques detection; with different input lengths, number of predicted classes and classification difficulty), some of which have limited training data. In addition, we conduct in-depth analyses of their efficacy across different training scenarios (training on the original multilingual data; on the translations into English; and on a subset of English-only data) and different languages. Our findings provide valuable insights into the applicability of the parameter-efficient fine-tuning techniques, particularly to complex multilingual and multilabel classification tasks.

摘要
这篇文章进一步探讨了微调和低阶化适应（LoRA）技术的影响，它们是用于对语言模型进行更有效的训练。过往的研究显示这些技术可以提高一些分类任务的性能。本文在多种多元文本分类任务（文类、几何、说服等）中进行了广泛的实验，包括有限的训练数据。此外，我们还进行了不同训练enario（训练原始多种语言数据；训练英文翻译；和使用英文数据subset）和不同语言的深入分析。我们的发现将有价值的帮助在复杂的多种语言和多类分类任务中应用这些参数有效的微调技术。

Dialogue for Prompting: a Policy-Gradient-Based Discrete Prompt Optimization for Few-shot Learning

paper_url: http://arxiv.org/abs/2308.07272
repo_url: None
paper_authors: Chengzhengxu Li, Xiaoming Liu, Yichen Wang, Duyi Li, Yu Lan, Chao Shen
for: 提高几何学NLU任务中的表现，减少专家知识和人工干预。
methods: 对PLMs进行对话分析，设计可读性检测 metric，使用RL框架和政策网络进行优化。
results: 在四个开源数据集上，DP_2O方法在几何学NLU任务中的表现高于SOTA方法1.52%，并且具有良好的通用性、Robustness和普适性。

Abstract
Prompt-based pre-trained language models (PLMs) paradigm have succeeded substantially in few-shot natural language processing (NLP) tasks. However, prior discrete prompt optimization methods require expert knowledge to design the base prompt set and identify high-quality prompts, which is costly, inefficient, and subjective. Meanwhile, existing continuous prompt optimization methods improve the performance by learning the ideal prompts through the gradient information of PLMs, whose high computational cost, and low readability and generalizability are often concerning. To address the research gap, we propose a Dialogue-comprised Policy-gradient-based Discrete Prompt Optimization ($DP_2O$) method. We first design a multi-round dialogue alignment strategy for readability prompt set generation based on GPT-4. Furthermore, we propose an efficient prompt screening metric to identify high-quality prompts with linear complexity. Finally, we construct a reinforcement learning (RL) framework based on policy gradients to match the prompts to inputs optimally. By training a policy network with only 0.67% of the PLM parameter size on the tasks in the few-shot setting, $DP_2O$ outperforms the state-of-the-art (SOTA) method by 1.52% in accuracy on average on four open-source datasets. Moreover, subsequent experiments also demonstrate that $DP_2O$ has good universality, robustness, and generalization ability.

摘要

2023-08-15

cs.LG

cs.LG - 2023-08-15

Dyadic Reinforcement Learning

paper_url: http://arxiv.org/abs/2308.07843
repo_url: https://github.com/statisticalreinforcementlearninglab/roadmap2.0testbed
paper_authors: Shuangning Li, Lluis Salvat Niell, Sung Won Choi, Inbal Nahum-Shani, Guy Shani, Susan Murphy
for: 该论文旨在提高健康结果，通过在日常生活中提供便捷的干预方法。
methods: 该论文提出了一种基于在线演进学习算法，以个性化干预发送方式，基于Contextual因素和target人和他们的护理伴侣之前的反应。
results: 通过在模拟场景和实际数据集上进行实验研究，提出了一种 bayesian和层次的dyadic RL算法，并证明了其可预测性。

Abstract
Mobile health aims to enhance health outcomes by delivering interventions to individuals as they go about their daily life. The involvement of care partners and social support networks often proves crucial in helping individuals managing burdensome medical conditions. This presents opportunities in mobile health to design interventions that target the dyadic relationship -- the relationship between a target person and their care partner -- with the aim of enhancing social support. In this paper, we develop dyadic RL, an online reinforcement learning algorithm designed to personalize intervention delivery based on contextual factors and past responses of a target person and their care partner. Here, multiple sets of interventions impact the dyad across multiple time intervals. The developed dyadic RL is Bayesian and hierarchical. We formally introduce the problem setup, develop dyadic RL and establish a regret bound. We demonstrate dyadic RL's empirical performance through simulation studies on both toy scenarios and on a realistic test bed constructed from data collected in a mobile health study.

摘要
Mobile health aimed to enhance health outcomes by delivering interventions to individuals as they go about their daily life. The involvement of care partners and social support networks often proved crucial in helping individuals manage burdensome medical conditions. This presented opportunities in mobile health to design interventions that targeted the dyadic relationship - the relationship between a target person and their care partner - with the aim of enhancing social support. In this paper, we developed dyadic RL, an online reinforcement learning algorithm designed to personalize intervention delivery based on contextual factors and past responses of a target person and their care partner. Here, multiple sets of interventions impacted the dyad across multiple time intervals. The developed dyadic RL was Bayesian and hierarchical. We formally introduced the problem setup, developed dyadic RL, and established a regret bound. We demonstrated dyadic RL's empirical performance through simulation studies on both toy scenarios and on a realistic test bed constructed from data collected in a mobile health study.

Simple and Efficient Partial Graph Adversarial Attack: A New Perspective

paper_url: http://arxiv.org/abs/2308.07834
repo_url: https://github.com/pasalab/pga
paper_authors: Guanghui Zhu, Mengyu Chen, Chunfeng Yuan, Yihua Huang
for: 提高图 neural network 的 robustness和安全性，针对图中所有节点的全球攻击方法。
methods: 提出一种全新的partial graph attack（PGA）方法，选择易于攻击的节点作为攻击目标，并提出了一种层次目标选择策略、一种成本效果较高的锚点选择策略和一种迭代循环增强的迭代式攻击方法。
results: 对比其他现有的图全球攻击方法，PGA可以实现显著提高攻击效果和攻击效率。

Abstract
As the study of graph neural networks becomes more intensive and comprehensive, their robustness and security have received great research interest. The existing global attack methods treat all nodes in the graph as their attack targets. Although existing methods have achieved excellent results, there is still considerable space for improvement. The key problem is that the current approaches rigidly follow the definition of global attacks. They ignore an important issue, i.e., different nodes have different robustness and are not equally resilient to attacks. From a global attacker's view, we should arrange the attack budget wisely, rather than wasting them on highly robust nodes. To this end, we propose a totally new method named partial graph attack (PGA), which selects the vulnerable nodes as attack targets. First, to select the vulnerable items, we propose a hierarchical target selection policy, which allows attackers to only focus on easy-to-attack nodes. Then, we propose a cost-effective anchor-picking policy to pick the most promising anchors for adding or removing edges, and a more aggressive iterative greedy-based attack method to perform more efficient attacks. Extensive experimental results demonstrate that PGA can achieve significant improvements in both attack effect and attack efficiency compared to other existing graph global attack methods.

摘要
Our approach consists of three key components:1. Hierarchical target selection policy: This policy allows attackers to focus on easy-to-attack nodes, reducing the overall cost of the attack.2. Cost-effective anchor-picking policy: This policy selects the most promising anchors for adding or removing edges, maximizing the attack effect while minimizing the cost.3. Iterative greedy-based attack method: This method performs more efficient attacks by iteratively adding or removing edges based on the selected anchors.Our extensive experimental results show that PGA achieves significant improvements in both attack effect and attack efficiency compared to other existing graph global attack methods.

REFORMS: Reporting Standards for Machine Learning Based Science

paper_url: http://arxiv.org/abs/2308.07832
repo_url: None
paper_authors: Sayash Kapoor, Emily Cantrell, Kenny Peng, Thanh Hien Pham, Christopher A. Bail, Odd Erik Gundersen, Jake M. Hofman, Jessica Hullman, Michael A. Lones, Momin M. Malik, Priyanka Nanayakkara, Russell A. Poldrack, Inioluwa Deborah Raji, Michael Roberts, Matthew J. Salganik, Marta Serra-Garcia, Brandon M. Stewart, Gilles Vandewiele, Arvind Narayanan
For: The paper aims to provide clear reporting standards for machine learning (ML) based science to address the issues of validity, reproducibility, and generalizability in scientific research.* Methods: The paper presents the REFORMS checklist, a set of 32 questions and guidelines developed based on a consensus of 19 researchers across various disciplines.* Results: The REFORMS checklist can serve as a resource for researchers, referees, and journals to ensure transparency and reproducibility in ML-based scientific research.Here is the information in Simplified Chinese text:* For: 这篇论文目标是提供机器学习（ML）基于科学研究的清晰报告标准，以解决科学研究中有效性、可重现性和普适性的问题。* Methods: 论文提出了REFORMS检查表（Reporting Standards For Machine Learning Based Science），这是基于19位研究者来自计算机科学、数据科学、数学、社会科学和生物医学等领域的共识，包括32个问题和对应的指南。* Results: REFORMS检查表可以为研究者、审稿人和杂志编辑提供一个资源，以确保机器学习基于科学研究的透明度和可重现性。

Abstract
Machine learning (ML) methods are proliferating in scientific research. However, the adoption of these methods has been accompanied by failures of validity, reproducibility, and generalizability. These failures can hinder scientific progress, lead to false consensus around invalid claims, and undermine the credibility of ML-based science. ML methods are often applied and fail in similar ways across disciplines. Motivated by this observation, our goal is to provide clear reporting standards for ML-based science. Drawing from an extensive review of past literature, we present the REFORMS checklist ($\textbf{Re}$porting Standards $\textbf{For}$ $\textbf{M}$achine Learning Based $\textbf{S}$cience). It consists of 32 questions and a paired set of guidelines. REFORMS was developed based on a consensus of 19 researchers across computer science, data science, mathematics, social sciences, and biomedical sciences. REFORMS can serve as a resource for researchers when designing and implementing a study, for referees when reviewing papers, and for journals when enforcing standards for transparency and reproducibility.

摘要
机器学习（ML）方法在科学研究中广泛应用，但是其应用也伴随着有效性、可重复性和普遍性的失败。这些失败可能会阻碍科学进步，导致无效的宣称得到共识，并且可能会下降机器学习基于科学的威信。机器学习方法经常在不同领域应用并失败，这使我们意识到了需要提供明确的报告标准。基于过去的文献检索，我们提出了REFORMS检查列表（Reporting Standards For Machine Learning Based Science）。它包括32个问题和一对拥有相同目标的指南。REFORMS是由19名来自计算机科学、数据科学、数学、社会科学和生物医学科学的研究人员共识而成，它可以作为研究人员设计和实施研究时的参考，同时也可以用于审稿人们审核文章，以及杂志 enforcing 透明度和可重复性的标准。

CMISR: Circular Medical Image Super-Resolution

paper_url: http://arxiv.org/abs/2308.08567
repo_url: None
paper_authors: Honggui Li, Maria Trocan, Dimitri Galayko, Mohamad Sawan
for: 提高医疗影像超分辨率（MISR）的性能，提出一种基于循环反馈的医疗影像超分辨率框架（CMISR）。
methods: 使用循环反馈机制，分为本地反馈和全局反馈两类，实现了关键点稳定性和稳定性。
results: CMISR在三种缩放因子和三个开源医疗影像dataset上的实验结果表明，其在重建性能方面胜过传统MISR，特别适用于医疗影像中具有强的边缘或激烈对比。

Abstract
Classical methods of medical image super-resolution (MISR) utilize open-loop architecture with implicit under-resolution (UR) unit and explicit super-resolution (SR) unit. The UR unit can always be given, assumed, or estimated, while the SR unit is elaborately designed according to various SR algorithms. The closed-loop feedback mechanism is widely employed in current MISR approaches and can efficiently improve their performance. The feedback mechanism may be divided into two categories: local and global feedback. Therefore, this paper proposes a global feedback-based closed-cycle framework, circular MISR (CMISR), with unambiguous UR and SR elements. Mathematical model and closed-loop equation of CMISR are built. Mathematical proof with Taylor-series approximation indicates that CMISR has zero recovery error in steady-state. In addition, CMISR holds plug-and-play characteristic which can be established on any existing MISR algorithms. Five CMISR algorithms are respectively proposed based on the state-of-the-art open-loop MISR algorithms. Experimental results with three scale factors and on three open medical image datasets show that CMISR is superior to MISR in reconstruction performance and is particularly suited to medical images with strong edges or intense contrast.

摘要
传统的医疗影像超分辨 (MISR) 方法使用开放式架构，其中隐式下解 (UR) 单元和显式超分辨 (SR) 单元是分开的。UR单元可以被给定、 Assuming 或估算，而SR单元则根据不同的SR算法进行精心设计。现有的MISR方法广泛采用了关闭着反馈机制，可以有效提高其性能。反馈机制可以分为两类：本地反馈和全球反馈。因此，本文提出了一种基于全球反馈的循环式框架，即循环MISR (CMISR)，其中UR和SR元素具有明确的定义。我们建立了CMISR的数学模型和关闭着方程，并通过泰勒级数拟合得出了CMISR在稳态状态下的零回归误差。此外，CMISR具有插件和玩儿特性，可以在任何现有的MISR算法基础上实现。我们根据现有的开放式MISR算法，分别提出了5种CMISR算法。实验结果表明，CMISR在重建性能方面高于MISR，特别适用于医疗影像中具有强的边缘或激烈的对比。

Cerberus: A Deep Learning Hybrid Model for Lithium-Ion Battery Aging Estimation and Prediction Based on Relaxation Voltage Curves

paper_url: http://arxiv.org/abs/2308.07824
repo_url: None
paper_authors: Yue Xiang, Bo Jiang, Haifeng Dai
For: The paper aims to estimate and predict the capacity aging of lithium-ion batteries using a hybrid model based on deep learning, which can accurately forecast the future capacity of the batteries.* Methods: The model uses historical capacity decay data and extracts salient features from charge and discharge relaxation processes to estimate the present capacity and predict future capacity.* Results: The model achieves a mean absolute percentage error (MAPE) of 0.29% under a charging condition of 0.25C, demonstrating its effectiveness in estimating and predicting capacity aging using real-world relaxation processes and historical capacity records within battery management systems (BMS).

Abstract
The degradation process of lithium-ion batteries is intricately linked to their entire lifecycle as power sources and energy storage devices, encompassing aspects such as performance delivery and cycling utilization. Consequently, the accurate and expedient estimation or prediction of the aging state of lithium-ion batteries has garnered extensive attention. Nonetheless, prevailing research predominantly concentrates on either aging estimation or prediction, neglecting the dynamic fusion of both facets. This paper proposes a hybrid model for capacity aging estimation and prediction based on deep learning, wherein salient features highly pertinent to aging are extracted from charge and discharge relaxation processes. By amalgamating historical capacity decay data, the model dynamically furnishes estimations of the present capacity and forecasts of future capacity for lithium-ion batteries. Our approach is validated against a novel dataset involving charge and discharge cycles at varying rates. Specifically, under a charging condition of 0.25C, a mean absolute percentage error (MAPE) of 0.29% is achieved. This outcome underscores the model's adeptness in harnessing relaxation processes commonly encountered in the real world and synergizing with historical capacity records within battery management systems (BMS), thereby affording estimations and prognostications of capacity decline with heightened precision.

摘要
锂离子电池的衰退过程与其整个生命周期深度相关，涵盖性能提供和能量存储等方面。因此，正确和快速地估计或预测锂离子电池的衰退状况备受广泛关注。然而，现有研究主要集中在 either 衰退估计或预测，忽视了这两个方面的动态融合。本文提出了一种基于深度学习的锂离子电池容量衰退估计和预测模型，其中抽象出了具有衰退相关性的充电和充放电过程特征。通过结合历史容量衰退数据，模型在实时提供了当前容量的估计和未来容量的预测，并且在充电条件下0.25C下达到了0.29%的平均绝对百分比误差（MAPE）。这一结果表明模型能够充分利用实际世界中常见的充电和充放电过程，同时与锂离子电池管理系统（BMS）中的历史容量记录相结合，从而为容量衰退的估计和预测提供了更高精度。

Deep reinforcement learning for process design: Review and perspective

paper_url: http://arxiv.org/abs/2308.07822
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Qinghe Gao, Artur M. Schweidtmann
for: 本研究旨在探讨人工智能技术如何加速化学工业中的可再生能源和原料供应转型。
methods: 本研究使用深度强化学习，一种机器学习技术，来解决化学工程中复杂的决策问题，并且探讨了这些技术在过程设计中的应用前景。
results: 本研究对现有的深度强化学习在过程设计中的应用进行了抽象和评估，并探讨了未来这些技术在化学工程中的发展前景。

Abstract
The transformation towards renewable energy and feedstock supply in the chemical industry requires new conceptual process design approaches. Recently, breakthroughs in artificial intelligence offer opportunities to accelerate this transition. Specifically, deep reinforcement learning, a subclass of machine learning, has shown the potential to solve complex decision-making problems and aid sustainable process design. We survey state-of-the-art research in reinforcement learning for process design through three major elements: (i) information representation, (ii) agent architecture, and (iii) environment and reward. Moreover, we discuss perspectives on underlying challenges and promising future works to unfold the full potential of reinforcement learning for process design in chemical engineering.

摘要
“对于化学工业中的可再生能源和原料供应转型，需要新的概念过程设计方法。现在，人工智能技术的突破发展带来了加速这个转型的机遇。特别是深度强化学习，一种机器学习的 subclass，在解决复杂决策问题和推动可持续过程设计方面表现出了潜力。我们通过三个主要元素：（i）信息表示，（ii）代理架构，以及（iii）环境和奖励，总结了现代研究的深度强化学习在过程设计方面的状况。此外，我们还讨论了下一步的挑战和未来研究的前景，以探索深度强化学习在化学工程中的潜力。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Quantifying the Cost of Learning in Queueing Systems

paper_url: http://arxiv.org/abs/2308.07817
repo_url: None
paper_authors: Daniel Freund, Thodoris Lykouris, Wentao Weng
For: This paper is written for researchers and practitioners interested in queueing systems and their optimal control, particularly in the context of parameter uncertainty.* Methods: The paper proposes a new metric called the Cost of Learning in Queueing (CLQ) to quantify the maximum increase in time-averaged queue length caused by parameter uncertainty. The authors also propose a unified analysis framework that bridges Lyapunov and bandit analysis to establish the results.* Results: The paper characterizes the CLQ of a single-queue multi-server system and extends the results to multi-queue multi-server systems and networks of queues. The authors show that the CLQ is a useful metric for evaluating the performance of queueing systems in the presence of parameter uncertainty.

Abstract
Queueing systems are widely applicable stochastic models with use cases in communication networks, healthcare, service systems, etc. Although their optimal control has been extensively studied, most existing approaches assume perfect knowledge of system parameters. Of course, this assumption rarely holds in practice where there is parameter uncertainty, thus motivating a recent line of work on bandit learning for queueing systems. This nascent stream of research focuses on the asymptotic performance of the proposed algorithms. In this paper, we argue that an asymptotic metric, which focuses on late-stage performance, is insufficient to capture the intrinsic statistical complexity of learning in queueing systems which typically occurs in the early stage. Instead, we propose the Cost of Learning in Queueing (CLQ), a new metric that quantifies the maximum increase in time-averaged queue length caused by parameter uncertainty. We characterize the CLQ of a single-queue multi-server system, and then extend these results to multi-queue multi-server systems and networks of queues. In establishing our results, we propose a unified analysis framework for CLQ that bridges Lyapunov and bandit analysis, which could be of independent interest.

摘要
queueing 系统是广泛应用的随机模型，有用cases在通信网络、医疗、服务系统等。虽然其优化控制得到了广泛的研究，但大多数现有方法假设系统参数具有完美的知识。然而，这种假设在实践中rarely holds，因此引起了一种Recent Line of Work on Bandit Learning for Queueing Systems。这个流行的研究方向主要关注 asymptotic performance of the proposed algorithms。在这篇论文中，我们 argue that an asymptotic metric, which focuses on late-stage performance, is insufficient to capture the intrinsic statistical complexity of learning in queueing systems, which typically occurs in the early stage。 Instead, we propose the Cost of Learning in Queueing (CLQ), a new metric that quantifies the maximum increase in time-averaged queue length caused by parameter uncertainty。 We characterize the CLQ of a single-queue multi-server system, and then extend these results to multi-queue multi-server systems and networks of queues。在证明我们的结果时，我们提出了一个统一的分析框架 для CLQ，该框架可以将 Lyapunov 和 bandit 分析相结合，这可能会对独立的研究有益。

Fairness and Privacy in Federated Learning and Their Implications in Healthcare

paper_url: http://arxiv.org/abs/2308.07805
repo_url: https://github.com/UVA-MLSys/DS7406
paper_authors: Navya Annapareddy, Jade Preston, Judy Fox
for: This paper aims to provide an overview of the typical lifecycle of fair federated learning in research and an updated taxonomy to account for the current state of fairness in implementations, with a focus on the healthcare domain.methods: The paper uses a decentralized approach to training machine learning models, called federated learning, to address data security, privacy, and vulnerability considerations.results: The paper highlights the challenges of implementing fairness in federated learning, including node data not being independent and identically distributed (iid), clients requiring high levels of communication overhead between peers, and the heterogeneity of different clients within a network with respect to dataset bias and size. The paper also provides added insight into the implications and challenges of implementing and supporting fairness in federated learning in the healthcare domain.

Abstract
Currently, many contexts exist where distributed learning is difficult or otherwise constrained by security and communication limitations. One common domain where this is a consideration is in Healthcare where data is often governed by data-use-ordinances like HIPAA. On the other hand, larger sample sizes and shared data models are necessary to allow models to better generalize on account of the potential for more variability and balancing underrepresented classes. Federated learning is a type of distributed learning model that allows data to be trained in a decentralized manner. This, in turn, addresses data security, privacy, and vulnerability considerations as data itself is not shared across a given learning network nodes. Three main challenges to federated learning include node data is not independent and identically distributed (iid), clients requiring high levels of communication overhead between peers, and there is the heterogeneity of different clients within a network with respect to dataset bias and size. As the field has grown, the notion of fairness in federated learning has also been introduced through novel implementations. Fairness approaches differ from the standard form of federated learning and also have distinct challenges and considerations for the healthcare domain. This paper endeavors to outline the typical lifecycle of fair federated learning in research as well as provide an updated taxonomy to account for the current state of fairness in implementations. Lastly, this paper provides added insight into the implications and challenges of implementing and supporting fairness in federated learning in the healthcare domain.

摘要
当前，有许多情况存在分布式学习是困难或受到安全和通信限制的情况。一个常见的领域是医疗领域，数据经常受到数据使用规定如HIPAA的限制。然而，更大的样本大小和共享数据模型是必要的，以使模型更好地泛化，因为可能存在更多的变化和平衡不足表示的类别。分布式学习是一种分布式学习模型，它使得数据在分布式学习网络中被训练，并解决了数据安全、隐私和抵触问题，因为数据本身不被分布式学习网络中的节点共享。主要挑战包括节点数据不是独立和同分布（iid）、客户需要高度的同域通信开销和客户网络中的数据偏好和大小不均。随着领域的发展，对分布式学习的公平性也被引入，并通过新的实现方式。公平性方法与标准的分布式学习不同，也有特殊的挑战和医疗领域中的考虑。本文尝试将研究中的公平分布式学习的典型生命周期和更新的分类表示出来，并提供了对当前公平性实现的更多的深入视角。最后，本文还提供了在实施和支持公平分布式学习在医疗领域中的挑战和问题。

Adaptive Noise Covariance Estimation under Colored Noise using Dynamic Expectation Maximization

paper_url: http://arxiv.org/abs/2308.07797
repo_url: https://github.com/ajitham123/DEM_NCM
paper_authors: Ajith Anil Meera, Pablo Lanillos
for: 这篇论文是为了提出一个新的脑心理静电组织（Brain-Inspired Algorithm），用于精确地估计动态系统中的噪音协调矩阵（Noise Covariance Matrix，NCM）。
methods: 这个算法extend了Dynamic Expectation Maximization（DEM）算法，以在线上估计噪音协调矩阵和状态估计，并且可以适应彩色噪音（colored noise）的情况。
results: 透过Randomized numerical simulations，我们展示了我们的估计方法在彩色噪音下比基准方法（Variational Bayes）更好，并且在高彩色噪音情况下也能够实现更好的结果。

Abstract
The accurate estimation of the noise covariance matrix (NCM) in a dynamic system is critical for state estimation and control, as it has a major influence in their optimality. Although a large number of NCM estimation methods have been developed, most of them assume the noises to be white. However, in many real-world applications, the noises are colored (e.g., they exhibit temporal autocorrelations), resulting in suboptimal solutions. Here, we introduce a novel brain-inspired algorithm that accurately and adaptively estimates the NCM for dynamic systems subjected to colored noise. Particularly, we extend the Dynamic Expectation Maximization algorithm to perform both online noise covariance and state estimation by optimizing the free energy objective. We mathematically prove that our NCM estimator converges to the global optimum of this free energy objective. Using randomized numerical simulations, we show that our estimator outperforms nine baseline methods with minimal noise covariance estimation error under colored noise conditions. Notably, we show that our method outperforms the best baseline (Variational Bayes) in joint noise and state estimation for high colored noise. We foresee that the accuracy and the adaptive nature of our estimator make it suitable for online estimation in real-world applications.

摘要
预测动态系统中噪声 covariance 矩阵（NCM）的准确性是控制和状态估计中关键的，因为它对系统的优化产生了很大的影响。虽然有很多 NCMEstimation 方法已经开发，但大多数它们假设噪声是白噪声（即噪声无相关性）。然而，在实际应用中，噪声通常是染色的（即噪声展现了时间自相关性），从而导致估计结果不佳。在这篇文章中，我们介绍了一种基于脑神经元的新算法，可以准确地适应 colored noise 的动态系统 NCM 估计。我们在 Dynamic Expectation Maximization 算法的基础上扩展了该算法，以在线进行噪声 covariance 和状态估计，并通过优化自由能对象来实现。我们数学证明了我们的 NCMEstimation 算法 converge 到 global optimum 的自由能对象上。使用随机数字 simulations，我们示出了我们的估计算法在噪声 Conditions 下比基eline方法（Variational Bayes）的噪声 covariance 估计误差较低。特别是，我们示出了我们的方法在高染色噪声 Conditions 下与 Variational Bayes 的联合噪声和状态估计表现较好。我们认为我们的方法的准确性和适应性使其适用于实际应用中的在线估计。

Implementing Quantum Generative Adversarial Network (qGAN) and QCBM in Finance

paper_url: http://arxiv.org/abs/2308.08448
repo_url: None
paper_authors: Santanu Ganguly
for: 这个论文探讨了应用量子机器学习（QML）在金融领域的未来研究方向，以及在金融世界中各种应用的QML模型。
methods: 本文使用实际的金融数据和模拟环境，比较了不同QML模型的性能，包括qGAN和QCBM等。
results: 研究显示，QML在金融领域可能具有未来的优势，并且qGAN模型在某些情况下表现出了明显的优势。

Abstract
Quantum machine learning (QML) is a cross-disciplinary subject made up of two of the most exciting research areas: quantum computing and classical machine learning (ML), with ML and artificial intelligence (AI) being projected as the first fields that will be impacted by the rise of quantum machines. Quantum computers are being used today in drug discovery, material & molecular modelling and finance. In this work, we discuss some upcoming active new research areas in application of quantum machine learning (QML) in finance. We discuss certain QML models that has become areas of active interest in the financial world for various applications. We use real world financial dataset and compare models such as qGAN (quantum generative adversarial networks) and QCBM (quantum circuit Born machine) among others, using simulated environments. For the qGAN, we define quantum circuits for discriminators and generators and show promises of future quantum advantage via QML in finance.

摘要
量子机器学习（QML）是一个跨学科领域，包括量子计算和经典机器学习（ML），被认为是未来量子机器的发展将首先影响的两个领域之一。现在，量子计算机已经在药物发现、物质和分子模拟以及金融领域中使用。在这个工作中，我们讨论了在金融领域应用量子机器学习（QML）的一些新 aktive研究领域。我们讨论了一些在金融界引起了广泛关注的QML模型，如quantum generative adversarial networks（qGAN）和Quantum Circuit Born Machine（QCBM）等，并使用实际世界金融数据进行比较。对于qGAN，我们定义了量子电路 для批分类器和生成器，并显示了未来量子优势的承诺。

Informed Named Entity Recognition Decoding for Generative Language Models

paper_url: http://arxiv.org/abs/2308.07791
repo_url: None
paper_authors: Tobias Deußer, Lars Hillebrand, Christian Bauckhage, Rafet Sifa
for: 本研究旨在提出一种简单 yet effective的方法，即 Informed Named Entity Recognition Decoding (iNERD)，用于循环的 named entity recognition 任务。
methods: 该方法利用当前的生成模型来进行语言理解，并采用一种有知识的 decoding 方式，将信息提取 tasks 与开放式文本生成结合，从而提高性能并消除所有的幻觉。
results: 通过在 eight 个 named entity recognition 数据集上评估五种生成语言模型，研究发现该方法在未知实体类集合环境下表现出优异的适应能力，特别是在不知道实体类集合时，表现更加出色。

Abstract
Ever-larger language models with ever-increasing capabilities are by now well-established text processing tools. Alas, information extraction tasks such as named entity recognition are still largely unaffected by this progress as they are primarily based on the previous generation of encoder-only transformer models. Here, we propose a simple yet effective approach, Informed Named Entity Recognition Decoding (iNERD), which treats named entity recognition as a generative process. It leverages the language understanding capabilities of recent generative models in a future-proof manner and employs an informed decoding scheme incorporating the restricted nature of information extraction into open-ended text generation, improving performance and eliminating any risk of hallucinations. We coarse-tune our model on a merged named entity corpus to strengthen its performance, evaluate five generative language models on eight named entity recognition datasets, and achieve remarkable results, especially in an environment with an unknown entity class set, demonstrating the adaptability of the approach.

摘要
现代语言模型不断增长，功能也不断提高。可是，信息提取任务，如名实Recognition，仍然受到这些进步的影响很少，因为它们基本上是基于上一代encoder-only transformer模型。在这里，我们提出了一种简单 yet effective的方法，即 Informed Named Entity Recognition Decoding（iNERD）。它将名实Recognition视为生成过程，利用当前的生成模型对语言理解能力，并采用了了知情 decode 策略，将开放式文本生成和信息提取相结合，提高性能，并完全消除任何幻觉的风险。我们在合并的名实Corpus上粗略调整我们的模型，使其在八个名实Recognition 数据集上表现出色，特别是在未知类型集的环境中，表现出了适应性。

DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding

paper_url: http://arxiv.org/abs/2308.07787
repo_url: https://github.com/joannahong/diffv2s
paper_authors: Jeongsoo Choi, Joanna Hong, Yong Man Ro
for: 这个研究的目的是提高视频到语音合成的精度和可理解性，使得可以从视频输入中恰当地重建出高质量的语音。
methods: 这个研究使用了一种新的视觉导向说话嵌入表示器，它使用了一个自我超vised预训练模型和提问调整技术来提取嵌入。此外，它还使用了一种扩散基于的视频到语音合成模型，称为DiffV2S，其 conditioned于提取的嵌入和输入视频帧中的视觉表示。
results: 这个研究的实验结果表明，DiffV2S可以保持输入视频帧中的音素细节，同时创造一个高度可理解的mel-spectrogram，其中每个说话者的身份都被保留。相比之下，DiffV2S的表现比之前的视频到语音合成技术更高。

Abstract
Recent research has demonstrated impressive results in video-to-speech synthesis which involves reconstructing speech solely from visual input. However, previous works have struggled to accurately synthesize speech due to a lack of sufficient guidance for the model to infer the correct content with the appropriate sound. To resolve the issue, they have adopted an extra speaker embedding as a speaking style guidance from a reference auditory information. Nevertheless, it is not always possible to obtain the audio information from the corresponding video input, especially during the inference time. In this paper, we present a novel vision-guided speaker embedding extractor using a self-supervised pre-trained model and prompt tuning technique. In doing so, the rich speaker embedding information can be produced solely from input visual information, and the extra audio information is not necessary during the inference time. Using the extracted vision-guided speaker embedding representations, we further develop a diffusion-based video-to-speech synthesis model, so called DiffV2S, conditioned on those speaker embeddings and the visual representation extracted from the input video. The proposed DiffV2S not only maintains phoneme details contained in the input video frames, but also creates a highly intelligible mel-spectrogram in which the speaker identities of the multiple speakers are all preserved. Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.

摘要
近期研究已经展示了视频到语音合成的出色成果，即通过视觉输入重建语音。然而，之前的研究往往因缺乏足够的指导，使模型很难准确地推理出正确的内容和合适的声音。为解决这问题，他们采用了外部的 speaker embedding 作为引导，从参考听力信息中提取出speaker embedding。然而，在推理时不一定能获取相应的音频信息，特别是在推理时。在这篇论文中，我们提出了一种新的视频引导的 speaker embedding EXTRACTOR，使用自我超vised pre-trained模型和提示调整技术。通过这种方法，我们可以从输入视频信息中提取出丰富的 speaker embedding信息，而不需要外部的音频信息。使用提取的视频引导 speaker embedding表示，我们进一步开发了一种扩散基于的视频到语音合成模型，称为DiffV2S。DiffV2S 模型通过 Conditioned on those speaker embeddings and the visual representation extracted from the input video, the proposed DiffV2S not only maintains phoneme details contained in the input video frames, but also creates a highly intelligible mel-spectrogram in which the speaker identities of the multiple speakers are all preserved. Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.

Hierarchical generative modelling for autonomous robots

paper_url: http://arxiv.org/abs/2308.07775
repo_url: None
paper_authors: Kai Yuan, Noor Sajid, Karl Friston, Zhibin Li
for: investigate the fundamental aspect of motor control in autonomous robotic operations, and develop a hierarchical generative model to achieve versatile sensorimotor control.
methods: use hierarchical generative modeling, multi-level planning, and numerical/physical simulation to achieve autonomous completion of complex tasks.
results: demonstrate the effectiveness of using human-inspired motor control algorithms, and show the ability of a humanoid robot to retrieve, transport, open, walk through a door, approach, and kick a football, while showing robust performance in presence of body damage and ground irregularities.Here’s the summary in Traditional Chinese:
for: 研究自主机械操作中的动作控制基础，并开发一个嵌入式生成模型以实现多标的感知动作控制。
methods: 使用嵌入式生成模型、多层规划和数据/物理模拟来完成自主任务。
results: 显示人类动作控制算法的效果，并展示一个人型机器人能够自主完成复杂任务，例如抓取、运输、开启、通过门、与足球进行踢动作，并在身体损坏和地面不平的情况下保持Robust性。

Abstract
Humans can produce complex whole-body motions when interacting with their surroundings, by planning, executing and combining individual limb movements. We investigated this fundamental aspect of motor control in the setting of autonomous robotic operations. We approach this problem by hierarchical generative modelling equipped with multi-level planning-for autonomous task completion-that mimics the deep temporal architecture of human motor control. Here, temporal depth refers to the nested time scales at which successive levels of a forward or generative model unfold, for example, delivering an object requires a global plan to contextualise the fast coordination of multiple local movements of limbs. This separation of temporal scales also motivates robotics and control. Specifically, to achieve versatile sensorimotor control, it is advantageous to hierarchically structure the planning and low-level motor control of individual limbs. We use numerical and physical simulation to conduct experiments and to establish the efficacy of this formulation. Using a hierarchical generative model, we show how a humanoid robot can autonomously complete a complex task that necessitates a holistic use of locomotion, manipulation, and grasping. Specifically, we demonstrate the ability of a humanoid robot that can retrieve and transport a box, open and walk through a door to reach the destination, approach and kick a football, while showing robust performance in presence of body damage and ground irregularities. Our findings demonstrated the effectiveness of using human-inspired motor control algorithms, and our method provides a viable hierarchical architecture for the autonomous completion of challenging goal-directed tasks.

摘要
人类可以生成复杂全身运动when interacting with他们的环境，通过规划、执行和组合个体肢体运动。我们在自主 робоック操作中调查了这一基本问题。我们采用层次生成模型，带有多级规划，以模仿人类motor控制的深度 temporal architecture。在这里， temporal depth 指的是成功层次模型 unfold 的不同时间尺度，例如，为了交付物品，需要一个全局规划，以Contextualize 多个快速协调的肢体运动。这种层次分离也驱动了机器人和控制。具体来说，以实现多样化的感知动作控制，是通过层次结构的规划和低级动作控制来实现的。我们通过数字和物理模拟进行实验，并证明了这种形式的有效性。使用层次生成模型，我们展示了一个人型机器人可以自主完成一个复杂任务，需要整体的运动、抓取、搬运和踢球等多种功能。Specifically，我们示出了一个人型机器人可以拾取和运输一个箱子，通过门打开和走进去到目的地，并且在踢球时表现出了Robust performance 的特点。我们的发现表明了使用人类 inspirited motor control算法的有效性，而我们的方法提供了一种可靠的层次建筑，以便自主完成具有挑战性的目标指导任务。

A Graph Encoder-Decoder Network for Unsupervised Anomaly Detection

paper_url: http://arxiv.org/abs/2308.07774
repo_url: None
paper_authors: Mahsa Mesgaran, A. Ben Hamza
for: 检测图граFC中异常节点
methods: 使用自适应图形编码器-解码器模型，学习异常分数函数，将节点排序根据其异常程度。编码阶段使用新型的LCPool方法，通过本地化约束的线性编码来生成团 assignment matrix，解决了传统方法中学习参数的问题，提高了效率和可解释性。解码阶段使用LCUnpool方法重construct原始图гра的结构和节点特征。
results: 在六个基准数据集上进行了实验评估，结果表明该方法在比较状态前的异常检测方法中表现出色，超过了现有方法。

Abstract
A key component of many graph neural networks (GNNs) is the pooling operation, which seeks to reduce the size of a graph while preserving important structural information. However, most existing graph pooling strategies rely on an assignment matrix obtained by employing a GNN layer, which is characterized by trainable parameters, often leading to significant computational complexity and a lack of interpretability in the pooling process. In this paper, we propose an unsupervised graph encoder-decoder model to detect abnormal nodes from graphs by learning an anomaly scoring function to rank nodes based on their degree of abnormality. In the encoding stage, we design a novel pooling mechanism, named LCPool, which leverages locality-constrained linear coding for feature encoding to find a cluster assignment matrix by solving a least-squares optimization problem with a locality regularization term. By enforcing locality constraints during the coding process, LCPool is designed to be free from learnable parameters, capable of efficiently handling large graphs, and can effectively generate a coarser graph representation while retaining the most significant structural characteristics of the graph. In the decoding stage, we propose an unpooling operation, called LCUnpool, to reconstruct both the structure and nodal features of the original graph. We conduct empirical evaluations of our method on six benchmark datasets using several evaluation metrics, and the results demonstrate its superiority over state-of-the-art anomaly detection approaches.

摘要
graph neural networks (GNNs) 的一个重要 компонент是聚合操作，它想要将 graphs 的大小减少，保留重要的结构信息。然而，大多数现有的 graph 聚合策略依赖一个由 GNN 层所得到的对称矩阵，这个矩阵通常具有可读的参数，往往导致计算复杂和模型解释不足。在这篇文章中，我们提出了一个无supervised graph encoder-decoder模型，用于侦测 graphs 中的异常点。在编码阶段，我们设计了一个名为 LCPool 的新的聚合机制，通过本地性受限的线性编码来找到一个对称矩阵，并通过解决一个最小二乘问题来找到一个最佳的对称矩阵。由于在编码过程中强制 enforcing 本地性限制，LCPool 可以免除学习参数，可以高效地处理大型 graphs，并且可以将原始图的主要结构特征传递到更粗糙的表示中。在解码阶段，我们提出了一个名为 LCUnpool 的解码操作，用于重建原始图的结构和节点特征。我们在六个 benchmark dataset 上进行了实验评估，结果显示我们的方法在与现有的侦测方法比较之下表现出色。

MOLE: MOdular Learning FramEwork via Mutual Information Maximization

paper_url: http://arxiv.org/abs/2308.07772
repo_url: None
paper_authors: Tianchao Li, Yulong Pei
for: 这个论文是为了介绍一种异步和本地学习框架，即Module Learning Framework（MOLE）。
methods: 这个框架将神经网络归一化为层，通过矩阵乘法来定义训练目标，并逐渐训练每个模块以达到最大化矩阵乘法的目标。
results: 实验表明，MOLE可以在向量-, 网格-和图形数据上进行高效的训练，并且可以解决图形数据上的节点级和图级任务。因此，MOLE已经在不同类型的数据上得到了实验证明。I hope that helps! Let me know if you have any other questions.

Abstract
This paper is to introduce an asynchronous and local learning framework for neural networks, named Modular Learning Framework (MOLE). This framework modularizes neural networks by layers, defines the training objective via mutual information for each module, and sequentially trains each module by mutual information maximization. MOLE makes the training become local optimization with gradient-isolated across modules, and this scheme is more biologically plausible than BP. We run experiments on vector-, grid- and graph-type data. In particular, this framework is capable of solving both graph- and node-level tasks for graph-type data. Therefore, MOLE has been experimentally proven to be universally applicable to different types of data.

摘要
这篇论文旨在介绍一种异步和本地学习框架 для神经网络，名为模块学习框架（MOLE）。这个框架将神经网络归类为层，通过每个模块的互信息定义训练目标，并逐渐训练每个模块以互信息最大化。MOLE使得训练变成了本地优化，梯度在模块之间隔离，这种方式更加生物学可能性高于BP。我们在矢量-, 网格-和图形数据上进行了实验，并证明MOLE可以解决图形数据中的图级和节点级任务。因此，MOLE在不同类型的数据上都有广泛的应用前景。

NeFL: Nested Federated Learning for Heterogeneous Clients

paper_url: http://arxiv.org/abs/2308.07761
repo_url: None
paper_authors: Honggu Kang, Seohyeon Cha, Jinwoo Shin, Jongmyeong Lee, Joonhyuk Kang
for: 这个研究旨在解决联合学习（Federated Learning，FL）训练过程中缓态或无法进行训练的客户端（即慢车）对整个训练时间的影响，以及实现训练模型的更好可扩展性。
methods: 本研究提出了一个称为嵌套联合学习（NeFL）的架构，它可以将模型分解为多个子模型，并使用深度和宽度的扩展来实现。NeFL还使用了解析方程（ODEs）来调整步长大小，以便在不同的客户端上进行训练。
results: 透过一系列实验，本研究表明NeFL可以实现训练模型的更好可扩展性，特别是在最差的子模型（例如CIFAR-10上的8.33提升）。此外，NeFL与最近的FL研究相互协调。

Abstract
Federated learning (FL) is a promising approach in distributed learning keeping privacy. However, during the training pipeline of FL, slow or incapable clients (i.e., stragglers) slow down the total training time and degrade performance. System heterogeneity, including heterogeneous computing and network bandwidth, has been addressed to mitigate the impact of stragglers. Previous studies split models to tackle the issue, but with less degree-of-freedom in terms of model architecture. We propose nested federated learning (NeFL), a generalized framework that efficiently divides a model into submodels using both depthwise and widthwise scaling. NeFL is implemented by interpreting models as solving ordinary differential equations (ODEs) with adaptive step sizes. To address the inconsistency that arises when training multiple submodels with different architecture, we decouple a few parameters. NeFL enables resource-constrained clients to effectively join the FL pipeline and the model to be trained with a larger amount of data. Through a series of experiments, we demonstrate that NeFL leads to significant gains, especially for the worst-case submodel (e.g., 8.33 improvement on CIFAR-10). Furthermore, we demonstrate NeFL aligns with recent studies in FL.

摘要
Federated learning (FL) 是一种有前途的方法，可以保持隐私性在分布式学习中。然而，在 FL 训练管线中，慢速或无法进行训练的客户端（即废物）会导致总训练时间增加和性能下降。系统多样性，包括不同的计算和网络带宽，已经被解决以减少废物的影响。先前的研究把模型分成了两部分来解决问题，但是这些方法具有较少的度量自由度，对于模型架构而言。我们提出了嵌套 federated learning（NeFL），一个通用的框架，可以快速地将模型分成子模型，使用深度和宽度的扩展。NeFL 通过将模型视为解决 ordinary differential equations（ODEs）的解，并使用适应步长来实现。为了解决不同架构下训练多个子模型时出现的不一致，我们将一些参数分离。NeFL 允许资源有限的客户端能够有效地加入 FL 管线，并让模型在更多数据上进行训练。通过一系列实验，我们展示了 NeFL 对 CIFAR-10 等数据集的进步，特别是最差的子模型（例如，8.33 倍进步）。此外，我们还证明 NeFL 与最近的 FL 研究相关。

Forward-Backward Reasoning in Large Language Models for Verification

paper_url: http://arxiv.org/abs/2308.07758
repo_url: None
paper_authors: Weisen Jiang, Han Shi, Longhui Yu, Zhengying Liu, Yu Zhang, Zhenguo Li, James T. Kwok
for: 提高理解任务中的推理能力
methods: 使用反向推理和前向推理的组合方法
results: 实验结果表明，FOBAR方法在多个数据集和三种LLM中表现出状元水平的推理能力

Abstract
Chain-of-Though (CoT) prompting has shown promising performance in various reasoning tasks. Recently, Self-Consistency \citep{wang2023selfconsistency} proposes to sample a diverse set of reasoning chains which may lead to different answers while the answer that receives the most votes is selected. In this paper, we propose a novel method to use backward reasoning in verifying candidate answers. We mask a token in the question by ${\bf x}$ and ask the LLM to predict the masked token when a candidate answer is provided by \textit{a simple template}, i.e., "\textit{\textbf{If we know the answer of the above question is \{a candidate answer\}, what is the value of unknown variable ${\bf x}$?}" Intuitively, the LLM is expected to predict the masked token successfully if the provided candidate answer is correct. We further propose FOBAR to combine forward and backward reasoning for estimating the probability of candidate answers. We conduct extensive experiments on six data sets and three LLMs. Experimental results demonstrate that FOBAR achieves state-of-the-art performance on various reasoning benchmarks.

摘要
链式思维（CoT）提示法在多种理解任务中表现出色。自康凝 \citep{wang2023selfconsistency}提出了采样多个理解链，以便通过不同的答案而产生多个可能性。在这篇论文中，我们提出了一种使用反向思维的新方法，用于验证候选答案。我们将问题中的一个token用${\bf x}$进行遮盖，然后询问LLM predict该遮盖的token，当提供一个简单的模板，即 "\textit{\textbf{如果我们知道上面的问题的答案是 \{一个候选答案\}, то值Unknown变量${\bf x}$是什么？}"。 intuitionally，LLM可以成功预测遮盖的token，如果提供的候选答案是正确的。我们还提出了FOBAR，用于将前向和反向思维相结合，以估算候选答案的概率。我们在六个数据集和三个LLM上进行了广泛的实验，实验结果表明，FOBAR在多种理解 bencmarks 上达到了领先的性能。

Exploiting Sparsity in Automotive Radar Object Detection Networks

paper_url: http://arxiv.org/abs/2308.07748
repo_url: None
paper_authors: Marius Lippke, Maurice Quach, Sascha Braun, Daniel Köhler, Michael Ulrich, Bastian Bischoff, Wei Yap Tan
for: 这 paper 的目的是提出一种基于 sparse convolutional neural network 的对象检测方法，用于解决自动驾驶系统中的环境感知问题。
methods: 该 paper 使用了 grid-based detection 和 sparse backbone 架构，并提出了 sparse kernel point pillars (SKPP) 和 dual voxel point convolutions (DVPC) 等技术来解决 радиар特有的挑战。
results: 该 paper 在 nuScenes 数据集上进行了评测，并证明了 SKPP-DPVCN 架构可以比基线和前一个状态的对象检测方法提高 Car AP4.0 的性能，并降低了平均缩放错误 (ASE) 值。

Abstract
Having precise perception of the environment is crucial for ensuring the secure and reliable functioning of autonomous driving systems. Radar object detection networks are one fundamental part of such systems. CNN-based object detectors showed good performance in this context, but they require large compute resources. This paper investigates sparse convolutional object detection networks, which combine powerful grid-based detection with low compute resources. We investigate radar specific challenges and propose sparse kernel point pillars (SKPP) and dual voxel point convolutions (DVPC) as remedies for the grid rendering and sparse backbone architectures. We evaluate our SKPP-DPVCN architecture on nuScenes, which outperforms the baseline by 5.89% and the previous state of the art by 4.19% in Car AP4.0. Moreover, SKPP-DPVCN reduces the average scale error (ASE) by 21.41% over the baseline.

摘要
“精准感知环境是自动驾驶系统的关键，以确保其安全和可靠运行。雷达对象检测网络是这种系统的基本组件之一。使用CNN的对象检测器显示了良好的性能，但它们需要大量的计算资源。这篇论文研究了稀疏 convolutional 对象检测网络，它们将强大的格子基础与低计算资源相结合。我们研究了雷达特有挑战，并提出了 sparse kernel point pillars（SKPP）和 dual voxel point convolutions（DVPC）来解决grid rendering和稀疏脊梁架构的问题。我们评估了我们的 SKPP-DPVCN 架构在 nuScenes 上，其与基准值相比提高了4.19%，并且与前一个状态的艺术提高了5.89%。此外，SKPP-DPVCN 还下降了平均扩散误差（ASE）的21.41%。”

Real Robot Challenge 2022: Learning Dexterous Manipulation from Offline Data in the Real World

paper_url: http://arxiv.org/abs/2308.07741
repo_url: None
paper_authors: Nico Gürtler, Felix Widmaier, Cansu Sancaktar, Sebastian Blaes, Pavel Kolev, Stefan Bauer, Manuel Wüthrich, Markus Wulfmeier, Martin Riedmiller, Arthur Allshire, Qiang Wang, Robert McCarthy, Hangyeol Kim, Jongchan Baek Pohang, Wookyong Kwon, Shanliang Qian, Yasunori Toshimitsu, Mike Yan Michelis, Amirhossein Kazemipour, Arman Raayatsanati, Hehui Zheng, Barnabasa Gavin Cangan, Bernhard Schölkopf, Georg Martius
for: 本研究的目的是bridge reinforcement learning（RL）和机器人共同体，让参与者通过实际操作real robot来验证RL算法的性能。
methods: 本研究使用了现有的real robot dataset，并提供了丰富的软件文档和初始化阶段，使参与者可以轻松地在real robot上进行学习和评估。
results: 研究发现，winning teams使用的方法可以在real robot上实现高效的dexterous manipulation任务，并且比预期的state-of-the-art offline RL算法更高效。

Abstract
Experimentation on real robots is demanding in terms of time and costs. For this reason, a large part of the reinforcement learning (RL) community uses simulators to develop and benchmark algorithms. However, insights gained in simulation do not necessarily translate to real robots, in particular for tasks involving complex interactions with the environment. The Real Robot Challenge 2022 therefore served as a bridge between the RL and robotics communities by allowing participants to experiment remotely with a real robot - as easily as in simulation. In the last years, offline reinforcement learning has matured into a promising paradigm for learning from pre-collected datasets, alleviating the reliance on expensive online interactions. We therefore asked the participants to learn two dexterous manipulation tasks involving pushing, grasping, and in-hand orientation from provided real-robot datasets. An extensive software documentation and an initial stage based on a simulation of the real set-up made the competition particularly accessible. By giving each team plenty of access budget to evaluate their offline-learned policies on a cluster of seven identical real TriFinger platforms, we organized an exciting competition for machine learners and roboticists alike. In this work we state the rules of the competition, present the methods used by the winning teams and compare their results with a benchmark of state-of-the-art offline RL algorithms on the challenge datasets.

摘要
实验在真正机器人上具有时间和成本的限制，因此许多学习强化（RL）社区使用模拟器来开发和比较算法。然而，在实际环境中的交互性较复杂时，在模拟器上获得的 Insight 可能不准确。为了 bridging 这两个社区，我们在2022年的真机器人挑战中让参与者通过远程控制真机器人来进行实验，与在模拟器上进行实验一样简单。在过去几年中，离线学习强化学习（offline RL）已经成熟为一种有前途的学习方法，可以避免在临时便捷的在线交互中花费高昂的成本。因此，我们要求参与者通过学习提供的真机器人数据集来完成两项灵活的机械操作任务，包括推动、抓取和手中 orienting。为了使参与者更加方便地参与到竞赛中，我们提供了广泛的软件文档和一个基于真实设置的初始阶段。为了让每个团队有足够的访问预算来评估他们在一群七个相同的真机器人平台上的离线学习策略，我们组织了一场吸引了机器人学家和学习机器人之间的精彩竞赛。在这篇文章中，我们介绍了竞赛规则，表明赢家们使用的方法，并与当前的离线RL算法在挑战数据集上的比较。

Domain-Aware Fine-Tuning: Enhancing Neural Network Adaptability

paper_url: http://arxiv.org/abs/2308.07728
repo_url: None
paper_authors: Seokhyeon Ha, Sunbeom Jung, Jungwoo Lee
For: 本研究旨在提出一种新的方法，以便在不同目标领域进行精度的适应和性能优化。* Methods: 本研究使用了域名映射和精度评估来缓解特征扭曲问题，并通过Linear Probing和精度调整来优化头层。* Results: 对比基eline方法，本研究的方法在域外数据上显示出较高的性能，并且可以减少特征扭曲。

Abstract
Fine-tuning pre-trained neural network models has become a widely adopted approach across various domains. However, it can lead to the distortion of pre-trained feature extractors that already possess strong generalization capabilities. Mitigating feature distortion during adaptation to new target domains is crucial. Recent studies have shown promising results in handling feature distortion by aligning the head layer on in-distribution datasets before performing fine-tuning. Nonetheless, a significant limitation arises from the treatment of batch normalization layers during fine-tuning, leading to suboptimal performance. In this paper, we propose Domain-Aware Fine-Tuning (DAFT), a novel approach that incorporates batch normalization conversion and the integration of linear probing and fine-tuning. Our batch normalization conversion method effectively mitigates feature distortion by reducing modifications to the neural network during fine-tuning. Additionally, we introduce the integration of linear probing and fine-tuning to optimize the head layer with gradual adaptation of the feature extractor. By leveraging batch normalization layers and integrating linear probing and fine-tuning, our DAFT significantly mitigates feature distortion and achieves improved model performance on both in-distribution and out-of-distribution datasets. Extensive experiments demonstrate that our method outperforms other baseline methods, demonstrating its effectiveness in not only improving performance but also mitigating feature distortion.

摘要
“已成为各领域的普遍采用方法，微型网络组件的精致调整已成为一个广泛应用的方法。然而，这可能会导致原有具备强化泛化能力的预训网络组件的扭曲。缓和预训网络组件的扭曲是非常重要的。 recent studies have shown promising results in handling feature distortion by aligning the head layer on in-distribution datasets before performing fine-tuning. However, a significant limitation arises from the treatment of batch normalization layers during fine-tuning, leading to suboptimal performance. In this paper, we propose Domain-Aware Fine-Tuning (DAFT), a novel approach that incorporates batch normalization conversion and the integration of linear probing and fine-tuning. Our batch normalization conversion method effectively mitigates feature distortion by reducing modifications to the neural network during fine-tuning. Additionally, we introduce the integration of linear probing and fine-tuning to optimize the head layer with gradual adaptation of the feature extractor. By leveraging batch normalization layers and integrating linear probing and fine-tuning, our DAFT significantly mitigates feature distortion and achieves improved model performance on both in-distribution and out-of-distribution datasets. Extensive experiments demonstrate that our method outperforms other baseline methods, demonstrating its effectiveness in not only improving performance but also mitigating feature distortion.”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know and I can provide that as well.

Fast Machine Unlearning Without Retraining Through Selective Synaptic Dampening

paper_url: http://arxiv.org/abs/2308.07707
repo_url: https://github.com/if-loops/selective-synaptic-dampening
paper_authors: Jack Foster, Stefan Schoepf, Alexandra Brintrup
for: 本研究旨在解决机器学习模型忘记Specific information的挑战，以遵守数据隐私法规和 removing harmful, manipulated, or outdated information。
methods: 本研究提出了一种名为Selective Synaptic Dampening（SSD）的两步，Post hoc，无需重新训练的方法，它快速、高效，不需要长期存储训练数据。
results: 对比 existed unlearning 方法，SSD 的性能与重新训练方法相当，这表明了无需重新训练的后置式忘记方法的可行性。

Abstract
Machine unlearning, the ability for a machine learning model to forget, is becoming increasingly important to comply with data privacy regulations, as well as to remove harmful, manipulated, or outdated information. The key challenge lies in forgetting specific information while protecting model performance on the remaining data. While current state-of-the-art methods perform well, they typically require some level of retraining over the retained data, in order to protect or restore model performance. This adds computational overhead and mandates that the training data remain available and accessible, which may not be feasible. In contrast, other methods employ a retrain-free paradigm, however, these approaches are prohibitively computationally expensive and do not perform on par with their retrain-based counterparts. We present Selective Synaptic Dampening (SSD), a novel two-step, post hoc, retrain-free approach to machine unlearning which is fast, performant, and does not require long-term storage of the training data. First, SSD uses the Fisher information matrix of the training and forgetting data to select parameters that are disproportionately important to the forget set. Second, SSD induces forgetting by dampening these parameters proportional to their relative importance to the forget set with respect to the wider training data. We evaluate our method against several existing unlearning methods in a range of experiments using ResNet18 and Vision Transformer. Results show that the performance of SSD is competitive with retrain-based post hoc methods, demonstrating the viability of retrain-free post hoc unlearning approaches.

摘要
机器学习模型的忘记能力，也就是机器学习模型的“忘记”，在符合数据隐私法规以及移除有害、操纵或过时信息方面变得越来越重要。然而，现有的状态 искусственный智能技术通常需要一定的重新训练，以保护或恢复模型在保留的数据上的性能。这会增加计算开销，并且需要训练数据保持可用和可访问，这可能不是可行的。相比之下，其他方法采用一种不需要重新训练的方法，但这些方法的计算成本过高，并且性能不如重新训练的方法。我们提出了一种新的两步、Post Hoc、无需重新训练的机器学习忘记方法：选择性神经元减弱（SSD）。首先，SSD使用训练和忘记数据的 Fisher 信息矩阵来选择对忘记集数据的重要参数。然后，SSD 通过对这些参数进行减弱，使其与忘记集数据相对更重要的参数相比，来实现忘记。我们在使用 ResNet18 和 Vision Transformer 进行了一系列实验，结果表明 SSD 的性能与重新训练后的Post Hoc方法相当竞争，这说明了无需重新训练的忘记方法的可行性。

Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models

paper_url: http://arxiv.org/abs/2308.07706
repo_url: None
paper_authors: Kanchan Poudel, Manish Dhakal, Prasiddha Bhandari, Rabin Adhikari, Safal Thapaliya, Bishesh Khanal
for: 本研究旨在提高医疗领域图像分割 task 的效果，通过文本引导来增强视觉特征。
methods: 本研究使用多Modal vision-language模型，包括图像描述和图像特征，以捕捉图像描述中的semantic信息，提高医疗领域图像分割 task 的效果。
results: 研究发现，existings vision language模型在多个 dataset 上的传输性不高，需要进行手动调整或 fine-tuning 以适应医疗领域。而通过生成不同的图像描述来训练模型，可以提高模型的性能。

Abstract
Medical Image Segmentation is crucial in various clinical applications within the medical domain. While state-of-the-art segmentation models have proven effective, integrating textual guidance to enhance visual features for this task remains an area with limited progress. Existing segmentation models that utilize textual guidance are primarily trained on open-domain images, raising concerns about their direct applicability in the medical domain without manual intervention or fine-tuning. To address these challenges, we propose using multimodal vision-language models for capturing semantic information from image descriptions and images, enabling the segmentation of diverse medical images. This study comprehensively evaluates existing vision language models across multiple datasets to assess their transferability from the open domain to the medical field. Furthermore, we introduce variations of image descriptions for previously unseen images in the dataset, revealing notable variations in model performance based on the generated prompts. Our findings highlight the distribution shift between the open-domain images and the medical domain and show that the segmentation models trained on open-domain images are not directly transferrable to the medical field. But their performance can be increased by finetuning them in the medical datasets. We report the zero-shot and finetuned segmentation performance of 4 Vision Language Models (VLMs) on 11 medical datasets using 9 types of prompts derived from 14 attributes.

摘要
医疗图像分割是医疗领域内的重要应用之一。虽然现有的分割模型已经证明有效，但是将文本指导integrated到图像特征中以提高分割性能仍然是一个有限的领域。现有的分割模型主要是在开放领域图像上训练，这引发了对其直接适用性在医疗领域的担忧。为了解决这些挑战，我们提议使用多Modal vision-language模型来捕捉图像描述和图像中的semantic信息，以便分割多种医疗图像。本研究对多个数据集进行了广泛的评估，以评估现有的视力语言模型在医疗领域的转移性。此外，我们还引入了未经见过的图像描述，并证明了基于生成的提示的模型性能的很大变化。我们的发现表明了开放领域图像和医疗领域之间的分布差异，并证明了训练在开放领域图像上的分割模型不能直接应用于医疗领域。但是，通过finetuning，可以提高这些模型在医疗数据集上的性能。我们对4种视力语言模型在11个医疗数据集上进行了零容量和finetuning的分割性能测试，使用9种基于14个特征的提示。

Parametric entropy based Cluster Centriod Initialization for k-means clustering of various Image datasets

paper_url: http://arxiv.org/abs/2308.07705
repo_url: None
paper_authors: Faheem Hussayn, Shahid M Shah
for: 这篇论文的目的是提出一种基于参数 entropy 的 k-means 初始化方法，以提高 k-means 算法在图像数据上的表现。
methods: 该论文使用了多种参数 entropy 来初始化 k-means 算法的中心点，并对不同的图像 dataset 进行了测试。
results: 研究发现，不同的 dataset 使用不同的参数 entropy 可以提供更好的结果，而且提议的方法可以提高 k-means 算法在图像数据上的表现。例如，在 Satellite、Toys、Fruits、Cars 等 dataset 上，使用 Taneja entropy、Kapur entropy、Aczel Daroczy entropy 和 Sharma Mittal entropy 等参数 entropy 可以提供更好的结果。

Abstract
One of the most employed yet simple algorithm for cluster analysis is the k-means algorithm. k-means has successfully witnessed its use in artificial intelligence, market segmentation, fraud detection, data mining, psychology, etc., only to name a few. The k-means algorithm, however, does not always yield the best quality results. Its performance heavily depends upon the number of clusters supplied and the proper initialization of the cluster centroids or seeds. In this paper, we conduct an analysis of the performance of k-means on image data by employing parametric entropies in an entropy based centroid initialization method and propose the best fitting entropy measures for general image datasets. We use several entropies like Taneja entropy, Kapur entropy, Aczel Daroczy entropy, Sharma Mittal entropy. We observe that for different datasets, different entropies provide better results than the conventional methods. We have applied our proposed algorithm on these datasets: Satellite, Toys, Fruits, Cars, Brain MRI, Covid X-Ray.

摘要
一种非常常用但简单的聚类分析算法是k-means算法。k-means算法在人工智能、市场 segmentation、诈骗探测、数据挖掘、心理学等领域都有广泛的应用，只是名些。然而，k-means算法并不总是能够提供最佳的结果。其性能很大程度上取决于提供的聚类数量和聚类中心点或种子的初始化。在这篇论文中，我们通过使用参数 entropy 来初始化聚类中心点，并提出了适用于普通图像数据集的最佳 entropy 度量。我们使用了多种 entropy，如Taneja entropy、Kapur entropy、Aczel Daroczy entropy和Sharma Mittal entropy。我们发现，不同的数据集，不同的 entropy 度量会提供更好的结果。我们在这些数据集上应用了我们的提议的算法：卫星、玩具、水果、汽车、脑Magnetic Resonance Imaging（MRI）、Covid X-Ray。

Enhancing Network Initialization for Medical AI Models Using Large-Scale, Unlabeled Natural Images

paper_url: http://arxiv.org/abs/2308.07688
repo_url: None
paper_authors: Soroosh Tayebi Arasteh, Leo Misera, Jakob Nikolas Kather, Daniel Truhn, Sven Nebelung
for: 这个研究的目的是探索可以使用非医学影像进行自主学习预训（SSL），以提高医学影像分析中的人工智能（AI）精度。
methods: 我们使用了一个视觉转化器，并将其初始化为（i）SSL预训自然影像（DINOv2）、（ii）SL预训自然影像（ImageNet dataset）和（iii）SL预训颈部X线成像（MIMIC-CXR dataset）。
results: 我们在6个大型全球颈部X线成像数据集上进行了过80万张颈部X线成像的测试，并识别了20多种不同的医学影像找到结果。我们的SSL预训策略不仅在所有数据集上比ImageNet预训（P<0.001）表现更好，甚至在某些情况下还超过了SL在MIMIC-CXR数据集上的表现。

Abstract
Pre-training datasets, like ImageNet, have become the gold standard in medical image analysis. However, the emergence of self-supervised learning (SSL), which leverages unlabeled data to learn robust features, presents an opportunity to bypass the intensive labeling process. In this study, we explored if SSL for pre-training on non-medical images can be applied to chest radiographs and how it compares to supervised pre-training on non-medical images and on medical images. We utilized a vision transformer and initialized its weights based on (i) SSL pre-training on natural images (DINOv2), (ii) SL pre-training on natural images (ImageNet dataset), and (iii) SL pre-training on chest radiographs from the MIMIC-CXR database. We tested our approach on over 800,000 chest radiographs from six large global datasets, diagnosing more than 20 different imaging findings. Our SSL pre-training on curated images not only outperformed ImageNet-based pre-training (P<0.001 for all datasets) but, in certain cases, also exceeded SL on the MIMIC-CXR dataset. Our findings suggest that selecting the right pre-training strategy, especially with SSL, can be pivotal for improving artificial intelligence (AI)'s diagnostic accuracy in medical imaging. By demonstrating the promise of SSL in chest radiograph analysis, we underline a transformative shift towards more efficient and accurate AI models in medical imaging.

摘要
预训 datasets，如 ImageNet，已成为医学影像分析的标准。然而，自动学习（SSL）技术的出现，可以使用无标签数据学习强大的特征，可能以替代复杂的标签过程。在这项研究中，我们研究了是否可以将SSL预训短图用于骨胸影像，并与其他两种预训方法进行比较。我们使用了一种视觉转换器，并将其参数初始化为（i）SSL预训自然图像（DINOv2），（ii）SL预训自然图像（ImageNet dataset），和（iii）SL预训骨胸影像（MIMIC-CXR数据库）。我们对超过800,000个骨胸影像进行测试，识别了 более20种不同的医学影像发现。我们的SSL预训策略不仅在所有数据集上超过ImageNet预训策略（P<0.001），而且在某些情况下， même exceeded SL在MIMIC-CXR数据库上。我们的发现表明，选择合适的预训策略，特别是使用SSL，可以对医学影像识别精度进行改进。通过证明SSL在骨胸影像分析中的推荐，我们强调了医学影像识别模型的更有效和精度的转型。

DiffGuard: Semantic Mismatch-Guided Out-of-Distribution Detection using Pre-trained Diffusion Models

paper_url: http://arxiv.org/abs/2308.07687
repo_url: https://github.com/cure-lab/diffguard
paper_authors: Ruiyuan Gao, Chenchen Zhao, Lanqing Hong, Qiang Xu
for: 本研究的目的是提出一种基于扩展模型的Semantic Out-of-Distribution（OOD）检测方法，以提高图像分类器的OOD检测性能。
methods: 该方法使用了 conditional Generative Adversarial Network（cGAN）来增大图像空间中的semantic mismatch，并且使用pre-trained diffusion models来实现Semantic mismatch-guided OOD detection。
results: 实验结果表明，DiffGuard可以在Cifar-10和ImageNet上达到州-of-the-art的OOD检测性能，并且可以与现有的OOD检测技术相结合以获得更高的检测性能。

Abstract
Given a classifier, the inherent property of semantic Out-of-Distribution (OOD) samples is that their contents differ from all legal classes in terms of semantics, namely semantic mismatch. There is a recent work that directly applies it to OOD detection, which employs a conditional Generative Adversarial Network (cGAN) to enlarge semantic mismatch in the image space. While achieving remarkable OOD detection performance on small datasets, it is not applicable to ImageNet-scale datasets due to the difficulty in training cGANs with both input images and labels as conditions. As diffusion models are much easier to train and amenable to various conditions compared to cGANs, in this work, we propose to directly use pre-trained diffusion models for semantic mismatch-guided OOD detection, named DiffGuard. Specifically, given an OOD input image and the predicted label from the classifier, we try to enlarge the semantic difference between the reconstructed OOD image under these conditions and the original input image. We also present several test-time techniques to further strengthen such differences. Experimental results show that DiffGuard is effective on both Cifar-10 and hard cases of the large-scale ImageNet, and it can be easily combined with existing OOD detection techniques to achieve state-of-the-art OOD detection results.

摘要
（简化中文）给定一个分类器，则它的Out-of-Distribution（OOD）样本的内在特性是与所有法定类型的内容不同，即semantic mismatch。有一项最近的工作直接应用它到OOD检测中，使用conditional Generative Adversarial Network（cGAN）来扩大图像空间中的semantic mismatch。尽管在小 dataset上达到了惊人的OOD检测性能，但是在ImageNet scale dataset上不可能因为cGAN的训练是不可能的。因为diffusion模型比cGAN更容易训练，在这项工作中，我们直接使用预训练的diffusion模型来实现semantic mismatch-guided OOD检测，名为DiffGuard。具体来说，给定一个OOD输入图像和分类器预测的标签，我们尝试通过在这些条件下重建OOD图像，并与原始输入图像进行比较，以扩大semantic difference。我们还提供了多种测试时技术来进一步强化这种差异。实验结果表明，DiffGuard是效果好的，在Cifar-10和ImageNet中的困难情况下都能够达到state-of-the-art的OOD检测结果。

Portfolio Selection via Topological Data Analysis

paper_url: http://arxiv.org/abs/2308.07944
repo_url: None
paper_authors: Petr Sokerin, Kristian Kuznetsov, Elizaveta Makhneva, Alexey Zaytsev
for: 投资决策中的资产组合管理是一项重要的任务，但传统方法往往无法实现合理的性能。
methods: 本文提出了一种两阶段的投资资产组合建立方法，首先生成时间序列表示，然后进行划分。该方法利用了 topological data analysis（TDA）特征来生成表示，从而揭示时间序列数据中的Topological结构。
results: 实验结果显示，我们提出的方法在不同时间帧下具有superior性能，与其他方法相比，这种性能的稳定性和可靠性得到了证明。这些结果表明TDA可以作为一种强大的工具来选择资产组合。

Abstract
Portfolio management is an essential part of investment decision-making. However, traditional methods often fail to deliver reasonable performance. This problem stems from the inability of these methods to account for the unique characteristics of multivariate time series data from stock markets. We present a two-stage method for constructing an investment portfolio of common stocks. The method involves the generation of time series representations followed by their subsequent clustering. Our approach utilizes features based on Topological Data Analysis (TDA) for the generation of representations, allowing us to elucidate the topological structure within the data. Experimental results show that our proposed system outperforms other methods. This superior performance is consistent over different time frames, suggesting the viability of TDA as a powerful tool for portfolio selection.

摘要
资产管理是投资决策的重要组成部分，但传统方法通常无法提供合理的性能。这个问题源于这些方法无法考虑股票市场多元时间序列数据的特殊特征。我们提出了一种两阶段方法，用于建立公司股票投资组合。该方法包括生成时间序列表示，然后进行归一化。我们的方法利用基于拓扑数据分析（TDA）的特征来生成表示，从而揭示时间序列数据中的拓扑结构。实验结果显示，我们的提议系统在不同时间框架下具有优秀的性能，这表明TDA可以成为资产选择中的强大工具。

Gradient-Based Post-Training Quantization: Challenging the Status Quo

paper_url: http://arxiv.org/abs/2308.07662
repo_url: None
paper_authors: Edouard Yvinec, Arnaud Dapogny, Kevin Bailly
for: 这篇论文的目的是提出一种新的量化方法，以提高量化深度神经网络的效率和可扩展性。
methods: 这篇论文使用了Gradient-based post-training quantization（GPTQ）方法，并且挑战了常用的GPTQ方法设计。具体来说，这篇论文提出了一些最佳实践方法，例如调整参数选择、增强特征变数、选择参考集等，以提高量化的效率和可扩展性。
results: 这篇论文的实验结果显示，这些最佳实践方法可以实现 significiant 的性能改进（例如，在ViT模型上，使用4位量化可以提高6.819点的表现），这些结果显示了这篇论文的量化方法的可行性和有效性。

Abstract
Quantization has become a crucial step for the efficient deployment of deep neural networks, where floating point operations are converted to simpler fixed point operations. In its most naive form, it simply consists in a combination of scaling and rounding transformations, leading to either a limited compression rate or a significant accuracy drop. Recently, Gradient-based post-training quantization (GPTQ) methods appears to be constitute a suitable trade-off between such simple methods and more powerful, yet expensive Quantization-Aware Training (QAT) approaches, particularly when attempting to quantize LLMs, where scalability of the quantization process is of paramount importance. GPTQ essentially consists in learning the rounding operation using a small calibration set. In this work, we challenge common choices in GPTQ methods. In particular, we show that the process is, to a certain extent, robust to a number of variables (weight selection, feature augmentation, choice of calibration set). More importantly, we derive a number of best practices for designing more efficient and scalable GPTQ methods, regarding the problem formulation (loss, degrees of freedom, use of non-uniform quantization schemes) or optimization process (choice of variable and optimizer). Lastly, we propose a novel importance-based mixed-precision technique. Those guidelines lead to significant performance improvements on all the tested state-of-the-art GPTQ methods and networks (e.g. +6.819 points on ViT for 4-bit quantization), paving the way for the design of scalable, yet effective quantization methods.

摘要
Translated into Simplified Chinese:量化已成为深度神经网络的高效部署的关键步骤，将浮点操作转换为更简单的固定点操作。在最简单的形式下，它只是通过缩放和圆拟操作的组合来实现压缩率的受限或减少准确率。在最近，Gradient-based post-training quantization（GPTQ）方法变得更加重要，它们在尝试量化LLMs时， scalability of the quantization process是关键。GPTQ主要是通过学习圆拟操作来实现，使用一个小量化集。在这项工作中，我们挑战了GPTQ方法的常见选择。具体来说，我们发现该过程在一些变量（weight选择、特征增强、选择量化集）的影响下具有一定的抗预测性。此外，我们还提出了一些优化GPTQ方法的最佳实践，包括问题表示（损失、自由度、非均匀量化方案）和优化过程（变量和优化器选择）。最后，我们提出了一种重要性基于混合精度技术。这些指南导致了所有测试的现有GPTQ方法和网络（例如，+6.819点在ViT中 для 4比特量化）获得了显著性能提高，开启了可扩展、有效的量化方法的设计。

Attention Is Not All You Need Anymore

paper_url: http://arxiv.org/abs/2308.07661
repo_url: https://github.com/rprokap/pset-9
paper_authors: Zhe Chen
for: 提高 transformer 性能
methods: 提出一种drop-in replacement self-attention mechanism，称为Extractor
results: 实验结果表明，将 self-attention mechanism replaced with Extractor 可以提高 transformer 性能，并且可以更快than self-attention mechanism。

Abstract
In recent years, the popular Transformer architecture has achieved great success in many application areas, including natural language processing and computer vision. Many existing works aim to reduce the computational and memory complexity of the self-attention mechanism in the Transformer by trading off performance. However, performance is key for the continuing success of the Transformer. In this paper, a drop-in replacement for the self-attention mechanism in the Transformer, called the Extractor, is proposed. Experimental results show that replacing the self-attention mechanism with the Extractor improves the performance of the Transformer. Furthermore, the proposed Extractor has the potential to run faster than the self-attention since it has a much shorter critical path of computation. Additionally, the sequence prediction problem in the context of text generation is formulated using variable-length discrete-time Markov chains, and the Transformer is reviewed based on our understanding.

摘要

From Commit Message Generation to History-Aware Commit Message Completion

paper_url: http://arxiv.org/abs/2308.07655
repo_url: https://github.com/jetbrains-research/commit_message_generation
paper_authors: Aleksandra Eliseeva, Yaroslav Sokolov, Egor Bogomolov, Yaroslav Golubev, Danny Dig, Timofey Bryksin
for: 提高 commits 的质量和个性化程度，使开发者更容易跟踪变更和协作。
methods: 利用 previous commit history 作为额外 контекст，通过 completion 和 generation 两种方法来生成高质量 commits。
results: 结果显示，在某些情况下，使用 completion 方法可以达到更高的质量和个性化程度，而使用历史信息可以提高 CMG 模型在生成任务中的表现。

Abstract
Commit messages are crucial to software development, allowing developers to track changes and collaborate effectively. Despite their utility, most commit messages lack important information since writing high-quality commit messages is tedious and time-consuming. The active research on commit message generation (CMG) has not yet led to wide adoption in practice. We argue that if we could shift the focus from commit message generation to commit message completion and use previous commit history as additional context, we could significantly improve the quality and the personal nature of the resulting commit messages. In this paper, we propose and evaluate both of these novel ideas. Since the existing datasets lack historical data, we collect and share a novel dataset called CommitChronicle, containing 10.7M commits across 20 programming languages. We use this dataset to evaluate the completion setting and the usefulness of the historical context for state-of-the-art CMG models and GPT-3.5-turbo. Our results show that in some contexts, commit message completion shows better results than generation, and that while in general GPT-3.5-turbo performs worse, it shows potential for long and detailed messages. As for the history, the results show that historical information improves the performance of CMG models in the generation task, and the performance of GPT-3.5-turbo in both generation and completion.

摘要
<>转换文本到简化中文。<>软件开发中的提交消息非常重要，它允许开发者跟踪更改并协作有效。尽管它们的重要性，但大多数提交消息缺乏重要信息，因为写好提交消息是时间consuming和繁琐的。有活跃的研究在提交消息生成（CMG）领域，但尚未得到广泛的实践应用。我们认为，如果我们可以将注重点从提交消息生成转移到提交消息完成，并使用之前的提交历史作为更多的上下文，我们可以大幅提高提交消息质量和个性化度。在这篇论文中，我们提出并评估了两个新的想法。由于现有的数据集缺乏历史数据，我们收集和分享了一个新的数据集called CommitChronicle，包含20种编程语言的10.7万个提交。我们使用这个数据集来评估完成设定和使用历史上下文来评估当前CMG模型和GPT-3.5-turbo的表现。我们的结果表明，在某些情况下，提交消息完成比生成更好，而且GPT-3.5-turbo在详细的消息中表现较差，但在某些情况下具有潜在的潜力。对于历史信息，我们的结果表明，历史信息可以提高CMG模型在生成任务中的表现，并且GPT-3.5-turbo在生成和完成任务中的表现。

Ternary Singular Value Decomposition as a Better Parameterized Form in Linear Mapping

paper_url: http://arxiv.org/abs/2308.07641
repo_url: None
paper_authors: Boyu Chen, Hanxuan Chen, Jiao He, Fengyu Sun, Shangling Jui
for: 这个论文的目的是提出一种简单 yet novel的线性映射方法来实现优秀的网络压缩性能。
methods: 这个论文使用的方法是一种叫做ternary SVD（TSVD）的 pseudo SVD，其限制了 $U$ 和 $V$ 矩阵在 SVD 中的形式为 ${\pm 1, 0}$ 的三元矩阵。这意味着在计算 $U(\cdot)$ 和 $V(\cdot)$ 时只需要使用加法操作。
results: 实验结果表明，TSVD 可以在不同类型的网络和任务中实现当今基eline模型如 ConvNext、Swim、BERT 和大型语言模型 OPT 的状态级压缩性能。

Abstract
We present a simple yet novel parameterized form of linear mapping to achieves remarkable network compression performance: a pseudo SVD called Ternary SVD (TSVD). Unlike vanilla SVD, TSVD limits the $U$ and $V$ matrices in SVD to ternary matrices form in $\{\pm 1, 0\}$. This means that instead of using the expensive multiplication instructions, TSVD only requires addition instructions when computing $U(\cdot)$ and $V(\cdot)$. We provide direct and training transition algorithms for TSVD like Post Training Quantization and Quantization Aware Training respectively. Additionally, we analyze the convergence of the direct transition algorithms in theory. In experiments, we demonstrate that TSVD can achieve state-of-the-art network compression performance in various types of networks and tasks, including current baseline models such as ConvNext, Swim, BERT, and large language model like OPT.

摘要
我们提出了一种简单 yet novel的线性映射参数化方法，可以实现出色的网络压缩性能：一种叫做ternary SVD（TSVD）的 Pseudo SVD。 unlike vanilla SVD, TSVD限制了 $U$ 和 $V$ 矩阵在 SV 中仅能是三元矩阵（\{\pm 1, 0\}）。这意味着在计算 $U(\cdot)$ 和 $V(\cdot)$ 时，TSVD 只需要使用加法指令，而不需要使用昂贵的乘法指令。我们提供了直接迁移算法和训练迁移算法，如Post Training Quantization 和 Quantization Aware Training 等。此外，我们还对直接迁移算法的整体性进行了理论分析。在实验中，我们表明了 TSVD 可以在不同类型的网络和任务上实现state-of-the-art的压缩性能，包括当前基eline模型 ConvNext、Swim、BERT 以及大型语言模型 OPT。

Backpropagation Path Search On Adversarial Transferability

paper_url: http://arxiv.org/abs/2308.07625
repo_url: None
paper_authors: Zhuoer Xu, Zhangxuan Gu, Jianping Zhang, Shiwen Cui, Changhua Meng, Weiqiang Wang
for: 防御深度神经网络受到敌意例之攻击，需要在部署前测试模型的可靠性。
methods: 基于传输的攻击者使用拷贝模型构建敌意例，然后将其传输到黑盒环境中部署的受害者模型。为了增强攻击性能，结构基于的攻击者修改了反propagation路径，但现有的结构基于的攻击者忽略了 convolution 模块，并使用伪函数来修改反propagation图。
results: 我们提出了 backPropagation pAth Search (PAS)，解决了上述两个问题。我们首先提出了 SkipConv，用于调整 convolution 模块的反propagation路径。以免攻击路径过拟合 surrogate 模型，我们还构建了 DAG 基于搜索空间，使用一步靠近法评估路径，并使用 bayesian 优化来搜索最佳路径。我们在各种传输设置下进行了广泛的实验，显示 PAS 可以大幅提高攻击成功率，包括常训练的模型和防御模型。

Abstract
Deep neural networks are vulnerable to adversarial examples, dictating the imperativeness to test the model's robustness before deployment. Transfer-based attackers craft adversarial examples against surrogate models and transfer them to victim models deployed in the black-box situation. To enhance the adversarial transferability, structure-based attackers adjust the backpropagation path to avoid the attack from overfitting the surrogate model. However, existing structure-based attackers fail to explore the convolution module in CNNs and modify the backpropagation graph heuristically, leading to limited effectiveness. In this paper, we propose backPropagation pAth Search (PAS), solving the aforementioned two problems. We first propose SkipConv to adjust the backpropagation path of convolution by structural reparameterization. To overcome the drawback of heuristically designed backpropagation paths, we further construct a DAG-based search space, utilize one-step approximation for path evaluation and employ Bayesian Optimization to search for the optimal path. We conduct comprehensive experiments in a wide range of transfer settings, showing that PAS improves the attack success rate by a huge margin for both normally trained and defense models.

摘要
深度神经网络容易受到敌意例际的攻击，因此在部署之前测试模型的可靠性是非常重要的。转移基于攻击者通过附加模型制造敌意例并将其传递到部署在黑盒子情况下的受害模型。为增强敌意例的可传递性，结构基于攻击者可以修改归并征求的路径，以避免攻击过拟合附加模型。然而，现有的结构基于攻击者未能探索CNN中的卷积模块，并修改归并图表使用了规则性的方法，导致效果有限。在这篇论文中，我们提出了backPropagation pAth Search（PAS），解决以下两个问题。我们首先提出了SkipConv，用于调整卷积后的归并路径。为了超越规则性设计的归并路径的缺点，我们进一步构建了DAG基本搜索空间，使用一步逼近方法评估归并路径，并使用bayesian优化来搜索最佳路径。我们在各种转移设置下进行了广泛的实验，结果显示，PAS可以大幅提高攻击成功率，包括常训练的模型和防御模型。

A Multilayer Perceptron-based Fast Sunlight Assessment for the Conceptual Design of Residential Neighborhoods under Chinese Policy

paper_url: http://arxiv.org/abs/2308.07616
repo_url: None
paper_authors: Can Jiang, Xiong Liang, Yu-Cheng Zhou, Yong Tian, Shengli Xu, Jia-Rui Lin, Zhiliang Ma, Shiji Yang, Hao Zhou
for: 本研究旨在应用深度学习技术来加速建筑设计阶段的日照时数 simulations，以减少计算时间和提高设计效率。
methods: 本研究提出了一个多层感知器（Multilayer Perceptron，MLP）基本的一阶预测方法，可以快速地预测建筑物的日照时数。方法首先将建筑物分解为多个立方体形状的部分，然后运用一个一阶预测模型来预测每个部分的日照时数。
results: 经过三个 numeral experiments，包括水平层和倾斜分析、模拟运算和优化，结果显示，本方法可以将计算时间降低到1/84~~1/50，并保持96.5%~~98%的准确性。此外，基于提案的模型，也开发了一个实用的住宅区布局规划插件 для Rhino 7/Grasshopper。

Abstract
In Chinese building codes, it is required that residential buildings receive a minimum number of hours of natural, direct sunlight on a specified winter day, which represents the worst sunlight condition in a year. This requirement is a prerequisite for obtaining a building permit during the conceptual design of a residential project. Thus, officially sanctioned software is usually used to assess the sunlight performance of buildings. These software programs predict sunlight hours based on repeated shading calculations, which is time-consuming. This paper proposed a multilayer perceptron-based method, a one-stage prediction approach, which outputs a shading time interval caused by the inputted cuboid-form building. The sunlight hours of a site can be obtained by calculating the union of the sunlight time intervals (complement of shading time interval) of all the buildings. Three numerical experiments, i.e., horizontal level and slope analysis, and simulation-based optimization are carried out; the results show that the method reduces the computation time to 1/84~1/50 with 96.5%~98% accuracies. A residential neighborhood layout planning plug-in for Rhino 7/Grasshopper is also developed based on the proposed model. This paper indicates that deep learning techniques can be adopted to accelerate sunlight hour simulations at the conceptual design phase.

摘要
中国建筑标准要求住宅建筑在指定的冬季日子上得到最少的自然、直接日光时间，这是为了获得建筑许可证的必要条件。因此，官方批准的软件通常用于评估建筑的日光性能。这些软件计算出日光时间基于重复的遮挡计算，这是时间消耗大。本文提出了基于多层感知器的方法，一种一stage预测方法，它输出一个输入的立方体形建筑物遮挡时间间隔。通过计算所有建筑物的遮挡时间间隔的并集（补做遮挡时间间隔），可以获得建筑地点的日光时间。三个数学实验，即水平层和坡度分析，以及基于仿真优化的模拟，都表明了该方法可以将计算时间减少到1/84~1/50，并保持96.5%~98%的准确性。此外，基于提议的模型也开发了一个基于Rhino 7/Grasshopper的住宅街区规划插件。本文表明，深度学习技术可以在概念设计阶段加速日光时间的估算。

Searching for Novel Chemistry in Exoplanetary Atmospheres using Machine Learning for Anomaly Detection

paper_url: http://arxiv.org/abs/2308.07604
repo_url: None
paper_authors: Roy T. Forestano, Konstantin T. Matchev, Katia Matcheva, Eyup B. Unlu
for: 本研究旨在开发新的快速高效的机器学习方法，用于检测望远镜观测数据中异常的行星，以找到具有不同化学成分的行星和可能的生物标志物。
methods: 本研究使用了两种流行的异常检测方法：本地异常因子和一类支持向量机器学习。
results: 研究成功地应用了这两种方法于大量的人工数据库中，并通过ROC曲线评估和比较了两种方法的性能。

Abstract
The next generation of telescopes will yield a substantial increase in the availability of high-resolution spectroscopic data for thousands of exoplanets. The sheer volume of data and number of planets to be analyzed greatly motivate the development of new, fast and efficient methods for flagging interesting planets for reobservation and detailed analysis. We advocate the application of machine learning (ML) techniques for anomaly (novelty) detection to exoplanet transit spectra, with the goal of identifying planets with unusual chemical composition and even searching for unknown biosignatures. We successfully demonstrate the feasibility of two popular anomaly detection methods (Local Outlier Factor and One Class Support Vector Machine) on a large public database of synthetic spectra. We consider several test cases, each with different levels of instrumental noise. In each case, we use ROC curves to quantify and compare the performance of the two ML techniques.

摘要
下一代望远镜将提供大量高分辨率光谱数据，用于千个外围星球的分析。数据量和星球数量的增加，大大推动了新的快速高效的方法的开发，用于标注有趣的星球进行重新观测和详细分析。我们提议通过机器学习（ML）技术进行外围星球谱spectra中异常（新型）检测，以找到不寻常的化学组成和甚至搜索未知生物标志。我们成功地在大规模公共数据库中使用synthetic spectra进行了两种流行的异常检测方法的实验（Local Outlier Factor和One Class Support Vector Machine），并在不同的实rumental noise水平下进行了多个测试case。在每个测试case中，我们使用ROC曲线来评估和比较两种ML技术的性能。

Generating Personas for Games with Multimodal Adversarial Imitation Learning

paper_url: http://arxiv.org/abs/2308.07598
repo_url: None
paper_authors: William Ahlberg, Alessandro Sestini, Konrad Tollmar, Linus Gisslén
for: 本研究旨在开发一种可以生成多个个性化策略的协同学习方法，以便模拟人类游戏玩家的多种玩法。
methods: 本研究使用了多模块生成 adversarial imitation learning（MultiGAIL）方法，通过在单机器模型中学习多个专家策略，并使用多个评估器来学习环境奖励。
results: 实验结果表明，MultiGAIL方法可以在连续和离散动作空间中的两个环境中生成多个个性化策略，并且在这些环境中表现出色。

Abstract
Reinforcement learning has been widely successful in producing agents capable of playing games at a human level. However, this requires complex reward engineering, and the agent's resulting policy is often unpredictable. Going beyond reinforcement learning is necessary to model a wide range of human playstyles, which can be difficult to represent with a reward function. This paper presents a novel imitation learning approach to generate multiple persona policies for playtesting. Multimodal Generative Adversarial Imitation Learning (MultiGAIL) uses an auxiliary input parameter to learn distinct personas using a single-agent model. MultiGAIL is based on generative adversarial imitation learning and uses multiple discriminators as reward models, inferring the environment reward by comparing the agent and distinct expert policies. The reward from each discriminator is weighted according to the auxiliary input. Our experimental analysis demonstrates the effectiveness of our technique in two environments with continuous and discrete action spaces.

摘要
� Reinforcement learning 已经成功地将机器人训练到人类水准。但是，这需要复杂的奖励工程，并且机器人的结果策略可能是随机的。为了模型人类玩家的广泛风格，超出奖励学习是必要的。这篇论文介绍了一种具有多个人格的循环学习方法，即多模倍GAIL（MultiGAIL）。MultiGAIL使用辅助输入参数来学习不同的人格，使用多个批评者作为奖励模型，推算环境奖励通过比较机器人和具体专家策略。奖励从每个批评者被权重根据辅助输入。我们的实验分析表明我们的技术在维度和整数动作空间的两个环境中具有效果。

High-Probability Risk Bounds via Sequential Predictors

paper_url: http://arxiv.org/abs/2308.07588
repo_url: None
paper_authors: Dirk van der Hoeven, Nikita Zhivotovskiy, Nicolò Cesa-Bianchi
for: 这 paper written for what? + 这 paper 目的是提供一种在线学习方法，可以在 minimal assumptions 下提供sequential regret bounds和in-expectation risk bounds。
methods: 这 paper 使用哪些方法? + 这 paper 使用 online learning methods，包括 general online learning algorithms 和 second-order correction to the loss function。
results: 这 paper 得到了哪些结果? + 这 paper 得到了nearly optimal high-probability risk bounds for several classical statistical estimation problems，such as discrete distribution estimation, linear regression, logistic regression, and conditional density estimation。

Abstract
Online learning methods yield sequential regret bounds under minimal assumptions and provide in-expectation risk bounds for statistical learning. However, despite the apparent advantage of online guarantees over their statistical counterparts, recent findings indicate that in many important cases, regret bounds may not guarantee tight high-probability risk bounds in the statistical setting. In this work we show that online to batch conversions applied to general online learning algorithms can bypass this limitation. Via a general second-order correction to the loss function defining the regret, we obtain nearly optimal high-probability risk bounds for several classical statistical estimation problems, such as discrete distribution estimation, linear regression, logistic regression, and conditional density estimation. Our analysis relies on the fact that many online learning algorithms are improper, as they are not restricted to use predictors from a given reference class. The improper nature of our estimators enables significant improvements in the dependencies on various problem parameters. Finally, we discuss some computational advantages of our sequential algorithms over their existing batch counterparts.

摘要
在线学习方法提供序列 regret bound nder minimal 假设，并提供预期 риск bound для统计学学习。然而， despite the apparent advantage of online guarantees over their statistical counterparts, recent findings indicate that in many important cases, regret bounds may not guarantee tight high-probability risk bounds in the statistical setting. In this work, we show that online to batch conversions applied to general online learning algorithms can bypass this limitation. Via a general second-order correction to the loss function defining the regret, we obtain nearly optimal high-probability risk bounds for several classical statistical estimation problems, such as discrete distribution estimation, linear regression, logistic regression, and conditional density estimation. Our analysis relies on the fact that many online learning algorithms are improper, as they are not restricted to use predictors from a given reference class. The improper nature of our estimators enables significant improvements in the dependencies on various problem parameters. Finally, we discuss some computational advantages of our sequential algorithms over their existing batch counterparts.

Temporal Interest Network for Click-Through Rate Prediction

paper_url: http://arxiv.org/abs/2308.08487
repo_url: https://github.com/shenweichen/DSIN
paper_authors: Haolin Zhou, Junwei Pan, Xinyi Zhou, Xihua Chen, Jie Jiang, Xiaofeng Gao, Guihai Chen
for: 预测点击率 (CTR) 的预测，研究者发现了用户行为历史记录的四元相关性（行为语义、目标语义、行为时间和目标时间）对性能的影响。
methods: 研究者使用了各种用户行为方法，包括 Semantic Embedding 和 Temporal Encoding，以及 Target-Aware Attention 和 Target-Aware Representation。
results: 研究者发现，现有方法无法学习这种四元相关性，而他们提出的 Temporal Interest Network (TIN) 可以有效地捕捉这种相关性，并在 Amazon 和 Alibaba 数据集上进行了广泛的评估，并与最佳基eline相比，TIN 表现出了0.43% 和 0.29% 的提升。

Abstract
The history of user behaviors constitutes one of the most significant characteristics in predicting the click-through rate (CTR), owing to their strong semantic and temporal correlation with the target item. While the literature has individually examined each of these correlations, research has yet to analyze them in combination, that is, the quadruple correlation of (behavior semantics, target semantics, behavior temporal, and target temporal). The effect of this correlation on performance and the extent to which existing methods learn it remain unknown. To address this gap, we empirically measure the quadruple correlation and observe intuitive yet robust quadruple patterns. We measure the learned correlation of several representative user behavior methods, but to our surprise, none of them learn such a pattern, especially the temporal one. In this paper, we propose the Temporal Interest Network (TIN) to capture the quadruple semantic and temporal correlation between behaviors and the target. We achieve this by incorporating target-aware temporal encoding, in addition to semantic embedding, to represent behaviors and the target. Furthermore, we deploy target-aware attention, along with target-aware representation, to explicitly conduct the 4-way interaction. We performed comprehensive evaluations on the Amazon and Alibaba datasets. Our proposed TIN outperforms the best-performing baselines by 0.43\% and 0.29\% on two datasets, respectively. Comprehensive analysis and visualization show that TIN is indeed capable of learning the quadruple correlation effectively, while all existing methods fail to do so. We provide our implementation of TIN in Tensorflow.

摘要
历史用户行为特征是预测点击率(CTR)的一个最重要的特征，因为它们具有强的 semantics和时间相关性。 Although literature has individually examined each of these correlations, research has yet to analyze them in combination, that is, the quadruple correlation of (behavior semantics, target semantics, behavior temporal, and target temporal). The effect of this correlation on performance and the extent to which existing methods learn it remain unknown. To address this gap, we empirically measure the quadruple correlation and observe intuitive yet robust quadruple patterns. We measure the learned correlation of several representative user behavior methods, but to our surprise, none of them learn such a pattern, especially the temporal one.In this paper, we propose the Temporal Interest Network (TIN) to capture the quadruple semantic and temporal correlation between behaviors and the target. We achieve this by incorporating target-aware temporal encoding, in addition to semantic embedding, to represent behaviors and the target. Furthermore, we deploy target-aware attention, along with target-aware representation, to explicitly conduct the 4-way interaction. We performed comprehensive evaluations on the Amazon and Alibaba datasets. Our proposed TIN outperforms the best-performing baselines by 0.43% and 0.29% on two datasets, respectively. Comprehensive analysis and visualization show that TIN is indeed capable of learning the quadruple correlation effectively, while all existing methods fail to do so. We provide our implementation of TIN in Tensorflow.

IoT Data Trust Evaluation via Machine Learning

paper_url: http://arxiv.org/abs/2308.11638
repo_url: https://github.com/jettbrains/-L-
paper_authors: Timothy Tadj, Reza Arablouei, Volkan Dedeoglu
for:This paper aims to address the lack of publicly-available datasets for evaluating the trustworthiness of IoT data by proposing a data synthesis method called random walk infilling (RWI) to augment existing trustworthy datasets with untrustworthy data.methods:The proposed method uses RWI to generate untrustworthy data from existing trustworthy datasets, and extracts new features from IoT time-series sensor data that capture its auto-correlation and cross-correlation with neighboring sensors. These features are used to learn ML models for recognizing the trustworthiness of IoT sensor data.results:The proposed method outperforms existing ML-based approaches to IoT data trust evaluation, and a semi-supervised ML approach that requires only about 10% of the data labeled offers competitive performance while being more practical. The results also show that the ML models learned from datasets augmented via RWI generalize well to unseen data.

Abstract
Various approaches based on supervised or unsupervised machine learning (ML) have been proposed for evaluating IoT data trust. However, assessing their real-world efficacy is hard mainly due to the lack of related publicly-available datasets that can be used for benchmarking. Since obtaining such datasets is challenging, we propose a data synthesis method, called random walk infilling (RWI), to augment IoT time-series datasets by synthesizing untrustworthy data from existing trustworthy data. Thus, RWI enables us to create labeled datasets that can be used to develop and validate ML models for IoT data trust evaluation. We also extract new features from IoT time-series sensor data that effectively capture its auto-correlation as well as its cross-correlation with the data of the neighboring (peer) sensors. These features can be used to learn ML models for recognizing the trustworthiness of IoT sensor data. Equipped with our synthesized ground-truth-labeled datasets and informative correlation-based feature, we conduct extensive experiments to critically examine various approaches to evaluating IoT data trust via ML. The results reveal that commonly used ML-based approaches to IoT data trust evaluation, which rely on unsupervised cluster analysis to assign trust labels to unlabeled data, perform poorly. This poor performance can be attributed to the underlying unsubstantiated assumption that clustering provides reliable labels for data trust, a premise that is found to be untenable. The results also show that the ML models learned from datasets augmented via RWI while using the proposed features generalize well to unseen data and outperform existing related approaches. Moreover, we observe that a semi-supervised ML approach that requires only about 10% of the data labeled offers competitive performance while being practically more appealing compared to the fully-supervised approaches.

摘要
各种基于指导或无指导机器学习（ML）的方法已经为评估互联网器件数据（IoT）的可信度提出了多种方法。然而，评估它们在实际世界中的有效性很难，主要因为缺乏相关的公共可用数据集，用于比较。由于获取这些数据集困难，我们提议一种数据生成方法，即随机扩散填充（RWI），以增强IoT时间序数据集。通过将可信数据 Synthesize into不可信数据，我们可以创建可用于开发和验证ML模型的标注数据集。我们还提取了IoT时间序感知器数据中有效地捕捉自动相关性以及与邻近（邻居）感知器数据的相关性。这些特征可以用来学习识别IoT感知器数据的可信度。配备我们生成的标注数据集和有用的相关特征，我们进行了广泛的实验，critically examine了多种基于ML的IoT数据可信度评估方法。结果显示，常用的ML基于方法，通过不supervised cluster analysis将无标签数据分类为可信数据，表现不佳。这些结果可以归结于这些方法下的一个不实际的假设，即分类提供可靠的数据可信度标签。结果还表明，使用RWI生成的数据集和我们提出的特征来学习ML模型，在未看到数据时generalize well，并且超过现有相关方法。此外，我们发现 semi-supervised ML方法，只需要约10%的数据标注，可以提供竞争力强的性能，而且在实际应用中更加吸引人。

Story Visualization by Online Text Augmentation with Context Memory

paper_url: http://arxiv.org/abs/2308.07575
repo_url: https://github.com/yonseivnl/cmota
paper_authors: Daechul Ahn, Daneul Kim, Gwangmo Song, Seung Hwan Kim, Honglak Lee, Dongyeop Kang, Jonghyun Choi
for: 提高文本描述到图像生成 task 中的语言多样性抗锋性。
methods: 提出了一种基于 Bi-directional Transformer 框架的内存架构，并在训练时使用在线文本增强来生成多个 pseudo-descriptions 作为补做性超级vision 的权威指导。
results: 在 Pororo-SV 和 Flintstones-SV 两个受欢迎的 SV benchmark 上，提出的方法与现状相比，在多个纪录中表现出优于其他方法，包括 FID、人物 F1、帧精度、 BLEU-2/3 和 R-precision 等指标。

Abstract
Story visualization (SV) is a challenging text-to-image generation task for the difficulty of not only rendering visual details from the text descriptions but also encoding a long-term context across multiple sentences. While prior efforts mostly focus on generating a semantically relevant image for each sentence, encoding a context spread across the given paragraph to generate contextually convincing images (e.g., with a correct character or with a proper background of the scene) remains a challenge. To this end, we propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation that generates multiple pseudo-descriptions as supplementary supervision during training for better generalization to the language variation at inference. In extensive experiments on the two popular SV benchmarks, i.e., the Pororo-SV and Flintstones-SV, the proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision with similar or less computational complexity.

摘要
story visualization (SV) 是一个具有挑战性的文本到图像生成任务，因为不仅需要从文本描述中提取视觉细节，还需要在多句话中编码长期上下文。而现有的尝试主要是为每句文本生成Semantically relevant的图像，但是在保持场景背景和人物性别正确的情况下，在整个段落上编码上下文并生成情节感地投入的图像仍然是一个挑战。为此，我们提议一种新的内存架构，用于Bi-directional Transformer框架的在线文本增强，在训练过程中生成多个假描述作为补做性的超vision，以提高语言变化的适应性。在两个流行的 SV 标准测试集上，即Pororo-SV 和 Flintstones-SV，我们的方法与现有的状态arius signicantly outperform，包括 FID、character F1、frame accuracy、BLEU-2/3 和 R-precision 等多个维度的指标，同时具有相似或更少的计算复杂度。

Synthetic data generation method for hybrid image-tabular data using two generative adversarial networks

paper_url: http://arxiv.org/abs/2308.07573
repo_url: None
paper_authors: Tomohiro Kikuchi, Shouhei Hanaoka, Takahiro Nakao, Tomomi Takenaga, Yukihiro Nomura, Harushi Mori, Takeharu Yoshikawa
for: 提供一种生成静脉呼吸图像和结构化表格数据的新方法，以解决医疗隐私和数据共享问题。
methods: 使用自适应GAN和条件表格GAN模型，将大型公共数据库（pDB）中的静脉呼吸图像维度减少，并将图像编码器与原始数据库（oDB）中的图像进行对应。
results: 成功生成了多样化的合成医疗记录，保持了图像和表格数据之间的协调性，并通过视觉评估、分布分析和分类任务进行评估。

Abstract
The generation of synthetic medical records using generative adversarial networks (GANs) has become increasingly important for addressing privacy concerns and promoting data sharing in the medical field. In this paper, we propose a novel method for generating synthetic hybrid medical records consisting of chest X-ray images (CXRs) and structured tabular data (including anthropometric data and laboratory tests) using an auto-encoding GAN ({\alpha}GAN) and a conditional tabular GAN (CTGAN). Our approach involves training a {\alpha}GAN model on a large public database (pDB) to reduce the dimensionality of CXRs. We then applied the trained encoder of the GAN model to the images in original database (oDB) to obtain the latent vectors. These latent vectors were combined with tabular data in oDB, and these joint data were used to train the CTGAN model. We successfully generated diverse synthetic records of hybrid CXR and tabular data, maintaining correspondence between them. We evaluated this synthetic database (sDB) through visual assessment, distribution of interrecord distances, and classification tasks. Our evaluation results showed that the sDB captured the features of the oDB while maintaining the correspondence between the images and tabular data. Although our approach relies on the availability of a large-scale pDB containing a substantial number of images with the same modality and imaging region as those in the oDB, this method has the potential for the public release of synthetic datasets without compromising the secondary use of data.

摘要
现代生成技术在医疗领域得到了广泛应用，特别是使用生成对抗网络（GANs）生成合成医疗记录，以解决隐私问题和促进数据共享。在这篇论文中，我们提出了一种新的方法，使用自动编码GAN（αGAN）和条件表格GAN（CTGAN）生成 hybrid 的胸部X射线图像（CXR）和结构化表格数据（包括人体测量数据和实验室测试）。我们的方法包括在大规模公共数据库（pDB）中训练αGAN模型，以减少CXR的维度。然后，我们将训练过的GAN模型的编码器应用到oDB中的图像上，以获取秘密 вектор。这些秘密 вектор与表格数据在oDB中进行结合，并将这些联合数据用于训练CTGAN模型。我们成功地生成了多样性的合成记录，保持了图像和表格数据之间的协调。我们通过视觉评估、记录间距离分布和分类任务进行评估。我们的评估结果表明，sDB捕捉了oDB中的特征，同时保持了图像和表格数据之间的协调。虽然我们的方法需要大规模的pDB，但这种方法具有公共发布合成数据库的潜在优势，不会损害数据的次要使用。

Ske2Grid: Skeleton-to-Grid Representation Learning for Action Recognition

paper_url: http://arxiv.org/abs/2308.07571
repo_url: https://github.com/osvai/ske2grid
paper_authors: Dongqi Cai, Yangyuxuan Kang, Anbang Yao, Yurong Chen
for: 这篇论文提出了一种基于骨架的动作识别表征学习框架，即Ske2Grid，以提高骨架基рован动作识别的准确率。
methods: 该框架使用了三种新的设计方法：图节点指定变换（GIT）、上升变换（UPT）和进步学习策略（PLS）。GIT用于构建一个固定大小的网格图像块，而UPT用于填充网格图像块中的节点。PLS用于解决一步UPT的严格性问题，并且可以逐步提高表征能力。
results: 实验表明，Ske2Grid在六个主流骨架基рован动作识别数据集上表现出色，与现有GCN基于解决方案相比，无需添加其他特殊设计。代码和模型可以在https://github.com/OSVAI/Ske2Grid上下载。

Abstract
This paper presents Ske2Grid, a new representation learning framework for improved skeleton-based action recognition. In Ske2Grid, we define a regular convolution operation upon a novel grid representation of human skeleton, which is a compact image-like grid patch constructed and learned through three novel designs. Specifically, we propose a graph-node index transform (GIT) to construct a regular grid patch through assigning the nodes in the skeleton graph one by one to the desired grid cells. To ensure that GIT is a bijection and enrich the expressiveness of the grid representation, an up-sampling transform (UPT) is learned to interpolate the skeleton graph nodes for filling the grid patch to the full. To resolve the problem when the one-step UPT is aggressive and further exploit the representation capability of the grid patch with increasing spatial size, a progressive learning strategy (PLS) is proposed which decouples the UPT into multiple steps and aligns them to multiple paired GITs through a compact cascaded design learned progressively. We construct networks upon prevailing graph convolution networks and conduct experiments on six mainstream skeleton-based action recognition datasets. Experiments show that our Ske2Grid significantly outperforms existing GCN-based solutions under different benchmark settings, without bells and whistles. Code and models are available at https://github.com/OSVAI/Ske2Grid

摘要
First, we propose a graph-node index transform (GIT) to construct a regular grid patch by assigning the nodes in the skeleton graph one by one to the desired grid cells. To ensure that GIT is a bijection and enrich the expressiveness of the grid representation, we also learn an up-sampling transform (UPT) to interpolate the skeleton graph nodes for filling the grid patch to the full.To further exploit the representation capability of the grid patch with increasing spatial size, we propose a progressive learning strategy (PLS) that decouples the UPT into multiple steps and aligns them to multiple paired GITs through a compact cascaded design learned progressively.We construct networks upon prevailing graph convolution networks and conduct experiments on six mainstream skeleton-based action recognition datasets. The results show that our Ske2Grid significantly outperforms existing GCN-based solutions under different benchmark settings, without any additional modifications or tricks. The code and models are available at https://github.com/OSVAI/Ske2Grid.

Semi-Supervised Learning with Multiple Imputations on Non-Random Missing Labels

paper_url: http://arxiv.org/abs/2308.07562
repo_url: None
paper_authors: Jason Lu, Michael Ma, Huaze Xu, Zixi Xu
for: 这个论文主要针对 semi-supervised learning (SSL) 中的三个主要问题：missing at random (MAR)、missing completely at random (MCAR) 和 missing not at random (MNAR)。
methods: 这篇论文提出了两种新的方法，用于combine multiple imputation models，以提高准确性和减少偏见。第一种方法是使用多个插值模型，创建信任区间，并应用一个阈值来忽略低信任 pseudo-labels。第二种方法是our new method，SSL with De-biased Imputations (SSL-DI)，通过过滤不准确的数据，找到一个准确可靠的子集，然后将这个子集插值到另一个 SSL 模型中，以减少偏见。
results: 该论文的实验结果表明，提出的方法可以在 MCAR 和 MNAR Situations 中效果地减少偏见，并在类别准确率方面与现有方法相比，表现出较高的性能。

Abstract
Semi-Supervised Learning (SSL) is implemented when algorithms are trained on both labeled and unlabeled data. This is a very common application of ML as it is unrealistic to obtain a fully labeled dataset. Researchers have tackled three main issues: missing at random (MAR), missing completely at random (MCAR), and missing not at random (MNAR). The MNAR problem is the most challenging of the three as one cannot safely assume that all class distributions are equal. Existing methods, including Class-Aware Imputation (CAI) and Class-Aware Propensity (CAP), mostly overlook the non-randomness in the unlabeled data. This paper proposes two new methods of combining multiple imputation models to achieve higher accuracy and less bias. 1) We use multiple imputation models, create confidence intervals, and apply a threshold to ignore pseudo-labels with low confidence. 2) Our new method, SSL with De-biased Imputations (SSL-DI), aims to reduce bias by filtering out inaccurate data and finding a subset that is accurate and reliable. This subset of the larger dataset could be imputed into another SSL model, which will be less biased. The proposed models have been shown to be effective in both MCAR and MNAR situations, and experimental results show that our methodology outperforms existing methods in terms of classification accuracy and reducing bias.

摘要
《半supervised学习（SSL）实现方法》在实际应用中，通常不可能获得完全标注数据集，因此SSL成为了非常常见的应用。研究人员面临着三个主要问题：Random missing（MAR）、完全随机缺失（MCAR）和非随机缺失（MNAR）。MNAR问题是三个问题中最为困难，因为不可能安全地假设所有类别分布相同。现有的方法，包括Class-Aware Imputation（CAI）和Class-Aware Propensity（CAP），几乎忽略了不Random的未标注数据。本文提出了两种新的方法，用于 combinig多个替补模型，以达到更高的准确率和更少的偏见。1. 我们使用多个替补模型，创建信任区间，并应用一个阈值，以忽略低信任 Pseudo-labels。2. 我们的新方法，SSL with De-biased Imputations（SSL-DI），旨在减少偏见，通过筛选不准确的数据，并找到一个准确可靠的子集，并将这个子集用于另一个SSL模型，以减少偏见。我们的方法在MCAR和MNAR情况下都有显著的优势，实验结果表明，我们的方法在分类精度和减少偏见方面都超过了现有方法。

A User-Centered Evaluation of Spanish Text Simplification

paper_url: http://arxiv.org/abs/2308.07556
repo_url: None
paper_authors: Adrian de Wynter, Anthony Hevia, Si-Qing Chen
for: 评估西班牙文简化（TS）生产系统中的表现，通过复杂句子和复杂词语识别两个字库。
methods: 使用神经网络对西班牙语TS进行评估，并与最常见的西班牙语特有可读性分数进行比较，发现神经网络在预测用户TS偏好时表现更好。
results: 发现多语言模型在同一任务中下降表现，而所有模型往往围绕毫不重要的统计特征（如句子长度）集中焦点。

Abstract
We present an evaluation of text simplification (TS) in Spanish for a production system, by means of two corpora focused in both complex-sentence and complex-word identification. We compare the most prevalent Spanish-specific readability scores with neural networks, and show that the latter are consistently better at predicting user preferences regarding TS. As part of our analysis, we find that multilingual models underperform against equivalent Spanish-only models on the same task, yet all models focus too often on spurious statistical features, such as sentence length. We release the corpora in our evaluation to the broader community with the hopes of pushing forward the state-of-the-art in Spanish natural language processing.

摘要
我们对西班牙文简化文本（TS）进行了评估，使用了两个文本库，专注于复杂句子和复杂单词识别。我们比较了最常见的西班牙语特有的可读性分数，以及神经网络，并发现后者在预测用户对TS的偏好时表现更好。在我们的分析中，我们发现了许多多语言模型在同一任务上表现较差，但所有模型很多时候强调无关紧要的统计特征，如句子长度。我们将我们的评估 corpora 发布给广泛的社区，以促进西班牙自然语言处理领域的进步。

Enhancing the Antidote: Improved Pointwise Certifications against Poisoning Attacks

paper_url: http://arxiv.org/abs/2308.07553
repo_url: None
paper_authors: Shijie Liu, Andrew C. Cullen, Paul Montague, Sarah M. Erfani, Benjamin I. P. Rubinstein
for: 防止毒素攻击影响模型行为
methods: 使用渐进隐私和随机采样 Gaussian 机制确保测试实例对 finite 数量毒素样例的不变性
results: 提供更大于先前证明的攻击Robustness 保证

Abstract
Poisoning attacks can disproportionately influence model behaviour by making small changes to the training corpus. While defences against specific poisoning attacks do exist, they in general do not provide any guarantees, leaving them potentially countered by novel attacks. In contrast, by examining worst-case behaviours Certified Defences make it possible to provide guarantees of the robustness of a sample against adversarial attacks modifying a finite number of training samples, known as pointwise certification. We achieve this by exploiting both Differential Privacy and the Sampled Gaussian Mechanism to ensure the invariance of prediction for each testing instance against finite numbers of poisoned examples. In doing so, our model provides guarantees of adversarial robustness that are more than twice as large as those provided by prior certifications.

摘要
毒素攻击可以夹击模型行为，通过小量修改训练集来让模型表现出巨大的影响。虽然有针对特定毒素攻击的防御方法，但这些防御方法通常不提供任何保证，因此可能会被新的攻击所打砸。相比之下，通过检查最坏情况的证明 Certified Defences，我们可以提供对测试实例的预测结果进行保证，并 garantuee 测试实例对于有限多个毒素样本的修改后的预测结果的一致性。通过利用差分隐私和随机 Gaussian 机制，我们的模型可以保证对于每个测试实例，对于有限多个毒素样本的修改后的预测结果的一致性。这使得我们的模型可以提供更大于之前证明的防御 robustness。

Domain Adaptation via Minimax Entropy for Real/Bogus Classification of Astronomical Alerts

paper_url: http://arxiv.org/abs/2308.07538
repo_url: None
paper_authors: Guillermo Cabrera-Vives, César Bolivar, Francisco Förster, Alejandra M. Muñoz Arancibia, Manuel Pérez-Carrasco, Esteban Reyes
for: 这个论文旨在研究域域适应（Domain Adaptation）技术，用于实时分析大量天文数据。
methods: 该论文使用了四个不同的数据集（HiTS、DES、ATLAS和ZTF），研究这些数据集之间的域Shift，并通过微调和半supervised深度适应来提高一个简单的深度学习分类模型。
results: 研究发现，微调和MME模型可以在目标数据集上提高基本模型的准确率，但MME模型不会对源数据集产生影响。只需要一些来自目标数据集的标注项（一个或 fewer），微调和MME模型就可以显著提高分类精度。

Abstract
Time domain astronomy is advancing towards the analysis of multiple massive datasets in real time, prompting the development of multi-stream machine learning models. In this work, we study Domain Adaptation (DA) for real/bogus classification of astronomical alerts using four different datasets: HiTS, DES, ATLAS, and ZTF. We study the domain shift between these datasets, and improve a naive deep learning classification model by using a fine tuning approach and semi-supervised deep DA via Minimax Entropy (MME). We compare the balanced accuracy of these models for different source-target scenarios. We find that both the fine tuning and MME models improve significantly the base model with as few as one labeled item per class coming from the target dataset, but that the MME does not compromise its performance on the source dataset.

摘要
时域天文学在实时处理大量数据方面升级，导致多流机器学习模型的发展。在这项工作中，我们研究天文学警报真假分类中的领域适应（DA），使用四个不同的数据集：HiTS、DES、ATLAS和ZTF。我们研究这些数据集之间的领域差异，并通过微调和半supervised深度DA via Minimax Entropy（MME）提高了基本模型的性能。我们对不同的源目标场景进行比较，发现两者都可以大幅提高基本模型，但MME不会在目标集中妥协性能。

KMF: Knowledge-Aware Multi-Faceted Representation Learning for Zero-Shot Node Classification

paper_url: http://arxiv.org/abs/2308.08563
repo_url: None
paper_authors: Likang Wu, Junji Jiang, Hongke Zhao, Hao Wang, Defu Lian, Mengdi Zhang, Enhong Chen
for: zero-shot node classification (ZNC) task in graph data analysis, to predict nodes from unseen classes
methods: Knowledge-Aware Multi-Faceted (KMF) framework that enhances label semantics via extracted KG-based topics, and reconstructs node content to a topic-level representation
results: extensive experiments on several public graph datasets, demonstrating effectiveness and generalization of KMF compared to state-of-the-art baselines, and an application of zero-shot cross-domain recommendation.

Abstract
Recently, Zero-Shot Node Classification (ZNC) has been an emerging and crucial task in graph data analysis. This task aims to predict nodes from unseen classes which are unobserved in the training process. Existing work mainly utilizes Graph Neural Networks (GNNs) to associate features' prototypes and labels' semantics thus enabling knowledge transfer from seen to unseen classes. However, the multi-faceted semantic orientation in the feature-semantic alignment has been neglected by previous work, i.e. the content of a node usually covers diverse topics that are relevant to the semantics of multiple labels. It's necessary to separate and judge the semantic factors that tremendously affect the cognitive ability to improve the generality of models. To this end, we propose a Knowledge-Aware Multi-Faceted framework (KMF) that enhances the richness of label semantics via the extracted KG (Knowledge Graph)-based topics. And then the content of each node is reconstructed to a topic-level representation that offers multi-faceted and fine-grained semantic relevancy to different labels. Due to the particularity of the graph's instance (i.e., node) representation, a novel geometric constraint is developed to alleviate the problem of prototype drift caused by node information aggregation. Finally, we conduct extensive experiments on several public graph datasets and design an application of zero-shot cross-domain recommendation. The quantitative results demonstrate both the effectiveness and generalization of KMF with the comparison of state-of-the-art baselines.

摘要
近期，零批节点分类（ZNC）已成为图数据分析中emerging和重要的任务。这个任务的目标是从训练过程中未见过的类型中预测节点。现有的工作主要利用图神经网络（GNNs）来协调特征的抽象和标签的 semantics，从而实现知识传递从见到未见类型。然而，先前的工作忽略了多元的semantic orientation在特征-semanticAlignment中，即节点的内容通常涵盖多个标签的 semantics，这些标签相关的多个话题。为了提高模型的通用性，我们提出了一种知识具有Multi-Faceted框架（KMF），该框架可以提高标签的 semantics richness via 提取的知识图（KG）基于话题。然后，每个节点的内容被重建到话题级别的表示，这种表示提供了多个标签的多元和细化的semantic relevancy。由于图的实例（即节点）表示的特殊性，我们开发了一种 novel geometric constraint来缓解由节点信息汇集所引起的 prototype drift问题。最后，我们在多个公共图据集上进行了广泛的实验，并设计了零shot Cross-Domain recommender。量化结果表明，KMF的效果和通用性在比较现有的基准下都显著。

Projection-Free Methods for Stochastic Simple Bilevel Optimization with Convex Lower-level Problem

paper_url: http://arxiv.org/abs/2308.07536
repo_url: None
paper_authors: Jincheng Cao, Ruichen Jiang, Nazanin Abolfazli, Erfan Yazdandoost Hamedani, Aryan Mokhtari
For: 该论文研究了一类随机二级优化问题，即随机简单二级优化问题，其中我们寻找一个最优的解，满足一个随机对象函数的最小化。* Methods: 我们提出了一种新的随机二级优化方法，利用随机割辑来近似下一级问题的解，然后通过条件梯度更新和减少误差来控制随机导数引起的错误。* Results: 我们证明了，对于凹函数上的上一级问题，我们的方法需要 $\tilde{\mathcal{O}(\max{1/\epsilon_f^{2},1/\epsilon_g^{2}})$ 随机票 query，以获得一个 $\epsilon_f$-优化的上一级解和 $\epsilon_g$-优化的下一级解。这个证明超越了之前的最佳记录 $\mathcal{O}(\max{1/\epsilon_f^{4},1/\epsilon_g^{4}})$. 更进一步，对于非凹函数上的上一级问题，我们的方法需要最多 $\tilde{\mathcal{O}(\max{1/\epsilon_f^{3},1/\epsilon_g^{3}})$ 随机票 query，以找到一个 $(\epsilon_f, \epsilon_g)$-静点。在finite-sum设定下，我们证明了我们的方法需要 $\tilde{\mathcal{O}(\sqrt{n}/\epsilon)$ 和 $\tilde{\mathcal{O}(\sqrt{n}/\epsilon^{2})$ 随机票 query，它们取决于对象函数是 convex 还是 non-convex。

Abstract
In this paper, we study a class of stochastic bilevel optimization problems, also known as stochastic simple bilevel optimization, where we minimize a smooth stochastic objective function over the optimal solution set of another stochastic convex optimization problem. We introduce novel stochastic bilevel optimization methods that locally approximate the solution set of the lower-level problem via a stochastic cutting plane, and then run a conditional gradient update with variance reduction techniques to control the error induced by using stochastic gradients. For the case that the upper-level function is convex, our method requires $\tilde{\mathcal{O}(\max\{1/\epsilon_f^{2},1/\epsilon_g^{2}\}) $ stochastic oracle queries to obtain a solution that is $\epsilon_f$-optimal for the upper-level and $\epsilon_g$-optimal for the lower-level. This guarantee improves the previous best-known complexity of $\mathcal{O}(\max\{1/\epsilon_f^{4},1/\epsilon_g^{4}\})$. Moreover, for the case that the upper-level function is non-convex, our method requires at most $\tilde{\mathcal{O}(\max\{1/\epsilon_f^{3},1/\epsilon_g^{3}\}) $ stochastic oracle queries to find an $(\epsilon_f, \epsilon_g)$-stationary point. In the finite-sum setting, we show that the number of stochastic oracle calls required by our method are $\tilde{\mathcal{O}(\sqrt{n}/\epsilon)$ and $\tilde{\mathcal{O}(\sqrt{n}/\epsilon^{2})$ for the convex and non-convex settings, respectively, where $\epsilon=\min \{\epsilon_f,\epsilon_g\}$.

摘要
在这篇论文中，我们研究一类随机二重优化问题，即随机简单二重优化问题，其中我们寻找最优解集的最小值。我们提出了一种新的随机二重优化方法，使用随机割辑来近似下一级问题的解集，然后运行一个 conditional gradient update 的方法来控制由随机导数引起的误差。对于凸Upper-level函数的情况，我们的方法需要 $\tilde{\mathcal{O}(\max\{1/\epsilon_f^{2},1/\epsilon_g^{2}\}) $ 随机oracle查询来获得一个 $\epsilon_f$-优化的上一级解和 $\epsilon_g$-优化的下一级解。这个 garantía提高了之前的最佳 Complexity 的 $\mathcal{O}(\max\{1/\epsilon_f^{4},1/\epsilon_g^{4}\})$.而对于非凸 Upper-level函数的情况，我们的方法需要最多 $\tilde{\mathcal{O}(\max\{1/\epsilon_f^{3},1/\epsilon_g^{3}\}) $ 随机oracle查询来找到一个$(\epsilon_f, \epsilon_g)$-静点。在finite-sum设定下，我们证明了我们的方法需要 $\tilde{\mathcal{O}(\sqrt{n}/\epsilon)$ 和 $\tilde{\mathcal{O}(\sqrt{n}/\epsilon^{2})$ 随机oracle查询，其中 $\epsilon = \min \{\epsilon_f, \epsilon_g\}$.

Inverse Lithography Physics-informed Deep Neural Level Set for Mask Optimization

paper_url: http://arxiv.org/abs/2308.12299
repo_url: None
paper_authors: Xing-Yu Ma, Shaogang Hao
for: 这篇论文主要是为了提出一种基于深度学习的 inverse lithography physics-informed deep neural level set（ILDLS）方法，用于mask优化。
methods: 这篇论文使用了深度学习（DL）方法，并将 inverse lithography physics incorporated into DL 框架中。具体来说，这篇论文使用了 level set 基于的 inverse lithography technology（ILT）作为层，并在这个层中进行了mask预测和修正。
results: 相比于纯度DL和ILT，ILDLS可以减少计算时间的数个数量级，同时提高了印刷可能性和过程窗口（PW）。这篇论文的结果表明，ILDLS可以提供一种高效的mask优化解决方案。

Abstract
As the feature size of integrated circuits continues to decrease, optical proximity correction (OPC) has emerged as a crucial resolution enhancement technology for ensuring high printability in the lithography process. Recently, level set-based inverse lithography technology (ILT) has drawn considerable attention as a promising OPC solution, showcasing its powerful pattern fidelity, especially in advanced process. However, massive computational time consumption of ILT limits its applicability to mainly correcting partial layers and hotspot regions. Deep learning (DL) methods have shown great potential in accelerating ILT. However, lack of domain knowledge of inverse lithography limits the ability of DL-based algorithms in process window (PW) enhancement and etc. In this paper, we propose an inverse lithography physics-informed deep neural level set (ILDLS) approach for mask optimization. This approach utilizes level set based-ILT as a layer within the DL framework and iteratively conducts mask prediction and correction to significantly enhance printability and PW in comparison with results from pure DL and ILT. With this approach, computation time is reduced by a few orders of magnitude versus ILT. By gearing up DL with knowledge of inverse lithography physics, ILDLS provides a new and efficient mask optimization solution.

摘要
In this paper, we propose an inverse lithography physics-informed deep neural level set (ILDLS) approach for mask optimization. This approach utilizes level set-based ILT as a layer within the DL framework and iteratively conducts mask prediction and correction to significantly enhance printability and PW in comparison with results from pure DL and ILT. With this approach, computation time is reduced by a few orders of magnitude versus ILT. By combining DL with knowledge of inverse lithography physics, ILDLS provides a new and efficient mask optimization solution.

FeatGeNN: Improving Model Performance for Tabular Data with Correlation-based Feature Extraction

paper_url: http://arxiv.org/abs/2308.07527
repo_url: None
paper_authors: Sammuel Ramos Silva, Rodrigo Silva
for: 提高机器学习模型性能和统计分析中获取更多信息
methods: 使用 corr 函数作为池化函数，从数据矩阵中提取和创建新特征
results: 在多个标准测试集上比较 FeatGeNN 与现有 AutoFE 方法，显示 FeatGeNN 可以提高模型性能。 corr 函数可以作为 tabular 数据中 pooling 函数的有力的替代方案。

Abstract
Automated Feature Engineering (AutoFE) has become an important task for any machine learning project, as it can help improve model performance and gain more information for statistical analysis. However, most current approaches for AutoFE rely on manual feature creation or use methods that can generate a large number of features, which can be computationally intensive and lead to overfitting. To address these challenges, we propose a novel convolutional method called FeatGeNN that extracts and creates new features using correlation as a pooling function. Unlike traditional pooling functions like max-pooling, correlation-based pooling considers the linear relationship between the features in the data matrix, making it more suitable for tabular data. We evaluate our method on various benchmark datasets and demonstrate that FeatGeNN outperforms existing AutoFE approaches regarding model performance. Our results suggest that correlation-based pooling can be a promising alternative to max-pooling for AutoFE in tabular data applications.

摘要
自动Feature工程（AutoFE）已成为机器学习项目中重要的任务，因为它可以提高模型性能并提供更多的统计分析信息。然而，现有的AutoFE方法大多依赖于手动创建特征或使用生成大量特征的方法，这可能会占用大量计算资源并导致过拟合。为解决这些挑战，我们提出了一种新的卷积方法 called FeatGeNN，它通过对数据矩阵中特征之间的相关性进行抽象，从而生成新的特征。与传统的最大值抽取函数不同，相关性基于的抽取函数更适合逻辑数据。我们在多个 benchmark 数据集上评估了我们的方法，并证明了 FeatGeNN 在AutoFE中超过了现有的方法。我们的结果表明，相关性基于的抽取函数可以成为 tabular 数据应用中 AutoFE 中的一个有前途的代替方案。

Potential of Deep Operator Networks in Digital Twin-enabling Technology for Nuclear System

paper_url: http://arxiv.org/abs/2308.07523
repo_url: None
paper_authors: Kazuma Kobayashi, Syed Bahauddin Alam
for: 这个研究旨在提出一种可靠且高精度的数据零层级模型（DeepONet），用于数位双（DT）系统中的核工程应用。
methods: 这个研究使用DeepONet方法，将函数作为输入数据，从训练数据中构建了算子G。
results: DeepONet方法在解决困难的粒子运输问题上展现了杰出的预测精度，比传统机器学习方法更高。

Abstract
This research introduces the Deep Operator Network (DeepONet) as a robust surrogate modeling method within the context of digital twin (DT) systems for nuclear engineering. With the increasing importance of nuclear energy as a carbon-neutral solution, adopting DT technology has become crucial to enhancing operational efficiencies, safety, and predictive capabilities in nuclear engineering applications. DeepONet exhibits remarkable prediction accuracy, outperforming traditional ML methods. Through extensive benchmarking and evaluation, this study showcases the scalability and computational efficiency of DeepONet in solving a challenging particle transport problem. By taking functions as input data and constructing the operator $G$ from training data, DeepONet can handle diverse and complex scenarios effectively. However, the application of DeepONet also reveals challenges related to optimal sensor placement and model evaluation, critical aspects of real-world implementation. Addressing these challenges will further enhance the method's practicality and reliability. Overall, DeepONet presents a promising and transformative tool for nuclear engineering research and applications. Its accurate prediction and computational efficiency capabilities can revolutionize DT systems, advancing nuclear engineering research. This study marks an important step towards harnessing the power of surrogate modeling techniques in critical engineering domains.

摘要
Simplified Chinese:这项研究介绍了深度操作网络（DeepONet）作为核动力工程中数字双（DT）系统中的稳定和准确的模拟方法。随着核能作为碳中和解方案的重要性增加，采用DT技术已经成为核动力工程应用中提高操作效率、安全性和预测能力的关键。DeepONet在多个核心问题上表现出了惊人的预测精度，超越传统机器学习方法。通过广泛的 benchmarking 和评估，这项研究展示了 DeepONet 在解决复杂的粒子传输问题时的扩展性和计算效率。通过将函数作为输入数据，从训练数据中构建操作符 $G$，DeepONet 可以有效地处理多样化和复杂的场景。然而， DeepONet 的应用也暴露了优化传感器布局和模型评估的挑战，这些挑战在实际应用中是关键的。解决这些挑战将进一步提高 DeepONet 的实用性和可靠性。总的来说，DeepONet 提供了核动力工程研究和应用中的一个可靠和转型的工具。它的准确预测和计算效率能力可以对 DT 系统进行革命性的改进，推动核动力工程研究。这项研究标志着使用表达式模拟技术在关键工程领域的应用的重要一步。

Nonlinearity, Feedback and Uniform Consistency in Causal Structural Learning

paper_url: http://arxiv.org/abs/2308.07520
repo_url: None
paper_authors: Shuyan Wang
for: 这个论文的目的是找到自动搜索方法，用于从观察数据中学习 causal structure。
methods: 这个论文使用的方法包括 modifying Strong Faithfulness 和 relaxing sufficiency assumption，以扩展 causal discovery 方法的应用范围。
results: 这个论文的研究结果表明，通过 relaxing 简化假设，可以扩展 causal discovery 方法的应用范围，并且可以学习 causal structure with latent variables。

Abstract
The goal of Causal Discovery is to find automated search methods for learning causal structures from observational data. In some cases all variables of the interested causal mechanism are measured, and the task is to predict the effects one measured variable has on another. In contrast, sometimes the variables of primary interest are not directly observable but instead inferred from their manifestations in the data. These are referred to as latent variables. One commonly known example is the psychological construct of intelligence, which cannot directly measured so researchers try to assess through various indicators such as IQ tests. In this case, casual discovery algorithms can uncover underlying patterns and structures to reveal the causal connections between the latent variables and between the latent and observed variables. This thesis focuses on two questions in causal discovery: providing an alternative definition of k-Triangle Faithfulness that (i) is weaker than strong faithfulness when applied to the Gaussian family of distributions, (ii) can be applied to non-Gaussian families of distributions, and (iii) under the assumption that the modified version of Strong Faithfulness holds, can be used to show the uniform consistency of a modified causal discovery algorithm; relaxing the sufficiency assumption to learn causal structures with latent variables. Given the importance of inferring cause-and-effect relationships for understanding and forecasting complex systems, the work in this thesis of relaxing various simplification assumptions is expected to extend the causal discovery method to be applicable in a wider range with diversified causal mechanism and statistical phenomena.

摘要
目的是找到自动搜寻方法，以了解观察资料中的 causal 结构。在某些情况下，所有兴趣的 causal 机制中的所有变量都是观察到的，而任务是预测一个观察到的变量对另一个变量的影响。相比之下，有时候兴趣的变量不是直接观察到的，而是从资料中的现象推断出来的。这些被称为潜在变量。例如，心理学中的智商不能直接观察，因此研究人员会尝试透过不同的指标，如智商测验，来评估。在这种情况下， causal 发现算法可以探测到底层结构和联系，以显示 causal 连接between latent 变量和观察到的变量。本论文关注两个问题：提供一个弱 faithfulness 定义，其中（i）在 Gaussian 家族中的分布下是弱 faithfulness 的，（ii）可以应用到非 Gaussian 家族的分布，以及（iii）在 modified 稳定假设下，可以用来证明一个修改版的 causal 发现算法的均匀一致性。将适当的假设放宽，以学习包含潜在变量的 causal 结构。由于推断 causal 关系的重要性，这项工作预期能够将 causal 发现方法扩展到更广泛的应用，包括多元的 causal 机制和 statistically 现象。

Distilling Knowledge from Resource Management Algorithms to Neural Networks: A Unified Training Assistance Approach

paper_url: http://arxiv.org/abs/2308.07511
repo_url: None
paper_authors: Longfei Ma, Nan Cheng, Xiucheng Wang, Zhisheng Yin, Haibo Zhou, Wei Quan
for: 提高多用户通信系统中信号干扰比例（SINR）优化的精度和速度。
methods: 使用知识储存（KD）技术和人工神经网络（NN）结合 traditional SINR 优化方法，以提高无监督和强化学习方法的性能和速度。
results: 在模拟结果中，提出的AD方法比传统学习方法更高效，并且能够解决无监督学习和强化学习中常见的问题，如获取优质解决方案和避免过度适应。

Abstract
As a fundamental problem, numerous methods are dedicated to the optimization of signal-to-interference-plus-noise ratio (SINR), in a multi-user setting. Although traditional model-based optimization methods achieve strong performance, the high complexity raises the research of neural network (NN) based approaches to trade-off the performance and complexity. To fully leverage the high performance of traditional model-based methods and the low complexity of the NN-based method, a knowledge distillation (KD) based algorithm distillation (AD) method is proposed in this paper to improve the performance and convergence speed of the NN-based method, where traditional SINR optimization methods are employed as ``teachers" to assist the training of NNs, which are ``students", thus enhancing the performance of unsupervised and reinforcement learning techniques. This approach aims to alleviate common issues encountered in each of these training paradigms, including the infeasibility of obtaining optimal solutions as labels and overfitting in supervised learning, ensuring higher convergence performance in unsupervised learning, and improving training efficiency in reinforcement learning. Simulation results demonstrate the enhanced performance of the proposed AD-based methods compared to traditional learning methods. Remarkably, this research paves the way for the integration of traditional optimization insights and emerging NN techniques in wireless communication system optimization.

摘要
Traditional model-based optimization methods have been widely used to optimize signal-to-interference-plus-noise ratio (SINR) in multi-user settings, but these methods are often complex and have high computational complexity. To address this issue, this paper proposes a knowledge distillation (KD) based algorithm distillation (AD) method that combines the strengths of traditional model-based methods and neural network (NN) based approaches. The proposed method uses traditional SINR optimization methods as "teachers" to assist the training of NNs, which are "students", to improve the performance and convergence speed of the NN-based method. This approach can alleviate common issues encountered in each of these training paradigms, such as the infeasibility of obtaining optimal solutions as labels and overfitting in supervised learning, ensuring higher convergence performance in unsupervised learning, and improving training efficiency in reinforcement learning. Simulation results show that the proposed AD-based methods outperform traditional learning methods, paving the way for the integration of traditional optimization insights and emerging NN techniques in wireless communication system optimization.Here's the word-for-word translation of the text into Simplified Chinese:多种方法都是为了优化信号干扰比例（SINR）的多用户设置中的问题。虽然传统的模型基于优化方法具有强大的表现，但它们的计算复杂性高。为了解决这个问题，这篇文章提出了知识储备（KD）基于算法储备（AD）方法。该方法使用传统的SINR优化方法作为“教师”，以帮助训练神经网络（NN），作为“学生”，从而提高NN的性能和速度。这种方法可以解决每个训练模式中的常见问题，如获得优化解为标签的不可能性和超参数过敏，确保更高的整合性表现，并提高循环学习的训练效率。实验结果表明，提议的AD-based方法比传统的学习方法更高效。这些研究开创了传统优化思想和新兴神经网络技术在无线通信系统优化中的集成。

Data Race Detection Using Large Language Models

paper_url: http://arxiv.org/abs/2308.07505
repo_url: https://github.com/chrisneagu/FTC-Skystone-Dark-Angels-Romania-2020
paper_authors: Le Chen, Xianzhong Ding, Murali Emani, Tristan Vanderbruggen, Pei-hung Lin, Chuanhua Liao
for: 本文旨在探讨使用大语言模型（LLMs）来检测数据竞争问题，以代替手动创建资源密集的工具。
methods: 本文提出了一种基于提示工程和精度调整技术的新的数据竞争检测方法，使用了特制的DRB-ML数据集，并对代表性的LLMs和开源LLMs进行了评估和精度调整。
results: 实验表明，LLMs可以成为数据竞争检测的可能的方法，但是它们仍无法与传统的数据竞争检测工具相比提供详细的变量对 causing数据竞争的信息。

Abstract
Large language models (LLMs) are demonstrating significant promise as an alternate strategy to facilitate analyses and optimizations of high-performance computing programs, circumventing the need for resource-intensive manual tool creation. In this paper, we explore a novel LLM-based data race detection approach combining prompting engineering and fine-tuning techniques. We create a dedicated dataset named DRB-ML, which is derived from DataRaceBench, with fine-grain labels showing the presence of data race pairs and their associated variables, line numbers, and read/write information. DRB-ML is then used to evaluate representative LLMs and fine-tune open-source ones. Our experiment shows that LLMs can be a viable approach to data race detection. However, they still cannot compete with traditional data race detection tools when we need detailed information about variable pairs causing data races.

摘要
大型语言模型（LLM）正在展示出很大的承诺，用作高性能计算程序分析和优化的代替策略，减少手动工具的创建成本。在这篇论文中，我们探索了一种基于提示工程和精细调整技术的新型LLM数据竞争检测方法。我们创建了一个专门的数据集名为DRB-ML，它是基于DataRaceBench的数据集，并具有精细的标签，显示数据竞争对的存在、相关变量、行号、读写信息等。然后，我们使用DRB-ML评估了一些代表性的LLM，并对开源LLM进行了精细调整。我们的实验表明，LLM可以成为数据竞争检测的有效方法，但是它们仍无法与传统的数据竞争检测工具相比，提供详细的变量对 causing 数据竞争的信息。

ST-MLP: A Cascaded Spatio-Temporal Linear Framework with Channel-Independence Strategy for Traffic Forecasting

paper_url: http://arxiv.org/abs/2308.07496
repo_url: None
paper_authors: Zepu Wang, Yuqi Nie, Peng Sun, Nam H. Nguyen, John Mulvey, H. Vincent Poor
For: 本研究旨在提高智能交通系统中的流量管理优化，通过提出一种简洁准确的交通预测模型。* Methods: 本文提出了一种基于多层感知器（MLP）模块和线性层的简洁空间时间图 neural network（STGNN）模型，利用时间信息、空间信息和预定的图结构，并实现了通道独立策略。* Results: 实验结果显示，ST-MLP模型在准确率和计算效率两个方面都有较高的表现，比其他模型和现有的STGNNs架构更高。

Abstract
The criticality of prompt and precise traffic forecasting in optimizing traffic flow management in Intelligent Transportation Systems (ITS) has drawn substantial scholarly focus. Spatio-Temporal Graph Neural Networks (STGNNs) have been lauded for their adaptability to road graph structures. Yet, current research on STGNNs architectures often prioritizes complex designs, leading to elevated computational burdens with only minor enhancements in accuracy. To address this issue, we propose ST-MLP, a concise spatio-temporal model solely based on cascaded Multi-Layer Perceptron (MLP) modules and linear layers. Specifically, we incorporate temporal information, spatial information and predefined graph structure with a successful implementation of the channel-independence strategy - an effective technique in time series forecasting. Empirical results demonstrate that ST-MLP outperforms state-of-the-art STGNNs and other models in terms of accuracy and computational efficiency. Our finding encourages further exploration of more concise and effective neural network architectures in the field of traffic forecasting.

摘要
Translation notes:* "prompt and precise" is translated as "快速和准确" (kuài sù and zhèng jí) to emphasize the importance of timeliness and accuracy in traffic forecasting.* "Spatio-Temporal Graph Neural Networks" is translated as "空间时间图 neural network" (kōng jiān shí jiān tiě xiào) to emphasize the graph structure and the combination of spatial and temporal information.* " Multi-Layer Perceptron" is translated as "多层感知器" (duō cèng kàn shì qì) to emphasize the hierarchical structure of the model.* "channel-independence strategy" is translated as "通道独立策略" (tōng dào dāo lì bàng yì) to emphasize the technique's ability to improve the model's performance without relying on specific channel information.

Adaptive Tracking of a Single-Rigid-Body Character in Various Environments

paper_url: http://arxiv.org/abs/2308.07491
repo_url: None
paper_authors: Taesoo Kwon, Taehong Gu, Jaewon Ahn, Yoonsang Lee
for: This paper proposes a deep reinforcement learning method for simulating full-body human motions in various scenarios, with the goal of adapting to unobserved environmental changes and controller transitions without requiring additional learning.
methods: The proposed method uses the centroidal dynamics model (CDM) to express the full-body character as a single rigid body (SRB) and trains a policy to track a reference motion using deep reinforcement learning. The SRB simulation is formulated as a quadratic programming (QP) problem, and the policy outputs an action that allows the SRB character to follow the reference motion.
results: The proposed method is demonstrated to be sample-efficient and able to cope with environments that have not been experienced during learning, such as running on uneven terrain or pushing a box, and transitions between learned policies, without any additional learning. The policy can be efficiently trained within 30 minutes on an ultraportable laptop.Here is the simplified Chinese version of the three key points:
for: 这篇论文提出了一种基于深度学习的人体全身动作模拟方法，以适应不同的环境和控制器转换，无需进行额外的学习。
methods: 该方法使用中心动力学模型（CDM）表示全身人体为单一静体（SRB），并通过深度强化学习训练一个策略来跟踪参照动作。SRB模拟被形式化为quadratic programming（QP）问题，策略输出一个动作，使SRB人体按照参照动作进行动作。
results: 该方法能够快速地在ultraportable笔记本电脑上进行高效地训练，并在不同的环境中进行适应，如跑在不平的地面上或推Push一个箱子，以及控制器转换。

Abstract
Since the introduction of DeepMimic [Peng et al. 2018], subsequent research has focused on expanding the repertoire of simulated motions across various scenarios. In this study, we propose an alternative approach for this goal, a deep reinforcement learning method based on the simulation of a single-rigid-body character. Using the centroidal dynamics model (CDM) to express the full-body character as a single rigid body (SRB) and training a policy to track a reference motion, we can obtain a policy that is capable of adapting to various unobserved environmental changes and controller transitions without requiring any additional learning. Due to the reduced dimension of state and action space, the learning process is sample-efficient. The final full-body motion is kinematically generated in a physically plausible way, based on the state of the simulated SRB character. The SRB simulation is formulated as a quadratic programming (QP) problem, and the policy outputs an action that allows the SRB character to follow the reference motion. We demonstrate that our policy, efficiently trained within 30 minutes on an ultraportable laptop, has the ability to cope with environments that have not been experienced during learning, such as running on uneven terrain or pushing a box, and transitions between learned policies, without any additional learning.

摘要
Translated into Simplified Chinese: desde la introducción de DeepMimic [Peng et al. 2018], la investigación subsiguiente se ha centrado en expandir el repertorio de movimientos simulados en diversas escenarios. En este estudio, proponemos un enfoque alternativo para lograr esto, un método de aprendizaje por refuerzo profundo basado en la simulación de un cuerpo rígido único (SRB). Usando el modelo de dinámica centroidal (CDM) para expresar el cuerpo rígido completo como un SRB y entrenar una política para seguir una referencia de movimiento, podemos obtener una política que es capaz de adaptarse a cambios ambientales no observados y transiciones de controlador sin necesidad de aprendizaje adicional. Debido a la reducción de la dimensión del espacio de estado y acción, el proceso de aprendizaje es eficiente en muestras. El movimiento final es generado de manera plausible físicamente, basado en el estado de la simulación del SRB. La simulación del SRB se formulates como un problema de programación cuadrática (QP), y la política devuelve una acción que permite al SRB seguir la referencia de movimiento. Demostramos que nuestra política, entrenada eficientemente dentro de 30 minutos en un portátil ultra, tiene la capacidad de adaptarse a entornos que no se han experimentado durante el aprendizaje, como correr sobre terreno irregular o empujar una caja, y transiciones entre políticas aprendidas, sin aprendizaje adicional.

O-1: Self-training with Oracle and 1-best Hypothesis

paper_url: http://arxiv.org/abs/2308.07486
repo_url: None
paper_authors: Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Kartik Audhkhasi
for: 本研究旨在提出一种新的自教程目标函数O-1，以减少训练偏见和统一训练和评估 метри。
methods: O-1是EMBR的快速变体，可以使用 Both supervised和Unsupervised数据，并且可以提高 oracle假设。
results: 对于SpeechStew数据集和一个大规模的内部数据集，O-1对识别效果有13%-25%的相对提升，与EMBR相比，O-1在不同的SpeechStew数据集上提高了80%的相对幅度，并在内部数据集上与oracle WER之间减少了12%的差距。总的来说，O-1在大规模数据集上实现了9%的相对提升。

Abstract
We introduce O-1, a new self-training objective to reduce training bias and unify training and evaluation metrics for speech recognition. O-1 is a faster variant of Expected Minimum Bayes Risk (EMBR), that boosts the oracle hypothesis and can accommodate both supervised and unsupervised data. We demonstrate the effectiveness of our approach in terms of recognition on publicly available SpeechStew datasets and a large-scale, in-house data set. On Speechstew, the O-1 objective closes the gap between the actual and oracle performance by 80\% relative compared to EMBR which bridges the gap by 43\% relative. O-1 achieves 13\% to 25\% relative improvement over EMBR on the various datasets that SpeechStew comprises of, and a 12\% relative gap reduction with respect to the oracle WER over EMBR training on the in-house dataset. Overall, O-1 results in a 9\% relative improvement in WER over EMBR, thereby speaking to the scalability of the proposed objective for large-scale datasets.

摘要
我们引入O-1，一个新的自我训练目标，以减少训练偏见和统一训练和评估度量 для语音识别。O-1是EMBR的更快版本，可以提高oracle假设和处理bothsupervised和无监督数据。我们透过使用O-1目标，在公开ailable的SpeechStew数据集和大规模内部数据集上显示了效果。在SpeechStew上，O-1目标可以与oracle性能相对比较，将实际性能与oracle性能之间的差距降低了80%相对数据。在SpeechStew的不同数据集上，O-1目标可以与EMBR目标相比，实现13%至25%的相对改善，并且对于内部数据集的oracle WER进行了12%的相对降低。总之，O-1目标对EMBR目标的9%相对改善，说明了这个目标的扩展性。

OCDaf: Ordered Causal Discovery with Autoregressive Flows

paper_url: http://arxiv.org/abs/2308.07480
repo_url: https://github.com/vahidzee/ocdaf
paper_authors: Hamidreza Kamkari, Vahid Zehtab, Vahid Balazadeh, Rahul G. Krishnan
for: 学习 causal graphs 从观察数据中学习 causal graphs
methods: 使用 order-based 方法，通过 continuous search 算法找到 causal structures
results: 在 Sachs 和 SynTReN benchmark 上达到 state-of-the-art 性能，并在多种 parametic 和 nonparametric synthetic datasets 中验证了identifiability theory 的正确性。

Abstract
We propose OCDaf, a novel order-based method for learning causal graphs from observational data. We establish the identifiability of causal graphs within multivariate heteroscedastic noise models, a generalization of additive noise models that allow for non-constant noise variances. Drawing upon the structural similarities between these models and affine autoregressive normalizing flows, we introduce a continuous search algorithm to find causal structures. Our experiments demonstrate state-of-the-art performance across the Sachs and SynTReN benchmarks in Structural Hamming Distance (SHD) and Structural Intervention Distance (SID). Furthermore, we validate our identifiability theory across various parametric and nonparametric synthetic datasets and showcase superior performance compared to existing baselines.

摘要
我们提出了OCDaf，一种基于顺序的方法，用于从观察数据中学习 causal 图。我们证明了 causal 图在多变量非常性噪声模型中可以uniquely 特征标识，这是常量噪声模型的推广。基于这些模型和 afine autoregressive normalizing flows 的结构相似性，我们引入了连续搜索算法来找到 causal 结构。我们的实验表明在 Sachs 和 SynTReN benchmark 上的状态当前性表现，以及在 Structural Hamming Distance (SHD) 和 Structural Intervention Distance (SID) 中的优秀表现。此外，我们还验证了我们的可 identificability 理论，并在不同的参数和非参数 synthetic 数据上显示了superior的表现，与现有的基准值相比。

Symphony: Optimized Model Serving using Centralized Orchestration

paper_url: http://arxiv.org/abs/2308.07470
repo_url: None
paper_authors: Lequn Chen, Weixin Deng, Anirudh Canumalla, Yu Xin, Matthai Philipose, Arvind Krishnamurthy
for: 提高深度神经网络（DNN）模型推理的加速率和服务级别目标（SLO）。
methods: 使用中央化调度系统，可以扩展到百万个请求每秒，并将万个GPU进行协调。使用非工作保存的调度算法，可以实现高批处理效率，同时也可以启用灵活自适应扩展。
results: 通过广泛的实验表明，Symphony系统比前一代系统高效性可以达到4.7倍。

Abstract
The orchestration of deep neural network (DNN) model inference on GPU clusters presents two significant challenges: achieving high accelerator efficiency given the batching properties of model inference while meeting latency service level objectives (SLOs), and adapting to workload changes both in terms of short-term fluctuations and long-term resource allocation. To address these challenges, we propose Symphony, a centralized scheduling system that can scale to millions of requests per second and coordinate tens of thousands of GPUs. Our system utilizes a non-work-conserving scheduling algorithm capable of achieving high batch efficiency while also enabling robust autoscaling. Additionally, we developed an epoch-scale algorithm that allocates models to sub-clusters based on the compute and memory needs of the models. Through extensive experiments, we demonstrate that Symphony outperforms prior systems by up to 4.7x higher goodput.

摘要
深度神经网络（DNN）模型推理在GPU集群中的协调表现两大挑战：实现批处理性质下的加速器效率，同时满足响应时间服务水平目标（SLO）。为解决这些挑战，我们提议了Symphony，一个可扩展到百万个请求每秒的中央调度系统，可以协调数万个GPU。我们的系统使用非工作保持式调度算法，可以实现高批处理效率，同时也允许自动扩缩。此外，我们还开发了一种时间尺度算法，将模型分配到子集群基于计算和内存需求。通过广泛的实验，我们证明Symphony比先前系统高效性更高，最高可以达4.7倍。

Omega-Regular Reward Machines

paper_url: http://arxiv.org/abs/2308.07469
repo_url: None
paper_authors: Ernst Moritz Hahn, Mateo Perez, Sven Schewe, Fabio Somenzi, Ashutosh Trivedi, Dominik Wojtczak
for: 这篇论文旨在探讨如何透过奖励机制来训练智能代理人（Agent）来完成任务，但是设计合适的奖励机制是训练成功的关键。
methods: 这篇论文使用了奖励机制的两种形式：奖励机器和欧姆regular语言，以表达不同类型的学习目标。
results: 论文提出了奖励机器和欧姆regular语言的 интеграción，以实现更加表达力和有效的奖励机制，并提出了一种基于模型自由学习算法的ε-优化策略来对奖励机器进行计算。通过实验，证明了提出的方法的有效性。

Abstract
Reinforcement learning (RL) is a powerful approach for training agents to perform tasks, but designing an appropriate reward mechanism is critical to its success. However, in many cases, the complexity of the learning objectives goes beyond the capabilities of the Markovian assumption, necessitating a more sophisticated reward mechanism. Reward machines and omega-regular languages are two formalisms used to express non-Markovian rewards for quantitative and qualitative objectives, respectively. This paper introduces omega-regular reward machines, which integrate reward machines with omega-regular languages to enable an expressive and effective reward mechanism for RL. We present a model-free RL algorithm to compute epsilon-optimal strategies against omega-egular reward machines and evaluate the effectiveness of the proposed algorithm through experiments.

摘要
《强化学习（RL）是一种强大的方法，用于训练代理人员执行任务，但设计合适的奖励机制是RL的成功关键。然而，在许多情况下，学习目标的复杂性超出了Markov预测的能力，需要更加复杂的奖励机制。奖励机器和ωRegular语言是两种用于表达非Markov奖励的形式主义，分别用于量化和质量目标。本文提出了ωRegular奖励机器，它将奖励机器与ωRegular语言集成，以实现RL中的表达和效果的奖励机制。我们提出了一种无模型RL算法，用于计算ε优策略对ωRegular奖励机器，并通过实验评估提出的算法的效果。》Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

There Is a Digital Art History

paper_url: http://arxiv.org/abs/2308.07464
repo_url: https://github.com/Gracetyty/art-gallery
paper_authors: Leonardo Impett, Fabian Offert
for: 本文重新评问 Johanna Drucker 十年前的问题：是否有一种数字艺术历史？在大规模变换器基础模型的出现下，传统类型的神经网络已经成为数字艺术历史的一部分，但是这些模型的知识价值和方法价值尚未得到系统的分析。
methods: 本文主要分析了两个方面：一是大规模视模型中新编码的视觉文化，对数字艺术历史有很大的影响；二是使用当今大规模视模型研究艺术史和城市规划等领域的技术案例，提出了一种新的批判方法，该方法考虑模型和其应用之间的知识杂糅。
results: 本文的结果表明，大规模视模型在艺术史和城市规划等领域的应用需要一种新的批判方法，该方法可以通过读取研究数据集和训练数据集的视觉意识来批判模型和其应用的知识杂糅。

Abstract
In this paper, we revisit Johanna Drucker's question, "Is there a digital art history?" -- posed exactly a decade ago -- in the light of the emergence of large-scale, transformer-based vision models. While more traditional types of neural networks have long been part of digital art history, and digital humanities projects have recently begun to use transformer models, their epistemic implications and methodological affordances have not yet been systematically analyzed. We focus our analysis on two main aspects that, together, seem to suggest a coming paradigm shift towards a "digital" art history in Drucker's sense. On the one hand, the visual-cultural repertoire newly encoded in large-scale vision models has an outsized effect on digital art history. The inclusion of significant numbers of non-photographic images allows for the extraction and automation of different forms of visual logics. Large-scale vision models have "seen" large parts of the Western visual canon mediated by Net visual culture, and they continuously solidify and concretize this canon through their already widespread application in all aspects of digital life. On the other hand, based on two technical case studies of utilizing a contemporary large-scale visual model to investigate basic questions from the fields of art history and urbanism, we suggest that such systems require a new critical methodology that takes into account the epistemic entanglement of a model and its applications. This new methodology reads its corpora through a neural model's training data, and vice versa: the visual ideologies of research datasets and training datasets become entangled.

摘要
在这篇论文中，我们重新探讨了 Johanna Drucker 提出的问题：“是否有数字艺术历史？” —— exactly a decade ago —— 在大规模变换器基础模型的出现下。 although more traditional types of neural networks have long been part of digital art history, and digital humanities projects have recently begun to use transformer models, their epistemic implications and methodological affordances have not yet been systematically analyzed. We focus our analysis on two main aspects that, together, seem to suggest a coming paradigm shift towards a "digital" art history in Drucker's sense. On the one hand, the visual-cultural repertoire newly encoded in large-scale vision models has an outsized effect on digital art history. The inclusion of significant numbers of non-photographic images allows for the extraction and automation of different forms of visual logics. Large-scale vision models have "seen" large parts of the Western visual canon mediated by Net visual culture, and they continuously solidify and concretize this canon through their already widespread application in all aspects of digital life. On the other hand, based on two technical case studies of utilizing a contemporary large-scale visual model to investigate basic questions from the fields of art history and urbanism, we suggest that such systems require a new critical methodology that takes into account the epistemic entanglement of a model and its applications. This new methodology reads its corpora through a neural model's training data, and vice versa: the visual ideologies of research datasets and training datasets become entangled.Here's a word-for-word translation of the text into Simplified Chinese:在这篇论文中，我们重新探讨了 Johanna Drucker 提出的问题：“是否有数字艺术历史？” —— exact 10 年前 —— 在大规模变换器基础模型的出现下。 although more traditional types of neural networks have long been part of digital art history, and digital humanities projects have recently begun to use transformer models, their epistemic implications and methodological affordances have not yet been systematically analyzed. We focus our analysis on two main aspects that, together, seem to suggest a coming paradigm shift towards a "digital" art history in Drucker's sense. On the one hand, the visual-cultural repertoire newly encoded in large-scale vision models has an outsized effect on digital art history. The inclusion of significant numbers of non-photographic images allows for the extraction and automation of different forms of visual logics. Large-scale vision models have "seen" large parts of the Western visual canon mediated by Net visual culture, and they continuously solidify and concretize this canon through their already widespread application in all aspects of digital life. On the other hand, based on two technical case studies of utilizing a contemporary large-scale visual model to investigate basic questions from the fields of art history and urbanism, we suggest that such systems require a new critical methodology that takes into account the epistemic entanglement of a model and its applications. This new methodology reads its corpora through a neural model's training data, and vice versa: the visual ideologies of research datasets and training datasets become entangled.

Inductive Knowledge Graph Completion with GNNs and Rules: An Analysis

paper_url: http://arxiv.org/abs/2308.07942
repo_url: https://github.com/anilakash/indkgc
paper_authors: Akash Anil, Víctor Gutiérrez-Basulto, Yazmín Ibañéz-García, Steven Schockaert
for: 本研究旨在解释 inductive knowledge graph completion 任务中，模型如何学习推理规则，并用于预测测试图库中的链接。
methods: 本研究使用了规则基于的方法，并研究了一些变种来解决具体的问题。
results: 研究发现，变种方法可以减少不可靠的实体的影响，并且可以保持 interpretability 优势。而且，一种变种方法，可以不断地利用整个知识图，并且一直高于 NBFNet 的性能。

Abstract
The task of inductive knowledge graph completion requires models to learn inference patterns from a training graph, which can then be used to make predictions on a disjoint test graph. Rule-based methods seem like a natural fit for this task, but in practice they significantly underperform state-of-the-art methods based on Graph Neural Networks (GNNs), such as NBFNet. We hypothesise that the underperformance of rule-based methods is due to two factors: (i) implausible entities are not ranked at all and (ii) only the most informative path is taken into account when determining the confidence in a given link prediction answer. To analyse the impact of these factors, we study a number of variants of a rule-based approach, which are specifically aimed at addressing the aforementioned issues. We find that the resulting models can achieve a performance which is close to that of NBFNet. Crucially, the considered variants only use a small fraction of the evidence that NBFNet relies on, which means that they largely keep the interpretability advantage of rule-based methods. Moreover, we show that a further variant, which does look at the full KG, consistently outperforms NBFNet.

摘要
任务是完成 inductive 知识图结构需要模型学习从训练图中推导出推理模式，以便在测试图上进行预测。规则基本方法似乎适合这个任务，但在实践中它们在比基于图神经网络（GNNS）的方法下表现出现下降。我们认为这是因为两个因素：（i）不可能的实体没有被排序，（ii）只考虑最有用的路径来确定链接预测答案的信任度。为了分析这些因素的影响，我们研究了一些 variants of rule-based approach，它们专门解决这些问题。我们发现这些模型可以达到与 NBFNet 相似的性能，而且它们只使用了一小部分的证据，这意味着它们保留了规则基本方法的解释性优势。此外，我们还证明了一个 variant，它查看了整个知识图，可以一直高于 NBFNet 的性能。

GRU-D-Weibull: A Novel Real-Time Individualized Endpoint Prediction

paper_url: http://arxiv.org/abs/2308.07452
repo_url: None
paper_authors: Xiaoyang Ruan, Liwei Wang, Charat Thongprayoon, Wisit Cheungpasitporn, Hongfang Liu
for: 这个研究的目的是提出一种新的方法，GRU-D-Weibull，用于模型Weibull分布，以实现个人化终点预测和人口水平风险管理。
methods: 这个方法使用了门控Recurrent Unit（GRU）和衰减（D）来模型Weibull分布，并实现了实时个人化终点预测和人口水平风险管理。
results: 使用了6879名CKD4阶段4患者的 cohort，我们评估了GRU-D-Weibull在终点预测中的表现。GRU-D-Weibull在终点预测中的C-指数在指定日期为~~0.7，而后续4.3年的跟踪中为~~0.77，与随机生存树相当。我们的方法实现了终点预测中的绝对L1损失为~~1.1年（SD 0.95），并在4年的跟踪中达到最低值为~~0.45年（SD0.3），与其他方法相比显著出众。GRU-D-Weibull consistently constrained the predicted survival probability within a smaller and more fixed range compared to other models throughout the follow-up period.

Abstract
Accurate prediction models for individual-level endpoints and time-to-endpoints are crucial in clinical practice. In this study, we propose a novel approach, GRU-D-Weibull, which combines gated recurrent units with decay (GRU-D) to model the Weibull distribution. Our method enables real-time individualized endpoint prediction and population-level risk management. Using a cohort of 6,879 patients with stage 4 chronic kidney disease (CKD4), we evaluated the performance of GRU-D-Weibull in endpoint prediction. The C-index of GRU-D-Weibull was ~0.7 at the index date and increased to ~0.77 after 4.3 years of follow-up, similar to random survival forest. Our approach achieved an absolute L1-loss of ~1.1 years (SD 0.95) at the CKD4 index date and a minimum of ~0.45 years (SD0.3) at 4 years of follow-up, outperforming competing methods significantly. GRU-D-Weibull consistently constrained the predicted survival probability at the time of an event within a smaller and more fixed range compared to other models throughout the follow-up period. We observed significant correlations between the error in point estimates and missing proportions of input features at the index date (correlations from ~0.1 to ~0.3), which diminished within 1 year as more data became available. By post-training recalibration, we successfully aligned the predicted and observed survival probabilities across multiple prediction horizons at different time points during follow-up. Our findings demonstrate the considerable potential of GRU-D-Weibull as the next-generation architecture for endpoint risk management, capable of generating various endpoint estimates for real-time monitoring using clinical data.

摘要
准确的预测模型对个体级终点和时间到终点是临床实践中非常重要。在本研究中，我们提出了一种新的方法，即GRU-D-Weibull，它将闭包隐藏单元（GRU-D）与减少（Decay）结合以模型Weibull分布。我们的方法可以实现实时个体化终点预测和人口级风险管理。使用6,879名CKD4阶段4慢性肾病患者的 cohort，我们评估了GRU-D-Weibull在终点预测中的性能。GRU-D-Weibull的C指数在指定日期为 approximately 0.7，并在4.3年后跟踪 period 后提高到approximately 0.77，与随机生存森林相似。我们的方法在终点预测中实现了约1.1年的绝对L1损失（SD 0.95），并在4年后的跟踪期间保持在约0.45年（SD0.3）以上，与其他方法相比显著超越。GRU-D-Weibull在跟踪期间一直压制了预测生存概率的误差，并在不同的跟踪时间点保持在更小和固定的范围内表现出色。我们发现在指定日期的输入特征损失率和缺失比例之间存在显著相关性（相关性从approximately 0.1到approximately 0.3），这种相关性随着更多数据的获得而逐渐减少。通过后期重新训练，我们成功地将预测和观测生存概率Alignment在不同的预测时间点和跟踪时间点。我们的发现表明GRU-D-Weibull可以作为下一代结构，用于实时终点风险管理，可以通过临床数据生成多个终点预测。

Open-set Face Recognition using Ensembles trained on Clustered Data

paper_url: http://arxiv.org/abs/2308.07445
repo_url: None
paper_authors: Rafael Henrique Vareto, William Robson Schwartz
for: 开放集面识别场景下，Unknown人物会在测试阶段出现，需要精准识别个人并有效地处理不熟悉的面孔。这篇论文描述了一种可扩展的开放集面识别方法，用于千计多个人的 галерее。
methods: 方法包括聚类和一个ensemble of binary learning algorithms，用于确定查询面孔样本是否属于face gallery，并且正确地确定它们的身份。
results: 实验表明，即使targeting scalability，也可以达到竞争性的性能。

Abstract
Open-set face recognition describes a scenario where unknown subjects, unseen during the training stage, appear on test time. Not only it requires methods that accurately identify individuals of interest, but also demands approaches that effectively deal with unfamiliar faces. This work details a scalable open-set face identification approach to galleries composed of hundreds and thousands of subjects. It is composed of clustering and an ensemble of binary learning algorithms that estimates when query face samples belong to the face gallery and then retrieves their correct identity. The approach selects the most suitable gallery subjects and uses the ensemble to improve prediction performance. We carry out experiments on well-known LFW and YTF benchmarks. Results show that competitive performance can be achieved even when targeting scalability.

摘要
开放集 face recognition 描述一种情况，在训练阶段未看到的未知人脸在测试阶段出现。不仅需要准确地识别关心人脸，还需要采用有效地处理不熟悉的人脸方法。这篇文章介绍了一种可扩展的开放集face标识方法，用于数百或千个主题的图库。它包括 clustering 和一个 ensemble of binary learning algorithms，可以确定测试人脸样本是否属于图库，并且可以 accurately retrieve their correct identity。该方法选择了最适合的图库主题，并使用 ensemble 进行改进预测性能。我们在well-known LFW 和 YTF benchmark上进行了实验，结果显示，可以达到竞争性的性能，即使targeting scalability。

Physics-Informed Deep Learning to Reduce the Bias in Joint Prediction of Nitrogen Oxides

paper_url: http://arxiv.org/abs/2308.07441
repo_url: None
paper_authors: Lianfa Li, Roxana Khalili, Frederick Lurmann, Nathan Pavlovic, Jun Wu, Yan Xu, Yisi Liu, Karl O’Sharkey, Beate Ritz, Luke Oman, Meredith Franklin, Theresa Bastain, Shohreh F. Farzan, Carrie Breton, Rima Habre
for: 这个论文主要是为了提高地面氮氧化物（NOx）的预测，以便更好地了解它们对健康和环境的影响。
methods: 这篇论文使用机器学习（ML）方法，但是这些方法缺乏物理和化学知识，因此可能会产生高度估计偏差。作者们提出了一种physics-informed deep learning框架，该框架可以编码扩散-扩散机制和流体动力约束，以提高NO2和NOx预测的准确性。
results: 作者们发现，该框架可以减少ML模型的估计偏差，并且可以提供精确的空气质量推算和明确的不确定性评估。此外，该框架还可以捕捉NO2和NOx的细致传输，并提供了可靠的空间抽象。

Abstract
Atmospheric nitrogen oxides (NOx) primarily from fuel combustion have recognized acute and chronic health and environmental effects. Machine learning (ML) methods have significantly enhanced our capacity to predict NOx concentrations at ground-level with high spatiotemporal resolution but may suffer from high estimation bias since they lack physical and chemical knowledge about air pollution dynamics. Chemical transport models (CTMs) leverage this knowledge; however, accurate predictions of ground-level concentrations typically necessitate extensive post-calibration. Here, we present a physics-informed deep learning framework that encodes advection-diffusion mechanisms and fluid dynamics constraints to jointly predict NO2 and NOx and reduce ML model bias by 21-42%. Our approach captures fine-scale transport of NO2 and NOx, generates robust spatial extrapolation, and provides explicit uncertainty estimation. The framework fuses knowledge-driven physicochemical principles of CTMs with the predictive power of ML for air quality exposure, health, and policy applications. Our approach offers significant improvements over purely data-driven ML methods and has unprecedented bias reduction in joint NO2 and NOx prediction.

摘要
燃烧产生的大气氮氧化物（NOx）已被认定为有急性和长期健康和环境影响。机器学习（ML）方法已经大幅提高我们预测地面NOx浓度的能力，但这些方法可能会受到高估偏差因为没有物理和化学知识关于空气污染动力学。化学运输模型（CTM）利用这些知识，但精确预测地面浓度通常需要广泛的后调整。在这里，我们介绍一个具有物理知识的深度学习框架，该框架编码了扩散运输机制和流体动力学约束，以预测NO2和NOx的 JOINT 预测和减少ML模型偏差21-42%。我们的方法能够捕捉精确的NO2和NOx的细胞运输，生成坚固的空间推导，并提供明确的 uncertainty 估计。这个框架融合了物理化学知识驱动的CTMs和ML的预测力，实现了空气质量露地暴露、健康和政策应用中的融合。我们的方法具有与对纯数据驱动ML方法的比较，无 precedent 的偏差减少在 JOINT NO2和NOx 预测中。

Interaction-Aware Personalized Vehicle Trajectory Prediction Using Temporal Graph Neural Networks

paper_url: http://arxiv.org/abs/2308.07439
repo_url: None
paper_authors: Amr Abdelraouf, Rohit Gupta, Kyungtae Han
for: 预测汽车轨迹的精度是自动驾驶系统和高级驾驶助手系统中的关键。现有方法主要依靠大规模的数据集来生成通用的轨迹预测，忽略了每位司机的个性驾驶模式。
methods: 我们提出了一种基于交互aware的个性化轨迹预测方法，该方法利用temporal graph neural networks（GCN）和Long Short-Term Memory（LSTM）模型了target vehicles和它们周围的交通之间的空间-时间交互。为了个性化预测，我们设立了一个管道，该管道通过转移学习来使模型在大规模轨迹数据集上进行初始化预training，然后在每位司机的具体驾驶数据上进行细化调整。
results: 我们的个性化GCN-LSTM模型在较长的预测时间范围内表现出优于其通用版本，并且与没有预training的个体模型相比，具有更高的预测精度。此外，我们的个性化模型还能够避免过拟合现象，强调了大规模数据集的预training对个性化预测的重要性。通过个性化，我们的方法提高了轨迹预测精度。

Abstract
Accurate prediction of vehicle trajectories is vital for advanced driver assistance systems and autonomous vehicles. Existing methods mainly rely on generic trajectory predictions derived from large datasets, overlooking the personalized driving patterns of individual drivers. To address this gap, we propose an approach for interaction-aware personalized vehicle trajectory prediction that incorporates temporal graph neural networks. Our method utilizes Graph Convolution Networks (GCN) and Long Short-Term Memory (LSTM) to model the spatio-temporal interactions between target vehicles and their surrounding traffic. To personalize the predictions, we establish a pipeline that leverages transfer learning: the model is initially pre-trained on a large-scale trajectory dataset and then fine-tuned for each driver using their specific driving data. We employ human-in-the-loop simulation to collect personalized naturalistic driving trajectories and corresponding surrounding vehicle trajectories. Experimental results demonstrate the superior performance of our personalized GCN-LSTM model, particularly for longer prediction horizons, compared to its generic counterpart. Moreover, the personalized model outperforms individual models created without pre-training, emphasizing the significance of pre-training on a large dataset to avoid overfitting. By incorporating personalization, our approach enhances trajectory prediction accuracy.

摘要
提高汽车轨迹预测精度是当前驱动助手系统和自动驱动技术的关键。现有方法主要依靠大规模数据集上的通用轨迹预测，忽略了个人驾驶模式的特殊性。为了解决这个差距，我们提出了一种依赖于互动的个性化汽车轨迹预测方法，该方法利用图гра拓扑神经网络（GCN）和长短期记忆（LSTM）模型了目标汽车和它们周围交通的空间时间互动。为了个性化预测，我们建立了一个管道，其中模型首先在大规模轨迹数据集上进行预训练，然后对每名司机进行细化调整，使用具有特定驾驶模式的自适应驾驶数据。我们使用人工在车Loop模拟器收集个性化自然驾驶轨迹和相应的围绕汽车轨迹。实验结果表明我们的个性化GCN-LSTM模型在较长预测时间 horizons 上表现出色，特别是比其通用对应模型更高。此外，个性化模型也比不进行预训练的模型（无预训练）表现更好，这说明了预训练大数据集可以避免过拟合。通过包含个性化，我们的方法可以提高轨迹预测精度。

A Hybrid Deep Spatio-Temporal Attention-Based Model for Parkinson’s Disease Diagnosis Using Resting State EEG Signals

paper_url: http://arxiv.org/abs/2308.07436
repo_url: None
paper_authors: Niloufar Delfan, Mohammadreza Shahsavari, Sadiq Hussain, Robertas Damaševičius, U. Rajendra Acharya
for: 这个研究的目的是为了开发一个自动化的 Parkinson’s disease 诊断模型，使用休息状态 EEG 信号。
methods: 这个模型使用了一种混合模型，包括卷积神经网络 (CNN)、双向关键缓冲网络 (Bi-GRU) 和注意机制。
results: 研究结果显示，提案的模型可以高度准确地诊断 Parkinson’s disease，并且在不同测试数据上也能够获得高性能。此外，模型还能够对部分输入信息的损失具有耐性。

Abstract
Parkinson's disease (PD), a severe and progressive neurological illness, affects millions of individuals worldwide. For effective treatment and management of PD, an accurate and early diagnosis is crucial. This study presents a deep learning-based model for the diagnosis of PD using resting state electroencephalogram (EEG) signal. The objective of the study is to develop an automated model that can extract complex hidden nonlinear features from EEG and demonstrate its generalizability on unseen data. The model is designed using a hybrid model, consists of convolutional neural network (CNN), bidirectional gated recurrent unit (Bi-GRU), and attention mechanism. The proposed method is evaluated on three public datasets (Uc San Diego Dataset, PRED-CT, and University of Iowa (UI) dataset), with one dataset used for training and the other two for evaluation. The results show that the proposed model can accurately diagnose PD with high performance on both the training and hold-out datasets. The model also performs well even when some part of the input information is missing. The results of this work have significant implications for patient treatment and for ongoing investigations into the early detection of Parkinson's disease. The suggested model holds promise as a non-invasive and reliable technique for PD early detection utilizing resting state EEG.

摘要
Parkinson's disease（PD）是一种严重和进行的神经疾病，影响全球数百万人。为了有效地治疗和管理PD，准确早期诊断是非常重要。本研究提出了基于深度学习的PD诊断模型，使用休息态电энцефаogram（EEG）信号。研究的目标是开发一个自动化的模型，可以从EEG中提取复杂隐藏的非线性特征，并在未看过数据上进行普适性评估。模型采用混合模型，包括卷积神经网络（CNN）、双向闭包Recurrent Unit（Bi-GRU）和注意机制。本研究在三个公共数据集（UC San Diego数据集、PRED-CT数据集和University of Iowa（UI）数据集）进行评估，其中一个数据集用于训练，另外两个数据集用于评估。结果表明，提议的模型可以准确地诊断PD，并且在训练和剩下数据集上都有高性能。此外，模型还能够在一部分输入信息缺失时保持良好的性能。本研究结果对患者治疗和PD早期检测的研究有着重要意义。建议的模型具有非侵入性和可靠性，可能成为PD早期诊断的非侵入性和可靠的技术。

Addressing Distribution Shift in RTB Markets via Exponential Tilting

paper_url: http://arxiv.org/abs/2308.07424
repo_url: None
paper_authors: Minji Kim, Seong Jin Lee, Bumsik Kim
for: This paper aims to address the issue of distribution shift in machine learning models, specifically in the context of Real-Time Bidding (RTB) market models.
methods: The proposed method is called Exponential Tilt Reweighting Alignment (ExTRA), which uses importance weights to minimize the KL divergence between the weighted source and target datasets. The ExTRA method can operate using labeled source data and unlabeled target data.
results: The paper evaluates the effectiveness of the ExTRA method through simulated real-world data, demonstrating its ability to address distribution shift and improve the performance of machine learning models.

Abstract
Distribution shift in machine learning models can be a primary cause of performance degradation. This paper delves into the characteristics of these shifts, primarily motivated by Real-Time Bidding (RTB) market models. We emphasize the challenges posed by class imbalance and sample selection bias, both potent instigators of distribution shifts. This paper introduces the Exponential Tilt Reweighting Alignment (ExTRA) algorithm, as proposed by Marty et al. (2023), to address distribution shifts in data. The ExTRA method is designed to determine the importance weights on the source data, aiming to minimize the KL divergence between the weighted source and target datasets. A notable advantage of this method is its ability to operate using labeled source data and unlabeled target data. Through simulated real-world data, we investigate the nature of distribution shift and evaluate the applicacy of the proposed model.

摘要
Distribution shift in machine learning models can be a primary cause of performance degradation. This paper explores the characteristics of these shifts, primarily motivated by Real-Time Bidding (RTB) market models. We highlight the challenges posed by class imbalance and sample selection bias, both powerful instigators of distribution shifts. This paper introduces the Exponential Tilt Reweighting Alignment (ExTRA) algorithm, as proposed by Marty et al. (2023), to address distribution shifts in data. The ExTRA method determines the importance weights on the source data to minimize the KL divergence between the weighted source and target datasets. A notable advantage of this method is its ability to operate using labeled source data and unlabeled target data. Through simulated real-world data, we investigate the nature of distribution shift and evaluate the applicability of the proposed model.Here's the word-for-word translation:分布shift在机器学习模型中可以是表现下降的主要原因。这篇论文探讨分布shift的特点，主要受Real-Time Bidding（RTB）市场模型的激发。我们强调分布shift的挑战，包括分类偏度和样本选择偏见，这两者都是分布shift的强力引起者。这篇论文介绍Marty等人（2023）提出的扩展tilt重要性补做算法（ExTRA），用于Addressing分布shift in data。ExTRA方法通过确定源数据中的重要性Weight来减少源数据和目标数据之间的KL偏度。这种方法的一个优点是它可以使用标注的源数据和无标注的目标数据进行操作。通过实际的世界数据 simulate，我们investigate分布shift的本质和提出方法的适用性。

U-Turn Diffusion

paper_url: http://arxiv.org/abs/2308.07421
repo_url: None
paper_authors: Hamidreza Behjoo, Michael Chertkov
for:这种Diffusion模型是用于生成人工图像的。methods:这些模型基于动态助手时间机制，其中Score函数来自输入图像。results:我们的研究发现了评估Diffusion模型效果的标准：生成过程中快速谱相关性的能力直接影响生成图像质量。此外，我们还提出了“U-Turn扩散”技术，该技术通过组合前向、U-turn和后向过程，生成一个接近独立同分布（i.i.d）样本。

Abstract
We present a comprehensive examination of score-based diffusion models of AI for generating synthetic images. These models hinge upon a dynamic auxiliary time mechanism driven by stochastic differential equations, wherein the score function is acquired from input images. Our investigation unveils a criterion for evaluating efficiency of the score-based diffusion models: the power of the generative process depends on the ability to de-construct fast correlations during the reverse/de-noising phase. To improve the quality of the produced synthetic images, we introduce an approach coined "U-Turn Diffusion". The U-Turn Diffusion technique starts with the standard forward diffusion process, albeit with a condensed duration compared to conventional settings. Subsequently, we execute the standard reverse dynamics, initialized with the concluding configuration from the forward process. This U-Turn Diffusion procedure, combining forward, U-turn, and reverse processes, creates a synthetic image approximating an independent and identically distributed (i.i.d.) sample from the probability distribution implicitly described via input samples. To analyze relevant time scales we employ various analytical tools, including auto-correlation analysis, weighted norm of the score-function analysis, and Kolmogorov-Smirnov Gaussianity test. The tools guide us to establishing that the Kernel Intersection Distance, a metric comparing the quality of synthetic samples with real data samples, is minimized at the optimal U-turn time.

摘要
我们对基于分数的扩散模型的人工智能生成 sintetic 图像进行了全面的检验。这些模型基于动态辅助时间机制驱动的随机 diffeq 方程，其中分数函数从输入图像中获取。我们的调查发现一个评估分数基于扩散模型的效率的标准：生成过程中快速相关的破坏速率决定了扩散模型的能效性。为了提高生成的 sintetic 图像质量，我们提出了“U-Turn 扩散”技术。U-Turn 扩散技术从标准的前向扩散过程开始，但是与传统设置相比，它具有缩短的时间长度。然后，我们执行标准的反向动力学，初始化为前向过程的结束配置。这种 U-Turn 扩散过程，结合前向、U-turn 和反向过程，创造了一个约束为独立同分布（i.i.d.）随机变量的 sintetic 图像。为了分析相关的时间尺度，我们使用了多种分析工具，包括自相关分析、分数函数的Weighted нор 分析和kolmogorov-smirnov Gaussianity 测试。这些工具引导我们确定了最佳 U-turn 时间，使得权重 norm 分数函数的质量最佳。

Locally Adaptive and Differentiable Regression

paper_url: http://arxiv.org/abs/2308.07418
repo_url: None
paper_authors: Mingxuan Han, Varun Shankar, Jeff M Phillips, Chenglong Ye
for: 提出了一种基于本地学习模型的全球连续可导模型框架，以寻求处理数据中存在不同密度或函数值规模的问题。
methods: 使用权重加权平均方法将本地学习模型在相应的地方进行连续拟合，以实现全球连续可导模型。
results: 在推理中，该模型可以更快地达到统计准确性，并在各种实际应用中提供了改进的表现。

Abstract
Over-parameterized models like deep nets and random forests have become very popular in machine learning. However, the natural goals of continuity and differentiability, common in regression models, are now often ignored in modern overparametrized, locally-adaptive models. We propose a general framework to construct a global continuous and differentiable model based on a weighted average of locally learned models in corresponding local regions. This model is competitive in dealing with data with different densities or scales of function values in different local regions. We demonstrate that when we mix kernel ridge and polynomial regression terms in the local models, and stitch them together continuously, we achieve faster statistical convergence in theory and improved performance in various practical settings.

摘要
现代机器学习中，过参数化模型如深度网络和随机森林已经非常流行。然而，传统机器学习模型中的稳定性和导数性目标，通常被现代过参数化、地方适应型模型所忽略。我们提出了一种通用框架，用于基于本地学习模型的权重平均构建全球连续和导数可达的模型。这种模型能够在不同的地方域上处理数据的不同浓度或函数值的比例。我们示出，当混合核ridge和多项式回归项在本地模型中，并将其们连续叠加时，我们可以在理论上更快地达到统计征 converge，并在各种实际场景中提高表现。

Text Injection for Capitalization and Turn-Taking Prediction in Speech Models

paper_url: http://arxiv.org/abs/2308.07395
repo_url: None
paper_authors: Shaan Bijwadia, Shuo-yiin Chang, Weiran Wang, Zhong Meng, Hao Zhang, Tara N. Sainath
for: 提高auxiliary任务性能（非ASR任务）
methods: 使用文本注入法（JEIT）对ASR模型进行训练，并同时完成两个auxiliary任务
results: 提高了长尾数据的首字母排序性能，以及提高了对话转接检测的受检率

Abstract
Text injection for automatic speech recognition (ASR), wherein unpaired text-only data is used to supplement paired audio-text data, has shown promising improvements for word error rate. This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model. In this work, we use joint end-to-end and internal language model training (JEIT) as our text injection algorithm to train an ASR model which performs two auxiliary tasks. The first is capitalization, which is a de-normalization task. The second is turn-taking prediction, which attempts to identify whether a user has completed their conversation turn in a digital assistant interaction. We show results demonstrating that our text injection method boosts capitalization performance for long-tail data, and improves turn-taking detection recall.

摘要
文本注入技术用于自动语音识别（ASR），其中不带文本数据用于补充带有音频文本数据的训练，已经显示出了词错率的明显改善。本研究探讨了文本注入的应用于 auxiliary task，这些任务通常由一个端到端模型完成。在这个工作中，我们使用联合端到端和内部语言模型训练算法（JEIT）来训练一个 ASR 模型，该模型同时完成了两个 auxiliary task。第一个是字母大小Normalization 任务，第二个是对话交换预测任务，它们是在数字助手交互中确定用户是否已经完成了对话转换。我们的文本注入方法可以提高长尾数据的字母大小正确率，并提高对话交换检测精度。

DISBELIEVE: Distance Between Client Models is Very Essential for Effective Local Model Poisoning Attacks

paper_url: http://arxiv.org/abs/2308.07387
repo_url: None
paper_authors: Indu Joshi, Priyank Upadhya, Gaurav Kumar Nayak, Peter Schüffler, Nassir Navab
for: 该研究旨在探讨 Federated Learning 如何解决医疗数据隐私问题，并研究如何防止恶意客户端攻击 Federated 系统。
methods: 该研究提出了一种新的本地模型欺骗攻击（DISBELIEVE），该攻击可以在 Robust Aggregation 方法下降低本地模型的性能，从而影响全局模型的性能。
results: 实验结果表明，DISBELIEVE 攻击可以在三个公共可用的医疗图像集上显著降低 Robust Aggregation 方法的性能，并且在自然图像集上也有较好的效果。

Abstract
Federated learning is a promising direction to tackle the privacy issues related to sharing patients' sensitive data. Often, federated systems in the medical image analysis domain assume that the participating local clients are \textit{honest}. Several studies report mechanisms through which a set of malicious clients can be introduced that can poison the federated setup, hampering the performance of the global model. To overcome this, robust aggregation methods have been proposed that defend against those attacks. We observe that most of the state-of-the-art robust aggregation methods are heavily dependent on the distance between the parameters or gradients of malicious clients and benign clients, which makes them prone to local model poisoning attacks when the parameters or gradients of malicious and benign clients are close. Leveraging this, we introduce DISBELIEVE, a local model poisoning attack that creates malicious parameters or gradients such that their distance to benign clients' parameters or gradients is low respectively but at the same time their adverse effect on the global model's performance is high. Experiments on three publicly available medical image datasets demonstrate the efficacy of the proposed DISBELIEVE attack as it significantly lowers the performance of the state-of-the-art \textit{robust aggregation} methods for medical image analysis. Furthermore, compared to state-of-the-art local model poisoning attacks, DISBELIEVE attack is also effective on natural images where we observe a severe drop in classification performance of the global model for multi-class classification on benchmark dataset CIFAR-10.

摘要
Federated learning 是一个有前途的方向，以解决分享患者敏感数据时的隐私问题。在医疗影像分析领域，联邦系统经常假设参与的本地客户端是诚实的。然而，一些研究表明，可以引入一组恶意客户端，使联邦设置受损，global模型性能下降。为解决这个问题，一些robust汇集方法被提出，以防止这些攻击。我们发现，大多数当前的state-of-the-art robust汇集方法都是依赖本地客户端和良好客户端之间的距离，这使得它们容易受到本地模型毒 poisoning攻击，当本地客户端和良好客户端之间的参数或梯度距离很近时。基于这一点，我们介绍了DISBELIEVE，一种本地模型毒 poisoning攻击，可以创造出谎言的参数或梯度，使其与良好客户端之间的距离很近，但同时对全局模型性能产生严重的影响。我们在三个公共可用的医疗影像数据集上进行了实验，并证明了我们提出的DISBELIEVE攻击的有效性，可以在医疗影像分析领域对当前的robust汇集方法进行重要的攻击。此外，我们还发现，DISBELIEVE攻击也能够在自然图像领域中效果，在CIFAR-10benchmark数据集上，全局模型的多类分类性能下降很严重。

DiffSED: Sound Event Detection with Denoising Diffusion

paper_url: http://arxiv.org/abs/2308.07293
repo_url: None
paper_authors: Swapnil Bhosale, Sauradip Nag, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu
for: 这个论文的目标是提出一种基于生成学习的声音时间边界检测方法，以提高声音事件检测的精度和效率。
methods: 该方法使用一种基于扩散过程的梯度下降算法，通过在适应性Transformer核心网络框架中对含杂的提议进行修复，以将含杂的提议转换为高质量的事件时间边界。
results: 对于Urban-SED和EPIC-Sounds数据集，该方法在训练和测试中均显示出优于现有方法的性能，具有40%以上的更快的训练速度。

Abstract
Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the splitand-classify (i.e., frame-level) strategy or the more principled event-level modeling approach, all existing methods consider the SED problem from the discriminative learning perspective. In this work, we reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process, conditioned on a target audio sample. During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions in the elegant Transformer decoder framework. Doing so enables the model generate accurate event boundaries from even noisy queries during inference. Extensive experiments on the Urban-SED and EPIC-Sounds datasets demonstrate that our model significantly outperforms existing alternatives, with 40+% faster convergence in training.

摘要
声音事件检测（SED）目标是预测各个事件的时间边界和类别标签，给一个未Constrained的音频样本。现有的所有方法都是从推理学教学角度来解决SED问题，包括分割和分类（即帧级）策略或更为原理性的事件级模型。在这项工作中，我们重新定义了SED问题，通过一种生成学学习角度来解决。具体来说，我们目标是通过降噪过程中的批量梯度下降来生成各个事件的时间边界，使用搅拌Transformer推理框架。在训练中，我们的模型学习了将噪音提取过程反转，将噪音潜在提取到真实的样本。这使得我们的模型能够在推理中生成准确的事件边界，即使噪音提取过程不准确。我们在Urban-SED和EPIC-Sounds数据集上进行了广泛的实验，结果表明，我们的模型与现有的方法相比，在训练时间上提高了40%以上。

The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

paper_url: http://arxiv.org/abs/2308.07286
repo_url: None
paper_authors: Patrick Fernandes, Daniel Deutsch, Mara Finkelstein, Parker Riley, André F. T. Martins, Graham Neubig, Ankush Garg, Jonathan H. Clark, Markus Freitag, Orhan Firat
for: 提高机器翻译系统的质量
methods: 利用大语言模型的推理和在场景学习能力，询问模型 identificar和分类翻译错误
results: 与只提示分数prompting比较，AutoMQM可以提高模型的性能，特别是大型模型，并提供可读性通过错误块与人工标注相对应

Abstract
Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative development of MT systems. While considerable progress has been made on estimating a single scalar quality score, current metrics lack the informativeness of more detailed schemes that annotate individual errors, such as Multidimensional Quality Metrics (MQM). In this paper, we help fill this gap by proposing AutoMQM, a prompting technique which leverages the reasoning and in-context learning capabilities of large language models (LLMs) and asks them to identify and categorize errors in translations. We start by evaluating recent LLMs, such as PaLM and PaLM-2, through simple score prediction prompting, and we study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores (with particularly large gains for larger models) while providing interpretability through error spans that align with human annotations.

摘要
自动评估机器翻译（MT）是一种关键的工具，帮助快速迭代MT系统的发展。虽然已经做出了很大的进步，但当前的 метрикиlack the informativeness of more detailed schemes that annotate individual errors, such as Multidimensional Quality Metrics (MQM). 在这篇论文中，我们帮助填补这一漏洞，提出了AutoMQM，一种提示技术，利用大型自然语言模型（LLMs）的思维和Context learning能力，让它们标识和分类翻译中的错误。我们开始通过对最近的LLMs，如PaLM和PaLM-2，进行简单的分数预测提示，然后研究了abeled data的影响通过 Context learning和 fine-tuning。最后，我们评估了AutoMQM与PaLM-2模型，发现它可以提高性能，特别是对于更大的模型，而且提供了可读性通过错误跨度与人类标注相对应。

Cross-Attribute Matrix Factorization Model with Shared User Embedding

paper_url: http://arxiv.org/abs/2308.07284
repo_url: None
paper_authors: Wen Liang, Zeng Fan, Youzhi Liang, Jianguo Jia
for: 提高推荐系统的准确率和稳定性，特别是对于“长尾”用户和 Item。
methods: 使用神经网络抽象来捕捉用户-项目交互，同时考虑用户和项目的特性和属性，以解决冷启始问题。
results: 对于 MovieLens 和 Pinterest 数据集，我们的 Cross-Attribute Matrix Factorization 模型在 sparse 数据场景下显示出优于常见方法的性能。

Abstract
Over the past few years, deep learning has firmly established its prowess across various domains, including computer vision, speech recognition, and natural language processing. Motivated by its outstanding success, researchers have been directing their efforts towards applying deep learning techniques to recommender systems. Neural collaborative filtering (NCF) and Neural Matrix Factorization (NeuMF) refreshes the traditional inner product in matrix factorization with a neural architecture capable of learning complex and data-driven functions. While these models effectively capture user-item interactions, they overlook the specific attributes of both users and items. This can lead to robustness issues, especially for items and users that belong to the "long tail". Such challenges are commonly recognized in recommender systems as a part of the cold-start problem. A direct and intuitive approach to address this issue is by leveraging the features and attributes of the items and users themselves. In this paper, we introduce a refined NeuMF model that considers not only the interaction between users and items, but also acrossing associated attributes. Moreover, our proposed architecture features a shared user embedding, seamlessly integrating with user embeddings to imporve the robustness and effectively address the cold-start problem. Rigorous experiments on both the Movielens and Pinterest datasets demonstrate the superiority of our Cross-Attribute Matrix Factorization model, particularly in scenarios characterized by higher dataset sparsity.

摘要
过去几年，深度学习在不同领域，包括计算机视觉、语音识别和自然语言处理等领域，展示了它的卓越。驱动于其成功，研究人员开始尝试应用深度学习技术到推荐系统中。尽管Neural Collaborative Filtering（NCF）和Neural Matrix Factorization（NeuMF）模型能够有效地捕捉用户-项目交互，但它们忽略了用户和项目的特定属性。这可能导致 robustness 问题，特别是对于数据中的"长尾"用户和项目。这种问题在推荐系统中通常被称为冷启始问题。为解决这个问题，我们在这篇论文中引入了一种改进的NeuMF模型，该模型不仅考虑用户和项目之间的交互，还考虑用户和项目之间的属性相互关系。此外，我们的提议的体系还具有共享用户嵌入，可以融合用户嵌入，从而提高系统的稳定性和效果地解决冷启始问题。我们在MovieLens和Pinterest数据集上进行了严格的实验，结果显示我们的 Cross-Attribute Matrix Factorization模型在数据集稀疏程度较高的情况下表现出色。

Data-Efficient Energy-Aware Participant Selection for UAV-Enabled Federated Learning

paper_url: http://arxiv.org/abs/2308.07273
repo_url: None
paper_authors: Youssra Cheriguene, Wael Jaafar, Chaker Abdelaziz Kerrache, Halim Yanikomeroglu, Fatima Zohra Bousbaa, Nasreddine Lagraa
for: 本研究旨在提高边缘 federated learning（FL）模型的准确性，通过选择合适的无人机参与者，并且考虑无人机的能源消耗、通信质量和本地数据的不同性。
methods: 本研究提出了一种新的无人机参与者选择策略，即基于数据效率和能源占用率的能源意识参与者选择策略（DEEPS），该策略通过选择每个子区域中最佳的FL参与者，基于本地数据的结构相似度指数平均分数和能源占用资料来实现。
results: 通过实验，本研究表明，对于边缘FL，使用DEEPS策略可以提高模型准确性、减少训练时间和无人机的能源消耗，相比于随机选择策略。

Abstract
Unmanned aerial vehicle (UAV)-enabled edge federated learning (FL) has sparked a rise in research interest as a result of the massive and heterogeneous data collected by UAVs, as well as the privacy concerns related to UAV data transmissions to edge servers. However, due to the redundancy of UAV collected data, e.g., imaging data, and non-rigorous FL participant selection, the convergence time of the FL learning process and bias of the FL model may increase. Consequently, we investigate in this paper the problem of selecting UAV participants for edge FL, aiming to improve the FL model's accuracy, under UAV constraints of energy consumption, communication quality, and local datasets' heterogeneity. We propose a novel UAV participant selection scheme, called data-efficient energy-aware participant selection strategy (DEEPS), which consists of selecting the best FL participant in each sub-region based on the structural similarity index measure (SSIM) average score of its local dataset and its power consumption profile. Through experiments, we demonstrate that the proposed selection scheme is superior to the benchmark random selection method, in terms of model accuracy, training time, and UAV energy consumption.

摘要
“无人航空器（UAV）启动的边缘联合学习（FL）已经引起了研究者们的探索，因为UAV所收集的数据量巨大且多样，同时也存在资料传输到边缘服务器的隐私问题。然而，由于UAV收集的数据存在重复性，例如影像数据，以及不充分的FL参与者选择，FL学习过程的参数调整和模型偏好可能会增加。因此，本文研究UAV参与者选择的问题，以提高FL模型的准确性，并且遵循UAV的能源消耗、通信质量和本地数据的多样性限制。我们提出了一个 novel UAV参与者选择策略，called 数据效率能源注意的参与者选择策略（DEEPS），它是根据每个子区域中的本地数据和能源消耗观察所得到的结构相似度平均分数（SSIM）的平均分数，选择每个子区域中最佳的 FL 参与者。经过实验，我们发现，提案的选择策略与参考随机选择方法相比，在于模型准确性、训练时间和UAV能源消耗方面均有优势。”

Dialogue for Prompting: a Policy-Gradient-Based Discrete Prompt Optimization for Few-shot Learning

paper_url: http://arxiv.org/abs/2308.07272
repo_url: None
paper_authors: Chengzhengxu Li, Xiaoming Liu, Yichen Wang, Duyi Li, Yu Lan, Chao Shen
For: 提高几何语言处理（NLP）任务中的几何学习效果，以及解决现有的精度优化方法的问题。* Methods: 使用对话Alignment策略生成可读性提示集，并提出高效的提示筛选指标来选择高质量提示。然后，通过policy梯度学习算法来匹配提示和输入。* Results: 在四个开源数据集上，DP_2O方法在几何学习设定下的准确率高于当前最佳方法的1.52%，并且在不同的任务和数据集上都有good的通用性、稳定性和泛化能力。

Abstract
Prompt-based pre-trained language models (PLMs) paradigm have succeeded substantially in few-shot natural language processing (NLP) tasks. However, prior discrete prompt optimization methods require expert knowledge to design the base prompt set and identify high-quality prompts, which is costly, inefficient, and subjective. Meanwhile, existing continuous prompt optimization methods improve the performance by learning the ideal prompts through the gradient information of PLMs, whose high computational cost, and low readability and generalizability are often concerning. To address the research gap, we propose a Dialogue-comprised Policy-gradient-based Discrete Prompt Optimization ($DP_2O$) method. We first design a multi-round dialogue alignment strategy for readability prompt set generation based on GPT-4. Furthermore, we propose an efficient prompt screening metric to identify high-quality prompts with linear complexity. Finally, we construct a reinforcement learning (RL) framework based on policy gradients to match the prompts to inputs optimally. By training a policy network with only 0.67% of the PLM parameter size on the tasks in the few-shot setting, $DP_2O$ outperforms the state-of-the-art (SOTA) method by 1.52% in accuracy on average on four open-source datasets. Moreover, subsequent experiments also demonstrate that $DP_2O$ has good universality, robustness, and generalization ability.

摘要

EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models

paper_url: http://arxiv.org/abs/2308.07269
repo_url: https://github.com/zjunlp/easyedit
paper_authors: Peng Wang, Ningyu Zhang, Xin Xie, Yunzhi Yao, Bozhong Tian, Mengru Wang, Zekun Xi, Siyuan Cheng, Kangwei Liu, Guozhou Zheng, Huajun Chen
for: 这个论文的目的是提出一个轻松使用的知识编辑框架，以便在大语言模型（LLMs）上应用多种 cutting-edge 知识编辑方法。
methods: 该论文使用了多种知识编辑方法，包括粘贴、替换、剪辑等，以及一些自动生成的方法。
results: 该论文的实验结果表明，使用知识编辑方法可以超过传统的精度调整，并且具有更好的一致性和普适性。

Abstract
Large Language Models (LLMs) usually suffer from knowledge cutoff or fallacy issues, which means they are unaware of unseen events or generate text with incorrect facts owing to the outdated/noisy data. To this end, many knowledge editing approaches for LLMs have emerged -- aiming to subtly inject/edit updated knowledge or adjust undesired behavior while minimizing the impact on unrelated inputs. Nevertheless, due to significant differences among various knowledge editing methods and the variations in task setups, there is no standard implementation framework available for the community, which hinders practitioners to apply knowledge editing to applications. To address these issues, we propose EasyEdit, an easy-to-use knowledge editing framework for LLMs. It supports various cutting-edge knowledge editing approaches and can be readily apply to many well-known LLMs such as T5, GPT-J, LlaMA, etc. Empirically, we report the knowledge editing results on LlaMA-2 with EasyEdit, demonstrating that knowledge editing surpasses traditional fine-tuning in terms of reliability and generalization. We have released the source code on GitHub at https://github.com/zjunlp/EasyEdit, along with Google Colab tutorials and comprehensive documentation for beginners to get started. Besides, we present an online system for real-time knowledge editing, and a demo video at http://knowlm.zjukg.cn/easyedit.mp4.

摘要
大型语言模型（LLM）通常会受到知识割裂或错误问题的影响，这意味着它们不知道未看过的事件或生成文本中含有错误的 факти due to outdated/noisy data。为了解决这个问题，许多知识编辑方法 для LLM 已经出现 -- 目的是通过微妙地将更新的知识或不适合的行为进行调整，以最小化对无关输入的影响。然而，由于不同的知识编辑方法和任务设置的差异，现在没有一个标准的实现框架可以供社区使用，这限制了实践者将知识编辑应用于应用程序。为了解决这些问题，我们提出了 EasyEdit，一个易于使用的知识编辑框架 для LLM。它支持许多最新的知识编辑方法，并可以轻松地应用到许多已知的 LLM，如 T5、GPT-J、LlaMA 等。我们在 LlaMA-2 上进行了知识编辑实验，结果显示，知识编辑超过了传统精细调整的可靠性和应用性。我们在 GitHub 上发布了源代码，并提供 Google Colab 教学和详细的文档，以便初学者开始。此外，我们还提供了线上系统 для实时知识编辑，以及一个网页demo video，请参考 http://knowlm.zjukg.cn/easyedit.mp4。

LCE: An Augmented Combination of Bagging and Boosting in Python

paper_url: http://arxiv.org/abs/2308.07250
repo_url: https://github.com/localcascadeensemble/lce
paper_authors: Kevin Fauvel, Élisa Fromont, Véronique Masson, Philippe Faverdin, Alexandre Termier
for: 本研究开发了一个高性能、可扩展、易用的Python包lcensemble，用于对 классификация和回归问题进行通用任务。
methods: 本研究使用了Local Cascade Ensemble（LCE）机器学习方法，它将Random Forest和XGBoost两种现状顶峰方法融合，以实现更好的泛化预测器。
results: lcensemble可以与scikit-learn集成，并且可以与scikit-learn的管道和模型选择工具互动。它在处理大规模数据时表现出了高性能。

Abstract
lcensemble is a high-performing, scalable and user-friendly Python package for the general tasks of classification and regression. The package implements Local Cascade Ensemble (LCE), a machine learning method that further enhances the prediction performance of the current state-of-the-art methods Random Forest and XGBoost. LCE combines their strengths and adopts a complementary diversification approach to obtain a better generalizing predictor. The package is compatible with scikit-learn, therefore it can interact with scikit-learn pipelines and model selection tools. It is distributed under the Apache 2.0 license, and its source code is available at https://github.com/LocalCascadeEnsemble/LCE.

摘要
LCensemble 是一个高性能、可扩展、易用的 Python 包，用于执行分类和回归的通用任务。该包实现了本地随机森林 ensemble（LCE）机器学习方法，该方法可以进一步提高当前状态的艺术法和 XGBoost 方法的预测性能。LCE 结合了它们的优点，采用了补做的多样化方法，从而获得一个更好的总体预测器。该包与 scikit-learn 兼容，因此可以与 scikit-learn 管道和模型选择工具进行交互。它根据 Apache 2.0 license 分发，源代码可以在 https://github.com/LocalCascadeEnsemble/LCE 上 obtener。

Can we Agree? On the Rashōmon Effect and the Reliability of Post-Hoc Explainable AI

paper_url: http://arxiv.org/abs/2308.07247
repo_url: None
paper_authors: Clement Poiret, Antoine Grigis, Justin Thomas, Marion Noulhiane
for: 这项研究探讨了使用SHAP在Rashomon集中获得可靠知识的挑战。
methods: 研究使用5个公共数据集进行实验，发现采样大小的增加可以提高模型的解释的一致性。但在少量采样下（<128个样本），解释具有高度的变化性，因此不可靠地抽取知识。然而，随着更多的数据，模型之间的一致性提高，允许达成共识。bagging ensemble通常具有更高的一致性。
results: 研究结果表明，要在少量采样下（<128个样本）进行验证，以确保结论的可靠性。此外，对于不同的模型类型、数据领域和解释方法，进一步的研究是必要的。测试神经网络和特定解释方法的收敛性也是有价值的。本研究的方法指向了可靠地从模糊模型中提取知识的原则方法。

Abstract
The Rash\=omon effect poses challenges for deriving reliable knowledge from machine learning models. This study examined the influence of sample size on explanations from models in a Rash\=omon set using SHAP. Experiments on 5 public datasets showed that explanations gradually converged as the sample size increased. Explanations from <128 samples exhibited high variability, limiting reliable knowledge extraction. However, agreement between models improved with more data, allowing for consensus. Bagging ensembles often had higher agreement. The results provide guidance on sufficient data to trust explanations. Variability at low samples suggests that conclusions may be unreliable without validation. Further work is needed with more model types, data domains, and explanation methods. Testing convergence in neural networks and with model-specific explanation methods would be impactful. The approaches explored here point towards principled techniques for eliciting knowledge from ambiguous models.

摘要
“落差omon效应对机器学习模型知识抽取带来挑战。这项研究研究了模型在Rashomon集中的解释如何受样本大小影响。使用SHAP进行实验，发现随着样本大小增加，解释的一致性逐渐提高。但是，从128个样本开始，解释呈现高度的变化， limiting 可靠知识抽取。然而，通过更多的数据，模型之间的一致性提高，allowing for consensus。 bagging ensemble 通常具有更高的一致性。结果为我们提供了足够数据来信任解释的指南。低样本数时的变化表明，不进行验证，得出的结论可能不可靠。未来的工作应该进一步探索更多的模型类型、数据领域和解释方法。测试神经网络和特定模型解释方法的受样本大小影响也是有价值的。研究进行的方法指向了有良好原则的模型解释技术。”

A Unifying Generator Loss Function for Generative Adversarial Networks

paper_url: http://arxiv.org/abs/2308.07233
repo_url: None
paper_authors: Justin Veiner, Fady Alajaji, Bahman Gharesifard
for: 这个论文主要关注的是 dual-objective generative adversarial network (GAN) 的 $\alpha$-parametrized generator loss function，用于替代原始 GAN 系统中的 classical discriminator loss function。
methods: 这个论文提出了一种基于 symmetric class probability estimation 类型的 generator loss function，称为 $\mathcal{L}_\alpha$，并使用这个loss function来定义 $\mathcal{L}_\alpha$-GAN 系统。
results: 研究人员通过分析 generator 的优化问题，发现 generator 的优化问题可以表示为一个 Jensen-$f_\alpha$- divergence 的最小化问题，其中 $f_\alpha$ 是一个 convex 函数，具体表示为 loss function $\mathcal{L}_\alpha$。此外，这个 $\mathcal{L}_\alpha$-GAN 问题还可以恢复一些在文献中提出的 GAN 问题，包括 VanillaGAN、LSGAN、L$k$GAN 和 $({\alpha_D},{\alpha_G})$-GAN 中的 $\alpha_D=1$。最后，在 MNIST、CIFAR-10 和 Stacked MNIST 三个数据集上进行了实验，以证明不同的例子的 $\mathcal{L}_\alpha$-GAN 系统的性能。

Abstract
A unifying $\alpha$-parametrized generator loss function is introduced for a dual-objective generative adversarial network (GAN), which uses a canonical (or classical) discriminator loss function such as the one in the original GAN (VanillaGAN) system. The generator loss function is based on a symmetric class probability estimation type function, $\mathcal{L}_\alpha$, and the resulting GAN system is termed $\mathcal{L}_\alpha$-GAN. Under an optimal discriminator, it is shown that the generator's optimization problem consists of minimizing a Jensen-$f_\alpha$-divergence, a natural generalization of the Jensen-Shannon divergence, where $f_\alpha$ is a convex function expressed in terms of the loss function $\mathcal{L}_\alpha$. It is also demonstrated that this $\mathcal{L}_\alpha$-GAN problem recovers as special cases a number of GAN problems in the literature, including VanillaGAN, Least Squares GAN (LSGAN), Least $k$th order GAN (L$k$GAN) and the recently introduced $(\alpha_D,\alpha_G)$-GAN with $\alpha_D=1$. Finally, experimental results are conducted on three datasets, MNIST, CIFAR-10, and Stacked MNIST to illustrate the performance of various examples of the $\mathcal{L}_\alpha$-GAN system.

摘要
文本中引入了一种对称$\alpha$-参数化生成器损失函数，用于一个双目标生成对抗网络（GAN）系统。生成器损失函数基于一种对称的класси型概率估计函数，$\mathcal{L}_\alpha$，并将系统称为$\mathcal{L}_\alpha$-GAN。在理想的权衡器下， generator的优化问题可以表示为最小化一种Jensen-$f_\alpha$-分配，这是自然推广Jensen-Shannon分配的一种自然推广，其中$f_\alpha$是一个对称的束缚函数，它与损失函数$\mathcal{L}_\alpha$有关。此外，这个$\mathcal{L}_\alpha$-GAN问题还能够恢复一些文献中的GAN问题，包括VanillaGAN、LSGAN、L$k$GAN和$(\alpha_D,\alpha_G)$-GAN中的$\alpha_D=1$。最后，对三个数据集（MNIST、CIFAR-10和Stacked MNIST）进行了实验，以示出不同的$\mathcal{L}_\alpha$-GAN系统的性能。

2023-08-15

eess.IV

eess.IV - 2023-08-15

Targeted Multispectral Filter Array Design for Endoscopic Cancer Detection in the Gastrointestinal Tract

paper_url: http://arxiv.org/abs/2308.07947
repo_url: None
paper_authors: Michaela Taylor-Williams, Ran Tao, Travis W Sawyer, Dale J Waterhouse, Jonghee Yoon, Sarah E Bohndiek
for: 检测肠胃肉体中疾病的颜色差异，以提高疾病检测的准确性。
methods: 使用自定义多spectral filter arrays (MSFAs)，并使用开源工具箱Opti-MSFA进行优化设计。
results: 结果显示，MSFA设计具有高分类精度，表明将在未来实施在检查器硬件中可能有助于提高肠胃肉体检测的早期检测。

Abstract
Colour differences between healthy and diseased tissue in the gastrointestinal tract are detected visually by clinicians during white light endoscopy (WLE); however, the earliest signs of disease are often just a slightly different shade of pink compared to healthy tissue. Here, we propose to target alternative colours for imaging to improve contrast using custom multispectral filter arrays (MSFAs) that could be deployed in an endoscopic chip-on-tip configuration. Using an open-source toolbox, Opti-MSFA, we examined the optimal design of MSFAs for early cancer detection in the gastrointestinal tract. The toolbox was first extended to use additional classification models (k-Nearest Neighbour, Support Vector Machine, and Spectral Angle Mapper). Using input spectral data from published clinical trials examining the oesophagus and colon, we optimised the design of MSFAs with 3 to 9 different bands. We examined the variation of the spectral and spatial classification accuracy as a function of number of bands. The MSFA designs have high classification accuracies, suggesting that future implementation in endoscopy hardware could potentially enable improved early detection of disease in the gastrointestinal tract during routine screening and surveillance. Optimal MSFA configurations can achieve similar classification accuracies as the full spectral data in an implementation that could be realised in far simpler hardware. The reduced number of spectral bands could enable future deployment of multispectral imaging in an endoscopic chip-on-tip configuration.

摘要
医生在白光endooscopy（WLE）中可以通过视觉检测肠道内健康和疾病组织的颜色差异。然而，疾病的早期征象通常只是健康组织的微妙变化。我们提议使用自定义多spectral filter array（MSFA）来提高对比度。我们使用了一个开源工具箱，Opti-MSFA，来调查最佳MSFA的设计。我们首先扩展了工具箱，使其支持更多的分类模型（k-最近邻居、支持向量机和spectral angle mapper）。使用来自已发布临床试验的迷你镜诊断数据，我们优化了MSFA的3到9个频谱带的设计。我们分析了频谱和空间分类精度的变化与频谱带数的关系。我们发现，最佳MSFA配置具有高分类精度， suggesting that future implementation in endoscopy hardware could potentially enable improved early detection of disease in the gastrointestinal tract during routine screening and surveillance。最佳MSFA配置可以实现与全spectral数据相同的分类精度，但具有更少的频谱带数，这可能使得未来在endooscopy中实现多spectral imaging的方式更加简单。

DSFNet: Convolutional Encoder-Decoder Architecture Combined Dual-GCN and Stand-alone Self-attention by Fast Normalized Fusion for Polyps Segmentation

paper_url: http://arxiv.org/abs/2308.07946
repo_url: None
paper_authors: Juntong Fan, Tieyong Zeng, Dayang Wang
for:This paper aims to address the challenging task of polyp segmentation in colonoscopy images using a novel U-shaped network called DSFNet.methods:The proposed DSFNet combines the advantages of Dual-GCN and self-attention mechanisms, including a feature enhancement block module and a stand-alone self-attention module, as well as a Fast Normalized Fusion method for efficient feature fusion.results:The proposed model surpasses other state-of-the-art models in terms of Dice, MAE, and IoU on two public datasets (Endoscene and Kvasir-SEG), and ablation studies verify the efficacy and effectiveness of each module. The proposed model has great clinical significance for polyp segmentation in colonoscopy images.Here is the Chinese version of the three key points:for:本研究使用一种新型的U型网络，即DSFNet，解决了医学内视镜中肠道肿瘤分 segmentation的挑战性问题。methods:提议的DSFNet结合了 dual-GCN和自注意机制的优点，包括特征增强块模块和独立自注意模块，以及一种高效的快Normalized Fusion方法。results:实验结果表明，提议的模型在两个公共数据集（Endoscene和Kvasir-SEG）上比其他状态对比模型在多个指标（Dice、MAE和IoU）上表现出优异，并进行了ablation研究来验证每个模块的有效性和效iveness。结果表明，提议的模型在医学中具有大的临床意义。

Abstract
In the past few decades, deep learning technology has been widely used in medical image segmentation and has made significant breakthroughs in the fields of liver and liver tumor segmentation, brain and brain tumor segmentation, video disc segmentation, heart image segmentation, and so on. However, the segmentation of polyps is still a challenging task since the surface of the polyps is flat and the color is very similar to that of surrounding tissues. Thus, It leads to the problems of the unclear boundary between polyps and surrounding mucosa, local overexposure, and bright spot reflection. To counter this problem, this paper presents a novel U-shaped network, namely DSFNet, which effectively combines the advantages of Dual-GCN and self-attention mechanisms. First, we introduce a feature enhancement block module based on Dual-GCN module as an attention mechanism to enhance the feature extraction of local spatial and structural information with fine granularity. Second, the stand-alone self-attention module is designed to enhance the integration ability of the decoding stage model to global information. Finally, the Fast Normalized Fusion method with trainable weights is used to efficiently fuse the corresponding three feature graphs in encoding, bottleneck, and decoding blocks, thus promoting information transmission and reducing the semantic gap between encoder and decoder. Our model is tested on two public datasets including Endoscene and Kvasir-SEG and compared with other state-of-the-art models. Experimental results show that the proposed model surpasses other competitors in many indicators, such as Dice, MAE, and IoU. In the meantime, ablation studies are also conducted to verify the efficacy and effectiveness of each module. Qualitative and quantitative analysis indicates that the proposed model has great clinical significance.

摘要
在过去几十年中，深度学习技术在医学影像分割领域得到广泛应用，并在肝脏和肝癌分割、脑和脑癌分割、视频碟分割、心脏影像分割等领域取得了显著突破。然而，肿瘤分割仍然是一项挑战性的任务，因为肿瘤表面平滑，颜色与周围组织相似，导致边界不清晰、局部过曝光和光泽反射等问题。为解决这些问题，本文提出了一种新的U型网络，即DSFNet，该网络效果地结合了DUAL-GCN模块和自注意机制。首先，我们引入了基于DUAL-GCN模块的特征增强块模块，以增强本地空间和结构信息的特征提取。其次，我们设计了独立的自注意模块，以提高解码阶段模型的全局信息集成能力。最后，我们使用可学习权重的快速 нормализа化融合方法，以有效地融合编码、瓶颈和解码块中的三个特征图，从而提高信息传递和减少编码器和解码器之间的semantic gap。我们的模型在Endoscene和Kvasir-SEG两个公共数据集上进行测试，与其他当前顶尖模型进行比较。实验结果表明，我们的模型在多个指标上超过其他竞争对手，包括 dice、MAE和IoU等指标。同时，我们还进行了ablation研究，以验证每个模块的有效性和效果。 qualitative和quantitative分析表明，我们的模型在临床上具有很大的价值。

An Interpretable Machine Learning Model with Deep Learning-based Imaging Biomarkers for Diagnosis of Alzheimer’s Disease

paper_url: http://arxiv.org/abs/2308.07778
repo_url: None
paper_authors: Wenjie Kang, Bo Li, Janne M. Papma, Lize C. Jiskoot, Peter Paul De Deyn, Geert Jan Biessels, Jurgen A. H. R. Claassen, Huub A. M. Middelkoop, Wiesje M. van der Flier, Inez H. G. B. Ramakers, Stefan Klein, Esther E. Bron
for: 预测阿尔ц海默病（AD）的早期诊断。
methods: combines Explainable Boosting Machines (EBM) with deep learning-based feature extraction，提供了每个特征的重要性。
results: 在Alzheimer’s Disease Neuroimaging Initiative（ADNI）数据集上 achieved accuracy of 0.883和area-under-the-curve（AUC）of 0.970 on AD和control分类，并在一个外部测试集上 achieved accuracy of 0.778和AUC of 0.887 on AD和主观认知下降（SCD）分类。

Abstract
Machine learning methods have shown large potential for the automatic early diagnosis of Alzheimer's Disease (AD). However, some machine learning methods based on imaging data have poor interpretability because it is usually unclear how they make their decisions. Explainable Boosting Machines (EBMs) are interpretable machine learning models based on the statistical framework of generalized additive modeling, but have so far only been used for tabular data. Therefore, we propose a framework that combines the strength of EBM with high-dimensional imaging data using deep learning-based feature extraction. The proposed framework is interpretable because it provides the importance of each feature. We validated the proposed framework on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, achieving accuracy of 0.883 and area-under-the-curve (AUC) of 0.970 on AD and control classification. Furthermore, we validated the proposed framework on an external testing set, achieving accuracy of 0.778 and AUC of 0.887 on AD and subjective cognitive decline (SCD) classification. The proposed framework significantly outperformed an EBM model using volume biomarkers instead of deep learning-based features, as well as an end-to-end convolutional neural network (CNN) with optimized architecture.

摘要

Dynamic Low-Rank Instance Adaptation for Universal Neural Image Compression

paper_url: http://arxiv.org/abs/2308.07733
repo_url: https://github.com/llvy21/duic
paper_authors: Yue Lv, Jinxi Xiang, Jun Zhang, Wenming Yang, Xiao Han, Wei Yang
for: 本研究旨在提高 neural image compression 的环境不同时的表现，即addressing the domain gap between training datasets (natural images) and inference datasets (e.g., artistic images).
methods: 我们提出了一种基于low-rank adaptation的方法，包括在客户端decoder中进行低级别矩阵分解，并将更新了适应参数传输到客户端。此外，我们还引入了一种动态阀网，以确定需要适应的层。
results: 我们的方法可以 universal across diverse image datasets，并且在out-of-domain图像上表现出较好的表现，与非适应方法相比，平均BD-rate提高约19%。此外，我们的方法还可以在不同的图像压缩架构上进行改进。

Abstract
The latest advancements in neural image compression show great potential in surpassing the rate-distortion performance of conventional standard codecs. Nevertheless, there exists an indelible domain gap between the datasets utilized for training (i.e., natural images) and those utilized for inference (e.g., artistic images). Our proposal involves a low-rank adaptation approach aimed at addressing the rate-distortion drop observed in out-of-domain datasets. Specifically, we perform low-rank matrix decomposition to update certain adaptation parameters of the client's decoder. These updated parameters, along with image latents, are encoded into a bitstream and transmitted to the decoder in practical scenarios. Due to the low-rank constraint imposed on the adaptation parameters, the resulting bit rate overhead is small. Furthermore, the bit rate allocation of low-rank adaptation is \emph{non-trivial}, considering the diverse inputs require varying adaptation bitstreams. We thus introduce a dynamic gating network on top of the low-rank adaptation method, in order to decide which decoder layer should employ adaptation. The dynamic adaptation network is optimized end-to-end using rate-distortion loss. Our proposed method exhibits universality across diverse image datasets. Extensive results demonstrate that this paradigm significantly mitigates the domain gap, surpassing non-adaptive methods with an average BD-rate improvement of approximately $19\%$ across out-of-domain images. Furthermore, it outperforms the most advanced instance adaptive methods by roughly $5\%$ BD-rate. Ablation studies confirm our method's ability to universally enhance various image compression architectures.

摘要
最新的神经网络图像压缩技术具有可能超越传统标准编解码器的环境-质量表现的潜在力量。然而，存在一个不可缺少的领域差距（domain gap），即训练集（natural images）和推理集（e.g., artistic images）之间的差异。我们的提议是一种基于低级数的适应方法，用于Addressing the rate-distortion drop observed in out-of-domain datasets。具体来说，我们通过低级数矩阵分解更新客户端的解码器某些适应参数。这些更新后的参数，加上图像latent，被编码到一个bit流中并在实际应用场景中传输。由于低级数约束对适应参数的影响，bit流扩展的负担小。此外，低级数适应的bit流分配是非易的，需要根据各种输入的多样性进行变动的适应。我们因此引入了一种基于低级数适应的动态阻止网络，以确定哪些解码层应该使用适应。这个动态适应网络通过练习环境-质量损失来优化。我们的提议在多种图像压缩架构上 universality，并且在多种图像数据集上进行了广泛的测试。结果显示，这种方法可以有效 mitigate the domain gap，与非适应方法相比，平均BD-rate提高约19%，而与最先进的实例适应方法相比，BD-rate提高约5%。剖析研究证明了我们的方法可以通过universal enhancement来改善多种图像压缩架构。

A deep deformable residual learning network for SAR images segmentation

paper_url: http://arxiv.org/abs/2308.07627
repo_url: None
paper_authors: Chenwei Wang, Jifang Pei, Yulin Huang, Jianyu Yang
for: 这篇论文是为了提出一个基于深度学习网络的新方法 для SAR 对象分类，以提高 SAR 对象分类的精度和速度。
methods: 本文使用的方法包括对于 SAR 图像进行深度学习网络的建立，并将对象分类问题转化为一个对于图像进行分类的问题。另外，本文还使用了扭转卷网络和复原学习块，以提高网络的准确性和稳定性。
results: 根据 MSTAR 资料集的实验结果显示，提出的深度对象分类网络能够实现高精度和高速度的 SAR 对象分类，并且比传统方法更加精确和可靠。

Abstract
Reliable automatic target segmentation in Synthetic Aperture Radar (SAR) imagery has played an important role in the SAR fields. Different from the traditional methods, Spectral Residual (SR) and CFAR detector, with the recent adavance in machine learning theory, there has emerged a novel method for SAR target segmentation, based on the deep learning networks. In this paper, we proposed a deep deformable residual learning network for target segmentation that attempts to preserve the precise contour of the target. For this, the deformable convolutional layers and residual learning block are applied, which could extract and preserve the geometric information of the targets as much as possible. Based on the Moving and Stationary Target Acquisition and Recognition (MSTAR) data set, experimental results have shown the superiority of the proposed network for the precise targets segmentation.

摘要
<启用简化中文静电Synthetic Aperture Radar（SAR）图像中的自动目标分割可靠性在SAR领域中发挥了重要作用。与传统方法不同，我们提出了基于深度学习网络的新方法，即吸引强度差（SR）和CFAR探测器。在这篇论文中，我们提出了一种深度变形剩余学习网络，用于目标分割，以保留目标精确的轮廓。为了实现这一目标，我们使用了变形卷积层和剩余学习块，以提取和保留目标的几何信息。基于Moving and Stationary Target Acquisition and Recognition（MSTAR）数据集，我们的实验结果表明，提出的网络可以准确地分割精确目标。

paper_url: http://arxiv.org/abs/2308.07611
repo_url: None
paper_authors: Po-Jui Lu, Benjamin Odry, Muhamed Barakovic, Matthias Weigel, Robin Sandkühler, Reza Rahmanzadeh, Xinjie Chen, Mario Ocampo-Pineda, Jens Kuhle, Ludwig Kappos, Philippe Cattin, Cristina Granziera
for: 这个研究的目的是为了确定多创性疾病（MS）患者的残障相关的脑部变化。
methods: 这个研究使用了全脑量化MRI（qMRI）、神经网络（CNN）和可解释方法，将MS患者分为严重残障和不严重残障两组。
results: 研究获得了0.885的测试效果，qT1是残障相关最敏感的测量指标，其次是neurite density index（NDI）。这个研究还发现了残障相关的脑部区域，包括 corticospinal tract，这些区域与患者的残障分数有 statistically significant 的相关性（ρ=-0.37和0.44）。

Abstract
Objective: Identifying disability-related brain changes is important for multiple sclerosis (MS) patients. Currently, there is no clear understanding about which pathological features drive disability in single MS patients. In this work, we propose a novel comprehensive approach, GAMER-MRIL, leveraging whole-brain quantitative MRI (qMRI), convolutional neural network (CNN), and an interpretability method from classifying MS patients with severe disability to investigating relevant pathological brain changes. Methods: One-hundred-sixty-six MS patients underwent 3T MRI acquisitions. qMRI informative of microstructural brain properties was reconstructed, including quantitative T1 (qT1), myelin water fraction (MWF), and neurite density index (NDI). To fully utilize the qMRI, GAMER-MRIL extended a gated-attention-based CNN (GAMER-MRI), which was developed to select patch-based qMRI important for a given task/question, to the whole-brain image. To find out disability-related brain regions, GAMER-MRIL modified a structure-aware interpretability method, Layer-wise Relevance Propagation (LRP), to incorporate qMRI. Results: The test performance was AUC=0.885. qT1 was the most sensitive measure related to disability, followed by NDI. The proposed LRP approach obtained more specifically relevant regions than other interpretability methods, including the saliency map, the integrated gradients, and the original LRP. The relevant regions included the corticospinal tract, where average qT1 and NDI significantly correlated with patients' disability scores ($\rho$=-0.37 and 0.44). Conclusion: These results demonstrated that GAMER-MRIL can classify patients with severe disability using qMRI and subsequently identify brain regions potentially important to the integrity of the mobile function. Significance: GAMER-MRIL holds promise for developing biomarkers and increasing clinicians' trust in NN.

摘要
目标：为多发性硬化病（MS）患者 Identify 负面功能相关的脑变化。现在，没有明确的认知，哪些生理学特征驱动单个MS患者的残疾。在这种工作中，我们提议了一种全新的全脑量化MRI（qMRI）、卷积神经网络（CNN）和可解释方法，从分类MS患者严重残疾到研究相关的脑变化。方法：一百六十六名MS患者通过3T MRI成像。重要的质量MRI（qMRI）信息，包括质量T1（qT1）、脑白质含量（MWF）和神经纤维数（NDI），都被重构。为了全面利用qMRI，我们扩展了基于闭合注意力的CNN（GAMER-MRI），并将其应用到整个脑图像。为了找出残疾相关的脑区域，我们修改了层次相关传播（LRP）方法，以包括qMRI。结果：测试性能为AUC=0.885。qT1是残疾相关度最高的指标，其次是NDI。我们的LRP方法在特定任务/问题中更有特点地找出了残疾相关的脑区域，比如脑束束络（ corticospinal tract），其中qT1和NDI的平均值与患者残疾分数（ρ=-0.37和0.44）有 statistically significant 相关性。结论：这些结果表明，GAMER-MRIL可以通过qMRI和CNN来分类患者严重残疾，并且可以特定残疾相关的脑区域。这些结果表明GAMER-MRIL具有开发生物标志物和增加临床医生对NN的信任的潜力。

Benchmarking Scalable Epistemic Uncertainty Quantification in Organ Segmentation

paper_url: http://arxiv.org/abs/2308.07506
repo_url: https://github.com/jadie1/medseguq
paper_authors: Jadie Adams, Shireen Y. Elhabian
for: aid diagnosis and treatment planning
methods: epistemic uncertainty quantification methods in organ segmentation
results: comprehensive benchmarking study to evaluate the accuracy, uncertainty calibration, and scalability of different methods

Abstract
Deep learning based methods for automatic organ segmentation have shown promise in aiding diagnosis and treatment planning. However, quantifying and understanding the uncertainty associated with model predictions is crucial in critical clinical applications. While many techniques have been proposed for epistemic or model-based uncertainty estimation, it is unclear which method is preferred in the medical image analysis setting. This paper presents a comprehensive benchmarking study that evaluates epistemic uncertainty quantification methods in organ segmentation in terms of accuracy, uncertainty calibration, and scalability. We provide a comprehensive discussion of the strengths, weaknesses, and out-of-distribution detection capabilities of each method as well as recommendations for future improvements. These findings contribute to the development of reliable and robust models that yield accurate segmentations while effectively quantifying epistemic uncertainty.

摘要
深度学习基于方法为自动器官 segmentation 表现出了许多批处的可能性，帮助诊断和治疗规划。然而，量化和理解模型预测结果中的uncertainty是在重要的临床应用中关键。虽然许多技术被提出用于知识 Based uncertainty estimation，但是没有一种方法在医学图像分析中被强调。这篇论文提供了一项全面的比较研究，评估了器官 segmentation 中epistemic uncertainty quantification 方法的准确性、uncertainty calibration和可扩展性。我们提供了每种方法的优缺点、out-of-distribution检测能力和未来改进的建议。这些发现有助于开发可靠和可靠的模型，以获得准确的 segmentation 结果，同时有效地量化epistemic uncertainty。

Brain Tumor Detection Based on a Novel and High-Quality Prediction of the Tumor Pixel Distributions

paper_url: http://arxiv.org/abs/2308.07495
repo_url: None
paper_authors: Yanming Sun, Chunyan Wang
for:* The paper proposes a system for detecting brain tumors in 3D MRI brain scans of Flair modality.methods:* The system uses a 2D histogram presentation to comprehend the gray-level distribution and pixel-location distribution of a 3D object.* It exploits the left-right asymmetry of a brain structure to establish particular 2D histograms, which are then modulated to attenuate irrelevant elements.* The system predicts the tumor pixel distribution in 3 steps, on the axial, coronal, and sagittal slice series, respectively.results:* The system delivers very good tumor detection results, comparable to those of state-of-the-art CNN systems with mono-modality inputs.* The system achieves this at an extremely low computation cost and without the need for training.Here is the answer in Simplified Chinese text:for:* 这篇论文提出了一种用于检测大脑肿瘤的系统，该系统使用3D MRI脑部扫描的FLAIR模式。methods:* 该系统使用2D histogram展示来理解肿瘤区域的灰度分布和像素位置分布。* 它利用脑结构的左右偏好来建立特定的2D histogram，并对其进行减小。* 系统在axial、coronal和sagittal slice series上预测肿瘤像素分布，并在每个步骤中使用预测结果来identify/remove肿瘤自由的 slice。results:* 系统实现了非常好的肿瘤检测结果，与单模态输入的state-of-the-art CNN系统相当。* 系统在计算成本非常低的情况下实现了这一结果，而无需训练。

Abstract
In this paper, we propose a system to detect brain tumor in 3D MRI brain scans of Flair modality. It performs 2 functions: (a) predicting gray-level and locational distributions of the pixels in the tumor regions and (b) generating tumor mask in pixel-wise precision. To facilitate 3D data analysis and processing, we introduced a 2D histogram presentation that comprehends the gray-level distribution and pixel-location distribution of a 3D object. In the proposed system, particular 2D histograms, in which tumor-related feature data get concentrated, are established by exploiting the left-right asymmetry of a brain structure. A modulation function is generated from the input data of each patient case and applied to the 2D histograms to attenuate the element irrelevant to the tumor regions. The prediction of the tumor pixel distribution is done in 3 steps, on the axial, coronal and sagittal slice series, respectively. In each step, the prediction result helps to identify/remove tumor-free slices, increasing the tumor information density in the remaining data to be applied to the next step. After the 3-step removal, the 3D input is reduced to a minimum bounding box of the tumor region. It is used to finalize the prediction and then transformed into a 3D tumor mask, by means of gray level thresholding and low-pass-based morphological operations. The final prediction result is used to determine the critical threshold. The proposed system has been tested extensively with the data of more than one thousand patient cases in the datasets of BraTS 2018~21. The test results demonstrate that the predicted 2D histograms have a high degree of similarity with the true ones. The system delivers also very good tumor detection results, comparable to those of state-of-the-art CNN systems with mono-modality inputs, which is achieved at an extremely low computation cost and no need for training.

摘要
在这篇论文中，我们提出了一种系统，用于检测脑肿瘤在3D MRI脑内部分模式下的检测。它执行了两个功能：（a）预测肿瘤区域中像素的灰度和位坐标分布，（b）生成精准的肿瘤面积。为了便于3D数据分析和处理，我们引入了2D分布图表示法，该法涵盖了肿瘤区域中像素的灰度分布和位坐标分布。在我们提出的系统中，特定的2D分布，在其中肿瘤相关特征数据受集中，被建立了。然后，通过对输入数据的每个患者案例中的偏置函数进行应用，以减少不相关于肿瘤区域的元素。肿瘤像素分布预测在 axial、coronal 和 sagittal 三个方向上进行了3步骤逐步进行，每步骤结果帮助确定/移除肿瘤不存在的剖面，从而提高肿瘤信息的浓度在剩下的数据中，并应用到下一步。经过3步 removals，输入3D数据被减少到最小 bounding box 的肿瘤区域。它被用于最终预测，并使用灰度阈值和低通过的杂谱操作来转换为3D肿瘤面积。测试结果表明，预测的2D分布与实际分布有高度的相似性。系统还提供了非常好的肿瘤检测结果，与STATE-OF-THE-ART CNN系统的单模式输入相比，并且在极低的计算成本下达到了这一点，无需训练。

Space Object Identification and Classification from Hyperspectral Material Analysis

paper_url: http://arxiv.org/abs/2308.07481
repo_url: None
paper_authors: Massimiliano Vasile, Lewis Walker, Andrew Campbell, Simao Marto, Paul Murray, Stephen Marshall, Vasili Savitski
for: 本研究旨在提取不知名空间物体的干涉特征信息，并使用这些信息来确定物体的物理组成。
methods: 本研究使用了两种材料标识和分类技术：一种基于机器学习，另一种基于最小二乘匹配known спектры库。通过这些信息，一种监督式机器学习算法用于将物体分类为不同类别，根据检测到物体上的材料。
results: 研究结果表明，当材料库缺失一种物质时，材料分类方法的行为会受到影响。此外，研究还发现在不理想的天气条件下，材料分类方法的行为也会受到影响。最终，文章将展示一些初步的空间物体识别和分类结果。

Abstract
This paper presents a data processing pipeline designed to extract information from the hyperspectral signature of unknown space objects. The methodology proposed in this paper determines the material composition of space objects from single pixel images. Two techniques are used for material identification and classification: one based on machine learning and the other based on a least square match with a library of known spectra. From this information, a supervised machine learning algorithm is used to classify the object into one of several categories based on the detection of materials on the object. The behaviour of the material classification methods is investigated under non-ideal circumstances, to determine the effect of weathered materials, and the behaviour when the training library is missing a material that is present in the object being observed. Finally the paper will present some preliminary results on the identification and classification of space objects.

摘要
Translation notes:* "hyperspectral signature" is translated as "多спектル特征" (duō yán jīng)* "material composition" is translated as "物质组成" (wù zhì zhōng)* "machine learning" is translated as "机器学习" (jī shì xué xí)* "least square match" is translated as "最小二乘匹配" (zuì xiǎo èr chuī pīng pái)* "training library" is translated as "训练库" (xùn xí kù)* "weathered materials" is translated as "气候变化的材料" (qì hòu biàn gē de zhì lǐ)

Probabilistic MIMO U-Net: Efficient and Accurate Uncertainty Estimation for Pixel-wise Regression

paper_url: http://arxiv.org/abs/2308.07477
repo_url: https://github.com/antonbaumann/mimo-unet
paper_authors: Anton Baumann, Thomas Roßberg, Michael Schmitt
for: 这个论文的目的是提高机器学习模型的可靠性和可读性，特别是在高度关键的实际应用场景中。
methods: 这篇论文使用了多输入多输出（MIMO）框架，利用深度神经网络的过参数化来进行像素级回归任务。它还引入了一种同步多个子网络性能的新程序。
results: 对两个正交的数据集进行了全面的评估，显示MIMO U-Net模型具有与现有模型相当的准确率，更好的折衔在正常数据上，robust的对于异常数据检测能力，并且具有较小的参数大小和更快的推理时间。代码可以在github.com/antonbaumann/MIMO-Unet中找到。

Abstract
Uncertainty estimation in machine learning is paramount for enhancing the reliability and interpretability of predictive models, especially in high-stakes real-world scenarios. Despite the availability of numerous methods, they often pose a trade-off between the quality of uncertainty estimation and computational efficiency. Addressing this challenge, we present an adaptation of the Multiple-Input Multiple-Output (MIMO) framework -- an approach exploiting the overparameterization of deep neural networks -- for pixel-wise regression tasks. Our MIMO variant expands the applicability of the approach from simple image classification to broader computer vision domains. For that purpose, we adapted the U-Net architecture to train multiple subnetworks within a single model, harnessing the overparameterization in deep neural networks. Additionally, we introduce a novel procedure for synchronizing subnetwork performance within the MIMO framework. Our comprehensive evaluations of the resulting MIMO U-Net on two orthogonal datasets demonstrate comparable accuracy to existing models, superior calibration on in-distribution data, robust out-of-distribution detection capabilities, and considerable improvements in parameter size and inference time. Code available at github.com/antonbaumann/MIMO-Unet

摘要
Machine learning 中的不确定性估计是对预测模型的可靠性和可解释性提高的重要环节,特别是在高度的实际应用场景中。Despite the numerous methods available, they often involve a trade-off between the quality of uncertainty estimation and computational efficiency. To address this challenge, we present an adaptation of the Multiple-Input Multiple-Output (MIMO) framework for pixel-wise regression tasks. Our MIMO variant expands the applicability of the approach from simple image classification to broader computer vision domains. To achieve this, we adapted the U-Net architecture to train multiple subnetworks within a single model, leveraging the overparameterization in deep neural networks. Additionally, we propose a novel procedure for synchronizing subnetwork performance within the MIMO framework. Our comprehensive evaluations of the resulting MIMO U-Net on two orthogonal datasets demonstrate comparable accuracy to existing models, superior calibration on in-distribution data, robust out-of-distribution detection capabilities, and significant improvements in parameter size and inference time. 相关代码可以在github.com/antonbaumann/MIMO-Unet 上找到。

Large-kernel Attention for Efficient and Robust Brain Lesion Segmentation

paper_url: http://arxiv.org/abs/2308.07251
repo_url: https://github.com/liamchalcroft/mdunet
paper_authors: Liam Chalcroft, Ruben Lourenço Pereira, Mikael Brudfors, Andrew S. Kayser, Mark D’Esposito, Cathy J. Price, Ioannis Pappas, John Ashburner
for: 这个论文是为了解决医疗影像分类 задачі中的深度学习模型效果不高、缺乏调节不变性的问题。
methods: 这个论文提出了一种基于全条件对称化Transformer块的U-Net架构，以模型3D脑膜疾病分类中的长距离互动。
results: 这个模型能够提供最大的折衔点，即性能与现有State-of-the-art相当，并且具有调节不变性和对称性的优点。

Abstract
Vision transformers are effective deep learning models for vision tasks, including medical image segmentation. However, they lack efficiency and translational invariance, unlike convolutional neural networks (CNNs). To model long-range interactions in 3D brain lesion segmentation, we propose an all-convolutional transformer block variant of the U-Net architecture. We demonstrate that our model provides the greatest compromise in three factors: performance competitive with the state-of-the-art; parameter efficiency of a CNN; and the favourable inductive biases of a transformer. Our public implementation is available at https://github.com/liamchalcroft/MDUNet .

摘要
“vision transformer”是深度学习模型，用于视觉任务，包括医疗图像分割。然而，它缺乏效率和翻译不变性，与卷积神经网络（CNN）不同。为了在3D脑损害分割中模型长距离交互，我们提议一种具有U-Net架构的所有卷积转换器块变体。我们示示了我们的模型在三个因素中均提供了最大的妥协：与现状前景竞争性的性能; 参数效率与CNN相同; 以及转换器的有利假设。我们的公共实现可以在https://github.com/liamchalcroft/MDUNet上找到。

2023-08-14

cs.SD

cs.SD - 2023-08-14

Integrating Emotion Recognition with Speech Recognition and Speaker Diarisation for Conversations

paper_url: http://arxiv.org/abs/2308.07145
repo_url: https://github.com/w-wu/steer
paper_authors: Wen Wu, Chao Zhang, Philip C. Woodland
for: 这个论文的目的是提出一种结合自动情感识别（AER）、自动语音识别（ASR）和 speaker 分类（SD）的共同训练系统，以便在对话系统中使用。
methods: 这个论文使用了一种共同encoder来构建不同的输出层，用于四个子任务：AER、ASR、语音活动检测和 speaker 分类。
results: 在IEMOCAP数据集上进行测试，提议的系统在AER、ASR和SD三个任务上均超越了两个基准系统，并且在时间权重的情感识别和 speaker 分类错误上提供了两个metric来评估AER性能。

Abstract
Although automatic emotion recognition (AER) has recently drawn significant research interest, most current AER studies use manually segmented utterances, which are usually unavailable for dialogue systems. This paper proposes integrating AER with automatic speech recognition (ASR) and speaker diarisation (SD) in a jointly-trained system. Distinct output layers are built for four sub-tasks including AER, ASR, voice activity detection and speaker classification based on a shared encoder. Taking the audio of a conversation as input, the integrated system finds all speech segments and transcribes the corresponding emotion classes, word sequences, and speaker identities. Two metrics are proposed to evaluate AER performance with automatic segmentation based on time-weighted emotion and speaker classification errors. Results on the IEMOCAP dataset show that the proposed system consistently outperforms two baselines with separately trained single-task systems on AER, ASR and SD.

摘要
尽管自动情感识别（AER）最近吸引了广泛的研究兴趣，现有大多数AER研究使用手动分割的话语，这些话语通常不可用于对话系统。这篇论文提议将AER、自动语音识别（ASR）和speaker分类（SD）集成为一个联合训练系统。根据共享encoder构建了四个不同的输出层，用于四个子任务，包括AER、ASR、语音活动检测和speaker类型分类。将对话 audio作为输入，该集成系统可以找到所有的语音段落，并将对应的情感类别、单词序列和 speaker 标识符转录出来。为评估AER性能，提出了两种指标，一是基于时间Weighted Emotion Errors，另一是基于speaker Classification Errors。对于IEMOCAP dataset，提出的系统一直在AER、ASR和SD三个基础系统之上升级，并且在AER性能和自动分割性能两个方面均有优异表现。

VoxBlink: X-Large Speaker Verification Dataset on Camera

paper_url: http://arxiv.org/abs/2308.07056
repo_url: None
paper_authors: Yuke Lin, Xiaoyi Qin, Ming Cheng, Ning Jiang, Guoqing Zhao, Ming Li
for: 这个论文主要用于提供一个大型的说话人识别数据集，以便进行说话人识别模型的训练和研究。
methods: 论文使用了自动化和可扩展的数据采集管道，从 YouTube 上下载了大量短视频，并从中提取了相关的speech和视频段落。
results: 论文的实验结果表明，通过在不同的基础结构上训练，可以获得13%-30%的性能提升，这些基础结构包括VoxCeleb2和VoxBlink-Clean。

Abstract
In this paper, we contribute a novel and extensive dataset for speaker verification, which contains noisy 38k identities/1.45M utterances (VoxBlink) and relatively cleaned 18k identities/1.02M (VoxBlink-Clean) utterances for training. Firstly, we accumulate a 60K+ users' list with their avatars and download their short videos on YouTube. We then established an automatic and scalable pipeline to extract relevant speech and video segments from these videos. To our knowledge, the VoxBlink dataset is one of the largest speaker recognition datasets available. Secondly, we conduct a series of experiments based on different backbones trained on a mix of the VoxCeleb2 and the VoxBlink-Clean. Our findings highlight a notable performance improvement, ranging from 13% to 30%, across different backbone architectures upon integrating our dataset for training. The dataset will be made publicly available shortly.

摘要
在这篇论文中，我们贡献了一个新的和广泛的演说识别数据集，包括噪音38k个人/1.45万个语音（VoxBlink）和相对干净的18k个人/1.02万个语音（VoxBlink-Clean） для训练。首先，我们积累了60,000名用户的名单和他们的小视频在YouTube上，然后我们建立了一个自动和可扩展的管道来提取这些视频中的相关语音和视频段落。根据我们所知，VoxBlink数据集是目前最大的演说识别数据集之一。其次，我们进行了不同的核心结构在VoxCeleb2和VoxBlink-Clean上进行训练的一系列实验。我们的发现表明，在将我们的数据集integrated into training中，不同的核心结构在13%到30%之间具有显著的性能提升。这个数据集即将公开。

paper_url: http://arxiv.org/abs/2308.08488
repo_url: https://github.com/mispchallenge/misp-icme-avsr
paper_authors: Yusheng Dai, Hang Chen, Jun Du, Xiaofei Ding, Ning Ding, Feijun Jiang, Chin-Hui Lee
for: 提高 audio-visual speech recognition（AVSR）系统的性能
methods: 提出两种新技术：利用叙述 lip shape 和 syllable-level subword unit 的相关性来设定准确的帧级句子界限，并提出一种 audio-guided cross-modal fusion encoder（CMFE）神经网络来全面利用多modal complementarity。
results: 在 MISP2021-AVSR 数据集上进行实验，证明了两种提posed技术的效果性。使用只有相对较少的训练数据，最终系统可以达到比现有系统更高的性能水平。

Abstract
In recent research, slight performance improvement is observed from automatic speech recognition systems to audio-visual speech recognition systems in the end-to-end framework with low-quality videos. Unmatching convergence rates and specialized input representations between audio and visual modalities are considered to cause the problem. In this paper, we propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. This enables accurate alignment of video and audio streams during visual model pre-training and cross-modal fusion. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers to make full use of modality complementarity. Experiments on the MISP2021-AVSR data set show the effectiveness of the two proposed techniques. Together, using only a relatively small amount of training data, the final system achieves better performances than state-of-the-art systems with more complex front-ends and back-ends.

摘要
近期研究发现，自动语音识别系统到audio-visual语音识别系统在端到端框架下的轻微性能提升。这可能是因为modalities之间的不匹配收敛率和特殊输入表示所致。在这篇论文中，我们提出了两种新的技巧来提高audio-visual语音识别（AVSR）在预训练和精度调整训练框架下。首先，我们研究了拼音和lip shape之间的相关性，以建立良好的帧级单词界限。这使得视频和音频流在视觉模型预训练和交叉模态融合时进行准确的同步。其次，我们提出了一种听音指导的交叉模态融合encoder（CMFE）神经网络，以利用主要训练参数进行多个交叉模态注意层，以便充分利用多模态的补做性。在MISP2021-AVSR数据集上进行的实验表明，两种提出的技巧具有效果。总之，使用只有相对较少的训练数据，最终系统可以超越state-of-the-art系统，具有更复杂的前端和后端。

The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track

paper_url: http://arxiv.org/abs/2308.06981
repo_url: None
paper_authors: Stefan Uhlich, Giorgio Fabbro, Masato Hirano, Shusuke Takahashi, Gordon Wichern, Jonathan Le Roux, Dipam Chakraborty, Sharada Mohanty, Kai Li, Yi Luo, Jianwei Yu, Rongzhi Gu, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Mikhail Sukhovei, Yuki Mitsufuji
for: 这篇论文总结了2023年音频分离挑战（SDX’23）的电影分离（CDX）轨迹。
methods: 论文详细介绍了比赛的结构和使用的数据集，尤其是新构建的CDXDB23隐藏数据集，用于评测参赛者的提交。
results: 论文提供了参与者采用最successful的方法的 Insights，相比cocktail-fork基线，专门在 simulated Divide and Remaster（DnR）数据集上训练的系统在SDR中提高1.8dB，而开放排名中任何数据可以用 для训练的系统在SDR中提高5.7dB。

Abstract
This paper summarizes the cinematic demixing (CDX) track of the Sound Demixing Challenge 2023 (SDX'23). We provide a comprehensive summary of the challenge setup, detailing the structure of the competition and the datasets used. Especially, we detail CDXDB23, a new hidden dataset constructed from real movies that was used to rank the submissions. The paper also offers insights into the most successful approaches employed by participants. Compared to the cocktail-fork baseline, the best-performing system trained exclusively on the simulated Divide and Remaster (DnR) dataset achieved an improvement of 1.8dB in SDR whereas the top performing system on the open leaderboard, where any data could be used for training, saw a significant improvement of 5.7dB.

摘要
Translation notes:* "cinematic demixing" (CDX) was translated as "电影分离" (diàn yǐng fēn tiē)* "Sound Demixing Challenge" (SDX) was translated as "声音分离挑战" (shēng yīn fēn tiē bàzhàng)* "CDXDB23" was translated as "CDXDB2023" (CDXDB2023)* "simulated Divide and Remaster" (DnR) dataset was translated as "模拟分离和重新制作" (mó xiǎo fēn tiē hé zhòng xiān zhì zuò) dataset* "open leaderboard" was translated as "开放排行榜" (kāifàng pǔhàng bǎng)

The Sound Demixing Challenge 2023 $\unicode{x2013}$ Music Demixing Track

paper_url: http://arxiv.org/abs/2308.06979
repo_url: https://github.com/zfturbo/mvsep-mdx23-music-separation-model
paper_authors: Giorgio Fabbro, Stefan Uhlich, Chieh-Hsin Lai, Woosung Choi, Marco Martínez-Ramírez, Weihsiang Liao, Igor Gadelha, Geraldo Ramos, Eddie Hsu, Hugo Rodrigues, Fabian-Robert Stöter, Alexandre Défossez, Yi Luo, Jianwei Yu, Dipam Chakraborty, Sharada Mohanty, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Nabarun Goswami, Tatsuya Harada, Minseok Kim, Jun Hyung Lee, Yuanliang Dong, Xinran Zhang, Jiafeng Liu, Yuki Mitsufuji
for: 音乐分离挑战(SDX’23)的音乐分离(MSS)轨迹
methods: 使用了新的数据集SDXDB23_LabelNoise和SDXDB23_Bleeding1进行训练，并介绍了一种新的训练数据集错误ormalization
results: 在competition中，提出了最高分的方法，并与之前的音乐分离挑战(SDX’21)进行了比较，得到了1.6dB的提高 Signal-to-distortion ratio，并通过 listening test 得到了聆听评价。

Abstract
This paper summarizes the music demixing (MDX) track of the Sound Demixing Challenge (SDX'23). We provide a summary of the challenge setup and introduce the task of robust music source separation (MSS), i.e., training MSS models in the presence of errors in the training data. We propose a formalization of the errors that can occur in the design of a training dataset for MSS systems and introduce two new datasets that simulate such errors: SDXDB23_LabelNoise and SDXDB23_Bleeding1. We describe the methods that achieved the highest scores in the competition. Moreover, we present a direct comparison with the previous edition of the challenge (the Music Demixing Challenge 2021): the best performing system under the standard MSS formulation achieved an improvement of over 1.6dB in signal-to-distortion ratio over the winner of the previous competition, when evaluated on MDXDB21. Besides relying on the signal-to-distortion ratio as objective metric, we also performed a listening test with renowned producers/musicians to study the perceptual quality of the systems and report here the results. Finally, we provide our insights into the organization of the competition and our prospects for future editions.

摘要
Translated into Simplified Chinese:这篇论文总结了Sound Demixing Challenge（SDX'23）中的音乐分离（MDX）track。我们介绍了挑战的设置和音乐来源分离（MSS）任务，即在训练数据中存在错误时训练MSS系统。我们提出了错误的形式化，并介绍了两个新的数据集，即SDXDB23_LabelNoise和SDXDB23_Bleeding1，这些数据集模拟了这些错误。我们 Described the methods that achieved the highest scores in the competition. In addition, we present a direct comparison with the previous edition of the challenge (the Music Demixing Challenge 2021): the best performing system under the standard MSS formulation achieved an improvement of over 1.6dB in signal-to-distortion ratio over the winner of the previous competition, when evaluated on MDXDB21. Besides relying on the signal-to-distortion ratio as objective metric, we also performed a listening test with renowned producers/musicians to study the perceptual quality of the systems and report here the results. Finally, we provide our insights into the organization of the competition and our prospects for future editions.

2023-08-14

cs.CV

cs.CV - 2023-08-14

DS-Depth: Dynamic and Static Depth Estimation via a Fusion Cost Volume

paper_url: http://arxiv.org/abs/2308.07225
repo_url: https://github.com/xingy038/ds-depth
paper_authors: Xingyu Miao, Yang Bai, Haoran Duan, Yawen Huang, Fan Wan, Xinxing Xu, Yang Long, Yefeng Zheng
for: 提高自适应单目深度估计的精度，解决动态对象景象中的错觉和干扰问题
methods: 使用差异投影错误来捕捉静止环境中的几何关系，并通过增强遗憾投影来改进错觉和干扰问题
results: 对比前一些基eline，模型在KITTI和Cityscapes datasets上表现出较高的精度和稳定性，并且能够更好地处理动态对象景象中的错觉和干扰问题

Abstract
Self-supervised monocular depth estimation methods typically rely on the reprojection error to capture geometric relationships between successive frames in static environments. However, this assumption does not hold in dynamic objects in scenarios, leading to errors during the view synthesis stage, such as feature mismatch and occlusion, which can significantly reduce the accuracy of the generated depth maps. To address this problem, we propose a novel dynamic cost volume that exploits residual optical flow to describe moving objects, improving incorrectly occluded regions in static cost volumes used in previous work. Nevertheless, the dynamic cost volume inevitably generates extra occlusions and noise, thus we alleviate this by designing a fusion module that makes static and dynamic cost volumes compensate for each other. In other words, occlusion from the static volume is refined by the dynamic volume, and incorrect information from the dynamic volume is eliminated by the static volume. Furthermore, we propose a pyramid distillation loss to reduce photometric error inaccuracy at low resolutions and an adaptive photometric error loss to alleviate the flow direction of the large gradient in the occlusion regions. We conducted extensive experiments on the KITTI and Cityscapes datasets, and the results demonstrate that our model outperforms previously published baselines for self-supervised monocular depth estimation.

摘要
自我监睹的单目深度估计方法通常基于 reprojection 错误来捕捉静止环境中的几何关系。然而，这个假设不适用于动态对象场景中，导致视觉合成阶段的错误，如特征匹配和遮挡，可以 significatively 降低生成的深度图的准确性。为解决这个问题，我们提出了一种新的动态成本Volume，利用剩余运动流来描述移动对象，提高静止成本Volume中的错误区域。然而，动态成本Volume会生成额外的遮挡和噪声，因此我们采用一种融合模块，使静止和动态成本Volume相互补做。即静止成本Volume中的遮挡被动态成本Volume修正，而动态成本Volume中的错误信息被静止成本Volume消除。此外，我们提出了一种 pyramid 润照损失来降低低分辨率下的光метри错误偏差，以及一种适应性 photometric error 损失来减轻遮挡区域中的流向大Gradient。我们在 KITTI 和 Cityscapes 数据集上进行了广泛的实验，结果表明我们的模型在自我监睹单目深度估计方法中升级 previously 发表的基elines。

Distance Matters For Improving Performance Estimation Under Covariate Shift

paper_url: http://arxiv.org/abs/2308.07223
repo_url: https://github.com/melanibe/distance_matters_performance_estimation
paper_authors: Mélanie Roschewitz, Ben Glocker
for: 这篇论文的目的是提出一种新的性能估计方法，以便在 covariate shift 的情况下安全部署 AI 模型，特别是在敏感的应用场景中。
methods: 本文使用了许多已知的方法，包括模型预测和 softmax 信任度来 derive accuracy estimates。然而，在 dataset shift 的情况下，信任度可能会变得不准确，因为测试样本可能远离训练分布。本文提出了一种 “distance-check” 来检查测试样本是否位于预期的训练分布中，以避免将不可靠的模型输出用于精度估计。
results: 本文在 13 个图像分类任务上进行了实验，范围包括自然和 sintetic distribution shift，以及多种模型。结果显示，distance-check 方法可以对性能估计进行有效的改善，具体来说是使用 median relative MAE improvement 来衡量改善的程度。在所有任务上，distance-check 方法可以获得 SOTA 性能，并且在 10 个任务上获得了最佳 baseline。相关的代码可以在 GitHub 上找到。

Abstract
Performance estimation under covariate shift is a crucial component of safe AI model deployment, especially for sensitive use-cases. Recently, several solutions were proposed to tackle this problem, most leveraging model predictions or softmax confidence to derive accuracy estimates. However, under dataset shifts, confidence scores may become ill-calibrated if samples are too far from the training distribution. In this work, we show that taking into account distances of test samples to their expected training distribution can significantly improve performance estimation under covariate shift. Precisely, we introduce a "distance-check" to flag samples that lie too far from the expected distribution, to avoid relying on their untrustworthy model outputs in the accuracy estimation step. We demonstrate the effectiveness of this method on 13 image classification tasks, across a wide-range of natural and synthetic distribution shifts and hundreds of models, with a median relative MAE improvement of 27% over the best baseline across all tasks, and SOTA performance on 10 out of 13 tasks. Our code is publicly available at https://github.com/melanibe/distance_matters_performance_estimation.

摘要
“性能估计下covariate shift是AI模型部署中的一个重要组成部分，特别是在敏感用途中。近年，一些解决方案被提出来解决这个问题，大多数是基于模型预测或softmax信任度来 derive accuracy estimates。但是，在数据shift中，信任度可能会变得不准确，如果测试样本太过远 FROM the training distribution。在这个工作中，我们显示了将测试样本与预期的training distribution之间的距离考虑到可以对性能估计下covariate shift做出重要改进。具体来说，我们引入了一个“distance-check”来检查测试样本是否位于预期的training distribution中，以避免依靠它们的不可靠的模型输出在准确性估计阶段中。我们在13个图像分类任务上证明了这个方法的有效性，包括各种自然和 sintetic distribution shift，以及数百个模型， WITH a median relative MAE improvement of 27% over the best baseline across all tasks, AND SOTA performance on 10 out of 13 tasks。我们的代码可以在https://github.com/melanibe/distance_matters_performance_estimation中找到。”

Automated Ensemble-Based Segmentation of Adult Brain Tumors: A Novel Approach Using the BraTS AFRICA Challenge Data

paper_url: http://arxiv.org/abs/2308.07214
repo_url: None
paper_authors: Chiranjeewee Prasad Koirala, Sovesh Mohapatra, Advait Gosai, Gottfried Schlaug
for: 这 paper 探讨了对多modalità MRI 数据的深度学习应用，以提高脑肿瘤 segmentation 精度在南部非洲患者人口中。
methods: 这 paper 介绍了一种 ensemble 方法，包括 eleven 种不同的变体，基于三种核心架构：UNet3D、ONet3D 和 SphereNet3D，以及修改的损失函数。
results: 研究发现， ensemble 方法，结合不同的架构，可以提高评估指标。特别是，结果表明 ensemble 方法可以达到 Dice 分数为 0.82、0.82 和 0.87 的优秀表现，用于提高脑肿瘤的 segmentation。

Abstract
Brain tumors, particularly glioblastoma, continue to challenge medical diagnostics and treatments globally. This paper explores the application of deep learning to multi-modality magnetic resonance imaging (MRI) data for enhanced brain tumor segmentation precision in the Sub-Saharan Africa patient population. We introduce an ensemble method that comprises eleven unique variations based on three core architectures: UNet3D, ONet3D, SphereNet3D and modified loss functions. The study emphasizes the need for both age- and population-based segmentation models, to fully account for the complexities in the brain. Our findings reveal that the ensemble approach, combining different architectures, outperforms single models, leading to improved evaluation metrics. Specifically, the results exhibit Dice scores of 0.82, 0.82, and 0.87 for enhancing tumor, tumor core, and whole tumor labels respectively. These results underline the potential of tailored deep learning techniques in precisely segmenting brain tumors and lay groundwork for future work to fine-tune models and assess performance across different brain regions.

摘要
脑肿，特别是 glioblastoma，仍然在全球医疗领域存在挑战。这篇论文探讨了对多Modal magnetic resonance imaging（MRI）数据的深度学习应用，以提高脑肿分 segmentation精度在非洲南部地区患者人群中。我们提出了一种ensemble方法，包括eleven个uniqu variation，基于三种核心体系：UNet3D、ONet3D和SphereNet3D，以及修改的损失函数。这篇研究强调了需要基于年龄和人口的分 segmentation模型，以全面考虑脑肿的复杂性。我们的发现表明， ensemble方法，组合不同的体系，在评估指标上超越单个模型，导致改进的评估结果。具体来说，结果显示 dice分数为0.82、0.82和0.87，用于提高肿体、肿核和全肿标签。这些结果表明了深度学习技术的可能性，在精度地分 segmentation脑肿，并为未来细化模型和评估不同脑区的表现奠定基础。

Automated Ensemble-Based Segmentation of Pediatric Brain Tumors: A Novel Approach Using the CBTN-CONNECT-ASNR-MICCAI BraTS-PEDs 2023 Challenge Data

paper_url: http://arxiv.org/abs/2308.07212
repo_url: None
paper_authors: Shashidhar Reddy Javaji, Sovesh Mohapatra, Advait Gosai, Gottfried Schlaug
for: 这个研究旨在提高脑肿瘤诊断技术和治疗方法，特别是为pediatric patients提供年龄特定的分 segmentation模型。
methods: 这个研究使用深度学习技术，使用MRI模式进行数据采集和分 segmentation。研究提出了一种新的ensemble方法，组合ONet和修改后的UNet模型，并使用创新的损失函数。
results: 研究实现了精度的分 segmentation模型， lesion_wise dice scores为0.52、0.72和0.78，表明ensemble方法在不同的扫描协议下的Robustness和准确性。视觉比较也证明了 ensemble方法在脑肿瘤区域覆盖方面的superiority。

Abstract
Brain tumors remain a critical global health challenge, necessitating advancements in diagnostic techniques and treatment methodologies. In response to the growing need for age-specific segmentation models, particularly for pediatric patients, this study explores the deployment of deep learning techniques using magnetic resonance imaging (MRI) modalities. By introducing a novel ensemble approach using ONet and modified versions of UNet, coupled with innovative loss functions, this study achieves a precise segmentation model for the BraTS-PEDs 2023 Challenge. Data augmentation, including both single and composite transformations, ensures model robustness and accuracy across different scanning protocols. The ensemble strategy, integrating the ONet and UNet models, shows greater effectiveness in capturing specific features and modeling diverse aspects of the MRI images which result in lesion_wise dice scores of 0.52, 0.72 and 0.78 for enhancing tumor, tumor core and whole tumor labels respectively. Visual comparisons further confirm the superiority of the ensemble method in accurate tumor region coverage. The results indicate that this advanced ensemble approach, building upon the unique strengths of individual models, offers promising prospects for enhanced diagnostic accuracy and effective treatment planning for brain tumors in pediatric brains.

摘要
脑肿症仍然是全球医疗挑战，需要不断发展诊断技术和治疗方法。为了应对儿童患者的年龄特定分 segmentation模型的增长需求，这种研究使用深度学习技术，利用 магни resonance imaging（MRI）Modalities进行分 segmentation。通过引入一种新的集成方法，使用 ONet 和修改后的 UNet，并采用创新的损失函数，这种研究实现了高精度的分 segmentation模型，用于 BraTS-PEDs 2023 挑战。数据增强，包括单个和复合变换，确保模型的稳定性和准确性在不同的扫描协议下。集成策略，将 ONet 和 UNet 模型集成起来，显示更高的效iveness，可以吸收特定的特征和多样化的 MRI 图像特征，导致 lesion_wise dice 分数为 0.52、0.72 和 0.78，用于增强肿瘤、肿瘤核和整个肿瘤标签。视觉比较也证明了集成策略的超越性，在准确地覆盖肿瘤区域方面。结果表明，这种高级集成策略，利用个体模型的独特优势，为脑肿症的诊断精度和有效的治疗规划提供了有希望的前景。

Unified Data-Free Compression: Pruning and Quantization without Fine-Tuning

paper_url: http://arxiv.org/abs/2308.07209
repo_url: None
paper_authors: Shipeng Bai, Jun Chen, Xintian Shen, Yixuan Qian, Yong Liu
for: 提高神经网络的推理时间和内存占用，并且能够处理敏感或商业秘密的数据。
methods: 提出了一种新的框架，即统一数据 свобод压缩（UDFC），它不需要原始训练集进行 fine-tuning，可以同时进行压缩和量化，从而提高神经网络的压缩率和量化精度。
results: 在大规模图像分类任务中，UDFC 实现了 significan 的提高，比如在 ImageNet 数据集上，与 SOTA 方法相比，UDFC 实现了20.54% 的准确率提高，并且可以在不同的网络架构和压缩方法上实现显著的改进。

Abstract
Structured pruning and quantization are promising approaches for reducing the inference time and memory footprint of neural networks. However, most existing methods require the original training dataset to fine-tune the model. This not only brings heavy resource consumption but also is not possible for applications with sensitive or proprietary data due to privacy and security concerns. Therefore, a few data-free methods are proposed to address this problem, but they perform data-free pruning and quantization separately, which does not explore the complementarity of pruning and quantization. In this paper, we propose a novel framework named Unified Data-Free Compression(UDFC), which performs pruning and quantization simultaneously without any data and fine-tuning process. Specifically, UDFC starts with the assumption that the partial information of a damaged(e.g., pruned or quantized) channel can be preserved by a linear combination of other channels, and then derives the reconstruction form from the assumption to restore the information loss due to compression. Finally, we formulate the reconstruction error between the original network and its compressed network, and theoretically deduce the closed-form solution. We evaluate the UDFC on the large-scale image classification task and obtain significant improvements over various network architectures and compression methods. For example, we achieve a 20.54% accuracy improvement on ImageNet dataset compared to SOTA method with 30% pruning ratio and 6-bit quantization on ResNet-34.

摘要
Structured pruning和量化是降低神经网络执行时间和内存占用的有望方法。然而，大多数现有方法需要原始训练集来细化模型，这不仅带来重量的资源占用，还不可能在敏感或商业秘密数据的应用中进行，因为隐私和安全问题。因此，一些不需要数据的方法被提议，但它们分别进行无数据的采样和量化，而不是探索归一化的逻辑。在这篇论文中，我们提出了一种名为统一无数据压缩（UDFC）的新框架，它在无数据和不需要细化过程下同时进行压缩和量化。具体来说，UDFC从假设部分损坏的通道可以通过其他通道的线性组合来保留一些信息来，然后 derive 从假设来恢复因压缩而产生的信息损失。最后，我们将原始网络和压缩后的网络之间的重建误差计算出来，并理论上得出关闭式解决方案。我们在大规模的图像分类任务上评估了UDFC，并得到了显著的改进。例如，在ImageNet数据集上，我们在ResNet-34架构上实现了30%的采样和6位数字化后的20.54%的准确率提升，比SOTA方法更高。

FOLT: Fast Multiple Object Tracking from UAV-captured Videos Based on Optical Flow

paper_url: http://arxiv.org/abs/2308.07207
repo_url: None
paper_authors: Mufeng Yao, Jiaqi Wang, Jinlong Peng, Mingmin Chi, Chao Liu
for: 本研究旨在解决小型物体跟踪在无人机视频中的挑战，即小物体size、模糊的物体表现和无人机平台上的大小和不规则运动。
methods: 我们提出了FOLT方法，它采用现代检测器和轻量级光流提取器来提取物体检测特征和运动特征，并通过流动导向的特征增强和运动预测来提高小物体检测和跟踪性能。
results: 实验结果表明，我们的提出的模型在Visdrone和UAVDT数据集上可以成功跟踪小物体，并在无人机-MOT任务中超越现有状态的方法。

Abstract
Multiple object tracking (MOT) has been successfully investigated in computer vision. However, MOT for the videos captured by unmanned aerial vehicles (UAV) is still challenging due to small object size, blurred object appearance, and very large and/or irregular motion in both ground objects and UAV platforms. In this paper, we propose FOLT to mitigate these problems and reach fast and accurate MOT in UAV view. Aiming at speed-accuracy trade-off, FOLT adopts a modern detector and light-weight optical flow extractor to extract object detection features and motion features at a minimum cost. Given the extracted flow, the flow-guided feature augmentation is designed to augment the object detection feature based on its optical flow, which improves the detection of small objects. Then the flow-guided motion prediction is also proposed to predict the object's position in the next frame, which improves the tracking performance of objects with very large displacements between adjacent frames. Finally, the tracker matches the detected objects and predicted objects using a spatially matching scheme to generate tracks for every object. Experiments on Visdrone and UAVDT datasets show that our proposed model can successfully track small objects with large and irregular motion and outperform existing state-of-the-art methods in UAV-MOT tasks.

摘要
多bject tracking (MOT) 在计算机视觉领域已经得到了成功的探索。然而，UAV拍摄视频中的 MOT 仍然是一个挑战，因为物体的小size、模糊的表现和 UAV 平台上的大型和/或不规则运动。在这篇论文中，我们提出了FOLT来解决这些问题，以达到快速精度的 MOT 在 UAV 视野中。以速度精度负担为目标，FOLT 采用了现代探测器和轻量级光流提取器来提取物体检测特征和运动特征，以最小化成本。根据提取的流动，我们提出了流动导向的特征增强技术，以提高小物体的检测。然后，我们还提出了流动导向的运动预测技术，以预测物体在下一帧的位置，提高对大幅运动的物体的跟踪性。最后，我们使用空间匹配算法匹配检测到的物体和预测的物体，以生成每个物体的轨迹。实验结果表明，我们提出的模型可以成功跟踪 UAV 拍摄视频中的小物体，并且在 UAV-MOT 任务中超过现有状态的方法。

Towards Robust Real-Time Scene Text Detection: From Semantic to Instance Representation Learning

paper_url: http://arxiv.org/abs/2308.07202
repo_url: None
paper_authors: Xugong Qin, Pengyuan Lyu, Chengquan Zhang, Yu Zhou, Kun Yao, Peng Zhang, Hailun Lin, Weiping Wang
for: 本研究旨在提高现场文本检测的准确率和速度，提出了一种基于表示学习的底层 segmentation-based 方法。
methods: 方法包括global-dense semantic contrast (GDSC)和top-down modeling (TDM)，它们帮助encoder网络学习更强的表示，而不需要在推理过程中添加参数和计算。
results: 实验结果表明，提出的方法可以在四个公共数据集上达到或超过现状之准确率和速度，具体来说是在Total-Text 上获得87.2% F-度量值，并在MSRA-TD500上获得89.6% F-度量值，这些结果都是在单个 GeForce RTX 2080 Ti GPU 上实现的。

Abstract
Due to the flexible representation of arbitrary-shaped scene text and simple pipeline, bottom-up segmentation-based methods begin to be mainstream in real-time scene text detection. Despite great progress, these methods show deficiencies in robustness and still suffer from false positives and instance adhesion. Different from existing methods which integrate multiple-granularity features or multiple outputs, we resort to the perspective of representation learning in which auxiliary tasks are utilized to enable the encoder to jointly learn robust features with the main task of per-pixel classification during optimization. For semantic representation learning, we propose global-dense semantic contrast (GDSC), in which a vector is extracted for global semantic representation, then used to perform element-wise contrast with the dense grid features. To learn instance-aware representation, we propose to combine top-down modeling (TDM) with the bottom-up framework to provide implicit instance-level clues for the encoder. With the proposed GDSC and TDM, the encoder network learns stronger representation without introducing any parameters and computations during inference. Equipped with a very light decoder, the detector can achieve more robust real-time scene text detection. Experimental results on four public datasets show that the proposed method can outperform or be comparable to the state-of-the-art on both accuracy and speed. Specifically, the proposed method achieves 87.2% F-measure with 48.2 FPS on Total-Text and 89.6% F-measure with 36.9 FPS on MSRA-TD500 on a single GeForce RTX 2080 Ti GPU.

摘要
For semantic representation learning, we propose global-dense semantic contrast (GDSC), which extracts a vector for global semantic representation and then performs element-wise contrast with the dense grid features. To learn instance-aware representation, we propose combining top-down modeling (TDM) with the bottom-up framework to provide implicit instance-level clues for the encoder. With the proposed GDSC and TDM, the encoder network learns stronger representation without introducing any additional parameters and computations during inference.Equipped with a very light decoder, the detector can achieve more robust real-time scene text detection. Experimental results on four public datasets show that the proposed method can outperform or be comparable to the state-of-the-art on both accuracy and speed. Specifically, the proposed method achieves 87.2% F-measure with 48.2 FPS on Total-Text and 89.6% F-measure with 36.9 FPS on MSRA-TD500 on a single GeForce RTX 2080 Ti GPU.

SEMI-CenterNet: A Machine Learning Facilitated Approach for Semiconductor Defect Inspection

paper_url: http://arxiv.org/abs/2308.07180
repo_url: None
paper_authors: Vic De Ridder, Bappaditya Dey, Enrique Dehaerne, Sandip Halder, Stefan De Gendt, Bartel Van Waeyenberge
for: 本研究旨在提出一种自动化的深度学习（深度学习）基本实现方法，以提高半导体缺陷检测中的精度和效率。
methods: 我们提出了一种自定义的中心点网络（CN）架构，并在半导体缺陷图像中训练该架构。该方法只预测潜在缺陷中心点，从而提高计算效率。
results: 我们在两个数据集上训练了两个ResNet背景模型，并对其进行了比较。结果显示，使用我们的SEMI-CN方法可以大幅提高检测速度，并且在训练时间和精度之间取得了良好的平衡。

Abstract
Continual shrinking of pattern dimensions in the semiconductor domain is making it increasingly difficult to inspect defects due to factors such as the presence of stochastic noise and the dynamic behavior of defect patterns and types. Conventional rule-based methods and non-parametric supervised machine learning algorithms like KNN mostly fail at the requirements of semiconductor defect inspection at these advanced nodes. Deep Learning (DL)-based methods have gained popularity in the semiconductor defect inspection domain because they have been proven robust towards these challenging scenarios. In this research work, we have presented an automated DL-based approach for efficient localization and classification of defects in SEM images. We have proposed SEMI-CenterNet (SEMI-CN), a customized CN architecture trained on SEM images of semiconductor wafer defects. The use of the proposed CN approach allows improved computational efficiency compared to previously studied DL models. SEMI-CN gets trained to output the center, class, size, and offset of a defect instance. This is different from the approach of most object detection models that use anchors for bounding box prediction. Previous methods predict redundant bounding boxes, most of which are discarded in postprocessing. CN mitigates this by only predicting boxes for likely defect center points. We train SEMI-CN on two datasets and benchmark two ResNet backbones for the framework. Initially, ResNet models pretrained on the COCO dataset undergo training using two datasets separately. Primarily, SEMI-CN shows significant improvement in inference time against previous research works. Finally, transfer learning (using weights of custom SEM dataset) is applied from ADI dataset to AEI dataset and vice-versa, which reduces the required training time for both backbones to reach the best mAP against conventional training method.

摘要
<>translate text into Simplified ChineseSemiconductor 领域中元件缩小的趋势使得缺陷检测变得越来越困难，主要因为存在Stochastic noise和缺陷模式和类型的动态行为。传统的规则基本方法和非 Parametric 超vised 机器学习算法如 KNN 等方法在这些高级节点上很难以满足半导体缺陷检测的要求。深度学习（DL）基本方法在半导体缺陷检测领域中具有耐用性，因此在这种研究中，我们提出了一种自动化的 DL 基本方法，用于高效地local化和类型化半导体缺陷图像中的缺陷。我们提出了一种自定义的 CN 架构，用于训练在半导体晶圆缺陷图像上。与传统的 DL 模型不同，我们的 CN 方法可以更好地提高计算效率。SEMI-CN 通过输出缺陷实例的中心、类型、大小和偏移量来进行定位和分类。与大多数对象检测模型不同，我们不使用锚点来预测缺陷 bounding box。之前的方法通常会预测多个 redundancy 的缺陷 bounding box，大多数这些 bounding box 在后处理中被抛弃。CN 可以避免这种情况，只预测可能的缺陷中心点。我们在两个 dataset 上训练 SEMI-CN，并对两个 ResNet 背景进行了对比。首先，ResNet 模型在 COCO dataset 上进行了预训练，然后在两个 dataset 上进行了分别训练。在初始化时，SEMI-CN 显示了明显的计算效率提高，相比之前的研究成果。最后，我们将 ADI dataset 和 AEI dataset 中的权重进行了转移学习，从而减少了训练时间以达到最佳 mAP。

HyperSparse Neural Networks: Shifting Exploration to Exploitation through Adaptive Regularization

paper_url: http://arxiv.org/abs/2308.07163
repo_url: https://github.com/greenautoml4fas/hypersparse
paper_authors: Patrick Glandorf, Timo Kaiser, Bodo Rosenhahn
for: 提出了一种新的强大的稀疏学习方法，即适应规范训练（ART），用于压缩稠密的网络。
methods: 相比于通常使用二进制面板进行训练来减少模型权重数量，我们在迭代 manner中增加权重规范，使权重逐渐逼近零。我们的方法将预训练模型知识压缩到最高权重中。
results: 我们的方法在CIFAR和TinyImageNet上进行了广泛的实验，并显示了与其他缩短方法相比，特别是在极高缩短度（99.8%）下表现出了显著的性能提升。此外，我们还对权重中高度强度的编码 Pattern进行了新的调查。

Abstract
Sparse neural networks are a key factor in developing resource-efficient machine learning applications. We propose the novel and powerful sparse learning method Adaptive Regularized Training (ART) to compress dense into sparse networks. Instead of the commonly used binary mask during training to reduce the number of model weights, we inherently shrink weights close to zero in an iterative manner with increasing weight regularization. Our method compresses the pre-trained model knowledge into the weights of highest magnitude. Therefore, we introduce a novel regularization loss named HyperSparse that exploits the highest weights while conserving the ability of weight exploration. Extensive experiments on CIFAR and TinyImageNet show that our method leads to notable performance gains compared to other sparsification methods, especially in extremely high sparsity regimes up to 99.8 percent model sparsity. Additional investigations provide new insights into the patterns that are encoded in weights with high magnitudes.

摘要
稀疏神经网络是开发资源有效的机器学习应用的关键因素。我们提出了新的有力的稀疏学习方法 Adaptive Regularized Training（ART），用于压缩稀疏网络。而不是通常使用训练时的 binary mask 来减少模型权重的数量，我们在迭代方式下增加权重规化，使权重逐渐接近零。我们的方法将预训练知识压缩到最高权重中。因此，我们引入了一种新的规化损失函数名为 HyperSparse，利用最高权重而忽略权重探索的能力。我们在 CIFAR 和 TinyImageNet 上进行了广泛的实验，发现我们的方法在极高稀疏度范围内（达到 99.8% 模型稀疏度）表现出了显著的性能提升，特别是与其他稀疏化方法相比。此外，我们还进行了新的探索，发现权重高度具有潜在的编码特征。

SAM Meets Robotic Surgery: An Empirical Study on Generalization, Robustness and Adaptation

paper_url: http://arxiv.org/abs/2308.07156
repo_url: None
paper_authors: An Wang, Mobarakol Islam, Mengya Xu, Yang Zhang, Hongliang Ren
for:* 这个论文主要研究的是Semantic Segmentation of Robotic Surgery Instruments，即使用Segment Anything Model（SAM）在外科手术中的应用。methods:* 这个论文使用了SAM模型，并对其进行了多种场景和环境的测试和评估，包括Prompted和Unprompted的情况，以及不同的损害和扰动等级。results:* SAM在Boundary Box Prompt的情况下显示出了 Zero-shot 通用性，但在Point-based Prompt和Unprompted情况下，其 segmentation效果不佳，尤其是在复杂的外科手术场景中，如血液、反射、模糊和阴影等情况下。此外，SAM在不同级别的数据损害下也不具备 suficient 的Robustness。

Abstract
The Segment Anything Model (SAM) serves as a fundamental model for semantic segmentation and demonstrates remarkable generalization capabilities across a wide range of downstream scenarios. In this empirical study, we examine SAM's robustness and zero-shot generalizability in the field of robotic surgery. We comprehensively explore different scenarios, including prompted and unprompted situations, bounding box and points-based prompt approaches, as well as the ability to generalize under corruptions and perturbations at five severity levels. Additionally, we compare the performance of SAM with state-of-the-art supervised models. We conduct all the experiments with two well-known robotic instrument segmentation datasets from MICCAI EndoVis 2017 and 2018 challenges. Our extensive evaluation results reveal that although SAM shows remarkable zero-shot generalization ability with bounding box prompts, it struggles to segment the whole instrument with point-based prompts and unprompted settings. Furthermore, our qualitative figures demonstrate that the model either failed to predict certain parts of the instrument mask (e.g., jaws, wrist) or predicted parts of the instrument as wrong classes in the scenario of overlapping instruments within the same bounding box or with the point-based prompt. In fact, SAM struggles to identify instruments in complex surgical scenarios characterized by the presence of blood, reflection, blur, and shade. Additionally, SAM is insufficiently robust to maintain high performance when subjected to various forms of data corruption. We also attempt to fine-tune SAM using Low-rank Adaptation (LoRA) and propose SurgicalSAM, which shows the capability in class-wise mask prediction without prompt. Therefore, we can argue that, without further domain-specific fine-tuning, SAM is not ready for downstream surgical tasks.

摘要
Segment Anything Model (SAM) 是一个基本模型 для semantic segmentation，它在多种下游enario中示出了惊人的总体化能力。在这个实验研究中，我们检查了SAM的Robustness和零Instance化能力在 робоcot surgery 领域。我们完整地探索了不同的enario，包括提示和无提示的情况，以及 bounding box 和点based prompt 的应用。此外，我们还评估了SAM与state-of-the-art 监督模型的比较。我们在两个well-known robotic instrument segmentation dataset 上进行了所有实验，这些dataset 来自 MICCAI EndoVis 2017 和 2018 挑战。我们的广泛评估结果显示，although SAM 在 bounding box 提示下示出了惊人的零Instance化能力，但在 point-based prompt 和无提示情况下，SAM 对于全部工具的分类还是困难。此外，我们的质量图示表明，SAM 在复杂的手术场景中，例如血液、镜面、模糊和阴影的存在下，还是很困难实现高性能。此外，SAM 在不同的数据承载变化下也不足以保持高性能。为了解决这个问题，我们尝试使用 Low-rank Adaptation (LoRA) 进行 fine-tuning，并提出了 SurgicalSAM，它可以在无提示情况下进行分类mask prediction。因此，我们可以 argument � SAM 在下游手术任务中不够充分适用，需要进一步的领域特定 fine-tuning。

DELO: Deep Evidential LiDAR Odometry using Partial Optimal Transport

paper_url: http://arxiv.org/abs/2308.07153
repo_url: None
paper_authors: Sk Aziz Ali, Djamila Aouada, Gerd Reis, Didier Stricker
for: 这个论文是为了提供一个精确、可靠、实时的 LiDAR 基于 odometry（LO）方法，用于机器人导航、全球一致的 3D 场景重建或安全的动作规划等应用。
methods: 这个方法使用了深度学习的方法，将精确的几何变换组合成一个实时（约 35-40ms 每帧）的 LO 方法，并且同时学习出精确的几何变换和预测不确定性（PU）作为证据，以确保 LO 预测的正确性。
results: 这个方法在 KITTI 数据集上进行评估，与最近的州际顶对比方法相比， exhibits 竞争性的性能，甚至超越了其他方法的一般化能力。

Abstract
Accurate, robust, and real-time LiDAR-based odometry (LO) is imperative for many applications like robot navigation, globally consistent 3D scene map reconstruction, or safe motion-planning. Though LiDAR sensor is known for its precise range measurement, the non-uniform and uncertain point sampling density induce structural inconsistencies. Hence, existing supervised and unsupervised point set registration methods fail to establish one-to-one matching correspondences between LiDAR frames. We introduce a novel deep learning-based real-time (approx. 35-40ms per frame) LO method that jointly learns accurate frame-to-frame correspondences and model's predictive uncertainty (PU) as evidence to safe-guard LO predictions. In this work, we propose (i) partial optimal transportation of LiDAR feature descriptor for robust LO estimation, (ii) joint learning of predictive uncertainty while learning odometry over driving sequences, and (iii) demonstrate how PU can serve as evidence for necessary pose-graph optimization when LO network is either under or over confident. We evaluate our method on KITTI dataset and show competitive performance, even superior generalization ability over recent state-of-the-art approaches. Source codes are available.

摘要
<>将文本翻译成简化中文。<>精准、可靠、实时的LiDAR基于滤波器（LO）是许多应用程序的关键，如机器人导航、全球一致的3D场景重建、安全的运动规划。虽然LiDAR传感器知道的精准范围测量，但非均匀和不确定的点抽样密度引起结构不一致。因此，现有的超级vised和无级视点注册方法无法建立一对一匹配关系 между LiDAR帧。我们介绍了一种新的深度学习基于实时（约35-40ms每帧）LO方法，该方法同时学习准确的帧到帧匹配和预测uncertainty（PU）作为证据，以保障LO预测。在这种工作中，我们提出了（i）LiDAR特征描述符的 partial optimal transportation 以实现Robust LO估计，（ii）在驾驶序列上同时学习预测uncertainty和LO，以及（iii）示出PU可以作为证据来优化pose-graph估计，当LO网络是 either under 或 over confident 时。我们在KITTI数据集上评估了我们的方法，并表现出competitive performance，甚至超越了最近的状态 искусственный智能方法。源代码可以获得。

Diffusion Based Augmentation for Captioning and Retrieval in Cultural Heritage

paper_url: http://arxiv.org/abs/2308.07151
repo_url: https://github.com/ciodar/cultural-heritage-diffaug
paper_authors: Dario Cioni, Lorenzo Berlincioni, Federico Becattini, Alberto del Bimbo
for: This paper aims to address the challenges of limited annotated data and domain shifts in the cultural heritage domain by leveraging generative vision-language models to augment art datasets.
methods: The proposed approach uses generative vision-language models to generate diverse variations of artworks conditioned on their captions, enhancing dataset diversity and improving the alignment of visual cues with knowledge from general-purpose datasets.
results: The generated variations assist in training vision and language models with a deeper understanding of artistic characteristics, allowing for better caption generation with appropriate jargon.

Abstract
Cultural heritage applications and advanced machine learning models are creating a fruitful synergy to provide effective and accessible ways of interacting with artworks. Smart audio-guides, personalized art-related content and gamification approaches are just a few examples of how technology can be exploited to provide additional value to artists or exhibitions. Nonetheless, from a machine learning point of view, the amount of available artistic data is often not enough to train effective models. Off-the-shelf computer vision modules can still be exploited to some extent, yet a severe domain shift is present between art images and standard natural image datasets used to train such models. As a result, this can lead to degraded performance. This paper introduces a novel approach to address the challenges of limited annotated data and domain shifts in the cultural heritage domain. By leveraging generative vision-language models, we augment art datasets by generating diverse variations of artworks conditioned on their captions. This augmentation strategy enhances dataset diversity, bridging the gap between natural images and artworks, and improving the alignment of visual cues with knowledge from general-purpose datasets. The generated variations assist in training vision and language models with a deeper understanding of artistic characteristics and that are able to generate better captions with appropriate jargon.

摘要
文化遗产应用和高级机器学习模型之间存在辉煌的共同作用，以提供有效和可accessible的艺术作品交互方式。智能音频导览、个性化艺术内容和游戏化方法等是技术的应用场景之一。然而，从机器学习的角度来看，可用的艺术数据量往往不够用于训练有效的模型。使用市场上的计算机视觉模块仍然可以获得一定的利用优势，但是领域转移问题仍然存在，这可能导致模型的性能下降。这篇论文提出了一种新的方法，用于解决文化遗产领域中缺乏注释数据和领域转移问题。通过利用生成视力语言模型，我们可以对艺术作品集添加多样化的变化，使得这些变化覆盖了自然图像和艺术作品之间的差异。这种增强策略可以增加数据集的多样性，使得视觉和语言模型能够更好地理解艺术特征，并且可以生成更好的描述文本。

A Time-aware tensor decomposition for tracking evolving patterns

paper_url: http://arxiv.org/abs/2308.07126
repo_url: None
paper_authors: Christos Chatzis, Max Pfeffer, Pedro Lind, Evrim Acar
for: 用于提取时间序列数据中逐渐发展的下游模式
methods: 使用PARAFAC2基于tensor分解方法，增加时间正则化来捕捉时间序列数据中的下游模式
results: 在Synthetic数据上进行了广泛的实验，表明tPARAFAC2可以更加准确地捕捉时间序列数据中的下游模式，比PARAFAC2和时间稳定正则化 coupled matrix factorization perfom better.

Abstract
Time-evolving data sets can often be arranged as a higher-order tensor with one of the modes being the time mode. While tensor factorizations have been successfully used to capture the underlying patterns in such higher-order data sets, the temporal aspect is often ignored, allowing for the reordering of time points. In recent studies, temporal regularizers are incorporated in the time mode to tackle this issue. Nevertheless, existing approaches still do not allow underlying patterns to change in time (e.g., spatial changes in the brain, contextual changes in topics). In this paper, we propose temporal PARAFAC2 (tPARAFAC2): a PARAFAC2-based tensor factorization method with temporal regularization to extract gradually evolving patterns from temporal data. Through extensive experiments on synthetic data, we demonstrate that tPARAFAC2 can capture the underlying evolving patterns accurately performing better than PARAFAC2 and coupled matrix factorization with temporal smoothness regularization.

摘要
<>将数据集视为高阶张量，其中一个方向是时间方向，可以使用张量分解方法捕捉下面数据集中的底层模式。然而，已有的方法通常忽略了时间方面，allowing for the reordering of time points。在latest studies, temporal regularizers are incorporated in the time mode to tackle this issue.However, existing approaches still do not allow underlying patterns to change in time（例如，在脑中的空间变化，话题中的上下文变化）。本文提出了temporal PARAFAC2（tPARAFAC2）：基于PARAFAC2的张量分解方法，带有时间正则化来提取时间数据中的慢慢发展模式。通过对synthetic数据进行了广泛的实验，我们表明了tPARAFAC2可以准确地捕捉下面数据集中的下面模式，并且perform better than PARAFAC2和coupled matrix factorization with temporal smoothness regularization。Note that "高阶张量" (gāo xià zhāng liàng) in the text refers to a higher-order tensor, and "张量分解" (zhāng liàng fāng jiě) refers to tensor factorization.

An Outlook into the Future of Egocentric Vision

paper_url: http://arxiv.org/abs/2308.07123
repo_url: None
paper_authors: Chiara Plizzari, Gabriele Goletto, Antonino Furnari, Siddhant Bansal, Francesco Ragusa, Giovanni Maria Farinella, Dima Damen, Tatiana Tommasi
for: 本文探讨了现代 egocentric vision 研究与未来的 gap，即将穿戴式计算机、外向摄像头和数字覆盖层 integrate 到我们日常生活中。
methods: 本文首先通过人物故事来幻想未来，并通过示例表明当前技术的限制。然后，文章提供了将未来与已定义的研究任务映射的方法，并对每个任务进行了论述，包括前沿技术、现状的方法和数据集。
results: 本文结束于对未来研究的建议，以解锁我们的Path to the future的 always-on、个性化和生活改善 egocentric vision。

Abstract
What will the future be? We wonder! In this survey, we explore the gap between current research in egocentric vision and the ever-anticipated future, where wearable computing, with outward facing cameras and digital overlays, is expected to be integrated in our every day lives. To understand this gap, the article starts by envisaging the future through character-based stories, showcasing through examples the limitations of current technology. We then provide a mapping between this future and previously defined research tasks. For each task, we survey its seminal works, current state-of-the-art methodologies and available datasets, then reflect on shortcomings that limit its applicability to future research. Note that this survey focuses on software models for egocentric vision, independent of any specific hardware. The paper concludes with recommendations for areas of immediate explorations so as to unlock our path to the future always-on, personalised and life-enhancing egocentric vision.

摘要
未来是什么样的？我们感到不知道！在这份调查中，我们探索了现有的研究和未来预期的区别，其中包括与我们日常生活结合的携带式计算，以及对我们每日生活的数位覆写。为了理解这个差距，这篇文章首先透过人物故事来预见未来，示例了现有技术的限制。然后，我们提供了对这些未来任务的映射，并评估了每个任务的先驱性研究、目前的技术方法和可用数据集。我们反思了这些任务的缺陷，限制了它们的应用性。注意，这份调查专注于视觉辨识软件模型，不受任何特定硬件的限制。文章结束时，我们提出了对未来研究的探索方向，以解锁我们的未来总是“在”、“个人化”和“生活改善”的视觉辨识。

On the Importance of Spatial Relations for Few-shot Action Recognition

paper_url: http://arxiv.org/abs/2308.07119
repo_url: None
paper_authors: Yilun Zhang, Yuqian Fu, Xingjun Ma, Lizhe Qi, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang
for: 这个论文主要targets at improving few-shot action recognition in videos by leveraging both spatial and temporal information.
methods: The proposed method, called Spatial Alignment Cross Transformer (SA-CT), incorporates a novel spatial alignment mechanism to re-adjust the spatial relations between objects in videos, and integrates temporal information through a Temporal Mixer module.
results: The proposed method achieves comparable performance to temporal-based methods on 3/4 benchmarks, and outperforms the state-of-the-art few-shot action recognition methods on 2 benchmarks. Additionally, the authors exploit large-scale pretrained models for few-shot action recognition and provide useful insights for this research direction.

Abstract
Deep learning has achieved great success in video recognition, yet still struggles to recognize novel actions when faced with only a few examples. To tackle this challenge, few-shot action recognition methods have been proposed to transfer knowledge from a source dataset to a novel target dataset with only one or a few labeled videos. However, existing methods mainly focus on modeling the temporal relations between the query and support videos while ignoring the spatial relations. In this paper, we find that the spatial misalignment between objects also occurs in videos, notably more common than the temporal inconsistency. We are thus motivated to investigate the importance of spatial relations and propose a more accurate few-shot action recognition method that leverages both spatial and temporal information. Particularly, a novel Spatial Alignment Cross Transformer (SA-CT) which learns to re-adjust the spatial relations and incorporates the temporal information is contributed. Experiments reveal that, even without using any temporal information, the performance of SA-CT is comparable to temporal based methods on 3/4 benchmarks. To further incorporate the temporal information, we propose a simple yet effective Temporal Mixer module. The Temporal Mixer enhances the video representation and improves the performance of the full SA-CT model, achieving very competitive results. In this work, we also exploit large-scale pretrained models for few-shot action recognition, providing useful insights for this research direction.

摘要
深度学习在视频识别中取得了很大的成功，然而仍然在面临只有一些示例时难以识别新的动作。为解决这个挑战，一些基于几个示例的动作识别方法已经被提出来，这些方法主要是模型着视频中的时间关系。然而，现有的方法很多都忽略了视频中的空间关系。在这篇论文中，我们发现了视频中的空间不一致现象，特别是在视频中的 объек 之间存在很多的空间不一致。这使我们被激励去研究空间关系的重要性，并提出了一种更准确的几个示例动作识别方法。特别是，我们提出了一种新的空间对准交叉传播（SA-CT）模型，它可以学习重新调整视频中的空间关系，并同时 incorporate 时间信息。实验表明，即使不使用任何时间信息，SA-CT 模型的性能与基于时间信息的方法相当，在 3/4 benchmark 上。为了进一步 incorporate 时间信息，我们还提出了一种简单 yet 有效的时间混合模块。时间混合模块可以提高视频表示，并使全SA-CT模型的性能非常竞争力。此外，我们还利用了大规模预训练模型来进行几个示例动作识别，提供了有用的研究方向。

SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and Transformers

paper_url: http://arxiv.org/abs/2308.07110
repo_url: None
paper_authors: Xijun Wang, Xiaojie Chu, Chunrui Han, Xiangyu Zhang
for:This paper presents a module called Spatial Cross-scale Convolution (SCSC) that improves the performance of both Convolutional Neural Networks (CNNs) and Transformers.methods:The SCSC module uses an efficient spatial cross-scale encoder and spatial embed module to capture a variety of features in one layer, addressing the issues of large dense kernels and self-attention in existing architectures.results:The SCSC module is shown to improve the performance of various base models on the face recognition task and ImageNet classification task, with 2.7% and 5.3% improvement in accuracy, respectively, while reducing the number of parameters and FLOPs by 79% and 22%, respectively. Additionally, a traditional network embedded with SCSC can match the performance of Swin Transformer.

Abstract
This paper presents a module, Spatial Cross-scale Convolution (SCSC), which is verified to be effective in improving both CNNs and Transformers. Nowadays, CNNs and Transformers have been successful in a variety of tasks. Especially for Transformers, increasing works achieve state-of-the-art performance in the computer vision community. Therefore, researchers start to explore the mechanism of those architectures. Large receptive fields, sparse connections, weight sharing, and dynamic weight have been considered keys to designing effective base models. However, there are still some issues to be addressed: large dense kernels and self-attention are inefficient, and large receptive fields make it hard to capture local features. Inspired by the above analyses and to solve the mentioned problems, in this paper, we design a general module taking in these design keys to enhance both CNNs and Transformers. SCSC introduces an efficient spatial cross-scale encoder and spatial embed module to capture assorted features in one layer. On the face recognition task, FaceResNet with SCSC can improve 2.7% with 68% fewer FLOPs and 79% fewer parameters. On the ImageNet classification task, Swin Transformer with SCSC can achieve even better performance with 22% fewer FLOPs, and ResNet with CSCS can improve 5.3% with similar complexity. Furthermore, a traditional network (e.g., ResNet) embedded with SCSC can match Swin Transformer's performance.

摘要
Inspired by these analyses and to solve these problems, the authors design a general module that incorporates these design elements to enhance both CNNs and Transformers. The SCSC module introduces an efficient spatial cross-scale encoder and spatial embed module to capture diverse features in one layer.On the face recognition task, the FaceResNet model with SCSC improves performance by 2.7% with 68% fewer floating-point operations (FLOPs) and 79% fewer parameters. On the ImageNet classification task, the Swin Transformer model with SCSC achieves better performance with 22% fewer FLOPs, and the ResNet model with CSCS improves performance by 5.3% with similar complexity. Additionally, a traditional network (e.g., ResNet) embedded with SCSC can match the performance of Swin Transformer.

Checklist to Transparently Define Test Oracles for TP, FP, and FN Objects in Automated Driving

paper_url: http://arxiv.org/abs/2308.07106
repo_url: https://github.com/michael-hoss/paper-oracle-definition
paper_authors: Michael Hoss
for: 这篇论文是为了提供一份关于汽车自动驾驶系统的感知子系统测试 oracle 的检查列表。
methods: 该论文使用了一系列的函数方面和实现细节来描述测试 oracle 的行为。
results: 该论文提供了一份可以帮助实践者提高测试 oracle 的透明度，从而使对象感知的声明更加可靠和比较可靠。

Abstract
Popular test oracles for the perception subsystem of driving automation systems identify true-positive (TP), false-positive (FP), and false-negative (FN) objects. Oracle transparency is needed for comparing test results and for safety cases. To date, there exists a common notion of TPs, FPs, and FNs in the field, but apparently no published way to comprehensively define their oracles. Therefore, this paper provides a checklist of functional aspects and implementation details that affect the oracle behavior. Besides labeling policies of the test set, we cover fields of view, occlusion handling, safety-relevant areas, matching criteria, temporal and probabilistic issues, and further aspects. Even though our checklist can hardly be formalized, it can help practitioners maximize the transparency of their oracles, which, in turn, makes statements on object perception more reliable and comparable.

摘要
Popular test oracles for the perception subsystem of autonomous driving systems identify true-positive (TP), false-positive (FP), and false-negative (FN) objects. Oracle transparency is needed for comparing test results and for safety cases. To date, there exists a common notion of TPs, FPs, and FNs in the field, but apparently no published way to comprehensively define their oracles. Therefore, this paper provides a checklist of functional aspects and implementation details that affect the oracle behavior. Besides labeling policies of the test set, we cover fields of view, occlusion handling, safety-relevant areas, matching criteria, temporal and probabilistic issues, and further aspects. Even though our checklist can hardly be formalized, it can help practitioners maximize the transparency of their oracles, which, in turn, makes statements on object perception more reliable and comparable.Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

FocusFlow: Boosting Key-Points Optical Flow Estimation for Autonomous Driving

paper_url: http://arxiv.org/abs/2308.07104
repo_url: https://github.com/zhonghuayi/focusflow_official
paper_authors: Zhonghua Yi, Hao Shi, Kailun Yang, Qi Jiang, Yaozu Ye, Ze Wang, Kaiwei Wang
for: 提高数据驱动 optical flow 估计方法的精度和稳定性，特别是在关键点方面。
methods: 引入点 cloud 模型化方法，并使用权重控制机制来让模型更加注意点云。基于这种模型化方法，提出了一个混合损失函数和特别的 Conditional Point Control Loss (CPCL) 函数来进行多个点的监督。
results: 与基于原始模型的方法相比，FocusFlow 显示出了 +44.5% 的精度提高，并且具有出色的扩展性和灵活性。此外，FocusFlow 在不同的关键点方面也有着优秀的表现，如 ORB、SIFT 和学习基于 SiLK 的关键点。

Abstract
Key-point-based scene understanding is fundamental for autonomous driving applications. At the same time, optical flow plays an important role in many vision tasks. However, due to the implicit bias of equal attention on all points, classic data-driven optical flow estimation methods yield less satisfactory performance on key points, limiting their implementations in key-point-critical safety-relevant scenarios. To address these issues, we introduce a points-based modeling method that requires the model to learn key-point-related priors explicitly. Based on the modeling method, we present FocusFlow, a framework consisting of 1) a mix loss function combined with a classic photometric loss function and our proposed Conditional Point Control Loss (CPCL) function for diverse point-wise supervision; 2) a conditioned controlling model which substitutes the conventional feature encoder by our proposed Condition Control Encoder (CCE). CCE incorporates a Frame Feature Encoder (FFE) that extracts features from frames, a Condition Feature Encoder (CFE) that learns to control the feature extraction behavior of FFE from input masks containing information of key points, and fusion modules that transfer the controlling information between FFE and CFE. Our FocusFlow framework shows outstanding performance with up to +44.5% precision improvement on various key points such as ORB, SIFT, and even learning-based SiLK, along with exceptional scalability for most existing data-driven optical flow methods like PWC-Net, RAFT, and FlowFormer. Notably, FocusFlow yields competitive or superior performances rivaling the original models on the whole frame. The source code will be available at https://github.com/ZhonghuaYi/FocusFlow_official.

摘要
KEY-POINT-BASED SCENE UNDERSTANDING IS FUNDAMENTAL FOR AUTONOMOUS DRIVING APPLICATIONS. AT THE SAME TIME, OPTICAL FLOW PLAYS AN IMPORTANT ROLE IN MANY VISION TASKS. HOWEVER, DUE TO THE IMPLICIT BIAS OF EQUAL ATTENTION ON ALL POINTS, CLASSIC DATA-DRIVEN OPTICAL FLOW ESTIMATION METHODS YIELD LESS SATISFACTORY PERFORMANCE ON KEY POINTS, LIMITING THEIR IMPLEMENTATIONS IN KEY-POINT-CRITICAL SAFETY-RELEVANT SCENARIOS. TO ADDRESS THESE ISSUES, WE INTRODUCE A POINTS-BASED MODELING METHOD THAT REQUIRES THE MODEL TO LEARN KEY-POINT-RELATED PRIORS EXPLICITLY. BASED ON THE MODELING METHOD, WE PRESENT FOCUSFLOW, A FRAMEWORK CONSISTING OF 1) A MIX LOSS FUNCTION COMBINED WITH A CLASSIC PHOTOMETRIC LOSS FUNCTION AND OUR PROPOSED CONDITIONAL POINT CONTROL LOSS (CPCL) FUNCTION FOR DIVERSE POINT-WISE SUPERVISION; 2) A CONDITIONED CONTROLLING MODEL WHICH SUBSTITUTES THE CONVENTIONAL FEATURE ENCODER BY OUR PROPOSED CONDITION CONTROL ENCODER (CCE). CCE INCORPORATES A FRAME FEATURE ENCODER (FFE) THAT EXTRACTS FEATURES FROM FRAMES, A CONDITION FEATURE ENCODER (CFE) THAT LEARNS TO CONTROL THE FEATURE EXTRACTION BEHAVIOR OF FFE FROM INPUT MASKS CONTAINING INFORMATION OF KEY POINTS, AND FUSION MODULES THAT TRANSFER THE CONTROLLING INFORMATION BETWEEN FFE AND CFE. OUR FOCUSFLOW FRAMEWORK SHOWS OUTSTANDING PERFORMANCE WITH UP TO +44.5% PRECISION IMPROVEMENT ON VARIOUS KEY POINTS SUCH AS ORB, SIFT, AND EVEN LEARNING-BASED SiLK, ALONG WITH EXCEPTIONAL SCALABILITY FOR MOST EXISTING DATA-DRIVEN OPTICAL FLOW METHODS LIKE PWC-NET, RAFT, AND FLOWFORMER. NOTABLY, FOCUSFLOW YIELDS COMPETITIVE OR SUPERIOR PERFORMANCES RIVALING THE ORIGINAL MODELS ON THE WHOLE FRAME. THE SOURCE CODE WILL BE AVAILABLE AT .

Masked Motion Predictors are Strong 3D Action Representation Learners

paper_url: http://arxiv.org/abs/2308.07092
repo_url: None
paper_authors: Yunyao Mao, Jiajun Deng, Wengang Zhou, Yao Fang, Wanli Ouyang, Houqiang Li
for: 本研究旨在提出一种有效的自助学习预训练方法，以提高3D人体动作识别模型的性能。
methods: 本研究使用的方法是Masked Motion Prediction（MAMP）框架，具体来说是对带有掩蔽的空间时间骨架序列进行预测，预测的是掩蔽的人体关节的 temporal 运动。在预测过程中，研究者们还利用了高度重复的时间序列的特点，通过将动作信息作为empirical semantic richness prior，引导掩蔽过程，从而提高了对semantically rich的时间区域的注意力。
results: 对于NTU-60、NTU-120和PKU-MMD等 datasets，MAMP预训练后的vanilla transformer得到了state-of-the-art的结果，不需要额外的技术和工具。研究者们还提供了源代码，可以在https://github.com/maoyunyao/MAMP上下载。

Abstract
In 3D human action recognition, limited supervised data makes it challenging to fully tap into the modeling potential of powerful networks such as transformers. As a result, researchers have been actively investigating effective self-supervised pre-training strategies. In this work, we show that instead of following the prevalent pretext task to perform masked self-component reconstruction in human joints, explicit contextual motion modeling is key to the success of learning effective feature representation for 3D action recognition. Formally, we propose the Masked Motion Prediction (MAMP) framework. To be specific, the proposed MAMP takes as input the masked spatio-temporal skeleton sequence and predicts the corresponding temporal motion of the masked human joints. Considering the high temporal redundancy of the skeleton sequence, in our MAMP, the motion information also acts as an empirical semantic richness prior that guide the masking process, promoting better attention to semantically rich temporal regions. Extensive experiments on NTU-60, NTU-120, and PKU-MMD datasets show that the proposed MAMP pre-training substantially improves the performance of the adopted vanilla transformer, achieving state-of-the-art results without bells and whistles. The source code of our MAMP is available at https://github.com/maoyunyao/MAMP.

摘要
在3D人体动作识别领域，有限的指导数据使得模型充分发挥其力量是挑战。因此，研究人员 актив地寻找有效的自监学习前置策略。在这项工作中，我们表明，而不是通过常见的预文务masked自组件重建来实现自监学习，而是明确的上下文动作模型化才是3D动作识别中的锚点。我们提出了Masked Motion Prediction（MAMP）框架。具体来说，我们的MAMP框架接受带有mask的空间时间骨架序列作为输入，并预测带mask的人体关节的时间动作。由于骨架序列的高时间重复率，我们在MAMP中使用动作信息作为empirical semantic richness prior，以便更好地引导masking过程，使得更好地注意到semantically rich的时间区域。我们在NTU-60、NTU-120和PKU-MMD datasets上进行了广泛的实验，结果显示，我们的MAMP预训练方法可以在不使用额外技巧的情况下，以状态 искусственный的方式提高采纳的Transformer模型的性能，达到当前最佳的结果。MAMP的源代码可以在https://github.com/maoyunyao/MAMP上下载。

ICPC: Instance-Conditioned Prompting with Contrastive Learning for Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.07078
repo_url: None
paper_authors: Chaohui Yu, Qiang Zhou, Zhibin Wang, Fan Wang
for: 这篇论文主要针对的是提高 Semantic Segmentation 中的 multimodal alignment，以提高 CLIP 知识的传递性能。
methods: 本文提出了两个方法来提高 multimodal alignment：一是使用动态提问来更好地利用文本编码器，二是提出了一种对比学习引导的 alignment 损失函数。
results: 对三个大规模数据集（ADE20K、COCO-Stuff10k 和 ADE20K-Full）进行了广泛的实验，结果显示，ICPC 在不同的底层模型上都能够得到了稳定的改进，比如使用 ResNet-50 为例，ICPC 在三个数据集上的 mIoU 分别提高了1.71%、1.05% 和 1.41%。

Abstract
Modern supervised semantic segmentation methods are usually finetuned based on the supervised or self-supervised models pre-trained on ImageNet. Recent work shows that transferring the knowledge from CLIP to semantic segmentation via prompt learning can achieve promising performance. The performance boost comes from the feature enhancement with multimodal alignment, i.e., the dot product between vision and text embeddings. However, how to improve the multimodal alignment for better transfer performance in dense tasks remains underexplored. In this work, we focus on improving the quality of vision-text alignment from two aspects of prompting design and loss function, and present an instance-conditioned prompting with contrastive learning (ICPC) framework. First, compared with the static prompt designs, we reveal that dynamic prompting conditioned on image content can more efficiently utilize the text encoder for complex dense tasks. Second, we propose an align-guided contrastive loss to refine the alignment of vision and text embeddings. We further propose lightweight multi-scale alignment for better performance. Extensive experiments on three large-scale datasets (ADE20K, COCO-Stuff10k, and ADE20K-Full) demonstrate that ICPC brings consistent improvements across diverse backbones. Taking ResNet-50 as an example, ICPC outperforms the state-of-the-art counterpart by 1.71%, 1.05%, and 1.41% mIoU on the three datasets, respectively.

摘要
现代超级vised semantic segmentation方法通常是基于ImageNet预训练的超级vised或自适应模型的finetuning。 latest work shows that transferring the knowledge from CLIP to semantic segmentation via prompt learning can achieve promising performance. The performance boost comes from the feature enhancement with multimodal alignment, i.e., the dot product between vision and text embeddings. However, how to improve the multimodal alignment for better transfer performance in dense tasks remains underexplored. In this work, we focus on improving the quality of vision-text alignment from two aspects of prompting design and loss function, and present an instance-conditioned prompting with contrastive learning (ICPC) framework. First, compared with the static prompt designs, we reveal that dynamic prompting conditioned on image content can more efficiently utilize the text encoder for complex dense tasks. Second, we propose an align-guided contrastive loss to refine the alignment of vision and text embeddings. We further propose lightweight multi-scale alignment for better performance. Extensive experiments on three large-scale datasets (ADE20K, COCO-Stuff10k, and ADE20K-Full) demonstrate that ICPC brings consistent improvements across diverse backbones. Taking ResNet-50 as an example, ICPC outperforms the state-of-the-art counterpart by 1.71%, 1.05%, and 1.41% mIoU on the three datasets, respectively.

Teeth And Root Canals Segmentation Using ZXYFormer With Uncertainty Guidance And Weight Transfer

paper_url: http://arxiv.org/abs/2308.07072
repo_url: None
paper_authors: Shangxuan Li, Yu Du, Li Ye, Chichi Li, Yanshu Fang, Cheng Wang, Wu Zhou
for: 这个研究旨在从CBCT图像中同时分类牙齿和根 канал，但是这个过程存在许多挑战。
methods: 我们提出了一种从粗细到细的分类方法，使用倒推特征融合变换和不确定性估计来解决这些挑战。
results: 经过157份高级CBCT数据的合作分类实验，发现我们的方法比现有的牙齿或根 канал分类方法更好。

Abstract
This study attempts to segment teeth and root-canals simultaneously from CBCT images, but there are very challenging problems in this process. First, the clinical CBCT image data is very large (e.g., 672 *688 * 688), and the use of downsampling operation will lose useful information about teeth and root canals. Second, teeth and root canals are very different in morphology, and it is difficult for a simple network to identify them precisely. In addition, there are weak edges at the tooth, between tooth and root canal, which makes it very difficult to segment such weak edges. To this end, we propose a coarse-to-fine segmentation method based on inverse feature fusion transformer and uncertainty estimation to address above challenging problems. First, we use the downscaled volume data (e.g., 128 * 128 * 128) to conduct coarse segmentation and map it to the original volume to obtain the area of teeth and root canals. Then, we design a transformer with reverse feature fusion, which can bring better segmentation effect of different morphological objects by transferring deeper features to shallow features. Finally, we design an auxiliary branch to calculate and refine the difficult areas in order to improve the weak edge segmentation performance of teeth and root canals. Through the combined tooth and root canal segmentation experiment of 157 clinical high-resolution CBCT data, it is verified that the proposed method is superior to the existing tooth or root canal segmentation methods.

摘要
First, we use the downscaled volume data (e.g., 128 x 128 x 128) to conduct coarse segmentation and map it to the original volume to obtain the area of teeth and root canals. Then, we design a transformer with reverse feature fusion, which can bring better segmentation effects of different morphological objects by transferring deeper features to shallow features. Finally, we design an auxiliary branch to calculate and refine the difficult areas in order to improve the weak edge segmentation performance of teeth and root canals.Through the combined tooth and root canal segmentation experiment of 157 clinical high-resolution CBCT data, it is verified that the proposed method is superior to the existing tooth or root canal segmentation methods.

A Local Iterative Approach for the Extraction of 2D Manifolds from Strongly Curved and Folded Thin-Layer Structures

paper_url: http://arxiv.org/abs/2308.07070
repo_url: None
paper_authors: Nicolas Klenert, Verena Lepper, Daniel Baum
for: 本文旨在分析 ancient rolled and folded 纸质结构，如纸、羊皮纸和银箔等，通过分析微型计算Tomography（Micro-CT）图像数据。
methods: 本文提出了一种新的方法，基于本地快走算法和分区区域的方法，可以提取2D manifold。
results: 本文通过使用 искусственный数据和实际数据进行示例，证明了该方法的可靠性和灵活性。

Abstract
Ridge surfaces represent important features for the analysis of 3-dimensional (3D) datasets in diverse applications and are often derived from varying underlying data including flow fields, geological fault data, and point data, but they can also be present in the original scalar images acquired using a plethora of imaging techniques. Our work is motivated by the analysis of image data acquired using micro-computed tomography (Micro-CT) of ancient, rolled and folded thin-layer structures such as papyrus, parchment, and paper as well as silver and lead sheets. From these documents we know that they are 2-dimensional (2D) in nature. Hence, we are particularly interested in reconstructing 2D manifolds that approximate the document's structure. The image data from which we want to reconstruct the 2D manifolds are often very noisy and represent folded, densely-layered structures with many artifacts, such as ruptures or layer splitting and merging. Previous ridge-surface extraction methods fail to extract the desired 2D manifold for such challenging data. We have therefore developed a novel method to extract 2D manifolds. The proposed method uses a local fast marching scheme in combination with a separation of the region covered by fast marching into two sub-regions. The 2D manifold of interest is then extracted as the surface separating the two sub-regions. The local scheme can be applied for both automatic propagation as well as interactive analysis. We demonstrate the applicability and robustness of our method on both artificial data as well as real-world data including folded silver and papyrus sheets.

摘要
三维数据集中的ridge表面对于多种应用场景是重要的特征，它们可以从不同的基础数据中 deriv，包括流体场数据、地质断层数据和点数据。但是，它们也可以在原始的scalar图像中存在。我们的工作受到ancient、rolled和folded薄层结构中的纸、羊皮纸和银屑Sheet的微计算 Tomography（Micro-CT）图像数据的分析启发。这些文档都是二维的（2D）性质。因此，我们特别关心从图像数据中提取2D manifold。图像数据经常具有噪音和缺失数据，表现为折叠、厚层结构和多种遗产物，如裂隙或层合并。现有的ridge-surface提取方法无法提取desired 2D manifold。我们因此开发了一种新的方法，使用本地快速推进方案和分区区域的分解。我们提取的2D manifold是这两个分区域之间的表面。本地方案可以用于自动推进以及交互分析。我们在人工数据和实际数据，包括折叠的银和纸Sheet中证明了我们的方法的可行性和可靠性。

Survey on video anomaly detection in dynamic scenes with moving cameras

paper_url: http://arxiv.org/abs/2308.07050
repo_url: None
paper_authors: Runyu Jiao, Yi Wan, Fabio Poiesi, Yiming Wang
for: 这篇论文旨在为摄像头动态场景中检测异常现象提供一个全面的综述。
methods: 这篇论文评估了不同的检测方法，包括深度学习、异常点检测、异常流行分析等。
results: 这篇论文通过对多个应用领域和数据集的分析，发现了现有的检测方法具有一定的局限性和挑战，并提出了未来研究的方向和新贡献。

Abstract
The increasing popularity of compact and inexpensive cameras, e.g.~dash cameras, body cameras, and cameras equipped on robots, has sparked a growing interest in detecting anomalies within dynamic scenes recorded by moving cameras. However, existing reviews primarily concentrate on Video Anomaly Detection (VAD) methods assuming static cameras. The VAD literature with moving cameras remains fragmented, lacking comprehensive reviews to date. To address this gap, we endeavor to present the first comprehensive survey on Moving Camera Video Anomaly Detection (MC-VAD). We delve into the research papers related to MC-VAD, critically assessing their limitations and highlighting associated challenges. Our exploration encompasses three application domains: security, urban transportation, and marine environments, which in turn cover six specific tasks. We compile an extensive list of 25 publicly-available datasets spanning four distinct environments: underwater, water surface, ground, and aerial. We summarize the types of anomalies these datasets correspond to or contain, and present five main categories of approaches for detecting such anomalies. Lastly, we identify future research directions and discuss novel contributions that could advance the field of MC-VAD. With this survey, we aim to offer a valuable reference for researchers and practitioners striving to develop and advance state-of-the-art MC-VAD methods.

摘要
随着小型便宜的摄像机的普及，如推车摄像机、身体摄像机和机器人装备的摄像机，对动态场景中的异常检测已经引起了越来越多的关注。然而，现有的评论主要集中在静止摄像机上的视频异常检测（VAD）方法上，而移动摄像机上的VAD方法的研究仍然是 Fragmented，无法提供全面的综述。为了bridging这个差距，我们努力为您提供首个全面的移动摄像机视频异常检测（MC-VAD）综述。我们对MC-VAD相关的研究论文进行了严格的评估，并指出了相关挑战和局限性。我们的探索包括安全、城市交通和海洋环境等三个应用领域，这些领域内包括六个特定任务。我们编辑了25个公共可用的数据集，这些数据集覆盖了四个不同的环境：水下、水面、地面和空中。我们总结了这些数据集中的异常类型和含义，并提出了五种主要的异常检测方法。最后，我们标识了未来研究的方向和提出了新贡献，以便进一步推动MC-VAD领域的发展。通过这份综述，我们希望为研究人员和实践者提供一份有价值的参考，以帮助他们开发和提高MC-VAD方法的状态泰。

An Inherent Trade-Off in Noisy Neural Communication with Rank-Order Coding

paper_url: http://arxiv.org/abs/2308.07034
repo_url: None
paper_authors: Ibrahim Alsolami, Tomoki Fukai
for: 研究哺乳动物大脑快速能力的新方法——排序编码法。
methods: 使用排序编码法研究哺乳动物大脑快速能力，并对噪声的影响进行研究。
results: 发现在某种噪声范围内，排序编码法可以实现更高的信息传输率，但也存在一类特殊的错误，这些错误会随着噪声增加。

Abstract
Rank-order coding, a form of temporal coding, has emerged as a promising scheme to explain the rapid ability of the mammalian brain. Owing to its speed as well as efficiency, rank-order coding is increasingly gaining interest in diverse research areas beyond neuroscience. However, much uncertainty still exists about the performance of rank-order coding under noise. Herein we show what information rates are fundamentally possible and what trade-offs are at stake. An unexpected finding in this paper is the emergence of a special class of errors that, in a regime, increase with less noise.

摘要
层次编码（rank-order coding），一种时间编码方式，已成为人类大脑快速能力的解释方案。由于其速度和效率，层次编码在不同研究领域 beyond neuroscience 中日益受到关注。然而，对层次编码下噪声的性能仍存在很多不确定性。在这篇文章中，我们展示了可以实现的信息速率和协议的权衡。这篇文章的意外发现是，在某个 режиме下，噪声越少，这种特殊的错误会增加。Here's the word-for-word translation of the text into Simplified Chinese:人类大脑快速能力的解释方案，层次编码（rank-order coding）已经广泛受到关注，由于其速度和效率。然而，对层次编码下噪声的性能仍存在很多不确定性。本文章展示了可以实现的信息速率和协议的权衡，并发现了在某个REGIME下，噪声越少，特殊的错误会增加。

S3IM: Stochastic Structural SIMilarity and Its Unreasonable Effectiveness for Neural Fields

paper_url: http://arxiv.org/abs/2308.07032
repo_url: https://github.com/madaoer/s3im_nerf
paper_authors: Zeke Xie, Xindi Yang, Yujie Yang, Qi Sun, Yixiang Jiang, Haoran Wang, Yunfeng Cai, Mingming Sun
for: 本研究旨在提高NeRF和相关神经场方法（如神经表面表示）的性能，使其能够更好地Synthesize novel-view images。
methods: 本研究提出了一种非本地多重训练 paradigm，通过一种新的Stochastic Structural SIMilarity（S3IM）损失函数，将多个数据点处理为一个整体，而不是独立处理多个输入。
results: 我们的实验表明，S3IM可以减少TensoRF和DVGO的测试MSE损失率超过90%，并提高NeuS的F-score得分198%和Chamfer $L_{1}$距离减少64%。此外，S3IM也能够在缺乏输入、损坏图像和动态场景下保持稳定性。

Abstract
Recently, Neural Radiance Field (NeRF) has shown great success in rendering novel-view images of a given scene by learning an implicit representation with only posed RGB images. NeRF and relevant neural field methods (e.g., neural surface representation) typically optimize a point-wise loss and make point-wise predictions, where one data point corresponds to one pixel. Unfortunately, this line of research failed to use the collective supervision of distant pixels, although it is known that pixels in an image or scene can provide rich structural information. To the best of our knowledge, we are the first to design a nonlocal multiplex training paradigm for NeRF and relevant neural field methods via a novel Stochastic Structural SIMilarity (S3IM) loss that processes multiple data points as a whole set instead of process multiple inputs independently. Our extensive experiments demonstrate the unreasonable effectiveness of S3IM in improving NeRF and neural surface representation for nearly free. The improvements of quality metrics can be particularly significant for those relatively difficult tasks: e.g., the test MSE loss unexpectedly drops by more than 90% for TensoRF and DVGO over eight novel view synthesis tasks; a 198% F-score gain and a 64% Chamfer $L_{1}$ distance reduction for NeuS over eight surface reconstruction tasks. Moreover, S3IM is consistently robust even with sparse inputs, corrupted images, and dynamic scenes.

摘要
最近，神经辐射场（NeRF）已经取得了大成功，通过只使用拍摄的RGB图像学习一个场景中的隐式表示，并且可以生成新视图图像。NeRF和相关的神经场方法（例如神经表面表示）通常是通过点位损失和点位预测来优化，其中一个数据点对应一个像素。然而，这一研究没有使用场景中像素之间的共同监督，尽管已知图像和场景中的像素可以提供丰富的结构信息。据我们所知，我们是首次设计了一种非本地多重训练方法，通过一种新的随机 Structural SIMilarity（S3IM）损失来处理多个数据点，而不是独立处理多个输入。我们的广泛实验表明，S3IM可以减少NeRF和神经表面表示的测试MSE损失，并且可以提高图像质量指标。特别是在比较困难的任务中，例如TensoRF和DVGO上的八个新视图合成任务，测试MSE损失异常下降了 más de 90%，而NeuS上的八个表面重建任务中，F-score提升了198%，Chamfer $L_{1}$距离减少了64%。此外，S3IM具有高度的稳定性，可以在稀缺输入、损坏图像和动态场景下表现出色。

AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning

paper_url: http://arxiv.org/abs/2308.07026
repo_url: https://github.com/cgcl-codes/advclip
paper_authors: Ziqi Zhou, Shengshan Hu, Minghui Li, Hangtao Zhang, Yechao Zhang, Hai Jin
for: 这个论文目的是为了开发一个可以在各种复杂下渠任务上表现出色的通用特征提取器，例如CLIP，并在大量未经标记的图像文本数据上进行训练。
methods: 这个论文使用了交叉模态的预训练Encoder，并通过构建一个图像图文 topological graph структуры和一种基于这种结构的生成对抗网络来生成一个通用的对抗示例。
results: 该论文的结果表明，通过添加这个对抗示例到图像中，可以使图像的嵌入空间Sim（类似性）与不同模态之间的相似度减少，并在特征空间中扰乱样本分布，从而实现通用的非目标攻击。

Abstract
Multimodal contrastive learning aims to train a general-purpose feature extractor, such as CLIP, on vast amounts of raw, unlabeled paired image-text data. This can greatly benefit various complex downstream tasks, including cross-modal image-text retrieval and image classification. Despite its promising prospect, the security issue of cross-modal pre-trained encoder has not been fully explored yet, especially when the pre-trained encoder is publicly available for commercial use. In this work, we propose AdvCLIP, the first attack framework for generating downstream-agnostic adversarial examples based on cross-modal pre-trained encoders. AdvCLIP aims to construct a universal adversarial patch for a set of natural images that can fool all the downstream tasks inheriting the victim cross-modal pre-trained encoder. To address the challenges of heterogeneity between different modalities and unknown downstream tasks, we first build a topological graph structure to capture the relevant positions between target samples and their neighbors. Then, we design a topology-deviation based generative adversarial network to generate a universal adversarial patch. By adding the patch to images, we minimize their embeddings similarity to different modality and perturb the sample distribution in the feature space, achieving unviersal non-targeted attacks. Our results demonstrate the excellent attack performance of AdvCLIP on two types of downstream tasks across eight datasets. We also tailor three popular defenses to mitigate AdvCLIP, highlighting the need for new defense mechanisms to defend cross-modal pre-trained encoders.

摘要
多模态对照学习目标是训练一个通用特征提取器，如CLIP，在大量未标注的图像文本数据上。这可以对多种复杂的下游任务产生很大的改进，包括跨模态图像文本检索和图像分类。尽管其承诺的前景很亮色，但跨模态预训练编码器的安全问题尚未得到完全探索，特别是当预训练编码器是商业用途上公开可用时。在这项工作中，我们提出了AdvCLIP，跨模态预训练编码器的首个攻击框架。AdvCLIP目标是生成基于跨模态预训练编码器的下游不受限制的攻击示例。我们希望通过构建一个图像和文本之间的 topological graph 结构，捕捉目标样本和其相关样本之间的相互关系。然后，我们设计了一种基于 topological deviation 的生成 adversarial network，生成一个通用的攻击质量。通过将质量添加到图像中，我们使得图像与不同模式之间的嵌入度相互不同，并在特征空间中扰乱样本分布，实现了不受限制的非目标攻击。我们的结果显示AdvCLIP在八个数据集上对两种下游任务进行了出色的攻击性能。我们还适应了三种流行的防御机制，强调了防御跨模态预训练编码器的需要。

PGT-Net: Progressive Guided Multi-task Neural Network for Small-area Wet Fingerprint Denoising and Recognition

paper_url: http://arxiv.org/abs/2308.07024
repo_url: None
paper_authors: Yu-Ting Li, Ching-Te Chiu, An-Ting Hsieh, Mao-Hsiu Hsu, Long Wenyong, Jui-Min Hsu
for: 提高手势识别精度（Fingerprint Recognition）
methods: 提出了一种END-TO-END TRAINABLE PROGRESSIVE GUIDED MULTI-TASK NEURAL NETWORK（PGT-Net），包括共享阶段和特定多任务阶段，使网络可以顺序训练binary和非binary手势图像。
results: 实验结果表明，PGT-Net在湿式手势图像干涂除和手势识别精度提高方面具有优秀表现，并在FT-lightnoised和FW9395数据集上降低了手势识别错误率（FRR）。在FT-lightnoised数据集上，FRR从17.75%降低到4.47%；在FW9395数据集上，FRR从9.45%降低到1.09%。

Abstract
Fingerprint recognition on mobile devices is an important method for identity verification. However, real fingerprints usually contain sweat and moisture which leads to poor recognition performance. In addition, for rolling out slimmer and thinner phones, technology companies reduce the size of recognition sensors by embedding them with the power button. Therefore, the limited size of fingerprint data also increases the difficulty of recognition. Denoising the small-area wet fingerprint images to clean ones becomes crucial to improve recognition performance. In this paper, we propose an end-to-end trainable progressive guided multi-task neural network (PGT-Net). The PGT-Net includes a shared stage and specific multi-task stages, enabling the network to train binary and non-binary fingerprints sequentially. The binary information is regarded as guidance for output enhancement which is enriched with the ridge and valley details. Moreover, a novel residual scaling mechanism is introduced to stabilize the training process. Experiment results on the FW9395 and FT-lightnoised dataset provided by FocalTech shows that PGT-Net has promising performance on the wet-fingerprint denoising and significantly improves the fingerprint recognition rate (FRR). On the FT-lightnoised dataset, the FRR of fingerprint recognition can be declined from 17.75% to 4.47%. On the FW9395 dataset, the FRR of fingerprint recognition can be declined from 9.45% to 1.09%.

摘要
Mobile device上的指纹识别是重要的身份验证方法。然而，真正的指纹通常含有汗水和湿度，导致识别性能差。此外，为了推出更薄和更细的手机，技术公司通常减小识别感知器的大小，并将其嵌入在电源按钮中。因此，限制的指纹数据大小也增加了识别的困难。减去小区域湿指纹图像，以便提高识别性能。在这篇论文中，我们提出了一个端到端训练可进程指纹网络（PGT-Net）。PGT-Net包括共享阶段和特定多任务阶段，使得网络可以顺序地训练二进制和非二进制指纹。二进制信息被视为识别输出增强的指导，并且通过ridge和谷峰细节来增强输出。此外，我们还提出了一种新的径规约稳定化机制，以确保训练过程的稳定。实验结果表明，PGT-Net在湿指纹减去和指纹识别率（FRR）上具有良好的表现，在FT-lightnoised数据集上，FRR可以从17.75%降至4.47%。在FW9395数据集上，FRR可以从9.45%降至1.09%。

Contrastive Bi-Projector for Unsupervised Domain Adaption

paper_url: http://arxiv.org/abs/2308.07017
repo_url: https://github.com/tom99763/Contrastive-Bi-Projector-for-Unsupervised-Domain-Adaption
paper_authors: Lin-Chieh Huang, Hung-Hsu Tsai
For: The paper proposes a novel unsupervised domain adaptation (UDA) method called CBPUDA, which improves existing UDA methods by reducing the generation of ambiguous features for classification and domain adaptation.* Methods: The CBPUDA method uses contrastive bi-projectors (CBP) to train feature extractors (FEs) adversarially, obtaining more refined decision boundaries and powerful classification performance. The proposed loss function, contrastive discrepancy (CD) loss, is analyzed for its properties, including an upper bound of joint prediction entropy and a gradient scaling (GS) scheme to overcome instability.* Results: The paper shows that the CBPUDA method is superior to conventional UDA methods for UDA and fine-grained UDA tasks, achieving better performance in classification and domain adaptation.Here is the simplified Chinese text for the three main points:* 用途：本文提出了一种基于对比 би项目（CBP）的新型无监督领域适应（UDA）方法，可以提高现有UDA方法的性能。* 方法：CBPUDA方法使用对比 би项目来训练特征提取器（FEs），通过对抗学习来提取更精细的决策边界，从而获得更高的分类性能。提出了一种对比抽象（CD）损失函数，并分析了其两个性能。* 结果：本文表明，CBPUDA方法比现有的UDA方法在UDA和细化UDA任务中表现更好，达到了更高的分类和领域适应性能。

Abstract
This paper proposes a novel unsupervised domain adaption (UDA) method based on contrastive bi-projector (CBP), which can improve the existing UDA methods. It is called CBPUDA here, which effectively promotes the feature extractors (FEs) to reduce the generation of ambiguous features for classification and domain adaption. The CBP differs from traditional bi-classifier-based methods at that these two classifiers are replaced with two projectors of performing a mapping from the input feature to two distinct features. These two projectors and the FEs in the CBPUDA can be trained adversarially to obtain more refined decision boundaries so that it can possess powerful classification performance. Two properties of the proposed loss function are analyzed here. The first property is to derive an upper bound of joint prediction entropy, which is used to form the proposed loss function, contrastive discrepancy (CD) loss. The CD loss takes the advantages of the contrastive learning and the bi-classifier. The second property is to analyze the gradient of the CD loss and then overcome the drawback of the CD loss. The result of the second property is utilized in the development of the gradient scaling (GS) scheme in this paper. The GS scheme can be exploited to tackle the unstable problem of the CD loss because training the CBPUDA requires using contrastive learning and adversarial learning at the same time. Therefore, using the CD loss with the GS scheme overcomes the problem mentioned above to make features more compact for intra-class and distinguishable for inter-class. Experimental results express that the CBPUDA is superior to conventional UDA methods under consideration in this paper for UDA and fine-grained UDA tasks.

摘要
Two properties of the proposed loss function, contrastive discrepancy (CD) loss, are analyzed:1. The CD loss has an upper bound on joint prediction entropy, which is used to form the loss function.2. The gradient of the CD loss is analyzed, and a gradient scaling (GS) scheme is developed to overcome the drawbacks of the CD loss.The GS scheme is used to tackle the unstable problem of the CD loss, which arises when using contrastive learning and adversarial learning simultaneously. By using the CD loss with the GS scheme, features are made more compact for intra-class and distinguishable for inter-class.Experimental results show that the CBPUDA outperforms conventional UDA methods for UDA and fine-grained UDA tasks.

HPFormer: Hyperspectral image prompt object tracking

paper_url: http://arxiv.org/abs/2308.07016
repo_url: None
paper_authors: Yuedong Tan
for: 提高视觉跟踪性能
methods: 使用Transformers架构，具有强大表示学习能力，并提出了一种新的卷积束注意力模块（Hyperspectral Hybrid Attention，HHA）和一种选择性地汇集空间细节和谱спектраль特征的变换带模块（Transform Band Module，TBM）。
results: 在NIR和VIS跟踪数据集上实现了状态之最性表现，提供了新的途径来利用变换器和卷积束注意力来提高对象跟踪。

Abstract
Hyperspectral imagery contains abundant spectral information beyond the visible RGB bands, providing rich discriminative details about objects in a scene. Leveraging such data has the potential to enhance visual tracking performance. While prior hyperspectral trackers employ CNN or hybrid CNN-Transformer architectures, we propose a novel approach HPFormer on Transformers to capitalize on their powerful representation learning capabilities. The core of HPFormer is a Hyperspectral Hybrid Attention (HHA) module which unifies feature extraction and fusion within one component through token interactions. Additionally, a Transform Band Module (TBM) is introduced to selectively aggregate spatial details and spectral signatures from the full hyperspectral input for injecting informative target representations. Extensive experiments demonstrate state-of-the-art performance of HPFormer on benchmark NIR and VIS tracking datasets. Our work provides new insights into harnessing the strengths of transformers and hyperspectral fusion to advance robust object tracking.

摘要
《卷积神经网络在多 Spectral 图像中的应用》 introduce a novel approach called HPFormer, which leverages the powerful representation learning capabilities of transformers to enhance visual tracking performance. The core of HPFormer is a Hyperspectral Hybrid Attention (HHA) module, which unifies feature extraction and fusion within one component through token interactions. Additionally, a Transform Band Module (TBM) is introduced to selectively aggregate spatial details and spectral signatures from the full hyperspectral input for injecting informative target representations. Extensive experiments demonstrate state-of-the-art performance of HPFormer on benchmark NIR and VIS tracking datasets. Our work provides new insights into harnessing the strengths of transformers and hyperspectral fusion to advance robust object tracking.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. If you prefer Traditional Chinese, please let me know and I can provide that version as well.

ACTIVE: Towards Highly Transferable 3D Physical Camouflage for Universal and Robust Vehicle Evasion

paper_url: http://arxiv.org/abs/2308.07009
repo_url: None
paper_authors: Naufal Suryanto, Yongsu Kim, Harashta Tatimma Larasati, Hyoeun Kang, Thi-Thu-Huong Le, Yoonyoung Hong, Hunmin Yang, Se-Yoon Oh, Howon Kim
for: 这个论文旨在攻击物探器，将任何3D车辆覆盖在探器面前。
methods: 这个方法使用了创新的文本渲染技术，可以将通用的文本应用到不同的车辆上，不受特定的 texture map 限制。它还使用了一个新的隐身损失函数，使车辆完全隐藏不见，以及一个缓和遮瑕损失函数，以增强伪装的自然性。
results: 在15种不同的模型上进行了广泛的实验，结果显示了ACTIVE在不同的公共探器上（包括最新的YOLOv7）都能够优于现有的作品。尤其是在其他车辆类别、任务（分类模型）和实际世界中的可转移性测试中，ACTIVE表现了良好的传递性。

Abstract
Adversarial camouflage has garnered attention for its ability to attack object detectors from any viewpoint by covering the entire object's surface. However, universality and robustness in existing methods often fall short as the transferability aspect is often overlooked, thus restricting their application only to a specific target with limited performance. To address these challenges, we present Adversarial Camouflage for Transferable and Intensive Vehicle Evasion (ACTIVE), a state-of-the-art physical camouflage attack framework designed to generate universal and robust adversarial camouflage capable of concealing any 3D vehicle from detectors. Our framework incorporates innovative techniques to enhance universality and robustness, including a refined texture rendering that enables common texture application to different vehicles without being constrained to a specific texture map, a novel stealth loss that renders the vehicle undetectable, and a smooth and camouflage loss to enhance the naturalness of the adversarial camouflage. Our extensive experiments on 15 different models show that ACTIVE consistently outperforms existing works on various public detectors, including the latest YOLOv7. Notably, our universality evaluations reveal promising transferability to other vehicle classes, tasks (segmentation models), and the real world, not just other vehicles.

摘要
adversarial camouflage 引起了关注，因为它可以从任何视角攻击物体探测器，覆盖整个物体表面。然而，现有的方法中的 universality 和 robustness frequently 缺乏，因为它们通常忽略了传输性问题，因此只能应用于特定目标，并且表现有限。为解决这些挑战，我们提出了 Adversarial Camouflage for Transferable and Intensive Vehicle Evasion (ACTIVE)，一个状态 искусственный智能Physical camouflage attack 框架，可以生成universal 和 robust adversarial camouflage，用于隐藏任何 3D 汽车。我们的框架包括创新的技术来提高 universality 和 robustness，包括改进的文本渲染，使得不同汽车可以共享同一个文本映射，以及一种新的隐身损失，使汽车无法探测，以及一种平滑和 camouflage 损失，以提高隐藏的自然性。我们对 15 个不同的模型进行了广泛的实验，显示 ACTIVE consistently 超越了现有的工作在各种公共探测器上，包括最新的 YOLOv7。尤其是，我们的 universality 评估表明了适用于其他汽车类、任务（分割模型）和实际世界的良好传输性。

Deepbet: Fast brain extraction of T1-weighted MRI using Convolutional Neural Networks

paper_url: http://arxiv.org/abs/2308.07003
repo_url: None
paper_authors: Lukas Fisch, Stefan Zumdick, Carlotta Barkhau, Daniel Emden, Jan Ernsting, Ramona Leenings, Kelvin Sarink, Nils R. Winter, Benjamin Risse, Udo Dannlowski, Tim Hahn
for: 本研究旨在开发一个高精度、高速的brain extractedction工具，用于多种нейро成像预处理管道中的分割步骤。
methods: 本研究使用了一个独特的数据集，包括568个T1-weighted(T1w) MR图像，并使用了当今最先进的深度学习方法来建立一个两阶段预测过程，以提高分割性能。
results: compared to当前状态的艺术模型（DSC = 97.8%和DSC = 97.9%），深入的brain extractedction模型在权重平衡分割中实现了新的状态对照性（DSC = 99.0%），并在所有样本中保持Dice分数 > 96.9%。此外，该模型可以加速brain extractedction的速度，比现有方法快约10倍，可以在低级别硬件上处理一个图像只需2秒钟。

Abstract
Brain extraction in magnetic resonance imaging (MRI) data is an important segmentation step in many neuroimaging preprocessing pipelines. Image segmentation is one of the research fields in which deep learning had the biggest impact in recent years enabling high precision segmentation with minimal compute. Consequently, traditional brain extraction methods are now being replaced by deep learning-based methods. Here, we used a unique dataset comprising 568 T1-weighted (T1w) MR images from 191 different studies in combination with cutting edge deep learning methods to build a fast, high-precision brain extraction tool called deepbet. deepbet uses LinkNet, a modern UNet architecture, in a two stage prediction process. This increases its segmentation performance, setting a novel state-of-the-art performance during cross-validation with a median Dice score (DSC) of 99.0% on unseen datasets, outperforming current state of the art models (DSC = 97.8% and DSC = 97.9%). While current methods are more sensitive to outliers, resulting in Dice scores as low as 76.5%, deepbet manages to achieve a Dice score of > 96.9% for all samples. Finally, our model accelerates brain extraction by a factor of ~10 compared to current methods, enabling the processing of one image in ~2 seconds on low level hardware.

摘要
magnetic resonance imaging (MRI) 数据中的脑部提取是许多神经成像预处理管道中的重要分 Segmentation step。图像分 segmentation 是深度学习在过去几年中对神经成像领域的研究中所带来的最大影响，使得传统的脑部提取方法被取代了。我们使用了568个T1-weighted (T1w) MR 图像和最前沿的深度学习方法，建立了一个高速、高精度的脑部提取工具 called deepbet。deepbet 使用了LinkNet，一种现代的UNet架构，在两个阶段的预测过程中。这使得它的 segmentation 性能得到了提高，在批处理中 median Dice 分数 (DSC) 为99.0%，超越了当前的状态码模型 (DSC = 97.8%和DSC = 97.9%)。而当前的方法更感应外围数据，导致 Dice 分数只有76.5%，而 deepbet 则可以达到 > 96.9% 的 Dice 分数 для所有样本。最后，我们的模型将脑部提取加速了约10倍，可以在低级别硬件上处理一个图像只需2秒钟。

Mutual Information-driven Triple Interaction Network for Efficient Image Dehazing

paper_url: http://arxiv.org/abs/2308.06998
repo_url: https://github.com/it-hao/mitnet
paper_authors: Hao Shen, Zhong-Qiu Zhao, Yulun Zhang, Zhao Zhang
for: 这个论文主要针对图像降雨问题进行解决，通过分解为多个更加 tractable 子任务，逐步估计降雨后的图像。
methods: 该论文提出了一种基于空间频域双域信息和两Stage Architecture的新方法，即MITNet，它利用了振荡 Spectrum 的恢复、phas spectrum 的学习和 Adaptive Triple Interaction Module (ATIM) 来提高图像降雨的性能。
results: 对多个公共数据集进行了广泛的实验，表明MITNet可以在低于同类模型的复杂性下达到更高的性能水平。

Abstract
Multi-stage architectures have exhibited efficacy in image dehazing, which usually decomposes a challenging task into multiple more tractable sub-tasks and progressively estimates latent hazy-free images. Despite the remarkable progress, existing methods still suffer from the following shortcomings: (1) limited exploration of frequency domain information; (2) insufficient information interaction; (3) severe feature redundancy. To remedy these issues, we propose a novel Mutual Information-driven Triple interaction Network (MITNet) based on spatial-frequency dual domain information and two-stage architecture. To be specific, the first stage, named amplitude-guided haze removal, aims to recover the amplitude spectrum of the hazy images for haze removal. And the second stage, named phase-guided structure refined, devotes to learning the transformation and refinement of the phase spectrum. To facilitate the information exchange between two stages, an Adaptive Triple Interaction Module (ATIM) is developed to simultaneously aggregate cross-domain, cross-scale, and cross-stage features, where the fused features are further used to generate content-adaptive dynamic filters so that applying them to enhance global context representation. In addition, we impose the mutual information minimization constraint on paired scale encoder and decoder features from both stages. Such an operation can effectively reduce information redundancy and enhance cross-stage feature complementarity. Extensive experiments on multiple public datasets exhibit that our MITNet performs superior performance with lower model complexity.The code and models are available at https://github.com/it-hao/MITNet.

摘要
多Stage网络在图像抑霜方面表现了效果，通常将复杂的任务分解成多个更容易处理的子任务，并逐步估计灰度图像的干扰。 despite the remarkable progress, existing methods still have the following shortcomings: (1) limited exploration of frequency domain information; (2) insufficient information interaction; (3) severe feature redundancy. To address these issues, we propose a novel Mutual Information-driven Triple interaction Network (MITNet) based on spatial-frequency dual domain information and two-stage architecture. Specifically, the first stage, named amplitude-guided haze removal, aims to recover the amplitude spectrum of hazy images for haze removal. And the second stage, named phase-guided structure refined, devotes to learning the transformation and refinement of the phase spectrum. To facilitate the information exchange between two stages, an Adaptive Triple Interaction Module (ATIM) is developed to simultaneously aggregate cross-domain, cross-scale, and cross-stage features, where the fused features are further used to generate content-adaptive dynamic filters for enhancing global context representation. In addition, we impose the mutual information minimization constraint on paired scale encoder and decoder features from both stages. This operation can effectively reduce information redundancy and enhance cross-stage feature complementarity. Extensive experiments on multiple public datasets show that our MITNet achieves superior performance with lower model complexity. The code and models are available at https://github.com/it-hao/MITNet.

PatchContrast: Self-Supervised Pre-training for 3D Object Detection

paper_url: http://arxiv.org/abs/2308.06985
repo_url: None
paper_authors: Oren Shrout, Ori Nitzan, Yizhak Ben-Shabat, Ayellet Tal
for: 自动驾驶车辆环境中物体检测的准确检测是一项关键挑战。然而，获得用于检测的标注数据是昂贵的和耗时的。我们介绍了PatchContrast，一种新的自我超vised点云预训练框架，用于提高3D物体检测的性能。
methods: 我们提出了两级含义来学习不supervised数据中的抽象表示：提案级别和 patch级别。提案级别寻找物体在它所处的环境中的坐标，而 patch级别添加了物体组件之间的内部连接信息，因此可以通过物体的个体组件来分辨不同的物体。我们示示了如何将这两级含义集成到不同的backbone中进行自我超vised预训练，以提高下游3D检测任务的性能。
results: 我们的方法比现有的状态先进模型在三个常用的3D检测数据集上表现出色，提高了3D检测任务的性能。

Abstract
Accurately detecting objects in the environment is a key challenge for autonomous vehicles. However, obtaining annotated data for detection is expensive and time-consuming. We introduce PatchContrast, a novel self-supervised point cloud pre-training framework for 3D object detection. We propose to utilize two levels of abstraction to learn discriminative representation from unlabeled data: proposal-level and patch-level. The proposal-level aims at localizing objects in relation to their surroundings, whereas the patch-level adds information about the internal connections between the object's components, hence distinguishing between different objects based on their individual components. We demonstrate how these levels can be integrated into self-supervised pre-training for various backbones to enhance the downstream 3D detection task. We show that our method outperforms existing state-of-the-art models on three commonly-used 3D detection datasets.

摘要
自动驾驶车辆环境中物体检测是一项关键挑战。然而，获取标注数据 для检测却是成本高且时间费时的。我们介绍了PatchContrast，一种新的自我超vised点云预训练框架 для3D物体检测。我们提议利用两级层次抽象来学习不同数据集中的抽象表示：提案级别和 patch级别。提案级别将物体本身和其周围环境进行Localizaion，而patch级别则提供了对物体组件之间的内部连接信息，从而通过不同物体的组件来分辨不同物体。我们示出了如何将这两级层次级别integreinto自我超vised预训练中，以提高下游3D检测任务的性能。我们证明了我们的方法可以与现有状态的最佳模型在三个常用的3D检测数据集上进行比较，并且表现出较好的性能。

A One Stop 3D Target Reconstruction and multilevel Segmentation Method

paper_url: http://arxiv.org/abs/2308.06974
repo_url: https://github.com/ganlab/ostra
paper_authors: Jiexiong Xu, Weikun Zhao, Zhiyan Tang, Xiangchao Gan
for: 本研究旨在提出一个开源的三维目标重建和多层分类框架（OSTRA），用于实现图像序列中多个实体的分类、追踪和三维重建。
methods: OSTRA使用多视角ステレオ（MVS）或RGBD基于的三维重建方法进行三维物体重建，并将二维图像中的分类扩展至三维空间，以支持连续的分类标签。
results: OSTRA在多个三维数据集上实现高性能的Semantic Segmentation、Instance Segmentation和Part Segmentation，甚至超越人工分类在复杂的场景和遮掩中。

Abstract
3D object reconstruction and multilevel segmentation are fundamental to computer vision research. Existing algorithms usually perform 3D scene reconstruction and target objects segmentation independently, and the performance is not fully guaranteed due to the challenge of the 3D segmentation. Here we propose an open-source one stop 3D target reconstruction and multilevel segmentation framework (OSTRA), which performs segmentation on 2D images, tracks multiple instances with segmentation labels in the image sequence, and then reconstructs labelled 3D objects or multiple parts with Multi-View Stereo (MVS) or RGBD-based 3D reconstruction methods. We extend object tracking and 3D reconstruction algorithms to support continuous segmentation labels to leverage the advances in the 2D image segmentation, especially the Segment-Anything Model (SAM) which uses the pretrained neural network without additional training for new scenes, for 3D object segmentation. OSTRA supports most popular 3D object models including point cloud, mesh and voxel, and achieves high performance for semantic segmentation, instance segmentation and part segmentation on several 3D datasets. It even surpasses the manual segmentation in scenes with complex structures and occlusions. Our method opens up a new avenue for reconstructing 3D targets embedded with rich multi-scale segmentation information in complex scenes. OSTRA is available from https://github.com/ganlab/OSTRA.

摘要
三Dimensional对象重建和多级分割是计算机视觉研究的基础。现有算法通常在独立地进行三Dimensional场景重建和目标对象分割，并且性能不能够保证由三Dimensional分割的挑战。我们提出了一个开源的一站式三Dimensional目标重建和多级分割框架（OSTRA），它在图像序列中进行分割，跟踪图像序列中的多个实例，并使用多视图镜像（MVS）或RGBD基于的三Dimensional重建方法来重建标签的三Dimensional对象或多部分。我们扩展了对象跟踪和三Dimensional重建算法，以支持连续的分割标签，以便利用二Dimensional图像分割的进步，特别是Segment-Anything Model（SAM），它使用预训练的神经网络，不需要额外训练，对于新场景进行三Dimensional对象分割。OSTRA支持大多数三Dimensional对象模型，包括点云、网格和体ixel，并在多个三Dimensional数据集上实现高性能的semantic分割、实例分割和部分分割。甚至超过了人工分割在复杂结构和遮挡的场景中。我们的方法开启了一新的途径，将三Dimensional目标嵌入了rich multi-scale分割信息的复杂场景中进行重建。OSTRA可以从https://github.com/ganlab/OSTRA获取。

How inter-rater variability relates to aleatoric and epistemic uncertainty: a case study with deep learning-based paraspinal muscle segmentation

paper_url: http://arxiv.org/abs/2308.06964
repo_url: None
paper_authors: Parinaz Roshanzamir, Hassan Rivaz, Joshua Ahn, Hamza Mirza, Neda Naghdi, Meagan Anstruther, Michele C. Battié, Maryse Fortin, Yiming Xiao
for: 这篇论文旨在探讨深度学习（DL）技术在医疗影像分类任务中的表现，特别是最新的Transformer模型和其变体。
methods: 本文使用test-time augmentation（TTA）、test-time dropout（TTD）和深度结构来量化 aleatoric和epistemic uncertainty，并评估它们与多注解者间的不确定性之间的关系。此外，本文比较了UNet和TransUNet，以研究Transformers对模型不确定性的影响，并评估了两种标签融合策略。
results: 本研究发现了多注解者间的不确定性和模型不确定性之间的交互关系，受到标签融合策略和DL模型的选择的影响。

Abstract
Recent developments in deep learning (DL) techniques have led to great performance improvement in medical image segmentation tasks, especially with the latest Transformer model and its variants. While labels from fusing multi-rater manual segmentations are often employed as ideal ground truths in DL model training, inter-rater variability due to factors such as training bias, image noise, and extreme anatomical variability can still affect the performance and uncertainty of the resulting algorithms. Knowledge regarding how inter-rater variability affects the reliability of the resulting DL algorithms, a key element in clinical deployment, can help inform better training data construction and DL models, but has not been explored extensively. In this paper, we measure aleatoric and epistemic uncertainties using test-time augmentation (TTA), test-time dropout (TTD), and deep ensemble to explore their relationship with inter-rater variability. Furthermore, we compare UNet and TransUNet to study the impacts of Transformers on model uncertainty with two label fusion strategies. We conduct a case study using multi-class paraspinal muscle segmentation from T2w MRIs. Our study reveals the interplay between inter-rater variability and uncertainties, affected by choices of label fusion strategies and DL models.

摘要
In this paper, we use test-time augmentation (TTA), test-time dropout (TTD), and deep ensemble to measure aleatoric and epistemic uncertainties and explore their relationship with inter-rater variability. Additionally, we compare UNet and TransUNet to study the impact of Transformers on model uncertainty with two label fusion strategies. We conduct a case study using multi-class paraspinal muscle segmentation from T2w MRIs. Our study reveals the interplay between inter-rater variability and uncertainties, which is affected by choices of label fusion strategies and DL models.

Color-NeuS: Reconstructing Neural Implicit Surfaces with Color

paper_url: http://arxiv.org/abs/2308.06962
repo_url: https://github.com/Colmar-zlicheng/Color-NeuS
paper_authors: Licheng Zhong, Lixin Yang, Kailin Li, Haoyu Zhen, Mei Han, Cewu Lu
for: 本研究旨在重新定义物体表面从多视图图像或单视频图像中，并且同时恢复颜色。
methods: 我们使用了一个颜色推导网络来除掉视角依赖的颜色，并且透过一个重新推导网络来保持颜色推导性能。 mesh 则是从 signed distance function（SDF）网络中提取出来的。
results: 我们在一个实际的手持物体扫描任务中评估了我们的方法，结果比任何可以同时恢复 mesh 和颜色的方法更好。更进一步地，我们将方法评估在公共数据集上，包括 DTU、BlendedMVS 和 OmniObject3D 等数据集，结果显示我们的方法在这些数据集上表现良好。

Abstract
The reconstruction of object surfaces from multi-view images or monocular video is a fundamental issue in computer vision. However, much of the recent research concentrates on reconstructing geometry through implicit or explicit methods. In this paper, we shift our focus towards reconstructing mesh in conjunction with color. We remove the view-dependent color from neural volume rendering while retaining volume rendering performance through a relighting network. Mesh is extracted from the signed distance function (SDF) network for the surface, and color for each surface vertex is drawn from the global color network. To evaluate our approach, we conceived a in hand object scanning task featuring numerous occlusions and dramatic shifts in lighting conditions. We've gathered several videos for this task, and the results surpass those of any existing methods capable of reconstructing mesh alongside color. Additionally, our method's performance was assessed using public datasets, including DTU, BlendedMVS, and OmniObject3D. The results indicated that our method performs well across all these datasets. Project page: https://colmar-zlicheng.github.io/color_neus.

摘要
“重建物体表面从多视图图像或单视图视频是计算机视觉中的基本问题。然而，现有的大部分研究强调 геометрическое重建方法，而我们在这篇论文中强调重建网格，并同时保留了颜色渲染性能。我们在神经网络中除掉了视角依赖的颜色，并将网格提取自 signed distance function（SDF）网络。每个表面Vertex的颜色从全局颜色网络中绘制。为评估我们的方法，我们设计了一个手持物体扫描任务，该任务具有各种遮挡和极大的照明变化。我们收集了许多视频数据，并结果超过任何可以同时重建网格和颜色的方法。此外，我们的方法在公共数据集上进行评估，包括 DTU、BlendedMVS 和 OmniObject3D，结果表明我们的方法在这些数据集上表现良好。项目页面：https://colmar-zlicheng.github.io/color_neus。”Note that Simplified Chinese is used here, as it is the more widely used standard for Chinese writing. If you prefer Traditional Chinese, I can provide that as well.

CEmb-SAM: Segment Anything Model with Condition Embedding for Joint Learning from Heterogeneous Datasets

paper_url: http://arxiv.org/abs/2308.06957
repo_url: None
paper_authors: Dongik Shin, Beomsuk Kim, Seungjun Baek
for: assist medical experts with diagnostic and therapeutic procedures
methods: jointly learning from heterogeneous datasets, using Segment Anything model (SAM) with Condition Embedding block (CEmb-SAM)
results: outperforms baseline methods on ultrasound image segmentation for peripheral nerves and breast cancer

Abstract
Automated segmentation of ultrasound images can assist medical experts with diagnostic and therapeutic procedures. Although using the common modality of ultrasound, one typically needs separate datasets in order to segment, for example, different anatomical structures or lesions with different levels of malignancy. In this paper, we consider the problem of jointly learning from heterogeneous datasets so that the model can improve generalization abilities by leveraging the inherent variability among datasets. We merge the heterogeneous datasets into one dataset and refer to each component dataset as a subgroup. We propose to train a single segmentation model so that the model can adapt to each sub-group. For robust segmentation, we leverage recently proposed Segment Anything model (SAM) in order to incorporate sub-group information into the model. We propose SAM with Condition Embedding block (CEmb-SAM) which encodes sub-group conditions and combines them with image embeddings from SAM. The conditional embedding block effectively adapts SAM to each image sub-group by incorporating dataset properties through learnable parameters for normalization. Experiments show that CEmb-SAM outperforms the baseline methods on ultrasound image segmentation for peripheral nerves and breast cancer. The experiments highlight the effectiveness of Cemb-SAM in learning from heterogeneous datasets in medical image segmentation tasks.

摘要
自动分割超声图像可以帮助医疗专业人员进行诊断和治疗过程。尽管使用常见的探测器模式，但一般需要分立的数据集以便分别对各种解剖结构或恶性肿瘤进行分割。在这篇论文中，我们考虑了将异构数据集合并学习的问题，以便模型可以通过利用异构数据集的自然变化提高泛化能力。我们将异构数据集合并成一个数据集，并将每个组件数据集称为子组。我们提议在单个分割模型中培训多个分割模型，以便模型可以适应每个子组。为了robust分割，我们利用最近提出的Segment Anything模型（SAM），并在SAM中添加条件嵌入块（CEmb），以编码子组条件并与SAM中的图像嵌入相结合。 conditional embedding block可以有效地适应SAM每个图像子组的变化，通过学习参数进行Normalization。实验表明CEmb-SAM在超声图像分割任务中表现出色，超过了基eline方法。实验还表明，CEmb-SAM在医疗图像分割任务中学习异构数据集的能力非常有用。

Global Features are All You Need for Image Retrieval and Reranking

paper_url: http://arxiv.org/abs/2308.06954
repo_url: https://github.com/shihaoshao-gh/superglobal
paper_authors: Shihao Shao, Kaifeng Chen, Arjun Karpur, Qinghua Cui, Andre Araujo, Bingyi Cao
for: 本研究旨在提高图像检索系统的效率和准确率，并提供一个可扩展的高性能图像检索系统。
methods: 本paper使用全球特征来进行初始化和重新排序，并提出了一些新的模块来改进全球特征提取和重新排序过程。
results: 对比于现有的图像检索系统，本paper的实验结果表明，SuperGlobal可以提供substantial improvement，特别是在Revisited Oxford+1M Hard数据集上，单阶段结果提高7.1%，两阶段结果提高3.7%，并且具有64,865x的速度提升。

Abstract
Image retrieval systems conventionally use a two-stage paradigm, leveraging global features for initial retrieval and local features for reranking. However, the scalability of this method is often limited due to the significant storage and computation cost incurred by local feature matching in the reranking stage. In this paper, we present SuperGlobal, a novel approach that exclusively employs global features for both stages, improving efficiency without sacrificing accuracy. SuperGlobal introduces key enhancements to the retrieval system, specifically focusing on the global feature extraction and reranking processes. For extraction, we identify sub-optimal performance when the widely-used ArcFace loss and Generalized Mean (GeM) pooling methods are combined and propose several new modules to improve GeM pooling. In the reranking stage, we introduce a novel method to update the global features of the query and top-ranked images by only considering feature refinement with a small set of images, thus being very compute and memory efficient. Our experiments demonstrate substantial improvements compared to the state of the art in standard benchmarks. Notably, on the Revisited Oxford+1M Hard dataset, our single-stage results improve by 7.1%, while our two-stage gain reaches 3.7% with a strong 64,865x speedup. Our two-stage system surpasses the current single-stage state-of-the-art by 16.3%, offering a scalable, accurate alternative for high-performing image retrieval systems with minimal time overhead. Code: https://github.com/ShihaoShao-GH/SuperGlobal.

摘要
传统的图像检索系统采用两个阶段方法，首先使用全球特征进行初始检索，然后使用本地特征进行重新排序。然而，这种方法的扩展性往往受到本地特征匹配的存储和计算成本的限制。在这篇论文中，我们提出了SuperGlobal方法，它凭借全球特征进行两个阶段的检索，提高效率而无需牺牲准确性。SuperGlobal方法在检索系统中进行了重要改进，具体包括全球特征提取和重新排序过程中的增强。在提取阶段，我们发现在广泛使用的ArcFace损失和Generalized Mean（GeM）混合方法时存在下限性，并提出了一些新的模块来改进GeM混合方法。在重新排序阶段，我们引入了一种新的方法，只考虑一小 subsets of images来更新全球特征，从而具有很高的计算和存储效率。我们的实验表明，SuperGlobal方法在标准的测试集上具有显著的改进，特别是在Revisited Oxford+1M Hard dataset上，我们的单阶段结果提高了7.1%，而我们的两阶段结果提高了3.7%，同时具有64,865倍的速度提升。我们的两阶段系统超过了当前单阶段状态之差16.3%，提供了可扩展、准确的图像检索系统选择。代码可在中找到。

Channel-Wise Contrastive Learning for Learning with Noisy Labels

paper_url: http://arxiv.org/abs/2308.06952
repo_url: None
paper_authors: Hui Kang, Sheng Liu, Huaxi Huang, Tongliang Liu
for: 本研究旨在Addressing the challenge of learning with noisy labels (LNL) by introducing channel-wise contrastive learning (CWCL) to distinguish authentic label information from noise.
methods: 该方法通过在多个通道进行对比学习，以提取准确标签信息，并逐步细化使用这些样本进行进一步精度调整。
results: 对多个 benchmark 数据集进行评估，显示该方法与现有方法相比，具有更高的精度和更好的鲁棒性。

Abstract
In real-world datasets, noisy labels are pervasive. The challenge of learning with noisy labels (LNL) is to train a classifier that discerns the actual classes from given instances. For this, the model must identify features indicative of the authentic labels. While research indicates that genuine label information is embedded in the learned features of even inaccurately labeled data, it's often intertwined with noise, complicating its direct application. Addressing this, we introduce channel-wise contrastive learning (CWCL). This method distinguishes authentic label information from noise by undertaking contrastive learning across diverse channels. Unlike conventional instance-wise contrastive learning (IWCL), CWCL tends to yield more nuanced and resilient features aligned with the authentic labels. Our strategy is twofold: firstly, using CWCL to extract pertinent features to identify cleanly labeled samples, and secondly, progressively fine-tuning using these samples. Evaluations on several benchmark datasets validate our method's superiority over existing approaches.

摘要
实际数据集中，噪声标签是普遍存在的。学习噪声标签（LNL）的挑战是训练一个能够识别实际类别的分类器。为此，模型必须识别实际标签中的特征。虽然研究表明，正确的标签信息在噪声涂抹后仍然包含在学习的数据中，但它通常与噪声混合在一起，使其直接应用更加复杂。为解决这个问题，我们介绍了通道 wise contrastive learning（CWCL）。这种方法通过对多个通道进行对比来分离真实标签信息和噪声。与传统的实例 wise contrastive learning（IWCL）不同，CWCL更有可能产生更加细腻和抗噪声的特征，与真实标签更好地align。我们的策略是两重的：首先，使用 CWCL 提取有关实际标签的重要特征，并在这些样本上进行逐渐细化。其次，通过这些样本进行进一步细化。我们在多个 benchmark 数据集上进行了评估，并证明了我们的方法在现有方法之上具有superiority。

MixBCT: Towards Self-Adapting Backward-Compatible Training

paper_url: http://arxiv.org/abs/2308.06948
repo_url: https://github.com/yuleung/mixbct
paper_authors: Yu Liang, Shiliang Zhang, Yaowei Wang, Sheng Xiao, Kenli Li, Xiaoyu Wang
for: 提高图像检索系统的效果，适用于具有不同质量的老模型。
methods: 提出了一种简单 yet高效的倒向兼容训练方法（MixBCT），通过约束新特征的分布来保证兼容性。
results: 在大规模的人脸识别数据集MS1Mv3和IJB-C上，与前一代方法相比，实验结果显示MixBCT具有明显的优势。

Abstract
The exponential growth of data, alongside advancements in model structures and loss functions, has necessitated the enhancement of image retrieval systems through the utilization of new models with superior feature embeddings. However, the expensive process of updating the old retrieval database by replacing embeddings poses a challenge. As a solution, backward-compatible training can be employed to avoid the necessity of updating old retrieval datasets. While previous methods achieved backward compatibility by aligning prototypes of the old model, they often overlooked the distribution of the old features, thus limiting their effectiveness when the old model's low quality leads to a weakly discriminative feature distribution. On the other hand, instance-based methods like L2 regression take into account the distribution of old features but impose strong constraints on the performance of the new model itself. In this paper, we propose MixBCT, a simple yet highly effective backward-compatible training method that serves as a unified framework for old models of varying qualities. Specifically, we summarize four constraints that are essential for ensuring backward compatibility in an ideal scenario, and we construct a single loss function to facilitate backward-compatible training. Our approach adaptively adjusts the constraint domain for new features based on the distribution of the old embeddings. We conducted extensive experiments on the large-scale face recognition datasets MS1Mv3 and IJB-C to verify the effectiveness of our method. The experimental results clearly demonstrate its superiority over previous methods. Code is available at https://github.com/yuleung/MixBCT

摘要
“数据的激素增长以及模型结构和损失函数的进步，使得图像检索系统需要通过使用新的模型来提高特征嵌入。然而，更新老检索数据库的过程是昂贵的，这pose了一个挑战。为解决这个问题，我们可以使用回溯相容的训练方法，以避免更新老检索数据库。在过去的方法中，通常通过对老模型的批处理来实现回溯相容性，但这经常忽视了老特征分布，从而限制了它们的效果。而instance-based方法，如L2回归，则考虑了老特征分布，但是它们对新模型的性能带来很强的限制。在这篇论文中，我们提出了一种简单 yet高效的回溯相容训练方法，即MixBCT。具体来说，我们总结了回溯相容训练中的四个关键约束，并构建了一个单一的损失函数来促进回溯相容训练。我们的方法可以适应不同质量的老模型，并可以自动调整新特征的约束领域，以适应老特征分布。我们在大规模的人脸识别数据集MS1Mv3和IJB-C上进行了广泛的实验，结果明显超过了先前的方法。代码可以在https://github.com/yuleung/MixBCT中找到。”

Knowing Where to Focus: Event-aware Transformer for Video Grounding

paper_url: http://arxiv.org/abs/2308.06947
repo_url: https://github.com/jinhyunj/eatr
paper_authors: Jinhyun Jang, Jungin Park, Jin Kim, Hyeongjun Kwon, Kwanghoon Sohn
for: 本 paper 的目的是提出一种基于 DETR 的视频基础设施模型，以便在视频中提取事件和时间信息。
methods: 本 paper 使用了一种名为Event-Aware Dynamic Moment Query的方法，该方法通过构成具有具体事件单元的构成的视频的具体事件单元，并通过将这些事件单元与视频中的句子表示相互作用，以便更好地预测视频中的时间信息。
results: 在多个视频基础设施测试benchmark上，Event-Aware Dynamic Moment Query 方法表现出色，比前一代方法更高效和精度地预测视频中的时间信息。

Abstract
Recent DETR-based video grounding models have made the model directly predict moment timestamps without any hand-crafted components, such as a pre-defined proposal or non-maximum suppression, by learning moment queries. However, their input-agnostic moment queries inevitably overlook an intrinsic temporal structure of a video, providing limited positional information. In this paper, we formulate an event-aware dynamic moment query to enable the model to take the input-specific content and positional information of the video into account. To this end, we present two levels of reasoning: 1) Event reasoning that captures distinctive event units constituting a given video using a slot attention mechanism; and 2) moment reasoning that fuses the moment queries with a given sentence through a gated fusion transformer layer and learns interactions between the moment queries and video-sentence representations to predict moment timestamps. Extensive experiments demonstrate the effectiveness and efficiency of the event-aware dynamic moment queries, outperforming state-of-the-art approaches on several video grounding benchmarks.

摘要

Event reasoning that captures distinctive event units constituting a given video using a slot attention mechanism;2. Moment reasoning that fuses the moment queries with a given sentence through a gated fusion transformer layer and learns interactions between the moment queries and video-sentence representations to predict moment timestamps.Extensive experiments demonstrate the effectiveness and efficiency of the event-aware dynamic moment queries, outperforming state-of-the-art approaches on several video grounding benchmarks.

Semantic-aware Network for Aerial-to-Ground Image Synthesis

paper_url: http://arxiv.org/abs/2308.06945
repo_url: https://github.com/jinhyunj/sanet
paper_authors: Jinhyun Jang, Taeyong Song, Kwanghoon Sohn
for: 这篇论文旨在解决从空中图像到地面图像的合成问题，这是一个复杂且挑战性较高的问题。
methods: 该论文提出了一种新的框架，通过强化结构对采取和semantic意识来解决这个问题。具体来说，该框架包括一个新的semantic-attentive特征转换模块，该模块可以将空中特征对应到地面布局中的复杂结构。此外，该论文还提出了semantic意识损失函数，通过利用预训练的分割网络来强制网络Synthesize实际的物体 across多个类型，并对不同类型的物体进行分别计算损失并均衡。
results: 对比之前的方法和简化学习，该论文的方法实现了较高的效果， both qualitatively and quantitatively。

Abstract
Aerial-to-ground image synthesis is an emerging and challenging problem that aims to synthesize a ground image from an aerial image. Due to the highly different layout and object representation between the aerial and ground images, existing approaches usually fail to transfer the components of the aerial scene into the ground scene. In this paper, we propose a novel framework to explore the challenges by imposing enhanced structural alignment and semantic awareness. We introduce a novel semantic-attentive feature transformation module that allows to reconstruct the complex geographic structures by aligning the aerial feature to the ground layout. Furthermore, we propose semantic-aware loss functions by leveraging a pre-trained segmentation network. The network is enforced to synthesize realistic objects across various classes by separately calculating losses for different classes and balancing them. Extensive experiments including comparisons with previous methods and ablation studies show the effectiveness of the proposed framework both qualitatively and quantitatively.

摘要
“空中到地面图像合成是一个emerging和挑战性的问题，旨在将地面图像从空中图像合成。由于空中和地面图像之间的元素布局和对象表示异常不同，现有的方法通常无法将空中场景中的元素传输到地面场景中。在这篇论文中，我们提出了一个新的框架，以便探讨这些挑战。我们引入了一个新的semantic-attentive特征转换模块，该模块可以对空中特征进行对地面布局的对接，以重建复杂的地理结构。此外，我们提出了semantic-aware的损失函数，通过利用预训练的分割网络，使网络在不同类型的对象中Synthesize realistic object。我们进行了广泛的实验，包括与之前的方法进行比较和简要的拆分学习，以证明我们的框架的效果。”Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

One-shot lip-based biometric authentication: extending behavioral features with authentication phrase information

paper_url: http://arxiv.org/abs/2308.06944
repo_url: None
paper_authors: Brando Koch, Ratko Grbić
For: 这篇论文旨在提出一种基于语音动作的生物特征认证方法，以便在视频数据中提取物理和行为特征，并通过一架深度图像网络和循环神经网络来训练。* Methods: 该方法使用一架深度图像网络，包括3D卷积和循环神经网络层，以提取物理和行为特征。另外，一种自定义的 triplet 损失函数也是提出来的，以进行批处理硬negative mining。* Results: 根据在自定义GRID数据集上进行的测试，该方法可以实现3.2% FAR和3.8% FRR的性能。此外，研究还进行了针对不同特征的分析，以评估语音动作认证方法中的物理和行为特征的影响和权重。

Abstract
Lip-based biometric authentication (LBBA) is an authentication method based on a person's lip movements during speech in the form of video data captured by a camera sensor. LBBA can utilize both physical and behavioral characteristics of lip movements without requiring any additional sensory equipment apart from an RGB camera. State-of-the-art (SOTA) approaches use one-shot learning to train deep siamese neural networks which produce an embedding vector out of these features. Embeddings are further used to compute the similarity between an enrolled user and a user being authenticated. A flaw of these approaches is that they model behavioral features as style-of-speech without relation to what is being said. This makes the system vulnerable to video replay attacks of the client speaking any phrase. To solve this problem we propose a one-shot approach which models behavioral features to discriminate against what is being said in addition to style-of-speech. We achieve this by customizing the GRID dataset to obtain required triplets and training a siamese neural network based on 3D convolutions and recurrent neural network layers. A custom triplet loss for batch-wise hard-negative mining is proposed. Obtained results using an open-set protocol are 3.2% FAR and 3.8% FRR on the test set of the customized GRID dataset. Additional analysis of the results was done to quantify the influence and discriminatory power of behavioral and physical features for LBBA.

摘要
口形基于身份验证（LBBA）是一种基于人脸运动的身份验证方法，通过摄像头捕捉的视频数据获取人脸运动特征。LBBA可以利用人脸运动的物理和行为特征，无需额外的感知设备。现有最佳实践（SOTA）方法使用一shot学习训练深度同声网络，生成特征向量并计算用户认证和承诺者之间的相似性。但这些方法存在一个缺陷，即模型行为特征为无关于发言内容的样式。这使得系统容易受到客户端发送任何话语的视频重播攻击。为解决这个问题，我们提出了一种一shot方法，模型行为特征以确定发言内容。我们通过自定义GRID数据集来获取必要的三重ts和训练基于3D卷积和循环神经网络层的siameseneural网络。我们还提出了自定义 triplet损失来进行批处理硬negative挑战。实验结果表明，在自定义GRID数据集上，得到的FRR和FAR分别为3.2%和3.8%。此外，我们还进行了更多的分析，以量化LBBA中物理和行为特征的影响和权威程度。

Radiomics-Informed Deep Learning for Classification of Atrial Fibrillation Sub-Types from Left-Atrium CT Volumes

paper_url: http://arxiv.org/abs/2308.06933
repo_url: https://github.com/xmed-lab/ridl
paper_authors: Weihang Dai, Xiaomeng Li, Taihui Yu, Di Zhao, Jun Shen, Kwang-Ting Cheng
for: 这个研究旨在提高心脏病电位矫正诊断，特别是Automatic Atrial Fibrillation (AF) 分类。
methods: 该方法结合深度学习和生物 markers的优点，使用 радиом�кс指南了深度学习模型，并通过特有的特征减少深度学习模型的过拟合问题。
results: 该方法在 AF 分类任务上达到了 86.9% AUC，超过了现有的 радиом�кс、深度学习和混合方法。

Abstract
Atrial Fibrillation (AF) is characterized by rapid, irregular heartbeats, and can lead to fatal complications such as heart failure. The disease is divided into two sub-types based on severity, which can be automatically classified through CT volumes for disease screening of severe cases. However, existing classification approaches rely on generic radiomic features that may not be optimal for the task, whilst deep learning methods tend to over-fit to the high-dimensional volume inputs. In this work, we propose a novel radiomics-informed deep-learning method, RIDL, that combines the advantages of deep learning and radiomic approaches to improve AF sub-type classification. Unlike existing hybrid techniques that mostly rely on na\"ive feature concatenation, we observe that radiomic feature selection methods can serve as an information prior, and propose supplementing low-level deep neural network (DNN) features with locally computed radiomic features. This reduces DNN over-fitting and allows local variations between radiomic features to be better captured. Furthermore, we ensure complementary information is learned by deep and radiomic features by designing a novel feature de-correlation loss. Combined, our method addresses the limitations of deep learning and radiomic approaches and outperforms state-of-the-art radiomic, deep learning, and hybrid approaches, achieving 86.9% AUC for the AF sub-type classification task. Code is available at https://github.com/xmed-lab/RIDL.

摘要
心律不正（AF）特征是快速、不规则的心跳，可能导致致命的合并症状，如心力衰竭。根据严重程度分为两种亚型，可通过CT图像进行疾病检测。现有的分类方法多数基于通用的 радиом来特征，这些特征可能不适合任务，而深度学习方法容易过拟合高维度的volume输入。在这种情况下，我们提出了一种新的 радиом扩展深度学习方法（RIDL），该方法结合深度学习和 радиом特征的优点，以提高AF亚型分类的准确率。 unlike existing hybrid techniques that mostly rely on naive feature concatenation, we observe that radiomic feature selection methods can serve as an information prior, and propose supplementing low-level deep neural network (DNN) features with locally computed radiomic features。这将 reduces DNN over-fitting and allows local variations between radiomic features to be better captured。另外，我们设计了一种新的特征分解损失函数，以确保深度和 радиом特征之间学习的信息是комplementary。结果显示，我们的方法可以在AF亚型分类任务中升级现有的 радиом、深度学习和混合方法，实现86.9%的AUC分类精度。代码可以在https://github.com/xmed-lab/RIDL上获取。

OpenGCD: Assisting Open World Recognition with Generalized Category Discovery

paper_url: http://arxiv.org/abs/2308.06926
repo_url: https://github.com/fulin-gao/opengcd
paper_authors: Fulin Gao, Weimin Zhong, Zhixing Cao, Xin Peng, Zhi Li
for: 该论文的目的是提出一种可靠的开放世界认知（OWR）系统，可以在线进行开放集 recognition（OSR）、分类 unknown 数据和进行逐步学习（IL）。
methods: 该论文提出了三个关键想法来解决上述问题，即（1）根据分类器预测结果的不确定性对实例的起源进行评分；（2）首次在 OWR 中应用总分类发现技术（GCD）来帮助人们对无标签数据进行分类；（3）为保持 IL 和 GCD 的平滑执行，保留每个类别的等量示例，并且目标是保持类别的多样性。
results: 实验结果表明，OpenGCD 不仅具有优秀的兼容性，还可以明显超越其他基elines。 Code: https://github.com/Fulin-Gao/OpenGCD.

Abstract
A desirable open world recognition (OWR) system requires performing three tasks: (1) Open set recognition (OSR), i.e., classifying the known (classes seen during training) and rejecting the unknown (unseen$/$novel classes) online; (2) Grouping and labeling these unknown as novel known classes; (3) Incremental learning (IL), i.e., continual learning these novel classes and retaining the memory of old classes. Ideally, all of these steps should be automated. However, existing methods mostly assume that the second task is completely done manually. To bridge this gap, we propose OpenGCD that combines three key ideas to solve the above problems sequentially: (a) We score the origin of instances (unknown or specifically known) based on the uncertainty of the classifier's prediction; (b) For the first time, we introduce generalized category discovery (GCD) techniques in OWR to assist humans in grouping unlabeled data; (c) For the smooth execution of IL and GCD, we retain an equal number of informative exemplars for each class with diversity as the goal. Moreover, we present a new performance evaluation metric for GCD called harmonic clustering accuracy. Experiments on two standard classification benchmarks and a challenging dataset demonstrate that OpenGCD not only offers excellent compatibility but also substantially outperforms other baselines. Code: https://github.com/Fulin-Gao/OpenGCD.

摘要
一个愿景的开放世界认知（OWR）系统需要完成三个任务：（1）开放集 recognition（OSR），即在线上分类已知（训练中看到的类）并拒绝未知（未看到的类）;（2）对未知类进行分类和标注为新知类;（3）Continual learning（IL），即不断学习这些新类并保持古老类的记忆。理想情况下，所有这些步骤都应该是自动化的。然而，现有方法大多数假设第二个任务是完全手动完成的。为了bridging这个差距，我们提出了OpenGCD，它结合了三个关键想法来解决上述问题：（a）根据分类器预测结果的不确定性来评分实例的起源（未知或特定知的）;（b）在OWR中首次引入通用类发现（GCD）技术，以帮助人类分类未标注数据;（c）为IL和GCD的畅通执行，我们保留每个类型的等数量的有用示例，并且目的是寻求多样性。此外，我们还提出了一个新的性能评价指标 дляGCD，即和谐分 clustering准确率。实验结果表明，OpenGCD不仅具有极好的兼容性，而且也substantially Outperform其他基准。代码：https://github.com/Fulin-Gao/OpenGCD。

CBA: Improving Online Continual Learning via Continual Bias Adaptor

paper_url: http://arxiv.org/abs/2308.06925
repo_url: https://github.com/wqza/cba-online-cl
paper_authors: Quanziang Wang, Renzhen Wang, Yichen Wu, Xixi Jia, Deyu Meng
for: 提高在非站ARY数据流中学习新知识和稳定把握先前学习的知识。
methods: 提出了一种Continual Bias Adaptor（CBA）模块，用于在训练过程中增强分类器网络，以适应变化的训练环境，使分类器网络能够稳定地把握先前学习的任务。
results: 经过了广泛的实验，证明了CBA模块能够有效地缓解Catastrophic Distribution Shift问题，并且在测试阶段可以移除CBA模块，不增加计算成本和内存开销。

Abstract
Online continual learning (CL) aims to learn new knowledge and consolidate previously learned knowledge from non-stationary data streams. Due to the time-varying training setting, the model learned from a changing distribution easily forgets the previously learned knowledge and biases toward the newly received task. To address this problem, we propose a Continual Bias Adaptor (CBA) module to augment the classifier network to adapt to catastrophic distribution change during training, such that the classifier network is able to learn a stable consolidation of previously learned tasks. In the testing stage, CBA can be removed which introduces no additional computation cost and memory overhead. We theoretically reveal the reason why the proposed method can effectively alleviate catastrophic distribution shifts, and empirically demonstrate its effectiveness through extensive experiments based on four rehearsal-based baselines and three public continual learning benchmarks.

摘要
在线 continual learning (CL) 目标是从非站ARY的数据流中学习新知识并巩固先前学习的知识。由于训练环境的时间变化，模型从变化的分布中学习的知识容易忘记先前学习的知识，偏向最新接收的任务。为解决这个问题，我们提议一种Continual Bias Adaptor (CBA)模块，用于补充分类网络，以适应训练中的慢性分布变化，使分类网络能够稳定地 консоли达先前学习的任务。在测试阶段，CBA可以被移除，不添加计算成本和内存负担。我们理论上解释了我们提议的方法可以有效缓解慢性分布变化的问题，并通过了广泛的实验，包括四个基础elines和三个公共的 continual learning benchmark。

Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking

paper_url: http://arxiv.org/abs/2308.06904
repo_url: https://github.com/kangben258/hit
paper_authors: Ben Kang, Xin Chen, Dong Wang, Houwen Peng, Huchuan Lu
for: 提高视觉跟踪器的运行速度，以便在具有限制计算能力的设备上实现高性能。
methods: 提出了一种新的高效跟踪模型家族，即HiT，它通过bridge模块将现代轻量级 transformer 和跟踪框架相连接，从而提高跟踪性能。同时，提出了一种新的双重图像位编码技术，以同时编码搜索区域和模板图像的位置信息。
results: HiT 模型在 Nvidia Jetson AGX 边缘设备上运行于 61 帧每秒 (fps)，并在 LaSOT benchmark 上达到 64.6% AUC，超过了所有之前的高效跟踪器。

Abstract
Transformer-based visual trackers have demonstrated significant progress owing to their superior modeling capabilities. However, existing trackers are hampered by low speed, limiting their applicability on devices with limited computational power. To alleviate this problem, we propose HiT, a new family of efficient tracking models that can run at high speed on different devices while retaining high performance. The central idea of HiT is the Bridge Module, which bridges the gap between modern lightweight transformers and the tracking framework. The Bridge Module incorporates the high-level information of deep features into the shallow large-resolution features. In this way, it produces better features for the tracking head. We also propose a novel dual-image position encoding technique that simultaneously encodes the position information of both the search region and template images. The HiT model achieves promising speed with competitive performance. For instance, it runs at 61 frames per second (fps) on the Nvidia Jetson AGX edge device. Furthermore, HiT attains 64.6% AUC on the LaSOT benchmark, surpassing all previous efficient trackers.

摘要
“具有增强能力的变换器基于视觉跟踪器在过去几年中已经取得了显著进步，但现有的跟踪器受到计算能力限制，导致它们在设备上运行速度低下。为解决这个问题，我们提出了HiT，一种新的高效跟踪模型，可以在不同的设备上快速运行，同时保持高性能。HiT的中心思想是桥模块，它将现代轻量级变换器和跟踪框架相连接，将深度特征中的高级信息与浅大分辨率特征相结合。这种方法可以为跟踪头产生更好的特征。我们还提出了一种新的双图像位编码技术，同时编码搜索区域和模板图像的位置信息。HiT模型在Nvidia Jetson AGX边缘设备上运行速度达61帧每秒（fps），并在LaSOT测试benchmark上取得了64.6%的AUC，超过了所有的高效跟踪器。”

Orthogonal Temporal Interpolation for Zero-Shot Video Recognition

paper_url: http://arxiv.org/abs/2308.06897
repo_url: None
paper_authors: Yan Zhu, Junbao Zhuo, Bin Ma, Jiajia Geng, Xiaoming Wei, Xiaolin Wei, Shuhui Wang
for: zero-shot video recognition (ZSVR)
methods: vision-language models (VLMs) with an additional temporal learning module and orthogonal temporal interpolation, as well as a matching loss
results: the proposed OTI model outperforms previous state-of-the-art methods on popular video datasets (Kinetics-600, UCF101, and HMDB51) with clear margins.

Abstract
Zero-shot video recognition (ZSVR) is a task that aims to recognize video categories that have not been seen during the model training process. Recently, vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability for ZSVR. To make VLMs applicable to the video domain, existing methods often use an additional temporal learning module after the image-level encoder to learn the temporal relationships among video frames. Unfortunately, for video from unseen categories, we observe an abnormal phenomenon where the model that uses spatial-temporal feature performs much worse than the model that removes temporal learning module and uses only spatial feature. We conjecture that improper temporal modeling on video disrupts the spatial feature of the video. To verify our hypothesis, we propose Feature Factorization to retain the orthogonal temporal feature of the video and use interpolation to construct refined spatial-temporal feature. The model using appropriately refined spatial-temporal feature performs better than the one using only spatial feature, which verifies the effectiveness of the orthogonal temporal feature for the ZSVR task. Therefore, an Orthogonal Temporal Interpolation module is designed to learn a better refined spatial-temporal video feature during training. Additionally, a Matching Loss is introduced to improve the quality of the orthogonal temporal feature. We propose a model called OTI for ZSVR by employing orthogonal temporal interpolation and the matching loss based on VLMs. The ZSVR accuracies on popular video datasets (i.e., Kinetics-600, UCF101 and HMDB51) show that OTI outperforms the previous state-of-the-art method by a clear margin.

摘要
“零目标影片识别（ZSVR）是一个目标，旨在识别过去训练过程中未看过的影片类别。现在，视力语模型（VLM）在大规模的图像文本对中进行预训后，对ZSVR预示出了优秀的转移性。为了让VLM在影片领域中可以应用，现有的方法通常使用额外的时间学习模块，以学习影片帧之间的时间关系。但是，对于未见类别的影片，我们观察到一个异常现象，即使用时间学习模块的模型在识别影片类别时表现很差，而不使用时间学习模块的模型则表现较好。我们推测，在影片中不恰当的时间模型可能会干扰影片的空间特征。为了证明我们的假设，我们提出了特征分解，以保留影片的正交时间特征，并使用插值来建构更精确的空间-时间特征。模型使用适当的特征分解和插值，对于ZSVR任务的表现来说明了其效iveness。因此，我们提出了一个名为OTI的模型，该模型通过使用正交时间插值和基于VLM的匹配损失来学习更好的空间-时间影片特征。在实验中，我们发现OTI在 популяр的影片 dataset（即Kinetics-600、UCF101和HMDB51）上的ZSVR准确率都高于前一代方法。”

Robustness Stress Testing in Medical Image Classification

paper_url: http://arxiv.org/abs/2308.06889
repo_url: https://github.com/mobarakol/robustness_stress_testing
paper_authors: Mobarakol Islam, Zeju Li, Ben Glocker
for: 这个论文旨在评估图像基于疾病检测模型的可靠性和抗耗性。
methods: 该论文使用了进攻性测试来评估模型的抗耗性和 subgroup 性能差异。进攻性测试使用了五种bidirectional和unidirectional图像抖动，six个不同的严重程度。
results: 实验结果表明，不同的模型在不同的抖动和严重程度下的性能差异很大。此外，预训练特征对下游的可靠性有重要的影响。研究结果表明，进攻性测试是一种重要的工具，应成为图像基于疾病检测模型的严格验证标准之一。

Abstract
Deep neural networks have shown impressive performance for image-based disease detection. Performance is commonly evaluated through clinical validation on independent test sets to demonstrate clinically acceptable accuracy. Reporting good performance metrics on test sets, however, is not always a sufficient indication of the generalizability and robustness of an algorithm. In particular, when the test data is drawn from the same distribution as the training data, the iid test set performance can be an unreliable estimate of the accuracy on new data. In this paper, we employ stress testing to assess model robustness and subgroup performance disparities in disease detection models. We design progressive stress testing using five different bidirectional and unidirectional image perturbations with six different severity levels. As a use case, we apply stress tests to measure the robustness of disease detection models for chest X-ray and skin lesion images, and demonstrate the importance of studying class and domain-specific model behaviour. Our experiments indicate that some models may yield more robust and equitable performance than others. We also find that pretraining characteristics play an important role in downstream robustness. We conclude that progressive stress testing is a viable and important tool and should become standard practice in the clinical validation of image-based disease detection models.

摘要
深度神经网络在图像基于疾病检测方面表现出色。表现的评估通常通过临床验证在独立的测试集上进行，以证明临床可接受的准确率。然而，只是在测试数据来自同一个分布 alsothe training data时，iid测试集表现可能不是一个可靠的准确率估计。在这篇论文中，我们使用压力测试来评估模型的可靠性和 subgroup 性能差异。我们设计了五种逆向和单向图像干扰，并将其分为六个不同的严重程度。作为一个使用例子，我们将压力测试应用于肺X射线和皮肤斑点图像的疾病检测模型中，并证明了研究类和领域特有的模型行为的重要性。我们的实验表明，一些模型可能比其他模型更加可靠和公平。我们还发现，预训练特征对下游的可靠性产生了重要的影响。我们 conclude 进展性的压力测试是一种可靠和重要的工具，应成为临床验证图像基于疾病检测模型的标准实践。

Towards Open-Set Test-Time Adaptation Utilizing the Wisdom of Crowds in Entropy Minimization

paper_url: http://arxiv.org/abs/2308.06879
repo_url: None
paper_authors: Jungsoo Lee, Debasmit Das, Jaegul Choo, Sungha Choi
for: 这篇论文的目的是提出一种简单 yet effective的测验时适应（TTA）方法，以解决现有TTA方法中的问题，包括开放集类TTA。
methods: 这篇论文使用了一种简单的样本选择方法，它是基于以下关键的实践观察：当使用Entropy minimization时， incorrect或open-set predictions对于模型的适应会带来噪音信号。这篇论文发现，这些噪音信号通常会导致模型的预测数据具有较低的信任值。因此，这篇论文提出了一种筛选样本的方法，以除除这些噪音信号。
results: 这篇论文的实验结果显示，这种筛选样本方法可以帮助TTA方法在图像分类（例如，TENT的错误率下降49.4%）和 semanticsegmentation（例如，TENT的mIoU提高11.7%）中提高长期适应性。

Abstract
Test-time adaptation (TTA) methods, which generally rely on the model's predictions (e.g., entropy minimization) to adapt the source pretrained model to the unlabeled target domain, suffer from noisy signals originating from 1) incorrect or 2) open-set predictions. Long-term stable adaptation is hampered by such noisy signals, so training models without such error accumulation is crucial for practical TTA. To address these issues, including open-set TTA, we propose a simple yet effective sample selection method inspired by the following crucial empirical finding. While entropy minimization compels the model to increase the probability of its predicted label (i.e., confidence values), we found that noisy samples rather show decreased confidence values. To be more specific, entropy minimization attempts to raise the confidence values of an individual sample's prediction, but individual confidence values may rise or fall due to the influence of signals from numerous other predictions (i.e., wisdom of crowds). Due to this fact, noisy signals misaligned with such 'wisdom of crowds', generally found in the correct signals, fail to raise the individual confidence values of wrong samples, despite attempts to increase them. Based on such findings, we filter out the samples whose confidence values are lower in the adapted model than in the original model, as they are likely to be noisy. Our method is widely applicable to existing TTA methods and improves their long-term adaptation performance in both image classification (e.g., 49.4% reduced error rates with TENT) and semantic segmentation (e.g., 11.7% gain in mIoU with TENT).

摘要
测试时适应（TTA）方法通常基于模型预测（例如 entropy 最小化）来适应无标目标频道的源预训练模型。然而，由于各种各样的预测错误和开放集预测，TTA 方法容易受到各种各样的噪声影响，从而长期稳定的适应变得困难。为了解决这些问题，包括开放集 TTA，我们提出了一种简单 yet 有效的样本选择方法，基于以下关键的实际观察：在 entropy 最小化过程中，模型会增加预测标签的可信度（即 confidence value），但噪声样本却会显示下降的可信度。实际上， entropy 最小化尝试通过提高个体样本预测的可信度来增加模型的预测可信度，但个体可信度可能由数据量级的其他预测信号所影响（即 "智慧的群体"）。由于这种情况，噪声样本不会因为受到其他预测信号的影响而增加个体可信度，即使 entropy 最小化尝试提高它们。基于这些观察，我们将各种各样的样本滤除，其中 confidence value 在适应模型中下降的比原始模型更低。我们的方法可以与现有的 TTA 方法结合使用，并在图像分类（例如 TENT 的 49.4% 降低错误率）和 semantic segmentation（例如 TENT 的 11.7% 提高 mIoU）中提高长期适应性能。

Shape-Graph Matching Network (SGM-net): Registration for Statistical Shape Analysis

paper_url: http://arxiv.org/abs/2308.06869
repo_url: None
paper_authors: Shenyuan Liang, Mauricio Pamplona Segundo, Sathyanarayanan N. Aakur, Sudeep Sarkar, Anuj Srivastava
for: 本研究针对数据对象形状的统计分析，具体来说是一种称为形状图的数据集，其中节点之间连接由自由形态曲线相连接。
methods: 本研究使用一种新的神经网络架构来解决对点（节点到节点、边到边）的受限注册问题，这个问题由于对节点（数量、位置）和边（形状、位置、大小）的不同而变得更加挑战。
results: 本研究使用一种新的神经网络架构和一个不supervised损失函数基于扭形度量来解决对点的受限注册问题，这得到了（1）状态机器的匹配性和（2）与基eline方法相比，计算成本减少一个数量级。作者通过实验和实际数据 demonstrate了该方法的有效性。

Abstract
This paper focuses on the statistical analysis of shapes of data objects called shape graphs, a set of nodes connected by articulated curves with arbitrary shapes. A critical need here is a constrained registration of points (nodes to nodes, edges to edges) across objects. This, in turn, requires optimization over the permutation group, made challenging by differences in nodes (in terms of numbers, locations) and edges (in terms of shapes, placements, and sizes) across objects. This paper tackles this registration problem using a novel neural-network architecture and involves an unsupervised loss function developed using the elastic shape metric for curves. This architecture results in (1) state-of-the-art matching performance and (2) an order of magnitude reduction in the computational cost relative to baseline approaches. We demonstrate the effectiveness of the proposed approach using both simulated data and real-world 2D and 3D shape graphs. Code and data will be made publicly available after review to foster research.

摘要
To tackle this registration problem, the paper proposes a novel neural network architecture and an unsupervised loss function based on the elastic shape metric for curves. The proposed approach achieves state-of-the-art matching performance and reduces the computational cost by an order of magnitude compared to baseline methods.The paper demonstrates the effectiveness of the proposed approach using both simulated data and real-world 2D and 3D shape graphs. To facilitate further research, the authors will make the code and data publicly available after review.

Camera Based mmWave Beam Prediction: Towards Multi-Candidate Real-World Scenarios

paper_url: http://arxiv.org/abs/2308.06868
repo_url: None
paper_authors: Gouranga Charan, Muhammad Alrabeiah, Tawfik Osman, Ahmed Alkhateeb
for:The paper is written for the purpose of exploring the use of sensory information to aid the beam selection process in millimeter-wave (mmWave) and sub-terahertz (sub-THz) communication systems.methods:The paper proposes a machine learning-based framework that utilizes visual and positional data to predict the optimal beam indices, as an alternative to conventional beam sweeping approaches.results:The proposed solutions achieve close to 100% top-5 beam prediction accuracy for single-user scenarios and close to 95% top-5 beam prediction accuracy for multi-candidate scenarios. Additionally, the approach can identify the probable transmitting candidate with over 93% accuracy across different scenarios, highlighting a promising approach for nearly eliminating the beam training overhead in mmWave/THz communication systems.Here is the text in Simplified Chinese:for:本文是为了探讨使用感知信息来帮助mmWave/THz通信系统的扫描过程中的 beam 选择问题。methods:本文提出了基于机器学习的框架，利用视觉和位置数据预测最佳扫描指标，作为传统扫描方法的替代方案。results:提议的解决方案在单用户场景中几乎达到100%的前5个扫描指标预测精度，在多候选场景中几乎达到95%的前5个扫描指标预测精度。此外，该方法还可以在不同场景中准确地识别发射器 кандидат的概率，达到了93%以上的准确率。这 highlights 一种可行的方法，可以减少mmWave/THz通信系统中的扫描训练开销。

Abstract
Leveraging sensory information to aid the millimeter-wave (mmWave) and sub-terahertz (sub-THz) beam selection process is attracting increasing interest. This sensory data, captured for example by cameras at the basestations, has the potential of significantly reducing the beam sweeping overhead and enabling highly-mobile applications. The solutions developed so far, however, have mainly considered single-candidate scenarios, i.e., scenarios with a single candidate user in the visual scene, and were evaluated using synthetic datasets. To address these limitations, this paper extensively investigates the sensing-aided beam prediction problem in a real-world multi-object vehicle-to-infrastructure (V2I) scenario and presents a comprehensive machine learning-based framework. In particular, this paper proposes to utilize visual and positional data to predict the optimal beam indices as an alternative to the conventional beam sweeping approaches. For this, a novel user (transmitter) identification solution has been developed, a key step in realizing sensing-aided multi-candidate and multi-user beam prediction solutions. The proposed solutions are evaluated on the large-scale real-world DeepSense $6$G dataset. Experimental results in realistic V2I communication scenarios indicate that the proposed solutions achieve close to $100\%$ top-5 beam prediction accuracy for the scenarios with single-user and close to $95\%$ top-5 beam prediction accuracy for multi-candidate scenarios. Furthermore, the proposed approach can identify the probable transmitting candidate with more than $93\%$ accuracy across the different scenarios. This highlights a promising approach for nearly eliminating the beam training overhead in mmWave/THz communication systems.

摘要
使用感知信息来帮助 millimeter-wave（mmWave）和Sub-teraHertz（sub-THz）的扫描过程选择 beam 是一项吸引越来越多的关注。这些感知数据，例如通过基站的摄像头捕获的视频数据，有可能减少扫描过程的负担和实现高移动端应用。现有的解决方案主要是基于单个候选者场景（即场景中只有一个用户），并使用合成数据进行评估。为了解决这些局限性，本文广泛 investigate了感知帮助 beam 预测问题在实际的多对象 Vehicle-to-Infrastructure（V2I）场景中，并提出了一个完整的机器学习基础框架。特别是，本文提议使用视觉和位置数据预测optimal beam 指标，作为传统扫描方法的替代方案。为此，我们开发了一种新的用户（发送器）标识解决方案，是实现感知帮助多候选者和多用户 beam 预测解决方案的关键步骤。提出的解决方案在 DeepSense $6$G 大规模实际场景中进行了实验，实际 results 表明，在单用户场景中，提出的解决方案可达 $100\%$ top-5 beam 预测精度，而在多候选场景中，可达 $95\%$ top-5 beam 预测精度。此外，我们的方法可以在不同场景下识别发送者的概率高于 $93\%$。这表明，我们的方法可以减少 mmWave/THz 通信系统中的扫描训练过程负担。

Improving Face Recognition from Caption Supervision with Multi-Granular Contextual Feature Aggregation

paper_url: http://arxiv.org/abs/2308.06866
repo_url: None
paper_authors: Md Mahedi Hasan, Nasser Nasrabadi
for: 本文提出了一种新的框架，即caption-guided face recognition（CGFR），以提高现成的面Recognition（FR）系统的性能。methods: 本文使用了face examiners提供的facial descriptions作为auxiliary information，并提出了一种Contextual feature aggregation module（CFAM）和textual feature refinement module（TFRM）来有效地 fusion image和textual features。results: 对于Multi-Modal CelebA-HQ数据集，我们的CGFR框架对ArcFace模型进行了改进，并在1:1验证和1:N识别协议中显著提高了性能。

Abstract
We introduce caption-guided face recognition (CGFR) as a new framework to improve the performance of commercial-off-the-shelf (COTS) face recognition (FR) systems. In contrast to combining soft biometrics (eg., facial marks, gender, and age) with face images, in this work, we use facial descriptions provided by face examiners as a piece of auxiliary information. However, due to the heterogeneity of the modalities, improving the performance by directly fusing the textual and facial features is very challenging, as both lie in different embedding spaces. In this paper, we propose a contextual feature aggregation module (CFAM) that addresses this issue by effectively exploiting the fine-grained word-region interaction and global image-caption association. Specifically, CFAM adopts a self-attention and a cross-attention scheme for improving the intra-modality and inter-modality relationship between the image and textual features, respectively. Additionally, we design a textual feature refinement module (TFRM) that refines the textual features of the pre-trained BERT encoder by updating the contextual embeddings. This module enhances the discriminative power of textual features with a cross-modal projection loss and realigns the word and caption embeddings with visual features by incorporating a visual-semantic alignment loss. We implemented the proposed CGFR framework on two face recognition models (ArcFace and AdaFace) and evaluated its performance on the Multi-Modal CelebA-HQ dataset. Our framework significantly improves the performance of ArcFace in both 1:1 verification and 1:N identification protocol.

摘要
我们介绍了一个新的框架，即caption-guided face recognition（CGFR），以提高现成通商（COTS）面Recognition（FR）系统的性能。相比于结合软生物мет里（例如面上的标识、性别和年龄），在这个工作中，我们使用面对面的描述提供的 auxiliary information。然而，由于modalities的不同，直接融合文本和面像特征是非常困难的，因为它们都处于不同的嵌入空间。在这篇文章中，我们提出了一个具有上下文特征聚合模块（CFAM），以解决这个问题。CFAM使用了自注意和跨注意的方式，从而提高了内模组和跨模组的关系。此外，我们设计了一个文本特征修正模块（TFRM），将预训BERTencoder的文本特征修正为更加精确。这个模块通过跨模式投射损失和与Visual特征Alignment损失，提高了文本特征的推测力。我们实现了CGFR框架在ArcFace和AdaFace两个面识别模型上，并在Multi-Modal CelebA-HQ dataset上进行了评估。我们的框架在1:1验证和1:N识别协议中具有明显的性能提升。

Manifold DivideMix: A Semi-Supervised Contrastive Learning Framework for Severe Label Noise

paper_url: http://arxiv.org/abs/2308.06861
repo_url: https://github.com/fahim-f/manifolddividemix
paper_authors: Fahimeh Fooladgar, Minh Nguyen Nhat To, Parvin Mousavi, Purang Abolmaesumi
for: 提高深度神经网络模型在含有噪声标签数据时的性能，并针对实际世界数据集中存在噪声标签样本的问题。
methods: 提出一种基于自我监督学习的方法，通过EXTRACTING meaningful和普适的嵌入空间，并使用简单 yet effective K-nearest neighbor方法来移除部分噪声样本，然后通过一种迭代的”Manifold DivideMix”算法来找到干净的样本和噪声样本，并在半超vised的方式下训练模型。此外，提出一种新的”MixEMatch”算法，通过在输入和最终隐藏表示空间中进行mixup扩展，以EXTRACT更好的表示。
results: 在多个Synthetic-noise图像标准 benchmark和实际世界web-crawled数据集上进行了广泛的实验，并证明了我们提出的框架的有效性。

Abstract
Deep neural networks have proven to be highly effective when large amounts of data with clean labels are available. However, their performance degrades when training data contains noisy labels, leading to poor generalization on the test set. Real-world datasets contain noisy label samples that either have similar visual semantics to other classes (in-distribution) or have no semantic relevance to any class (out-of-distribution) in the dataset. Most state-of-the-art methods leverage ID labeled noisy samples as unlabeled data for semi-supervised learning, but OOD labeled noisy samples cannot be used in this way because they do not belong to any class within the dataset. Hence, in this paper, we propose incorporating the information from all the training data by leveraging the benefits of self-supervised training. Our method aims to extract a meaningful and generalizable embedding space for each sample regardless of its label. Then, we employ a simple yet effective K-nearest neighbor method to remove portions of out-of-distribution samples. By discarding these samples, we propose an iterative "Manifold DivideMix" algorithm to find clean and noisy samples, and train our model in a semi-supervised way. In addition, we propose "MixEMatch", a new algorithm for the semi-supervised step that involves mixup augmentation at the input and final hidden representations of the model. This will extract better representations by interpolating both in the input and manifold spaces. Extensive experiments on multiple synthetic-noise image benchmarks and real-world web-crawled datasets demonstrate the effectiveness of our proposed framework. Code is available at https://github.com/Fahim-F/ManifoldDivideMix.

摘要
深度神经网络在有大量高质量标签数据时表现非常出色。然而，当训练数据包含噪声标签时，其性能会下降，导致测试集上的泛化性不佳。现实中的数据集中存在噪声标签样本，其中一些样本与其他类具有相似的视觉 semantics（在 Distribution 中），而其他样本则完全没有任何类别相关性（out-of-distribution）。大多数当前的方法利用 ID 标注的噪声样本作为无标签数据进行半超vised学习，但 OOD 标注的噪声样本不能这样使用，因为它们不属于数据集中的任何类别。因此，在这篇论文中，我们提出了利用自我标注的优势，抽象出每个样本的含义和普适的嵌入空间。然后，我们使用简单而有效的 K-最近邻查找方法，从out-of-distribution样本中排除一部分样本。通过这种方式，我们提出了一种迭代的“替换Mix”算法，用于从含有噪声的样本中提取净化后的样本。此外，我们还提出了一种新的“混合匹配”算法，用于半超vised步骤，通过在输入和最终隐藏层中进行混合增强，从而提取更好的表示。我们的方法在多个人造噪音图像标准检验和实际web-抓取的数据集上进行了广泛的实验，并得到了非常出色的效果。代码可以在https://github.com/Fahim-F/ManifoldDivideMix中找到。

UGC Quality Assessment: Exploring the Impact of Saliency in Deep Feature-Based Quality Assessment

paper_url: http://arxiv.org/abs/2308.06853
repo_url: None
paper_authors: Xinyi Wang, Angeliki Katsenou, David Bull
for: 本研究旨在提高用户生成内容（UGC）质量的评估方法。
methods: 本研究使用了现代metric，抽取自然场景统计和深度神经网络特征，并使用了焦点地图提高可见性。
results: 初步结果显示，只使用深度特征可以 дости得高相关性，而添加焦点不总是提高性能。结果和代码将公开发布，以便作为研究社区的参考。

Abstract
The volume of User Generated Content (UGC) has increased in recent years. The challenge with this type of content is assessing its quality. So far, the state-of-the-art metrics are not exhibiting a very high correlation with perceptual quality. In this paper, we explore state-of-the-art metrics that extract/combine natural scene statistics and deep neural network features. We experiment with these by introducing saliency maps to improve perceptibility. We train and test our models using public datasets, namely, YouTube-UGC and KoNViD-1k. Preliminary results indicate that high correlations are achieved by using only deep features while adding saliency is not always boosting the performance. Our results and code will be made publicly available to serve as a benchmark for the research community and can be found on our project page: https://github.com/xinyiW915/SPIE-2023-Supplementary.

摘要
“用户生成内容”（UGC）的量在最近几年增加了。但是评估这种内容质量的挑战很大。目前最先进的度量不 exhibit 高度相关性。在这篇文章中，我们探索了使用自然场景统计和深度神经网络特征的度量。我们尝试使用 saliency map 提高可视性。我们使用 YouTube-UGC 和 KoNViD-1k 公共数据集进行训练和测试。初步结果显示，仅使用深度特征可以取得高相关性，而添加 saliency 并不总是提高表现。我们的结果和代码将会在我们的项目页面上公开，供研究社区使用，可以在 GitHub 上找到：https://github.com/xinyiW915/SPIE-2023-Supplementary。

Optimizing Brain Tumor Classification: A Comprehensive Study on Transfer Learning and Imbalance Handling in Deep Learning Models

paper_url: http://arxiv.org/abs/2308.06821
repo_url: https://github.com/razaimam45/ai701-project-transfer-learning-approach-for-imbalance-classification-of-brain-tumor-mri-
paper_authors: Raza Imam, Mohammed Talha Alam
for: 这个研究的目的是为了开发一个基于深度学习的数据不均衡类别数据的方法，以提高脑癌MRI图像的类别精度。
methods: 这个研究使用了转移学习的方法，将公开available的模型的预测能力转移到CNN中，以提高类别精度。实验使用了不同的损失函数和补充方法，包括 focal loss 和 SMOTE/ADASYN，来解决数据不均衡问题。
results: 实验结果显示，提案的方法可以实现96%的准确率，比其他方法高得多。

Abstract
Deep learning has emerged as a prominent field in recent literature, showcasing the introduction of models that utilize transfer learning to achieve remarkable accuracies in the classification of brain tumor MRI images. However, the majority of these proposals primarily focus on balanced datasets, neglecting the inherent data imbalance present in real-world scenarios. Consequently, there is a pressing need for approaches that not only address the data imbalance but also prioritize precise classification of brain cancer. In this work, we present a novel deep learning-based approach, called Transfer Learning-CNN, for brain tumor classification using MRI data. The proposed model leverages the predictive capabilities of existing publicly available models by utilizing their pre-trained weights and transferring those weights to the CNN. By leveraging a publicly available Brain MRI dataset, the experiment evaluated various transfer learning models for classifying different tumor types, including meningioma, glioma, and pituitary tumors. We investigate the impact of different loss functions, including focal loss, and oversampling methods, such as SMOTE and ADASYN, in addressing the data imbalance issue. Notably, the proposed strategy, which combines VGG-16 and CNN, achieved an impressive accuracy rate of 96%, surpassing alternative approaches significantly.

摘要
深度学习已经成为当前文献中的一个突出的领域，展示了使用传输学习来实现在大脑肿瘤MRI图像分类中的惊人准确率。然而，大多数这些建议都主要关注均衡数据集，忽视了实际世界中的数据不均衡问题。因此，有一项急需的是解决数据不均衡问题并且强调大脑癌精准分类的方法。在这项工作中，我们提出了一种基于深度学习的新方法，即传输学习-CNN，用于大脑肿瘤分类。该方法利用了现有公共可用的模型的预测能力，将其预训练的参数传递到CNN中。通过使用公共可用的大脑MRI数据集，我们对不同类型的肿瘤进行了不同的传输学习模型的评估，包括脑膜肿瘤、 glioma 和肾脏肿瘤。我们也 investigate了不同的损失函数，包括关注损失和 oversampling 方法，例如 SMOTE 和 ADASYN，以解决数据不均衡问题。值得一提的是，我们提出的策略，即将 VGG-16 和 CNN 结合使用，实现了96%的准确率，与其他方法相比明显超越。

2023-08-14

cs.AI

cs.AI - 2023-08-14

paper_url: http://arxiv.org/abs/2308.07222
repo_url: None
paper_authors: Hao Wu, Alejandro Ariza-Casabona, Bartłomiej Twardowski, Tri Kurniawan Wijaya
for: 这篇论文的目的是提出一种基于图的项目结构优化方法，以提高多Modal的推荐系统的性能。
methods: 该方法使用图形 Early-Fusion 技术，将多Modal的内容特征综合到一起，以获取更加精准的项目表示。
results: 经过广泛的实验 validate，该方法在四个公开数据集上实现了比基于单Modal的方法更高的推荐性能。

Abstract
In modern e-commerce, item content features in various modalities offer accurate yet comprehensive information to recommender systems. The majority of previous work either focuses on learning effective item representation during modelling user-item interactions, or exploring item-item relationships by analysing multi-modal features. Those methods, however, fail to incorporate the collaborative item-user-item relationships into the multi-modal feature-based item structure. In this work, we propose a graph-based item structure enhancement method MM-GEF: Multi-Modal recommendation with Graph Early-Fusion, which effectively combines the latent item structure underlying multi-modal contents with the collaborative signals. Instead of processing the content feature in different modalities separately, we show that the early-fusion of multi-modal features provides significant improvement. MM-GEF learns refined item representations by injecting structural information obtained from both multi-modal and collaborative signals. Through extensive experiments on four publicly available datasets, we demonstrate systematical improvements of our method over state-of-the-art multi-modal recommendation methods.

摘要
现代电子商务中，物品内容特征在不同的Modalities上提供了准确且全面的信息，以便推荐系统进行推荐。大多数前一次的工作都集中在学习用户-物品交互中有效的物品表示，或者分析多Modalities的特征来探索物品之间的关系。然而，这些方法却忽略了用户-物品-物品的相互关系在多Modalities的特征基础上的协同作用。在这项工作中，我们提出了一种图structured enhancement方法MM-GEF：多Modalities推荐with Graph Early-Fusion，该方法能够有效地结合多Modalities的特征基础和用户-物品-物品的协同信号。而不是分别处理不同Modalities的内容特征，我们表明在早期融合多Modalities的特征提供了显著的改善。MM-GEF通过对多Modalities特征基础和协同信号获得的结构信息进行投入，学习精细的物品表示。经过对四个公共数据集的广泛实验，我们证明了我们的方法在多Modalities推荐方法中的系统性改进。

Generating Individual Trajectories Using GPT-2 Trained from Scratch on Encoded Spatiotemporal Data

paper_url: http://arxiv.org/abs/2308.07940
repo_url: None
paper_authors: Taizo Horikomi, Shouji Fujimoto, Atushi Ishikawa, Takayuki Mizuno
for: 这个论文主要是为了构建一个基于深度学习的人体行为预测模型，以便预测人们的日常行为。
methods: 这个论文使用了GPT-2架构来训练一个序列生成模型，该模型可以从零开始训练，并且可以基于具有不同空间缩放的地理坐标和时间间隔 tokens 来生成人们的日常行动路径。
results: 这个论文的结果表明，通过使用GPT-2架构和特殊符号来表示环境因素和个人特征，可以生成基于具有多种空间缩放的人们的日常行动路径，并且可以预测人们在不同时间和空间上的行为。

Abstract
Following Mizuno, Fujimoto, and Ishikawa's research (Front. Phys. 2022), we transpose geographical coordinates expressed in latitude and longitude into distinctive location tokens that embody positions across varied spatial scales. We encapsulate an individual daily trajectory as a sequence of tokens by adding unique time interval tokens to the location tokens. Using the architecture of an autoregressive language model, GPT-2, this sequence of tokens is trained from scratch, allowing us to construct a deep learning model that sequentially generates an individual daily trajectory. Environmental factors such as meteorological conditions and individual attributes such as gender and age are symbolized by unique special tokens, and by training these tokens and trajectories on the GPT-2 architecture, we can generate trajectories that are influenced by both environmental factors and individual attributes.

摘要
根据米泽野、藤本、石川等人的研究（Front. Phys. 2022），我们将地理坐标表示为纬度和经度转化为特征强调的位置标记，这些标记表示位置在不同的空间尺度上的不同位置。我们将每天的行走路径视为一个序列的位置标记，通过将唯一的时间间隔标记添加到位置标记中来实现。使用GPT-2架构的自然语言模型，我们从头来训练这些标记和路径，以建立一个可以顺序生成每天行走路径的深度学习模型。环境因素，如天气条件和个人特征，例如性别和年龄，被表示为特殊的特征标记，通过训练这些标记和路径，我们可以生成受环境因素和个人特征影响的行走路径。

Algorithms for the Training of Neural Support Vector Machines

paper_url: http://arxiv.org/abs/2308.07204
repo_url: https://github.com/sayantann11/all-classification-templetes-for-ML
paper_authors: Lars Simon, Manuel Radons
for: 这篇论文是为了探讨逻辑支持向量机器学习（NSVMs）如何利用领域知识在模型架构设计中。
methods: 这篇论文使用了 Pegasos 算法，并提供了一些训练算法来实现 NSVMs。
results: 这篇论文通过解决一些标准机器学习任务来证明 NSVMs 的效果。

Abstract
Neural support vector machines (NSVMs) allow for the incorporation of domain knowledge in the design of the model architecture. In this article we introduce a set of training algorithms for NSVMs that leverage the Pegasos algorithm and provide a proof of concept by solving a set of standard machine learning tasks.

摘要
神经支持向量机 (NSVM) 允许在模型建立的架构中吸收域知识。在这篇文章中，我们介绍了一组使用 Pegasos 算法的 NSVM 训练算法，并通过解决一组标准机器学习任务来提供证明。Here's a word-for-word translation of the text:神经支持向量机（NSVM）允许在模型建立的架构中吸收域知识。在这篇文章中，我们介绍了一组使用 Pegasos 算法的 NSVM 训练算法，并通过解决一组标准机器学习任务来提供证明。

Neural Categorical Priors for Physics-Based Character Control

paper_url: http://arxiv.org/abs/2308.07200
repo_url: https://github.com/Tencent-RoboticsX/NCP
paper_authors: Qingxu Zhu, He Zhang, Mengting Lan, Lei Han
for: 本研究旨在提出一种新的学习框架，用于控制基于物理的人形角色，以获得更高质量和多样性的运动。
methods: 该方法使用奖励学习（RL）来跟踪和模仿生物样的运动，并使用矩阵量化自适应器（VQ-VAE）来压缩运动clip中的信息。
results: 该方法可以生成高质量的生物样运动，并且可以帮助上层政策学习下游任务。我们在人形角色上进行了广泛的实验，并获得了considerably高质量的运动。

Abstract
Recent advances in learning reusable motion priors have demonstrated their effectiveness in generating naturalistic behaviors. In this paper, we propose a new learning framework in this paradigm for controlling physics-based characters with significantly improved motion quality and diversity over existing state-of-the-art methods. The proposed method uses reinforcement learning (RL) to initially track and imitate life-like movements from unstructured motion clips using the discrete information bottleneck, as adopted in the Vector Quantized Variational AutoEncoder (VQ-VAE). This structure compresses the most relevant information from the motion clips into a compact yet informative latent space, i.e., a discrete space over vector quantized codes. By sampling codes in the space from a trained categorical prior distribution, high-quality life-like behaviors can be generated, similar to the usage of VQ-VAE in computer vision. Although this prior distribution can be trained with the supervision of the encoder's output, it follows the original motion clip distribution in the dataset and could lead to imbalanced behaviors in our setting. To address the issue, we further propose a technique named prior shifting to adjust the prior distribution using curiosity-driven RL. The outcome distribution is demonstrated to offer sufficient behavioral diversity and significantly facilitates upper-level policy learning for downstream tasks. We conduct comprehensive experiments using humanoid characters on two challenging downstream tasks, sword-shield striking and two-player boxing game. Our results demonstrate that the proposed framework is capable of controlling the character to perform considerably high-quality movements in terms of behavioral strategies, diversity, and realism. Videos, codes, and data are available at https://tencent-roboticsx.github.io/NCP/.

摘要
近期研究生成可重用运动偏好的进步已经证明它们可以生成自然的行为。在这篇论文中，我们提出了一种新的学习框架，用于控制基于物理的角色，并且可以提高运动质量和多样性。我们使用奖励学习（RL）来初始化并模仿生命体的自然运动，使用不结构化运动片断中的离散信息瓶颈，这种结构可以压缩运动片断中最重要的信息到一个 компакт又有用的积分空间中。通过从训练过的分类假设分布中采样代码，可以生成高质量的生命体运动。虽然这个假设分布可以通过编码器的输出进行训练，但它会跟踪原始运动片断的分布，这可能会导致行为偏执。为了解决这个问题，我们进一步提出了一种名为“偏shift”的技术，通过吸引力驱动RL来调整假设分布。经过这种技术的调整，结果分布能够提供足够的行为多样性，并且可以大大提高下游策略学习的效果。我们在humanoid角色上进行了严格的实验，使用两个具有挑战性的下游任务：剑盾擦擦和两个玩家拳击游戏。我们的结果表明，我们的框架可以控制角色进行较高质量的运动，包括行为策略、多样性和现实感。视频、代码和数据可以在https://tencent-roboticsx.github.io/NCP/中获得。

Explaining Black-Box Models through Counterfactuals

paper_url: http://arxiv.org/abs/2308.07198
repo_url: https://github.com/juliatrustworthyai/counterfactualexplanations.jl
paper_authors: Patrick Altmeyer, Arie van Deursen, Cynthia C. S. Liem
for: 这篇论文旨在提供一个用于生成对比事实解释（Counterfactual Explanations，CE）和算法救济（Algorithmic Recourse，AR）的Julia包，用于解释黑盒模型的输出。
methods: 这篇论文使用了Julia语言开发了一个包，包含了一系列的对比事实生成器，用于生成可行、可信度高的对比事实解释。
results: 论文通过使用这些对比事实生成器，可以提供实用、可行的对比事实解释，并且可以用于提供算法救济，帮助改善不满的结果。

Abstract
We present CounterfactualExplanations.jl: a package for generating Counterfactual Explanations (CE) and Algorithmic Recourse (AR) for black-box models in Julia. CE explain how inputs into a model need to change to yield specific model predictions. Explanations that involve realistic and actionable changes can be used to provide AR: a set of proposed actions for individuals to change an undesirable outcome for the better. In this article, we discuss the usefulness of CE for Explainable Artificial Intelligence and demonstrate the functionality of our package. The package is straightforward to use and designed with a focus on customization and extensibility. We envision it to one day be the go-to place for explaining arbitrary predictive models in Julia through a diverse suite of counterfactual generators.

摘要
我们介绍CounterfactualExplanations.jl：一个用于生成反事件解释（CE）和算法救援（AR）的 julia 套件。CE 可以解释模型对特定预测所需的输入如何改变，以提供实用和可行的改善方案。我们认为 CE 对于可解释人工智能（Explainable AI）是非常有用，并在这篇文章中详细介绍了我们的套件功能。我们的套件易于使用，采用了自定义和扩展的设计，我们希望这将成为一个用于解释 julia 中的任意预测模型的首选场所，通过多种反事件生成器。

Task Offloading for Smart Glasses in Healthcare: Enhancing Detection of Elevated Body Temperature

paper_url: http://arxiv.org/abs/2308.07193
repo_url: None
paper_authors: Abdenacer Naouri, Nabil Abdelkader Nouri, Attia Qammar, Feifei Shi, Huansheng Ning, Sahraoui Dhelim
for: 这个研究的目的是分析在智能眼镜上执行医疗监测应用时的任务卸载场景，以确定最佳的卸载条件。
methods: 该研究使用了实际情况下的性能指标，包括任务完成时间、计算能力和能源消耗，来评估卸载的有效性。
results: 研究发现，在一个室内环境中，如机场，使用智能眼镜检测高体温可以减轻医疗人员的工作负担，提高医疗服务质量。这些发现表明在医疗设置中，任务卸载可以为智能眼镜提供实用性和 relevance。

Abstract
Wearable devices like smart glasses have gained popularity across various applications. However, their limited computational capabilities pose challenges for tasks that require extensive processing, such as image and video processing, leading to drained device batteries. To address this, offloading such tasks to nearby powerful remote devices, such as mobile devices or remote servers, has emerged as a promising solution. This paper focuses on analyzing task-offloading scenarios for a healthcare monitoring application performed on smart wearable glasses, aiming to identify the optimal conditions for offloading. The study evaluates performance metrics including task completion time, computing capabilities, and energy consumption under realistic conditions. A specific use case is explored within an indoor area like an airport, where security agents wearing smart glasses to detect elevated body temperature in individuals, potentially indicating COVID-19. The findings highlight the potential benefits of task offloading for wearable devices in healthcare settings, demonstrating its practicality and relevance.

摘要
智能眼镜和其他智能穿戴设备在不同应用领域中得到了普及。然而，它们的计算能力有限，导致需要大量处理的任务，如图像和视频处理，会使设备电池耗尽。为解决这问题，将这些任务外卸到附近的强大Remote设备，如移动设备或远程服务器，已成为一种有前途的解决方案。本文针对智能眼镜在医疗监测应用中进行任务外卸场景分析，以评估最佳外卸条件。研究评估了任务完成时间、计算能力和能源消耗的性能指标，在实际情况下进行测试。一个具体的应用场景是在机场内，安全人员通过穿戴智能眼镜检测身体发热，可能indi COVID-19。发现任务外卸可以为智能穿戴设备在医疗设置中提供优化的性能。

paper_url: http://arxiv.org/abs/2308.08499
repo_url: None
paper_authors: Amar Khelloufi, Huansheng Ning, Abdelkarim Ben Sada, Abdenacer Naouri, Sahraoui Dhelim
for: This paper aims to improve the accuracy and relevance of personalized service recommendations in the Social Internet of Things (SIoT) context by exploring the contextual representation of each device-service pair.
methods: The proposed framework uses a latent features combination technique to capture latent feature interactions and Factorization Machines to model higher-order feature interactions specific to each SIoT device-service pair.
results: The experimental evaluation demonstrates the framework’s effectiveness in improving service recommendation accuracy and relevance.Here’s the text in Simplified Chinese:
for: 这篇论文目标是在社交互联设备（SIoT）上提高个性化服务推荐的准确率和相关性，通过研究每个设备-服务对的Contextual表示。
methods: 提议的框架使用Latent features combination技术捕捉设备之间的Latent特征交互，并使用Factorization Machines模型每个SIoT设备-服务对的高阶特征交互。
results: 实验证明框架可以提高服务推荐准确率和相关性。

Abstract
The Social Internet of Things (SIoT) enables interconnected smart devices to share data and services, opening up opportunities for personalized service recommendations. However, existing research often overlooks crucial aspects that can enhance the accuracy and relevance of recommendations in the SIoT context. Specifically, existing techniques tend to consider the extraction of social relationships between devices and neglect the contextual presentation of service reviews. This study aims to address these gaps by exploring the contextual representation of each device-service pair. Firstly, we propose a latent features combination technique that can capture latent feature interactions, by aggregating the device-device relationships within the SIoT. Then, we leverage Factorization Machines to model higher-order feature interactions specific to each SIoT device-service pair to accomplish accurate rating prediction. Finally, we propose a service recommendation framework for SIoT based on review aggregation and feature learning processes. The experimental evaluation demonstrates the framework's effectiveness in improving service recommendation accuracy and relevance.

摘要
社交互联网关系（SIoT）可以让智能设备之间进行数据和服务之间的共享，从而开启了个性化服务推荐的可能性。然而，现有的研究往往忽视了在SIoT上的重要因素，这些因素可以提高推荐的准确性和相关性。具体来说，现有的技术通常会忽视设备之间的社交关系和服务评价的上下文显示。本研究旨在解决这些缺陷，通过对每个设备-服务对的上下文表示进行描述。首先，我们提出一种秘密特征组合技术，可以捕捉设备之间的秘密特征互动，通过在SIoT中对设备之间的关系进行聚合。然后，我们利用因子分解机制来模型每个SIoT设备-服务对的高阶特征互动，以实现准确的评级预测。最后，我们提出了基于评价聚合和特征学习的服务推荐框架 дляSIoT。实验评估表明该框架可以提高服务推荐的准确性和相关性。

Conformal Predictions Enhanced Expert-guided Meshing with Graph Neural Networks

paper_url: http://arxiv.org/abs/2308.07358
repo_url: https://github.com/ahnobari/autosurf
paper_authors: Amin Heyrani Nobari, Justin Rey, Suhas Kodali, Matthew Jones, Faez Ahmed
for: 这个论文是为了自动生成CFD模型的网格而设计的。
methods: 这个论文使用图 Nueral Networks（GNN）和专家指导来自动生成CFD模型的网格。它还提出了一种新的3D分割算法，可以更高效地分类表面。
results: 论文通过一个实际案例研究表明，自动生成的网格与专家生成的网格相当，并且可以使得计算机 fluid dynamics 方法 converge 并生成准确的结果。此外，论文还比较了自动生成网格和适应重新网格两种方法的效率，发现自动生成网格比适应重新网格5倍 faster。代码和数据可以在https://github.com/ahnobari/AutoSurf上下载。

Abstract
Computational Fluid Dynamics (CFD) is widely used in different engineering fields, but accurate simulations are dependent upon proper meshing of the simulation domain. While highly refined meshes may ensure precision, they come with high computational costs. Similarly, adaptive remeshing techniques require multiple simulations and come at a great computational cost. This means that the meshing process is reliant upon expert knowledge and years of experience. Automating mesh generation can save significant time and effort and lead to a faster and more efficient design process. This paper presents a machine learning-based scheme that utilizes Graph Neural Networks (GNN) and expert guidance to automatically generate CFD meshes for aircraft models. In this work, we introduce a new 3D segmentation algorithm that outperforms two state-of-the-art models, PointNet++ and PointMLP, for surface classification. We also present a novel approach to project predictions from 3D mesh segmentation models to CAD surfaces using the conformal predictions method, which provides marginal statistical guarantees and robust uncertainty quantification and handling. We demonstrate that the addition of conformal predictions effectively enables the model to avoid under-refinement, hence failure, in CFD meshing even for weak and less accurate models. Finally, we demonstrate the efficacy of our approach through a real-world case study that demonstrates that our automatically generated mesh is comparable in quality to expert-generated meshes and enables the solver to converge and produce accurate results. Furthermore, we compare our approach to the alternative of adaptive remeshing in the same case study and find that our method is 5 times faster in the overall process of simulation. The code and data for this project are made publicly available at https://github.com/ahnobari/AutoSurf.

摘要
computational fluid dynamics (CFD) 在不同的工程领域广泛使用，但准确的 simulate 需要正确的刻分 simulation 领域。高度细化的 mesh 可能确保准确性，但会带来高计算成本。同时，自适应刻分技术需要多次 simulate 和高计算成本。这意味着刻分过程依赖于专业知识和年资。自动生成 mesh 可以 savesignificant time and effort, leading to a faster and more efficient design process.这篇文章提出了一种基于机器学习的方案，使用 Graph Neural Networks (GNN) 和专家指导来自动生成 CFD 模型的 mesh для飞机模型。在这种工作中，我们提出了一种新的3D 分割算法，超过了 PointNet++ 和 PointMLP 两种surface classification 模型。我们还提出了一种将预测从 3D mesh 分割模型 projet 到 CAD 表面的方法，使用 conformal predictions 方法，该方法提供了边缘统计保证和稳定的uncertainty quantification和处理。我们证明了，通过添加conformal predictions，模型可以避免under-refinement，因此失败，在 CFD 刻分中。最后，我们通过一个真实的案例研究，证明我们自动生成的 mesh 与专家生成的 mesh 相比较，并且能够使解ilder converge 并生成准确的结果。此外，我们与 adaptive remeshing 的相对比较，发现我们的方法比 adaptive remeshing 5 倍快。我们的代码和数据在 https://github.com/ahnobari/AutoSurf 上公开 disponibles。

Knowledge Prompt-tuning for Sequential Recommendation

paper_url: http://arxiv.org/abs/2308.08459
repo_url: https://github.com/zhaijianyang/kp4sr
paper_authors: Jianyang Zhai, Xiawu Zheng, Chang-Dong Wang, Hui Li, Yonghong Tian
for: 本研究的目的是提出一种以知识库为基础的sequential recommendation（SR）方法，以解决现有SR方法缺乏域知识和细化用户喜好的问题。
methods: 本研究提出了一种名为Knowledge Prompt-tuning for Sequential Recommendation（KP4SR）的方法，它利用了外部知识库和知识提示来解决 semantic gap 问题。而知识提示的执行方式包括构建关系模板和知识树，以及在知识树上应用知识树面罩来缓解噪声问题。
results: 实验结果显示，KP4SR方法在三个真实世界数据集上的评价指标上比现有的PLM-based方法（基于语言模型）表现出色，特别是在NDCG@5和HR@5指标上有40.65%、36.42%和22.17%的提升。

Abstract
Pre-trained language models (PLMs) have demonstrated strong performance in sequential recommendation (SR), which are utilized to extract general knowledge. However, existing methods still lack domain knowledge and struggle to capture users' fine-grained preferences. Meanwhile, many traditional SR methods improve this issue by integrating side information while suffering from information loss. To summarize, we believe that a good recommendation system should utilize both general and domain knowledge simultaneously. Therefore, we introduce an external knowledge base and propose Knowledge Prompt-tuning for Sequential Recommendation (\textbf{KP4SR}). Specifically, we construct a set of relationship templates and transform a structured knowledge graph (KG) into knowledge prompts to solve the problem of the semantic gap. However, knowledge prompts disrupt the original data structure and introduce a significant amount of noise. We further construct a knowledge tree and propose a knowledge tree mask, which restores the data structure in a mask matrix form, thus mitigating the noise problem. We evaluate KP4SR on three real-world datasets, and experimental results show that our approach outperforms state-of-the-art methods on multiple evaluation metrics. Specifically, compared with PLM-based methods, our method improves NDCG@5 and HR@5 by \textcolor{red}{40.65\%} and \textcolor{red}{36.42\%} on the books dataset, \textcolor{red}{11.17\%} and \textcolor{red}{11.47\%} on the music dataset, and \textcolor{red}{22.17\%} and \textcolor{red}{19.14\%} on the movies dataset, respectively. Our code is publicly available at the link: \href{https://github.com/zhaijianyang/KP4SR}{\textcolor{blue}{https://github.com/zhaijianyang/KP4SR}.}

摘要
预训语言模型（PLM）在sequential recommendation（SR）中表现出了强大的能力，但现有方法仍缺乏域知识和用户细致的偏好。而 tradicional SR 方法通常通过integrating side information来解决这个问题，但这会导致信息损失。因此，我们认为一个好的推荐系统应该同时利用通用知识和域知识。因此，我们引入了外部知识库和建议 Knowledge Prompt-tuning for Sequential Recommendation（KP4SR）。我们构建了一组关系模板，将结构化知识图（KG）转换为知识提示，以解决 semantic gap 问题。然而，知识提示会破坏原始数据结构并引入很多噪声。我们进一步构建了知识树和知识树面罩，使得数据结构在面罩矩阵形式中得到修复，因此 Mitigate the noise problem。我们在三个真实世界数据集上测试了我们的方法，结果显示，我们的方法在多个评价指标上超越了现有方法。具体来说，与 PLM 基于方法相比，我们的方法在书籍数据集上提高 NDCG@5 和 HR@5 的值为 \red{40.65\%} 和 \red{36.42\%}，在音乐数据集上提高 \red{11.17\%} 和 \red{11.47\%}，在电影数据集上提高 \red{22.17\%} 和 \red{19.14\%}，分别。我们的代码公开available于以下链接：\href{https://github.com/zhaijianyang/KP4SR}{\textcolor{blue}{https://github.com/zhaijianyang/KP4SR}.}

Demonstration of CORNET: A System For Learning Spreadsheet Formatting Rules By Example

paper_url: http://arxiv.org/abs/2308.07357
repo_url: None
paper_authors: Mukul Singh, Jose Cambronero, Sumit Gulwani, Vu Le, Carina Negreanu, Gust Verbruggen
for: 用于自动学习条件 formatting 规则，并将其应用到 Microsoft Excel 中。
methods: 使用 симвоlic rule enumeration、 semi-supervised clustering 和 iterative decision tree learning，以及 neural ranker 来生成条件 formatting 规则。
results: 可以将用户提供的一两个格式化 cell 作为示范，然后生成 formatting rule 建议供用户应用到 spreadsheet 中。

Abstract
Data management and analysis tasks are often carried out using spreadsheet software. A popular feature in most spreadsheet platforms is the ability to define data-dependent formatting rules. These rules can express actions such as "color red all entries in a column that are negative" or "bold all rows not containing error or failure." Unfortunately, users who want to exercise this functionality need to manually write these conditional formatting (CF) rules. We introduce CORNET, a system that automatically learns such conditional formatting rules from user examples. CORNET takes inspiration from inductive program synthesis and combines symbolic rule enumeration, based on semi-supervised clustering and iterative decision tree learning, with a neural ranker to produce accurate conditional formatting rules. In this demonstration, we show CORNET in action as a simple add-in to Microsoft Excel. After the user provides one or two formatted cells as examples, CORNET generates formatting rule suggestions for the user to apply to the spreadsheet.

摘要
<> translate into Simplified Chinese数据管理和分析任务经常使用表格软件进行。许多表格平台具有定义数据依赖的格式化规则的功能。这些规则可以表达如"将列中的负数颜色为红"或"不包含错误或失败的行加粗"。然而，用户们想要实现这些Conditional Formatting（CF）规则需要手动写出这些规则。我们介绍了CORNET，一个系统可以自动从用户示例中学习Conditional Formatting规则。CORNET吸取了 inductive 程序生成的灵感，结合半supervised clustering和迭代决策树学习，以生成准确的Conditional Formatting规则。在这次演示中，我们将CORNET作为Microsoft Excel中的一个简单插件展示。在用户提供一个或二个格式化的单元格示例后，CORNET会生成格式化规则建议，让用户应用于表格。

SPEGTI: Structured Prediction for Efficient Generative Text-to-Image Models

paper_url: http://arxiv.org/abs/2308.10997
repo_url: None
paper_authors: Sadeep Jayasumana, Daniel Glasner, Srikumar Ramalingam, Andreas Veit, Ayan Chakrabarti, Sanjiv Kumar
for: 提高文本描述生成图像质量和效率
methods: 使用Markov Random Field（MRF）模型提高图像兼容性和减少Muse预测步骤
results: 提高Muse模型的运算效率，无损输出图像质量Here’s a brief explanation of each point:* “for”: The paper is aimed at improving the quality and efficiency of text-to-image generation models.* “methods”: The paper proposes using a Markov Random Field (MRF) model to improve the compatibility between different regions of an image, which in turn speeds up the Muse model.* “results”: The proposed method, called SPEGTI, achieves a 1.5X speedup in Muse inference with no loss in output image quality.

Abstract
Modern text-to-image generation models produce high-quality images that are both photorealistic and faithful to the text prompts. However, this quality comes at significant computational cost: nearly all of these models are iterative and require running inference multiple times with large models. This iterative process is needed to ensure that different regions of the image are not only aligned with the text prompt, but also compatible with each other. In this work, we propose a light-weight approach to achieving this compatibility between different regions of an image, using a Markov Random Field (MRF) model. This method is shown to work in conjunction with the recently proposed Muse model. The MRF encodes the compatibility among image tokens at different spatial locations and enables us to significantly reduce the required number of Muse prediction steps. Inference with the MRF is significantly cheaper, and its parameters can be quickly learned through back-propagation by modeling MRF inference as a differentiable neural-network layer. Our full model, SPEGTI, uses this proposed MRF model to speed up Muse by 1.5X with no loss in output image quality.

摘要
现代文本到图像生成模型可以生成高质量的图像，这些图像不仅具有摄影真实性，还具有文本提示的准确性。然而，这种质量需要支付高计算成本：大多数这些模型都是迭代的，需要在大型模型上进行多次推理。这种迭代过程是为确保图像各个区域不仅与文本提示相Alignment，还与其他区域兼容。在这项工作中，我们提出了一种轻量级的方法来实现图像各个区域之间的兼容性，使用Markov随机场（MRF）模型。这种方法可以与最近提出的Muse模型一起使用，并且可以减少Muse预测步骤的数量，从而大幅降低计算成本。我们的全模型，SPEGTI，使用这种提出的MRF模型，可以在Muse模型中快速预测图像，并且不会失去输出图像质量。

HyperBandit: Contextual Bandit with Hypernewtork for Time-Varying User Preferences in Streaming Recommendation

paper_url: http://arxiv.org/abs/2308.08497
repo_url: None
paper_authors: Chenglei Shen, Xiao Zhang, Wei Wei, Jun Xu
for: 本研究旨在提出一种能够快速适应用户时间变化的流媒体推荐模型，以满足实际流媒体推荐场景中用户偏好的动态变化。
methods: 本研究提出了一种Contextual Bandit方法，使用了hypernetwork来模型时间变化的用户偏好，并采用了bandit策略来在线进行推荐。为了满足实时要求，研究者们在训练过程中利用了低级结构的low-rank factorization。
results: 对实际 dataset进行了广泛的实验，并证明了 HyperBandit 能够在流媒体推荐场景中具有优于现有基eline的表现，并且可以快速适应用户时间变化。

Abstract
In real-world streaming recommender systems, user preferences often dynamically change over time (e.g., a user may have different preferences during weekdays and weekends). Existing bandit-based streaming recommendation models only consider time as a timestamp, without explicitly modeling the relationship between time variables and time-varying user preferences. This leads to recommendation models that cannot quickly adapt to dynamic scenarios. To address this issue, we propose a contextual bandit approach using hypernetwork, called HyperBandit, which takes time features as input and dynamically adjusts the recommendation model for time-varying user preferences. Specifically, HyperBandit maintains a neural network capable of generating the parameters for estimating time-varying rewards, taking into account the correlation between time features and user preferences. Using the estimated time-varying rewards, a bandit policy is employed to make online recommendations by learning the latent item contexts. To meet the real-time requirements in streaming recommendation scenarios, we have verified the existence of a low-rank structure in the parameter matrix and utilize low-rank factorization for efficient training. Theoretically, we demonstrate a sublinear regret upper bound against the best policy. Extensive experiments on real-world datasets show that the proposed HyperBandit consistently outperforms the state-of-the-art baselines in terms of accumulated rewards.

摘要
实际流媒体推荐系统中，用户偏好经常在时间上变化（例如，用户在工作日和周末有不同的偏好）。现有的铲剑基于推荐模型仅考虑时间为毫科学上的时间戳，没有明确模型时间变量和用户偏好之间的关系。这会导致推荐模型无法快速适应动态场景。为解决这个问题，我们提议一种 Contextual Bandit 方法，使用嵌入式神经网络（HyperNetwork），以时间特征为输入，动态调整用户偏好变化的推荐模型。具体来说，HyperBandit 保持一个可以生成时间变量相关的参数来估计用户偏好变化的神经网络，同时考虑用户偏好和时间特征之间的相关性。使用估计的时间变量奖励，采用铲剑策略进行在线推荐，学习隐藏的项目上下文。为满足流媒体推荐场景的实时需求，我们已经验证了低纬度结构的存在，并利用低纬度因子化进行高效的训练。理论上，我们证明了对最佳策略的下界 regret Upper Bound。实际实验表明，提议的 HyperBandit 在实际 dataset 上持续超过状态艺术基elines。

AIGC In China: Current Developments And Future Outlook

paper_url: http://arxiv.org/abs/2308.08451
repo_url: None
paper_authors: Xiangyu Li, Yuqing Fan, Shenghui Cheng
for: 本研究旨在分析中国AI生成内容（AIGC）领域的当前状况，包括技术基础和应用领域。
methods: 本研究使用关键词搜索方法来 Identify relevant academic papers and analyze the market status, policy landscape, and development trajectory of AIGC in China.
results: 研究发现，中国AIGC领域正在快速发展，但也面临着一些挑战和风险。本研究提供了一个全面的AIGC产品和相关生态系统的分析，以及对AIGC产业未来发展的前瞻性分析。

Abstract
The increasing attention given to AI Generated Content (AIGC) has brought a profound impact on various aspects of daily life, industrial manufacturing, and the academic sector. Recognizing the global trends and competitiveness in AIGC development, this study aims to analyze China's current status in the field. The investigation begins with an overview of the foundational technologies and current applications of AIGC. Subsequently, the study delves into the market status, policy landscape, and development trajectory of AIGC in China, utilizing keyword searches to identify relevant scholarly papers. Furthermore, the paper provides a comprehensive examination of AIGC products and their corresponding ecosystem, emphasizing the ecological construction of AIGC. Finally, this paper discusses the challenges and risks faced by the AIGC industry while presenting a forward-looking perspective on the industry's future based on competitive insights in AIGC.

摘要
随着人工智能生成内容（AIGC）的注意力增加，它对日常生活、工业生产和学术领域产生了深远的影响。在认识全球趋势和竞争力的基础上，本研究目的是分析中国AIGC领域的当前状况。研究从基础技术和当前应用领域入手，然后探讨中国AIGC市场情况、政策风景和发展轨迹，通过关键词搜索获取相关学术论文。此外，本文还进行了全面的AIGC产品和相关生态系统的检视，强调AIGC生态建设。最后，本文讨论了AIGC行业所面临的挑战和风险，并提供了基于竞争情况的未来展望。

OctoPack: Instruction Tuning Code Large Language Models

paper_url: http://arxiv.org/abs/2308.07124
repo_url: https://github.com/bigcode-project/octopack
paper_authors: Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, Shayne Longpre
for: 这篇论文的目的是精致地训练大型自然语言模型（LLMs），以提高自然语言任务的性能。
methods: 这篇论文使用了代码的自然结构，即Git提交，并将其转换为人类的指令。 authors 创建了 CommitPack，一个包含了350种程式语言的4 terabytes Git提交。
results: 在使用 CommitPack 训练 StarCoder 模型（16B参数）时，在 HumanEval Python 测试 benchmark 上 achieved state-of-the-art 性能（46.2% pass@1），并在 HumanEvalPack 测试 benchmark 上显示了最好的性能。

Abstract
Finetuning large language models (LLMs) on instructions leads to vast performance improvements on natural language tasks. We apply instruction tuning using code, leveraging the natural structure of Git commits, which pair code changes with human instructions. We compile CommitPack: 4 terabytes of Git commits across 350 programming languages. We benchmark CommitPack against other natural and synthetic code instructions (xP3x, Self-Instruct, OASST) on the 16B parameter StarCoder model, and achieve state-of-the-art performance among models not trained on OpenAI outputs, on the HumanEval Python benchmark (46.2% pass@1). We further introduce HumanEvalPack, expanding the HumanEval benchmark to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis) across 6 languages (Python, JavaScript, Java, Go, C++, Rust). Our models, OctoCoder and OctoGeeX, achieve the best performance across HumanEvalPack among all permissive models, demonstrating CommitPack's benefits in generalizing to a wider set of languages and natural coding tasks. Code, models and data are freely available at https://github.com/bigcode-project/octopack.

摘要
大型语言模型（LLM）的微调使得自然语言任务中的表现得到了很大的改善。我们使用代码的自然结构，通过 Git 提交来进行 instruction tuning，并compile了 4 兆Byte的 Git 提交 across 350 种程式语言。我们将其与其他自然和合成代码指令（xP3x、Self-Instruct、OASST）进行比较，使用 16B 参数的 StarCoder 模型，在 HumanEval Python 套件中获得了最佳性能（46.2% pass@1）。我们还引入了 HumanEvalPack，扩展了 HumanEval 套件，包括 3 个程式码任务（Code Repair、Code Explanation、Code Synthesis） across 6 种程式语言（Python、JavaScript、Java、Go、C++、Rust）。我们的模型 OctoCoder 和 OctoGeeX 在 HumanEvalPack 中获得了最佳性能，证明 CommitPack 对于更多的语言和自然程式码任务具有普遍性。代码、模型和数据可以免费下载于 https://github.com/bigcode-project/octopack。

CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation

paper_url: http://arxiv.org/abs/2308.07146
repo_url: https://github.com/kevinlight831/ctp
paper_authors: Hongguang Zhu, Yunchao Wei, Xiaodan Liang, Chunjie Zhang, Yao Zhao
for: 这个论文是为了研究视觉语言不间断预训练（Vision-Language Continual Pretraining，VLCP）而写的。
methods: 这个论文使用了一种新的算法，即兼容势量对比法（Compatible Momentum Contrast），以及一种概率转移方法（Topology Preservation）来实现VLCP。
results: 实验结果表明，这个算法不仅可以在多个基线上达到更高的性能，而且也不需要付出贵重的训练成本。

Abstract
Vision-Language Pretraining (VLP) has shown impressive results on diverse downstream tasks by offline training on large-scale datasets. Regarding the growing nature of real-world data, such an offline training paradigm on ever-expanding data is unsustainable, because models lack the continual learning ability to accumulate knowledge constantly. However, most continual learning studies are limited to uni-modal classification and existing multi-modal datasets cannot simulate continual non-stationary data stream scenarios. To support the study of Vision-Language Continual Pretraining (VLCP), we first contribute a comprehensive and unified benchmark dataset P9D which contains over one million product image-text pairs from 9 industries. The data from each industry as an independent task supports continual learning and conforms to the real-world long-tail nature to simulate pretraining on web data. We comprehensively study the characteristics and challenges of VLCP, and propose a new algorithm: Compatible momentum contrast with Topology Preservation, dubbed CTP. The compatible momentum model absorbs the knowledge of the current and previous-task models to flexibly update the modal feature. Moreover, Topology Preservation transfers the knowledge of embedding across tasks while preserving the flexibility of feature adjustment. The experimental results demonstrate our method not only achieves superior performance compared with other baselines but also does not bring an expensive training burden. Dataset and codes are available at https://github.com/KevinLight831/CTP.

摘要
vision-language预训练（VLP）在多种下游任务上表现出色，但由于实际数据的不断增长，这种离线训练方式在不断学习的能力不足以满足需求。然而，大多数不断学习研究仅限于单modal类别，现有的多modal数据集不能模拟不断非站ARY数据流场景。为支持视图语言不断预训练（VLCP）的研究，我们首先提供了一个完整的、统一的benchmark数据集P9D，包含超过一百万个产品图像-文本对from 9个行业。每个行业的数据作为独立任务支持不断学习，并与实际世界的长尾分布相符，以模拟在网络数据上的预训练。我们系统地研究了VLCP的特点和挑战，并提出了一种新的算法：Compatible Momentum Contrast with Topology Preservation（CTP）。Compatible Momentum模型吸收当前和前一任务模型的知识，以flexibly更新Modal特征。此外，Topology Preservation将知识传递到下一任务模型，保持特征调整的灵活性。实验结果表明，我们的方法不仅在其他基eline上达到了superior表现，而且不会带来昂贵的训练负担。数据集和代码可以在https://github.com/KevinLight831/CTP获取。

Natural Language is All a Graph Needs

paper_url: http://arxiv.org/abs/2308.07134
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Ruosong Ye, Caiqi Zhang, Runhui Wang, Shuyuan Xu, Yongfeng Zhang
for: 本研究的目的是探讨 Whether large language models (LLMs) can replace graph neural networks (GNNs) as the foundation model for graphs.
methods: 我们提出了 InstructGLM (Instruction-finetuned Graph Language Model)，使用自然语言指令系统atically design高级别可扩展的提示，并使用自然语言描述图像的几何结构和节点特征。
results: 我们的方法在ogbn-arxiv、Cora和PubMed dataset上都超过了所有竞争性GNN基线值，这表明了我们的方法的有效性，并且探讨了大语言模型作为图像机器学习的基础模型的可能性。

Abstract
The emergence of large-scale pre-trained language models, such as ChatGPT, has revolutionized various research fields in artificial intelligence. Transformers-based large language models (LLMs) have gradually replaced CNNs and RNNs to unify fields of computer vision and natural language processing. Compared with the data that exists relatively independently such as images, videos or texts, graph is a type of data that contains rich structural and relational information. Meanwhile, natural language, as one of the most expressive mediums, excels in describing complex structures. However, existing work on incorporating graph learning problems into the generative language modeling framework remains very limited. As the importance of large language models continues to grow, it becomes essential to explore whether LLMs can also replace GNNs as the foundation model for graphs. In this paper, we propose InstructGLM (Instruction-finetuned Graph Language Model), systematically design highly scalable prompts based on natural language instructions, and use natural language to describe the geometric structure and node features of the graph for instruction tuning an LLM to perform learning and inference on graphs in a generative manner. Our method exceeds all competitive GNN baselines on ogbn-arxiv, Cora and PubMed datasets, which demonstrates the effectiveness of our method and sheds light on generative large language models as the foundation model for graph machine learning.

摘要
大规模预训练语言模型，如ChatGPT，在人工智能多个研究领域中引发革命。基于Transformers的大语言模型（LLM）逐渐取代了CNNs和RNNs，将计算机视觉和自然语言处理等领域联系起来。与独立存在的数据类型如图像、视频或文本不同，图表是一种包含丰富结构和关系信息的数据类型。同时，自然语言作为最具表达力的媒介，能够描述复杂结构。然而，将图学学习问题 incorporated into the generative language modeling framework 的现有工作很有限。随着大语言模型的重要性不断增长，我们认为可以explore whether LLMs can also replace GNNs as the foundation model for graphs。在这篇论文中，我们提出了InstructGLM（基于natural language instruction的图语言模型），系统地设计了可扩展的提示，并使用自然语言来描述图表的几何结构和节点特征。通过对LLM进行学习和推理，我们实现了对图表的生成式学习和推理。我们的方法在ogbn-arxiv、Cora和PubMed dataset上超过了所有相关GNN基elines，这 demonstartes了我们的方法的效果，并且推翻了大语言模型作为图机器学习基础模型的可能性。

Implementation of The Future of Drug Discovery: QuantumBased Machine Learning Simulation (QMLS)

paper_url: http://arxiv.org/abs/2308.08561
repo_url: None
paper_authors: Yew Kee Wong, Yifan Zhou, Yan Shing Liang, Haichuan Qiu, Yu Xi Wu, Bin He
for: 这个研究旨在缩短药物开发过程的R&D阶段，从三到六个月，并降低成本至五万到八万美元。
methods: 这个概念使用机器学习和量子模拟来发现可能的灵数，并将其筛选以根据反应和与目标蛋白质的缩合效果进行筛选。
results: 这个概念可以将R&D阶段缩短至三到六个月，并降低成本至五万到八万美元，并生成多达数十个适用于临床试验的药物。

Abstract
The Research & Development (R&D) phase of drug development is a lengthy and costly process. To revolutionize this process, we introduce our new concept QMLS to shorten the whole R&D phase to three to six months and decrease the cost to merely fifty to eighty thousand USD. For Hit Generation, Machine Learning Molecule Generation (MLMG) generates possible hits according to the molecular structure of the target protein while the Quantum Simulation (QS) filters molecules from the primary essay based on the reaction and binding effectiveness with the target protein. Then, For Lead Optimization, the resultant molecules generated and filtered from MLMG and QS are compared, and molecules that appear as a result of both processes will be made into dozens of molecular variations through Machine Learning Molecule Variation (MLMV), while others will only be made into a few variations. Lastly, all optimized molecules would undergo multiple rounds of QS filtering with a high standard for reaction effectiveness and safety, creating a few dozen pre-clinical-trail-ready drugs. This paper is based on our first paper, where we pitched the concept of machine learning combined with quantum simulations. In this paper we will go over the detailed design and framework of QMLS, including MLMG, MLMV, and QS.

摘要
研发（R&D）阶段是药品开发的长期和昂贵的过程。为了革新这个过程，我们提出了新的概念——快速药品开发（QMLS），可以缩短整个R&D阶段到3-6个月，并降低成本至50-80万美元。在找到可能的杀手（Hit）方面，机器学习分子生成（MLMG）根据目标蛋白质的分子结构生成可能的杀手，而量子模拟（QS）从原始试验中筛选出符合目标蛋白质的反应和结合效果的分子。在吸引化学物质阶段，得到的分子被QS筛选后，通过机器学习分子变化（MLMV）生成多种分子变体，而其他分子只生成一些变体。最后，所有优化的分子都会经过多次QS筛选，以确保它们具有高效性和安全性。通过这种方式，我们可以在几个月内生成几十个前期临床药品。这篇文章是我们的第一篇论文，我们在那里提出了机器学习与量子模拟的概念。在这篇文章中，我们将详细介绍QMLS的设计和框架，包括MLMG、MLMV和QS。

Ada-QPacknet – adaptive pruning with bit width reduction as an efficient continual learning method without forgetting

paper_url: http://arxiv.org/abs/2308.07939
repo_url: None
paper_authors: Marcin Pietroń, Dominik Żurek, Kamil Faber, Roberto Corizzo
for: This paper aims to improve the efficiency of Continual Learning (CL) algorithms in dynamic and complex environments.
methods: The proposed approach, called Ada-QPacknet, incorporates both pruning and quantization techniques to reduce the size of the model and improve its performance in CL scenarios.
results: The presented results show that the proposed approach achieves similar accuracy as floating-point sub-networks in well-known CL scenarios, and outperforms most other CL strategies in task and class incremental scenarios.Here’s the full text in Simplified Chinese:
for: 本文目的是提高深度学习模型在动态复杂环境下的效率，通过Continual Learning（CL）算法。
methods: 提议的方法是Ada-QPacknet，它将杜绝和量化技术相结合，以减少模型的大小并提高CL场景中的性能。
results: presente results表明，提议的方法在知名CL场景中与浮点子网络相当的准确率，并在任务和类增量场景中超过大多数CL策略。

Abstract
Continual Learning (CL) is a process in which there is still huge gap between human and deep learning model efficiency. Recently, many CL algorithms were designed. Most of them have many problems with learning in dynamic and complex environments. In this work new architecture based approach Ada-QPacknet is described. It incorporates the pruning for extracting the sub-network for each task. The crucial aspect in architecture based CL methods is theirs capacity. In presented method the size of the model is reduced by efficient linear and nonlinear quantisation approach. The method reduces the bit-width of the weights format. The presented results shows that hybrid 8 and 4-bit quantisation achieves similar accuracy as floating-point sub-network on a well-know CL scenarios. To our knowledge it is the first CL strategy which incorporates both compression techniques pruning and quantisation for generating task sub-networks. The presented algorithm was tested on well-known episode combinations and compared with most popular algorithms. Results show that proposed approach outperforms most of the CL strategies in task and class incremental scenarios.

摘要

#InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models

paper_url: http://arxiv.org/abs/2308.07074
repo_url: https://github.com/ofa-sys/instag
paper_authors: Keming Lu, Hongyi Yuan, Zheng Yuan, Runji Lin, Junyang Lin, Chuanqi Tan, Chang Zhou, Jingren Zhou
for: 这篇论文主要用于探讨基于监督精细调教（SFT）的语言模型是如何获得 instrucion-following 能力的？
methods: 该论文提出了一种名为 InsTag 的开放集成精细标注器，用于标注 SFT 数据集中的样本，并定义了 instrucion 多样性和复杂性的量化分析。
results: 研究发现，通过使用 InsTag 选择的 6K 多样性和复杂性的样本进行精细调教，可以使模型的能力得到显著提升，并在 MT-Bench 中与大量 SFT 数据进行比较。

Abstract
Foundation language models obtain the instruction-following ability through supervised fine-tuning (SFT). Diversity and complexity are considered critical factors of a successful SFT dataset, while their definitions remain obscure and lack quantitative analyses. In this work, we propose InsTag, an open-set fine-grained tagger, to tag samples within SFT datasets based on semantics and intentions and define instruction diversity and complexity regarding tags. We obtain 6.6K tags to describe comprehensive user queries. Then we analyze popular open-sourced SFT datasets and find that the model ability grows with more diverse and complex data. Based on this observation, we propose a data selector based on InsTag to select 6K diverse and complex samples from open-source datasets and fine-tune models on InsTag-selected data. The resulting models, TagLM, outperform open-source models based on considerably larger SFT data evaluated by MT-Bench, echoing the importance of query diversity and complexity. We open-source InsTag in https://github.com/OFA-Sys/InsTag.

摘要

Machine Unlearning: Solutions and Challenges

paper_url: http://arxiv.org/abs/2308.07061
repo_url: None
paper_authors: Jie Xu, Zihan Wu, Cong Wang, Xiaohua Jia
for: 本研究旨在提供一个全面的机器学习忘记研究taxonomy，并对现有研究进行分析和评价。
methods: 本研究使用了精确的忘记算法和近似的忘记方法，并对这些方法进行了分析和评价。
results: 本研究提出了未来机器学习忘记研究的发展方向，并鼓励研究人员通过解决实际问题来提供影响ful的贡献。

Abstract
Machine learning models may inadvertently memorize sensitive, unauthorized, or malicious data, posing risks of privacy violations, security breaches, and performance deterioration. To address these issues, machine unlearning has emerged as a critical technique to selectively remove specific training data points' influence on trained models. This paper provides a comprehensive taxonomy and analysis of machine unlearning research. We categorize existing research into exact unlearning that algorithmically removes data influence entirely and approximate unlearning that efficiently minimizes influence through limited parameter updates. By reviewing the state-of-the-art solutions, we critically discuss their advantages and limitations. Furthermore, we propose future directions to advance machine unlearning and establish it as an essential capability for trustworthy and adaptive machine learning. This paper provides researchers with a roadmap of open problems, encouraging impactful contributions to address real-world needs for selective data removal.

摘要
Translated into Simplified Chinese:机器学习模型可能不慎地记忆敏感、未经授权或黑客数据，导致隐私侵犯、安全泄露和性能下降。为解决这些问题，机器忘记技术已经出现，可以选择性地从训练模型中除去特定数据点的影响。这篇论文提供了机器忘记的全面分类和分析，将现有研究分为精确忘记和近似忘记两类。我们 kritisch 评估了现状的解决方案，并提出了未来的发展方向，以便在可靠和适应性Machine learning中确立机器忘记的能力。这篇论文为研究人员提供了一份未解决的问题路线图，鼓励他们对实际需求进行有力的贡献，以解决实际中的选择性数据 removals。

Distinguishing Risk Preferences using Repeated Gambles

paper_url: http://arxiv.org/abs/2308.07054
repo_url: None
paper_authors: James Price, Colm Connaughton
For: The paper explores the practical challenges of inferring risk preferences from the observed choices of artificial agents in sequences of repeated gambles.* Methods: The paper uses the Yeo-Johnson transformation to construct a family of gambles that interpolates smoothly between the additive and multiplicative cases, and analyzes the optimal strategy for this family both analytically and numerically.* Results: The paper finds that it becomes increasingly difficult to distinguish the risk preferences of agents as their wealth increases, because agents with different risk preferences eventually make the same decisions for sufficiently high wealth.

Abstract
Sequences of repeated gambles provide an experimental tool to characterize the risk preferences of humans or artificial decision-making agents. The difficulty of this inference depends on factors including the details of the gambles offered and the number of iterations of the game played. In this paper we explore in detail the practical challenges of inferring risk preferences from the observed choices of artificial agents who are presented with finite sequences of repeated gambles. We are motivated by the fact that the strategy to maximize long-run wealth for sequences of repeated additive gambles (where gains and losses are independent of current wealth) is different to the strategy for repeated multiplicative gambles (where gains and losses are proportional to current wealth.) Accurate measurement of risk preferences would be needed to tell whether an agent is employing the optimal strategy or not. To generalize the types of gambles our agents face we use the Yeo-Johnson transformation, a tool borrowed from feature engineering for time series analysis, to construct a family of gambles that interpolates smoothly between the additive and multiplicative cases. We then analyze the optimal strategy for this family, both analytically and numerically. We find that it becomes increasingly difficult to distinguish the risk preferences of agents as their wealth increases. This is because agents with different risk preferences eventually make the same decisions for sufficiently high wealth. We believe that these findings are informative for the effective design of experiments to measure risk preferences in humans.

摘要
sequences of repeated gambles 提供了一种实验工具来描述人类或人工决策代理的风险偏好。这种推断的困难程度取决于因素，包括对投注的细节和游戏的数量。在这篇论文中，我们详细探讨了人工代理在重复的加权投注中的实际挑战。我们被激励于因为在重复的加权投注中，最佳长期财富战略和加权投注战略不同。准确测量风险偏好是必要的，以确定代理是否使用最佳策略。为推广代理面临的投注类型，我们使用 Yeo-Johnson 变换，一种从特征工程中借鉴的时间序列分析工具，构建了一家 interpolation between additive and multiplicative cases 的投注家族。然后，我们分析了这家族中的最佳策略，包括分析和数值方法。我们发现，随着代理的财富增加，代理的风险偏好难以分辨。这是因为代理不同的风险偏好在财富增加到足够高的时候会做出同样的决策。我们认为这些发现对人类风险偏好测量的设计有用。

Diagnosis of Scalp Disorders using Machine Learning and Deep Learning Approach – A Review

paper_url: http://arxiv.org/abs/2308.07052
repo_url: None
paper_authors: Hrishabh Tiwari, Jatin Moolchandani, Shamla Mantri
for: 该研究旨在提高scalp疾病的诊断精度和效率。
methods: 该研究使用了深度学习模型，包括CNN和FCN，以及一个扩展的APP，以实现精准的scalp疾病诊断。
results: 研究表明，使用深度学习模型可以实现scalp疾病诊断的高精度和高效率，其中一些研究达到了97.41%-99.09%的准确率，而其他研究则达到了82.9%和91.4%的准确率。

Abstract
The morbidity of scalp diseases is minuscule compared to other diseases, but the impact on the patient's life is enormous. It is common for people to experience scalp problems that include Dandruff, Psoriasis, Tinea-Capitis, Alopecia and Atopic-Dermatitis. In accordance with WHO research, approximately 70% of adults have problems with their scalp. It has been demonstrated in descriptive research that hair quality is impaired by impaired scalp, but these impacts are reversible with early diagnosis and treatment. Deep Learning advances have demonstrated the effectiveness of CNN paired with FCN in diagnosing scalp and skin disorders. In one proposed Deep-Learning-based scalp inspection and diagnosis system, an imaging microscope and a trained model are combined with an app that classifies scalp disorders accurately with an average precision of 97.41%- 99.09%. Another research dealt with classifying the Psoriasis using the CNN with an accuracy of 82.9%. As part of another study, an ML based algorithm was also employed. It accurately classified the healthy scalp and alopecia areata with 91.4% and 88.9% accuracy with SVM and KNN algorithms. Using deep learning models to diagnose scalp related diseases has improved due to advancements i computation capabilities and computer vision, but there remains a wide horizon for further improvements.

摘要
Scalp 疾病的患病率相对其他疾病较低，但对病人生活的影响很大。人们常常会经历Scalp 问题，包括斑点病、 Psoriasis、Tinea-Capitis、 Alopecia 和 Atopic-Dermatitis。根据Who研究，成人约70%有Scalp 问题。研究表明，发现早期Scalp 问题可以有效地改善毛发质量，但这些影响可以逆转。在 Deep Learning 技术的推动下，CNN 与 FCN 的结合已经在诊断Scalp 和皮肤疾病方面达到了高度的准确率。一种提议的 Deep-Learning-based scalp 检查和诊断系统使用了一个快速的 imaging 镜和一个训练好的模型，并与一个APP结合，可以准确地分类Scalp 疾病，其准确率为97.41%-99.09%。另一项研究则是使用CNN来分类 Psoriasis，其准确率为82.9%。在另一项研究中，一种ML 基于的算法也被使用，可以准确地分类健康的Scalp 和 Alopecia areata，其准确率为91.4%和88.9%。使用 Deep learning 模型诊断Scalp 相关疾病的精度已经得到了进一步提高，但还有很大的可exploration空间。

The minimal computational substrate of fluid intelligence

paper_url: http://arxiv.org/abs/2308.07039
repo_url: None
paper_authors: Amy PK Nelson, Joe Mole, Guilherme Pombo, Robert J Gray, James K Ruffle, Edgar Chan, Geraint E Rees, Lisa Cipolotti, Parashkev Nachev
for: 这个研究是用来评估一个改进后的Raven进攻性智能测试（RAPM），以验证人类水平的智商测试能否通过自我指导的人工神经网络（LaMa）完成。
methods: 这个研究使用了LaMa自我指导的人工神经网络，该网络只在完成部分遮盖的自然环境场景图像上进行了自我学习。
results: 研究发现，LaMa在完成RAPM测试时达到了人类水平的成绩，而且与健康参与者和focus lesion参与者的表现类似，并且在损害右前额叶功能时出现了类似的错误。

Abstract
The quantification of cognitive powers rests on identifying a behavioural task that depends on them. Such dependence cannot be assured, for the powers a task invokes cannot be experimentally controlled or constrained a priori, resulting in unknown vulnerability to failure of specificity and generalisability. Evaluating a compact version of Raven's Advanced Progressive Matrices (RAPM), a widely used clinical test of fluid intelligence, we show that LaMa, a self-supervised artificial neural network trained solely on the completion of partially masked images of natural environmental scenes, achieves human-level test scores a prima vista, without any task-specific inductive bias or training. Compared with cohorts of healthy and focally lesioned participants, LaMa exhibits human-like variation with item difficulty, and produces errors characteristic of right frontal lobe damage under degradation of its ability to integrate global spatial patterns. LaMa's narrow training and limited capacity -- comparable to the nervous system of the fruit fly -- suggest RAPM may be open to computationally simple solutions that need not necessarily invoke abstract reasoning.

摘要
评估认知能力的量化基于确定一个行为任务取决于它们。但是，这种依赖性不可控制，因为任务所调用的能力无法在实验上预先控制或受限制，导致不知之处的失败和不一致。我们评估了一个简化版的鸟智慧进步性测验（RAPM），一种广泛用于诊断智商的临床测试，我们发现LaMa，一个自我指导的人工神经网络，在完全不受任务指导的情况下，直接完成部分遮盖的自然环境场景图像的完成任务，可以达到人类水平的测试得分。与健康参与者和损伤参与者的群组相比，LaMa表现出人类化的变化，并且在Item难度上出现了人类类似的错误。LaMa的窄训练和有限容量（与蜂蜜蜂 nervous system相当）表明，RAPM可能是一个计算简单的解决方案，不需要涉及抽象的理性。

Bayesian Flow Networks

paper_url: http://arxiv.org/abs/2308.07037
repo_url: https://github.com/stefanradev93/BayesFlow
paper_authors: Alex Graves, Rupesh Kumar Srivastava, Timothy Atkinson, Faustino Gomez
for: 这篇论文探讨了一种新的生成模型，即权重投影网络（BFN），它利用 bayesian 推理来修改一组独立的分布参数，然后将这些参数作为神经网络的输入，输出一个第二个、相互依赖的分布。
methods: 该模型使用了 bayesian 推理来修改参数，然后通过神经网络输出一个生成分布。在实验中，使用了不同的损失函数，包括离散和连续时间损失函数，以及采样生成过程。
results: 实验表明，BFNs 可以与其他权重投影模型相比，在静止化 MNIST 和 CIFAR-10 图像模型任务上实现竞争性的 log-likelihood，并在文本8 字符级语言模型任务上超越所有已知的离散扩散模型。

Abstract
This paper introduces Bayesian Flow Networks (BFNs), a new class of generative model in which the parameters of a set of independent distributions are modified with Bayesian inference in the light of noisy data samples, then passed as input to a neural network that outputs a second, interdependent distribution. Starting from a simple prior and iteratively updating the two distributions yields a generative procedure similar to the reverse process of diffusion models; however it is conceptually simpler in that no forward process is required. Discrete and continuous-time loss functions are derived for continuous, discretised and discrete data, along with sample generation procedures. Notably, the network inputs for discrete data lie on the probability simplex, and are therefore natively differentiable, paving the way for gradient-based sample guidance and few-step generation in discrete domains such as language modelling. The loss function directly optimises data compression and places no restrictions on the network architecture. In our experiments BFNs achieve competitive log-likelihoods for image modelling on dynamically binarized MNIST and CIFAR-10, and outperform all known discrete diffusion models on the text8 character-level language modelling task.

摘要
Here is the text in Simplified Chinese:这篇论文介绍了概率流网络（BFN），一种新的生成模型，其中概率流网络中参数的修改使用权化推断，然后通过神经网络输出第二个、相互dependent的分布。这个过程类似于扩散模型的反向过程，但是更简单，不需要前向过程。模型可以处理整数、连续和整数化数据，并且神经网络输入的整数数据位于概率 simpliciter 上，因此可以使用导数下降的技术进行批处理和几步生成。损失函数直接优化数据压缩，不受网络结构限制。在实验中，BFNs在动态 binary MNIST 和 CIFAR-10 上实现了图像模型的竞争性Log-likelihood，并在 text8 字符级语言模型任务上超过了所有已知整数扩散模型。

Bayesian Physics-Informed Neural Network for the Forward and Inverse Simulation of Engineered Nano-particles Mobility in a Contaminated Aquifer

paper_url: http://arxiv.org/abs/2308.07352
repo_url: None
paper_authors: Shikhar Nilabh, Fidel Grandia
for: 这研究旨在开发一种可预测的工程尺度材料浸泡环境中的气体传输模型，以便为地下水污染 Site 的整体环境和生态系统进行有效的恢复。
methods: 该研究使用了一种基于 Bayesian Physics-Informed Neural Network（B-PINN）的方法，模拟了气体在aquifer中的传输行为。
results: 研究结果表明，B-PINN 方法可以高度准确地预测气体的传输行为，并且可以量化不确定性。此外，研究还发现了aquifer中气体传输的主要控制因素。这些结果表明，该工具可以为开发有效的地下水污染 Site 整治策略提供预测性的 Insights。

Abstract
Globally, there are many polluted groundwater sites that need an active remediation plan for the restoration of local ecosystem and environment. Engineered nanoparticles (ENPs) have proven to be an effective reactive agent for the in-situ degradation of pollutants in groundwater. While the performance of these ENPs has been highly promising on the laboratory scale, their application in real field case conditions is still limited. The complex transport and retention mechanisms of ENPs hinder the development of an efficient remediation strategy. Therefore, a predictive tool to comprehend the transport and retention behavior of ENPs is highly required. The existing tools in the literature are dominated with numerical simulators, which have limited flexibility and accuracy in the presence of sparse datasets and the aquifer heterogeneity. This work uses a Bayesian Physics-Informed Neural Network (B-PINN) framework to model the nano-particles mobility within an aquifer. The result from the forward model demonstrates the effective capability of B-PINN in accurately predicting the ENPs mobility and quantifying the uncertainty. The inverse model output is then used to predict the governing parameters for the ENPs mobility in a small-scale aquifer. The research demonstrates the capability of the tool to provide predictive insights for developing an efficient groundwater remediation strategy.

摘要

IOB: Integrating Optimization Transfer and Behavior Transfer for Multi-Policy Reuse

paper_url: http://arxiv.org/abs/2308.07351
repo_url: None
paper_authors: Siyuan Li, Hao Li, Jin Zhang, Zhen Wang, Peng Liu, Chongjie Zhang
for: 本研究旨在解决选择适用于目标任务的源策略的挑战，提出了一种新的转移学习RL方法。
methods: 该方法利用actor-critic框架中的Q函数导引策略选择，选择目标策略的最大一步改进的源策略。它还 combining optimization transfer和behavior transfer（IOB），通过准则学习的策略来模仿导引策略，以提高转移效果。
results: 对于基准任务，该方法超过了现有的转移RL基线，并在持续学习场景中提高了最终性和知识传递性。此外，我们证明了该优化传输技术可以提高目标策略学习。

Abstract
Humans have the ability to reuse previously learned policies to solve new tasks quickly, and reinforcement learning (RL) agents can do the same by transferring knowledge from source policies to a related target task. Transfer RL methods can reshape the policy optimization objective (optimization transfer) or influence the behavior policy (behavior transfer) using source policies. However, selecting the appropriate source policy with limited samples to guide target policy learning has been a challenge. Previous methods introduce additional components, such as hierarchical policies or estimations of source policies' value functions, which can lead to non-stationary policy optimization or heavy sampling costs, diminishing transfer effectiveness. To address this challenge, we propose a novel transfer RL method that selects the source policy without training extra components. Our method utilizes the Q function in the actor-critic framework to guide policy selection, choosing the source policy with the largest one-step improvement over the current target policy. We integrate optimization transfer and behavior transfer (IOB) by regularizing the learned policy to mimic the guidance policy and combining them as the behavior policy. This integration significantly enhances transfer effectiveness, surpasses state-of-the-art transfer RL baselines in benchmark tasks, and improves final performance and knowledge transferability in continual learning scenarios. Additionally, we show that our optimization transfer technique is guaranteed to improve target policy learning.

摘要
人类具有重用已经学习的策略来快速解决新任务的能力，而强化学习（RL）代理也可以通过将来源策略传播到相关的目标任务中来传递知识。传递RL方法可以改变策略优化目标（优化传递）或影响行为策略（行为传递）使用源策略。然而，在有限样本情况下选择合适的源策略是一个挑战。先前的方法可能会添加额外的组件，如层次政策或估计源策略的价值函数，这可能会导致非站点策略优化或重大的采样成本， thereby reducing transfer effectiveness.为了解决这个挑战，我们提出了一种新的传递RL方法，不需要训练额外的组件。我们利用actor-critic框架中的Q函数来导引策选择，选择目标策略中最大化一步改进的源策略。我们将优化传递和行为传递（IOB）相结合，通过对学习的策略进行正则化，使其模仿指导策略，并将其与行为策略相结合。这种结合显著提高了传递效果，超过了状态静态的传递RL基准值，并提高了最终性和知识传递性在不断学习场景中。此外，我们证明了我们的优化传递技术可以确保目标策略学习的改进。

Efficient Neural PDE-Solvers using Quantization Aware Training

paper_url: http://arxiv.org/abs/2308.07350
repo_url: None
paper_authors: Winfried van den Dool, Tijmen Blankevoort, Max Welling, Yuki M. Asano
for: 解决Partial Differential Equations（PDE）领域中计算成本的问题，通过使用神经网络作为传统数学方法的替代方案。
methods: 使用现有的量化方法来降低神经网络的计算成本，而不需要限制PDE的空间分辨率。
results: 对四个标准PDE数据集和三种网络架构进行了实验，并证明了在不同的设置下，量化意识训练可以成功降低计算成本，同时保持性能。此外，我们还证明了将计算成本与性能之间的 pareto优化是通过量化来实现的。

Abstract
In the past years, the application of neural networks as an alternative to classical numerical methods to solve Partial Differential Equations has emerged as a potential paradigm shift in this century-old mathematical field. However, in terms of practical applicability, computational cost remains a substantial bottleneck. Classical approaches try to mitigate this challenge by limiting the spatial resolution on which the PDEs are defined. For neural PDE solvers, we can do better: Here, we investigate the potential of state-of-the-art quantization methods on reducing computational costs. We show that quantizing the network weights and activations can successfully lower the computational cost of inference while maintaining performance. Our results on four standard PDE datasets and three network architectures show that quantization-aware training works across settings and three orders of FLOPs magnitudes. Finally, we empirically demonstrate that Pareto-optimality of computational cost vs performance is almost always achieved only by incorporating quantization.

摘要

Aggregating Intrinsic Information to Enhance BCI Performance through Federated Learning

paper_url: http://arxiv.org/abs/2308.11636
repo_url: None
paper_authors: Rui Liu, Yuanyuan Chen, Anran Li, Yi Ding, Han Yu, Cuntai Guan
for: 提高Brain-Computer Interface（BCI）高性能深度学习模型的建立
methods: 提出了一种 Hierarchical Personalized Federated Learning EEG Decoding（FLEEG）框架，通过协同学习多个数据集，提高模型的泛化能力和稳定性
results: 在 Motor Imagery（MI）分类任务中，与9个不同设备收集的EEG数据集进行了合作训练，可以提高分类性能达16.7%，特别是对小数据集的提升更大。

Abstract
Insufficient data is a long-standing challenge for Brain-Computer Interface (BCI) to build a high-performance deep learning model. Though numerous research groups and institutes collect a multitude of EEG datasets for the same BCI task, sharing EEG data from multiple sites is still challenging due to the heterogeneity of devices. The significance of this challenge cannot be overstated, given the critical role of data diversity in fostering model robustness. However, existing works rarely discuss this issue, predominantly centering their attention on model training within a single dataset, often in the context of inter-subject or inter-session settings. In this work, we propose a hierarchical personalized Federated Learning EEG decoding (FLEEG) framework to surmount this challenge. This innovative framework heralds a new learning paradigm for BCI, enabling datasets with disparate data formats to collaborate in the model training process. Each client is assigned a specific dataset and trains a hierarchical personalized model to manage diverse data formats and facilitate information exchange. Meanwhile, the server coordinates the training procedure to harness knowledge gleaned from all datasets, thus elevating overall performance. The framework has been evaluated in Motor Imagery (MI) classification with nine EEG datasets collected by different devices but implementing the same MI task. Results demonstrate that the proposed frame can boost classification performance up to 16.7% by enabling knowledge sharing between multiple datasets, especially for smaller datasets. Visualization results also indicate that the proposed framework can empower the local models to put a stable focus on task-related areas, yielding better performance. To the best of our knowledge, this is the first end-to-end solution to address this important challenge.

摘要
BCIs 长期面临不充分数据的挑战，建立高性能深度学习模型。虽然许多研究机构和机构收集了多个 EEG 数据集，但是共享 EEG 数据从多个地点仍然困难，主要因为设备的不同。这个挑战的重要性不可遗憾，因为数据多样性对模型的稳定性具有关键作用。然而，现有的研究很少讨论这一问题，通常是在单个数据集的训练方法上围绕着间subject或间 session 设置中做出主要听讲。在这种情况下，我们提出了一种层次个性化联合学习 EEG 解码（FLEEG）框架，以超越这一挑战。这种创新的框架标识了一种新的学习 paradigma для BCIs，使得不同数据格式的数据集可以在模型训练过程中合作。每个客户端被分配特定的数据集，并训练一个层次个性化模型来管理多种数据格式和促进信息交换。同时，服务器协调训练过程，以利用所有数据集中获得的知识，从而提高整体性能。我们在 Motor Imagery （MI）分类任务上使用了九个 EEG 数据集，每个数据集都是由不同的设备收集的，但是实现了同一个 MI 任务。结果表明，我们的框架可以提高分类性能达到 16.7%，尤其是 для小型数据集。视觉结果还表明，我们的框架可以让本地模型固定焦点在任务相关的区域，从而提高表现。到目前为止，我们知道这是第一个综合解决这一重要挑战的解决方案。

paper_url: http://arxiv.org/abs/2308.08488
repo_url: https://github.com/mispchallenge/misp-icme-avsr
paper_authors: Yusheng Dai, Hang Chen, Jun Du, Xiaofei Ding, Ning Ding, Feijun Jiang, Chin-Hui Lee
for: 提高音频视频演示系统（AVSR）的表现，特别是在低质量视频下。
methods: 提出两种新技术：一是基于 Mandarin 语音形态的 lip shape 和 syllable-level subword unit 的关系研究，以获得准确的视频和音频流重叠。二是一种 audio-guided cross-modal fusion encoder（CMFE）神经网络，使用多个交叠模式来完全利用多modal complementarity。
results: 在 MISP2021-AVSR 数据集上进行实验，证明了两种提posed技术的有效性。使用只需相对较少的训练数据，最终系统的表现优于一些更复杂的前端和后端的现有系统。

Abstract
In recent research, slight performance improvement is observed from automatic speech recognition systems to audio-visual speech recognition systems in the end-to-end framework with low-quality videos. Unmatching convergence rates and specialized input representations between audio and visual modalities are considered to cause the problem. In this paper, we propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. This enables accurate alignment of video and audio streams during visual model pre-training and cross-modal fusion. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers to make full use of modality complementarity. Experiments on the MISP2021-AVSR data set show the effectiveness of the two proposed techniques. Together, using only a relatively small amount of training data, the final system achieves better performances than state-of-the-art systems with more complex front-ends and back-ends.

摘要
现在的研究显示，自动音频识别系统在端到端框架中对低质量视频进行自动化识别时会有轻微的性能提升。这是因为音频和视频模式之间的匹配速率和特殊输入表示方式存在差异。在这篇论文中，我们提出了两种新的技术来改进音频视频识别（AVSR），并在预训练和精度调整训练框架下进行评估。首先，我们研究了拟合舌形和句子级别的子音单元之间的相关性，以确定准确的帧级句子界限。这使得视频和音频流之间的同步Alignment能够更加精准。然后，我们提出了一种听音导向的交叉模态融合编程（CMFE）神经网络，以利用主要训练参数在多个交叉模态注意层中进行全面的融合。实验表明，使用这两种提posed技术后，只需使用相对较小的训练数据，最终系统可以在与更复杂的前端和后端的状态-of-the-art系统相比，达到更高的性能。

pNNCLR: Stochastic Pseudo Neighborhoods for Contrastive Learning based Unsupervised Representation Learning Problems

paper_url: http://arxiv.org/abs/2308.06983
repo_url: None
paper_authors: Momojit Biswas, Himanshu Buckchash, Dilip K. Prasad
for: 本研究旨在提高 nearest neighbor 基于自助学习（SSL）的图像识别问题中的 semantic variation。
methods: 本研究使用 nearest neighbor sampling 提供更多的 semantic variation，并引入 pseudo nearest neighbors（pNN）控制支持集质量，以提高表现。
results: 对多个公共图像识别和医学图像识别数据集进行评估，本方法与基eline nearest neighbor 方法相比，性能提高约8%，与其他先前提出的 SSL 方法相比几乎相当。

Abstract
Nearest neighbor (NN) sampling provides more semantic variations than pre-defined transformations for self-supervised learning (SSL) based image recognition problems. However, its performance is restricted by the quality of the support set, which holds positive samples for the contrastive loss. In this work, we show that the quality of the support set plays a crucial role in any nearest neighbor based method for SSL. We then provide a refined baseline (pNNCLR) to the nearest neighbor based SSL approach (NNCLR). To this end, we introduce pseudo nearest neighbors (pNN) to control the quality of the support set, wherein, rather than sampling the nearest neighbors, we sample in the vicinity of hard nearest neighbors by varying the magnitude of the resultant vector and employing a stochastic sampling strategy to improve the performance. Additionally, to stabilize the effects of uncertainty in NN-based learning, we employ a smooth-weight-update approach for training the proposed network. Evaluation of the proposed method on multiple public image recognition and medical image recognition datasets shows that it performs up to 8 percent better than the baseline nearest neighbor method, and is comparable to other previously proposed SSL methods.

摘要
近邻采样（NN）提供更多semantic变化 than pre-defined transformations for self-supervised learning（SSL）based image recognition problems. However, its performance is restricted by the quality of the support set, which holds positive samples for the contrastive loss. In this work, we show that the quality of the support set plays a crucial role in any nearest neighbor based method for SSL. We then provide a refined baseline（pNNCLR）to the nearest neighbor based SSL approach（NNCLR）. To this end, we introduce pseudo nearest neighbors（pNN）to control the quality of the support set, wherein, rather than sampling the nearest neighbors, we sample in the vicinity of hard nearest neighbors by varying the magnitude of the resultant vector and employing a stochastic sampling strategy to improve the performance. Additionally, to stabilize the effects of uncertainty in NN-based learning, we employ a smooth-weight-update approach for training the proposed network. Evaluation of the proposed method on multiple public image recognition and medical image recognition datasets shows that it performs up to 8 percent better than the baseline nearest neighbor method, and is comparable to other previously proposed SSL methods.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard Chinese languages used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

Routing Recovery for UAV Networks with Deliberate Attacks: A Reinforcement Learning based Approach

paper_url: http://arxiv.org/abs/2308.06973
repo_url: None
paper_authors: Sijie He, Ziye Jia, Chao Dong, Wei Wang, Yilu Cao, Yang Yang, Qihui Wu
for: 本研究针对无人机网络中的攻击问题，提出了一种路由计划和恢复策略。
methods: 本文提出了一种基于重要性分析的节点重要性排名机制，并使用了人工智能技术来恢复路由路径。
results: 实验结果表明，提出的方法比其他已知方法更高效，能够有效地恢复路由路径并提高无人机网络的可靠性。

Abstract
The unmanned aerial vehicle (UAV) network is popular these years due to its various applications. In the UAV network, routing is significantly affected by the distributed network topology, leading to the issue that UAVs are vulnerable to deliberate damage. Hence, this paper focuses on the routing plan and recovery for UAV networks with attacks. In detail, a deliberate attack model based on the importance of nodes is designed to represent enemy attacks. Then, a node importance ranking mechanism is presented, considering the degree of nodes and link importance. However, it is intractable to handle the routing problem by traditional methods for UAV networks, since link connections change with the UAV availability. Hence, an intelligent algorithm based on reinforcement learning is proposed to recover the routing path when UAVs are attacked. Simulations are conducted and numerical results verify the proposed mechanism performs better than other referred methods.

摘要
自这些年来，无人飞行器（UAV）网络已经非常受欢迎，因为它们在各种应用方面表现出了优异的能力。在UAV网络中，路由受到分布式网络架构的影响，导致UAV受到意外损害。因此，这篇论文关注于UAV网络中的路由规划和恢复，以适应攻击。具体来说，我们设计了一种基于节点重要性的敌对攻击模型，并提出了一种考虑节点度和链接重要性的节点重要性排名机制。但是，由于UAV网络中的链接连接不断变化，因此传统的路由方法无法应对UAV网络中的攻击。因此，我们提出了一种基于强化学习的智能算法，以恢复UAV网络中的路由路径当UAV受到攻击时。我们进行了 simulations 和数值分析，并证明了我们提出的机制在UAV网络中恢复路由路径时比其他已知方法更好。

BIRP: Bitcoin Information Retrieval Prediction Model Based on Multimodal Pattern Matching

paper_url: http://arxiv.org/abs/2308.08558
repo_url: None
paper_authors: Minsuk Kim, Byungchul Kim, Junyeong Yong, Jeongwoo Park, Gyeongmin Kim
for: 该论文旨在提出一种基于PC模式匹配算法的方向预测模型，以提高对财务时间序列的预测能力。
methods: 该论文使用了多模式匹配算法来检测财务市场中的征兆，并将PC模式匹配结果作为额外特征来提高方向预测模型的准确性。
results: 研究人员通过应用该方法在比特币市场中进行了方向预测，并发现该方法可以提高方向预测的准确性。

Abstract
Financial time series have historically been assumed to be a martingale process under the Random Walk hypothesis. Instead of making investment decisions using the raw prices alone, various multimodal pattern matching algorithms have been developed to help detect subtly hidden repeatable patterns within the financial market. Many of the chart-based pattern matching tools only retrieve similar past chart (PC) patterns given the current chart (CC) pattern, and leaves the entire interpretive and predictive analysis, thus ultimately the final investment decision, to the investors. In this paper, we propose an approach of ranking similar PC movements given the CC information and show that exploiting this as additional features improves the directional prediction capacity of our model. We apply our ranking and directional prediction modeling methodologies on Bitcoin due to its highly volatile prices that make it challenging to predict its future movements.

摘要
金融时间序列历史上通常被视为一个martingale过程，而不是使用 raw 价格做投资决策。多种多模式匹配算法已经被开发出来帮助检测金融市场中的潜在征性重复模式。许多图表基本 pattern matching 工具只是根据当前图表（CC）提供类似过去图表（PC）模式，留下整个解释和预测分析，最终决策，给投资者。在这篇论文中，我们提出一种基于 CC 信息对 PC 运动进行排名的方法，并证明在我们的模型中利用这些特征可以提高方向预测能力。我们在使用我们的排名和方向预测模型方法时选择比特币，因为它的价格波动性较高，使其预测未来运动更加挑战。

Graph Structural Residuals: A Learning Approach to Diagnosis

paper_url: http://arxiv.org/abs/2308.06961
repo_url: None
paper_authors: Jan Lukas Augustin, Oliver Niggemann
for: 提出了一种基于深度图结构学习的数据驱动的系统诊断方法，它可以轻松地集成图结构学习和模型基于诊断。
methods: 使用两个不同的自适应图结构学习模型架构，通过自动学习系统的下游结构来提取系统的动态观察数据，并将系统的构造、观察和缺陷重新定义。
results: 通过对振荡器系统的实验， demonstate了该方法的可行性和效果，并证明了该方法可以提供更加准确和有效的系统诊断。

Abstract
Traditional model-based diagnosis relies on constructing explicit system models, a process that can be laborious and expertise-demanding. In this paper, we propose a novel framework that combines concepts of model-based diagnosis with deep graph structure learning. This data-driven approach leverages data to learn the system's underlying structure and provide dynamic observations, represented by two distinct graph adjacency matrices. Our work facilitates a seamless integration of graph structure learning with model-based diagnosis by making three main contributions: (i) redefining the constructs of system representation, observations, and faults (ii) introducing two distinct versions of a self-supervised graph structure learning model architecture and (iii) demonstrating the potential of our data-driven diagnostic method through experiments on a system of coupled oscillators.

摘要
传统的模型基于诊断方法是通过构建明确的系统模型来进行，这可能是劳动密集且需要专业知识的。在这篇论文中，我们提出了一种新的框架，它将模型基于诊断与深度图结构学习结合起来。这种数据驱动的方法利用数据来学习系统的下面结构，并提供动态观察结果，表示为两个不同的图邻接矩阵。我们的工作使得图结构学习与模型基于诊断的集成变得自然和简单，我们做出了三个主要贡献：（一）重新定义系统表示、观察和故障的构造（二）引入两种不同的自我超级vised图结构学习模型架构（三）通过对振荡器系统进行实验，证明我们的数据驱动诊断方法的潜力。

Search to Fine-tune Pre-trained Graph Neural Networks for Graph-level Tasks

paper_url: http://arxiv.org/abs/2308.06960
repo_url: None
paper_authors: Zhili Wang, Shimin Di, Lei Chen, Xiaofang Zhou
for: This paper aims to design a better fine-tuning strategy for pre-trained graph neural networks (GNNs) to improve their performance on downstream graph-level tasks.
methods: The proposed method, called S2PGNN, searches for an appropriate fine-tuning framework for the given labeled data on the downstream task, adaptively designing a suitable strategy for each task.
results: The empirical studies show that S2PGNN can be implemented on the top of 10 famous pre-trained GNNs and consistently improve their performance. Additionally, S2PGNN achieves better performance than existing fine-tuning strategies within and outside the GNN area.

Abstract
Recently, graph neural networks (GNNs) have shown its unprecedented success in many graph-related tasks. However, GNNs face the label scarcity issue as other neural networks do. Thus, recent efforts try to pre-train GNNs on a large-scale unlabeled graph and adapt the knowledge from the unlabeled graph to the target downstream task. The adaptation is generally achieved by fine-tuning the pre-trained GNNs with a limited number of labeled data. Despite the importance of fine-tuning, current GNNs pre-training works often ignore designing a good fine-tuning strategy to better leverage transferred knowledge and improve the performance on downstream tasks. Only few works start to investigate a better fine-tuning strategy for pre-trained GNNs. But their designs either have strong assumptions or overlook the data-aware issue for various downstream datasets. Therefore, we aim to design a better fine-tuning strategy for pre-trained GNNs to improve the model performance in this paper. Given a pre-trained GNN, we propose to search to fine-tune pre-trained graph neural networks for graph-level tasks (S2PGNN), which adaptively design a suitable fine-tuning framework for the given labeled data on the downstream task. To ensure the improvement brought by searching fine-tuning strategy, we carefully summarize a proper search space of fine-tuning framework that is suitable for GNNs. The empirical studies show that S2PGNN can be implemented on the top of 10 famous pre-trained GNNs and consistently improve their performance. Besides, S2PGNN achieves better performance than existing fine-tuning strategies within and outside the GNN area. Our code is publicly available at \url{https://anonymous.4open.science/r/code_icde2024-A9CB/}.

摘要
近些年来，图节点网络（GNNs）在许多图关联任务上显示出无前例的成功。然而，GNNs还面临着标签稀缺问题，与其他神经网络一样。因此，current efforts是在大规模无标记图上预训练GNNs，然后将知识从无标记图 adapts到目标下游任务。适应通常是通过精度调整预训练GNNs中的一部分参数来实现。 despite the importance of fine-tuning, current GNNs pre-training works often ignore designing a good fine-tuning strategy to better leverage transferred knowledge and improve the performance on downstream tasks. Only few works start to investigate a better fine-tuning strategy for pre-trained GNNs. But their designs either have strong assumptions or overlook the data-aware issue for various downstream datasets. Therefore, we aim to design a better fine-tuning strategy for pre-trained GNNs to improve the model performance in this paper. Given a pre-trained GNN, we propose to search for fine-tune pre-trained graph neural networks for graph-level tasks (S2PGNN), which adaptively designs a suitable fine-tuning framework for the given labeled data on the downstream task. To ensure the improvement brought by searching fine-tuning strategy, we carefully summarize a proper search space of fine-tuning framework that is suitable for GNNs. The empirical studies show that S2PGNN can be implemented on the top of 10 famous pre-trained GNNs and consistently improve their performance. Besides, S2PGNN achieves better performance than existing fine-tuning strategies within and outside the GNN area. Our code is publicly available at \url{https://anonymous.4open.science/r/code_icde2024-A9CB/}.

Approximating Human-Like Few-shot Learning with GPT-based Compression

paper_url: http://arxiv.org/abs/2308.06942
repo_url: None
paper_authors: Cynthia Huang, Yuqing Xie, Zhiying Jiang, Jimmy Lin, Ming Li
For: 本研究旨在塑造生成模型具备人类学习能力，以便在推理过程中进行数据压缩。* Methods: 本文提出了一种使用生成预训练模型（GPT）来估计kolmogorov复杂度，以便在几何学习中进行数据压缩。* Results: 实验结果表明，使用GPT模型作为压缩约束可以实现15.5倍的压缩率，并且对于困难的NLG任务（包括语义相似性、零和一极少文本分类和零极少文本排名）具有优秀的性能。

Abstract
In this work, we conceptualize the learning process as information compression. We seek to equip generative pre-trained models with human-like learning capabilities that enable data compression during inference. We present a novel approach that utilizes the Generative Pre-trained Transformer (GPT) to approximate Kolmogorov complexity, with the aim of estimating the optimal Information Distance for few-shot learning. We first propose using GPT as a prior for lossless text compression, achieving a noteworthy compression ratio. Experiment with LLAMA2-7B backbone achieves a compression ratio of 15.5 on enwik9. We justify the pre-training objective of GPT models by demonstrating its equivalence to the compression length, and, consequently, its ability to approximate the information distance for texts. Leveraging the approximated information distance, our method allows the direct application of GPT models in quantitative text similarity measurements. Experiment results show that our method overall achieves superior performance compared to embedding and prompt baselines on challenging NLP tasks, including semantic similarity, zero and one-shot text classification, and zero-shot text ranking.

摘要
在这个工作中，我们概念化学习过程为信息压缩。我们希望为生成预训练模型增加人类学习能力，以便在推理过程中进行数据压缩。我们提出了一种新的方法，利用生成预训练变换器（GPT）来近似kolmogorov复杂度，以便估算少量学习中的最佳信息距离。我们首先提出使用GPT作为损失less文本压缩的先验，实现了一个很好的压缩比例。在LLAMA2-7B底层上进行实验，实现了enwik9上的压缩比例为15.5。我们证明了GPT模型的预训练目标的正确性，并且因此能够近似信息距离的估算，从而使得GPT模型可以直接应用于文本相似度量度中。通过利用近似信息距离，我们的方法可以在挑战性的NLP任务中实现超越预测和描述基elines的性能。

FusionPlanner: A Multi-task Motion Planner for Mining Trucks using Multi-sensor Fusion Method

paper_url: http://arxiv.org/abs/2308.06931
repo_url: None
paper_authors: Siyu Teng, Luxi Li, Yuchen Li, Xuemin Hu, Lingxi Li, Yunfeng Ai, Long Chen
for:This paper proposes a comprehensive paradigm for unmanned transportation in open-pit mines, including a simulation platform, a testing benchmark, and a trustworthy and robust motion planner.methods:The paper proposes a multi-task motion planning algorithm called FusionPlanner, which uses a multi-sensor fusion method to adapt both lateral and longitudinal control tasks for unmanned transportation.results:The performance of FusionPlanner is tested by MiningNav in PMS, and the empirical results demonstrate a significant reduction in the number of collisions and takeovers of their planner.

Abstract
In recent years, significant achievements have been made in motion planning for intelligent vehicles. However, as a typical unstructured environment, open-pit mining attracts limited attention due to its complex operational conditions and adverse environmental factors. A comprehensive paradigm for unmanned transportation in open-pit mines is proposed in this research, including a simulation platform, a testing benchmark, and a trustworthy and robust motion planner. \textcolor{red}{Firstly, we propose a multi-task motion planning algorithm, called FusionPlanner, for autonomous mining trucks by the Multi-sensor fusion method to adapt both lateral and longitudinal control tasks for unmanned transportation. Then, we develop a novel benchmark called MiningNav, which offers three validation approaches to evaluate the trustworthiness and robustness of well-trained algorithms in transportation roads of open-pit mines. Finally, we introduce the Parallel Mining Simulator (PMS), a new high-fidelity simulator specifically designed for open-pit mining scenarios. PMS enables the users to manage and control open-pit mine transportation from both the single-truck control and multi-truck scheduling perspectives.} \textcolor{red}{The performance of FusionPlanner is tested by MiningNav in PMS, and the empirical results demonstrate a significant reduction in the number of collisions and takeovers of our planner. We anticipate our unmanned transportation paradigm will bring mining trucks one step closer to trustworthiness and robustness in continuous round-the-clock unmanned transportation.

摘要
近年来，在智能汽车运动规划方面有了 significative achievements。然而，由于开采矿场的复杂操作条件和不利环境因素，这种场景吸引了有限的关注。本研究提出了一种涵盖全面的无人运输解决方案，包括仿真平台、测试标准和可靠性和稳定性较高的运动规划算法。首先，我们提出了一种多任务运动规划算法，称为FusionPlanner，用于自动采矿车辆的多感器融合方法，以适应无人运输中的 lateral和longitudinal控制任务。然后，我们开发了一个新的测试标准，称为MiningNav，它提供了三种验证方法来评估训练过的算法在交通路上的可靠性和稳定性。最后，我们介绍了一个新的高级仿真平台，称为Parallel Mining Simulator (PMS)，它专门针对开采矿场 scenarios。PMS允许用户在交通路上控制开采矿车辆的单车控制和多车调度两种视角。FusionPlanner的性能被MiningNav在PMS中测试，实际结果表明我们的 плаanner的数量紧急和takeover的减少了显著。我们预计我们的无人运输方案将使采矿车辆一步 closer to trustworthiness和稳定性在无人不断运输中。

FedEdge AI-TC: A Semi-supervised Traffic Classification Method based on Trusted Federated Deep Learning for Mobile Edge Computing

paper_url: http://arxiv.org/abs/2308.06924
repo_url: None
paper_authors: Pan Wang, Zeyi Li, Mengyi Fu, Zixuan Wang, Ze Zhang, MinYao Liu
for: 本文旨在提出一种基于联合学习（Federated Learning，FL）的可靠网络流量分类（TC）框架，以提高5G客户端设备（CPE）中的TC性能。methods: 本文使用了机器学习（Machine Learning，ML）和深度学习（Deep Learning，DL）技术来提高TC性能，并提出了一种基于自变量自编码器（Variational Auto-Encoder，VAE）和卷积神经网络（Convolutional Neural Network，CNN）的半监督TC算法，以减少数据依赖性。此外，本文还提出了一种名为XAI-Pruning的AI模型压缩方法，以减少模型大小并保持模型解释性。results: 实验评估表明，基于FedEdge AI-TC框架的TC模型在精度和效率两个方面具有明显的优势，而且可以保护用户隐私和模型准确性。这种框架可以提高服务质量和安全性，因此具有广泛的应用前景。

Abstract
As a typical entity of MEC (Mobile Edge Computing), 5G CPE (Customer Premise Equipment)/HGU (Home Gateway Unit) has proven to be a promising alternative to traditional Smart Home Gateway. Network TC (Traffic Classification) is a vital service quality assurance and security management method for communication networks, which has become a crucial functional entity in 5G CPE/HGU. In recent years, many researchers have applied Machine Learning or Deep Learning (DL) to TC, namely AI-TC, to improve its performance. However, AI-TC faces challenges, including data dependency, resource-intensive traffic labeling, and user privacy concerns. The limited computing resources of 5G CPE further complicate efficient classification. Moreover, the "black box" nature of AI-TC models raises transparency and credibility issues. The paper proposes the FedEdge AI-TC framework, leveraging Federated Learning (FL) for reliable Network TC in 5G CPE. FL ensures privacy by employing local training, model parameter iteration, and centralized training. A semi-supervised TC algorithm based on Variational Auto-Encoder (VAE) and convolutional neural network (CNN) reduces data dependency while maintaining accuracy. To optimize model light-weight deployment, the paper introduces XAI-Pruning, an AI model compression method combined with DL model interpretability. Experimental evaluation demonstrates FedEdge AI-TC's superiority over benchmarks in terms of accuracy and efficient TC performance. The framework enhances user privacy and model credibility, offering a comprehensive solution for dependable and transparent Network TC in 5G CPE, thus enhancing service quality and security.

摘要
为了提高5G CPE（客户端设备）/HGU（家庭网关）的服务质量和安全性，MEC（移动边缘计算）中的网络TC（流量分类）已成为一种有前途的替代方案。然而，使用人工智能（AI）进行TC（TC使用AI）存在一些挑战，包括数据依赖、负担重的流量标注和用户隐私问题。另外，TC模型的“黑盒”性也会导致透明度和信任问题。为了解决这些问题，本文提出了FedEdge AI-TC框架，利用联邦学习（FL）来确保5G CPE中的可靠网络TC。FL确保了隐私，通过本地训练、模型参数迭代和中心训练。使用变量自动编码器（VAE）和卷积神经网络（CNN）的半监督TC算法可以减少数据依赖而保持准确性。为了优化模型轻量级部署，本文介绍了XAI-Pruning，一种将AI模型压缩与DL模型解释相结合的模型压缩方法。实验评估表明FedEdge AI-TC在精度和效率方面与标准准点。这种框架提高了用户隐私和模型信任度，为5G CPE中可靠和透明的网络TC提供了全面的解决方案，从而提高服务质量和安全性。

Probabilistic contingent planning based on HTN for high-quality plans

paper_url: http://arxiv.org/abs/2308.06922
repo_url: None
paper_authors: Peng Zhao
For: The paper is written for planning in partially observable environments, where traditional deterministic planning methods are not practical.* Methods: The paper proposes a probabilistic contingent Hierarchical Task Network (HTN) planner called High-Quality Contingent Planner (HQCP) to generate high-quality plans in partially observable environments. The planner extends HTN planning formalisms to partial observability and evaluates plans based on cost.* Results: The paper explores a novel heuristic for high-quality plans and develops an integrated planning algorithm. An empirical study verifies the effectiveness and efficiency of the planner in probabilistic contingent planning and obtaining high-quality plans.

Abstract
Deterministic planning assumes that the planning evolves along a fully predictable path, and therefore it loses the practical value in most real projections. A more realistic view is that planning ought to take into consideration partial observability beforehand and aim for a more flexible and robust solution. What is more significant, it is inevitable that the quality of plan varies dramatically in the partially observable environment. In this paper we propose a probabilistic contingent Hierarchical Task Network (HTN) planner, named High-Quality Contingent Planner (HQCP), to generate high-quality plans in the partially observable environment. The formalisms in HTN planning are extended into partial observability and are evaluated regarding the cost. Next, we explore a novel heuristic for high-quality plans and develop the integrated planning algorithm. Finally, an empirical study verifies the effectiveness and efficiency of the planner both in probabilistic contingent planning and for obtaining high-quality plans.

摘要
<>transliteration: zhèng zhì yì yù xiǎng zhèng zhì yì yù, yīn zhèng zhì yì yù de zhèng yì yù zhèng zhì yì yù. translated text: deterministic planning assumes that the planning evolves along a fully predictable path, and therefore it loses the practical value in most real projections. a more realistic view is that planning ought to take into consideration partial observability beforehand and aim for a more flexible and robust solution. what is more significant, it is inevitable that the quality of plan varies dramatically in the partially observable environment. in this paper we propose a probabilistic contingent hierarchical task network (htn) planner, named high-quality contingent planner (hqcp), to generate high-quality plans in the partially observable environment. the formalisms in htn planning are extended into partial observability and are evaluated regarding the cost. next, we explore a novel heuristic for high-quality plans and develop the integrated planning algorithm. finally, an empirical study verifies the effectiveness and efficiency of the planner both in probabilistic contingent planning and for obtaining high-quality plans.Note: The transliteration is based on the Hanyu Pinyin system, which is a standardized system for romanizing Chinese characters. The translated text is in Simplified Chinese, which is the standardized form of Chinese used in mainland China.

Chatbots in Drug Discovery: A Case Study on Anti-Cocaine Addiction Drug Development with ChatGPT

paper_url: http://arxiv.org/abs/2308.06920
repo_url: None
paper_authors: Rui Wang, Hongsong Feng, Guo-Wei Wei
for: 这个论文的目的是发展抗吸毒药物，使用GPT-4作为虚拟导航员，为研究人员提供策略和方法指导，以开发更有价值的药物候选体。
methods: 这个研究使用GPT-4语言模型chatbot，作为虚拟导航员，为研究人员提供策略和方法指导，以开发更有价值的药物候选体。
results: 这个研究发现，通过使用GPT-4语言模型chatbot，可以帮助研究人员更好地开发更有价值的药物候选体，并且可以提高药物开发的效率和优化性。

Abstract
The birth of ChatGPT, a cutting-edge language model chatbot developed by OpenAI, ushered in a new era in AI, and this paper vividly showcases its innovative application within the field of drug discovery. Focused specifically on developing anti-cocaine addiction drugs, the study employs GPT-4 as a virtual guide, offering strategic and methodological insights to researchers working on generative models for drug candidates. The primary objective is to generate optimal drug-like molecules with desired properties. By leveraging the capabilities of ChatGPT, the study introduces a novel approach to the drug discovery process. This symbiotic partnership between AI and researchers transforms how drug development is approached. Chatbots become facilitators, steering researchers towards innovative methodologies and productive paths for creating effective drug candidates. This research sheds light on the collaborative synergy between human expertise and AI assistance, wherein ChatGPT's cognitive abilities enhance the design and development of potential pharmaceutical solutions. This paper not only explores the integration of advanced AI in drug discovery but also reimagines the landscape by advocating for AI-powered chatbots as trailblazers in revolutionizing therapeutic innovation.

摘要
开启AI时代的掌门人---ChatGPT，一个前所未有的语言模型对话机器人，由OpenAI开发，带来了新的时代。这篇研究专注于开发抗科塞鸽药物，使用GPT-4作为虚拟导师，为研究人员工作于生成模型的药物候选者提供策略和方法学见解。研究的主要目标是生成符合需求的药物分子。通过利用ChatGPT的能力，这篇研究引入了一个新的药物发现过程。在人类专家和AI助手的协力下，虚拟导师成为了药物发现的帮手，导引研究人员朝着创新的方法和有效的药物候选者进行探索。这篇研究不仅探讨了AI在药物发现中的应用，而且重新定义了领域的概念，宣扬AI助手作为药物发现的先驱者，推动药物创新的发展。

A Novel Ehanced Move Recognition Algorithm Based on Pre-trained Models with Positional Embeddings

paper_url: http://arxiv.org/abs/2308.10822
repo_url: None
paper_authors: Hao Wen, Jie Wang, Xiaodong Qiao
for: 这篇论文主要targets at improving the recognition of abstracts in Chinese scientific and technological papers.
methods: 该论文提出了一种基于改进预训练模型和闭包网络听写机制的新的增强 Move recognition算法，用于处理中文科技论文的摘要。
results: 实验结果显示，该算法相比原始数据集，在分割数据集上提高了13.37%的准确率，并在基础比较模型上提高了7.55%的准确率。

Abstract
The recognition of abstracts is crucial for effectively locating the content and clarifying the article. Existing move recognition algorithms lack the ability to learn word position information to obtain contextual semantics. This paper proposes a novel enhanced move recognition algorithm with an improved pre-trained model and a gated network with attention mechanism for unstructured abstracts of Chinese scientific and technological papers. The proposed algorithm first performs summary data segmentation and vocabulary training. The EP-ERNIE$\_$AT-GRU framework is leveraged to incorporate word positional information, facilitating deep semantic learning and targeted feature extraction. Experimental results demonstrate that the proposed algorithm achieves 13.37$\%$ higher accuracy on the split dataset than on the original dataset and a 7.55$\%$ improvement in accuracy over the basic comparison model.

摘要
摘要理解是科技文献检索和解释的关键，但现有的 Move 认识算法无法学习单词位置信息以获取语义上的上下文。这篇论文提出了一种新的增强 Move 认识算法，使用改进的预训练模型和闭合网络听力机制来处理中文科技文献摘要。该算法首先执行摘要数据分 segmentation 和词汇训练。使用 EP-ERNIE $\_$ AT-GRU 框架，以获取单词位置信息，进一步深入学习语义和特定特征提取。实验结果表明，提出的算法在分数据集上比原始数据集高出 13.37%，与基本对比模型相比高出 7.55%。

Hierarchy Flow For High-Fidelity Image-to-Image Translation

paper_url: http://arxiv.org/abs/2308.06909
repo_url: https://github.com/weichenfan/hierarchyflow
paper_authors: Weichen Fan, Jinghuan Chen, Ziwei Liu
for: 本研究旨在提高图像到图像翻译中的内容保持性。
methods: 我们提出了一种新的流基模型，即层次流（Hierarchy Flow），以提高翻译过程中的内容保持性。
results: 我们的方法在各种图像到图像翻译benchmark上实现了状态的表现，特别是在强级和正常级翻译任务中表现出了明显的优势。

Abstract
Image-to-image (I2I) translation comprises a wide spectrum of tasks. Here we divide this problem into three levels: strong-fidelity translation, normal-fidelity translation, and weak-fidelity translation, indicating the extent to which the content of the original image is preserved. Although existing methods achieve good performance in weak-fidelity translation, they fail to fully preserve the content in both strong- and normal-fidelity tasks, e.g. sim2real, style transfer and low-level vision. In this work, we propose Hierarchy Flow, a novel flow-based model to achieve better content preservation during translation. Specifically, 1) we first unveil the drawbacks of standard flow-based models when applied to I2I translation. 2) Next, we propose a new design, namely hierarchical coupling for reversible feature transformation and multi-scale modeling, to constitute Hierarchy Flow. 3) Finally, we present a dedicated aligned-style loss for a better trade-off between content preservation and stylization during translation. Extensive experiments on a wide range of I2I translation benchmarks demonstrate that our approach achieves state-of-the-art performance, with convincing advantages in both strong- and normal-fidelity tasks. Code and models will be at https://github.com/WeichenFan/HierarchyFlow.

摘要
Image-to-image（I2I）翻译包括多种任务。我们在这里将这个问题分为三级：强度精度翻译、常规精度翻译和弱度精度翻译，这些级别指的是原始图像内容的保留程度。虽然现有方法在弱度翻译任务上达到了好的性能，但是它们在强度翻译和常规翻译任务中却失败了，例如sim2real、style transfer和低级视觉。在这种情况下，我们提出了一种新的方法——层次流模型，以提高翻译过程中内容的保留。具体来说，我们首先揭示了标准流模型在I2I翻译中的缺陷。然后，我们提出了一种新的设计——层次协调对应的反转特征转换和多尺度模型，以构成层次流模型。最后，我们提出了一种专门设计的对齐风格损失，以实现在翻译过程中更好的内容保留和风格化融合。我们在多种I2I翻译 benchmark 上进行了广泛的实验，并证明了我们的方法可以达到状态之最的性能，在强度翻译和常规翻译任务中都有证明性的优势。代码和模型将在 GitHub 上提供。

Generative Interpretation

paper_url: http://arxiv.org/abs/2308.06907
repo_url: https://github.com/yonathanarbel/generativeinterpretation
paper_authors: Yonathan A. Arbel, David Hoffman
for: 这篇论文目的是提出一种新的合同解释方法，使用大语言模型来估算合同意义。
methods: 这篇论文使用了大语言模型来实现这个目的，并通过实践案例来示例其能力。
results: 论文显示了这些模型可以帮助法官确定合同的真实意义，衡量不确定性，并填充党之间的协议缺陷。它还示出了这些模型可以评估外部证据的证据价值。

Abstract
We introduce generative interpretation, a new approach to estimating contractual meaning using large language models. As AI triumphalism is the order of the day, we proceed by way of grounded case studies, each illustrating the capabilities of these novel tools in distinct ways. Taking well-known contracts opinions, and sourcing the actual agreements that they adjudicated, we show that AI models can help factfinders ascertain ordinary meaning in context, quantify ambiguity, and fill gaps in parties' agreements. We also illustrate how models can calculate the probative value of individual pieces of extrinsic evidence. After offering best practices for the use of these models given their limitations, we consider their implications for judicial practice and contract theory. Using LLMs permits courts to estimate what the parties intended cheaply and accurately, and as such generative interpretation unsettles the current interpretative stalemate. Their use responds to efficiency-minded textualists and justice-oriented contextualists, who argue about whether parties will prefer cost and certainty or accuracy and fairness. Parties--and courts--would prefer a middle path, in which adjudicators strive to predict what the contract really meant, admitting just enough context to approximate reality while avoiding unguided and biased assimilation of evidence. As generative interpretation offers this possibility, we argue it can become the new workhorse of contractual interpretation.

摘要
我们介绍生成解释，一种新的方法来估算合同意义使用大语言模型。在人工智能胜利的时代，我们通过实际案例来证明这些新工具在不同方面的能力。使用知名合同意见和实际协议，我们表明AI模型可以帮助判决者了解具体的意思，衡量不确定性，并填充党们的协议中的缺陷。我们还示出了模型可以计算个别外部证据的证据价值。在使用这些模型的限制后，我们考虑了它们对法律实践和合同理论的影响。使用LLMs permit courts to estimate what the parties intended at a low cost and high accuracy, and thus generative interpretation breaks the current interpretive deadlock. Its use responds to efficiency-minded textualists and justice-oriented contextualists, who argue about whether parties will prefer cost and certainty or accuracy and fairness. Parties and courts would prefer a middle path, in which adjudicators strive to predict what the contract really meant, admitting just enough context to approximate reality while avoiding unguided and biased assimilation of evidence. As generative interpretation offers this possibility, we argue it can become the new workhorse of contractual interpretation.

The Michigan Robotics Undergraduate Curriculum: Defining the Discipline of Robotics for Equity and Excellence

paper_url: http://arxiv.org/abs/2308.06905
repo_url: None
paper_authors: Odest Chadwicke Jenkins, Jessy Grizzle, Ella Atkins, Leia Stirling, Elliott Rouse, Mark Guzdial, Damen Provost, Kimberly Mann, Joanna Millunchick
for:The paper is written to propose and establish a new undergraduate program in Robotics at the University of Michigan, with a focus on equity and excellence.methods:The program is designed with an adaptable curriculum that is accessible through a diversity of student pathways, and includes partnerships with Historically Black Colleges and Universities.results:The program has been highly successful in its first academic year, with over 100 students declaring Robotics as their major, completion of the Robotics major by the first two graduates, and soaring enrollments in Robotics classes.

Abstract
The Robotics Major at the University of Michigan was successfully launched in the 2022-23 academic year as an innovative step forward to better serve students, our communities, and our society. Building on our guiding principle of "Robotics with Respect" and our larger Robotics Pathways model, the Michigan Robotics Major was designed to define robotics as a true academic discipline with both equity and excellence as our highest priorities. Understanding that talent is equally distributed but opportunity is not, the Michigan Robotics Major has embraced an adaptable curriculum that is accessible through a diversity of student pathways and enables successful and sustained career-long participation in robotics, AI, and automation professions. The results after our planning efforts (2019-22) and first academic year (2022-23) have been highly encouraging: more than 100 students declared Robotics as their major, completion of the Robotics major by our first two graduates, soaring enrollments in our Robotics classes, thriving partnerships with Historically Black Colleges and Universities. This document provides our original curricular proposal for the Robotics Undergraduate Program at the University of Michigan, submitted to the Michigan Association of State Universities in April 2022 and approved in June 2022. The dissemination of our program design is in the spirit of continued growth for higher education towards realizing equity and excellence. The most recent version of this document is also available on Google Docs through this link: https://ocj.me/robotics_major

摘要
美国密歇根大学机器人学专业在2022-23学年 successfully 发起了一个创新的步骤，以更好地服务学生、社区和社会。基于我们的指导原则“机器人学 avec 尊重”和我们更大的机器人路径模型，密歇根大学机器人学专业是为了定义机器人学为真正的学术领域，并将 equity 和 excellence 作为我们最高的优先级。理解才华平等分布，但机会不平等分布，密歇根大学机器人学专业采用了适应性课程，通过多种学生路径访问，帮助学生在机器人、人工智能和自动化领域成功地职业发展。我们的规划努力（2019-22）和首学年（2022-23）的结果非常鼓舞人：More than 100 学生声明机器人学为他们的主修，首两名毕业生完成机器人学专业，课程报名人数在趋升，与 Historically Black Colleges and Universities 的合作也在逐渐增长。这份文件包含我们原始的课程设计提案，在2022年6月由密歇根州大学协会批准。我们希望通过分享我们的program design ，促进高等教育的增长，实现 equity 和 excellence。最新版本的这份文件可以在 Google Docs 上通过以下链接获取：https://ocj.me/robotics_major

Robustified ANNs Reveal Wormholes Between Human Category Percepts

paper_url: http://arxiv.org/abs/2308.06887
repo_url: https://github.com/ggaziv/wormholes
paper_authors: Guy Gaziv, Michael J. Lee, James J. DiCarlo
for: 这篇论文旨在探讨人工神经网络（ANNs）对图像分类的敏感性，以及人类视觉处理模型的不足。
methods: 研究人员使用了标准的ANN模型和roboustified ANN模型，生成了小norm图像扰动，并评估了人类视觉对这些扰动的稳定性。
results: 研究发现，人类视觉对小norm图像扰动是高度稳定的，而roboustified ANN模型可以可靠地找到低norm图像扰动，使人类视觉受到强烈扰动。此外，研究还发现了“孔雀门”现象，即在图像空间中的任意起点都存在一些“蛇口”，可以带领人类视觉从当前分类状态转移到另一个semantically very different的状态。

Abstract
The visual object category reports of artificial neural networks (ANNs) are notoriously sensitive to tiny, adversarial image perturbations. Because human category reports (aka human percepts) are thought to be insensitive to those same small-norm perturbations -- and locally stable in general -- this argues that ANNs are incomplete scientific models of human visual perception. Consistent with this, we show that when small-norm image perturbations are generated by standard ANN models, human object category percepts are indeed highly stable. However, in this very same "human-presumed-stable" regime, we find that robustified ANNs reliably discover low-norm image perturbations that strongly disrupt human percepts. These previously undetectable human perceptual disruptions are massive in amplitude, approaching the same level of sensitivity seen in robustified ANNs. Further, we show that robustified ANNs support precise perceptual state interventions: they guide the construction of low-norm image perturbations that strongly alter human category percepts toward specific prescribed percepts. These observations suggest that for arbitrary starting points in image space, there exists a set of nearby "wormholes", each leading the subject from their current category perceptual state into a semantically very different state. Moreover, contemporary ANN models of biological visual processing are now accurate enough to consistently guide us to those portals.

摘要
人工神经网络（ANNs）的视觉物体类别报告具有极其敏感于微小、敌意的图像扰动的特点。因为人类的视觉报告（也称为人类感知）被认为不受这些小范围扰动的影响，而且在总体上是稳定的，这意味着ANNs是人类视觉认知的不完整科学模型。我们的研究表明，当使用标准ANN模型生成小范围扰动时，人类对象类别报告是非常稳定的。然而，在这个“人类假设稳定”的 régime中，我们发现了使用强化ANN模型时发现的低范围扰动，这些扰动会强烈地打乱人类报告。这些扰动的振荡强度非常大，相当于强化ANNs的敏感度。此外，我们发现了使用强化ANN模型支持精确的感知状态改变：它们可以生成低范围扰动，使人类category报告强烈地改变。这些观察表明，对于任意的图像初始状态，存在一组“蠕虫洞”，每个蠕虫洞都可以将主体从其当前的报告状态转移到具有不同semantics的状态。此外，当代的生物视觉处理ANN模型已经够精度，可以一直引导我们到这些门户。

Optimizing Offensive Gameplan in the National Basketball Association with Machine Learning

paper_url: http://arxiv.org/abs/2308.06851
repo_url: None
paper_authors: Eamon Mukhopadhyay
for: This paper aims to verify the effectiveness of the Offensive Rating (ORTG) metric in predicting different NBA playtypes.
methods: The authors use both linear regression and neural network regression models to evaluate the correlation between ORTG and different playtypes.
results: The authors find that both models have a strong correlation with the playtypes, but the neural network model performs slightly better than the linear regression model. They also use the accuracy of the models to optimize the output of the model with test examples, demonstrating the combination of features that can achieve a highly functioning offense.Here’s the simplified Chinese text:
for: 这篇论文目的是验证 Offensive Rating (ORTG) metric 是否有效地预测不同的 NBA 战术类型。
methods: 作者使用了线性回归和神经网络回归模型来评估 ORTG 和不同战术类型之间的相关性。
results: 作者发现两种模型都有强相关性，但神经网络模型在准确性上略微高于线性回归模型。他们还使用模型准确性来优化输出模型的测试例子，以示出可以实现高效的攻击防御。

Abstract
Throughout the analytical revolution that has occurred in the NBA, the development of specific metrics and formulas has given teams, coaches, and players a new way to see the game. However - the question arises - how can we verify any metrics? One method would simply be eyeball approximation (trying out many different gameplans) and/or trial and error - an estimation-based and costly approach. Another approach is to try to model already existing metrics with a unique set of features using machine learning techniques. The key to this approach is that with these features that are selected, we can try to gauge the effectiveness of these features combined, rather than using individual analysis in simple metric evaluation. If we have an accurate model, it can particularly help us determine the specifics of gameplan execution. In this paper, the statistic ORTG (Offensive Rating, developed by Dean Oliver) was found to have a correlation with different NBA playtypes using both a linear regression model and a neural network regression model, although ultimately, a neural network worked slightly better than linear regression. Using the accuracy of the models as a justification, the next step was to optimize the output of the model with test examples, which would demonstrate the combination of features to best achieve a highly functioning offense.

摘要
在NBA analytics革命中，发展特定的指标和公式为球队、教练和球员提供了一种新的视角。然而，问题 arise - 如何验证这些指标呢？一种方法是通过观察多个不同的战斗策略来估算（trying out many different gameplans），以及或者试错法则 - 一种估计基于的成本高的方法。另一种方法是使用机器学习技术来模型已有的指标，并选择一组独特的特征。这种方法的关键在于，我们可以通过这些选择的特征来评估这些特征的组合效果，而不是单独评估指标。如果我们有一个准确的模型，那么它可以尤其帮助我们确定游戏计划的执行细节。在这篇论文中，发展了ORTG（Offensive Rating，由Dean Oliver创造）指标，与不同的NBA战斗类型之间存在相关性，使用线性回归模型和神经网络回归模型进行相关性分析，最终发现神经网络模型的准确性略高于线性回归模型。使用模型准确性作为正当化，下一步是优化模型输出的测试例子，以示出最佳执行游戏计划所需的组合特征。

A Parallel Ensemble of Metaheuristic Solvers for the Traveling Salesman Problem

paper_url: http://arxiv.org/abs/2308.07347
repo_url: None
paper_authors: Swetha Varadarajan, Darrell Whitley
for: 解决购物人巡游问题 (TSP)，一个广泛研究的NP困难问题。
methods: 使用Lin-Kernighan-Helsgaun（LKH）规则和Edge Assembly crossover（EAX）算法，以及其hybrid版本和Mixing Genetic Algorithm（MGA）。
results: ensemble setup中 combine these solvers，可以超越单个 solver的性能，特别是在大于10,000个城市的问题上。

Abstract
The travelling salesman problem (TSP) is one of the well-studied NP-hard problems in the literature. The state-of-the art inexact TSP solvers are the Lin-Kernighan-Helsgaun (LKH) heuristic and Edge Assembly crossover (EAX). A recent study suggests that EAX with restart mechanisms perform well on a wide range of TSP instances. However, this study is limited to 2,000 city problems. We study for problems ranging from 2,000 to 85,900. We see that the performance of the solver varies with the type of the problem. However, combining these solvers in an ensemble setup, we are able to outperform the individual solver's performance. We see the ensemble setup as an efficient way to make use of the abundance of compute resources. In addition to EAX and LKH, we use several versions of the hybrid of EAX and Mixing Genetic Algorithm (MGA). A hybrid of MGA and EAX is known to solve some hard problems. We see that the ensemble of the hybrid version outperforms the state-of-the-art solvers on problems larger than 10,000 cities.

摘要
“旅游销售问题”（TSP）是一个已有广泛研究的NP困难问题。现今的State-of-the-art对于TSP问题的不精确解决方案是林-肯尼根-赫尔斯堡（LKH）规律和边组聚合交叉（EAX）。一 recent study 显示，在广泛的TSP问题上，EAX加上重新启动机制perform well。但是，这个研究仅对2,000个城市问题进行了研究。我们在2,000至85,900个城市之间进行了研究，发现解释器的表现随问题的类型而异。不过，将这些解释器集成为一个组合设置，我们能够超过个体解释器的表现。我们视这个组合设置为一种有效的使用计算资源的方式。此外，我们还使用了一些版本的EAX和混合遗传算法（MGA）的复合版本。一个混合版本已经能够解决一些困难的问题。我们发现，这个组合设置在10,000个城市以上的问题上表现更好。

Diagnostic Reasoning Prompts Reveal the Potential for Large Language Model Interpretability in Medicine

paper_url: http://arxiv.org/abs/2308.06834
repo_url: None
paper_authors: Thomas Savage, Ashwin Nayak, Robert Gallo, Ekanath Rangan, Jonathan H Chen
for: 本研究旨在探讨语言模型在医学领域中是否可以准确地诊断疾病，并使用什么方法来实现这一目标。
methods: 本研究使用了GPT4语言模型，并开发了一些新的诊断逻辑提问来评估语言模型的诊断能力。
results: 研究发现，通过使用新的诊断逻辑提问，GPT4可以模仿医生的常见诊断过程，同时保持诊断准确性。这表明，可以使用价值的诊断逻辑提问来让语言模型进行可读性的诊断。

Abstract
One of the major barriers to using large language models (LLMs) in medicine is the perception they use uninterpretable methods to make clinical decisions that are inherently different from the cognitive processes of clinicians. In this manuscript we develop novel diagnostic reasoning prompts to study whether LLMs can perform clinical reasoning to accurately form a diagnosis. We find that GPT4 can be prompted to mimic the common clinical reasoning processes of clinicians without sacrificing diagnostic accuracy. This is significant because an LLM that can use clinical reasoning to provide an interpretable rationale offers physicians a means to evaluate whether LLMs can be trusted for patient care. Novel prompting methods have the potential to expose the black box of LLMs, bringing them one step closer to safe and effective use in medicine.

摘要
一个主要阻碍大语言模型（LLMs）在医学中使用的问题是人们认为它们使用不可解释的方法来做诊断决策，这与医生的认知过程有很大差异。在这篇论文中，我们开发了新的诊断思维提要，以研究LLMs是否可以准确地进行诊断。我们发现GPT4可以通过模仿常见的临床思维过程来提供一个可解释的诊断原因，而不会产生诊断准确性的损失。这是重要的，因为一个可以使用临床思维来提供可解释的诊断原因的 LLM 可以让医生评估 LLMS 是否可以用于患者护理。这种新的提要方法有助于暴露 LLMS 的黑盒子，使其更接近安全和有效的使用。

InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models

paper_url: http://arxiv.org/abs/2308.08500
repo_url: None
paper_authors: Kabir Nagrecha, Lingyi Liu, Pablo Delgado, Prasanna Padmanabhan
for: 这篇论文旨在探讨深度学习推荐模型（DLRM）训练中的数据接入问题，以及相关的管道瓶颈和挑战。
methods: 本论文使用了人工智能学习（RL）代理来学习训练机器的CPU资源分配方式，以更好地平行化数据加载和提高通过put。
results: experiments表明，使用InTune可以在几分钟内构建优化数据管道配置，并可以轻松地与现有训练工作流 integration。 InTune可以提高在线数据接入速率，从而减少模型执行时间的浪费和提高效率。在实际场景中，InTune可以提高数据接入吞吐量 by up to 2.29倍，并同时提高CPU和GPU资源利用率。

Abstract
Deep learning-based recommender models (DLRMs) have become an essential component of many modern recommender systems. Several companies are now building large compute clusters reserved only for DLRM training, driving new interest in cost- and time- saving optimizations. The systems challenges faced in this setting are unique; while typical deep learning training jobs are dominated by model execution, the most important factor in DLRM training performance is often online data ingestion. In this paper, we explore the unique characteristics of this data ingestion problem and provide insights into DLRM training pipeline bottlenecks and challenges. We study real-world DLRM data processing pipelines taken from our compute cluster at Netflix to observe the performance impacts of online ingestion and to identify shortfalls in existing pipeline optimizers. We find that current tooling either yields sub-optimal performance, frequent crashes, or else requires impractical cluster re-organization to adopt. Our studies lead us to design and build a new solution for data pipeline optimization, InTune. InTune employs a reinforcement learning (RL) agent to learn how to distribute the CPU resources of a trainer machine across a DLRM data pipeline to more effectively parallelize data loading and improve throughput. Our experiments show that InTune can build an optimized data pipeline configuration within only a few minutes, and can easily be integrated into existing training workflows. By exploiting the responsiveness and adaptability of RL, InTune achieves higher online data ingestion rates than existing optimizers, thus reducing idle times in model execution and increasing efficiency. We apply InTune to our real-world cluster, and find that it increases data ingestion throughput by as much as 2.29X versus state-of-the-art data pipeline optimizers while also improving both CPU & GPU utilization.

摘要
深度学习基本的推荐模型（DLRM）已经成为现代推荐系统中的重要组件。许多公司现在为DLRM训练建立大型计算集群，导致新的成本和时间OPTIMIZATION需求。这些系统的挑战是独特的； Typical deep learning training Jobs是模型执行所dominated，但DLRM训练性能中最重要的因素通常是在线数据取入。在这篇论文中，我们探索DLRM数据取入问题的独特特性并提供了推荐管道瓶颈和挑战。我们研究了Netflix的compute集群中的实际DLRM数据处理管道，并观察了在线取入的性能影响以及现有管道优化工具的缺点。我们发现现有工具可能会导致下 optimize performance，或者频繁崩溃，或者需要重新组织集群以采用。我们的研究导致我们设计并建立了一个新的数据管道优化解决方案，即InTune。InTune使用了强化学习（RL）代理来分配训练机器的CPU资源在DLRM数据管道中更有效地并行数据加载和提高吞吐量。我们的实验表明，InTune可以在只需几分钟之内构建优化的数据管道配置，并可以轻松地与现有训练工作流 integrate。通过强化学习的响应和适应性，InTune可以在现有优化器的基础上提高在线数据取入率，从而降低模型执行时的空闲时间和提高效率。我们在实际集群上应用InTune，发现它可以提高数据取入吞吐量达2.29倍，而且同时提高CPU和GPU资源利用率。

An Ensemble Approach to Question Classification: Integrating Electra Transformer, GloVe, and LSTM

paper_url: http://arxiv.org/abs/2308.06828
repo_url: None
paper_authors: Sanad Aburass, Osama Dorgham
for: 本研究は问题分类 задачіに对するnovelensemble方法を提出します。
methods: 提案された模型はElectra、GloVe、LSTMの三种state-of-the-art模型を组み合わせた Ensemble方法です。
results: 对TREC数据集进行训练和评估，结果显示提案的模型在所有评估指标上都超过了BERT、RoBERTa、DistilBERT等其他 cutting-edge模型，测试集准确率达0.8。

Abstract
This paper introduces a novel ensemble approach for question classification using state-of-the-art models -- Electra, GloVe, and LSTM. The proposed model is trained and evaluated on the TREC dataset, a well-established benchmark for question classification tasks. The ensemble model combines the strengths of Electra, a transformer-based model for language understanding, GloVe, a global vectors for word representation, and LSTM, a recurrent neural network variant, providing a robust and efficient solution for question classification. Extensive experiments were carried out to compare the performance of the proposed ensemble approach with other cutting-edge models, such as BERT, RoBERTa, and DistilBERT. Our results demonstrate that the ensemble model outperforms these models across all evaluation metrics, achieving an accuracy of 0.8 on the test set. These findings underscore the effectiveness of the ensemble approach in enhancing the performance of question classification tasks, and invite further exploration of ensemble methods in natural language processing.

摘要
这篇论文介绍了一种新的集成方法用于问题分类，使用当前的模型---Electra、GloVe和LSTM。该提议的模型在TREC数据集上进行训练和评估，这是一个已知的问题分类任务 benchmark。集成模型 combinesthe strengths of Electra、GloVe和LSTM，提供一种强大和高效的问题分类解决方案。我们进行了广泛的实验，比较了该集成模型与其他当前最佳模型，如BERT、RoBERTa和DistilBERT的性能。我们的结果表明，集成模型在所有评估指标上都超过了这些模型，实现了测试集上的准确率0.8。这些发现证明了集成方法在问题分类任务中的效iveness，并邀请进一步探索集成方法在自然语言处理领域的应用。

Reinforcement Graph Clustering with Unknown Cluster Number

paper_url: http://arxiv.org/abs/2308.06827
repo_url: https://github.com/yueliu1999/awesome-deep-graph-clustering
paper_authors: Yue Liu, Ke Liang, Jun Xia, Xihong Yang, Sihang Zhou, Meng Liu, Xinwang Liu, Stan Z. Li
for: 这种方法的目的是为了不需要先知道cluster数量的情况下，使用神经网络进行深度图 clustering。
methods: 该方法使用了对冲预测任务来学习节点表示，然后使用了 reinforcement learning 机制来决定节点分布。
results: 实验表明，该方法可以具有高效率和高准确率，并且可以在不知道cluster数量的情况下进行深度图 clustering。

Abstract
Deep graph clustering, which aims to group nodes into disjoint clusters by neural networks in an unsupervised manner, has attracted great attention in recent years. Although the performance has been largely improved, the excellent performance of the existing methods heavily relies on an accurately predefined cluster number, which is not always available in the real-world scenario. To enable the deep graph clustering algorithms to work without the guidance of the predefined cluster number, we propose a new deep graph clustering method termed Reinforcement Graph Clustering (RGC). In our proposed method, cluster number determination and unsupervised representation learning are unified into a uniform framework by the reinforcement learning mechanism. Concretely, the discriminative node representations are first learned with the contrastive pretext task. Then, to capture the clustering state accurately with both local and global information in the graph, both node and cluster states are considered. Subsequently, at each state, the qualities of different cluster numbers are evaluated by the quality network, and the greedy action is executed to determine the cluster number. In order to conduct feedback actions, the clustering-oriented reward function is proposed to enhance the cohesion of the same clusters and separate the different clusters. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method. The source code of RGC is shared at https://github.com/yueliu1999/RGC and a collection (papers, codes and, datasets) of deep graph clustering is shared at https://github.com/yueliu1999/Awesome-Deep-Graph-Clustering on Github.

摘要
深度图 clustering，目标是通过神经网络在无监督的情况下将节点分组成不同的分支，在过去几年内吸引了广泛的关注。 although the performance has been largely improved, the excellent performance of the existing methods heavily relies on accurately predefined cluster number, which is not always available in the real-world scenario. To enable the deep graph clustering algorithms to work without the guidance of the predefined cluster number, we propose a new deep graph clustering method termed Reinforcement Graph Clustering (RGC). In our proposed method, cluster number determination and unsupervised representation learning are unified into a uniform framework by the reinforcement learning mechanism. Concretely, the discriminative node representations are first learned with the contrastive pretext task. Then, to capture the clustering state accurately with both local and global information in the graph, both node and cluster states are considered. Subsequently, at each state, the qualities of different cluster numbers are evaluated by the quality network, and the greedy action is executed to determine the cluster number. In order to conduct feedback actions, the clustering-oriented reward function is proposed to enhance the cohesion of the same clusters and separate the different clusters. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method. The source code of RGC is shared at and a collection (papers, codes and, datasets) of deep graph clustering is shared at on Github.

Approximate and Weighted Data Reconstruction Attack in Federated Learning

paper_url: http://arxiv.org/abs/2308.06822
repo_url: None
paper_authors: Ziqi Wang, Yongcun Song, Enrique Zuazua
for: 这个论文的目的是提出一种可以攻击 Federated Learning（FL）中最常使用的水平 Federated Averaging（FedAvg）scenario的方法，并且提高攻击效率和预测品质。
methods: 本文使用了一种 interpolating-based approximation method，将客户端的本地训练过程中的模型更新转换为可以攻击的形式，然后设计了一个层别加权损失函数来提高攻击的数据质量。
results: 实验结果显示，提出的 aproximate和weighted攻击（AWA）方法比其他现有的方法有更好的攻击效率和预测品质，特别是在静止图像数据重建 task 上。

Abstract
Federated Learning (FL) is a distributed learning paradigm that enables multiple clients to collaborate on building a machine learning model without sharing their private data. Although FL is considered privacy-preserved by design, recent data reconstruction attacks demonstrate that an attacker can recover clients' training data based on the parameters shared in FL. However, most existing methods fail to attack the most widely used horizontal Federated Averaging (FedAvg) scenario, where clients share model parameters after multiple local training steps. To tackle this issue, we propose an interpolation-based approximation method, which makes attacking FedAvg scenarios feasible by generating the intermediate model updates of the clients' local training processes. Then, we design a layer-wise weighted loss function to improve the data quality of reconstruction. We assign different weights to model updates in different layers concerning the neural network structure, with the weights tuned by Bayesian optimization. Finally, experimental results validate the superiority of our proposed approximate and weighted attack (AWA) method over the other state-of-the-art methods, as demonstrated by the substantial improvement in different evaluation metrics for image data reconstructions.

摘要
受欢迎的分布式学习（FL）是一种分布式学习模式，允许多个客户端共同建立一个机器学习模型，无需分享他们的私人数据。虽然FL被视为隐私保护的设计，但最近的数据重建攻击表明，攻击者可以根据在FL中共享的参数来恢复客户端的训练数据。然而，大多数现有的方法无法攻击最常用的水平 Federated Averaging（FedAvg）场景，在这种场景下，客户端在多个本地训练步骤后共享模型参数。为解决这个问题，我们提出了一种 interpolate-based approximation 方法，可以让攻击者在客户端的本地训练过程中生成 intercept 的模型更新。然后，我们设计了层weise 权重损失函数，以提高数据重建的质量。我们对模型更新在不同层之间分配不同的权重，并通过 Bayesian 优化调整这些权重。最后，我们进行了实验，并证明了我们提出的approximate和权重 attacked（AWA）方法在不同评价指标中具有显著的优势，特别是对于图像数据重建 task。

Ground Manipulator Primitive Tasks to Executable Actions using Large Language Models

paper_url: http://arxiv.org/abs/2308.06810
repo_url: None
paper_authors: Yue Cao, C. S. George Lee
for: 这 paper 的目的是解决 robot 系统中高级任务与低级动作之间的转换问题。
methods: 该 paper 使用大型自然语言模型 (LLM) 将 manipulate primitive task 转换为机器人低级动作。
results: 该 paper 提供了一种基于任务框架的程序式似的提问，使得 LLM 可以生成位置/力集点，以便实现混合控制。

Abstract
Layered architectures have been widely used in robot systems. The majority of them implement planning and execution functions in separate layers. However, there still lacks a straightforward way to transit high-level tasks in the planning layer to the low-level motor commands in the execution layer. In order to tackle this challenge, we propose a novel approach to ground the manipulator primitive tasks to robot low-level actions using large language models (LLMs). We designed a program-like prompt based on the task frame formalism. In this way, we enable LLMs to generate position/force set-points for hybrid control. Evaluations over several state-of-the-art LLMs are provided.

摘要
层次架构在机器人系统中广泛应用。大多数其中实现规划和执行功能在不同的层次。然而，从高级任务到低级机器指令的转换仍然存在一定的挑战。为了解决这个问题，我们提出了一种新的方法，通过大型自然语言模型（LLM）将抓取器基础任务地标准化到机器人低级动作上。我们设计了基于任务框架 formalism的程序式样本。这样，我们允许 LLM 生成位/力设点，以便在混合控制中生成位置/力信号。我们对多个现代 LLM 进行了评估。

Neural Networks for Programming Quantum Annealers

paper_url: http://arxiv.org/abs/2308.06807
repo_url: https://github.com/boschsamuel/nnforprogrammingquantumannealers
paper_authors: Samuel Bosch, Bobak Kiani, Rui Yang, Adrian Lupascu, Seth Lloyd
for: 这个论文旨在探讨量子机器学习是否可以提高人工智能的发展，特别是解决类别问题。
methods: 本论文使用了一种组合量子逻辑和类传播神经网络的方法，通过将量子逻辑连接到神经网络中来实现分类任务。
results: 研究发现，在使用量子逻辑的情况下，神经网络的性能不变化很大，并且不如使用常规非线性神经网络来解决分类问题。

Abstract
Quantum machine learning has the potential to enable advances in artificial intelligence, such as solving problems intractable on classical computers. Some fundamental ideas behind quantum machine learning are similar to kernel methods in classical machine learning. Both process information by mapping it into high-dimensional vector spaces without explicitly calculating their numerical values. We explore a setup for performing classification on labeled classical datasets, consisting of a classical neural network connected to a quantum annealer. The neural network programs the quantum annealer's controls and thereby maps the annealer's initial states into new states in the Hilbert space. The neural network's parameters are optimized to maximize the distance of states corresponding to inputs from different classes and minimize the distance between quantum states corresponding to the same class. Recent literature showed that at least some of the "learning" is due to the quantum annealer, connecting a small linear network to a quantum annealer and using it to learn small and linearly inseparable datasets. In this study, we consider a similar but not quite the same case, where a classical fully-fledged neural network is connected with a small quantum annealer. In such a setting, the fully-fledged classical neural-network already has built-in nonlinearity and learning power, and can already handle the classification problem alone, we want to see whether an additional quantum layer could boost its performance. We simulate this system to learn several common datasets, including those for image and sound recognition. We conclude that adding a small quantum annealer does not provide a significant benefit over just using a regular (nonlinear) classical neural network.

摘要
量子机器学习有潜力推动人工智能的发展，如解决классические计算机无法解决的问题。一些量子机器学习的基本想法与经典机器学习的核心思想类似。两者都将信息映射到高维向量空间中，不必显式计算其数值。我们研究一种将标注的古典数据集用类icial neural network和小规模量子热退器连接起来进行分类。 neural network控制量子热退器的初态，并将其映射到彪维空间中的新状态。 neural network的参数被优化，以最大化输入 différences between states corresponding to different classes and minimize the difference between states corresponding to the same class.在文献中显示，至少一些"学习"是由量子热退器提供的，将小线性网络连接到量子热退器，并用其学习小和线性不可分离的数据集。在这种情况下，我们考虑一个类似的情况，其中一个完整的古典神经网络与一个小量子热退器连接在一起。在这种设置下，古典神经网络已经具有内置的非线性和学习能力，它可以独立处理分类问题。我们通过模拟这个系统，学习了一些常见的数据集，包括图像和声音识别。我们结论是，添加小量子热退器并不提供显著的改善，相比于使用非线性的古典神经网络。

Py-Tetrad and RPy-Tetrad: A New Python Interface with R Support for Tetrad Causal Search

paper_url: http://arxiv.org/abs/2308.07346
repo_url: None
paper_authors: Joseph D. Ramsey, Bryan Andrews
for: 这个论文是为了提供一个新的Python和R接口来访问Tetrad项目的 causal 模型计算、搜索和估计。
methods: 这个论文使用了JPype和Reticulate两种新的接口方法来实现Python和R与Tetrad的交互，这些方法是直接解决现有的问题。
results: 该论文提供了一些简单的工具和一些工作示例，使用JPype和Reticulate来接口Python和R与Tetrad是直观的和易懂的。

Abstract
We give novel Python and R interfaces for the (Java) Tetrad project for causal modeling, search, and estimation. The Tetrad project is a mainstay in the literature, having been under consistent development for over 30 years. Some of its algorithms are now classics, like PC and FCI; others are recent developments. It is increasingly the case, however, that researchers need to access the underlying Java code from Python or R. Existing methods for doing this are inadequate. We provide new, up-to-date methods using the JPype Python-Java interface and the Reticulate Python-R interface, directly solving these issues. With the addition of some simple tools and the provision of working examples for both Python and R, using JPype and Reticulate to interface Python and R with Tetrad is straightforward and intuitive.

摘要
我们提供了一个新的Python和R接口 дляJava Tetrad项目，用于 causal模型搜索和估计。Tetrad项目已经在文献中被不断开发了超过30年，其中一些算法已经成为了经典，如PC和FCI；其他则是最近的发展。然而，研究人员在使用Python或R访问下面的Java代码变得越来越重要。现有的方法无法满足这些需求。我们提供了新的、当前的方法，使用JPype Python-Java接口和Reticulate Python-R接口，直接解决这些问题。另外，我们还提供了一些简单的工具和工作示例，使用JPype和Reticulate来接口Python和R与Tetrad是直观的。

SAILOR: Structural Augmentation Based Tail Node Representation Learning

paper_url: http://arxiv.org/abs/2308.06801
repo_url: https://github.com/jie-re/sailor
paper_authors: Jie Liao, Jintang Li, Liang Chen, Bingzhe Wu, Yatao Bian, Zibin Zheng
for: 提高Graph Neural Networks（GNNs）对尾节点的表示学习效果，增强GNNs在真实世界图像中的表示能力。
methods: 提出了一种通用的Structural Augmentation based taIL nOde Representation learning框架，名为SAILOR，可以同时增强图структуры和提取更有用的尾节点表示。
results: 对公共 benchmark 数据集进行了广泛的实验，显示SAILOR可以显著提高尾节点表示，并超越当前的基线。

Abstract
Graph Neural Networks (GNNs) have achieved state-of-the-art performance in representation learning for graphs recently. However, the effectiveness of GNNs, which capitalize on the key operation of message propagation, highly depends on the quality of the topology structure. Most of the graphs in real-world scenarios follow a long-tailed distribution on their node degrees, that is, a vast majority of the nodes in the graph are tail nodes with only a few connected edges. GNNs produce inferior node representations for tail nodes since they lack structural information. In the pursuit of promoting the expressiveness of GNNs for tail nodes, we explore how the deficiency of structural information deteriorates the performance of tail nodes and propose a general Structural Augmentation based taIL nOde Representation learning framework, dubbed as SAILOR, which can jointly learn to augment the graph structure and extract more informative representations for tail nodes. Extensive experiments on public benchmark datasets demonstrate that SAILOR can significantly improve the tail node representations and outperform the state-of-the-art baselines.

摘要
GRAPHNeural Networks (GNNs) 最近已经达到了图表示学习中的状态艺术水平。然而，GNNs的效果，它们利用消息传递操作的关键，具体取决于图的结构质量。大多数实际场景中的图都follows a long-tailed distribution on node degrees, that is, most nodes in the graph are tail nodes with only a few connected edges. GNNs produce inferior node representations for tail nodes because they lack structural information. In order to improve the expressiveness of GNNs for tail nodes, we explore how the lack of structural information degrades the performance of tail nodes and propose a general Structural Augmentation based tail Node Representation learning framework, called SAILOR, which can jointly learn to augment the graph structure and extract more informative representations for tail nodes. Extensive experiments on public benchmark datasets show that SAILOR can significantly improve the tail node representations and outperform the state-of-the-art baselines.Here's the word-for-word translation of the text into Simplified Chinese:GRAPHNeural Networks (GNNs) 最近已经达到了图表示学习中的状态艺术水平。然而，GNNs的效果，它们利用消息传递操作的关键，具体取决于图的结构质量。大多数实际场景中的图都follows a long-tailed distribution on node degrees, that is, most nodes in the graph are tail nodes with only a few connected edges. GNNs produce inferior node representations for tail nodes because they lack structural information. In order to improve the expressiveness of GNNs for tail nodes, we explore how the lack of structural information degrades the performance of tail nodes and propose a general Structural Augmentation based tail Node Representation learning framework, called SAILOR, which can jointly learn to augment the graph structure and extract more informative representations for tail nodes. Extensive experiments on public benchmark datasets show that SAILOR can significantly improve the tail node representations and outperform the state-of-the-art baselines.

2023-08-14

cs.CL

cs.CL - 2023-08-14

Human-centered NLP Fact-checking: Co-Designing with Fact-checkers using Matchmaking for AI

paper_url: http://arxiv.org/abs/2308.07213
repo_url: None
paper_authors: Houjiang Liu, Anubrata Das, Alexander Boltz, Didi Zhou, Daisy Pinaroc, Matthew Lease, Min Kyung Lee
for: 本研究旨在帮助职业Fact-checking增加效率和可扩展性，通过与Fact-checker合作设计AI工具，以满足Fact-checker的需求和价值观。
methods: 本研究使用了Matchmaking for AI方法，通过合作设计，让Fact-checker、设计师和NLP研究人员共同探索Fact-checker需要如何被技术支持，并从Fact-checker的角度设计AI工具。
results: 本研究在22名职业Fact-checker的合作设计中提出了11个新的设计想法，帮助Fact-checker更有效率地进行信息搜索、处理和写作任务，以及预防未来的谣言和减少自己的可能的偏见。

Abstract
A key challenge in professional fact-checking is its limited scalability in relation to the magnitude of false information. While many Natural Language Processing (NLP) tools have been proposed to enhance fact-checking efficiency and scalability, both academic research and fact-checking organizations report limited adoption of such tooling due to insufficient alignment with fact-checker practices, values, and needs. To address this gap, we investigate a co-design method, Matchmaking for AI, which facilitates fact-checkers, designers, and NLP researchers to collaboratively discover what fact-checker needs should be addressed by technology and how. Our co-design sessions with 22 professional fact-checkers yielded a set of 11 novel design ideas. They assist in information searching, processing, and writing tasks for efficient and personalized fact-checking; help fact-checkers proactively prepare for future misinformation; monitor their potential biases; and support internal organization collaboration. Our work offers implications for human-centered fact-checking research and practice and AI co-design research.

摘要
一个主要挑战在职业事实核查是它的限定可扩展性，与假信息的规模相对。虽然许多自然语言处理（NLP）工具已经被提议用于增强事实核查效率和可扩展性，但是学术研究和事实核查组织都报告了这些工具的采用率很低，主要是因为技术不符合事实核查员的做法、价值和需求。为解决这个差距，我们调查了一种合作设计方法，即Matchmaking for AI，该方法使得事实核查员、设计师和NLP研究人员共同探索事实核查员需要技术解决的问题和如何解决。我们与22名职业事实核查员进行了合作设计会议，得到了11项新的设计想法。这些设计想法可以帮助事实核查员更有效地搜索、处理和撰写信息，以及帮助他们预先准备对未来的假信息进行核查；监测他们的潜在偏见；以及支持内部组织合作。我们的工作对人类中心的事实核查研究和实践以及AI合作研究具有启示性。

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

paper_url: http://arxiv.org/abs/2308.07201
repo_url: https://github.com/chanchimin/chateval
paper_authors: Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, Zhiyuan Liu
for: 这个论文的目的是提出一种多智能体评估方法，以取代人工评估，并使用多种语言模型合作来提高评估效果。
methods: 这个论文使用了多种语言模型，包括Transformer和Recurrent Neural Network，并通过多智能体评估方法来提高评估效果。
results: 这个论文的实验结果表明，使用多智能体评估方法可以提高评估效果，并且可以与人工评估相比。

Abstract
Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human evaluation processes often involve multiple human annotators collaborating in the evaluation, we resort to a multi-agent debate framework, moving beyond single-agent prompting strategies. The multi-agent-based approach enables a group of LLMs to synergize with an array of intelligent counterparts, harnessing their distinct capabilities and expertise to enhance efficiency and effectiveness in handling intricate tasks. In this paper, we construct a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models on open-ended questions and traditional natural language generation (NLG) tasks. Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments. Our code is available at https://github.com/chanchimin/ChatEval.

摘要
文本评估历史上总是存在巨大的困难和成本高昂的问题。随着大语言模型（LLM）的出现，研究人员开始探索LLM的可能性作为人类评估的替代方案。although these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality.因此，我们认为最佳的人类评估过程经常含有多个人类标注者在评估中合作，我们采用多代理人讨论框架，超越单代理人提示策略。这种多代理人基本思路可以让一群LLM协同作用，利用它们的不同能力和专业知识来提高处理复杂任务的效率和效果。在这篇论文中，我们构建了一个多代理人评审组called ChatEval，以自动讨论和评估不同模型生成的响应质量。我们的分析表明，ChatEval不仅可以对文本进行评分，还可以提供人类模仿的评估过程，为可靠的评估提供可靠的评估。我们的代码可以在https://github.com/chanchimin/ChatEval中获取。

Incorporating Annotator Uncertainty into Representations of Discourse Relations

paper_url: http://arxiv.org/abs/2308.07179
repo_url: None
paper_authors: S. Magalí López Cortez, Cassandra L. Jacobs
for: 本研究探讨了不熟悉分类者对对话数据的谏话关系标注中的uncertainty。
methods: 研究使用对话上的单词、对话内speaker之间的对话、对话之间的对话 контекст来预测分类者的信任度。基于这些统计学特征，计算出谏话关系的分布式表示，并使用 hierarchical clustering 分析。
results: 研究发现，将分类者对谏话关系标注的uncertainty incorporated into distributed representations of discourse relations，可以准确地模型分类者的信任度。

Abstract
Annotation of discourse relations is a known difficult task, especially for non-expert annotators. In this paper, we investigate novice annotators' uncertainty on the annotation of discourse relations on spoken conversational data. We find that dialogue context (single turn, pair of turns within speaker, and pair of turns across speakers) is a significant predictor of confidence scores. We compute distributed representations of discourse relations from co-occurrence statistics that incorporate information about confidence scores and dialogue context. We perform a hierarchical clustering analysis using these representations and show that weighting discourse relation representations with information about confidence and dialogue context coherently models our annotators' uncertainty about discourse relation labels.

摘要
描述关系标注是一项知名的困难任务，尤其是 для非专家标注员。本文研究非专家标注员对对话语音数据中描述关系的标注不确定性。我们发现对话上下文（单个转折、对话中 speaker 的对话对）是标注不确定性的重要预测因素。我们从共occurrence统计中计算出描述关系的分布式表示，并使用这些表示进行层次划分分析。我们发现，将描述关系表示与信度和对话上下文信息一起权重计算可以准确地模型我们的标注不确定性。

Mind your Language (Model): Fact-Checking LLMs and their Role in NLP Research and Practice

paper_url: http://arxiv.org/abs/2308.07120
repo_url: None
paper_authors: Alexandra Sasha Luccioni, Anna Rogers
for: 本论文提出了大语言模型（LLMs）的定义，探讨了其功能和潜在应用，以及现有证据和反证据。
methods: 本论文使用了定义和探讨现有证据和反证据来探讨 LLMS 的功能和潜在应用。
results: 本论文提出了一个定义 LLMS，并探讨了现有证据和反证据，以及未来研究的可能性和方向。

Abstract
Much of the recent discourse within the NLP research community has been centered around Large Language Models (LLMs), their functionality and potential -- yet not only do we not have a working definition of LLMs, but much of this discourse relies on claims and assumptions that are worth re-examining. This position paper contributes a definition of LLMs, explicates some of the assumptions made regarding their functionality, and outlines the existing evidence for and against them. We conclude with suggestions for research directions and their framing in future work.

摘要
很多最近的NLP研究社区的讨论都集中在大语言模型（LLMs）上，其功能和潜力，然而我们没有一个正式的定义，而且大多数这些讨论都基于可能需要重新评估的假设和宣称。这篇Position paper提供了LLMs的定义，探讨了关于它们的功能假设，并对现有证据进行了总结。我们结束于未来研究方向的建议和Future工作的框架。

Large Language Models for Information Retrieval: A Survey

paper_url: http://arxiv.org/abs/2308.07107
repo_url: https://github.com/ruc-nlpir/llm4ir-survey
paper_authors: Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Zhicheng Dou, Ji-Rong Wen
For: This paper is focused on the integration of large language models (LLMs) with information retrieval (IR) systems to improve the accuracy and efficiency of IR systems.* Methods: The paper discusses various methods for combining LLMs with IR systems, including query rewriters, retrievers, rerankers, and readers. These methods aim to leverage the language understanding and generation capabilities of LLMs to improve the performance of IR systems.* Results: The paper provides a comprehensive overview of the current state of research in this field and highlights the promising directions for future research. It also discusses the challenges and limitations of integrating LLMs with IR systems, such as data scarcity, interpretability, and the generation of contextually plausible yet potentially inaccurate responses.

Abstract
As a primary means of information acquisition, information retrieval (IR) systems, such as search engines, have integrated themselves into our daily lives. These systems also serve as components of dialogue, question-answering, and recommender systems. The trajectory of IR has evolved dynamically from its origins in term-based methods to its integration with advanced neural models. While the neural models excel at capturing complex contextual signals and semantic nuances, thereby reshaping the IR landscape, they still face challenges such as data scarcity, interpretability, and the generation of contextually plausible yet potentially inaccurate responses. This evolution requires a combination of both traditional methods (such as term-based sparse retrieval methods with rapid response) and modern neural architectures (such as language models with powerful language understanding capacity). Meanwhile, the emergence of large language models (LLMs), typified by ChatGPT and GPT-4, has revolutionized natural language processing due to their remarkable language understanding, generation, generalization, and reasoning abilities. Consequently, recent research has sought to leverage LLMs to improve IR systems. Given the rapid evolution of this research trajectory, it is necessary to consolidate existing methodologies and provide nuanced insights through a comprehensive overview. In this survey, we delve into the confluence of LLMs and IR systems, including crucial aspects such as query rewriters, retrievers, rerankers, and readers. Additionally, we explore promising directions within this expanding field.

摘要
primary means of information acquisition，information retrieval（IR）系统，如搜索引擎，已经成为我们日常生活的重要组成部分。这些系统还作为对话，问答和推荐系统的组件。IR的发展轨迹从起源的单词方法发展到与高级神经网络模型结合。尽管神经网络模型能够捕捉复杂的上下文信号和语义差异，但仍面临数据缺乏，可读性和生成上下文可能准确但又不正确的响应的挑战。这种发展需要结合传统方法（如简单的term-based sparse retrieval方法）和现代神经网络 architecture（如语言模型）。 Meanwhile，大型语言模型（LLMs），如ChatGPT和GPT-4，对自然语言处理产生了革命，因为它们具有出色的语言理解、生成、总结和逻辑能力。因此，最近的研究尝试利用LLMs提高IR系统。给出 rapidevolving的研究轨迹，我们需要结合现有的方法和提供细化的洞察。在这个survey中，我们探讨了LLMs和IR系统的结合，包括关键的query rewriter，retriever，reranker和reader。此外，我们还探讨了这个扩展的领域中的可能的方向。

Temporal Sentence Grounding in Streaming Videos

paper_url: http://arxiv.org/abs/2308.07102
repo_url: https://github.com/sczwangxiao/tsgvs-mm2023
paper_authors: Tian Gan, Xiao Wang, Yan Sun, Jianlong Wu, Qingpei Guo, Liqiang Nie
for: 本研究目标是解决一个新的任务：流动视频中的时间句子根据（TSGSV）。
methods: 我们提出了两种新方法：一种是一种双网络结构，帮助模型学习到来自未来帧的事件信息；另一种是一种语言引导的视觉压缩器，从涉及到查询的视觉帧中消除无关的帧并强化相关的帧。
results: 我们在ActivityNet Captions、TACoS和MAD datasets上进行了广泛的实验，结果表明我们提出的方法具有优势。一种系统的减少研究也证明了它们的有效性。

Abstract
This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV). The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query. Unlike regular videos, streaming videos are acquired continuously from a particular source, and are always desired to be processed on-the-fly in many applications such as surveillance and live-stream analysis. Thus, TSGSV is challenging since it requires the model to infer without future frames and process long historical frames effectively, which is untouched in the early methods. To specifically address the above challenges, we propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames and reinforces the frames that are relevant to the query. We conduct extensive experiments using ActivityNet Captions, TACoS, and MAD datasets. The results demonstrate the superiority of our proposed methods. A systematic ablation study also confirms their effectiveness.

摘要

A TwinNet structure that enables the model to learn about upcoming events.2. A language-guided feature compressor that eliminates redundant visual frames and reinforces the frames that are relevant to the query.We conduct extensive experiments using ActivityNet Captions, TACoS, and MAD datasets. The results demonstrate the superiority of our proposed methods. A systematic ablation study also confirms their effectiveness.

Aesthetics of Sanskrit Poetry from the Perspective of Computational Linguistics: A Case Study Analysis on Siksastaka

paper_url: http://arxiv.org/abs/2308.07081
repo_url: https://github.com/sanskritshala/shikshastakam
paper_authors: Jivnesh Sandhan, Amruta Barbadikar, Malay Maity, Pavankumar Satuluri, Tushar Sandhan, Ravi M. Gupta, Pawan Goyal, Laxmidhar Behera
for: 这篇论文旨在探讨 sanskrit 诗歌的计算语言学分析和分类方法，具体来说是通过人工智能和专家的协作来挖掘古代 sanskrit 诗歌中的隐藏美丽之处。
methods: 该论文提出了一个可解Framework，该框架包括机器学习和人工智能的混合使用，以及一个人类在循环中的协作方式，用于分析和分类诗歌的质量和特征。
results: 通过对 sanskrit 诗歌 “Siksastaka” 的分析和注释，该论文提供了一个深入的分析和评价，并提供了一个在线应用程序，以便未来的研究人员可以继续进行相关研究。

Abstract
Sanskrit poetry has played a significant role in shaping the literary and cultural landscape of the Indian subcontinent for centuries. However, not much attention has been devoted to uncovering the hidden beauty of Sanskrit poetry in computational linguistics. This article explores the intersection of Sanskrit poetry and computational linguistics by proposing a roadmap of an interpretable framework to analyze and classify the qualities and characteristics of fine Sanskrit poetry. We discuss the rich tradition of Sanskrit poetry and the significance of computational linguistics in automatically identifying the characteristics of fine poetry. The proposed framework involves a human-in-the-loop approach that combines deterministic aspects delegated to machines and deep semantics left to human experts. We provide a deep analysis of Siksastaka, a Sanskrit poem, from the perspective of 6 prominent kavyashastra schools, to illustrate the proposed framework. Additionally, we provide compound, dependency, anvaya (prose order linearised form), meter, rasa (mood), alankar (figure of speech), and riti (writing style) annotations for Siksastaka and a web application to illustrate the poem's analysis and annotations. Our key contributions include the proposed framework, the analysis of Siksastaka, the annotations and the web application for future research. Link for interactive analysis: https://sanskritshala.github.io/shikshastakam/

摘要
sanskrit 诗歌在印度次大陆的文学和文化领域中扮演了重要的角色，但在计算语言学方面却没有受到充分的关注。本文探讨了 sanskrit 诗歌和计算语言学之间的交叉点，并提出了一个可解释的框架，用于分析和分类精美的 sanskrit 诗歌的特点和特色。我们讨论了 sanskrit 诗歌的丰富传统和计算语言学在自动识别精美诗歌的重要性。我们提出的框架采用了人类在 loop 的方法，将 deterministic 的方面委托给机器，而 deep semantics 则委托给人类专家。我们对 sanskrit 诗歌 "Siksastaka" 进行了6种 prominent kavyashastra 学派的深入分析，以 illustrate 我们的框架。此外，我们还提供了 Siksastaka 的 compound、dependency、anvaya（排序 linearized form）、米eter、rasa（情感）、alankar（ figura of speech）和 riti（书写风格）注释，以及一个网站，以便未来的研究。关于交互分析的链接：https://sanskritshala.github.io/shikshastakam/Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China.

Can Knowledge Graphs Simplify Text?

paper_url: http://arxiv.org/abs/2308.06975
repo_url: https://github.com/subhasmalik/Microsoft-azure-cognitive-services
paper_authors: Anthony Colas, Haodi Ma, Xuanli He, Yang Bai, Daisy Zhe Wang
for: 这篇论文是关于无监督文本简化的研究，旨在使用知识图(KG)技术来生成简洁的文本，保持原始文本的意义。
methods: 该论文提出了一种名为KGSimple的新方法，它通过迭代和采样KG-first的方式，利用知识图生成的技术来生成简洁的文本，并保持原始文本的意义。
results: 该论文在使用现有的KG-to-text dataset进行评估，并示出了KGSimple模型的效果比起无监督文本简化模型更好。 Code available on GitHub.

Abstract
Knowledge Graph (KG)-to-Text Generation has seen recent improvements in generating fluent and informative sentences which describe a given KG. As KGs are widespread across multiple domains and contain important entity-relation information, and as text simplification aims to reduce the complexity of a text while preserving the meaning of the original text, we propose KGSimple, a novel approach to unsupervised text simplification which infuses KG-established techniques in order to construct a simplified KG path and generate a concise text which preserves the original input's meaning. Through an iterative and sampling KG-first approach, our model is capable of simplifying text when starting from a KG by learning to keep important information while harnessing KG-to-text generation to output fluent and descriptive sentences. We evaluate various settings of the KGSimple model on currently-available KG-to-text datasets, demonstrating its effectiveness compared to unsupervised text simplification models which start with a given complex text. Our code is available on GitHub.

摘要
知识图（KG）-to-文本生成技术在最近得到了改进，能够生成流畅、有信息的句子，描述给定的KG。由于KG广泛存在多个领域，含有重要的实体关系信息，而文本简化的目标是将文本简化到最小化复杂度，保持原始文本的意思，因此我们提议KGSimple，一种新的无监督文本简化方法，利用KG确立的技术来构建简化KG路径，生成简洁的文本，保持原始输入的意思。我们采用迭代和采样KG-first方法，使我们的模型能够从KG开始简化文本，学习保留重要信息，同时利用KG-to-文本生成技术输出流畅、描述性的句子。我们在当前可用的KG-to-文本 datasets上评估了不同的KGSimple模型设置，并证明其比无监督文本简化模型，从给定的复杂文本开始简化文本更有效。我们的代码可以在GitHub上找到。

EcomGPT: Instruction-tuning Large Language Model with Chain-of-Task Tasks for E-commerce

paper_url: http://arxiv.org/abs/2308.06966
repo_url: None
paper_authors: Yangning Li, Shirong Ma, Xiaobin Wang, Shen Huang, Chengyue Jiang, Hai-Tao Zheng, Pengjun Xie, Fei Huang, Yong Jiang
For: The paper aims to address the challenge of using general language models for e-commerce tasks, and proposes a new dataset and a tailored model (EcomGPT) to improve the model’s performance on these tasks.* Methods: The paper introduces a new dataset called EcomInstruct, which consists of 2.5 million instruction data and is designed to scale up the data size and task diversity for e-commerce tasks. The model EcomGPT is trained on this dataset using the backbone model BLOOMZ, and the authors use a chain-of-task approach to improve the model’s generalization capabilities.* Results: The paper reports that EcomGPT outperforms ChatGPT in terms of cross-dataset/task generalization on e-commerce tasks, as demonstrated through extensive experiments and human evaluations. The authors also show that EcomGPT acquires fundamental semantic understanding capabilities through the chain-of-task approach, which improves its performance on these tasks.Here are the three points in Simplified Chinese text:* For: 本研究旨在解决通用语言模型在电商任务上的挑战，并提出了一个新的数据集和一种适应型（EcomGPT），以提高这些任务的表现。* Methods: 本研究引入了一个新的数据集 called EcomInstruct，该数据集包含250万个指令数据，旨在扩大电商任务的数据大小和任务多样性。模型EcomGPT通过使用BLOOMZ的后托模型，在EcomInstruct上进行训练，并使用链式任务方法来提高模型的总体化能力。* Results: 本研究表明，EcomGPT比ChatGPT在电商任务上的跨数据集/任务总体化性能更高，经过广泛的实验和人工评估。 authors也表明，EcomGPT通过链式任务方法获得了基本的Semantic理解能力，从而提高了它在这些任务上的表现。

Abstract
Recently, instruction-following Large Language Models (LLMs) , represented by ChatGPT, have exhibited exceptional performance in general Natural Language Processing (NLP) tasks. However, the unique characteristics of E-commerce data pose significant challenges to general LLMs. An LLM tailored specifically for E-commerce scenarios, possessing robust cross-dataset/task generalization capabilities, is a pressing necessity. To solve this issue, in this work, we proposed the first e-commerce instruction dataset EcomInstruct, with a total of 2.5 million instruction data. EcomInstruct scales up the data size and task diversity by constructing atomic tasks with E-commerce basic data types, such as product information, user reviews. Atomic tasks are defined as intermediate tasks implicitly involved in solving a final task, which we also call Chain-of-Task tasks. We developed EcomGPT with different parameter scales by training the backbone model BLOOMZ with the EcomInstruct. Benefiting from the fundamental semantic understanding capabilities acquired from the Chain-of-Task tasks, EcomGPT exhibits excellent zero-shot generalization capabilities. Extensive experiments and human evaluations demonstrate that EcomGPT outperforms ChatGPT in term of cross-dataset/task generalization on E-commerce tasks.

摘要
Translation Notes:* "Recently" is translated as "最近" (most recent)* "Large Language Models" is translated as "大型自然语言模型" (large natural language models)* "E-commerce" is translated as "电商" (e-commerce)* "instruction" is translated as "指南" (instructions)* "dataset" is translated as "数据集" (dataset)* "task" is translated as "任务" (task)* "ChatGPT" is translated as "ChatGPT" (ChatGPT)* "EcomInstruct" is translated as "EcomInstruct" (E-commerce instruction dataset)* "atomic tasks" is translated as "原子任务" (atomic tasks)* "Chain-of-Task tasks" is translated as "链接任务" (chain-of-task tasks)* "backbone model" is translated as "基本模型" (backbone model)* "BLOOMZ" is translated as "BLOOMZ" (BLOOMZ)* "zero-shot generalization" is translated as "零批学习泛化" (zero-shot generalization)* "extensive experiments" is translated as "广泛的实验" (extensive experiments)* "human evaluations" is translated as "人类评估" (human evaluations)

Thresh: A Unified, Customizable and Deployable Platform for Fine-Grained Text Evaluation

paper_url: http://arxiv.org/abs/2308.06953
repo_url: None
paper_authors: David Heineman, Yao Dou, Wei Xu
for: 本研究是为了提供一个可靠、可重复的方式来评估文本生成任务，如摘要、简化、机器翻译和新闻生成。
methods: 本研究使用了一个名为Thresh的 plataform，该平台可以快速创建和测试任务特定的评估界面，并且可以在一个网页浏览器窗口中完成所有步骤。
results: 本研究通过Thresh平台可以快速创建和测试多种NLP任务的评估界面，并且可以提供多种批处理和大规模评估的选项。

Abstract
Fine-grained, span-level human evaluation has emerged as a reliable and robust method for evaluating text generation tasks such as summarization, simplification, machine translation and news generation, and the derived annotations have been useful for training automatic metrics and improving language models. However, existing annotation tools implemented for these evaluation frameworks lack the adaptability to be extended to different domains or languages, or modify annotation settings according to user needs. And the absence of a unified annotated data format inhibits the research in multi-task learning. In this paper, we introduce Thresh, a unified, customizable and deployable platform for fine-grained evaluation. By simply creating a YAML configuration file, users can build and test an annotation interface for any framework within minutes -- all in one web browser window. To facilitate collaboration and sharing, Thresh provides a community hub that hosts a collection of fine-grained frameworks and corresponding annotations made and collected by the community, covering a wide range of NLP tasks. For deployment, Thresh offers multiple options for any scale of annotation projects from small manual inspections to large crowdsourcing ones. Additionally, we introduce a Python library to streamline the entire process from typology design and deployment to annotation processing. Thresh is publicly accessible at https://thresh.tools.

摘要
最细化的、span级人工评估已成为文本生成任务 such as 概要、简化、翻译和新闻生成的可靠和可靠的评估方法。但现有的评估工具不具备扩展到不同领域或语言的能力，也无法根据用户需求修改评估设置。此外，缺乏一个统一的注释数据格式，限制了多任务学习的研究。本文提出了 Thresh，一个统一、可定制和可部署的精细评估平台。通过创建一个 YAML 配置文件，用户可以在浏览器窗口内快速建立和测试任何框架的注释界面，并且可以在社区平台上共享和协作。为投入大规模注释项目，Thresh 提供多种部署选项。此外，我们还提供了一个 Python 库，以便从类型设计、部署到注释处理的整个过程。Thresh 公共访问地址为。

Automated Testing and Improvement of Named Entity Recognition Systems

paper_url: http://arxiv.org/abs/2308.07937
repo_url: None
paper_authors: Boxi Yu, Yiyan Hu, Qiuyang Mang, Wenhan Hu, Pinjia He
for: 提高Named Entity Recognition（NER）系统的可靠性和精度，使其在不同的自然语言处理应用中更可靠。
methods: 提出了一种新的、广泛适用的方法，可以自动测试和修复不同的NER系统。
results: 通过测试两个state-of-the-art（SOTA）NER模型和两个商业NER API，发现自动测试和修复可以高效地提高NER系统的精度和可靠性。

Abstract
Named entity recognition (NER) systems have seen rapid progress in recent years due to the development of deep neural networks. These systems are widely used in various natural language processing applications, such as information extraction, question answering, and sentiment analysis. However, the complexity and intractability of deep neural networks can make NER systems unreliable in certain circumstances, resulting in incorrect predictions. For example, NER systems may misidentify female names as chemicals or fail to recognize the names of minority groups, leading to user dissatisfaction. To tackle this problem, we introduce TIN, a novel, widely applicable approach for automatically testing and repairing various NER systems. The key idea for automated testing is that the NER predictions of the same named entities under similar contexts should be identical. The core idea for automated repairing is that similar named entities should have the same NER prediction under the same context. We use TIN to test two SOTA NER models and two commercial NER APIs, i.e., Azure NER and AWS NER. We manually verify 784 of the suspicious issues reported by TIN and find that 702 are erroneous issues, leading to high precision (85.0%-93.4%) across four categories of NER errors: omission, over-labeling, incorrect category, and range error. For automated repairing, TIN achieves a high error reduction rate (26.8%-50.6%) over the four systems under test, which successfully repairs 1,056 out of the 1,877 reported NER errors.

摘要
TIN的关键想法是在相似的上下文中，NER预测的相同名称应该是相同的。TIN的核心想法是在相似的上下文中，相似的名称应该有相同的NER预测。我们使用 TIN 测试了两个 SOTA NER 模型和两个商业 NER API，即 Azure NER 和 AWS NER。我们手动验证了 TIN 发现的 784 个可疑问题中的 702 个是错误的问题，得到了高精度（85.0%-93.4%）在四个NER错误类型中。对于自动修复，TIN 在四个系统上得到了高错误率（26.8%-50.6%），成功修复了 1,056 个reported NER 错误。

CausalLM is not optimal for in-context learning

paper_url: http://arxiv.org/abs/2308.06912
repo_url: None
paper_authors: Nan Ding, Tomer Levinboim, Jialin Wu, Sebastian Goodman, Radu Soricut
for: 本研究旨在理解在 context 学习中使用 prefix 语言模型 (prefixLM) 和 causal 语言模型 (causalLM) 的性能差异。
methods: 本研究采用了理论分析方法，对 prefixLM 和 causalLM 的参数构造进行分析，并通过synthetic 和实际任务的实验 verify 其理论结论。
results: 研究结果显示， prefixLM 在 linear 回归问题中 converges to 优质解，而 causalLM 的 convergence 动态类似于在线梯度下降算法，不能 garantate 优质性，即使样本数量无限大。 Empirical 实验表明， causalLM 在所有设置下consistently underperforms prefixLM。

Abstract
Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each other, compared to causal language models (causalLM), which use auto-regressive attention that prohibits in-context samples to attend to future samples. While this result is intuitive, it is not understood from a theoretical perspective. In this paper we take a theoretical approach and analyze the convergence behavior of prefixLM and causalLM under a certain parameter construction. Our analysis shows that both LM types converge to their stationary points at a linear rate, but that while prefixLM converges to the optimal solution of linear regression, causalLM convergence dynamics follows that of an online gradient descent algorithm, which is not guaranteed to be optimal even as the number of samples grows infinitely. We supplement our theoretical claims with empirical experiments over synthetic and real tasks and using various types of transformers. Our experiments verify that causalLM consistently underperforms prefixLM in all settings.

摘要
近期实验证据表明，基于转换器的内容学习perform better使用预refix语言模型（prefixLM），因为所有的内容样本都可以互相attend，而不是 causalLM，它们使用自动征回应，禁止内容样本attend到未来的样本。 Although this result is intuitive, it is not understood from a theoretical perspective. 在这篇论文中，我们从理论角度分析了prefixLM和causalLM的整合行为。我们的分析显示，两种LM类型都会在某些参数构造下 converge to their stationary points at a linear rate，但是prefixLM会 converges to the optimal solution of linear regression，而causalLM的整合动态则类似于在线梯度下降算法，这并不是确保优化的，即使样本数量 infinitely grows。我们在实验中补充了我们的理论声明，并通过使用不同的转换器和实验任务进行了empirical experiments。我们的实验结果表明，causalLM在所有设置中一直下perform prefixLM。

paper_url: http://arxiv.org/abs/2308.06911
repo_url: None
paper_authors: Pengfei Liu, Yiming Ren, Zhixiang Ren
for: 这篇论文旨在开发一种多Modal语言模型，以捕捉分子数据的充分和复杂信息。
methods: 该论文使用了一种新的GIT-Former架构，可以将所有modalities映射到一个统一的 latent space中。
results: 该论文实现了一种创新的任意语言分子翻译策略，在分子captioning中提高了10%-15%的精度，在分子性预测中提高了5%-10%的准确率，并在分子生成中提高了20%的有效性。

Abstract
Large language models have made significant strides in natural language processing, paving the way for innovative applications including molecular representation and generation. However, most existing single-modality approaches cannot capture the abundant and complex information in molecular data. Here, we introduce GIT-Mol, a multi-modal large language model that integrates the structure Graph, Image, and Text information, including the Simplified Molecular Input Line Entry System (SMILES) and molecular captions. To facilitate the integration of multi-modal molecular data, we propose GIT-Former, a novel architecture capable of mapping all modalities into a unified latent space. Our study develops an innovative any-to-language molecular translation strategy and achieves a 10%-15% improvement in molecular captioning, a 5%-10% accuracy increase in property prediction, and a 20% boost in molecule generation validity compared to baseline or single-modality models.

摘要
大型自然语言处理模型已经取得了重要进展，开创了新的应用领域，包括分子表示和生成。然而，现有的单模态方法通常无法捕捉分子数据中的丰富和复杂信息。我们介绍了 GIT-Mol，一个多模态大语言模型，该模型结合结构图、图像和文本信息，包括简化分子输入语言系统（SMILES）和分子描述。为了实现多modal分子数据的集成，我们提出了 GIT-Former 架构，可以将所有模式映射到一个统一的隐藏空间。我们的研究开发了一种创新的任意语言分子翻译策略，在分子描述中提高了10%-15%，在物理预测中提高了5%-10%，在分子生成VALIDIDAD中提高了20% compared to 基eline或单模态模型。

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

paper_url: http://arxiv.org/abs/2308.06873
repo_url: None
paper_authors: Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, Takuya Yoshioka
for: 这篇论文旨在探讨高质量零参数文本识别模型，以及如何使其能够处理多种声音转换任务，包括噪音抑制、目标说话人提取、声音编辑等。
methods: 这篇论文提出了一种名为SpeechX的多任务学习模型，通过将语音编码语言模型和多任务学习相结合，实现了对声音转换任务的统一和可扩展的模型化。
results: 实验结果表明，SpeechX模型在不同任务中表现出色，包括零参数文本识别、噪音抑制、目标说话人提取、声音编辑等，与专门设计的模型相比，其性能均或超过专门的模型。

Abstract
Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks. See https://aka.ms/speechx for demo samples.

摘要
现代化的生成式speech模型，基于音频-文本提示，已经实现了高质量的零shot文本到语音。然而，现有的模型仍然面临着许多不同的音频-文本语音生成任务，包括音频 capture在不利的噪声条件下进行处理。这篇论文介绍了SpeechX，一种多功能的语音生成模型，可以实现零shot TTS和多种语音转换任务，并处理干净和噪声信号。SpeechX结合了神经编码语言模型和多任务学习，使用任务dependent的提示，实现了一个统一的和可扩展的模型，并提供了一种通用的文本输入方式来进行语音增强和转换任务。实验结果表明SpeechX在不同任务中具有优秀的性能，包括零shot TTS、噪声抑制、目标 speaker 提取、语音除去和语音编辑等，与专门的模型在任务中表现相当或更高。请参考 https://aka.ms/speechx 获取示例样本。

2023-08-14

cs.LG

cs.LG - 2023-08-14

Distance Matters For Improving Performance Estimation Under Covariate Shift

paper_url: http://arxiv.org/abs/2308.07223
repo_url: https://github.com/melanibe/distance_matters_performance_estimation
paper_authors: Mélanie Roschewitz, Ben Glocker
for: 本文旨在提出一种基于距离测试样本预期的训练分布的性能估计方法，以便在数据变换时进行安全的 AI 模型部署。
methods: 本文使用了一种基于距离测试样本预期的训练分布的方法，通过检查样本与预期的训练分布之间的距离，以避免在数据变换时取得不可靠的模型输出。
results: experiments 表明，该方法可以在13种图像分类任务上提供 statistically 显著的性能估计改进（相对于最佳基准），并在10种任务上达到最佳性能。 code 可以在 https://github.com/melanibe/distance_matters_performance_estimation 中找到。

Abstract
Performance estimation under covariate shift is a crucial component of safe AI model deployment, especially for sensitive use-cases. Recently, several solutions were proposed to tackle this problem, most leveraging model predictions or softmax confidence to derive accuracy estimates. However, under dataset shifts, confidence scores may become ill-calibrated if samples are too far from the training distribution. In this work, we show that taking into account distances of test samples to their expected training distribution can significantly improve performance estimation under covariate shift. Precisely, we introduce a "distance-check" to flag samples that lie too far from the expected distribution, to avoid relying on their untrustworthy model outputs in the accuracy estimation step. We demonstrate the effectiveness of this method on 13 image classification tasks, across a wide-range of natural and synthetic distribution shifts and hundreds of models, with a median relative MAE improvement of 27% over the best baseline across all tasks, and SOTA performance on 10 out of 13 tasks. Our code is publicly available at https://github.com/melanibe/distance_matters_performance_estimation.

摘要
In this work, we show that taking into account the distances of test samples to their expected training distribution can significantly improve performance estimation under covariate shift. Specifically, we introduce a "distance-check" to flag samples that lie too far from the expected distribution, in order to avoid relying on their untrustworthy model outputs in the accuracy estimation step.We demonstrate the effectiveness of this method on 13 image classification tasks, across a wide range of natural and synthetic distribution shifts and hundreds of models. Our results show a median relative MAE improvement of 27% over the best baseline across all tasks, and SOTA performance on 10 out of 13 tasks. Our code is publicly available at .

AudioFormer: Audio Transformer learns audio feature representations from discrete acoustic codes

paper_url: http://arxiv.org/abs/2308.07221
repo_url: https://github.com/LZH-0225/AudioFormer
paper_authors: Zhaohui Li, Haitao Wang, Xinghua Jiang
for: 本研究旨在提出一种名为AudioFormer的方法，该方法通过获得隐藏的音频编码并在其基础之上进行微调，以便为音频分类任务进行特征表示。
methods: 我们首先提出一种新的思路，即将音频分类任务看作自然语言理解（NLU）的一种形式。然后，我们利用现有的神经网络音频编码器模型，生成隐藏的音频编码，并将其用于训练一个假名句语言模型（MLM），从而获得音频特征表示。此外，我们还提出了一种多 positivesample contrastive（MPC）学习方法，该方法可以学习多个隐藏的音频编码之间的共同表示。
results: 在我们的实验中，我们将隐藏的音频编码视为文本数据，并使用cloze-like方法训练一个假名句语言模型，最终获得高质量的音频表示。特别是，MPC学习技术可以有效地捕捉多个正样本之间的协同表示。我们的研究结果表明，AudioFormer在多个数据集上达到了 significatively improved的性能，甚至超过了一些音视频多模态分类模型。具体的表现为：在AudioSet（2M,20K）、FSD50K等数据集上，AudioFormer的性能分别为53.9、45.1和65.6。我们已经公开分享了代码和模型：https://github.com/LZH-0225/AudioFormer.git。

Abstract
We propose a method named AudioFormer,which learns audio feature representations through the acquisition of discrete acoustic codes and subsequently fine-tunes them for audio classification tasks. Initially,we introduce a novel perspective by considering the audio classification task as a form of natural language understanding (NLU). Leveraging an existing neural audio codec model,we generate discrete acoustic codes and utilize them to train a masked language model (MLM),thereby obtaining audio feature representations. Furthermore,we pioneer the integration of a Multi-Positive sample Contrastive (MPC) learning approach. This method enables the learning of joint representations among multiple discrete acoustic codes within the same audio input. In our experiments,we treat discrete acoustic codes as textual data and train a masked language model using a cloze-like methodology,ultimately deriving high-quality audio representations. Notably,the MPC learning technique effectively captures collaborative representations among distinct positive samples. Our research outcomes demonstrate that AudioFormer attains significantly improved performance compared to prevailing monomodal audio classification models across multiple datasets,and even outperforms audio-visual multimodal classification models on select datasets. Specifically,our approach achieves remarkable results on datasets including AudioSet (2M,20K),and FSD50K,with performance scores of 53.9,45.1,and 65.6,respectively. We have openly shared both the code and models: https://github.com/LZH-0225/AudioFormer.git.

摘要
我们提出一种方法 named AudioFormer，它通过获取批量音频编码并进行精细调整，以便为音频分类任务学习音频特征表示。我们首先提出一种新的视角，即视音频分类任务为自然语言理解（NLU）的一种形式。利用现有的神经网络音频编码器模型，我们生成了批量音频编码，并使用它们训练一个隐藏状态语言模型（MLM），从而获得音频特征表示。此外，我们开拓了多个正样本对比（MPC）学习方法的 интеграción。这种方法使得在同一个音频输入中学习多个批量音频编码的联合表示。在我们的实验中，我们将批量音频编码视为文本数据，并使用cloze-like方法训练一个隐藏状态语言模型，最终得到高质量的音频表示。各种实验结果表明，AudioFormer在多个数据集上达到了 significativamente提高的性能，并在一些数据集上 even outperform 音频-视觉多模态分类模型。具体来说，我们的方法在AudioSet（2M,20K）、FSD50K 等数据集上达到了性能分数为 53.9、45.1 和 65.6 等。我们已经在 GitHub 上公开分享了代码和模型：https://github.com/LZH-0225/AudioFormer.git。

Generating Individual Trajectories Using GPT-2 Trained from Scratch on Encoded Spatiotemporal Data

paper_url: http://arxiv.org/abs/2308.07940
repo_url: None
paper_authors: Taizo Horikomi, Shouji Fujimoto, Atushi Ishikawa, Takayuki Mizuno
for: 本研究使用GPT-2语言模型来生成个人日常路径序列，以考虑环境因素和个人特征的影响。
methods: 研究人员使用了坐标转换技术将地理坐标表示为特定的位置符号，并将每天的路径序列表示为一系列这些位置符号。特定的时间间隔符号和环境因素符号也被添加到序列中，以便在GPT-2架构上进行训练。
results: 通过训练这些位置符号和时间间隔符号，研究人员可以生成受环境因素和个人特征影响的个人日常路径序列。

Abstract
Following Mizuno, Fujimoto, and Ishikawa's research (Front. Phys. 2022), we transpose geographical coordinates expressed in latitude and longitude into distinctive location tokens that embody positions across varied spatial scales. We encapsulate an individual daily trajectory as a sequence of tokens by adding unique time interval tokens to the location tokens. Using the architecture of an autoregressive language model, GPT-2, this sequence of tokens is trained from scratch, allowing us to construct a deep learning model that sequentially generates an individual daily trajectory. Environmental factors such as meteorological conditions and individual attributes such as gender and age are symbolized by unique special tokens, and by training these tokens and trajectories on the GPT-2 architecture, we can generate trajectories that are influenced by both environmental factors and individual attributes.

摘要
根据米泽野、藤本和石川等人的研究（Front. Phys. 2022），我们将地理坐标表示为纬度和经度转换为特征化的位置标记，这些标记表示在不同的空间尺度上的位置。我们将每天的行走路径序列为一系列标记，并将具有特定时间间隔的唯一标记添加到位置标记中。使用GPT-2架构的自然语言模型，我们从头开始训练这些标记和路径，以生成基于环境因素和个人特征的各天行走路径。特殊的环境因素和个人特征被象化为唯一的特殊标记，通过训练这些标记和路径，我们可以生成受环境因素和个人特征影响的行走路径。

Automated Ensemble-Based Segmentation of Pediatric Brain Tumors: A Novel Approach Using the CBTN-CONNECT-ASNR-MICCAI BraTS-PEDs 2023 Challenge Data

paper_url: http://arxiv.org/abs/2308.07212
repo_url: None
paper_authors: Shashidhar Reddy Javaji, Sovesh Mohapatra, Advait Gosai, Gottfried Schlaug
for: 这个研究旨在发展deep learning技术，对于脑癌诊断和治疗方法进行改进。
methods: 这个研究使用了深度学习技术，包括ONet和modified UNet，以及新的损失函数。
results: ensemble方法可以实现更高的精度和更好的特征捕捉，实现lesion_wise dice scores的0.52、0.72和0.78，并且可以更好地覆盖肿瘤区域。

Abstract
Brain tumors remain a critical global health challenge, necessitating advancements in diagnostic techniques and treatment methodologies. In response to the growing need for age-specific segmentation models, particularly for pediatric patients, this study explores the deployment of deep learning techniques using magnetic resonance imaging (MRI) modalities. By introducing a novel ensemble approach using ONet and modified versions of UNet, coupled with innovative loss functions, this study achieves a precise segmentation model for the BraTS-PEDs 2023 Challenge. Data augmentation, including both single and composite transformations, ensures model robustness and accuracy across different scanning protocols. The ensemble strategy, integrating the ONet and UNet models, shows greater effectiveness in capturing specific features and modeling diverse aspects of the MRI images which result in lesion_wise dice scores of 0.52, 0.72 and 0.78 for enhancing tumor, tumor core and whole tumor labels respectively. Visual comparisons further confirm the superiority of the ensemble method in accurate tumor region coverage. The results indicate that this advanced ensemble approach, building upon the unique strengths of individual models, offers promising prospects for enhanced diagnostic accuracy and effective treatment planning for brain tumors in pediatric brains.

摘要
�� funcionado global health challenge, requiring advancements in diagnostic techniques and treatment methodologies. In response to the growing need for age-specific segmentation models, particularly for pediatric patients, this study explores the deployment of deep learning techniques using magnetic resonance imaging (MRI) modalities. By introducing a novel ensemble approach using ONet and modified versions of UNet, coupled with innovative loss functions, this study achieves a precise segmentation model for the BraTS-PEDs 2023 Challenge. Data augmentation, including both single and composite transformations, ensures model robustness and accuracy across different scanning protocols. The ensemble strategy, integrating the ONet and UNet models, shows greater effectiveness in capturing specific features and modeling diverse aspects of the MRI images which result in lesion_wise dice scores of 0.52, 0.72 and 0.78 for enhancing tumor, tumor core and whole tumor labels respectively. Visual comparisons further confirm the superiority of the ensemble method in accurate tumor region coverage. The results indicate that this advanced ensemble approach, building upon the unique strengths of individual models, offers promising prospects for enhanced diagnostic accuracy and effective treatment planning for brain tumors in pediatric brains.Here's the word-for-word translation:�Git tumors remain a critical global health challenge, necessitating advancements in diagnostic techniques and treatment methodologies. In response to the growing need for age-specific segmentation models, particularly for pediatric patients, this study explores the deployment of deep learning techniques using magnetic resonance imaging (MRI) modalities. By introducing a novel ensemble approach using ONet and modified versions of UNet, coupled with innovative loss functions, this study achieves a precise segmentation model for the BraTS-PEDs 2023 Challenge. Data augmentation, including both single and composite transformations, ensures model robustness and accuracy across different scanning protocols. The ensemble strategy, integrating the ONet and UNet models, shows greater effectiveness in capturing specific features and modeling diverse aspects of the MRI images which result in lesion_wise dice scores of 0.52, 0.72 and 0.78 for enhancing tumor, tumor core and whole tumor labels respectively. Visual comparisons further confirm the superiority of the ensemble method in accurate tumor region coverage. The results indicate that this advanced ensemble approach, building upon the unique strengths of individual models, offers promising prospects for enhanced diagnostic accuracy and effective treatment planning for brain tumors in pediatric brains.

Unified Data-Free Compression: Pruning and Quantization without Fine-Tuning

paper_url: http://arxiv.org/abs/2308.07209
repo_url: None
paper_authors: Shipeng Bai, Jun Chen, Xintian Shen, Yixuan Qian, Yong Liu
for: 提高神经网络的推理时间和内存占用的压缩和量化方法，但大多数现有方法需要原始训练集来微调模型，带来负担重大并不适用于敏感或商业化数据的应用。
methods: 提出了一些数据自由方法，但它们分别进行数据自由压缩和量化，而不是同时进行压缩和量化。
results: 在大规模图像分类任务中，我们的方法（Unified Data-Free Compression，UDFC）可以在不需要数据和微调过程的情况下，同时进行压缩和量化，并实现了与现有方法相当的性能提升。例如，在ImageNet dataset上，我们对ResNet-34网络进行30%压缩和6比特量化后，与最佳方法相比，我们的方法可以达到20.54%的精度提升。

Abstract
Structured pruning and quantization are promising approaches for reducing the inference time and memory footprint of neural networks. However, most existing methods require the original training dataset to fine-tune the model. This not only brings heavy resource consumption but also is not possible for applications with sensitive or proprietary data due to privacy and security concerns. Therefore, a few data-free methods are proposed to address this problem, but they perform data-free pruning and quantization separately, which does not explore the complementarity of pruning and quantization. In this paper, we propose a novel framework named Unified Data-Free Compression(UDFC), which performs pruning and quantization simultaneously without any data and fine-tuning process. Specifically, UDFC starts with the assumption that the partial information of a damaged(e.g., pruned or quantized) channel can be preserved by a linear combination of other channels, and then derives the reconstruction form from the assumption to restore the information loss due to compression. Finally, we formulate the reconstruction error between the original network and its compressed network, and theoretically deduce the closed-form solution. We evaluate the UDFC on the large-scale image classification task and obtain significant improvements over various network architectures and compression methods. For example, we achieve a 20.54% accuracy improvement on ImageNet dataset compared to SOTA method with 30% pruning ratio and 6-bit quantization on ResNet-34.

摘要
《结构化剪辑和量化是减少神经网络推理时间和内存占用的有效方法。然而，大多数现有方法需要原始训练集来精度调整模型，这不仅带来重要资源占用，还不可能 для涉及隐私或商业机密的应用程序 due to privacy and security concerns。因此，一些无数据方法被提议，但它们分别进行无数据剪辑和量化，而不是探索剪辑和量化的共同优势。在这篇论文中，我们提出了一个名为统一无数据压缩（UDFC）的新框架，它在无数据情况下同时进行剪辑和量化。具体来说，UDFC从假设部分频道（例如剪辑或量化）的信息可以通过其他频道的线性组合来保留一些信息，然后 derive 恢复形式来恢复因压缩而产生的信息损失。最后，我们将重建误差 между 原始网络和压缩后的网络，并 theoretically 递归解决。我们在大规模图像分类任务上评估了UDFC，并实现了对不同网络架构和压缩方法的显著改进。例如，我们在 ImageNet 数据集上实现了与 SOTA 方法相比的 20.54% 的准确率提升，其中 ResNet-34 网络的 30% 剪辑率和 6 位量化。

Algorithms for the Training of Neural Support Vector Machines

paper_url: http://arxiv.org/abs/2308.07204
repo_url: https://github.com/sayantann11/all-classification-templetes-for-ML
paper_authors: Lars Simon, Manuel Radons
for: 本文旨在探讨基于域知识的神经支持向量机（NSVM）模型的设计，以及一些基于Pegasos算法的训练算法。
methods: 本文使用Pegasos算法和一些基于神经网络的训练算法来训练NSVM模型。
results: 本文通过解决一些标准机器学习任务来证明NSVM模型的可行性。

Abstract
Neural support vector machines (NSVMs) allow for the incorporation of domain knowledge in the design of the model architecture. In this article we introduce a set of training algorithms for NSVMs that leverage the Pegasos algorithm and provide a proof of concept by solving a set of standard machine learning tasks.

摘要
神经支持向量机器 (NSVM) 允许在模型建立之处 incorporate 领域知识。在这篇文章中，我们介绍了一组用 Pegasos 算法进行训练 NSVM 的算法，并通过解决一组标准机器学习任务来提供证明。Note that "神经支持向量机器" (NSVM) is the Simplified Chinese term for "neural support vector machine".

Neural Categorical Priors for Physics-Based Character Control

paper_url: http://arxiv.org/abs/2308.07200
repo_url: https://github.com/Tencent-RoboticsX/NCP
paper_authors: Qingxu Zhu, He Zhang, Mengting Lan, Lei Han
for: 本研究目的是提出一种新的学习框架，用于控制基于物理学的人工智能角色，以实现更高质量和多样性的运动。
methods: 本研究使用了强化学习（RL）来跟踪和模仿生命力运动的精准信息，并使用了量化自适应变换器（VQ-VAE）来压缩运动clip中的最重要信息。
results: 研究结果表明，提出的方法可以控制人工智能角色进行高质量、多样化的运动，并且可以在两个复杂的下游任务中表现出色，包括剑盾攻击和两个玩家拳击游戏。

Abstract
Recent advances in learning reusable motion priors have demonstrated their effectiveness in generating naturalistic behaviors. In this paper, we propose a new learning framework in this paradigm for controlling physics-based characters with significantly improved motion quality and diversity over existing state-of-the-art methods. The proposed method uses reinforcement learning (RL) to initially track and imitate life-like movements from unstructured motion clips using the discrete information bottleneck, as adopted in the Vector Quantized Variational AutoEncoder (VQ-VAE). This structure compresses the most relevant information from the motion clips into a compact yet informative latent space, i.e., a discrete space over vector quantized codes. By sampling codes in the space from a trained categorical prior distribution, high-quality life-like behaviors can be generated, similar to the usage of VQ-VAE in computer vision. Although this prior distribution can be trained with the supervision of the encoder's output, it follows the original motion clip distribution in the dataset and could lead to imbalanced behaviors in our setting. To address the issue, we further propose a technique named prior shifting to adjust the prior distribution using curiosity-driven RL. The outcome distribution is demonstrated to offer sufficient behavioral diversity and significantly facilitates upper-level policy learning for downstream tasks. We conduct comprehensive experiments using humanoid characters on two challenging downstream tasks, sword-shield striking and two-player boxing game. Our results demonstrate that the proposed framework is capable of controlling the character to perform considerably high-quality movements in terms of behavioral strategies, diversity, and realism. Videos, codes, and data are available at https://tencent-roboticsx.github.io/NCP/.

摘要
近期研究生成可重用运动先验的进步已经证明了它们在生成自然化行为方面的效果。在这篇论文中，我们提出了一种新的学习框架，用于控制基于物理学的角色，并提高了现有状态艺术方法的运动质量和多样性。我们的方法使用了奖励学习（RL）来初始化并模仿生命体运动，使用不结构化运动片段中的精炼信息，并使用Vector Quantized Variational AutoEncoder（VQ-VAE）结构压缩运动片段中的最重要信息。这种结构将运动片段中的信息压缩成一个紧凑而有用的秘密空间中，通过从已经训练的分类先验分布中采样代码，可以生成高质量的生命体运动。虽然这种先验分布可以通过Encoder的输出进行超vision训练，但它遵循原始运动片段分布，可能会导致行为偏好。为解决这个问题，我们进一步提出了一种名为“先验偏移”的技术，通过吸引力驱动RL来调整先验分布。结果显示，我们的框架可以控制角色进行高质量的运动，包括行为策略、多样性和真实性。我们在人iform机器人上进行了广泛的实验，并在剑盾战和两个玩家盒子游戏中进行了两个下游任务。我们的结果表明，我们的框架可以控制角色进行较高质量的运动，并且可以提高下游任务的性能。视频、代码和数据可以在https://tencent-roboticsx.github.io/NCP/上获取。

Explaining Black-Box Models through Counterfactuals

paper_url: http://arxiv.org/abs/2308.07198
repo_url: https://github.com/juliatrustworthyai/counterfactualexplanations.jl
paper_authors: Patrick Altmeyer, Arie van Deursen, Cynthia C. S. Liem
for: 这篇论文是用于解释人工智能的Explainable Artificial Intelligence（ExplaI）。
methods: 这篇论文使用Counterfactual Explanations（CE）和Algorithmic Recourse（AR）来解释黑盒模型的预测结果。
results: 这篇论文提供了一个用于Julia语言的CounterfactualExplanations.jl包，可以生成Counterfactual Explanations和Algorithmic Recourse，并且可以用于解释任何黑盒模型的预测结果。

Abstract
We present CounterfactualExplanations.jl: a package for generating Counterfactual Explanations (CE) and Algorithmic Recourse (AR) for black-box models in Julia. CE explain how inputs into a model need to change to yield specific model predictions. Explanations that involve realistic and actionable changes can be used to provide AR: a set of proposed actions for individuals to change an undesirable outcome for the better. In this article, we discuss the usefulness of CE for Explainable Artificial Intelligence and demonstrate the functionality of our package. The package is straightforward to use and designed with a focus on customization and extensibility. We envision it to one day be the go-to place for explaining arbitrary predictive models in Julia through a diverse suite of counterfactual generators.

摘要
我们介绍CounterfactualExplanations.jl：一个用于生成Counterfactual Explanations（CE）和Algorithmic Recourse（AR）的套件，用于黑盒模型中的Julia。CE解释了如何让模型的输入变化以获得具体预测。这些解释可以提供AR：一组建议行动，以改善不愉快的结果。在这篇文章中，我们讨论了Counterfactual Explanations在可解释人工智能中的用途，并详细介绍了我们的套件。套件易于使用，设计了一个重点在自定义和扩展。我们将这个套件作为Julia中解释任意预测模型的首选场所。

gSASRec: Reducing Overconfidence in Sequential Recommendation Trained with Negative Sampling

paper_url: http://arxiv.org/abs/2308.07192
repo_url: https://github.com/asash/gsasrec
paper_authors: Aleksandr Petrov, Craig Macdonald
for: 这篇论文旨在解释为何SASRec模型在比较BERT4Rec模型时表现不佳，并提出一种新的总体二进制十字Entropy损失函数（gBCE）以及改进后的gSASRec模型，以 Mitigate overconfidence问题。
methods: 这篇论文使用了SASRec模型和BERT4Rec模型，并对它们进行了比较。它还提出了一种新的总体二进制十字Entropy损失函数（gBCE），并证明了它可以降低overconfidence问题。
results: 这篇论文通过了详细的实验表明，gSASRec模型可以在三个 datasets上不受overconfidence问题的影响，并且可以超越BERT4Rec模型（例如，MovieLens-1M数据集上的NDCG提高了9.47%），同时需要更少的训练时间（例如，MovieLens-1M数据集上的训练时间减少了73%）。

Abstract
A large catalogue size is one of the central challenges in training recommendation models: a large number of items makes them memory and computationally inefficient to compute scores for all items during training, forcing these models to deploy negative sampling. However, negative sampling increases the proportion of positive interactions in the training data, and therefore models trained with negative sampling tend to overestimate the probabilities of positive interactions a phenomenon we call overconfidence. While the absolute values of the predicted scores or probabilities are not important for the ranking of retrieved recommendations, overconfident models may fail to estimate nuanced differences in the top-ranked items, resulting in degraded performance. In this paper, we show that overconfidence explains why the popular SASRec model underperforms when compared to BERT4Rec. This is contrary to the BERT4Rec authors explanation that the difference in performance is due to the bi-directional attention mechanism. To mitigate overconfidence, we propose a novel Generalised Binary Cross-Entropy Loss function (gBCE) and theoretically prove that it can mitigate overconfidence. We further propose the gSASRec model, an improvement over SASRec that deploys an increased number of negatives and the gBCE loss. We show through detailed experiments on three datasets that gSASRec does not exhibit the overconfidence problem. As a result, gSASRec can outperform BERT4Rec (e.g. +9.47% NDCG on the MovieLens-1M dataset), while requiring less training time (e.g. -73% training time on MovieLens-1M). Moreover, in contrast to BERT4Rec, gSASRec is suitable for large datasets that contain more than 1 million items.

摘要
大型目录大小是训练推荐模型的中心挑战之一：大量的项目使得计算和存储成本增加，使得这些模型在训练期间计算所有项目的得分变得不可能。然而，使用负样本增加了正交互动的比例在训练数据中，因此模型受负样本部署后会过度估计正交互动的现象，我们称之为过信任。虽然绝对值的预测分数或概率不重要于推荐结果的排序，但过信任的模型可能无法估计顶层推荐的细微差异，导致性能下降。在这篇论文中，我们表明了过信任问题导致SASRec模型在比较BERT4Rec时表现不佳。这与BERT4Rec作者的解释不同，即Bi-directional attention机制导致的差异。为了消除过信任，我们提出一种通用二进制十进制损失函数（gBCE），并论证它可以消除过信任。此外，我们还提出了gSASRec模型，它在SASRec模型的基础上增加了更多的负样本和gBCE损失函数。我们通过三个数据集的详细实验表明，gSASRec模型不会出现过信任问题。因此，gSASRec可以超过BERT4Rec（例如，MovieLens-1M数据集上的NDCG提高9.47%），同时需要较少的训练时间（例如，MovieLens-1M数据集上的训练时间减少73%）。此外，gSASRec模型适用于包含更多 чем100万个项目的大型数据集。

Improving ICD-based semantic similarity by accounting for varying degrees of comorbidity

paper_url: http://arxiv.org/abs/2308.07359
repo_url: None
paper_authors: Jan Janosch Schneider, Marius Adler, Christoph Ammer-Herrmenau, Alexander Otto König, Ulrich Sax, Jonas Hügel
for: 这篇论文的目的是什么？* 这篇论文的目的是找出类似的病人，以便评估治疗结果和促进临床决策。methods: 这篇论文使用了哪些方法？* 这篇论文使用了 semantic similarity algorithms，包括 level-based information content、Leacock & Chodorow concept similarity 和 bipartite graph matching。results: 这篇论文的结果是什么？* 这篇论文的结果表明， Accounting for comorbidity variance can significantly improve the performance of semantic similarity algorithms。最佳结果为 level-based information content、Leacock & Chodorow concept similarity 和 bipartite graph matching的 комbination，与专家评验的真实值相符。

Abstract
Finding similar patients is a common objective in precision medicine, facilitating treatment outcome assessment and clinical decision support. Choosing widely-available patient features and appropriate mathematical methods for similarity calculations is crucial. International Statistical Classification of Diseases and Related Health Problems (ICD) codes are used worldwide to encode diseases and are available for nearly all patients. Aggregated as sets consisting of primary and secondary diagnoses they can display a degree of comorbidity and reveal comorbidity patterns. It is possible to compute the similarity of patients based on their ICD codes by using semantic similarity algorithms. These algorithms have been traditionally evaluated using a single-term expert rated data set. However, real-word patient data often display varying degrees of documented comorbidities that might impair algorithm performance. To account for this, we present a scale term that considers documented comorbidity-variance. In this work, we compared the performance of 80 combinations of established algorithms in terms of semantic similarity based on ICD-code sets. The sets have been extracted from patients with a C25.X (pancreatic cancer) primary diagnosis and provide a variety of different combinations of ICD-codes. Using our scale term we yielded the best results with a combination of level-based information content, Leacock & Chodorow concept similarity and bipartite graph matching for the set similarities reaching a correlation of 0.75 with our expert's ground truth. Our results highlight the importance of accounting for comorbidity variance while demonstrating how well current semantic similarity algorithms perform.

摘要
寻找类似病人是精准医学中常见的目标，可以促进治疗结果评估和临床决策支持。选择广泛可用的病人特征和适当的数学方法进行相似性计算是关键。国际疾病分类和相关医学问题（ICD）代码是全球通用的疾病编码，可以为大多数病人提供。将这些代码集成为主要和次要诊断的集合，可以显示疾病复杂性和潜在的疾病模式。可以使用语义相似算法计算病人之间的相似性。这些算法traditionally被评估使用专家评分的单个数据集。但是，实际的病人数据经常具有不同程度的记录的相关疾病，这可能会影响算法性能。为了考虑这一点，我们提出了一个权重因子，以考虑记录的相关疾病差异。在这种情况下，我们比较了80组已知算法的语义相似性，基于ICD代码集。这些代码集来自悉尼癌病（C25.X）主诊断的病人，并提供了不同的ICD代码组合。通过我们的权重因子，我们得到了最佳的结果，其中包括水平基本信息内容、Leacock & Chodorow概念相似和 биipartite图匹配算法，达到了专家的参考真实值的0.75相似度。我们的结果 highlights the importance of accounting for comorbidity variance while demonstrating the current state-of-the-art semantic similarity algorithms perform well.

Conformal Predictions Enhanced Expert-guided Meshing with Graph Neural Networks

paper_url: http://arxiv.org/abs/2308.07358
repo_url: https://github.com/ahnobari/autosurf
paper_authors: Amin Heyrani Nobari, Justin Rey, Suhas Kodali, Matthew Jones, Faez Ahmed
for:This paper aims to develop a machine learning-based scheme for automatically generating high-quality meshes for computational fluid dynamics (CFD) simulations, with a focus on aircraft models.methods:The proposed method utilizes graph neural networks (GNN) and expert guidance to generate CFD meshes. A new 3D segmentation algorithm is introduced, which outperforms two state-of-the-art models, PointNet++ and PointMLP, for surface classification. The conformal predictions method is used to project predictions from 3D mesh segmentation models to CAD surfaces, providing marginal statistical guarantees and robust uncertainty quantification and handling.results:The proposed approach is demonstrated through a real-world case study, showing that the automatically generated mesh is comparable in quality to expert-generated meshes and enables the solver to converge and produce accurate results. Additionally, the approach is found to be 5 times faster than adaptive remeshing in the overall process of simulation. The code and data for this project are made publicly available at https://github.com/ahnobari/AutoSurf.

Abstract
Computational Fluid Dynamics (CFD) is widely used in different engineering fields, but accurate simulations are dependent upon proper meshing of the simulation domain. While highly refined meshes may ensure precision, they come with high computational costs. Similarly, adaptive remeshing techniques require multiple simulations and come at a great computational cost. This means that the meshing process is reliant upon expert knowledge and years of experience. Automating mesh generation can save significant time and effort and lead to a faster and more efficient design process. This paper presents a machine learning-based scheme that utilizes Graph Neural Networks (GNN) and expert guidance to automatically generate CFD meshes for aircraft models. In this work, we introduce a new 3D segmentation algorithm that outperforms two state-of-the-art models, PointNet++ and PointMLP, for surface classification. We also present a novel approach to project predictions from 3D mesh segmentation models to CAD surfaces using the conformal predictions method, which provides marginal statistical guarantees and robust uncertainty quantification and handling. We demonstrate that the addition of conformal predictions effectively enables the model to avoid under-refinement, hence failure, in CFD meshing even for weak and less accurate models. Finally, we demonstrate the efficacy of our approach through a real-world case study that demonstrates that our automatically generated mesh is comparable in quality to expert-generated meshes and enables the solver to converge and produce accurate results. Furthermore, we compare our approach to the alternative of adaptive remeshing in the same case study and find that our method is 5 times faster in the overall process of simulation. The code and data for this project are made publicly available at https://github.com/ahnobari/AutoSurf.

摘要
computational fluid dynamics (CFD) 广泛应用于不同的工程领域，但准确的 simulations 受到 mesh 的限制。高精度的 mesh 可以确保精度，但是来自计算成本的代价很高。 adaptive remeshing 技术也需要多次 simulations 和大量计算成本。这意味着 meshing 过程依赖于专家知识和多年的经验。自动生成 mesh 可以保存很多时间和努力，并且导致更快的设计过程。本文提出了一种基于 machine learning 的 scheme，使用 graph neural networks (GNN) 和专家指导生成 CFD mesh для飞机模型。在这种工作中，我们提出了一种新的 3D 分割算法，其在 surface classification 方面超过了两个 state-of-the-art 模型：PointNet++ 和 PointMLP。我们还提出了一种将 predictions 从 3D mesh 分割模型项project 到 CAD 表面的新方法，使用 conformal predictions 方法，该方法提供了边缘统计保证和稳定的 uncertainty quantification 和处理。我们示示了添加 conformal predictions 可以使模型避免 under-refinement 和失败，即CFD meshing中的负面刻。 finally，我们通过一个实际的案例研究证明了我们自动生成的 mesh 与专家生成的 mesh 相当，并且使得解除器能够 converges 并生成准确的结果。此外，我们与 adaptive remeshing 的相对比较发现，我们的方法在整个 simulations 过程中速度比 adaptive remeshing 5 倍。代码和数据可以在 https://github.com/ahnobari/AutoSurf 上公开获取。

Efficient Learning of Quantum States Prepared With Few Non-Clifford Gates II: Single-Copy Measurements

paper_url: http://arxiv.org/abs/2308.07175
repo_url: None
paper_authors: Sabee Grewal, Vishnu Iyer, William Kretschmer, Daniel Liang
for: 学习 $n$-qubit 量子状态，输出由最多 $t$ 单位 qubit 非截归 gate 生成的 circuits，可以使用 $\mathsf{poly}(n,2^t,1/\epsilon)$ 时间和样本来达到 trace distance $\epsilon$。
methods: 使用单复本测量来学习该类状态，而不需要双复本测量。
results: 实现了同样高效的学习算法，但使用单复本测量而不需要双复本测量。

Abstract
Recent work has shown that $n$-qubit quantum states output by circuits with at most $t$ single-qubit non-Clifford gates can be learned to trace distance $\epsilon$ using $\mathsf{poly}(n,2^t,1/\epsilon)$ time and samples. All prior algorithms achieving this runtime use entangled measurements across two copies of the input state. In this work, we give a similarly efficient algorithm that learns the same class of states using only single-copy measurements.

摘要

PitchNet: A Fully Convolutional Neural Network for Pitch Estimation

paper_url: http://arxiv.org/abs/2308.07170
repo_url: None
paper_authors: Jeremy Cochoy
for: 用于提高音乐和声音处理中的抽取音高精度
methods: 使用卷积神经网络和自相关函数优化抽取音高精度
results: 在各种数据集上（包括合唱、歌剧录音和时间压缩的元音）进行评估，达到了更高的抽取音高精度

Abstract
In the domain of music and sound processing, pitch extraction plays a pivotal role. This research introduces "PitchNet", a convolutional neural network tailored for pitch extraction from the human singing voice, including acapella performances. Integrating autocorrelation with deep learning techniques, PitchNet aims to optimize the accuracy of pitch detection. Evaluation across datasets comprising synthetic sounds, opera recordings, and time-stretched vowels demonstrates its efficacy. This work paves the way for enhanced pitch extraction in both music and voice settings.

摘要
在音乐和声音处理领域中，抓取高度扮演着关键性的角色。这项研究介绍了“抓取网络”（PitchNet），一种针对人声 singing voice 的卷积神经网络，包括 acapella 表演。通过与深度学习技术结合自相关性，PitchNet 目标优化抓取精度。对于各种数据集，包括 sintetic 声音、歌剧录音和时间压缩的元音，评估表明 PitchNet 的可行性。这项工作将为音乐和声音设置中的抓取提供新的 возможности。

SPEGTI: Structured Prediction for Efficient Generative Text-to-Image Models

paper_url: http://arxiv.org/abs/2308.10997
repo_url: None
paper_authors: Sadeep Jayasumana, Daniel Glasner, Srikumar Ramalingam, Andreas Veit, Ayan Chakrabarti, Sanjiv Kumar
for: 提高文本图像生成模型的计算效率，以便在不输出质量下降的情况下提高图像生成速度。
methods: 使用MarkovRandomField（MRF）模型来编码图像各个位置的图像元素之间的兼容性，并使用这个MRF模型与之前提出的Muse模型结合使用，以便减少Muse预测步骤数量，从而提高图像生成速度。
results: 通过使用MRF模型，可以在不输出质量下降的情况下，提高文本图像生成模型的计算效率，并且可以在不需要多次预测的情况下，提高图像生成速度。

Abstract
Modern text-to-image generation models produce high-quality images that are both photorealistic and faithful to the text prompts. However, this quality comes at significant computational cost: nearly all of these models are iterative and require running inference multiple times with large models. This iterative process is needed to ensure that different regions of the image are not only aligned with the text prompt, but also compatible with each other. In this work, we propose a light-weight approach to achieving this compatibility between different regions of an image, using a Markov Random Field (MRF) model. This method is shown to work in conjunction with the recently proposed Muse model. The MRF encodes the compatibility among image tokens at different spatial locations and enables us to significantly reduce the required number of Muse prediction steps. Inference with the MRF is significantly cheaper, and its parameters can be quickly learned through back-propagation by modeling MRF inference as a differentiable neural-network layer. Our full model, SPEGTI, uses this proposed MRF model to speed up Muse by 1.5X with no loss in output image quality.

摘要
现代文本到图像生成模型可以生成高质量的图像，这些图像不仅具有高度的真实性，还具有与文本提示符的准确性。然而，这些高质量图像的生成需要大量的计算成本：大多数这些模型都是迭代的，需要多次运行推理。这种迭代过程是为了确保图像中的不同区域不仅与文本提示符吻合，而且也与其他区域吻合。在这种工作中，我们提出了一种轻量级的方法来实现图像中不同区域之间的吻合，使用Markov随机场（MRF）模型。这种方法可以与最近提出的Muse模型结合使用，并且可以在推理过程中减少Muse预测步骤的数量。MRF模型可以编码图像元素之间的空间位置的兼容性，从而使得推理过程中的计算成本得到了显著减少。我们的全模型SPEGTI使用这种提议的MRF模型，可以在推理过程中加速Muse的执行，而无需减少输出图像质量。

Pairing interacting protein sequences using masked language modeling

paper_url: http://arxiv.org/abs/2308.07136
repo_url: https://github.com/bitbol-lab/diffpalm
paper_authors: Umberto Lupo, Damiano Sgarbossa, Anne-Florence Bitbol
for: The paper aims to predict which proteins interact together from their amino-acid sequences, which is an important task in protein structure prediction and function prediction.
methods: The paper develops a method called DiffPALM that leverages protein language models trained on multiple sequence alignments to pair interacting protein sequences. The method uses MSA Transformer and the EvoFormer module of AlphaFold to fill in masked amino acids in multiple sequence alignments and capture inter-chain coevolution.
results: The paper shows that DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments and achieves competitive performance with using orthology-based pairing. Additionally, DiffPALM improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer without significantly deteriorating any of those tested.Here is the simplified Chinese version of the three key points:
for: 这篇论文目标是从蛋白质序列中预测哪些蛋白质相互作用，这是蛋白质结构预测和功能预测中非常重要的任务。
methods: 论文提出了一种名为DiffPALM的方法，它利用蛋白质语言模型在多个序列对上进行训练，以对相互作用的蛋白质序列进行对应。该方法使用MSA Transformer和AlphaFold中的EvoFormer模块来填充多个序列对中的遮盖氨基酸，并capture氨基酸之间的跨链共演化。
results: 论文表明，DiffPALM比现有的相互作用基于共演化的对应方法在困难的多个序列对上表现出色，并达到与使用同源蛋白质对应的竞争性表现。此外，DiffPALM还可以提高一些细胞蛋白质复合物的结构预测，无需进行较major fine-tuning。

Abstract
Predicting which proteins interact together from amino-acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments, such as MSA Transformer and the EvoFormer module of AlphaFold. We formulate the problem of pairing interacting partners among the paralogs of two protein families in a differentiable way. We introduce a method called DiffPALM that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. MSA Transformer encodes coevolution between functionally or structurally coupled amino acids. We show that it captures inter-chain coevolution, while it was trained on single-chain data, which means that it can be used out-of-distribution. Relying on MSA Transformer without fine-tuning, DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments extracted from ubiquitous prokaryotic protein datasets. It also outperforms an alternative method based on a state-of-the-art protein language model trained on single sequences. Paired alignments of interacting protein sequences are a crucial ingredient of supervised deep learning methods to predict the three-dimensional structure of protein complexes. DiffPALM substantially improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer, without significantly deteriorating any of those we tested. It also achieves competitive performance with using orthology-based pairing.

摘要
<>转换文本为简化中文。<>预测蛋白质之间的互作是一项重要任务。我们开发了一种方法来对蛋白质序列进行互作对应，利用蛋白质语言模型在多个序列对alignment中学习的力量，如MSA transformer和AlphaFold中的EvoFormer模块。我们将对蛋白质家族中的参数进行分配的问题进行形式化。我们提出了一种名为DiffPALM的方法，它利用MSA transformer填充多个序列对alignment中的masked amino酸的能力，以获得更好的互作对应。MSA transformer编码了功能或结构相关的氨基酸之间的共演化，我们表明它可以在单链数据上进行填充，并且在多个序列对alignment中提取深层次的数据时表现出色。与现有的共演化基于方法相比，DiffPALM在具有深度多个序列对alignment的困难benchmark上表现出色，并且在使用不需要微调的情况下，也能够与一种基于state-of-the-art蛋白质语言模型的方法相比。对于一些细菌蛋白质复合物的三维结构预测，DiffPALM提供了显著改善，而不是显著下降任何已测试的结构。它还可以与基于orthology的对应方法相比。

Natural Language is All a Graph Needs

paper_url: http://arxiv.org/abs/2308.07134
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Ruosong Ye, Caiqi Zhang, Runhui Wang, Shuyuan Xu, Yongfeng Zhang
for: 本研究旨在探讨whether large language models (LLMs) can replace graph neural networks (GNNs) as the foundation model for graphs.
methods: 本研究提出了InstructGLM（Instruction-finetuned Graph Language Model），通过自然语言指令设计了高度扩展的提示，并使用自然语言描述图像的几何结构和节点特征。
results: 研究结果表明，InstructGLM在ogbn-arxiv、Cora和PubMed datasets上都超过了所有竞争GNN基elines，这证明了我们的方法的有效性，同时也释放了大语言模型作为图机器学习基础模型的潜在性。

Abstract
The emergence of large-scale pre-trained language models, such as ChatGPT, has revolutionized various research fields in artificial intelligence. Transformers-based large language models (LLMs) have gradually replaced CNNs and RNNs to unify fields of computer vision and natural language processing. Compared with the data that exists relatively independently such as images, videos or texts, graph is a type of data that contains rich structural and relational information. Meanwhile, natural language, as one of the most expressive mediums, excels in describing complex structures. However, existing work on incorporating graph learning problems into the generative language modeling framework remains very limited. As the importance of large language models continues to grow, it becomes essential to explore whether LLMs can also replace GNNs as the foundation model for graphs. In this paper, we propose InstructGLM (Instruction-finetuned Graph Language Model), systematically design highly scalable prompts based on natural language instructions, and use natural language to describe the geometric structure and node features of the graph for instruction tuning an LLM to perform learning and inference on graphs in a generative manner. Our method exceeds all competitive GNN baselines on ogbn-arxiv, Cora and PubMed datasets, which demonstrates the effectiveness of our method and sheds light on generative large language models as the foundation model for graph machine learning.

摘要
大型预训语言模型，如ChatGPT，的出现对人工智能多个研究领域产生了革命性的影响。基于Transformers的大型语言模型（LLMs）逐渐取代了CNNs和RNNs，统一了计算机视觉和自然语言处理的领域。相比于独立存在的数据，如图像、视频或文本，图表是一种包含丰富结构和关系信息的数据类型。同时，自然语言作为最有表达力的媒体，在描述复杂结构方面表现出色。然而，将图学问题 incorporated into the generative language modeling framework 的现有工作很有限。随着大语言模型的重要性不断增长，我们需要探索是否可以将LLMs作为图像学习的基础模型。在这篇论文中，我们提出了InstructGLM（基于自然语言指令的图语言模型），系统地设计了可扩展的自然语言指令，并使用自然语言来描述图形结构和节点特征。通过这种方式，我们使用大语言模型进行图像学习和推理，并达到了在ogbn-arxiv、Cora和PubMed数据集上的所有竞争GNN基elines的超越。这说明了我们的方法的有效性，并且推照到了大语言模型作为图像学习的基础模型。

Implementation of The Future of Drug Discovery: QuantumBased Machine Learning Simulation (QMLS)

paper_url: http://arxiv.org/abs/2308.08561
repo_url: None
paper_authors: Yew Kee Wong, Yifan Zhou, Yan Shing Liang, Haichuan Qiu, Yu Xi Wu, Bin He
For: 这篇论文主要目的是提出一种新的药物开发研究与发展（R&D）阶段缩短方法，使其从原来的几年和十万美元降低到只需三到六个月和五万到八千美元。* Methods: 这篇论文使用的方法包括机器学习分子生成（MLMG）和量子 simulate（QS）。 MLMG 根据目标蛋白质的分子结构生成可能的吸引者，而 QS 根据反应和绑定效果从原始试剂中筛选出符合条件的分子。* Results: 这篇论文的结果是提出了一种基于机器学习和量子 simulate 的药物开发研究方法，可以在三到六个月和五万到八千美元的范围内缩短 R&D 阶段，并且可以生成多达几十个前期临床试验准备的药物。

Abstract
The Research & Development (R&D) phase of drug development is a lengthy and costly process. To revolutionize this process, we introduce our new concept QMLS to shorten the whole R&D phase to three to six months and decrease the cost to merely fifty to eighty thousand USD. For Hit Generation, Machine Learning Molecule Generation (MLMG) generates possible hits according to the molecular structure of the target protein while the Quantum Simulation (QS) filters molecules from the primary essay based on the reaction and binding effectiveness with the target protein. Then, For Lead Optimization, the resultant molecules generated and filtered from MLMG and QS are compared, and molecules that appear as a result of both processes will be made into dozens of molecular variations through Machine Learning Molecule Variation (MLMV), while others will only be made into a few variations. Lastly, all optimized molecules would undergo multiple rounds of QS filtering with a high standard for reaction effectiveness and safety, creating a few dozen pre-clinical-trail-ready drugs. This paper is based on our first paper, where we pitched the concept of machine learning combined with quantum simulations. In this paper we will go over the detailed design and framework of QMLS, including MLMG, MLMV, and QS.

摘要
研发（R&D）阶段是药品开发的 longest 和最昂贵的阶段。为了革新这个过程，我们介绍了一新的概念——QMLS，它可以缩短整个R&D阶段的时间至3-6个月，并将成本降至50-80万美元。在hit生成阶段，机器学习分子生成（MLMG）根据目标蛋白质分子结构生成可能的hit，而量子 simulations（QS）则从首轮试验中筛选出符合反应和结合效果的分子。在Lead优化阶段，由MLMG和QS生成的结果分子进行比较，并生成几十个分子变化through machine learning分子变化（MLMV），而其他分子则只生成几个变化。最后，所有优化的分子都会经过多轮QS筛选，以确保反应效果和安全性。通过这种方式，我们可以在几个月内生成数十个前期临床药物。这篇文章是我们之前的第一篇论文中提出的概念的详细设计和框架，包括MLMG、MLMV和QS。

A Time-aware tensor decomposition for tracking evolving patterns

paper_url: http://arxiv.org/abs/2308.07126
repo_url: None
paper_authors: Christos Chatzis, Max Pfeffer, Pedro Lind, Evrim Acar
for: 这篇论文主要旨在提出一种基于PARAFAC2的时间 regularization方法，用于从时间数据中提取慢慢发展的模式。
methods: 该方法使用时间 regularization来防止时间点的重新排序，并使用PARAFAC2进行tensor factorization来捕捉时间数据中的下降模式。
results: 经过广泛的实验表明，tPARAFAC2能够准确地捕捉时间数据中的下降模式，并在表现上超过PARAFAC2和 coupling matrix factorization with temporal smoothness regularization。

Abstract
Time-evolving data sets can often be arranged as a higher-order tensor with one of the modes being the time mode. While tensor factorizations have been successfully used to capture the underlying patterns in such higher-order data sets, the temporal aspect is often ignored, allowing for the reordering of time points. In recent studies, temporal regularizers are incorporated in the time mode to tackle this issue. Nevertheless, existing approaches still do not allow underlying patterns to change in time (e.g., spatial changes in the brain, contextual changes in topics). In this paper, we propose temporal PARAFAC2 (tPARAFAC2): a PARAFAC2-based tensor factorization method with temporal regularization to extract gradually evolving patterns from temporal data. Through extensive experiments on synthetic data, we demonstrate that tPARAFAC2 can capture the underlying evolving patterns accurately performing better than PARAFAC2 and coupled matrix factorization with temporal smoothness regularization.

摘要
<>将文本翻译成简化字符串。<>时间演化数据集经常可以被视为高阶张量，其中一个模式是时间模式。而张量分解技术已经成功地捕捉了高阶数据集中的下面纹理，但是忽略了时间方面，允许时间点的重新排序。在最近的研究中，人们尝试将时间正则化添加到时间模式中，以解决这个问题。然而，现有的方法仍然不允许下面纹理在时间上发生变化（例如，脑中的空间变化，话题中的上下文变化）。在这篇论文中，我们提出了时间PARAFAC2（tPARAFAC2）：一种基于PARAFAC2的张量分解方法，带有时间正则化来提取时间演化的慢慢发展模式。通过对synthetic数据进行了广泛的实验，我们示出了tPARAFAC2可以准确地捕捉到时间演化中的下面纹理，并且比PARAFAC2和联合矩阵因子化 WITH 时间平滑正则化更好。

Active Bird2Vec: Towards End-to-End Bird Sound Monitoring with Transformers

paper_url: http://arxiv.org/abs/2308.07121
repo_url: None
paper_authors: Lukas Rauch, Raphael Schwinger, Moritz Wirth, Bernhard Sick, Sven Tomforde, Christoph Scholz
for: 鸟叫声监测的终端学习shift，结合自动学习(SSL)和深度活动学习(DAL)。
methods: 利用 transformer 模型，直接处理原始音频数据，不需要传统的spectrogram转换。
results: 通过 SSL 生成高质量鸟叫声表示，可能加速环境变化评估和风力 farm 决策过程。同时，通过 DAL 利用鸟类 vocals 的多样性，减少人工标注数据的依赖，提高生物听音研究的可比性和可重现性。

Abstract
We propose a shift towards end-to-end learning in bird sound monitoring by combining self-supervised (SSL) and deep active learning (DAL). Leveraging transformer models, we aim to bypass traditional spectrogram conversions, enabling direct raw audio processing. ActiveBird2Vec is set to generate high-quality bird sound representations through SSL, potentially accelerating the assessment of environmental changes and decision-making processes for wind farms. Additionally, we seek to utilize the wide variety of bird vocalizations through DAL, reducing the reliance on extensively labeled datasets by human experts. We plan to curate a comprehensive set of tasks through Huggingface Datasets, enhancing future comparability and reproducibility of bioacoustic research. A comparative analysis between various transformer models will be conducted to evaluate their proficiency in bird sound recognition tasks. We aim to accelerate the progression of avian bioacoustic research and contribute to more effective conservation strategies.

摘要
我们提议将学习方法转向终端学习（End-to-End Learning），将自我超级vised学习（Self-Supervised Learning）和深度活动学习（Deep Active Learning）相结合。通过使用转换器模型，我们希望直接处理原始音频数据，并不需要传统的spectrogram转换。活动鸟2Vec可以通过SSL生成高质量鸟叫表示，可能加速环境变化评估和风轮农场决策过程。此外，我们计划利用鸟类 vocals 的多样性，减少人工标注数据的依赖性。我们将使用Huggingface Datasets框架，实现未来比较性和可重复性的生物声学研究。我们计划对不同的转换器模型进行比较分析，以评估它们在鸟叫识别任务中的效果。我们希望通过加速鸟类生物声学研究，为更有效的保护策略做出贡献。

Neural radiance fields in the industrial and robotics domain: applications, research opportunities and use cases

paper_url: http://arxiv.org/abs/2308.07118
repo_url: https://github.com/maftej/iisnerf
paper_authors: Eugen Šlapak, Enric Pardo, Matúš Dopiriak, Taras Maksymyuk, Juraj Gazda
for: 本研究旨在探讨基于提供训练图像的神经辐射场（NeRF）在各种工业领域的应用潜力，并提供未来研究方向。
methods: 本研究使用NeRF来实现3D场景表示，并在视频压缩和3D动作估计等领域进行证明。
results: 研究显示，使用NeRF进行视频压缩可以达到48%和74%的压缩率提升，而在3D动作估计中，使用D-NeRF实现的 disparity map PSNR值达到23 dB，SSIM值为0.97。

Abstract
The proliferation of technologies, such as extended reality (XR), has increased the demand for high-quality three-dimensional (3D) graphical representations. Industrial 3D applications encompass computer-aided design (CAD), finite element analysis (FEA), scanning, and robotics. However, current methods employed for industrial 3D representations suffer from high implementation costs and reliance on manual human input for accurate 3D modeling. To address these challenges, neural radiance fields (NeRFs) have emerged as a promising approach for learning 3D scene representations based on provided training 2D images. Despite a growing interest in NeRFs, their potential applications in various industrial subdomains are still unexplored. In this paper, we deliver a comprehensive examination of NeRF industrial applications while also providing direction for future research endeavors. We also present a series of proof-of-concept experiments that demonstrate the potential of NeRFs in the industrial domain. These experiments include NeRF-based video compression techniques and using NeRFs for 3D motion estimation in the context of collision avoidance. In the video compression experiment, our results show compression savings up to 48\% and 74\% for resolutions of 1920x1080 and 300x168, respectively. The motion estimation experiment used a 3D animation of a robotic arm to train Dynamic-NeRF (D-NeRF) and achieved an average peak signal-to-noise ratio (PSNR) of disparity map with the value of 23 dB and an structural similarity index measure (SSIM) 0.97.

摘要
“技术的普及，如扩展现实（XR），已经提高了高品质三维图形的需求。工业三维应用包括计算机支持设计（CAD）、finite element分析（FEA）、扫描和机器人。然而，现有的工业三维表示方法受到高实施成本和人工输入的假设，以获得准确的三维模型。为了解决这些挑战，神经辐射场（NeRF）已经出现为了学习基于提供训练图像的三维场景表示。尽管NeRF在不同领域产生了增长的兴趣，但它们在不同的工业子领域的潜在应用还未得到了足够的探索。在这篇论文中，我们提供了对NeRF工业应用的全面评估，并为未来研究提供方向。我们还进行了一系列的证明性实验，以示NeRF在工业领域的潜在应用。这些实验包括基于NeRF的视频压缩技术和使用NeRF进行3D运动估计，以避免碰撞。在视频压缩实验中，我们得到了1920x1080和300x168的分辨率下的压缩率为48%和74%。在运动估计实验中，我们使用了一个3D动画的机械臂进行训练，并获得了23 dB的平均峰值信号噪声比（PSNR）和0.97的结构相似度指标（SSIM）。”

iSTFTNet2: Faster and More Lightweight iSTFT-Based Neural Vocoder Using 1D-2D CNN

paper_url: http://arxiv.org/abs/2308.07117
repo_url: None
paper_authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Shogo Seki
for: 高速、轻量级、高精度的语音合成
methods: 使用快速、轻量级的1D CNN作为基础网络，并将一些神经过程替换为iSTFT，以提高速度和精度。
results: iSTFTNet2比iSTFTNet更快、更轻量级，且音质相对保持不变。

Abstract
The inverse short-time Fourier transform network (iSTFTNet) has garnered attention owing to its fast, lightweight, and high-fidelity speech synthesis. It obtains these characteristics using a fast and lightweight 1D CNN as the backbone and replacing some neural processes with iSTFT. Owing to the difficulty of a 1D CNN to model high-dimensional spectrograms, the frequency dimension is reduced via temporal upsampling. However, this strategy compromises the potential to enhance the speed. Therefore, we propose iSTFTNet2, an improved variant of iSTFTNet with a 1D-2D CNN that employs 1D and 2D CNNs to model temporal and spectrogram structures, respectively. We designed a 2D CNN that performs frequency upsampling after conversion in a few-frequency space. This design facilitates the modeling of high-dimensional spectrograms without compromising the speed. The results demonstrated that iSTFTNet2 made iSTFTNet faster and more lightweight with comparable speech quality. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/istftnet2/.

摘要
它的快速、轻量级和高精度语音合成使得倒时傅立卷网络（iSTFTNet）受到了关注。它使用了快速和轻量级的1D CNN作为核心，并将一些神经过程替换为iSTFT。由于1D CNNDifficulty modeling高维spectrograms，因此在频率维度上做了时间upsampling。然而，这种策略会减少速度的潜在提高。因此，我们提出了iSTFTNet2，它是iSTFTNet的改进版本，使用1D-2D CNN来模型时间和spectrogram结构。我们设计了一个2D CNN，它在几个频率空间中进行频率upsampling。这种设计可以模型高维spectrograms，而不会减少速度。结果表明，iSTFTNet2使得iSTFTNet更快速和轻量级，并且与相同的语音质量相对。音频样本可以在https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/istftnet2/上获取。

Ada-QPacknet – adaptive pruning with bit width reduction as an efficient continual learning method without forgetting

paper_url: http://arxiv.org/abs/2308.07939
repo_url: None
paper_authors: Marcin Pietroń, Dominik Żurek, Kamil Faber, Roberto Corizzo
for: 这篇论文主要针对各种各样的动态和复杂环境下的Continual Learning（CL）问题。
methods: 该论文提出了一种基于架构的CL方法，称为Ada-QPacknet，它通过减少模型大小来实现CL。该方法使用有效的线性和非线性归一化方法来减少模型的权重的位数据类型。
results: 根据实验结果，hybrid 8和4位归一化的混合归一化方法可以达到类似于浮点子网络的准确率，而且在任务和类增量场景中比大多数CL策略表现更好。

Abstract
Continual Learning (CL) is a process in which there is still huge gap between human and deep learning model efficiency. Recently, many CL algorithms were designed. Most of them have many problems with learning in dynamic and complex environments. In this work new architecture based approach Ada-QPacknet is described. It incorporates the pruning for extracting the sub-network for each task. The crucial aspect in architecture based CL methods is theirs capacity. In presented method the size of the model is reduced by efficient linear and nonlinear quantisation approach. The method reduces the bit-width of the weights format. The presented results shows that hybrid 8 and 4-bit quantisation achieves similar accuracy as floating-point sub-network on a well-know CL scenarios. To our knowledge it is the first CL strategy which incorporates both compression techniques pruning and quantisation for generating task sub-networks. The presented algorithm was tested on well-known episode combinations and compared with most popular algorithms. Results show that proposed approach outperforms most of the CL strategies in task and class incremental scenarios.

摘要

Age-Stratified Differences in Morphological Connectivity Patterns in ASD: An sMRI and Machine Learning Approach

paper_url: http://arxiv.org/abs/2308.07356
repo_url: None
paper_authors: Gokul Manoj, Sandeep Singh Sengar, Jac Fredo Agastinose Ronickom
for: 本研究的目的是用 morphological features (MF) 和 morphological connectivity features (MCF) 来分类 autism spectrum disorder (ASD)，并比较不同年龄组的分类效果。
methods: 研究使用了 two publicly available databases, ABIDE-I 和 ABIDE-II, 获取了 structural magnetic resonance imaging (sMRI) 数据，并对数据进行了标准化处理。然后，将数据分割成 148 个不同区域，根据 Destrieux Atlases，并从每个区域提取了面积、厚度、体积和平均弯曲信息。使用了统计学 t-test (p<0.05) 来选择特征，然后使用 random forest (RF) 分类器进行训练。
results: 研究结果表明，6-11 岁的年龄组的表现最高，然后是 6-18 岁和 11-18 岁的年龄组。总的来说，MCF 与 RF 在 6-11 岁的年龄组中表现最好，其中的准确率、 F1 分数、回归率和精度分别为 75.8%、83.1%、86% 和 80.4%。结论：本研究因此表明，使用 morphological connectivity 和年龄相关的诊断模型可以有效地分类 ASD。

Abstract
Purpose: Age biases have been identified as an essential factor in the diagnosis of ASD. The objective of this study was to compare the effect of different age groups in classifying ASD using morphological features (MF) and morphological connectivity features (MCF). Methods: The structural magnetic resonance imaging (sMRI) data for the study was obtained from the two publicly available databases, ABIDE-I and ABIDE-II. We considered three age groups, 6 to 11, 11 to 18, and 6 to 18, for our analysis. The sMRI data was pre-processed using a standard pipeline and was then parcellated into 148 different regions according to the Destrieux atlas. The area, thickness, volume, and mean curvature information was then extracted for each region which was used to create a total of 592 MF and 10,878 MCF for each subject. Significant features were identified using a statistical t-test (p<0.05) which was then used to train a random forest (RF) classifier. Results: The results of our study suggested that the performance of the 6 to 11 age group was the highest, followed by the 6 to 18 and 11 to 18 ages in both MF and MCF. Overall, the MCF with RF in the 6 to 11 age group performed better in the classification than the other groups and produced an accuracy, F1 score, recall, and precision of 75.8%, 83.1%, 86%, and 80.4%, respectively. Conclusion: Our study thus demonstrates that morphological connectivity and age-related diagnostic model could be an effective approach to discriminating ASD.

摘要
目的：识别自适应发育障碍（ASD）的年龄因素有所重要。本研究的目标是比较不同年龄组的ASD诊断使用形态特征（MF）和形态连接特征（MCF）的效果。方法：我们使用ABIDE-I和ABIDE-II两个公共数据库获取了structural magnetic resonance imaging（sMRI）数据。我们分为三个年龄组：6-11岁、11-18岁和6-18岁进行分析。sMRI数据经过标准化处理后，使用Destrieux Atlas将数据分割成148个区域。然后提取每个区域的面积、厚度、体积和平均曲率信息，共计592个MF和10878个MCF。使用统计t检测test（p<0.05）标识特征，然后使用随机森林（RF）分类器进行训练。结果：我们的研究发现，6-11岁年龄组的表现最高，其次是6-18岁和11-18岁年龄组。总的来说，MCF与RF在6-11岁年龄组中表现较好，其准确率、F1分数、报告率和准确率分别为75.8%、83.1%、86%和80.4%。结论：因此，我们的研究表明，形态连接和年龄相关的诊断模型可以有效地识别ASD。

#InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models

paper_url: http://arxiv.org/abs/2308.07074
repo_url: https://github.com/ofa-sys/instag
paper_authors: Keming Lu, Hongyi Yuan, Zheng Yuan, Runji Lin, Junyang Lin, Chuanqi Tan, Chang Zhou, Jingren Zhou
for: 这个论文的目的是提高基础模型的命令遵从能力，并通过量化分析定义命令多样性和复杂性。
methods: 本文使用了一种名为InsTag的开放集 Fine-grained tagger，通过Semantics和Intention来标签SFT数据集中的样本，并定义命令多样性和复杂性。
results: 根据MT-Bench的评价，使用InsTag选择的6K多样性和复杂性的样本进行微调，可以使基础模型的命令遵从能力得到显著提高。

Abstract
Foundation language models obtain the instruction-following ability through supervised fine-tuning (SFT). Diversity and complexity are considered critical factors of a successful SFT dataset, while their definitions remain obscure and lack quantitative analyses. In this work, we propose InsTag, an open-set fine-grained tagger, to tag samples within SFT datasets based on semantics and intentions and define instruction diversity and complexity regarding tags. We obtain 6.6K tags to describe comprehensive user queries. Then we analyze popular open-sourced SFT datasets and find that the model ability grows with more diverse and complex data. Based on this observation, we propose a data selector based on InsTag to select 6K diverse and complex samples from open-source datasets and fine-tune models on InsTag-selected data. The resulting models, TagLM, outperform open-source models based on considerably larger SFT data evaluated by MT-Bench, echoing the importance of query diversity and complexity. We open-source InsTag in https://github.com/OFA-Sys/InsTag.

摘要
基于监督精度（SFT）的基础语言模型获得了指令遵循能力，但是关键因素如多样性和复杂性的定义尚未得到准确的量化分析。在这项工作中，我们提出了InsTag，一个开放集成细词标注器，用于在SFT数据集中标注样本基于含义和目标，并定义指令多样性和复杂性的标签。我们获得了6.6K个标签来描述用户查询的全面性。然后我们分析了一些常用的开源SFT数据集，发现模型能力随着数据集的多样性和复杂性增加而增长。基于这一观察，我们提出了基于InsTag的数据选择器，用于从开源数据集中选择6K个多样性和复杂性最高的样本，并在InsTag-选择的数据上精度 fine-tune 模型。得到的模型TagLM在MT-Bench上评估得到了较大的SFT数据集的较好的性能，证明了查询多样性和复杂性的重要性。我们将InsTag开源在https://github.com/OFA-Sys/InsTag。

Machine Unlearning: Solutions and Challenges

paper_url: http://arxiv.org/abs/2308.07061
repo_url: None
paper_authors: Jie Xu, Zihan Wu, Cong Wang, Xiaohua Jia
for: 本研究旨在Addressing privacy and security concerns in machine learning by selectively removing specific training data points’ influence on trained models.
methods: 本研究 categorizes existing machine unlearning research into two types: exact unlearning和approximate unlearning, and reviews state-of-the-art solutions with their advantages and limitations.
results: 本研究提出了未来研究方向，并鼓励研究人员通过Addressing open problems to advance machine unlearning and establish it as an essential capability for trustworthy and adaptive machine learning.

Abstract
Machine learning models may inadvertently memorize sensitive, unauthorized, or malicious data, posing risks of privacy violations, security breaches, and performance deterioration. To address these issues, machine unlearning has emerged as a critical technique to selectively remove specific training data points' influence on trained models. This paper provides a comprehensive taxonomy and analysis of machine unlearning research. We categorize existing research into exact unlearning that algorithmically removes data influence entirely and approximate unlearning that efficiently minimizes influence through limited parameter updates. By reviewing the state-of-the-art solutions, we critically discuss their advantages and limitations. Furthermore, we propose future directions to advance machine unlearning and establish it as an essential capability for trustworthy and adaptive machine learning. This paper provides researchers with a roadmap of open problems, encouraging impactful contributions to address real-world needs for selective data removal.

摘要
Translation notes:* " Machine learning models" is translated as "机器学习模型" (jī zhī xué xí mó delè)* "inadvertently" is translated as "无意" (wú yì)* "sensitive, unauthorized, or malicious" is translated as "敏感、未授权或黑客" (mǐn gǎn, wèi shèng qián, hēi kè)* "privacy violations" is translated as "隐私侵犯" (yǐn wèi qiāng fāng)* "security breaches" is translated as "安全泄露" (ān què lù)* "performance deterioration" is translated as "性能下降" (xìng néng xià gōng)* "machine unlearning" is translated as "机器忘记" (jī zhī wàng jī)* "selectively remove" is translated as "选择性地移除" (选择性地移除)* "specific training data points" is translated as "特定训练数据点" (特定训练数据点)* "influence" is translated as "影响" (yìng xiǎng)* "entirely" is translated as "完全" (quán zhèng)* "efficiently" is translated as "高效" (gāo yù)* "minimizes" is translated as "最小化" (zuì xiǎo hóu)* "limited parameter updates" is translated as "有限参数更新" (yǒu xiàn paramètres jīn gòu)* "state-of-the-art solutions" is translated as "现状的解决方案" (xiàn zhèng de jiě jīng fāng àn)* "advantages and limitations" is translated as "优点和缺点" (yòu dòng hé qiòng diǎn)* "future directions" is translated as "未来方向" (wèi lāi fāng dìng)* "trustworthy and adaptive machine learning" is translated as "可靠性和适应性机器学习" (kě zuò xìng yì jī zhī xué xí)* "open problems" is translated as "开放问题" (kāi fàng wèn tí)* "impactful contributions" is translated as "有影响的贡献" (yǒu yìng xiǎng de gōng jìn)

Diagnosis of Scalp Disorders using Machine Learning and Deep Learning Approach – A Review

paper_url: http://arxiv.org/abs/2308.07052
repo_url: None
paper_authors: Hrishabh Tiwari, Jatin Moolchandani, Shamla Mantri
for: 这个研究是为了提高皮肤病诊断的准确率和效率。
methods: 这个研究使用了深度学习模型，包括CNN和FCN，以及一个APP来识别皮肤和scalp疾病。
results: 研究结果表明，使用深度学习模型可以准确地识别皮肤和scalp疾病，其中最高准确率达97.41%-99.09%。

Abstract
The morbidity of scalp diseases is minuscule compared to other diseases, but the impact on the patient's life is enormous. It is common for people to experience scalp problems that include Dandruff, Psoriasis, Tinea-Capitis, Alopecia and Atopic-Dermatitis. In accordance with WHO research, approximately 70% of adults have problems with their scalp. It has been demonstrated in descriptive research that hair quality is impaired by impaired scalp, but these impacts are reversible with early diagnosis and treatment. Deep Learning advances have demonstrated the effectiveness of CNN paired with FCN in diagnosing scalp and skin disorders. In one proposed Deep-Learning-based scalp inspection and diagnosis system, an imaging microscope and a trained model are combined with an app that classifies scalp disorders accurately with an average precision of 97.41%- 99.09%. Another research dealt with classifying the Psoriasis using the CNN with an accuracy of 82.9%. As part of another study, an ML based algorithm was also employed. It accurately classified the healthy scalp and alopecia areata with 91.4% and 88.9% accuracy with SVM and KNN algorithms. Using deep learning models to diagnose scalp related diseases has improved due to advancements i computation capabilities and computer vision, but there remains a wide horizon for further improvements.

摘要
scalp病的感染率相对其他疾病较低，但对病人生活的影响却很大。人们常常会经历头皮问题，包括痤疮、 Psoriasis、脚抄螯、脱发和过敏性皮肤炎。根据Who的研究，大约70%的成年人都有头皮问题。研究表明，损害的头皮质量可以通过早期诊断和治疗来改善，但这些影响可以逆转。深度学习技术的发展使得 CNN 和 FCN 的结合可以准确地诊断头皮和皮肤疾病。一种提议的深度学习基于的头皮检查和诊断系统使用了一个升级的探针和训练模型，并与一个APP结合，可以准确地分类头皮疾病，其精度为97.41%-99.09%。另一项研究用到 CNN 分类痤疮，精度为82.9%。另外一项研究使用 ML 算法，可以准确地分类健康的头皮和脱发症，精度分别为91.4%和88.9%。使用深度学习模型诊断头皮相关疾病，因计算能力和计算视觉的进步而得到改善，但还有很大的发展空间。

Fourier neural operator for learning solutions to macroscopic traffic flow models: Application to the forward and inverse problems

paper_url: http://arxiv.org/abs/2308.07051
repo_url: None
paper_authors: Bilal Thonnam Thodi, Sai Venkata Ramana Ambadipudi, Saif Eddin Jabari
for: 本研究使用深度学习方法解决非线性散射方程的问题，具体是用神经网络扩散算法来学习宏观交通流模型中的全部交通状态。
methods: 本研究使用的是一种名为 физи学 Informed Fourier Neural Operator（π-FNO）的神经网络算法，该算法在训练过程中添加了物理损失函数来补做冲击预测，以提高冲击预测的准确性。
results: 实验结果表明，使用本研究的神经网络算法可以高度准确地预测环路交通网络和城市信号灯控制下的density dynamics，并且可以适应不同的车辆队列分布和多个交通信号周期。此外，研究还发现，使用physics regularizer可以帮助学习长期交通状态的预测，特别是在periodic boundary data的情况下。

Abstract
Deep learning methods are emerging as popular computational tools for solving forward and inverse problems in traffic flow. In this paper, we study a neural operator framework for learning solutions to nonlinear hyperbolic partial differential equations with applications in macroscopic traffic flow models. In this framework, an operator is trained to map heterogeneous and sparse traffic input data to the complete macroscopic traffic state in a supervised learning setting. We chose a physics-informed Fourier neural operator ($\pi$-FNO) as the operator, where an additional physics loss based on a discrete conservation law regularizes the problem during training to improve the shock predictions. We also propose to use training data generated from random piecewise constant input data to systematically capture the shock and rarefied solutions. From experiments using the LWR traffic flow model, we found superior accuracy in predicting the density dynamics of a ring-road network and urban signalized road. We also found that the operator can be trained using simple traffic density dynamics, e.g., consisting of $2-3$ vehicle queues and $1-2$ traffic signal cycles, and it can predict density dynamics for heterogeneous vehicle queue distributions and multiple traffic signal cycles $(\geq 2)$ with an acceptable error. The extrapolation error grew sub-linearly with input complexity for a proper choice of the model architecture and training data. Adding a physics regularizer aided in learning long-term traffic density dynamics, especially for problems with periodic boundary data.

摘要
深度学习方法在交通流动中应用得更加广泛，用于解决前向和反向问题。在这篇论文中，我们研究了一种神经运算框架，用于学习解决非线性偏微分方程的解。在这个框架中，一个运算被训练来将各种不同和稀缺的交通输入数据映射到完整的宏观交通状态中。我们选择了一种physics-informed Fourier neural operator（$\pi$-FNO）作为运算，其中添加了物理损失，以便在训练过程中进行辐射预测。我们还提议使用来自随机划分输入数据的训练数据，以系统地捕捉冲击和稀缺解。在使用LWR交通流模型的实验中，我们发现了在密度动力学中的高精度预测，特别是在环路网络和城市控制措施下。我们还发现，运算可以通过简单的交通密度动力学，例如由2-3辆汽车队列和1-2个交通信号循环组成的，来预测密度动力学。并且可以在多个交通信号循环和不同车辆队列分布下进行预测，并且误差在输入复杂性增加时呈线性增长。添加物理正则化有助于学习长期交通密度动力学，特别是在 Periodic boundary data 下。

UIPC-MF: User-Item Prototype Connection Matrix Factorization for Explainable Collaborative Filtering

paper_url: http://arxiv.org/abs/2308.07048
repo_url: None
paper_authors: Lei Pan, Von-Wun Soo
for: 提供可解释的用户行为推荐（Recommending items to potentially interested users with explainable user behavior）
methods: 使用prototype-based matrix factorization方法（UIPC-MF），用户和Item分别与一组prototype相关联，以提高推荐的可解释性。
results: 在三个 dataset 上比基eline方法高效，并且提供更好的透明度（Hit Ratio和Normalized Discounted Cumulative Gain）。

Abstract
Recommending items to potentially interested users has been an important commercial task that faces two main challenges: accuracy and explainability. While most collaborative filtering models rely on statistical computations on a large scale of interaction data between users and items and can achieve high performance, they often lack clear explanatory power. We propose UIPC-MF, a prototype-based matrix factorization method for explainable collaborative filtering recommendations. In UIPC-MF, both users and items are associated with sets of prototypes, capturing general collaborative attributes. To enhance explainability, UIPC-MF learns connection weights that reflect the associative relations between user and item prototypes for recommendations. UIPC-MF outperforms other prototype-based baseline methods in terms of Hit Ratio and Normalized Discounted Cumulative Gain on three datasets, while also providing better transparency.

摘要
推荐预测已成为商业中的一个重要任务，面临两大挑战：准确率和可解释性。大多数共同推荐模型基于大规模的用户和项目互动数据进行统计计算，可以达到高性能，但往往缺乏明确的解释力。我们提出了UIPC-MF，一种基于Matrix Factorization的原型基于方法，用于可解释的共同推荐。在UIPC-MF中，用户和项目都关联有一组概念prototype，捕捉用户和项目之间的共同特征。为了增强可解释性，UIPC-MF学习用户和项目概念之间的关联关系，以便为推荐提供更好的透明性。UIPC-MF在三个数据集上相比其他原型基本方法而言，有较高的 Hit Ratio 和 Normalized Discounted Cumulative Gain，同时也提供更好的透明性。

No Regularization is Needed: An Efficient and Effective Model for Incomplete Label Distribution Learning

paper_url: http://arxiv.org/abs/2308.07047
repo_url: None
paper_authors: Xiang Li, Songcan Chen
for: This paper focuses on addressing the problem of Incomplete Label Distribution Learning (InLDL), where the labels are incomplete or unobserved for some samples.
methods: The authors propose a new method that uses the prior of label distribution to solve the InLDL problem without any explicit regularization. They define a weighted empirical risk and derive upper bounds to reveal the implicit regularization role of weighting.
results: The proposed method has four advantages: 1) it is model selection free, 2) it has a closed form solution and is easy to implement, 3) it has linear computational complexity, and 4) it is competitive with state-of-the-art methods even without any explicit regularization.

Abstract
Label Distribution Learning (LDL) assigns soft labels, a.k.a. degrees, to a sample. In reality, it is always laborious to obtain complete degrees, giving birth to the Incomplete LDL (InLDL). However, InLDL often suffers from performance degeneration. To remedy it, existing methods need one or more explicit regularizations, leading to burdensome parameter tuning and extra computation. We argue that label distribution itself may provide useful prior, when used appropriately, the InLDL problem can be solved without any explicit regularization. In this paper, we offer a rational alternative to use such a prior. Our intuition is that large degrees are likely to get more concern, the small ones are easily overlooked, whereas the missing degrees are completely neglected in InLDL. To learn an accurate label distribution, it is crucial not to ignore the small observed degrees but to give them properly large weights, while gradually increasing the weights of the missing degrees. To this end, we first define a weighted empirical risk and derive upper bounds between the expected risk and the weighted empirical risk, which reveals in principle that weighting plays an implicit regularization role. Then, by using the prior of degrees, we design a weighted scheme and verify its effectiveness. To sum up, our model has four advantages, it is 1) model selection free, as no explicit regularization is imposed; 2) with closed form solution (sub-problem) and easy-to-implement (a few lines of codes); 3) with linear computational complexity in the number of samples, thus scalable to large datasets; 4) competitive with state-of-the-arts even without any explicit regularization.

摘要
Label Distribution Learning (LDL) assigns 软标签，即学习度，到一个样本上。在实际应用中，通常难以获得完整的学习度，从而产生了不完整的LDL（InLDL）问题。然而，InLDL经常会导致性能下降。为了解决这个问题，现有方法通常需要一或多个显式正则化，从而增加参数调整的复杂性和计算量。我们认为标签分布本身可以提供有用的先验知识，当用于适当的情况时，InLDL问题可以解决无需显式正则化。在这篇论文中，我们提出了一种有理的方法，使用这种先验知识来解决InLDL问题。我们的假设是，大的学习度更有可能得到更多的注意力，小的学习度容易被忽略，而缺失的学习度完全被InLDL忽略。为了学习准确的标签分布，非常重要不要忽略小 observed 的学习度，而是给它们分配正确的大小，同时逐渐增加缺失的学习度的权重。我们首先定义一个权重 empirical risk，并 deriv 上下文中的预期风险和权重 empirical risk 之间的Upper bound，这表明了权重在本质上扮演了隐式正则化的角色。然后，我们使用学习度的先验知识来设计一种权重方案，并证明其效果。总之，我们的模型具有以下四个优点：1) 无需显式正则化，因为不需要在数据上添加任何正则化项；2) 具有关闭式解决方案和易于实现（只需一些代码）；3) 计算复杂度为数据集的线性时间，因此可扩展到大型数据集；4) 可与当前的状态艺技相比，甚至没有任何显式正则化。

Bayesian Flow Networks

paper_url: http://arxiv.org/abs/2308.07037
repo_url: https://github.com/stefanradev93/BayesFlow
paper_authors: Alex Graves, Rupesh Kumar Srivastava, Timothy Atkinson, Faustino Gomez
for: 本研究旨在提出一种新的生成模型—权当流网络（BFN），它通过bayesian推理在噪声数据样本的指导下修改参数集中的独立分布，然后将这些分布作为输入传递给神经网络，从而生成第二个相互关联的分布。
methods: 本研究使用了权当流网络（BFN），它们的生成过程类似于反射模型的逆过程，但是更加简单，不需要前向过程。研究者还 derive了离散和连续时间的损失函数，以及批量生成过程。
results: 实验表明，BFNs可以在 dynamical binarized MNIST 和 CIFAR-10 图像模型任务上 achieve 竞争力的 log-likelihood，并且在 text8 字符级语言模型任务上超越了所有已知的杂分 diffusion 模型。

Abstract
This paper introduces Bayesian Flow Networks (BFNs), a new class of generative model in which the parameters of a set of independent distributions are modified with Bayesian inference in the light of noisy data samples, then passed as input to a neural network that outputs a second, interdependent distribution. Starting from a simple prior and iteratively updating the two distributions yields a generative procedure similar to the reverse process of diffusion models; however it is conceptually simpler in that no forward process is required. Discrete and continuous-time loss functions are derived for continuous, discretised and discrete data, along with sample generation procedures. Notably, the network inputs for discrete data lie on the probability simplex, and are therefore natively differentiable, paving the way for gradient-based sample guidance and few-step generation in discrete domains such as language modelling. The loss function directly optimises data compression and places no restrictions on the network architecture. In our experiments BFNs achieve competitive log-likelihoods for image modelling on dynamically binarized MNIST and CIFAR-10, and outperform all known discrete diffusion models on the text8 character-level language modelling task.

摘要

S3IM: Stochastic Structural SIMilarity and Its Unreasonable Effectiveness for Neural Fields

paper_url: http://arxiv.org/abs/2308.07032
repo_url: https://github.com/madaoer/s3im_nerf
paper_authors: Zeke Xie, Xindi Yang, Yujie Yang, Qi Sun, Yixiang Jiang, Haoran Wang, Yunfeng Cai, Mingming Sun
For: The paper aims to improve the quality of Neural Radiance Field (NeRF) and related neural field methods for novel-view image synthesis and surface reconstruction tasks.* Methods: The paper introduces a nonlocal multiplex training paradigm for NeRF and related neural field methods, using a novel Stochastic Structural SIMilarity (S3IM) loss that processes multiple data points as a whole set instead of processing multiple inputs independently.* Results: The paper shows that the proposed S3IM loss leads to significant improvements in quality metrics for NeRF and neural surface representation, particularly for difficult tasks such as novel view synthesis and surface reconstruction. The improvements are robust even with sparse inputs, corrupted images, and dynamic scenes.

Abstract
Recently, Neural Radiance Field (NeRF) has shown great success in rendering novel-view images of a given scene by learning an implicit representation with only posed RGB images. NeRF and relevant neural field methods (e.g., neural surface representation) typically optimize a point-wise loss and make point-wise predictions, where one data point corresponds to one pixel. Unfortunately, this line of research failed to use the collective supervision of distant pixels, although it is known that pixels in an image or scene can provide rich structural information. To the best of our knowledge, we are the first to design a nonlocal multiplex training paradigm for NeRF and relevant neural field methods via a novel Stochastic Structural SIMilarity (S3IM) loss that processes multiple data points as a whole set instead of process multiple inputs independently. Our extensive experiments demonstrate the unreasonable effectiveness of S3IM in improving NeRF and neural surface representation for nearly free. The improvements of quality metrics can be particularly significant for those relatively difficult tasks: e.g., the test MSE loss unexpectedly drops by more than 90% for TensoRF and DVGO over eight novel view synthesis tasks; a 198% F-score gain and a 64% Chamfer $L_{1}$ distance reduction for NeuS over eight surface reconstruction tasks. Moreover, S3IM is consistently robust even with sparse inputs, corrupted images, and dynamic scenes.

摘要
最近，神经辐射场（NeRF）已经取得了大成功，通过学习含义表示的唯一RGB图像来生成新视图图像。NeRF和相关的神经场方法（例如神经表面表示）通常通过点级损失来优化和预测点级数据，而这些数据点与每个像素相对应。然而，这一线索的研究忽略了远程像素的共同监督，尽管知道图像或场景中的像素可以提供丰富的结构信息。据我们所知，我们是第一个设计非本地多重训练方法via一种新的随机结构相似性（S3IM）损失，该损失处理多个数据点作为整体而不是独立处理多个输入。我们的广泛实验表明S3IM在改进NeRF和神经表面表示方法方面具有不可思议的效果，并且这些改进的质量指标可以特别显著，例如在八个新视图合成任务中，测试MSE损失意外下降了More than 90% дляTensoRF和DVGO; NeuS在八个表面重建任务中获得了198%的F-score提升和64%的L1距离减少。此外，S3IM具有对于稀缺输入、损坏图像和动态场景的一致性。

Bayesian Physics-Informed Neural Network for the Forward and Inverse Simulation of Engineered Nano-particles Mobility in a Contaminated Aquifer

paper_url: http://arxiv.org/abs/2308.07352
repo_url: None
paper_authors: Shikhar Nilabh, Fidel Grandia
for: 这项研究的目的是为了开发一种能够在地下水域中有效地预测粒子的移动和停留行为，以便开发一种有效的地下水恢复策略。
methods: 这项研究使用了一种bayesian physics-informed neural network（B-PINN）框架，通过对模拟粒子在aquifer中的移动进行前向模型，并通过对模型输出进行逆向模型，来量化粒子的移动和停留行为。
results: 研究表明，B-PINN框架可以准确地预测粒子的移动和停留行为，并且可以量化这些行为的不确定性。此外，研究还发现了一些关键参数，可以用于控制粒子的移动和停留。这些结果表明，B-PINN框架可以提供有用的预测情况，以便开发有效的地下水恢复策略。

Abstract
Globally, there are many polluted groundwater sites that need an active remediation plan for the restoration of local ecosystem and environment. Engineered nanoparticles (ENPs) have proven to be an effective reactive agent for the in-situ degradation of pollutants in groundwater. While the performance of these ENPs has been highly promising on the laboratory scale, their application in real field case conditions is still limited. The complex transport and retention mechanisms of ENPs hinder the development of an efficient remediation strategy. Therefore, a predictive tool to comprehend the transport and retention behavior of ENPs is highly required. The existing tools in the literature are dominated with numerical simulators, which have limited flexibility and accuracy in the presence of sparse datasets and the aquifer heterogeneity. This work uses a Bayesian Physics-Informed Neural Network (B-PINN) framework to model the nano-particles mobility within an aquifer. The result from the forward model demonstrates the effective capability of B-PINN in accurately predicting the ENPs mobility and quantifying the uncertainty. The inverse model output is then used to predict the governing parameters for the ENPs mobility in a small-scale aquifer. The research demonstrates the capability of the tool to provide predictive insights for developing an efficient groundwater remediation strategy.

摘要

IOB: Integrating Optimization Transfer and Behavior Transfer for Multi-Policy Reuse

paper_url: http://arxiv.org/abs/2308.07351
repo_url: None
paper_authors: Siyuan Li, Hao Li, Jin Zhang, Zhen Wang, Peng Liu, Chongjie Zhang
for: 本研究旨在解决选择适当的源策略以促进目标策略学习的挑战，提出了一种新的转移学习RL方法。
methods: 该方法利用actor-critic框架中的Q函数引导策略选择，选择源策略可以提供最大一步改进。另外，该方法还结合了优化转移和行为转移（IOB），通过规范学习的策略来模仿指导策略，并将其与行为策略相结合。
results: 该方法在标准任务中超过了状态艺术RL基线，并在连续学习场景中提高了最终性和知识传递性。此外，该方法的优化转移技术保证了目标策略学习的提高。

Abstract
Humans have the ability to reuse previously learned policies to solve new tasks quickly, and reinforcement learning (RL) agents can do the same by transferring knowledge from source policies to a related target task. Transfer RL methods can reshape the policy optimization objective (optimization transfer) or influence the behavior policy (behavior transfer) using source policies. However, selecting the appropriate source policy with limited samples to guide target policy learning has been a challenge. Previous methods introduce additional components, such as hierarchical policies or estimations of source policies' value functions, which can lead to non-stationary policy optimization or heavy sampling costs, diminishing transfer effectiveness. To address this challenge, we propose a novel transfer RL method that selects the source policy without training extra components. Our method utilizes the Q function in the actor-critic framework to guide policy selection, choosing the source policy with the largest one-step improvement over the current target policy. We integrate optimization transfer and behavior transfer (IOB) by regularizing the learned policy to mimic the guidance policy and combining them as the behavior policy. This integration significantly enhances transfer effectiveness, surpasses state-of-the-art transfer RL baselines in benchmark tasks, and improves final performance and knowledge transferability in continual learning scenarios. Additionally, we show that our optimization transfer technique is guaranteed to improve target policy learning.

摘要
人类有能力快速解决新任务使用已经学习过的策略，而强化学习（RL）代理也可以通过将来源策略中的知识传递到相关的目标任务中来实现此目的。传输RL方法可以修改策略优化目标（优化传递）或影响行为策略（行为传递）使用源策略。然而，选择适当的源策略可以受有限样本数的限制，从而影响目标策略的学习。先前的方法通过引入层次政策或估计源策略的价值函数等附加组件，可能导致非站点策略优化或重大的样本成本，减弱传输效果。为解决这个挑战，我们提出了一种新的传输RL方法，不需要训练附加组件。我们利用actor-critic框架中的Q函数来导引策选择，选择目标策略中最大化一步改进的源策略。我们将优化传递和行为传递（IOB）相结合，通过规范学习的策略来模仿指导策略，并将其与之相结合。这种结合显著提高了传输效果，超越了基准测试RL方法，并在持续学习场景中提高了最终性和知识传递性。此外，我们证明我们的优化传递技术是 garantizado 提高目标策略学习。

Efficient Neural PDE-Solvers using Quantization Aware Training

paper_url: http://arxiv.org/abs/2308.07350
repo_url: None
paper_authors: Winfried van den Dool, Tijmen Blankevoort, Max Welling, Yuki M. Asano
for: 解决Partial Differential Equations（PDE）中的计算成本问题，以减少计算成本并维持性能。
methods: 使用现有的量化方法来减少计算成本，包括量化网络参数和活动。
results: 对四个标准PDE数据集和三种网络架构进行了训练，并证明了量化意识训练可以降低计算成本，同时维持性能。最终，我们实际示出，只有通过量化来实现Pareto优化计算成本与性能的平衡。

Abstract
In the past years, the application of neural networks as an alternative to classical numerical methods to solve Partial Differential Equations has emerged as a potential paradigm shift in this century-old mathematical field. However, in terms of practical applicability, computational cost remains a substantial bottleneck. Classical approaches try to mitigate this challenge by limiting the spatial resolution on which the PDEs are defined. For neural PDE solvers, we can do better: Here, we investigate the potential of state-of-the-art quantization methods on reducing computational costs. We show that quantizing the network weights and activations can successfully lower the computational cost of inference while maintaining performance. Our results on four standard PDE datasets and three network architectures show that quantization-aware training works across settings and three orders of FLOPs magnitudes. Finally, we empirically demonstrate that Pareto-optimality of computational cost vs performance is almost always achieved only by incorporating quantization.

摘要

Learning to Optimize LSM-trees: Towards A Reinforcement Learning based Key-Value Store for Dynamic Workloads

paper_url: http://arxiv.org/abs/2308.07013
repo_url: None
paper_authors: Dingheng Mo, Fanchao Chen, Siqiang Luo, Caihua Shan
for: 提高静态工作负荷下的系统性能优化。
methods: 使用Reinforcement Learning（RL）导向LSM树变换，并提出新的LSM树设计——FLSM树，以便在不同的压缩策略之间进行高效的过渡。
results: 在多种工作负荷下，RusKey可以达到4倍的终端性能优化，比RocksDB系统更强。

Abstract
LSM-trees are widely adopted as the storage backend of key-value stores. However, optimizing the system performance under dynamic workloads has not been sufficiently studied or evaluated in previous work. To fill the gap, we present RusKey, a key-value store with the following new features: (1) RusKey is a first attempt to orchestrate LSM-tree structures online to enable robust performance under the context of dynamic workloads; (2) RusKey is the first study to use Reinforcement Learning (RL) to guide LSM-tree transformations; (3) RusKey includes a new LSM-tree design, named FLSM-tree, for an efficient transition between different compaction policies -- the bottleneck of dynamic key-value stores. We justify the superiority of the new design with theoretical analysis; (4) RusKey requires no prior workload knowledge for system adjustment, in contrast to state-of-the-art techniques. Experiments show that RusKey exhibits strong performance robustness in diverse workloads, achieving up to 4x better end-to-end performance than the RocksDB system under various settings.

摘要

RusKey是首次在线上预约LSM-树结构，以确保在动态负荷下的稳定性表现。2. RusKey是首次使用强化学习（RL）引导LSM-树变换的研究。3. RusKey包含一种新的LSM-树设计，称为FLSM-树，可以有效地在不同的压缩策略之间进行过渡。4. RusKey不需要先知系统负荷特性，与现有技术不同。我们通过理论分析证明了新设计的优越性。实验结果表明，RusKey在多种工作负荷下表现出了强大的性能稳定性，与RocksDB系统在不同设置下实现了最高的终端性能，达到4倍之多。

Greedy online change point detection

paper_url: http://arxiv.org/abs/2308.07012
repo_url: None
paper_authors: Jou-Hui Ho, Felipe Tobar
for: 提高 online Change Point Detection（CPD）方法的精度和准确性。
methods: 使用 Greedy Online Change Point Detection（GOCPD）方法，通过最大化数据来自两个独立模型（temporal）的概率，以找到时间序列中的变化点。
results: 在单个变化点的情况下，使用ternary搜索，逻辑复杂度为对数。在synthetic数据和实际世界 univariate和multivariate设置中，证明GOCPD的有效性。

Abstract
Standard online change point detection (CPD) methods tend to have large false discovery rates as their detections are sensitive to outliers. To overcome this drawback, we propose Greedy Online Change Point Detection (GOCPD), a computationally appealing method which finds change points by maximizing the probability of the data coming from the (temporal) concatenation of two independent models. We show that, for time series with a single change point, this objective is unimodal and thus CPD can be accelerated via ternary search with logarithmic complexity. We demonstrate the effectiveness of GOCPD on synthetic data and validate our findings on real-world univariate and multivariate settings.

摘要
标准在线变点检测（CPD）方法通常会有较大的假阳性率，因为它们对异常值敏感。为了解决这个缺点，我们提议了Greedy Online Change Point Detection（GOCPD），一种计算效率高的方法，它通过最大化数据来自（时间）拼接两个独立模型的概率来检测变点。我们显示，对于具有单个变点的时间序列，这个目标函数是单峰性的，因此可以通过ternary search进行加速，其复杂度为对数型。我们在 synthetic 数据上证明了 GOCPD 的效果，并在实际世界的单variate 和多variate 设置中验证了我们的结论。

Aggregating Intrinsic Information to Enhance BCI Performance through Federated Learning

paper_url: http://arxiv.org/abs/2308.11636
repo_url: None
paper_authors: Rui Liu, Yuanyuan Chen, Anran Li, Yi Ding, Han Yu, Cuntai Guan
for: 这个研究旨在解决脑computer接口（BCI）建立高性能深度学习模型所面临的长期挑战，即脑电图（EEG）数据的共享。
methods: 本研究提出了一个层次化个性化联合学习（FLEEG）框架，以解决EEG数据之间的不同格式问题。每个客户端都被指派特定的数据集，并训练层次化个性化模型，以管理不同数据格式并促进信息交换。服务器则处理训练过程，将来自所有数据集的知识融合，以提高总表现。
results: 研究将在脑意念（MI）类别任务上进行了评估，使用了9个由不同设备收集的EEG数据集。结果显示，提出的框架可以提高类别性能达16.7%，尤其是 для较小的数据集。可视化结果也显示出该框架可以让本地模型对任务相关区域进行稳定的注意力集中，从而提高表现。

Abstract
Insufficient data is a long-standing challenge for Brain-Computer Interface (BCI) to build a high-performance deep learning model. Though numerous research groups and institutes collect a multitude of EEG datasets for the same BCI task, sharing EEG data from multiple sites is still challenging due to the heterogeneity of devices. The significance of this challenge cannot be overstated, given the critical role of data diversity in fostering model robustness. However, existing works rarely discuss this issue, predominantly centering their attention on model training within a single dataset, often in the context of inter-subject or inter-session settings. In this work, we propose a hierarchical personalized Federated Learning EEG decoding (FLEEG) framework to surmount this challenge. This innovative framework heralds a new learning paradigm for BCI, enabling datasets with disparate data formats to collaborate in the model training process. Each client is assigned a specific dataset and trains a hierarchical personalized model to manage diverse data formats and facilitate information exchange. Meanwhile, the server coordinates the training procedure to harness knowledge gleaned from all datasets, thus elevating overall performance. The framework has been evaluated in Motor Imagery (MI) classification with nine EEG datasets collected by different devices but implementing the same MI task. Results demonstrate that the proposed frame can boost classification performance up to 16.7% by enabling knowledge sharing between multiple datasets, especially for smaller datasets. Visualization results also indicate that the proposed framework can empower the local models to put a stable focus on task-related areas, yielding better performance. To the best of our knowledge, this is the first end-to-end solution to address this important challenge.

摘要
BCIs 长期面临缺乏数据的挑战，建立高性能的深度学习模型。虽然多个研究组织和机构收集了大量的 EEG 数据，但是在不同设备上分享 EEG 数据仍然具有挑战性，这是因为设备之间存在差异。这种挑战的重要性无法被低估，因为数据多样性对模型的稳定性具有关键作用。然而，现有的研究很少讨论这个问题，通常在单一数据集上进行模型训练，通常在 между Subject 或 Session 上进行。在这种情况下，我们提出了一种层次个性化 Federated Learning EEG 解码（FLEEG）框架，以超越这个挑战。这种创新的框架标识了一种新的学习模式 для BCIs，使得不同数据格式的数据可以在模型训练过程中合作。每个客户端都被分配了特定的数据集，并训练了一个层次个性化模型来管理多样的数据格式并促进信息交换。同时，服务器协调训练过程，以利用所有数据集中所获得的知识，从而提高总性能。我们在 Motor Imagery （MI）分类任务中使用九个 EEG 数据集，每个数据集都是由不同的设备收集的，但是实现了同一个 MI 任务。结果表明，我们的框架可以提高分类性能达到 16.7%，尤其是对小数据集的提高。视觉结果还表明，我们的框架可以让本地模型固定焦点于任务相关的区域，从而提高表现。到目前为止，这是我们知道的首个综合解决这个重要挑战的解决方案。

Deep convolutional neural networks for cyclic sensor data

paper_url: http://arxiv.org/abs/2308.06987
repo_url: None
paper_authors: Payman Goodarzi, Yannick Robin, Andreas Schütze, Tizian Schneider
for: 本研究旨在探讨基于感知器的维保维护，并使用深度学习技术对一个液压系统测试平台数据进行应用。
methods: 本研究使用了三个模型：基线模型使用传统方法、单个CNN模型使用早期感知融合、以及两个CNN模型（2L-CNN）使用晚期感知融合。
results: 基线模型使用晚期感知融合实现了低于1%的测试错误率，而CNN模型由于感知器之间的多样性而遇到挑战，导致错误率高达20.5%。在进一步调查这个问题时，我们发现了每个感知器都需要独立进行特征提取的问题。此外，我们还评估了2L-CNN模型，并发现它可以将最佳和最差的感知器组合起来，以减少错误率33%。这种研究认真地面对了多感知器系统中的复杂性。

Abstract
Predictive maintenance plays a critical role in ensuring the uninterrupted operation of industrial systems and mitigating the potential risks associated with system failures. This study focuses on sensor-based condition monitoring and explores the application of deep learning techniques using a hydraulic system testbed dataset. Our investigation involves comparing the performance of three models: a baseline model employing conventional methods, a single CNN model with early sensor fusion, and a two-lane CNN model (2L-CNN) with late sensor fusion. The baseline model achieves an impressive test error rate of 1% by employing late sensor fusion, where feature extraction is performed individually for each sensor. However, the CNN model encounters challenges due to the diverse sensor characteristics, resulting in an error rate of 20.5%. To further investigate this issue, we conduct separate training for each sensor and observe variations in accuracy. Additionally, we evaluate the performance of the 2L-CNN model, which demonstrates significant improvement by reducing the error rate by 33% when considering the combination of the least and most optimal sensors. This study underscores the importance of effectively addressing the complexities posed by multi-sensor systems in sensor-based condition monitoring.

摘要
预测维护在工业系统不间断运行和降低系统故障的风险方面扮演着关键角色。本研究利用液压系统测试平台数据进行了深度学习技术的应用，并对三种模型进行比较：基线模型使用传统方法、单个CNN模型使用早期感知融合，以及两个CNN模型（2L-CNN）使用晚期感知融合。基线模型通过使用晚期感知融合实现了测试错误率为1%，但CNN模型由于感知器的多样性而遇到问题，导致错误率为20.5%。为了更深入了解这个问题，我们对每个感知器进行了分别的训练，并观察到了减少精度的变化。此外，我们还评估了2L-CNN模型的性能，其能够在考虑最佳和最差感知器的组合下降低错误率33%。这个研究重申了对多感知器系统的预测维护存在多样性和复杂性的挑战。

pNNCLR: Stochastic Pseudo Neighborhoods for Contrastive Learning based Unsupervised Representation Learning Problems

paper_url: http://arxiv.org/abs/2308.06983
repo_url: None
paper_authors: Momojit Biswas, Himanshu Buckchash, Dilip K. Prasad
for: 本研究的目的是提高 nearest neighbor 基于自助学习（SSL）的图像识别问题中的 semantic variation。
methods: 本研究使用 nearest neighbor sampling 方法，并引入 pseudo nearest neighbors（pNN）来控制支持集质量。此外，通过随机抽样和平滑重量更新方法来稳定 nearest neighbor 基于学习的不确定性。
results: 对多个公共图像识别和医学图像识别数据集进行评估，本研究的提案方法可以与基准 nearest neighbor 方法相比，并与其他先前提出的 SSL 方法相当。

Abstract
Nearest neighbor (NN) sampling provides more semantic variations than pre-defined transformations for self-supervised learning (SSL) based image recognition problems. However, its performance is restricted by the quality of the support set, which holds positive samples for the contrastive loss. In this work, we show that the quality of the support set plays a crucial role in any nearest neighbor based method for SSL. We then provide a refined baseline (pNNCLR) to the nearest neighbor based SSL approach (NNCLR). To this end, we introduce pseudo nearest neighbors (pNN) to control the quality of the support set, wherein, rather than sampling the nearest neighbors, we sample in the vicinity of hard nearest neighbors by varying the magnitude of the resultant vector and employing a stochastic sampling strategy to improve the performance. Additionally, to stabilize the effects of uncertainty in NN-based learning, we employ a smooth-weight-update approach for training the proposed network. Evaluation of the proposed method on multiple public image recognition and medical image recognition datasets shows that it performs up to 8 percent better than the baseline nearest neighbor method, and is comparable to other previously proposed SSL methods.

摘要
近邻采样（NN）提供更多语义变化，对自助学习（SSL）基于图像识别问题的性能有较好的影响。然而，其性能受支持集质量的限制。在这种情况下，我们表明支持集质量对任何近邻基于SSL方法的性能具有关键作用。我们然后提供一种精度的基线（pNNCLR），用于改进近邻基于SSL方法（NNCLR）。为此，我们引入 pseudo 近邻（pNN），以控制支持集质量。具体来说，而不是直接采样最近邻，我们采样邻近硬邻邻的附近，通过变化结果向量的大小和使用随机采样策略来提高性能。此外，为了稳定NN基于学习中的uncertainty的效果，我们使用了平滑Weight更新方法进行网络训练。多个公共图像识别和医疗图像识别数据集上的评估表明，我们提出的方法与基eline最近邻方法相比，性能提高达8%，与其他之前提出的SSL方法相当。

Routing Recovery for UAV Networks with Deliberate Attacks: A Reinforcement Learning based Approach

paper_url: http://arxiv.org/abs/2308.06973
repo_url: None
paper_authors: Sijie He, Ziye Jia, Chao Dong, Wei Wang, Yilu Cao, Yang Yang, Qihui Wu
for: 本研究强调路由计划和恢复方法，以适应无人机网络受到攻击的情况。
methods: 该研究提出了一种基于节点重要性的攻击模型，并实现了节点重要性排名机制。此外，基于强化学习算法的智能路由方法也被提出，以恢复路由路径在无人机网络受到攻击时。
results: 数据示，提出的方法比其他相关方法更为有效。

Abstract
The unmanned aerial vehicle (UAV) network is popular these years due to its various applications. In the UAV network, routing is significantly affected by the distributed network topology, leading to the issue that UAVs are vulnerable to deliberate damage. Hence, this paper focuses on the routing plan and recovery for UAV networks with attacks. In detail, a deliberate attack model based on the importance of nodes is designed to represent enemy attacks. Then, a node importance ranking mechanism is presented, considering the degree of nodes and link importance. However, it is intractable to handle the routing problem by traditional methods for UAV networks, since link connections change with the UAV availability. Hence, an intelligent algorithm based on reinforcement learning is proposed to recover the routing path when UAVs are attacked. Simulations are conducted and numerical results verify the proposed mechanism performs better than other referred methods.

摘要
“无人航空器（UAV）网络在这些年变得非常流行，它在各种应用方面表现出了优异的表现。然而，UAV网络中的路由却受到分布式网络架构的影响，导致UAV易受到意外攻击。因此，本文关注UAV网络中的路由计划和恢复，以适应攻击。具体来说，我们设计了一种基于节点重要性的攻击模型，并提出了一种考虑节点和链接重要性的节点重要性排名机制。然而，由于UAV网络中的链接连接随着UAV可用性的变化，传统的路由方法无法处理UAV网络的路由问题。因此，我们提出了基于强化学习算法的智能路由恢复方法，以便在UAV被攻击时恢复路由路径。我们对此进行了仿真和数值分析，结果表明我们的提案在恢复路由路径方面表现出了更好的性能。”Note: The translation is done using Google Translate and may not be perfect. Please let me know if you need any further assistance.

AutoAssign+: Automatic Shared Embedding Assignment in Streaming Recommendation

paper_url: http://arxiv.org/abs/2308.06965
repo_url: https://github.com/Applied-Machine-Learning-Lab/AutoAssign-Plus
paper_authors: Ziru Liu, Kecheng Chen, Fengyi Song, Bo Chen, Xiangyu Zhao, Huifeng Guo, Ruiming Tang
For: The paper aims to address the challenges of assigning initial ID embeddings randomly in streaming recommender systems, which can result in suboptimal prediction performance for items or users with limited interactive data, and lead to unnecessary memory consumption.* Methods: The paper proposes a reinforcement learning-driven framework called AutoAssign+, which utilizes an Identity Agent to represent low-frequency IDs field-wise with a small set of shared embeddings, and dynamically determine which ID features should be retained or eliminated in the embedding table.* Results: The paper demonstrates that AutoAssign+ is capable of significantly enhancing recommendation performance by mitigating the cold-start problem, and yields a reduction in memory usage of approximately 20-30%, verifying its practical effectiveness and efficiency for streaming recommender systems.Here’s the simplified Chinese text for the three key points:* For: 这篇论文目标是解决流动推荐系统中 randomly 分配初始 ID 嵌入的问题，这可能导致有限交互数据的用户或物品预测性能下降，并且需要不断扩展嵌入表，从而导致过度的内存消耗。* Methods: 论文提出一种基于强化学习的框架，即 AutoAssign+，该框架利用一个 Identity Agent 作为actor网络，该网络在两个角色下运行：一是用一小组共享嵌入来代表低频 ID，以提高嵌入初始化；二是在嵌入表中决定应保留或消除哪些 ID 特征。批评网络对策优化。* Results: 实验结果表明，AutoAssign+ 能够显著提高推荐性能，减轻冷启 проблеme，并且减少内存使用量约 20-30%，证明其在流动推荐系统中的实用性和效率。

Abstract
In the domain of streaming recommender systems, conventional methods for addressing new user IDs or item IDs typically involve assigning initial ID embeddings randomly. However, this practice results in two practical challenges: (i) Items or users with limited interactive data may yield suboptimal prediction performance. (ii) Embedding new IDs or low-frequency IDs necessitates consistently expanding the embedding table, leading to unnecessary memory consumption. In light of these concerns, we introduce a reinforcement learning-driven framework, namely AutoAssign+, that facilitates Automatic Shared Embedding Assignment Plus. To be specific, AutoAssign+ utilizes an Identity Agent as an actor network, which plays a dual role: (i) Representing low-frequency IDs field-wise with a small set of shared embeddings to enhance the embedding initialization, and (ii) Dynamically determining which ID features should be retained or eliminated in the embedding table. The policy of the agent is optimized with the guidance of a critic network. To evaluate the effectiveness of our approach, we perform extensive experiments on three commonly used benchmark datasets. Our experiment results demonstrate that AutoAssign+ is capable of significantly enhancing recommendation performance by mitigating the cold-start problem. Furthermore, our framework yields a reduction in memory usage of approximately 20-30%, verifying its practical effectiveness and efficiency for streaming recommender systems.

摘要
在流动推荐系统领域，传统方法通常是随机分配初始ID embedding。然而，这种做法会导致两个实际挑戦：（i）有限交互数据的物品或用户可能会得到低效预测性能。（ii）添加新ID或低频ID需要不断扩大 embedding 表，从而导致不必要的内存浪费。为了解决这些问题，我们介绍了一个基于强化学习的框架，即AutoAssign+，它实现了自动共享 embedding 分配加 plus。具体来说，AutoAssign+ 使用一个 Identity Agent 作为actor网络，该网络在两个角色中进行表达：（i）在 embeddings 中场景化低频 ID 使用一小组共享 embedding 进行增强初始化。（ii）在 embedding 表中决定保留或 eliminating ID 特征。Identity Agent 的策略通过批评网络的指导优化。为了评估我们的方法的有效性，我们在三个常用的标准数据集上进行了广泛的实验。实验结果表明，AutoAssign+ 能够有效地缓解冷启点问题，并且它的内存使用率比传统方法减少了约20-30%，证明了它在流动推荐系统中的实际效果和效率。

Graph Structural Residuals: A Learning Approach to Diagnosis

paper_url: http://arxiv.org/abs/2308.06961
repo_url: None
paper_authors: Jan Lukas Augustin, Oliver Niggemann
for: This paper proposes a novel framework for model-based diagnosis that combines concepts of model-based diagnosis with deep graph structure learning, aiming to facilitate a seamless integration of graph structure learning with model-based diagnosis.
methods: The proposed framework uses two distinct graph adjacency matrices to represent the system’s underlying structure and provide dynamic observations. Additionally, the paper introduces two versions of a self-supervised graph structure learning model architecture.
results: The authors demonstrate the potential of their data-driven diagnostic method through experiments on a system of coupled oscillators.

Abstract
Traditional model-based diagnosis relies on constructing explicit system models, a process that can be laborious and expertise-demanding. In this paper, we propose a novel framework that combines concepts of model-based diagnosis with deep graph structure learning. This data-driven approach leverages data to learn the system's underlying structure and provide dynamic observations, represented by two distinct graph adjacency matrices. Our work facilitates a seamless integration of graph structure learning with model-based diagnosis by making three main contributions: (i) redefining the constructs of system representation, observations, and faults (ii) introducing two distinct versions of a self-supervised graph structure learning model architecture and (iii) demonstrating the potential of our data-driven diagnostic method through experiments on a system of coupled oscillators.

摘要
传统的模型基于诊断方法是通过构建明确的系统模型来进行，这可能是一项劳动密集且需要专家知识的过程。在这篇论文中，我们提出了一种新的框架，它将模型基于诊断与深度图结构学习结合起来。这种数据驱动的方法利用数据来学习系统的下面结构，并提供动态观察结果，表示为两个不同的图邻接矩阵。我们的工作使得图结构学习与模型基于诊断的集成变得自然和简单，我们的主要贡献包括：1. 重新定义系统表示、观察和缺陷的构造2. 提出两种自动学习图结构模型建立方法3. 通过对振荡器系统的实验，证明我们的数据驱动诊断方法的潜力。

Search to Fine-tune Pre-trained Graph Neural Networks for Graph-level Tasks

paper_url: http://arxiv.org/abs/2308.06960
repo_url: None
paper_authors: Zhili Wang, Shimin Di, Lei Chen, Xiaofang Zhou
for: 这paper是为了提出一种更好的微调策略来改进预训练的graph neural network (GNN)的性能，以便在下游任务上提高模型性能。methods: 这paper使用了针对大规模未标注图数据进行预训练，并通过限制数据量的微调来适应目标下游任务。具体来说，它们提出了一种名为S2PGNN的搜索式微调策略，可以在各种下游任务上实现更好的性能。results: 这paper的实验结果表明，S2PGNN可以在10种著名的预训练GNN上实现性能提升，并且在预训练GNN的内部和外部比较其他微调策略的情况下都有更好的性能。codes可以在\url{https://anonymous.4open.science/r/code_icde2024-A9CB/}上获取。

Abstract
Recently, graph neural networks (GNNs) have shown its unprecedented success in many graph-related tasks. However, GNNs face the label scarcity issue as other neural networks do. Thus, recent efforts try to pre-train GNNs on a large-scale unlabeled graph and adapt the knowledge from the unlabeled graph to the target downstream task. The adaptation is generally achieved by fine-tuning the pre-trained GNNs with a limited number of labeled data. Despite the importance of fine-tuning, current GNNs pre-training works often ignore designing a good fine-tuning strategy to better leverage transferred knowledge and improve the performance on downstream tasks. Only few works start to investigate a better fine-tuning strategy for pre-trained GNNs. But their designs either have strong assumptions or overlook the data-aware issue for various downstream datasets. Therefore, we aim to design a better fine-tuning strategy for pre-trained GNNs to improve the model performance in this paper. Given a pre-trained GNN, we propose to search to fine-tune pre-trained graph neural networks for graph-level tasks (S2PGNN), which adaptively design a suitable fine-tuning framework for the given labeled data on the downstream task. To ensure the improvement brought by searching fine-tuning strategy, we carefully summarize a proper search space of fine-tuning framework that is suitable for GNNs. The empirical studies show that S2PGNN can be implemented on the top of 10 famous pre-trained GNNs and consistently improve their performance. Besides, S2PGNN achieves better performance than existing fine-tuning strategies within and outside the GNN area. Our code is publicly available at \url{https://anonymous.4open.science/r/code_icde2024-A9CB/}.

摘要
近期，图 нейрон网络（GNNs）在许多图关联任务中显示了无前例的成功。然而，GNNs面临标签缺乏问题，与其他神经网络一样。因此，当前努力通过大规模无标签图进行Pre-training GNNs，并将知识从无标签图传递到目标下游任务。适应通常通过精度调整Pre-trained GNNs中的一部分参数来实现。 despite the importance of fine-tuning, current GNNs pre-training works often ignore designing a good fine-tuning strategy to better leverage transferred knowledge and improve the performance on downstream tasks. Only a few works have started to investigate a better fine-tuning strategy for pre-trained GNNs, but their designs either have strong assumptions or overlook the data-aware issue for various downstream datasets. Therefore, we aim to design a better fine-tuning strategy for pre-trained GNNs to improve the model performance in this paper. Given a pre-trained GNN, we propose to search for a fine-tuning framework that adaptively designs a suitable fine-tuning strategy for the given labeled data on the downstream task. To ensure the improvement brought by searching fine-tuning strategy, we carefully summarize a proper search space of fine-tuning framework that is suitable for GNNs. The empirical studies show that S2PGNN can be implemented on the top of 10 famous pre-trained GNNs and consistently improve their performance. Besides, S2PGNN achieves better performance than existing fine-tuning strategies within and outside the GNN area. Our code is publicly available at \url{https://anonymous.4open.science/r/code_icde2024-A9CB/}.

Data-Driven Allocation of Preventive Care With Application to Diabetes Mellitus Type II

paper_url: http://arxiv.org/abs/2308.06959
repo_url: None
paper_authors: Mathias Kraus, Stefan Feuerriegel, Maytal Saar-Tsechansky
for: 预防疾病的效果性评估和决策支持
methods: 结合Counterfactual推理、机器学习和优化技术，建立可扩展的数据驱动决策模型，可以利用现代电子医疗记录中的高维医疗数据
results: 对89,191名 prediabetic 患者的电子医疗记录进行评估，与现有医疗实践相比，我们的数据驱动决策模型可以每年节省11亿美元。并且在不同预算水平下进行成本效果分析。

Abstract
Problem Definition. Increasing costs of healthcare highlight the importance of effective disease prevention. However, decision models for allocating preventive care are lacking. Methodology/Results. In this paper, we develop a data-driven decision model for determining a cost-effective allocation of preventive treatments to patients at risk. Specifically, we combine counterfactual inference, machine learning, and optimization techniques to build a scalable decision model that can exploit high-dimensional medical data, such as the data found in modern electronic health records. Our decision model is evaluated based on electronic health records from 89,191 prediabetic patients. We compare the allocation of preventive treatments (metformin) prescribed by our data-driven decision model with that of current practice. We find that if our approach is applied to the U.S. population, it can yield annual savings of $1.1 billion. Finally, we analyze the cost-effectiveness under varying budget levels. Managerial Implications. Our work supports decision-making in health management, with the goal of achieving effective disease prevention at lower costs. Importantly, our decision model is generic and can thus be used for effective allocation of preventive care for other preventable diseases.

摘要
问题定义：医疗成本的增长强调了疾病预防的重要性。然而，决策模型用于分配预防治疗的缺失。方法ология/结果：在这篇论文中，我们开发了一种基于数据的决策模型，用于确定有效分配预防治疗给患有风险的病人。具体来说，我们结合Counterfactual推理、机器学习和优化技术，构建了可扩展的决策模型，可以利用现代电子医疗记录中的高维医疗数据。我们的决策模型在89191名 prediabetic 患者的电子医疗记录上进行评估。我们将比较我们的数据驱动的决策模型与现有做法分配预防治疗（metformin）的分配方式。我们发现，如果我们的方法应用于美国人口，可以每年节省11亿美元。最后，我们分析了不同预算水平下的成本效果。管理意义：我们的工作支持医疗管理决策，以实现更有效的疾病预防，并降低成本。重要的是，我们的决策模型是通用的，可以用于有效地分配预防治疗其他预防性疾病。

CEmb-SAM: Segment Anything Model with Condition Embedding for Joint Learning from Heterogeneous Datasets

paper_url: http://arxiv.org/abs/2308.06957
repo_url: None
paper_authors: Dongik Shin, Beomsuk Kim, Seungjun Baek
for: 助 медицин专家进行诊断和治疗过程中的自动图像分割。
methods: 使用多modal ultrasound图像，并将不同的 анатомиче结构或癌变分为不同的子集，以便使用单一模型进行学习和泛化。
results: 在实验中，使用 Condition Embedding block (CEmb-SAM) 可以有效地适应不同的子集，并且在 peripheral nerves 和 breast cancer 图像分割任务中表现出色，比基eline方法有更好的效果。

Abstract
Automated segmentation of ultrasound images can assist medical experts with diagnostic and therapeutic procedures. Although using the common modality of ultrasound, one typically needs separate datasets in order to segment, for example, different anatomical structures or lesions with different levels of malignancy. In this paper, we consider the problem of jointly learning from heterogeneous datasets so that the model can improve generalization abilities by leveraging the inherent variability among datasets. We merge the heterogeneous datasets into one dataset and refer to each component dataset as a subgroup. We propose to train a single segmentation model so that the model can adapt to each sub-group. For robust segmentation, we leverage recently proposed Segment Anything model (SAM) in order to incorporate sub-group information into the model. We propose SAM with Condition Embedding block (CEmb-SAM) which encodes sub-group conditions and combines them with image embeddings from SAM. The conditional embedding block effectively adapts SAM to each image sub-group by incorporating dataset properties through learnable parameters for normalization. Experiments show that CEmb-SAM outperforms the baseline methods on ultrasound image segmentation for peripheral nerves and breast cancer. The experiments highlight the effectiveness of Cemb-SAM in learning from heterogeneous datasets in medical image segmentation tasks.

摘要
自动 segmentation of ultrasound images 可以帮助医疗专家进行诊断和治疗过程。 although using the common modality of ultrasound, one typically needs separate datasets in order to segment, for example, different anatomical structures or lesions with different levels of malignancy. 在这篇论文中，我们考虑了将异类数据集合在一起，以便模型可以利用数据集之间的自然变化来提高泛化能力。 we merge the heterogeneous datasets into one dataset and refer to each component dataset as a subgroup. we propose to train a single segmentation model so that the model can adapt to each sub-group. for robust segmentation, we leverage recently proposed Segment Anything model (SAM) in order to incorporate sub-group information into the model. we propose SAM with Condition Embedding block (CEmb-SAM) which encodes sub-group conditions and combines them with image embeddings from SAM. the conditional embedding block effectively adapts SAM to each image sub-group by incorporating dataset properties through learnable parameters for normalization. experiments show that CEmb-SAM outperforms the baseline methods on ultrasound image segmentation for peripheral nerves and breast cancer. the experiments highlight the effectiveness of Cemb-SAM in learning from heterogeneous datasets in medical image segmentation tasks.

Channel-Wise Contrastive Learning for Learning with Noisy Labels

paper_url: http://arxiv.org/abs/2308.06952
repo_url: None
paper_authors: Hui Kang, Sheng Liu, Huaxi Huang, Tongliang Liu
for: 本研究旨在Addressing the challenge of learning with noisy labels (LNL), 即训练一个能够从给定的实例中分辨真实的类别信息的分类器。
methods: 本研究提出了一种频道 wise contrastive learning (CWCL) 方法，通过在多个频道上进行对比学习，以分离真实的标签信息和噪声。
results: 对多个 benchmark 数据集进行评估，研究发现 CWCL 方法比既有的方法更高效，能够提取更加细腻和鲜明的特征，以便更好地分辨真实的标签信息。

Abstract
In real-world datasets, noisy labels are pervasive. The challenge of learning with noisy labels (LNL) is to train a classifier that discerns the actual classes from given instances. For this, the model must identify features indicative of the authentic labels. While research indicates that genuine label information is embedded in the learned features of even inaccurately labeled data, it's often intertwined with noise, complicating its direct application. Addressing this, we introduce channel-wise contrastive learning (CWCL). This method distinguishes authentic label information from noise by undertaking contrastive learning across diverse channels. Unlike conventional instance-wise contrastive learning (IWCL), CWCL tends to yield more nuanced and resilient features aligned with the authentic labels. Our strategy is twofold: firstly, using CWCL to extract pertinent features to identify cleanly labeled samples, and secondly, progressively fine-tuning using these samples. Evaluations on several benchmark datasets validate our method's superiority over existing approaches.

摘要

Knowing Where to Focus: Event-aware Transformer for Video Grounding

paper_url: http://arxiv.org/abs/2308.06947
repo_url: https://github.com/jinhyunj/eatr
paper_authors: Jinhyun Jang, Jungin Park, Jin Kim, Hyeongjun Kwon, Kwanghoon Sohn
for: This paper aims to improve video grounding models by incorporating event-aware dynamic moment queries to better capture the temporal structure of videos and provide more accurate moment timestamps.
methods: The proposed method uses a slot attention mechanism for event reasoning and a gated fusion transformer layer for moment reasoning, which fuses the moment queries with the video-sentence representations to predict moment timestamps.
results: The proposed approach outperforms state-of-the-art video grounding models on several benchmarks, demonstrating its effectiveness and efficiency.Here’s the simplified Chinese text:
for: 这篇论文目的是提高视频落实模型，通过包含事件相关的动态时刻查询来更好地捕捉视频的时间结构，并提供更准确的时刻查询。
methods: 该方法使用槽注意机制进行事件理解，并使用阀门融合变换层与视频句子表示之间的交互来预测时刻查询。
results: 该方法在多个benchmark上表现出色，超越了现有的视频落实模型，证明其效果和效率。

Abstract
Recent DETR-based video grounding models have made the model directly predict moment timestamps without any hand-crafted components, such as a pre-defined proposal or non-maximum suppression, by learning moment queries. However, their input-agnostic moment queries inevitably overlook an intrinsic temporal structure of a video, providing limited positional information. In this paper, we formulate an event-aware dynamic moment query to enable the model to take the input-specific content and positional information of the video into account. To this end, we present two levels of reasoning: 1) Event reasoning that captures distinctive event units constituting a given video using a slot attention mechanism; and 2) moment reasoning that fuses the moment queries with a given sentence through a gated fusion transformer layer and learns interactions between the moment queries and video-sentence representations to predict moment timestamps. Extensive experiments demonstrate the effectiveness and efficiency of the event-aware dynamic moment queries, outperforming state-of-the-art approaches on several video grounding benchmarks.

摘要

Event reasoning that captures distinctive event units constituting a given video using a slot attention mechanism.2. Moment reasoning that fuses the moment queries with a given sentence through a gated fusion transformer layer and learns interactions between the moment queries and video-sentence representations to predict moment timestamps.Extensive experiments demonstrate the effectiveness and efficiency of the event-aware dynamic moment queries, outperforming state-of-the-art approaches on several video grounding benchmarks.Translation notes:* DETR-based: 基于DETR的 (DETR是一种引入了 transformer 的 Object Detection 模型)* input-agnostic: 无关输入的 (ignore the input)* event-aware: 事件意识的 (aware of events)* dynamic moment queries: 动态时刻查询 (query the moment of an event)* gated fusion transformer layer: 阻塞融合变换层 (a type of transformer layer that combines multiple inputs)* video-sentence representations: 视频句子表示 (representations of video and sentence)* moment timestamps: 时刻查询 (query the moment of an event)

Semantic-aware Network for Aerial-to-Ground Image Synthesis

paper_url: http://arxiv.org/abs/2308.06945
repo_url: https://github.com/jinhyunj/sanet
paper_authors: Jinhyun Jang, Taeyong Song, Kwanghoon Sohn
for: 本文 targets Aerial-to-ground image synthesis, an emerging and challenging problem that aims to synthesize a ground image from an aerial image.
methods: 本文提出了一个 novel framework，通过强化结构对运算和 semantic awareness 来解决这个问题。具体来说，本文引入了一个新的 semantic-attentive feature transformation module，可以将 aerial 特征转换为 ground 的 complex geographic structures。此外，本文还提出了 semantic-aware loss functions，通过利用预训练的 segmentation network，让网络 Synthesize realistic objects across various classes，并对不同类别进行分别计算损失和均衡。
results: 实验结果显示，提出的 framework 能够实现高品质的 Aerial-to-ground image synthesis，并与先前的方法进行比较和范例研究。

Abstract
Aerial-to-ground image synthesis is an emerging and challenging problem that aims to synthesize a ground image from an aerial image. Due to the highly different layout and object representation between the aerial and ground images, existing approaches usually fail to transfer the components of the aerial scene into the ground scene. In this paper, we propose a novel framework to explore the challenges by imposing enhanced structural alignment and semantic awareness. We introduce a novel semantic-attentive feature transformation module that allows to reconstruct the complex geographic structures by aligning the aerial feature to the ground layout. Furthermore, we propose semantic-aware loss functions by leveraging a pre-trained segmentation network. The network is enforced to synthesize realistic objects across various classes by separately calculating losses for different classes and balancing them. Extensive experiments including comparisons with previous methods and ablation studies show the effectiveness of the proposed framework both qualitatively and quantitatively.

摘要
空中图像与地面图像合成是一个emerging和挑战性的问题，目标是将空中图像转换为地面图像。由于空中和地面图像之间的 Layout和对象表示差异极大，现有的方法通常无法将空中场景中的组件迁移到地面场景中。在这篇论文中，我们提出了一个新的框架，以探讨这些挑战。我们引入了一个新的semantic-attentive特征变换模块，该模块可以将空中特征与地面布局相互对应，并且我们提出了Semantic-aware的损失函数，该函数通过使用预训练的分割网络来适应不同类别的物体，以实现Synthesize realistic的对象。我们进行了广泛的实验，包括与之前的方法进行比较和简要的ablation study，以证明我们的框架的效果。

Insurance pricing on price comparison websites via reinforcement learning

paper_url: http://arxiv.org/abs/2308.06935
repo_url: None
paper_authors: Tanut Treetanthiploet, Yufei Zhang, Lukasz Szpruch, Isaac Bowers-Barnard, Henrietta Ridley, James Hickey, Chris Pearce
for: This paper aims to address the challenges of formulating effective pricing strategies for insurers on price comparison websites (PCWs) by introducing a reinforcement learning (RL) framework that integrates model-based and model-free methods.
methods: The proposed methodology uses a model-based component to train agents in an offline setting, and model-free algorithms in a contextual bandit (CB) manner to dynamically update the pricing policy and maximize expected revenue.
results: The paper demonstrates the superiority of the proposed methodology over existing off-the-shelf RL/CB approaches using synthetic data, and shows that the hybrid agent outperforms benchmarks in terms of sample efficiency and cumulative reward.

Abstract
The emergence of price comparison websites (PCWs) has presented insurers with unique challenges in formulating effective pricing strategies. Operating on PCWs requires insurers to strike a delicate balance between competitive premiums and profitability, amidst obstacles such as low historical conversion rates, limited visibility of competitors' actions, and a dynamic market environment. In addition to this, the capital intensive nature of the business means pricing below the risk levels of customers can result in solvency issues for the insurer. To address these challenges, this paper introduces reinforcement learning (RL) framework that learns the optimal pricing policy by integrating model-based and model-free methods. The model-based component is used to train agents in an offline setting, avoiding cold-start issues, while model-free algorithms are then employed in a contextual bandit (CB) manner to dynamically update the pricing policy to maximise the expected revenue. This facilitates quick adaptation to evolving market dynamics and enhances algorithm efficiency and decision interpretability. The paper also highlights the importance of evaluating pricing policies using an offline dataset in a consistent fashion and demonstrates the superiority of the proposed methodology over existing off-the-shelf RL/CB approaches. We validate our methodology using synthetic data, generated to reflect private commercially available data within real-world insurers, and compare against 6 other benchmark approaches. Our hybrid agent outperforms these benchmarks in terms of sample efficiency and cumulative reward with the exception of an agent that has access to perfect market information which would not be available in a real-world set-up.

摘要
随着价格比较网站（PCW）的出现，保险公司面临着独特的价格策略形成挑战。在PCW上运营需要保险公司坚持细致的平衡，同时综合考虑竞争价格、利润和市场环境的变化。此外，保险业务具有资本密集的特点，如果价格低于客户风险水平，可能会导致保险公司的资本危机。为 Addressing these challenges, this paper proposes a reinforcement learning (RL) framework that learns the optimal pricing policy by integrating model-based and model-free methods. The model-based component is used to train agents in an offline setting, avoiding cold-start issues, while model-free algorithms are then employed in a contextual bandit (CB) manner to dynamically update the pricing policy to maximize the expected revenue. This facilitates quick adaptation to evolving market dynamics and enhances algorithm efficiency and decision interpretability. The paper also highlights the importance of evaluating pricing policies using an offline dataset in a consistent fashion and demonstrates the superiority of the proposed methodology over existing off-the-shelf RL/CB approaches. We validate our methodology using synthetic data, generated to reflect private commercially available data within real-world insurers, and compare against 6 other benchmark approaches. Our hybrid agent outperforms these benchmarks in terms of sample efficiency and cumulative reward, with the exception of an agent that has access to perfect market information, which is not available in a real-world setting.

Predicting Listing Prices In Dynamic Short Term Rental Markets Using Machine Learning Models

paper_url: http://arxiv.org/abs/2308.06929
repo_url: None
paper_authors: Sam Chapman, Seifey Mohammad, Kimberly Villegas
For: The paper aims to predict the prices of Airbnb rentals in Austin, Texas using a machine learning modeling approach, with the primary objective of constructing an accurate model and the secondary objective of identifying the key factors that drive rental prices.* Methods: The paper uses a machine learning approach and incorporates sentiment analysis into the feature engineering to gain a deeper understanding of periodic changes in Airbnb rental prices.* Results: The paper aims to provide accurate predictions of Airbnb rental prices in Austin, Texas and identify the key factors that drive these prices, with a focus on understanding how these factors vary across different locations and property types.Here is the same information in Simplified Chinese:* For: 这篇论文目标是预测美国得克萨斯州奥斯汀的空bnb租赁价格，使用机器学习模型方法，主要目标是构建准确的模型，并且次要目标是确定租赁价格的关键因素。* Methods: 论文使用机器学习方法，并将情感分析integrated into feature engineering，以更深入理解 periodic change in Airbnb租赁价格。* Results: 论文期望提供准确的空bnb租赁价格预测，并确定租赁价格关键因素，特别是在不同的地点和房型上。

Abstract
Our research group wanted to take on the difficult task of predicting prices in a dynamic market. And short term rentals such as Airbnb listings seemed to be the perfect proving ground to do such a thing. Airbnb has revolutionized the travel industry by providing a platform for homeowners to rent out their properties to travelers. The pricing of Airbnb rentals is prone to high fluctuations, with prices changing frequently based on demand, seasonality, and other factors. Accurate prediction of Airbnb rental prices is crucial for hosts to optimize their revenue and for travelers to make informed booking decisions. In this project, we aim to predict the prices of Airbnb rentals using a machine learning modeling approach. Our project expands on earlier research in the area of analyzing Airbnb rental prices by taking a methodical machine learning approach as well as incorporating sentiment analysis into our feature engineering. We intend to gain a deeper understanding on periodic changes of Airbnb rental prices. The primary objective of this study is to construct an accurate machine learning model for predicting Airbnb rental prices specifically in Austin, Texas. Our project's secondary objective is to identify the key factors that drive Airbnb rental prices and to investigate how these factors vary across different locations and property types.

摘要
我们的研究小组想要解决动态市场中价格预测的复杂任务。短期租赁如 Airbnb 列表似乎是完美的证明场地。 Airbnb 为旅行者提供了一个平台，让房东租出他们的房屋给旅行者。 Airbnb 租赁价格受到高涨的影响，价格频繁变化，与需求、季节和其他因素有关。正确预测 Airbnb 租赁价格是hosts 优化收益和旅行者做出 Informed 预订决策的关键。在这个项目中，我们使用机器学习模型方法来预测 Airbnb 租赁价格。我们的项目在分析 Airbnb 租赁价格方面进一步发展了之前的研究。我们采用了系统的机器学习方法，同时还包括了情感分析在feature工程中。我们想要更深入了解 periodic 变化 Airbnb 租赁价格。我们项目的主要目标是在奥斯汀、得克萨斯建立准确的机器学习模型，预测 Airbnb 租赁价格。我们项目的次要目标是确定 Airbnb 租赁价格的关键因素，以及这些因素在不同的地点和房型上如何变化。

CBA: Improving Online Continual Learning via Continual Bias Adaptor

paper_url: http://arxiv.org/abs/2308.06925
repo_url: https://github.com/wqza/cba-online-cl
paper_authors: Quanziang Wang, Renzhen Wang, Yichen Wu, Xixi Jia, Deyu Meng
for: 提高在非站ARY数据流中进行在线 continual learning（CL）的能力，以抵御数据流中的 Distribution shift 问题。
methods: 提出了一种 Continual Bias Adaptor（CBA）模块，用于在训练过程中增强分类器网络，以适应不断变化的数据分布，从而保持 previously learned tasks 的稳定整合。
results: 通过理论分析和实验测试，证明了 CBA 模块的有效性，并与四种基eline和三个公共的 continual learning benchmark 进行了广泛的比较。

Abstract
Online continual learning (CL) aims to learn new knowledge and consolidate previously learned knowledge from non-stationary data streams. Due to the time-varying training setting, the model learned from a changing distribution easily forgets the previously learned knowledge and biases toward the newly received task. To address this problem, we propose a Continual Bias Adaptor (CBA) module to augment the classifier network to adapt to catastrophic distribution change during training, such that the classifier network is able to learn a stable consolidation of previously learned tasks. In the testing stage, CBA can be removed which introduces no additional computation cost and memory overhead. We theoretically reveal the reason why the proposed method can effectively alleviate catastrophic distribution shifts, and empirically demonstrate its effectiveness through extensive experiments based on four rehearsal-based baselines and three public continual learning benchmarks.

摘要

A Novel Ehanced Move Recognition Algorithm Based on Pre-trained Models with Positional Embeddings

paper_url: http://arxiv.org/abs/2308.10822
repo_url: None
paper_authors: Hao Wen, Jie Wang, Xiaodong Qiao
for: 本研究旨在提高中文科技论文摘要中的 Move 识别精度。
methods: 该算法使用了改进的预训练模型和笛卡尔网络听力机制，以获取摘要中的字符位信息，进而提高深度 semantics 学习和targeted 特征提取。
results: 实验结果表明，提案的算法相比原始数据集，在分割数据集上达到了13.37% 高的准确率，并与基础对比模型相比，提高了7.55%。

Abstract
The recognition of abstracts is crucial for effectively locating the content and clarifying the article. Existing move recognition algorithms lack the ability to learn word position information to obtain contextual semantics. This paper proposes a novel enhanced move recognition algorithm with an improved pre-trained model and a gated network with attention mechanism for unstructured abstracts of Chinese scientific and technological papers. The proposed algorithm first performs summary data segmentation and vocabulary training. The EP-ERNIE$\_$AT-GRU framework is leveraged to incorporate word positional information, facilitating deep semantic learning and targeted feature extraction. Experimental results demonstrate that the proposed algorithm achieves 13.37$\%$ higher accuracy on the split dataset than on the original dataset and a 7.55$\%$ improvement in accuracy over the basic comparison model.

摘要
“摘要识别是对中文科技论文内容的效果搜索和解释的关键。现有的移动识别算法缺乏 Contextual semantics 的学习能力。本文提出了一种新的增强移动识别算法，包括改进的预训练模型和闭合网络带有注意力机制，以提高中文科技论文的摘要中的字位信息。提议的算法首先执行摘要数据分 segmentation 和词汇训练。EP-ERNIE $\_$ AT-GRU 框架被利用，以包括字位信息，促进深层 semantic learning 和targeted feature extraction。实验结果表明，提议的算法在分 Split 集上比原始集上的准确率高出 13.37%，与基本比较模型的准确率高出 7.55%。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

CausalLM is not optimal for in-context learning

paper_url: http://arxiv.org/abs/2308.06912
repo_url: None
paper_authors: Nan Ding, Tomer Levinboim, Jialin Wu, Sebastian Goodman, Radu Soricut
for: 本研究旨在理解 prefixLM 和 causalLM 在理论上的异同，以及它们在不同任务上的表现。
methods: 本研究使用了一种特定的参数构造来分析 prefixLM 和 causalLM 的收敛行为。
results: 我们的分析显示，prefixLM 在Linear Regression的优点点上收敛，而 causalLM 的收敛 dynamics 类似于在线 gradient descent 算法，不是 garantied 为优化的，即使样本数量在无穷大。我们的实验结果也表明， causalLM 在所有任务上一直下perform prefixLM。

Abstract
Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each other, compared to causal language models (causalLM), which use auto-regressive attention that prohibits in-context samples to attend to future samples. While this result is intuitive, it is not understood from a theoretical perspective. In this paper we take a theoretical approach and analyze the convergence behavior of prefixLM and causalLM under a certain parameter construction. Our analysis shows that both LM types converge to their stationary points at a linear rate, but that while prefixLM converges to the optimal solution of linear regression, causalLM convergence dynamics follows that of an online gradient descent algorithm, which is not guaranteed to be optimal even as the number of samples grows infinitely. We supplement our theoretical claims with empirical experiments over synthetic and real tasks and using various types of transformers. Our experiments verify that causalLM consistently underperforms prefixLM in all settings.

摘要
近期实验证据表明，基于转换器的受Context学习（prefixLM）在比 causalLM（causalLM）的情况下表现更好，这两种语言模型的区别在于，前一种使用预测语言模型，可以让所有的受Context样本都可以互相注意，而后一种使用自动递归注意力，这会禁止受Context样本注意到未来的样本。虽然这个结果很直观，但是从理论角度来看还不够了解。在这篇论文中，我们采用理论方法，分析了 prefixLM 和 causalLM 两种语言模型在某些参数构造下的收敛行为。我们的分析表明，两种LM类型在线性收敛，但是 prefixLM converge 到线性回归的优秀解，而 causalLM 的收敛动态类似于在线上梯度下降算法，这并不是 garantate 为INF 多个样本收敛到优秀解。我们在实验中验证了这些理论声明，通过在 sintetic 和实际任务上进行了多种 transformer 的实验，并发现 causalLM 在所有设置下 consistently underperform prefixLM。

paper_url: http://arxiv.org/abs/2308.06911
repo_url: None
paper_authors: Pengfei Liu, Yiming Ren, Zhixiang Ren
for: 本研究旨在开发一种多模态大语言模型，以捕捉分子数据中的丰富和复杂信息。
methods: 本研究使用GIT-Mol模型，该模型结合结构图、图像和文本信息，包括简化分子输入线Entry系统（SMILES）和分子caption。为了实现多Modal数据的集成，我们提出了GIT-Former模型，可以将所有模式映射到一个统一的幽默空间。
results: 我们开发了一种创新的任意语言分子翻译策略，比基线或单模态模型提高了10%-15%的分子描述率，提高了5%-10%的物理预测精度，并提高了20%的分子生成有效性。

Abstract
Large language models have made significant strides in natural language processing, paving the way for innovative applications including molecular representation and generation. However, most existing single-modality approaches cannot capture the abundant and complex information in molecular data. Here, we introduce GIT-Mol, a multi-modal large language model that integrates the structure Graph, Image, and Text information, including the Simplified Molecular Input Line Entry System (SMILES) and molecular captions. To facilitate the integration of multi-modal molecular data, we propose GIT-Former, a novel architecture capable of mapping all modalities into a unified latent space. Our study develops an innovative any-to-language molecular translation strategy and achieves a 10%-15% improvement in molecular captioning, a 5%-10% accuracy increase in property prediction, and a 20% boost in molecule generation validity compared to baseline or single-modality models.

摘要
大型语言模型已经做出了很大的进步，导致了创新的应用，如分子表示和生成。但是，现有的单一模式方法通常无法捕捉分子数据中的实际和复杂的信息。在这里，我们介绍GIT-Mol，一个多模式大语言模型，它统合了分子结构graph、图像和文本信息，包括简润分子输入语系 (SMILES) 和分子描述。为了促进多模式分子数据的集成，我们提出GIT-Former，一个新的架构，可以将所有模式转换到一个统一的隐藏空间中。我们的研究开发了一种创新的任意语言分子翻译策略，并在分子描述、性能预测和分子生成效果上实现了10%-15%的改善、5%-10%的精度提高和20%的效果提高，相比基准或单一模式模型。

Generative Interpretation

paper_url: http://arxiv.org/abs/2308.06907
repo_url: https://github.com/yonathanarbel/generativeinterpretation
paper_authors: Yonathan A. Arbel, David Hoffman
for: This paper aims to introduce a new approach to estimating contractual meaning using large language models.
methods: The paper uses grounded case studies to illustrate the capabilities of these novel tools in distinct ways, such as ascertaining ordinary meaning in context, quantifying ambiguity, filling gaps in parties’ agreements, and calculating the probative value of individual pieces of extrinsic evidence.
results: The paper shows that AI models can help factfinders accurately estimate what the parties intended, and that generative interpretation can unsettle the current interpretative stalemate between efficiency-minded textualists and justice-oriented contextualists.

Abstract
We introduce generative interpretation, a new approach to estimating contractual meaning using large language models. As AI triumphalism is the order of the day, we proceed by way of grounded case studies, each illustrating the capabilities of these novel tools in distinct ways. Taking well-known contracts opinions, and sourcing the actual agreements that they adjudicated, we show that AI models can help factfinders ascertain ordinary meaning in context, quantify ambiguity, and fill gaps in parties' agreements. We also illustrate how models can calculate the probative value of individual pieces of extrinsic evidence. After offering best practices for the use of these models given their limitations, we consider their implications for judicial practice and contract theory. Using LLMs permits courts to estimate what the parties intended cheaply and accurately, and as such generative interpretation unsettles the current interpretative stalemate. Their use responds to efficiency-minded textualists and justice-oriented contextualists, who argue about whether parties will prefer cost and certainty or accuracy and fairness. Parties--and courts--would prefer a middle path, in which adjudicators strive to predict what the contract really meant, admitting just enough context to approximate reality while avoiding unguided and biased assimilation of evidence. As generative interpretation offers this possibility, we argue it can become the new workhorse of contractual interpretation.

摘要
我们介绍生成解释，一种新的合约解释方法使用大型自然语言模型。随着人工智能豪语气势日益增长，我们透过实际案例进行grounded的应用，每个案例都展示了这些新工具在不同方面的能力。从知名合约案例和实际契约中获取了诉讼官员可能需要了解的资讯，我们示出AI模型可以帮助诉讼官员了解合约的内容意义，衡量内容的模糊性，填充点数和契约中的漏洞。我们还示出了AI模型可以评估个别外部证据的证据价值。在提供了最佳实践方法后，我们考虑了这些模型的局限性和影响，并探讨它们对法律实践和合约理论的影响。使用LLMs可以让法院估算合约的意思便宜且精确，因此生成解释会使现有的解释僵局破产。这种方法可以满足效率主义者和正义主义者的需求，他们认为党籍人将偏好成本和可靠性或精确和公正。党籍人和法院都偏好一条中路，在这条路上，仲裁官员将努力预测合约的真正意思，接受足够的文本背景，以避免无方向和偏袋的证据融合。因此，我们认为生成解释将成为未来合约解释的主要工具。

Federated Classification in Hyperbolic Spaces via Secure Aggregation of Convex Hulls

paper_url: http://arxiv.org/abs/2308.06895
repo_url: None
paper_authors: Saurav Prakash, Jin Sima, Chao Pan, Eli Chien, Olgica Milenkovic
for: 这个论文旨在解决在分布式和隐私化的设置下进行 Federated Learning 的问题，特别是在 hyperbolic spaces 中进行分类。
methods: 本文提出了一种首次在 hyperbolic spaces 中进行 Federated Classification 的方法，包括分布式版本的 convex SVM 分类器，integer $B_h$ 序列来解决标签替换问题，以及基于 Poincaré 盘的量化方法来限制数据泄露。
results: 本文通过测试多种多样化的数据集，包括具有层次结构的单元细胞 RNA-seq 数据，demonstrated 该方法可以提高分类精度，比其欧几何空间中的对应方法更好。

Abstract
Hierarchical and tree-like data sets arise in many applications, including language processing, graph data mining, phylogeny and genomics. It is known that tree-like data cannot be embedded into Euclidean spaces of finite dimension with small distortion. This problem can be mitigated through the use of hyperbolic spaces. When such data also has to be processed in a distributed and privatized setting, it becomes necessary to work with new federated learning methods tailored to hyperbolic spaces. As an initial step towards the development of the field of federated learning in hyperbolic spaces, we propose the first known approach to federated classification in hyperbolic spaces. Our contributions are as follows. First, we develop distributed versions of convex SVM classifiers for Poincar\'e discs. In this setting, the information conveyed from clients to the global classifier are convex hulls of clusters present in individual client data. Second, to avoid label switching issues, we introduce a number-theoretic approach for label recovery based on the so-called integer $B_h$ sequences. Third, we compute the complexity of the convex hulls in hyperbolic spaces to assess the extent of data leakage; at the same time, in order to limit the communication cost for the hulls, we propose a new quantization method for the Poincar\'e disc coupled with Reed-Solomon-like encoding. Fourth, at server level, we introduce a new approach for aggregating convex hulls of the clients based on balanced graph partitioning. We test our method on a collection of diverse data sets, including hierarchical single-cell RNA-seq data from different patients distributed across different repositories that have stringent privacy constraints. The classification accuracy of our method is up to $\sim 11\%$ better than its Euclidean counterpart, demonstrating the importance of privacy-preserving learning in hyperbolic spaces.

摘要
随着应用领域的发展，树状和树状数据在语言处理、图数据挖掘、phylogeny和 genomics 等领域中变得越来越普遍。然而，树状数据无法在有限维度的欧式空间中嵌入，这会导致数据泄露问题。为解决这问题，我们可以使用拥有不同维度的几何空间。在分布式和隐私化的设置下，我们需要采用特有的联邦学习方法，以适应几何空间中的树状数据。为开拓联邦学习在几何空间中的领域，我们提出了首个已知的联邦分类方法。我们的贡献如下：1. 我们开发了分布式版本的 convex SVM 分类器，适用于Poincaré盘中的数据。在这个设置下，客户端上的信息都是各个客户端数据中的凸集。2. 为避免标签交换问题，我们引入了一种数学基础的方法，基于 so-called 整数 $B_h$ 序列。3. 我们计算了几何空间中凸集的复杂度，以评估数据泄露的程度。同时，我们提出了一种新的量化方法，用于压缩 Poincaré 盘。4. 在服务器端，我们引入了一种新的方法，用于将客户端上的凸集聚合到 Balanced Graph Partitioning 中。我们对一些多样化的数据集进行测试，包括不同病人的单元细胞 RNA-seq 数据，分布在不同的存储库中，具有严格的隐私限制。我们的方法的分类精度比其欧式对应方法高出至多 11%，这说明了隐私保护在几何空间中的重要性。

Bridging Offline-Online Evaluation with a Time-dependent and Popularity Bias-free Offline Metric for Recommenders

paper_url: http://arxiv.org/abs/2308.06885
repo_url: None
paper_authors: Petr Kasalický, Rodrigo Alves, Pavel Kordík
for: 评估推荐系统的效果是一项复杂的任务。在线和离线评估指标对推荐系统的真正目标存在偏误。大多数最近发表的论文使用不精准的离线评估方法来评估其方法的性能，从而削弱了学术研究对实际应用的影响。
methods: 我们研究了和评估推荐系统在线性能相关的离线评估指标。我们发现，惩罚受欢迎的Item和考虑交易时间在评估中可以提高我们选择最佳推荐模型的能力。
results: 我们对五个大规模的实际应用数据进行了测试，并发现，使用我们提出的离线评估指标可以更好地反映推荐系统在线性能。这些结果可以帮助学术界更好地理解离线评估和优化的标准，以便更好地应用推荐系统。

Abstract
The evaluation of recommendation systems is a complex task. The offline and online evaluation metrics for recommender systems are ambiguous in their true objectives. The majority of recently published papers benchmark their methods using ill-posed offline evaluation methodology that often fails to predict true online performance. Because of this, the impact that academic research has on the industry is reduced. The aim of our research is to investigate and compare the online performance of offline evaluation metrics. We show that penalizing popular items and considering the time of transactions during the evaluation significantly improves our ability to choose the best recommendation model for a live recommender system. Our results, averaged over five large-size real-world live data procured from recommenders, aim to help the academic community to understand better offline evaluation and optimization criteria that are more relevant for real applications of recommender systems.

摘要
评估推荐系统的复杂性使得评估方法存在各种问题。在线和离线评估指标对于推荐系统来说是不确定的。大多数最近发表的论文使用不精准的离线评估方法来评估自己的方法，这会导致实际在线性能与评估结果存在差异。这种情况使得学术研究对于实际应用的影响减少。我们的研究目标是研究和比较在线评估指标的表现。我们发现，对 популяр item 进行 penalty 和在评估过程中考虑交易时间可以显著改善我们选择最佳推荐模型的能力。我们的结果，基于五个大型实际生产环境中的真实数据， hopes to help学术界更好地理解推荐系统的离线评估和优化标准，以便更好地应用于实际推荐系统中。

Multi-Receiver Task-Oriented Communications via Multi-Task Deep Learning

paper_url: http://arxiv.org/abs/2308.06884
repo_url: None
paper_authors: Yalin E. Sagduyu, Tugba Erpek, Aylin Yener, Sennur Ulukus
for: 本研究探讨了任务导向的通信系统，在 transmitter 与多个接收器之间进行交互，每个接收器都需要完成自己的任务，例如图像分类等，并在 transmitter 上训练共享encoder和每个接收器上的专门decoder。
methods: 该方法使用多任务深度学习来实现多任务的共同优化和多接收器之间的通信，并通过在边缘的6G网络中进行有效的资源分配，以适应不同的通信频道条件，并最小化传输过程中的过载。
results: 实验结果表明，相比单任务导向的通信系统，多任务导向的通信系统可以更好地适应不同的任务和通信环境，并且可以提高图像分类精度和资源利用率。

Abstract
This paper studies task-oriented, otherwise known as goal-oriented, communications, in a setting where a transmitter communicates with multiple receivers, each with its own task to complete on a dataset, e.g., images, available at the transmitter. A multi-task deep learning approach that involves training a common encoder at the transmitter and individual decoders at the receivers is presented for joint optimization of completing multiple tasks and communicating with multiple receivers. By providing efficient resource allocation at the edge of 6G networks, the proposed approach allows the communications system to adapt to varying channel conditions and achieves task-specific objectives while minimizing transmission overhead. Joint training of the encoder and decoders using multi-task learning captures shared information across tasks and optimizes the communication process accordingly. By leveraging the broadcast nature of wireless communications, multi-receiver task-oriented communications (MTOC) reduces the number of transmissions required to complete tasks at different receivers. Performance evaluation conducted on the MNIST, Fashion MNIST, and CIFAR-10 datasets (with image classification considered for different tasks) demonstrates the effectiveness of MTOC in terms of classification accuracy and resource utilization compared to single-task-oriented communication systems.

摘要
Performance evaluation on the MNIST, Fashion MNIST, and CIFAR-10 datasets (with image classification considered for different tasks) demonstrates the effectiveness of task-oriented communication in terms of classification accuracy and resource utilization.

Quantifying Outlierness of Funds from their Categories using Supervised Similarity

paper_url: http://arxiv.org/abs/2308.06882
repo_url: None
paper_authors: Dhruv Desai, Ashmita Dhiman, Tushar Sharma, Deepika Sharma, Dhagash Mehta, Stefano Pasquali
For: 本研究旨在量化基金分类错误的影响，并提出一种基于机器学习的方法来检测和识别基金分类错误。* Methods: 本研究使用Random Forest方法进行距离度量学习，计算每个数据点的类别异常指标，以识别基金分类错误。* Results: 研究发现基金分类错误与未来回报之间存在强相关关系，并讨论了这些结果的意义。

Abstract
Mutual fund categorization has become a standard tool for the investment management industry and is extensively used by allocators for portfolio construction and manager selection, as well as by fund managers for peer analysis and competitive positioning. As a result, a (unintended) miscategorization or lack of precision can significantly impact allocation decisions and investment fund managers. Here, we aim to quantify the effect of miscategorization of funds utilizing a machine learning based approach. We formulate the problem of miscategorization of funds as a distance-based outlier detection problem, where the outliers are the data-points that are far from the rest of the data-points in the given feature space. We implement and employ a Random Forest (RF) based method of distance metric learning, and compute the so-called class-wise outlier measures for each data-point to identify outliers in the data. We test our implementation on various publicly available data sets, and then apply it to mutual fund data. We show that there is a strong relationship between the outlier measures of the funds and their future returns and discuss the implications of our findings.

摘要
资金基金分类已成为投资管理industry的标准工具，广泛用于分配和资金经理选择，以及资金管理人员用于对比分析和竞争位置。因此，任意错误分类或缺乏精度可能对分配决策和投资基金产生深远的影响。在这里，我们想要量化基金分类错误的影响，使用机器学习基于的方法。我们将基金分类问题定义为一个距离度量学习问题，其中异常数据点是与其他数据点在给定特征空间的距离最远的数据点。我们实现了Random Forest（RF）基于的距离度量学习方法，并计算每个数据点的类别异常度量来标识异常数据点。我们在各种公开可用的数据集上测试了我们的实现，然后应用于基金数据。我们发现，基金异常度量和未来回报之间存在强相关关系，并讨论了我们的发现的意义。

AutoSeqRec: Autoencoder for Efficient Sequential Recommendation

paper_url: http://arxiv.org/abs/2308.06878
repo_url: https://github.com/sliu675/autoseqrec
paper_authors: Sijia Liu, Jiahao Liu, Hansu Gu, Dongsheng Li, Tun Lu, Peng Zhang, Ning Gu
for: 这篇论文主要针对继承推荐任务，旨在提供一种高效可靠的继承推荐方法。
methods: 该方法基于自适应器，包括一个编码器和三个解码器，它们考虑了用户-项交互矩阵和项过渡矩阵的行列。
results: 对比其他方法，AutoSeqRec显示了更高的准确率，同时具有更好的可靠性和效率。

Abstract
Sequential recommendation demonstrates the capability to recommend items by modeling the sequential behavior of users. Traditional methods typically treat users as sequences of items, overlooking the collaborative relationships among them. Graph-based methods incorporate collaborative information by utilizing the user-item interaction graph. However, these methods sometimes face challenges in terms of time complexity and computational efficiency. To address these limitations, this paper presents AutoSeqRec, an incremental recommendation model specifically designed for sequential recommendation tasks. AutoSeqRec is based on autoencoders and consists of an encoder and three decoders within the autoencoder architecture. These components consider both the user-item interaction matrix and the rows and columns of the item transition matrix. The reconstruction of the user-item interaction matrix captures user long-term preferences through collaborative filtering. In addition, the rows and columns of the item transition matrix represent the item out-degree and in-degree hopping behavior, which allows for modeling the user's short-term interests. When making incremental recommendations, only the input matrices need to be updated, without the need to update parameters, which makes AutoSeqRec very efficient. Comprehensive evaluations demonstrate that AutoSeqRec outperforms existing methods in terms of accuracy, while showcasing its robustness and efficiency.

摘要
带有顺序推荐的模型可以根据用户的顺序行为来推荐项目。传统方法通常将用户看作为序列中的项目，忽略了用户之间的协同关系。基于图的方法可以包含协同信息，但是它们有时会面临时间复杂度和计算效率的限制。为了解决这些限制，这篇论文提出了 AutoSeqRec，一种特点是适用于顺序推荐任务的增量推荐模型。AutoSeqRec基于自适应器，包括一个Encoder和三个解码器。这些组件考虑了用户-项目交互矩阵和行列式的项目过渡矩阵。重建用户-项目交互矩阵可以捕捉用户长期的偏好，通过协同筛选。此外，行列式的项目过渡矩阵表示用户短期的兴趣，允许模型化用户的短期偏好。在进行增量推荐时，只需更新输入矩阵，无需更新参数，这使得AutoSeqRec非常高效。 comprehensive评估表明，AutoSeqRec在准确性方面超过了现有方法，并展示了其稳定性和高效性。

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

paper_url: http://arxiv.org/abs/2308.06873
repo_url: None
paper_authors: Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, Takuya Yoshioka
for: 高质量零批text-to-speech创新和多种语音转换任务
methods: neural codec语言模型+多任务学习+任务dependent prompting
results: 在多种任务中表现优秀，包括零批TTS、噪声抑制、目标 speaker提取、speech removing和speech editing等，与专门模型相当或更高的性能

Abstract
Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks. See https://aka.ms/speechx for demo samples.

摘要
近期，基于音频文本提示的生成术语模型得到了很大的进步，可以实现高质量的零shot文本至语音转化。然而，现有模型仍然面临着处理多样化音频文本术语生成任务的限制，包括转化输入语音和处理附带噪声的audio捕获条件。本文介绍SpeechX，一种通用的术语生成模型，可以实现零shot TTS和多种术语转换任务，并处理干净和噪声信号。SpeechX通过神经编码语言模型和多任务学习，使用任务特定的提示，实现了一个统一和可扩展的模型，可以通过文本输入来进行语音提高和转换任务。实验结果表明SpeechX在多种任务中具有优秀的表现，包括零shot TTS、噪声减少、目标说话人EXTRACTION、语音删除和语音编辑等，与专门的模型相比，在任务中表现相当或更高。请参考https://aka.ms/speechx查看示例样本。

Semi-Supervised Dual-Stream Self-Attentive Adversarial Graph Contrastive Learning for Cross-Subject EEG-based Emotion Recognition

paper_url: http://arxiv.org/abs/2308.11635
repo_url: None
paper_authors: Weishan Ye, Zhiguo Zhang, Min Zhang, Fei Teng, Li Zhang, Linling Li, Gan Huang, Jianhong Wang, Dong Ni, Zhen Liang
for: 本研究旨在解决脑波Emotion recognition中数据标注率的限制问题，提高脑波Emotion recognition的精度和可靠性。methods: 本研究提出了一种基于Self-Attentive Adversarial Graph Contrastive learning的半supervised Dual-stream架构（简称DS-AGC），包括两个平行流处理非结构和结构脑波特征。非结构流使用半supervised多域适应方法来减轻来源频率域和目标频率域之间的分布差异。结构流开发了图像异构学习方法来提取多个EEG渠道之间的有效特征表示。此外，一种自注意 fusion模块也被提出，用于特征融合、样本选择和情绪识别，其中更加注重EEG特征与数据样本在标签源频率域中更加相关的部分。results: 经过对两个 benchmark数据库（SEED和SEED-IV）进行了大规模的实验，结果表明，提出的模型在不同的受测 incomplete label 条件下（包括标签率不同的情况）表现出了5.83%和6.99%的提高在SEED和SEED-IV数据库上，证明了该模型在跨主体EEG Emotion recognition中有效地解决了标签稀缺问题。

Abstract
Electroencephalography (EEG) is an objective tool for emotion recognition with promising applications. However, the scarcity of labeled data remains a major challenge in this field, limiting the widespread use of EEG-based emotion recognition. In this paper, a semi-supervised Dual-stream Self-Attentive Adversarial Graph Contrastive learning framework (termed as DS-AGC) is proposed to tackle the challenge of limited labeled data in cross-subject EEG-based emotion recognition. The DS-AGC framework includes two parallel streams for extracting non-structural and structural EEG features. The non-structural stream incorporates a semi-supervised multi-domain adaptation method to alleviate distribution discrepancy among labeled source domain, unlabeled source domain, and unknown target domain. The structural stream develops a graph contrastive learning method to extract effective graph-based feature representation from multiple EEG channels in a semi-supervised manner. Further, a self-attentive fusion module is developed for feature fusion, sample selection, and emotion recognition, which highlights EEG features more relevant to emotions and data samples in the labeled source domain that are closer to the target domain. Extensive experiments conducted on two benchmark databases (SEED and SEED-IV) using a semi-supervised cross-subject leave-one-subject-out cross-validation evaluation scheme show that the proposed model outperforms existing methods under different incomplete label conditions (with an average improvement of 5.83% on SEED and 6.99% on SEED-IV), demonstrating its effectiveness in addressing the label scarcity problem in cross-subject EEG-based emotion recognition.

摘要
电子脑电图像 (EEG) 是一种客观工具 для情感识别，具有广泛的应用前景。然而，有限的标签数据仍然是这一领域的主要挑战，限制了 EEG 基于情感识别的普及使用。在这篇论文中，一种半supervised dual-stream自我注意力对抗图像学习框架（简称 DS-AGC）被提出，以解决跨主体 EEG 基于情感识别的标签稀缺问题。DS-AGC 框架包括两个并行的流动，用于提取非结构化和结构化 EEG 特征。非结构化流动利用半supervised多域适应方法，以减轻不同标签领域之间的分布差异。结构化流动开发了图像学习方法，以提取多个 EEG 通道之间的有效图像特征表示。此外，一个自我注意力融合模块被开发，用于特征融合、样本选择和情感识别，并强调 EEG 特征更加 relevante 于情感和数据样本在标签领域中更加接近的目标领域。广泛的实验在两个标准数据库（SEED 和 SEED-IV）上进行，使用半supervised cross-subject leave-one-subject-out 跨领域验证方式，显示提出的模型在不同的标签缺失情况下（SEED 上的平均提升率为 5.83%，SEED-IV 上的平均提升率为 6.99%），表明其能够有效地解决跨主体 EEG 基于情感识别的标签稀缺问题。

Effect of Choosing Loss Function when Using T-batching for Representation Learning on Dynamic Networks

paper_url: http://arxiv.org/abs/2308.06862
repo_url: https://github.com/erfanloghmani/effect-of-loss-function-tbatching
paper_authors: Erfan Loghmani, MohammadAmin Fazli
for: 这篇论文旨在探讨静态网络模型的动态学习方法，以提高网络模型的训练效率和准确性。
methods: 本论文提出了两种替代训练损失函数，并通过数学分析显示了这些损失函数可以解决训练损失函数中的问题，提高训练性能。
results: 实验结果显示，将这两种替代损失函数应用于训练静态网络模型可以提高模型的训练效率和准确性，特别是在实际世界的动态网络上。

Abstract
Representation learning methods have revolutionized machine learning on networks by converting discrete network structures into continuous domains. However, dynamic networks that evolve over time pose new challenges. To address this, dynamic representation learning methods have gained attention, offering benefits like reduced learning time and improved accuracy by utilizing temporal information. T-batching is a valuable technique for training dynamic network models that reduces training time while preserving vital conditions for accurate modeling. However, we have identified a limitation in the training loss function used with t-batching. Through mathematical analysis, we propose two alternative loss functions that overcome these issues, resulting in enhanced training performance. We extensively evaluate the proposed loss functions on synthetic and real-world dynamic networks. The results consistently demonstrate superior performance compared to the original loss function. Notably, in a real-world network characterized by diverse user interaction histories, the proposed loss functions achieved more than 26.9% enhancement in Mean Reciprocal Rank (MRR) and more than 11.8% improvement in Recall@10. These findings underscore the efficacy of the proposed loss functions in dynamic network modeling.

摘要
<>将文本翻译成简化中文。<>机器学习在网络上进行了革命，通过将分类网络结构转换为连续域。然而，时间演化的网络却提出了新的挑战。为此，动态表示学习方法在抓取到关注。这些方法可以减少学习时间，并使用时间信息提高准确性。 T-批处理是训练动态网络模型的有价值技术，可以降低训练时间，保持模型准确的条件。然而，我们发现在使用 t-批处理时的训练损失函数中存在一定的限制。通过数学分析，我们提出了两种替代的损失函数，可以解决这些问题，从而提高训练性能。我们对提出的损失函数进行了广泛的评估，在 sintetic 和实际的动态网络上进行了测试。结果表明，提出的损失函数在动态网络模型训练中具有显著优势，与原始损失函数相比，可以提高 Mean Reciprocal Rank（MRR）的表现至少26.9%，并提高 Recall@10 的表现至少11.8%。这些发现证明了我们提出的损失函数在动态网络模型中的有效性。

Optimizing Offensive Gameplan in the National Basketball Association with Machine Learning

paper_url: http://arxiv.org/abs/2308.06851
repo_url: None
paper_authors: Eamon Mukhopadhyay
for: 本研究的目的是确认Stats的有效性，以及将Stats与NBA比赛类型之间建立关联性。
methods: 本研究使用了机器学习技术，选择了一组特定的特征，以评估Stats的有效性。 linear regression 模型和对应性网络模型都被用来检验Stats 的跟踪能力。
results: 研究发现，使用ORTG 的 linear regression 模型和对应性网络模型都能够与不同的NBA比赛类型建立关联性。然而，使用对应性网络模型的精度较高。通过对模型的调整，研究人员发现了一组特定的特征，可以帮助建立一个高效的进攻策略。

Abstract
Throughout the analytical revolution that has occurred in the NBA, the development of specific metrics and formulas has given teams, coaches, and players a new way to see the game. However - the question arises - how can we verify any metrics? One method would simply be eyeball approximation (trying out many different gameplans) and/or trial and error - an estimation-based and costly approach. Another approach is to try to model already existing metrics with a unique set of features using machine learning techniques. The key to this approach is that with these features that are selected, we can try to gauge the effectiveness of these features combined, rather than using individual analysis in simple metric evaluation. If we have an accurate model, it can particularly help us determine the specifics of gameplan execution. In this paper, the statistic ORTG (Offensive Rating, developed by Dean Oliver) was found to have a correlation with different NBA playtypes using both a linear regression model and a neural network regression model, although ultimately, a neural network worked slightly better than linear regression. Using the accuracy of the models as a justification, the next step was to optimize the output of the model with test examples, which would demonstrate the combination of features to best achieve a highly functioning offense.

摘要
在NBA的分析革命中，开发特定的指标和公式为球队、教练和球员提供了一种新的视角。然而，问题出现：如何证明这些指标？一种方法是通过观察和尝试多种战斗策略来估算，这是一种估算性的和昂贵的方法。另一种方法是使用机器学习技术来模型现有的指标，并使用这些特定的特征来评估这些指标的效果。如果我们有一个准确的模型，那么它可以帮助我们确定游戏计划的具体执行方式。根据这篇论文，由Dean Oliver开发的ORTG指标（进攻评估指标）与不同的NBA战斗类型之间存在正相关关系，使用线性回归模型和神经网络回归模型进行评估，最终神经网络模型的性能略高于线性回归模型。使用模型的准确性为正当化，接下来的步骤是使用测试例子来优化模型的输出，以达到高效的攻击战斗。

When Monte-Carlo Dropout Meets Multi-Exit: Optimizing Bayesian Neural Networks on FPGA

paper_url: http://arxiv.org/abs/2308.06849
repo_url: https://github.com/os-hxfan/bayesnn_fpga
paper_authors: Hongxiang Fan, Hao Chen, Liam Castelli, Zhiqiang Que, He Li, Kenneth Long, Wayne Luk
for: 提高安全应用中的投机率预测，如医学影像和自动驾驶。
methods: 提出了一种基于多出口Monte Carlo Dropout（MCD）的 bayesian neural network，实现了准确预测，同时降低了算法复杂性。
results: 对比CPU、GPU和其他当前硬件实现，自动生成的加速器实现了更高的能效率。

Abstract
Bayesian Neural Networks (BayesNNs) have demonstrated their capability of providing calibrated prediction for safety-critical applications such as medical imaging and autonomous driving. However, the high algorithmic complexity and the poor hardware performance of BayesNNs hinder their deployment in real-life applications. To bridge this gap, this paper proposes a novel multi-exit Monte-Carlo Dropout (MCD)-based BayesNN that achieves well-calibrated predictions with low algorithmic complexity. To further reduce the barrier to adopting BayesNNs, we propose a transformation framework that can generate FPGA-based accelerators for multi-exit MCD-based BayesNNs. Several novel optimization techniques are introduced to improve hardware performance. Our experiments demonstrate that our auto-generated accelerator achieves higher energy efficiency than CPU, GPU, and other state-of-the-art hardware implementations.

摘要
bayesian neural networks (bayesNNs) 有显示出在安全关键应用，如医疗成像和自动驾驶中提供了调整后预测的能力。然而，高算法复杂性和 BayesNNs 的硬件性能问题使得它们在实际应用中困难得 deployment。为bridge这个差距，本文提出了一种基于多出口 Monte Carlo Dropout (MCD) 的 BayesNN ，可以实现低算法复杂性下的准确预测。此外，我们还提出了一种转换框架，可以生成 FPGA 基于的加速器，以便快速采用 BayesNNs。我们还引入了一些新的优化技术，以提高硬件性能。我们的实验表明，我们自动生成的加速器在能耗效率方面高于 CPU、GPU 和其他现有硬件实现。Note: "BayesNNs" is a abbreviation of "Bayesian Neural Networks" in English, and "bayesNNs" is the pinyin Romanization of "拜耳 нейрон网络" in Chinese.

Generalizing Topological Graph Neural Networks with Paths

paper_url: http://arxiv.org/abs/2308.06838
repo_url: None
paper_authors: Quang Truong, Peter Chin
for: 本文主要研究Graph Neural Networks (GNNs)的限制和提高。
methods: 本文提出了一种以路径为中心的方法，该方法可以在不假设图structure的情况下提高GNNs的性能。
results: 本文的方法在多个 benchmark 上达到了状态之artefact的表现。

Abstract
While Graph Neural Networks (GNNs) have made significant strides in diverse areas, they are hindered by a theoretical constraint known as the 1-Weisfeiler-Lehmann test. Even though latest advancements in higher-order GNNs can overcome this boundary, they typically center around certain graph components like cliques or cycles. However, our investigation goes a different route. We put emphasis on paths, which are inherent in every graph. We are able to construct a more general topological perspective and form a bridge to certain established theories about other topological domains. Interestingly, without any assumptions on graph sub-structures, our approach surpasses earlier techniques in this field, achieving state-of-the-art performance on several benchmarks.

摘要
GNNS （图 neural network）在多种领域取得了重要进步，但它们受到一种理论限制，称为Weisfeiler-Lehmann测试。 latest advancements in higher-order GNNs 可以突破这个限制，但它们通常围绕 graf 组件如 clique 或 cycle 进行中心。然而，我们的研究采取了不同的方向。我们强调 path，这些是所有 graf 的内在特征。我们可以构建一个更通用的 topological 视角，并与其他已确立的 topological 领域之间建立桥梁。很有趣的是，不需要任何关于 graf 子结构的假设，我们的方法可以在这个领域中超越之前的技术，在多个 benchmark 上达到状态的表现。

InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models

paper_url: http://arxiv.org/abs/2308.08500
repo_url: None
paper_authors: Kabir Nagrecha, Lingyi Liu, Pablo Delgado, Prasanna Padmanabhan
for: 这个论文的目的是探讨深度学习推荐模型（DLRM）的训练过程中的数据接收问题，以及这个问题在现实世界中的瓶颈和挑战。
methods: 这篇论文使用了人工智能的强化学习（RL）技术来解决数据接收问题，RL机器学习agent可以学习如何在DLRM数据管道中分配CPU资源，以更好地并行数据加载和提高throughput。
results: 实验表明，使用InTune可以在只需几分钟之内构建优化数据管道配置，并且可以轻松地与现有训练工作流 integrate。InTune可以提高在线数据接收率，从而减少模型执行时间的浪费和提高效率。在实际世界中应用InTune后，发现它可以提高数据接收 durchput 比现有数据管道优化器高出2.29倍，同时也提高CPU和GPU资源的利用率。

Abstract
Deep learning-based recommender models (DLRMs) have become an essential component of many modern recommender systems. Several companies are now building large compute clusters reserved only for DLRM training, driving new interest in cost- and time- saving optimizations. The systems challenges faced in this setting are unique; while typical deep learning training jobs are dominated by model execution, the most important factor in DLRM training performance is often online data ingestion. In this paper, we explore the unique characteristics of this data ingestion problem and provide insights into DLRM training pipeline bottlenecks and challenges. We study real-world DLRM data processing pipelines taken from our compute cluster at Netflix to observe the performance impacts of online ingestion and to identify shortfalls in existing pipeline optimizers. We find that current tooling either yields sub-optimal performance, frequent crashes, or else requires impractical cluster re-organization to adopt. Our studies lead us to design and build a new solution for data pipeline optimization, InTune. InTune employs a reinforcement learning (RL) agent to learn how to distribute the CPU resources of a trainer machine across a DLRM data pipeline to more effectively parallelize data loading and improve throughput. Our experiments show that InTune can build an optimized data pipeline configuration within only a few minutes, and can easily be integrated into existing training workflows. By exploiting the responsiveness and adaptability of RL, InTune achieves higher online data ingestion rates than existing optimizers, thus reducing idle times in model execution and increasing efficiency. We apply InTune to our real-world cluster, and find that it increases data ingestion throughput by as much as 2.29X versus state-of-the-art data pipeline optimizers while also improving both CPU & GPU utilization.

摘要
现代推荐系统中的深度学习基于模型（DLRM）已成为关键组件。许多公司正在建立专门用于DLRM训练的大型计算集群，导致新的成本和时间OPTIMIZATION的兴趣。这种系统中的挑战是唯一的；通常的深度学习训练任务由模型执行所主导，而DLRM训练中最重要的因素则是在线数据接收。在这篇论文中，我们探讨DLRM数据接收问题的独特特点，并对DLRM训练管道中的瓶颈和挑战提供了新的视角。我们研究了Netflix的计算集群中的真实DLRM数据处理管道，以观察在线接收的性能影响和标准化管道优化器的缺陷。我们发现现有的工具可以提供低效率、频繁崩溃或者需要重新组织集群的优化。为了解决这些问题，我们设计了一种新的数据管道优化解决方案——InTune。InTune使用了强化学习（RL）代理来学习如何在DLRM数据管道中分配训练机器的CPU资源，以更好地并行数据加载并提高吞吐量。我们的实验表明，InTune可以在只需几分钟之内构建优化后的数据管道配置，并可以轻松地与现有训练工作流 integrate。通过强化学习的响应和适应性，InTune可以在现有优化器的基础上提高在线数据接收速率，从而降低模型执行时间的浪费和提高效率。我们在实际 cluster 中应用InTune，发现它可以提高数据接收吞吐量，最高可达2.29倍于当前状态艺术数据管道优化器，同时也提高CPU和GPU资源的利用率。

An Ensemble Approach to Question Classification: Integrating Electra Transformer, GloVe, and LSTM

paper_url: http://arxiv.org/abs/2308.06828
repo_url: None
paper_authors: Sanad Aburass, Osama Dorgham
for: 本研究提出了一种新的集成方法，用于问题分类任务。
methods: 该模型使用了现代化的 Electra、GloVe 和 LSTM 模型，并将其集成起来，以提高问题分类的精度和效率。
results: 对于 TREC 数据集上的问题分类任务，我们的模型实现了以下成果： accuracy 0.8 。这些结果表明，集成方法在问题分类任务中具有显著的优势，并且鼓励进一步探索 ensemble 方法在自然语言处理中的应用。

Abstract
This paper introduces a novel ensemble approach for question classification using state-of-the-art models -- Electra, GloVe, and LSTM. The proposed model is trained and evaluated on the TREC dataset, a well-established benchmark for question classification tasks. The ensemble model combines the strengths of Electra, a transformer-based model for language understanding, GloVe, a global vectors for word representation, and LSTM, a recurrent neural network variant, providing a robust and efficient solution for question classification. Extensive experiments were carried out to compare the performance of the proposed ensemble approach with other cutting-edge models, such as BERT, RoBERTa, and DistilBERT. Our results demonstrate that the ensemble model outperforms these models across all evaluation metrics, achieving an accuracy of 0.8 on the test set. These findings underscore the effectiveness of the ensemble approach in enhancing the performance of question classification tasks, and invite further exploration of ensemble methods in natural language processing.

摘要
这篇论文介绍了一种新的ensemble方法 для问题分类，使用当今最佳模型——Electra、GloVe和LSTM。提议的模型在TREC数据集上进行训练和评估，TREC数据集是问题分类任务的可靠的标准 benchmark。 ensemble模型结合了Electra、GloVe和LSTM的优势，提供了一种强大和高效的问题分类解决方案。我们进行了广泛的实验，比较了提议的ensemble方法与其他最新的模型，如BERT、RoBERTa和DistilBERT的性能。我们的结果表明， ensemble模型在所有评价指标上都超过了这些模型，在测试集上达到了0.8的准确率。这些发现证明了 ensemble方法在问题分类任务中的效果，并邀请了进一步的ensemble方法在自然语言处理领域的探索。

Reinforcement Graph Clustering with Unknown Cluster Number

paper_url: http://arxiv.org/abs/2308.06827
repo_url: https://github.com/yueliu1999/awesome-deep-graph-clustering
paper_authors: Yue Liu, Ke Liang, Jun Xia, Xihong Yang, Sihang Zhou, Meng Liu, Xinwang Liu, Stan Z. Li
for: 这个论文的目标是提出一种无监督的深度图 clustering 方法，以便在实际场景中不需要预先定义群集数量。
methods: 该方法使用了强化学习机制，将集群数量决定和无监督表示学习集成到一个统一框架中。首先学习出了强化表示，然后考虑了节点和集群状态，并使用了质量网络评估不同群集数量的质量。最后，通过强化学习机制来确定最佳的群集数量。
results: 实验表明，提出的方法可以有效地进行深度图 clustering，并且比既有方法更加高效。code 和数据集可以在 GitHub 上找到。

Abstract
Deep graph clustering, which aims to group nodes into disjoint clusters by neural networks in an unsupervised manner, has attracted great attention in recent years. Although the performance has been largely improved, the excellent performance of the existing methods heavily relies on an accurately predefined cluster number, which is not always available in the real-world scenario. To enable the deep graph clustering algorithms to work without the guidance of the predefined cluster number, we propose a new deep graph clustering method termed Reinforcement Graph Clustering (RGC). In our proposed method, cluster number determination and unsupervised representation learning are unified into a uniform framework by the reinforcement learning mechanism. Concretely, the discriminative node representations are first learned with the contrastive pretext task. Then, to capture the clustering state accurately with both local and global information in the graph, both node and cluster states are considered. Subsequently, at each state, the qualities of different cluster numbers are evaluated by the quality network, and the greedy action is executed to determine the cluster number. In order to conduct feedback actions, the clustering-oriented reward function is proposed to enhance the cohesion of the same clusters and separate the different clusters. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method. The source code of RGC is shared at https://github.com/yueliu1999/RGC and a collection (papers, codes and, datasets) of deep graph clustering is shared at https://github.com/yueliu1999/Awesome-Deep-Graph-Clustering on Github.

摘要
深度图 clustering，目标是通过神经网络在无监督情况下将节点分组成不重叠的分组，在过去几年内吸引了很大的关注。虽然现有的方法已经提高了性能，但是它们的出色表现受到准确预定的分组数量的限制，这在实际应用场景中并不总是可用。为了让深度图 clustering 算法不受预定分组数量的限制，我们提出了一种新的深度图 clustering 方法，称为奖励图 clustering（RGC）。在我们的提议方法中，帧定分组数量和无监督表示学习被统一到一个奖励学习机制中。具体来说，首先通过对比预文本任务来学习描述性节点表示。然后，为了准确地捕捉图中节点和分组的相互关系，在每个状态下考虑节点和分组状态。接着，在每个状态下，通过质量网络评估不同分组数量的质量，并执行贪婪的动作来确定分组数量。为了进行反馈动作，我们提出了一种集成吸引函数，以提高同分组内节点之间的凝聚力和不同分组内节点之间的分离力。我们的实验表明，RGC 方法可以具有高效率和高效性。RGC 的源代码可以在 GitHub 上获取（https://github.com/yueliu1999/RGC），并且我们在 GitHub 上分享了一个包含深度图 clustering 相关论文、代码和数据集的集成（https://github.com/yueliu1999/Awesome-Deep-Graph-Clustering）。

Approximate and Weighted Data Reconstruction Attack in Federated Learning

paper_url: http://arxiv.org/abs/2308.06822
repo_url: None
paper_authors: Ziqi Wang, Yongcun Song, Enrique Zuazua
for: 该研究旨在攻击基于 horizontal Federated Averaging（FedAvg）的分布式学习（Federated Learning，FL）场景，以便在无需分享客户端数据的情况下，攻击者可以recover客户端的训练数据。
methods: 该研究提出了一种 interpolation-based approximation 方法，通过生成客户端的本地训练过程中的中间模型更新，使得攻击 FedAvg 场景变得可行。此外，该研究还提出了一种层wise weighted loss function，用于提高数据重建的质量。
results: 实验结果表明，该研究的提出的 Approximate and Weighted Attack（AWA）方法在不同的评价指标中具有显著的改善，特别是在图像数据重建中。

Abstract
Federated Learning (FL) is a distributed learning paradigm that enables multiple clients to collaborate on building a machine learning model without sharing their private data. Although FL is considered privacy-preserved by design, recent data reconstruction attacks demonstrate that an attacker can recover clients' training data based on the parameters shared in FL. However, most existing methods fail to attack the most widely used horizontal Federated Averaging (FedAvg) scenario, where clients share model parameters after multiple local training steps. To tackle this issue, we propose an interpolation-based approximation method, which makes attacking FedAvg scenarios feasible by generating the intermediate model updates of the clients' local training processes. Then, we design a layer-wise weighted loss function to improve the data quality of reconstruction. We assign different weights to model updates in different layers concerning the neural network structure, with the weights tuned by Bayesian optimization. Finally, experimental results validate the superiority of our proposed approximate and weighted attack (AWA) method over the other state-of-the-art methods, as demonstrated by the substantial improvement in different evaluation metrics for image data reconstructions.

摘要
federated learning（FL）是一种分布式学习模式，允许多个客户端共同构建一个机器学习模型，而无需分享他们的私人数据。虽然FL被视为隐私保护的设计，但是最近的数据重建攻击表明，攻击者可以根据在FL中共享的参数恢复客户端的训练数据。然而，现有的方法无法攻击最常用的水平联合平均（FedAvg）场景，在这里，客户端在多个本地训练步骤后共享模型参数。为解决这个问题，我们提议一种 interpolation-based 近似方法，使得在客户端的本地训练过程中生成Intermediate模型更新。然后，我们设计了层weise weighted 损失函数，以提高数据重建的质量。我们对模型更新在不同层中分配不同的权重，并通过抽样优化得到最佳权重。最后，我们的提出的近似和权重攻击（AWA）方法在不同评价指标中具有显著的提高，与其他当前state-of-the-art方法进行比较。

SoK: Realistic Adversarial Attacks and Defenses for Intelligent Network Intrusion Detection

paper_url: http://arxiv.org/abs/2308.06819
repo_url: None
paper_authors: João Vitorino, Isabel Praça, Eva Maia
for: 本研究旨在汇总当前领域中 adversarial 学习的应用情况，以及对 realistic 的攻击示例的生成方法。
methods: 本研究使用了多种 adversarial 攻击方法，包括黑盒测试、灰盒测试、探测隐蔽攻击等。
results: 本研究通过对多种 adversarial 攻击方法进行分析和评估，提出了一些 open challenges 和 future research directions，以及一些实际应用场景的推荐。

Abstract
Machine Learning (ML) can be incredibly valuable to automate anomaly detection and cyber-attack classification, improving the way that Network Intrusion Detection (NID) is performed. However, despite the benefits of ML models, they are highly susceptible to adversarial cyber-attack examples specifically crafted to exploit them. A wide range of adversarial attacks have been created and researchers have worked on various defense strategies to safeguard ML models, but most were not intended for the specific constraints of a communication network and its communication protocols, so they may lead to unrealistic examples in the NID domain. This Systematization of Knowledge (SoK) consolidates and summarizes the state-of-the-art adversarial learning approaches that can generate realistic examples and could be used in real ML development and deployment scenarios with real network traffic flows. This SoK also describes the open challenges regarding the use of adversarial ML in the NID domain, defines the fundamental properties that are required for an adversarial example to be realistic, and provides guidelines for researchers to ensure that their future experiments are adequate for a real communication network.

摘要
Translated into Simplified Chinese:机器学习（ML）可以极其有价值地自动检测异常和识别攻击，提高网络入侵检测（NID）的方式。然而，尽管ML模型具有各种优点，但它们又受到特制的攻击示例的威胁。有许多攻击方法被创造出来，研究人员也为了保护ML模型而努力了很多，但大多数这些方法不适用于通信网络和其通信协议的特定限制，因此可能导致NID领域中的不真实的示例。这个系统化知识（SoK）总结了当前领域中最佳的抗击学习方法，这些方法可以生成真实的示例，并可以在实际的ML开发和部署场景中使用实际的网络流量。这个SoK还描述了使用抗击学习在NID领域的开放挑战，定义了真实示例所需的基本属性，并提供了指导方针，以便未来研究人员可以在真正的通信网络中进行合适的实验。

SAILOR: Structural Augmentation Based Tail Node Representation Learning

paper_url: http://arxiv.org/abs/2308.06801
repo_url: https://github.com/jie-re/sailor
paper_authors: Jie Liao, Jintang Li, Liang Chen, Bingzhe Wu, Yatao Bian, Zibin Zheng
for: 本文是为了提高图像中的尾节点表示性而提出的一种框架，即SAILOR，该框架可以同时学习图像的结构增强和尾节点表示提取更多的信息。methods: 本文使用的方法包括message propagation和structural augmentation，这两种方法可以帮助提高尾节点的表示性。results: 实验结果表明，SAILOR可以显著提高尾节点的表示性，并超越现有的基elines。

Abstract
Graph Neural Networks (GNNs) have achieved state-of-the-art performance in representation learning for graphs recently. However, the effectiveness of GNNs, which capitalize on the key operation of message propagation, highly depends on the quality of the topology structure. Most of the graphs in real-world scenarios follow a long-tailed distribution on their node degrees, that is, a vast majority of the nodes in the graph are tail nodes with only a few connected edges. GNNs produce inferior node representations for tail nodes since they lack structural information. In the pursuit of promoting the expressiveness of GNNs for tail nodes, we explore how the deficiency of structural information deteriorates the performance of tail nodes and propose a general Structural Augmentation based taIL nOde Representation learning framework, dubbed as SAILOR, which can jointly learn to augment the graph structure and extract more informative representations for tail nodes. Extensive experiments on public benchmark datasets demonstrate that SAILOR can significantly improve the tail node representations and outperform the state-of-the-art baselines.

摘要
格raph神经网络（GNNs）在近期 representation learning 中取得了状态理想的表现。然而，GNNs 的效果，它们基于消息传递操作，具体取决于图结构质量。大多数实际场景中的图都遵循一个长尾分布，即图中的大多数节点是tail节点，只有几个连接的边。GNNs 对tail节点进行表示不够，因为它们缺乏结构信息。为了提高 GNNs 对tail节点的表示能力，我们研究了tail节点表示力下降的原因，并提出了一种通用的结构扩充基于 taIL nOde Representation 学习框架，名为 SAILOR，可以同时学习扩充图结构并提取更有用的表示信息。我们对公共 benchmark 数据集进行了广泛的实验，结果显示，SAILOR 可以明显提高tail节点表示能力，并超过当前的基elines。

2023-08-15

CCD-3DR: Consistent Conditioning in Diffusion for Single-Image 3D Reconstruction

Learning Better Keypoints for Multi-Object 6DoF Pose Estimation

ImbSAM: A Closer Look at Sharpness-Aware Minimization in Class-Imbalanced Recognition

Grasp Transfer based on Self-Aligning Implicit Representations of Local Surfaces

Neuromorphic Seatbelt State Detection for In-Cabin Monitoring with Event Cameras

Handwritten Stenography Recognition and the LION Dataset

DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding

Future Video Prediction from a Single Frame for Video Anomaly Detection

Learning Image Deraining Transformer Network with Dynamic Dual Self-Attention

An Interpretable Machine Learning Model with Deep Learning-based Imaging Biomarkers for Diagnosis of Alzheimer’s Disease

Dual-path TokenLearner for Remote Photoplethysmography-based Physiological Measurement with Facial Videos

Multi-scale Promoted Self-adjusting Correlation Learning for Facial Action Unit Detection

Whale Detection Enhancement through Synthetic Satellite Images

CASPNet++: Joint Multi-Agent Motion Prediction

ChartDETR: A Multi-shape Detection Network for Visual Chart Recognition

Identity-Consistent Aggregation for Video Object Detection

Dynamic Low-Rank Instance Adaptation for Universal Neural Image Compression

UniTR: A Unified and Efficient Multi-Modal Transformer for Bird’s-Eye-View Representation

Context-Aware Pseudo-Label Refinement for Source-Free Domain Adaptive Fundus Image Segmentation

Domain-Aware Fine-Tuning: Enhancing Neural Network Adaptability

Real-time Automatic M-mode Echocardiography Measurement with Panel Attention from Local-to-Global Pixels

Enhancing Network Initialization for Medical AI Models Using Large-Scale, Unlabeled Natural Images

A Review of Adversarial Attacks in Computer Vision

Inversion-by-Inversion: Exemplar-based Sketch-to-Photo Synthesis via Stochastic Differential Equations without Training

Gradient-Based Post-Training Quantization: Challenging the Status Quo

Geometry of the Visual Cortex with Applications to Image Inpainting and Enhancement

Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

Backpropagation Path Search On Adversarial Transferability

Self-Prompting Large Vision Models for Few-Shot Medical Image Segmentation

Self-supervised Hypergraphs for Learning Multiple World Interpretations

GAMER-MRIL identifies Disability-Related Brain Changes in Multiple Sclerosis

AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model

Graph-Segmenter: Graph Transformer with Boundary-aware Attention for Semantic Segmentation

ADD: An Automatic Desensitization Fisheye Dataset for Autonomous Driving

Synthetic data generation method for hybrid image-tabular data using two generative adversarial networks

Ske2Grid: Skeleton-to-Grid Representation Learning for Action Recognition

Improved mirror ball projection for more accurate merging of multiple camera outputs and process monitoring

SST: A Simplified Swin Transformer-based Model for Taxi Destination Prediction based on Existing Trajectory

Multi-view 3D Face Reconstruction Based on Flame

3DHacker: Spectrum-based Decision Boundary Generation for Hard-label 3D Point Cloud Attack

Multimodal Dataset Distillation for Image-Text Retrieval

Visual and Textual Prior Guided Mask Assemble for Few-Shot Segmentation and Beyond

AttMOT: Improving Multiple-Object Tracking by Introducing Auxiliary Pedestrian Attributes

Improved Region Proposal Network for Enhanced Few-Shot Object Detection

Inverse Lithography Physics-informed Deep Neural Level Set for Mask Optimization

Confidence Contours: Uncertainty-Aware Annotation for Medical Semantic Segmentation

Benchmarking Scalable Epistemic Uncertainty Quantification in Organ Segmentation

ICAFusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection

SpecTracle: Wearable Facial Motion Tracking from Unobtrusive Peripheral Cameras

BSED: Baseline Shapley-Based Explainable Detector

Space Object Identification and Classification from Hyperspectral Material Analysis

Probabilistic MIMO U-Net: Efficient and Accurate Uncertainty Estimation for Pixel-wise Regression

Reducing Training Demands for 3D Gait Recognition with Deep Koopman Operator Constraints

There Is a Digital Art History

U-Turn Diffusion

Semantify: Simplifying the Control of 3D Morphable Models using CLIP

A Unified Query-based Paradigm for Camouflaged Instance Segmentation

DISBELIEVE: Distance Between Client Models is Very Essential for Effective Local Model Poisoning Attacks

The Devil in the Details: Simple and Effective Optical Flow Synthetic Data Generation

Jurassic World Remake: Bringing Ancient Fossils Back to Life via Zero-Shot Long Image-to-Image Translation

Dual Associated Encoder for Face Restoration

Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation

A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis

Accurate Eye Tracking from Dense 3D Surface Reconstructions using Single-Shot Deflectometry

A Robust Approach Towards Distinguishing Natural and Computer Generated Images using Multi-Colorspace fused and Enriched Vision Transformer

Diving with Penguins: Detecting Penguins and their Prey in Animal-borne Underwater Videos via Deep Learning

Efficient Real-time Smoke Filtration with 3D LiDAR for Search and Rescue with Autonomous Heterogeneous Robotic Systems

Large-kernel Attention for Efficient and Robust Brain Lesion Segmentation

AAFACE: Attribute-aware Attentional Network for Face Recognition

UniWorld: Autonomous Driving Pre-training via World Models

RestoreFormer++: Towards Real-World Blind Face Restoration from Undegraded Key-Value Pairs

2023-08-15

REFORMS: Reporting Standards for Machine Learning Based Science

Tightest Admissible Shortest Path

Learning to Identify Critical States for Reinforcement Learning from Videos

Implementing Quantum Generative Adversarial Network (qGAN) and QCBM in Finance

Informed Named Entity Recognition Decoding for Generative Language Models

Do We Fully Understand Students’ Knowledge States? Identifying and Mitigating Answer Bias in Knowledge Tracing

Hierarchical generative modelling for autonomous robots

ERA: Enhanced Relaxed A algorithm for Solving the Shortest Path Problem in Regular Grid Maps