2023-08-24

cs.CV

cs.CV - 2023-08-24

VNI-Net: Vector Neurons-based Rotation-Invariant Descriptor for LiDAR Place Recognition

paper_url: http://arxiv.org/abs/2308.12870
repo_url: None
paper_authors: Gengxuan Tian, Junqiao Zhao, Yingfeng Cai, Fenglin Zhang, Wenjie Mu, Chen Ye
for: 提高 LiDAR 场景认知中的旋转不敏感性
methods: 使用 Vector Neurons Network (VNN) 实现 SO(3) 旋转不变性，提取邻近点的旋转等价特征，将低维特征映射到高维空间
results: 对公共数据集进行实验，与其他基准方法相比，提高了旋转不敏感性，与当前状态艺术场景认知方法几乎匹配Here’s a brief explanation of each point:* “for”: The paper aims to improve the rotation-invariance of LiDAR scene recognition.* “methods”: The proposed method uses Vector Neurons Network (VNN) to achieve SO(3) rotation invariance, and extracts rotation-equivalent features from neighboring points.* “results”: The proposed method significantly outperforms other baseline methods that consider rotation invariance, and achieves comparable results with current state-of-the-art place recognition methods that do not consider rotation issues.

Abstract
LiDAR-based place recognition plays a crucial role in Simultaneous Localization and Mapping (SLAM) and LiDAR localization. Despite the emergence of various deep learning-based and hand-crafting-based methods, rotation-induced place recognition failure remains a critical challenge. Existing studies address this limitation through specific training strategies or network structures. However, the former does not produce satisfactory results, while the latter focuses mainly on the reduced problem of SO(2) rotation invariance. Methods targeting SO(3) rotation invariance suffer from limitations in discrimination capability. In this paper, we propose a new method that employs Vector Neurons Network (VNN) to achieve SO(3) rotation invariance. We first extract rotation-equivariant features from neighboring points and map low-dimensional features to a high-dimensional space through VNN. Afterwards, we calculate the Euclidean and Cosine distance in the rotation-equivariant feature space as rotation-invariant feature descriptors. Finally, we aggregate the features using GeM pooling to obtain global descriptors. To address the significant information loss when formulating rotation-invariant descriptors, we propose computing distances between features at different layers within the Euclidean space neighborhood. This greatly improves the discriminability of the point cloud descriptors while ensuring computational efficiency. Experimental results on public datasets show that our approach significantly outperforms other baseline methods implementing rotation invariance, while achieving comparable results with current state-of-the-art place recognition methods that do not consider rotation issues.

摘要
利用LiDAR技术实现地点识别在同时地图和地点位置确定（SLAM）中扮演关键角色。 despite the emergence of various深度学习基于和手动设计基于方法， rotate induced place recognition failure remains a critical challenge。 existing studies address this limitation through specific training strategies or network structures。 However, the former does not produce satisfactory results, while the latter focuses mainly on the reduced problem of SO(2) rotation invariance。 methods targeting SO(3) rotation invariance suffer from limitations in discrimination capability。 In this paper, we propose a new method that employs Vector Neurons Network (VNN) to achieve SO(3) rotation invariance。 we first extract rotation-equivariant features from neighboring points and map low-dimensional features to a high-dimensional space through VNN。 Afterwards, we calculate the Euclidean and Cosine distance in the rotation-equivariant feature space as rotation-invariant feature descriptors。 Finally, we aggregate the features using GeM pooling to obtain global descriptors。 To address the significant information loss when formulating rotation-invariant descriptors, we propose computing distances between features at different layers within the Euclidean space neighborhood。 This greatly improves the discriminability of the point cloud descriptors while ensuring computational efficiency。 Experimental results on public datasets show that our approach significantly outperforms other baseline methods implementing rotation invariance, while achieving comparable results with current state-of-the-art place recognition methods that do not consider rotation issues。Note: "Simplified Chinese" is a romanization of the Chinese language that uses a simplified set of characters and pronunciation. It is commonly used in mainland China and Singapore.

ToonTalker: Cross-Domain Face Reenactment

paper_url: http://arxiv.org/abs/2308.12866
repo_url: None
paper_authors: Yuan Gong, Yong Zhang, Xiaodong Cun, Fei Yin, Yanbo Fan, Xuan Wang, Baoyuan Wu, Yujiu Yang
for: 本研究旨在实现跨域人脸reenactment，即将真实视频转换为动漫图像和 vice versa。
methods: 我们提出了一种基于 transformer 框架的新方法，包括两个域特定的动作编码器和两个可学习的动作基准存储。我们还使用了源查询 transformer 和驱动 transformer 来将域特定动作 proyect 到共同幂 space，然后在该空间中进行动作传输。
results: 我们的方法在评估中表现出色，与竞争方法相比有所超越。此外，我们还提供了一个 Disney 风格的动漫数据集，以便进一步验证和应用我们的方法。

Abstract
We target cross-domain face reenactment in this paper, i.e., driving a cartoon image with the video of a real person and vice versa. Recently, many works have focused on one-shot talking face generation to drive a portrait with a real video, i.e., within-domain reenactment. Straightforwardly applying those methods to cross-domain animation will cause inaccurate expression transfer, blur effects, and even apparent artifacts due to the domain shift between cartoon and real faces. Only a few works attempt to settle cross-domain face reenactment. The most related work AnimeCeleb requires constructing a dataset with pose vector and cartoon image pairs by animating 3D characters, which makes it inapplicable anymore if no paired data is available. In this paper, we propose a novel method for cross-domain reenactment without paired data. Specifically, we propose a transformer-based framework to align the motions from different domains into a common latent space where motion transfer is conducted via latent code addition. Two domain-specific motion encoders and two learnable motion base memories are used to capture domain properties. A source query transformer and a driving one are exploited to project domain-specific motion to the canonical space. The edited motion is projected back to the domain of the source with a transformer. Moreover, since no paired data is provided, we propose a novel cross-domain training scheme using data from two domains with the designed analogy constraint. Besides, we contribute a cartoon dataset in Disney style. Extensive evaluations demonstrate the superiority of our method over competing methods.

摘要
我们在这篇论文中targetcross-domain face reenactment，即将动漫图像驱动真实视频和真实视频驱动动漫图像。在最近的许多工作中，人们主要关注在一个shot中生成真实人脸，即在同一个频谱中reenactment。如果直接应用这些方法到cross-domain animation，会导致不准确的表达传递、模糊效果和甚至显式的artefacts，这是因为cartoon和真实人脸之间的频谱差异。只有一些工作尝试了cross-domain face reenactment。最相关的工作是AnimeCeleb，它需要构建一个数据集，其中包含pose vector和动漫图像对的Pair，并通过动画3D人物来生成这些对。这使得它在没有对应数据时无法应用。在这篇论文中，我们提出了一种新的方法，即使没有对应数据也可以实现cross-domain reenactment。具体来说，我们提出了一个基于transformer的框架，用于将不同频谱中的动作都尝试到一个共同的幂space中，然后通过幂码加法进行动作传递。我们使用了两个域特定的动作编码器和两个可学习的动作基准记忆来捕捉域属性。 sources query transformer和驱动一个是用于将域特定的动作 проек到共同空间，而编辑的动作则是通过transformer将其 projet回到源域。此外，由于没有提供对应数据，我们提出了一种新的跨域训练方案，使用了两个域的数据，并通过设计的相似性约束。此外，我们还贡献了一个以Disney风格为主的动漫数据集。我们的方法在多个评估中表现出色，超过了竞争方法的性能。

SkipcrossNets: Adaptive Skip-cross Fusion for Road Detection

paper_url: http://arxiv.org/abs/2308.12863
repo_url: None
paper_authors: Xinyu Zhang, Yan Gong, Zhiwei Li, Xin Gao, Dafeng Jin, Jun Li, Huaping Liu
For: This paper proposes a novel fusion architecture called SkipcrossNets for multi-modal fusion of LiDAR point clouds and camera images in autonomous driving tasks.* Methods: The SkipcrossNets architecture uses skip-cross connections to adaptively combine features from both modalities at each layer, without being bound to a specific fusion epoch. The network is divided into several blocks to reduce the complexity of feature fusion and the number of model parameters.* Results: The proposed SkipcrossNets architecture achieved a MaxF score of 96.85% on the KITTI dataset and an F1 score of 84.84% on the A2D2 dataset, with a memory requirement of only 2.33 MB and a speed of 68.24 FPS, making it viable for mobile terminals and embedded devices.

Abstract
Multi-modal fusion is increasingly being used for autonomous driving tasks, as images from different modalities provide unique information for feature extraction. However, the existing two-stream networks are only fused at a specific network layer, which requires a lot of manual attempts to set up. As the CNN goes deeper, the two modal features become more and more advanced and abstract, and the fusion occurs at the feature level with a large gap, which can easily hurt the performance. In this study, we propose a novel fusion architecture called skip-cross networks (SkipcrossNets), which combines adaptively LiDAR point clouds and camera images without being bound to a certain fusion epoch. Specifically, skip-cross connects each layer to each layer in a feed-forward manner, and for each layer, the feature maps of all previous layers are used as input and its own feature maps are used as input to all subsequent layers for the other modality, enhancing feature propagation and multi-modal features fusion. This strategy facilitates selection of the most similar feature layers from two data pipelines, providing a complementary effect for sparse point cloud features during fusion processes. The network is also divided into several blocks to reduce the complexity of feature fusion and the number of model parameters. The advantages of skip-cross fusion were demonstrated through application to the KITTI and A2D2 datasets, achieving a MaxF score of 96.85% on KITTI and an F1 score of 84.84% on A2D2. The model parameters required only 2.33 MB of memory at a speed of 68.24 FPS, which could be viable for mobile terminals and embedded devices.

摘要
多Modal融合在自动驾驶任务中日益普遍应用，因为不同模式的图像提供了独特的特征提取信息。然而，现有的两派网络只是在特定网络层进行融合，需要大量的手动尝试设置。随着CNN深入，两种模式的特征变得越来越先进和抽象，融合发生在特征层级，这可能会产生性能下降。在这种研究中，我们提出了一种新的融合架构，即跳过网络（SkipcrossNets），它将雷达点云和摄像头图像在不同的层次进行 Adaptive 融合。具体来说，跳过连接每层与每层进行Feed-forward 的连接，并且为每层的特征图使用所有前一层的特征图作为输入，并将每层的特征图作为输入给后续层的另一个模式。这种策略使得在融合过程中选择最相似的特征层，提供了质点云特征的补做效果。此外，网络还被分解成多个块，以降低特征融合的复杂度和模型参数的数量。Skipcross融合的优点在应用于 KITTI 和 A2D2 数据集上得到了证明，最大评分达 96.85% 和 F1 分数达 84.84%。模型参数只需 2.33 MB 的内存和 68.24 FPS 的速度，这可能是蜂窝Terminal和嵌入式设备的可行选择。

Learned Local Attention Maps for Synthesising Vessel Segmentations

paper_url: http://arxiv.org/abs/2308.12861
repo_url: None
paper_authors: Yash Deo, Rodrigo Bonazzola, Haoran Dou, Yan Xia, Tianyou Wei, Nishant Ravikumar, Alejandro F. Frangi, Toni Lassila
for: 这个论文是为了制作血管分割图像的IMAGING模式，以便在诊断和评估血液管的风险方面提供更好的工具。
methods: 这篇论文使用了一种encoder-decoder模型，通过将T1和T2磁共振图像进行编码，生成基于T2磁共振图像的血管分割图像。该模型采用了两个阶段多目标学习方法，以捕捉全局和局部特征。它使用学习的本地注意力地图，从T2磁共振图像中提取与生成圆束血管相关的信息。
results: 测试中，这个模型的生成的血管分割图像的 dice分数为$0.79\pm0.03$，比state-of-the-art的分割网络（如转换器U-Net和nnU-net）更高，同时使用的参数数量只是这些网络的一部分。主要的 Qualitative difference between our synthetic vessel segmentations and the comparative models was in the sharper resolution of the CoW vessel segments, especially in the posterior circulation.

Abstract
Magnetic resonance angiography (MRA) is an imaging modality for visualising blood vessels. It is useful for several diagnostic applications and for assessing the risk of adverse events such as haemorrhagic stroke (resulting from the rupture of aneurysms in blood vessels). However, MRAs are not acquired routinely, hence, an approach to synthesise blood vessel segmentations from more routinely acquired MR contrasts such as T1 and T2, would be useful. We present an encoder-decoder model for synthesising segmentations of the main cerebral arteries in the circle of Willis (CoW) from only T2 MRI. We propose a two-phase multi-objective learning approach, which captures both global and local features. It uses learned local attention maps generated by dilating the segmentation labels, which forces the network to only extract information from the T2 MRI relevant to synthesising the CoW. Our synthetic vessel segmentations generated from only T2 MRI achieved a mean Dice score of $0.79 \pm 0.03$ in testing, compared to state-of-the-art segmentation networks such as transformer U-Net ($0.71 \pm 0.04$) and nnU-net($0.68 \pm 0.05$), while using only a fraction of the parameters. The main qualitative difference between our synthetic vessel segmentations and the comparative models was in the sharper resolution of the CoW vessel segments, especially in the posterior circulation.

摘要
磁共振成像（MRA）是一种成像血管的技术，可以用于诊断和评估风险，如血栓roke（由血管壁崩溃引起的血栓roke）。然而，MRA不是常见的成像方式，因此一种能够从常见的MR增强像素，如T1和T2，synthesize血管分 segmentation的方法会很有用。我们提出了一种编码器-解码器模型，可以从T2 MRI中synthesize主要脑血管的圆桌（CoW）分 segmentation。我们采用了两个阶段多目标学习方法，可以捕捉全局和局部特征。它使用学习的本地注意力图，从T2 MRI中提取与synthesize CoW相关的信息，使网络只提取T2 MRI中与synthesize CoW相关的信息。我们使用只有T2 MRI Synthesize的血管分 segmentation在测试中 achieve了 mean dice score为$0.79 \pm 0.03$，比 estado-of-the-art segmentation网络如transformer U-Net ($0.71 \pm 0.04$)和nnU-net ($0.68 \pm 0.05$) 的segmentation网络，而且只用了一小部分的参数。主要的qualitative difference между我们的synthetic vessel segmentation和相关的模型在CoW血管段的高分辨率，尤其是 posterior circulation。

paper_url: http://arxiv.org/abs/2308.12845
repo_url: https://github.com/xwaiyy123/object-navigation
paper_authors: Wei Xie, Haobo Jiang, Shuo Gu, Jin Xie
for: 本研究旨在提高室内导航任务中的目标避免碰撞率，尤其是在视觉图像中缺失障碍物和可能的检测错误问题下。
methods: 本研究提出了一种基于历史尝试和错误经验学习的隐式障碍地图驱动的室内导航框架，以提高避免碰撞的Robustness。同时，一种基于非本地网络的目标念 памя库聚合模块是设计来利用非本地网络来描述导航过程中target semantic和target方向准确的相关性，以便在导航过程中挖掘最相关的物品准确准确。
results: 对于AI2-Thor和RoboTHOR的测试数据集，我们的提出方法得到了优秀的避免碰撞和导航效率。

Abstract
Robust obstacle avoidance is one of the critical steps for successful goal-driven indoor navigation tasks.Due to the obstacle missing in the visual image and the possible missed detection issue, visual image-based obstacle avoidance techniques still suffer from unsatisfactory robustness. To mitigate it, in this paper, we propose a novel implicit obstacle map-driven indoor navigation framework for robust obstacle avoidance, where an implicit obstacle map is learned based on the historical trial-and-error experience rather than the visual image. In order to further improve the navigation efficiency, a non-local target memory aggregation module is designed to leverage a non-local network to model the intrinsic relationship between the target semantic and the target orientation clues during the navigation process so as to mine the most target-correlated object clues for the navigation decision. Extensive experimental results on AI2-Thor and RoboTHOR benchmarks verify the excellent obstacle avoidance and navigation efficiency of our proposed method. The core source code is available at https://github.com/xwaiyy123/object-navigation.

摘要
Robust obstacle avoidance 是indoor navigation任务中的一个关键步骤。由于视觉图像中缺失障碍物和可能的检测问题，视觉图像基于的障碍物避免技术仍然具有不满足的Robustness。为了解决这个问题，在这篇论文中，我们提出了一种基于历史尝试和错误经验学习的隐式障碍地图驱动的indoor navigation框架，以便更好地避免障碍物。为了进一步提高导航效率，我们还设计了一个非本地目标记忆聚合模块，通过非本地网络模型target semantic和导航过程中的target orientation clue之间的内在关系，以便在导航过程中挖掘最相关的目标对象指示。经验结果表明，我们提出的方法具有优秀的障碍物避免和导航效率。主要代码可以在https://github.com/xwaiyy123/object-navigation中获取。

EFormer: Enhanced Transformer towards Semantic-Contour Features of Foreground for Portraits Matting

paper_url: http://arxiv.org/abs/2308.12831
repo_url: None
paper_authors: Zitao Wang, Qiguang Miao, Yue Xi
for: 提取完整的 semantics 和细腻的 outline
methods: 使用 transformers 自动注意 Mechanism，具有更大的接受场景，能够更好地捕捉人脸的长距离依赖关系和低频semantic 信息
results: 提高模型对人脸 outline 的准确性和完整性，并且不需要trimap

Abstract
The portrait matting task aims to extract an alpha matte with complete semantics and finely-detailed contours. In comparison to CNN-based approaches, transformers with self-attention allow a larger receptive field, enabling it to better capture long-range dependencies and low-frequency semantic information of a portrait. However, the recent research shows that self-attention mechanism struggle with modeling high-frequency information and capturing fine contour details, which can lead to bias while predicting the portrait's contours. To address the problem, we propose EFormer to enhance the model's attention towards semantic and contour features. Especially the latter, which is surrounded by a large amount of high-frequency details. We build a semantic and contour detector (SCD) to accurately capture the distribution of semantic and contour features. And we further design contour-edge extraction branch and semantic extraction branch for refining contour features and complete semantic information. Finally, we fuse the two kinds of features and leverage the segmentation head to generate the predicted portrait matte. Remarkably, EFormer is an end-to-end trimap-free method and boasts a simple structure. Experiments conducted on VideoMatte240K-JPEGSD and AIM datasets demonstrate that EFormer outperforms previous portrait matte methods.

摘要
PORTRAIT MATTING TASK的目标是提取一个完整的α抑制矩阵，以捕捉人脸的完整 semantics和细腻的边缘信息。相比CNN基于的方法， transformer自注意力 Mechanism 具有更大的接受场，可以更好地捕捉人脸的长距离依赖关系和低频semantic信息。然而， latest research 表明 that self-attention mechanism 在模型高频信息和细腻边缘特征的处理方面存在困难，可能导致预测人脸的边缘偏倚。为了解决问题，我们提出 EFormer，用于增强模型的注意力 towards semantic和contour特征。特别是后者，它围绕着大量高频细节。我们建立了 semantic和contour探测器 (SCD)，以准确捕捉人脸的semantic和contour特征的分布。此外，我们还设计了 contour-edge extraction branch 和 semantic extraction branch，用于精细化contour特征和完整semantic信息。最后，我们融合两种特征，并利用 segmentation head 生成预测的人脸抑制矩阵。值得注意的是， EFormer 是一种端到端的trimap-free方法，具有简单的结构。在 VideoMatte240K-JPEGSD 和 AIM 数据集上进行的实验表明，EFormer 在人脸抑制矩阵预测方面高效。

Robotic Scene Segmentation with Memory Network for Runtime Surgical Context Inference

paper_url: http://arxiv.org/abs/2308.12789
repo_url: https://github.com/uva-dsa/runtime_robscene_seg_2context
paper_authors: Zongyu Li, Ian Reyes, Homa Alemzadeh
for: 这 paper 是为了解决runtime context inference在机器助手手术中的挑战，以及提高视频数据的分割精度和时间一致性。
methods: 这 paper 使用了 Space Time Correspondence Network (STCN)，这是一种记忆网络，它可以进行二分分割并减少类别偏见的影响。STCN 使用了记忆银行，以使用过去的图像和分割信息，以确保分割掩模的一致性。
results: 实验表明，STCN 在公共可用的 JIGSAWS 数据集上表现出色，对于难以分割的对象，如针和织物，可以提高分割精度和上下文推断。此外，这 paper 还证明了在 runtime 无需妥协性能的情况下，可以同时进行分割和上下文推断。

Abstract
Surgical context inference has recently garnered significant attention in robot-assisted surgery as it can facilitate workflow analysis, skill assessment, and error detection. However, runtime context inference is challenging since it requires timely and accurate detection of the interactions among the tools and objects in the surgical scene based on the segmentation of video data. On the other hand, existing state-of-the-art video segmentation methods are often biased against infrequent classes and fail to provide temporal consistency for segmented masks. This can negatively impact the context inference and accurate detection of critical states. In this study, we propose a solution to these challenges using a Space Time Correspondence Network (STCN). STCN is a memory network that performs binary segmentation and minimizes the effects of class imbalance. The use of a memory bank in STCN allows for the utilization of past image and segmentation information, thereby ensuring consistency of the masks. Our experiments using the publicly available JIGSAWS dataset demonstrate that STCN achieves superior segmentation performance for objects that are difficult to segment, such as needle and thread, and improves context inference compared to the state-of-the-art. We also demonstrate that segmentation and context inference can be performed at runtime without compromising performance.

摘要
医疗机器人助手中的手术上下文推断在最近几年内受到了广泛关注，因为它可以帮助分析工作流程、评估技能和检测错误。然而，运行时上下文推断具有挑战性，因为它需要在视频数据中检测工具和物品之间的互动，并在实时上提供准确的上下文推断。然而，现有的状态 искусственный智能方法通常对不常见的类型存在偏见，并且无法提供时间上的一致性 для分类mask。这可能会对上下文推断产生负面影响，并妨碍精准检测关键状态。在本研究中，我们提出了一种解决这些挑战的解决方案，即使用Space Time Correspondence Network（STCN）。STCN是一种记忆网络，它可以实现二分 segmentation，并最小化类别不均衡的影响。记忆银行在STCN中的使用，使得可以利用过去的图像和分类信息，以确保mask的一致性。我们使用公共可用的JIGSAWS数据集进行实验，并证明STCN可以在难以分类的对象，如针和织物，中提供superior的分类性能，并提高上下文推断的精度。此外，我们还证明可以在运行时进行分类和上下文推断，不会影响性能。

On Offline Evaluation of 3D Object Detection for Autonomous Driving

paper_url: http://arxiv.org/abs/2308.12779
repo_url: None
paper_authors: Tim Schreier, Katrin Renz, Andreas Geiger, Kashyap Chitta
for: 这个论文是为了评估3D对象检测模型在自动驾驶核心任务中的性能而写的。
methods: 这篇论文使用了16种对象检测模型，并在CARLA simulate器上进行了广泛的实验，以评估不同检测精度指标如何影响自动驾驶性能。
results: 研究发现，nuScenes检测得分更高地相关于驾驶性能，而且警告了对` плаanner-centric’指标的封闭依赖。

Abstract
Prior work in 3D object detection evaluates models using offline metrics like average precision since closed-loop online evaluation on the downstream driving task is costly. However, it is unclear how indicative offline results are of driving performance. In this work, we perform the first empirical evaluation measuring how predictive different detection metrics are of driving performance when detectors are integrated into a full self-driving stack. We conduct extensive experiments on urban driving in the CARLA simulator using 16 object detection models. We find that the nuScenes Detection Score has a higher correlation to driving performance than the widely used average precision metric. In addition, our results call for caution on the exclusive reliance on the emerging class of `planner-centric' metrics.

摘要
先前的工作在3D对象检测中通常使用离线指标如平均准确率来评估模型。然而，不清楚这些离线结果对驱动性能的指导性。在本工作中，我们实施了首次employmetric evaluation的研究，measure如何不同的检测指标对自动驱动栈中检测器的驱动性能具有预测性。我们在CARLA simulator上进行了大规模的城市驱动实验，使用16个对象检测模型。我们发现，nuScenes Detection Score与驱动性能之间存在更高的相关性，而且我们的结果表明，应该小心对新般的` плаanner-centric'指标的归类依赖。

LISTER: Neighbor Decoding for Length-Insensitive Scene Text Recognition

paper_url: http://arxiv.org/abs/2308.12774
repo_url: None
paper_authors: Changxu Cheng, Peng Wang, Cheng Da, Qi Zheng, Cong Yao
for: 提高Scene Text Recognition（STR）的长文本识别能力和长文本推断能力。
methods: 提出Length-Insensitive Scene TExt Recognizer（LISTER）算法，包括Neighbor Decoder和Feature Enhancement Module两部分。Neighbor Decoder使用帮助器矩阵获取准确的字符注意力地图，不受文本长度影响。Feature Enhancement Module通过低计算成本模型长距离依赖关系，可以逐步增强特征地图进行迭代处理。
results: 实验表明，提出的LISTER算法在长文本识别和长文本推断方面具有显著优势，并且与之前的STATE-OF-THE-ART方法在标准STR测试集（主要是短文本）上表现相当。

Abstract
The diversity in length constitutes a significant characteristic of text. Due to the long-tail distribution of text lengths, most existing methods for scene text recognition (STR) only work well on short or seen-length text, lacking the capability of recognizing longer text or performing length extrapolation. This is a crucial issue, since the lengths of the text to be recognized are usually not given in advance in real-world applications, but it has not been adequately investigated in previous works. Therefore, we propose in this paper a method called Length-Insensitive Scene TExt Recognizer (LISTER), which remedies the limitation regarding the robustness to various text lengths. Specifically, a Neighbor Decoder is proposed to obtain accurate character attention maps with the assistance of a novel neighbor matrix regardless of the text lengths. Besides, a Feature Enhancement Module is devised to model the long-range dependency with low computation cost, which is able to perform iterations with the neighbor decoder to enhance the feature map progressively. To the best of our knowledge, we are the first to achieve effective length-insensitive scene text recognition. Extensive experiments demonstrate that the proposed LISTER algorithm exhibits obvious superiority on long text recognition and the ability for length extrapolation, while comparing favourably with the previous state-of-the-art methods on standard benchmarks for STR (mainly short text).

摘要
Text 的多样性在长度方面是一个重要特征。由于文本长度的长尾分布，大多数现有的场景文本识别（STR）方法只能在短文本上工作良好，缺乏长文本或者长文本 extrapolation 的能力。这是一个关键问题，因为实际应用中文本的长度通常不会提前给出，但这一问题在前一未得到充分调查。因此，我们在这篇论文中提出了一种名为Length-Insensitive Scene TExt Recognizer（LISTER）的方法，以解决这种限制。具体来说，我们提出了一种邻居解码器，可以通过一个新的邻居矩阵获得准确的字符注意地图，不管文本的长度。此外，我们还设计了一种特征增强模块，可以通过低计算成本模elling 长距离关系，并且可以与邻居解码器进行多次迭代来进一步增强特征图。根据我们所知，我们的提出的 LISTER 算法是首次实现了长度不敏感的场景文本识别。我们的实验结果表明，LISTER 算法在长文本识别和长度推算方面具有明显的优势，同时与之前的状态lava 方法在标准 STR 标准库（主要是短文本）上比较了常。

IP-UNet: Intensity Projection UNet Architecture for 3D Medical Volume Segmentation

paper_url: http://arxiv.org/abs/2308.12761
repo_url: None
paper_authors: Nyothiri Aung, Tahar Kechadi, Liming Chen, Sahraoui Dhelim
for: automatic breast calcification detectionmethods: IP-UNet model, which performs multi-class segmentation on Intensity Projection (IP) of 3D volumetric data, and uses limited memory capability for training without losing the original 3D image resolution.results: IP-UNet achieves similar segmentation accuracy as 3D-UNet but with much better performance, reducing training time by 70% and memory consumption by 92%.

Abstract
CNNs have been widely applied for medical image analysis. However, limited memory capacity is one of the most common drawbacks of processing high-resolution 3D volumetric data. 3D volumes are usually cropped or downsized first before processing, which can result in a loss of resolution, increase class imbalance, and affect the performance of the segmentation algorithms. In this paper, we propose an end-to-end deep learning approach called IP-UNet. IP-UNet is a UNet-based model that performs multi-class segmentation on Intensity Projection (IP) of 3D volumetric data instead of the memory-consuming 3D volumes. IP-UNet uses limited memory capability for training without losing the original 3D image resolution. We compare the performance of three models in terms of segmentation accuracy and computational cost: 1) Slice-by-slice 2D segmentation of the CT scan images using a conventional 2D UNet model. 2) IP-UNet that operates on data obtained by merging the extracted Maximum Intensity Projection (MIP), Closest Vessel Projection (CVP), and Average Intensity Projection (AvgIP) representations of the source 3D volumes, then applying the UNet model on the output IP images. 3) 3D-UNet model directly reads the 3D volumes constructed from a series of CT scan images and outputs the 3D volume of the predicted segmentation. We test the performance of these methods on 3D volumetric images for automatic breast calcification detection. Experimental results show that IP-Unet can achieve similar segmentation accuracy with 3D-Unet but with much better performance. It reduces the training time by 70\% and memory consumption by 92\%.

摘要
对于医疗影像分析，广泛应用了深度学习网络（CNN）。然而，处理高分辨率3D数据时的内存容量问题是最常见的问题。通常会将3Dvolume裁剪或缩小以便处理，这会导致解析损失、增加分布不均和影像分析表现下降。在这篇论文中，我们提出了一个端到端的深度学习方法，即IP-UNet。IP-UNet是基于UNet模型，用于多类分类INTENSITY PROJECTION（IP）3D数据，而不需要内存昂贵的3Dvolume训练。我们将比较三种模型的表现，包括：1）对CT扫描影像的单面2D分类使用传统2D UNet模型。2）IP-UNet，它在提取Maximum Intensity Projection（MIP）、Closest Vessel Projection（CVP）和Average Intensity Projection（AvgIP）表现后，将UNet模型应用于输出IP图像。3）直接从CT扫描影像构建3D数据，并将预测分类结果传回3D数据。我们将这些方法评估在3D数据中自动胸腔癌斑准确性检测上。实验结果显示，IP-UNet可以与3D-UNet实现相似的分类精度，但是具有训练时间快速和内存消耗几乎减半的优点。

PartSeg: Few-shot Part Segmentation via Part-aware Prompt Learning

paper_url: http://arxiv.org/abs/2308.12757
repo_url: None
paper_authors: Mengya Han, Heliang Zheng, Chaoyue Wang, Yong Luo, Han Hu, Jing Zhang, Yonggang Wen
for: 本研究旨在实现少数示例部分 segmentation，即使用非常少的标注示例来分割未经见过的物体中的不同部分。
methods: 我们提出了一种基于多Modal学习的新方法，称为PartSeg，用于实现少数示例部分 segmentation。我们特别设计了一种可以让CLIP模型更好地理解“部分”概念的部分掌握学习方法。此外，我们在提问学习过程中建立了不同物体类别中同一部分之间的关系。
results: 我们在PartImageNet和Pascal$_$Part datasets上进行了广泛的实验，结果表明，我们提出的方法可以达到状态 искусственный智能的表现。

Abstract
In this work, we address the task of few-shot part segmentation, which aims to segment the different parts of an unseen object using very few labeled examples. It is found that leveraging the textual space of a powerful pre-trained image-language model (such as CLIP) can be beneficial in learning visual features. Therefore, we develop a novel method termed PartSeg for few-shot part segmentation based on multimodal learning. Specifically, we design a part-aware prompt learning method to generate part-specific prompts that enable the CLIP model to better understand the concept of ``part'' and fully utilize its textual space. Furthermore, since the concept of the same part under different object categories is general, we establish relationships between these parts during the prompt learning process. We conduct extensive experiments on the PartImageNet and Pascal$\_$Part datasets, and the experimental results demonstrated that our proposed method achieves state-of-the-art performance.

摘要
在这个工作中，我们考虑了几个shot部分 segmentation任务，该任务的目标是使用非常少的标注例进行不同对象的部分分类。我们发现可以利用一个强大预训练的图像语言模型（如CLIP）的文本空间，可以有利于学习视觉特征。因此，我们开发了一种基于多Modal学习的新方法，称为PartSeg。具体来说，我们设计了一种部分意识的提问学习方法，以便CLIP模型更好地理解“部分”的概念，并充分利用其文本空间。此外，我们在提问学习过程中建立了这些部分之间的关系。我们在PartImageNet和Pascal$\_$Part datasets上进行了广泛的实验，并发现我们的提议方法可以达到状态的表现。

Learning Heavily-Degraded Prior for Underwater Object Detection

paper_url: http://arxiv.org/abs/2308.12738
repo_url: https://github.com/xiaodetection/learning-heavily-degraed-prior
paper_authors: Chenping Fu, Xin Fan, Jiewen Xiao, Wanqi Yuan, Risheng Liu, Zhongxuan Luo
for: 解决水下物体检测中的质量下降问题，通过利用受损图像中的特征分布偏移来提高检测性能。
methods: 基于受损图像的统计观察，提出了差异特征传递模块（RFTM），通过学习受损图像和水下图像之间的映射，提高水下物体检测性能。
results: 对URPC2020和UODD数据集进行评估，显示 compared to CNN基于的检测器，本方法可以大幅提高水下物体检测性能，并且具有更高的速度和更少的参数。

Abstract
Underwater object detection suffers from low detection performance because the distance and wavelength dependent imaging process yield evident image quality degradations such as haze-like effects, low visibility, and color distortions. Therefore, we commit to resolving the issue of underwater object detection with compounded environmental degradations. Typical approaches attempt to develop sophisticated deep architecture to generate high-quality images or features. However, these methods are only work for limited ranges because imaging factors are either unstable, too sensitive, or compounded. Unlike these approaches catering for high-quality images or features, this paper seeks transferable prior knowledge from detector-friendly images. The prior guides detectors removing degradations that interfere with detection. It is based on statistical observations that, the heavily degraded regions of detector-friendly (DFUI) and underwater images have evident feature distribution gaps while the lightly degraded regions of them overlap each other. Therefore, we propose a residual feature transference module (RFTM) to learn a mapping between deep representations of the heavily degraded patches of DFUI- and underwater- images, and make the mapping as a heavily degraded prior (HDP) for underwater detection. Since the statistical properties are independent to image content, HDP can be learned without the supervision of semantic labels and plugged into popular CNNbased feature extraction networks to improve their performance on underwater object detection. Without bells and whistles, evaluations on URPC2020 and UODD show that our methods outperform CNN-based detectors by a large margin. Our method with higher speeds and less parameters still performs better than transformer-based detectors. Our code and DFUI dataset can be found in https://github.com/xiaoDetection/Learning-Heavily-Degraed-Prior.

摘要
水下对象检测受到质量低下的影响，因为图像处理过程中的距离和波长的效果会导致明显的图像质量下降，如霾效、低可见度和颜色扭曲。因此，我们决心解决水下对象检测中的环境降低效应。常见的方法是开发复杂的深度架构来生成高质量图像或特征。然而，这些方法只能在有限的范围内工作，因为图像因素是不稳定、太敏感或复杂的。与这些方法不同，这篇论文寻求从检测友好的图像（DFUI）中提取知识。这种知识指导检测器去除影响检测的降低因素。这基于统计观察，水下和DFUI图像中的严重降低区域有明显的特征分布差异，而轻度降低区域之间重叠。因此，我们提出了差异特征传递模块（RFTM），以学习将深度表示DFUI图像中的严重降低区域和水下图像之间建立一个映射，并将这个映射作为强制质量（HDP）来进行水下检测。由于统计性质独立于图像内容，HDP可以在无监督的情况下学习，并可以插入流行的CNN基于特征提取网络来提高水下对象检测性能。我们的方法在URPC2020和UODD上进行评估，与CNN基于检测器相比，我们的方法在大幅度下表现出较好的性能。我们的方法还具有更高的速度和更少的参数，仍然可以在 transformer 基于检测器之上进行改进。我们的代码和 DFUI 数据集可以在 GitHub 上找到：https://github.com/xiaoDetection/Learning-Heavily-Degraed-Prior。

FastSurfer-HypVINN: Automated sub-segmentation of the hypothalamus and adjacent structures on high-resolutional brain MRI

paper_url: http://arxiv.org/abs/2308.12736
repo_url: None
paper_authors: Santiago Estrada, David Kügler, Emad Bahrami, Peng Xu, Dilshad Mousa, Monique M. B. Breteler, N. Ahmad Aziz, Martin Reuter
for: 这个论文的目的是提供一种自动分割哈米尼肌肉的方法，以便更好地研究哈米尼肌肉的功能和结构。
methods: 这个方法使用了深度学习算法，并且可以处理0.8mm的是otropic T1w和T2w MR图像。
results: 这个方法可以具有高度的分割精度和可靠性，并且可以在多个数据集上进行扩展验证。

Abstract
The hypothalamus plays a crucial role in the regulation of a broad range of physiological, behavioural, and cognitive functions. However, despite its importance, only a few small-scale neuroimaging studies have investigated its substructures, likely due to the lack of fully automated segmentation tools to address scalability and reproducibility issues of manual segmentation. While the only previous attempt to automatically sub-segment the hypothalamus with a neural network showed promise for 1.0 mm isotropic T1-weighted (T1w) MRI, there is a need for an automated tool to sub-segment also high-resolutional (HiRes) MR scans, as they are becoming widely available, and include structural detail also from multi-modal MRI. We, therefore, introduce a novel, fast, and fully automated deep learning method named HypVINN for sub-segmentation of the hypothalamus and adjacent structures on 0.8 mm isotropic T1w and T2w brain MR images that is robust to missing modalities. We extensively validate our model with respect to segmentation accuracy, generalizability, in-session test-retest reliability, and sensitivity to replicate hypothalamic volume effects (e.g. sex-differences). The proposed method exhibits high segmentation performance both for standalone T1w images as well as for T1w/T2w image pairs. Even with the additional capability to accept flexible inputs, our model matches or exceeds the performance of state-of-the-art methods with fixed inputs. We, further, demonstrate the generalizability of our method in experiments with 1.0 mm MR scans from both the Rhineland Study and the UK Biobank. Finally, HypVINN can perform the segmentation in less than a minute (GPU) and will be available in the open source FastSurfer neuroimaging software suite, offering a validated, efficient, and scalable solution for evaluating imaging-derived phenotypes of the hypothalamus.

摘要
《响应腔室中的肥厚腔室功能》的研究具有重要的意义，但由于缺乏可靠的自动分割工具，因此只有一些小规模的 нейро成像研究对其下部结构进行了调查。尽管之前一个使用神经网络自动分割肥厚腔室的尝试显示了对1.0 mm是otropic T1束成像（T1w）的承诺，但是有必要为高分辨率（HiRes）MR扫描图像提供自动分割工具，因为它们在广泛使用并包含多modal MRI结构细节。我们因此介绍了一种新的快速、自动化深度学习方法，名为 HypVINN，用于肥厚腔室和相邻结构的0.8 mm是otropic T1w和T2w大脑MR扫描图像的分割，具有对缺失模式的Robust性。我们对模型进行了广泛验证，包括分割精度、普适性、在SESSION中的重复测试可靠性和性别差异的敏感性。我们的模型在单独的T1w图像上以及T1w/T2w图像对上都 exhibits高度的分割性能。即使可以接受 flexible inputs，我们的模型与已有的方法相当或超过性能。我们进一步证明了我们的方法在1.0 mm MR扫描图像上的普适性，并在 Rheinland Study和UK Biobank中进行了实验。最后，HypVINN可以在 less than a minute（GPU）内完成分割，并将被包含在开源的 FastSurfer neuroscience imaging software suite中，提供一个验证、高效、扩展的解决方案，用于评估基于成像的肥厚腔室相关性。

Ground-to-Aerial Person Search: Benchmark Dataset and Approach

paper_url: http://arxiv.org/abs/2308.12712
repo_url: https://github.com/yqc123456/hkd_for_person_search
paper_authors: Shizhou Zhang, Qingchun Yang, De Cheng, Yinghui Xing, Guoqiang Liang, Peng Wang, Yanning Zhang
for: 这个论文旨在构建一个大规模的人体搜索数据集，以便进行跨平台智能监测应用程序开发。
methods: 该论文使用了两步人体搜索方法和终端到终端人体搜索方法，并提出了一种简单 yet effective的知识储存方法，用于提高人体搜索性能。
results: 该论文通过对G2APS数据集和两个公共的人体搜索数据集进行分析，并提出了一种基于知识储存的人体搜索方法，实现了状态畅的性能。

Abstract
In this work, we construct a large-scale dataset for Ground-to-Aerial Person Search, named G2APS, which contains 31,770 images of 260,559 annotated bounding boxes for 2,644 identities appearing in both of the UAVs and ground surveillance cameras. To our knowledge, this is the first dataset for cross-platform intelligent surveillance applications, where the UAVs could work as a powerful complement for the ground surveillance cameras. To more realistically simulate the actual cross-platform Ground-to-Aerial surveillance scenarios, the surveillance cameras are fixed about 2 meters above the ground, while the UAVs capture videos of persons at different location, with a variety of view-angles, flight attitudes and flight modes. Therefore, the dataset has the following unique characteristics: 1) drastic view-angle changes between query and gallery person images from cross-platform cameras; 2) diverse resolutions, poses and views of the person images under 9 rich real-world scenarios. On basis of the G2APS benchmark dataset, we demonstrate detailed analysis about current two-step and end-to-end person search methods, and further propose a simple yet effective knowledge distillation scheme on the head of the ReID network, which achieves state-of-the-art performances on both of the G2APS and the previous two public person search datasets, i.e., PRW and CUHK-SYSU. The dataset and source code available on \url{https://github.com/yqc123456/HKD_for_person_search}.

摘要
在这项工作中，我们构建了一个大规模的人earch数据集，名为G2APS，其包含31,770张图像和260,559个注解的矩形框，其中每个矩形框都包含2,644个人肖象出现在UAV和地面监测摄像头中。我们知道，这是首个跨平台智能监测应用程序的数据集，UAV可以作为地面监测摄像头的强力补充。为更真实地模拟实际跨平台地面-空中监测场景，地面监测摄像头 fixes在2米上，而UAV拍摄了不同位置的人肖象，并且有多种视角、飞行姿态和飞行模式。因此，该数据集具有以下独特特点：1）跨平台摄像头之间人肖象的极大视角变化；2）人肖象的多种分辨率、姿势和视野下的9种实际场景。基于G2APS标准数据集，我们对现有的两步人earch方法和端到端人earch方法进行详细分析，并提出了一种简单 yet有效的知识储存 scheme，用于在ReID网络的头部进行人earch，该方法在G2APS和以前两个公共人earch数据集上实现了状态当前性。数据集和源代码可以在 \url{https://github.com/yqc123456/HKD_for_person_search} 上获取。

A Parse-Then-Place Approach for Generating Graphic Layouts from Textual Descriptions

paper_url: http://arxiv.org/abs/2308.12700
repo_url: None
paper_authors: Jiawei Lin, Jiaqi Guo, Shizhao Sun, Weijiang Xu, Ting Liu, Jian-Guang Lou, Dongmei Zhang
for: 这个论文的目的是提出一种基于文本指导的图形设计方法，以低下设计难度。
methods: 这种方法包括两个阶段：解析阶段和放置阶段。解析阶段通过将文本描述转换为一种中间表示（IR），来模拟文本中的隐式约束。放置阶段使用Transformer模型生成图形。为了处理组合和不完整的约束，我们使用Transformer模型并且 специаль地设计了约束和图形的表示方式。
results: 我们在两个 Text-to-Layout 数据集上进行了实验，并取得了优秀的成绩。量化结果、质量分析和用户研究都证明了我们的方法的有效性。

Abstract
Creating layouts is a fundamental step in graphic design. In this work, we propose to use text as the guidance to create graphic layouts, i.e., Text-to-Layout, aiming to lower the design barriers. Text-to-Layout is a challenging task, because it needs to consider the implicit, combined, and incomplete layout constraints from text, each of which has not been studied in previous work. To address this, we present a two-stage approach, named parse-then-place. The approach introduces an intermediate representation (IR) between text and layout to represent diverse layout constraints. With IR, Text-to-Layout is decomposed into a parse stage and a place stage. The parse stage takes a textual description as input and generates an IR, in which the implicit constraints from the text are transformed into explicit ones. The place stage generates layouts based on the IR. To model combined and incomplete constraints, we use a Transformer-based layout generation model and carefully design a way to represent constraints and layouts as sequences. Besides, we adopt the pretrain-then-finetune strategy to boost the performance of the layout generation model with large-scale unlabeled layouts. To evaluate our approach, we construct two Text-to-Layout datasets and conduct experiments on them. Quantitative results, qualitative analysis, and user studies demonstrate the effectiveness of our approach.

摘要
创建布局是图形设计的基本步骤。在这项工作中，我们提议使用文本作为布局创建的指导，即文本到布局（Text-to-Layout），以降低设计障碍。文本到布局是一项复杂的任务，因为它需要考虑文本中的隐式、共同和部分缺失的布局约束，每一种都没有在前期工作中研究过。为解决这个问题，我们提出了两个阶段方法，称之为parse-then-place。这种方法引入了一个中间表示（IR），用于将文本中的布局约束转换为Explicit的约束。在IR中，我们使用Transformer模型来生成布局。此外，我们还设计了一种方法来表示约束和布局为序列，以便处理共同和缺失的约束。此外，我们采用了预训练后finetune策略，以提高布局生成模型的性能。为评估我们的方法，我们构建了两个文本到布局数据集，并在其中进行了实验。量化结果、质量分析和用户研究都证明了我们的方法的有效性。

A Continual Learning Approach for Cross-Domain White Blood Cell Classification

paper_url: http://arxiv.org/abs/2308.12679
repo_url: None
paper_authors: Ario Sadafi, Raheleh Salehi, Armin Gruber, Sayedali Shetab Boushehri, Pascal Giehr, Nassir Navab, Carsten Marr
for: 这个研究旨在提高白血球类别的准确性，以便诊断血液疾病。
methods: 本研究使用了一种叫做复习式专有学习的方法，可以逐步学习来自新数据流的知识，而不会忘记之前学习的知识。
results: 研究结果显示，使用复习式专有学习方法可以在不同的颜色、分辨率和类别结构下，实现白血球类别的准确分类。此外，在长期演进学习中，本方法也可以优于现有的iCaRL和EWC方法。

Abstract
Accurate classification of white blood cells in peripheral blood is essential for diagnosing hematological diseases. Due to constantly evolving clinical settings, data sources, and disease classifications, it is necessary to update machine learning classification models regularly for practical real-world use. Such models significantly benefit from sequentially learning from incoming data streams without forgetting previously acquired knowledge. However, models can suffer from catastrophic forgetting, causing a drop in performance on previous tasks when fine-tuned on new data. Here, we propose a rehearsal-based continual learning approach for class incremental and domain incremental scenarios in white blood cell classification. To choose representative samples from previous tasks, we employ exemplar set selection based on the model's predictions. This involves selecting the most confident samples and the most challenging samples identified through uncertainty estimation of the model. We thoroughly evaluated our proposed approach on three white blood cell classification datasets that differ in color, resolution, and class composition, including scenarios where new domains or new classes are introduced to the model with every task. We also test a long class incremental experiment with both new domains and new classes. Our results demonstrate that our approach outperforms established baselines in continual learning, including existing iCaRL and EWC methods for classifying white blood cells in cross-domain environments.

摘要
Accurate classification of white blood cells in peripheral blood is essential for diagnosing hematological diseases. Due to constantly evolving clinical settings, data sources, and disease classifications, it is necessary to update machine learning classification models regularly for practical real-world use. Such models significantly benefit from sequentially learning from incoming data streams without forgetting previously acquired knowledge. However, models can suffer from catastrophic forgetting, causing a drop in performance on previous tasks when fine-tuned on new data. Here, we propose a rehearsal-based continual learning approach for class incremental and domain incremental scenarios in white blood cell classification. To choose representative samples from previous tasks, we employ exemplar set selection based on the model's predictions. This involves selecting the most confident samples and the most challenging samples identified through uncertainty estimation of the model. We thoroughly evaluated our proposed approach on three white blood cell classification datasets that differ in color, resolution, and class composition, including scenarios where new domains or new classes are introduced to the model with every task. We also test a long class incremental experiment with both new domains and new classes. Our results demonstrate that our approach outperforms established baselines in continual learning, including existing iCaRL and EWC methods for classifying white blood cells in cross-domain environments.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need Traditional Chinese, please let me know and I can provide that as well.

A Study of Age and Sex Bias in Multiple Instance Learning based Classification of Acute Myeloid Leukemia Subtypes

paper_url: http://arxiv.org/abs/2308.12675
repo_url: None
paper_authors: Ario Sadafi, Matthias Hehr, Nassir Navab, Carsten Marr
for: 这份研究旨在探讨急性白血病（AML）分型的精确分类是否受到年龄和性别偏好的影响，以提高临床决策和患者照顾。methods: 这份研究使用多例学习（MIL）架构，训练多个MIL模型，并评估它们在不同的性别偏好和年龄偏好下的表现。results: 研究发现，AML分型分类中受到性别和年龄偏好的影响，特别是女性患者更容易受到性别偏好的影响，而 certain age groups，例如72-86岁的患者，则受到年龄偏好的影响。确保训练数据的多元性是关键，以确保AML分型分类的可靠性和公平性，最终帮助多元化的患者人口。

Abstract
Accurate classification of Acute Myeloid Leukemia (AML) subtypes is crucial for clinical decision-making and patient care. In this study, we investigate the potential presence of age and sex bias in AML subtype classification using Multiple Instance Learning (MIL) architectures. To that end, we train multiple MIL models using different levels of sex imbalance in the training set and excluding certain age groups. To assess the sex bias, we evaluate the performance of the models on male and female test sets. For age bias, models are tested against underrepresented age groups in the training data. We find a significant effect of sex and age bias on the performance of the model for AML subtype classification. Specifically, we observe that females are more likely to be affected by sex imbalance dataset and certain age groups, such as patients with 72 to 86 years of age with the RUNX1::RUNX1T1 genetic subtype, are significantly affected by an age bias present in the training data. Ensuring inclusivity in the training data is thus essential for generating reliable and equitable outcomes in AML genetic subtype classification, ultimately benefiting diverse patient populations.

摘要
《急性白细胞病（AML）分型准确分类是临床决策和患者护理中非常重要。本研究探讨AML分型准确分类中年龄和性别偏见的可能性，使用多例学习（MIL）架构。为此，我们在不同的性别占比水平和年龄组中训练多个MIL模型，并在测试集上评估模型的性别偏见和年龄偏见。结果显示，女性患者更容易受到数据集中的性别偏见的影响，而72-86岁的年龄组患者则受到训练数据中的年龄偏见的影响。因此，在训练数据中保证包容性是AML分型准确分类中不可或缺的。这有助于促进多样化患者群体的可靠和公平的结果，最终总是有利于患者的护理。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Masked Feature Modelling: Feature Masking for the Unsupervised Pre-training of a Graph Attention Network Block for Bottom-up Video Event Recognition

paper_url: http://arxiv.org/abs/2308.12673
repo_url: None
paper_authors: Dimitrios Daskalakis, Nikolaos Gkalelis, Vasileios Mezaris
for: 本研究旨在提出一种未监督预训练方法，用于提高视频事件识别模型的起点和总体性能。
methods: 本研究使用了一种已经预训练的视觉Tokenizer来重建视频中对象的遮盖特征，然后将预训练的GAT块 integrate到现有的视频事件识别架构中，以提高模型的起点和总体性能。
results: 实验结果表明，使用Masked Feature Modelling（MFM）方法可以提高视频事件识别性能。

Abstract
In this paper, we introduce Masked Feature Modelling (MFM), a novel approach for the unsupervised pre-training of a Graph Attention Network (GAT) block. MFM utilizes a pretrained Visual Tokenizer to reconstruct masked features of objects within a video, leveraging the MiniKinetics dataset. We then incorporate the pre-trained GAT block into a state-of-the-art bottom-up supervised video-event recognition architecture, ViGAT, to improve the model's starting point and overall accuracy. Experimental evaluations on the YLI-MED dataset demonstrate the effectiveness of MFM in improving event recognition performance.

摘要
在这篇论文中，我们介绍了一种新的无监督预训练方法，即Masked Feature Modelling（MFM），用于提高视频事件认知性能。MFM使用一个预训练的视觉分词器来重建视频中对象的遮盲特征，利用MiniKinetics dataset。然后，我们将预训练的GAT块 integrate到了一个现有的 bottom-up 超级视频事件识别架构ViGAT中，以提高模型的起点和总体准确率。实验评估在YLI-MED数据集上，表明MFM有效地提高事件识别性能。

An All Deep System for Badminton Game Analysis

paper_url: http://arxiv.org/abs/2308.12645
repo_url: None
paper_authors: Po-Yung Chou, Yu-Chun Lo, Bo-Zheng Xie, Cheng-Hung Lin, Yu-Yung Kao
for: automatic detection of events within badminton match videos, especially the shuttlecock
methods: modified TrackNet model and diverse data types to improve precision
results: score of 0.78 out of 1.0 in the challenge

Abstract
The CoachAI Badminton 2023 Track1 initiative aim to automatically detect events within badminton match videos. Detecting small objects, especially the shuttlecock, is of quite importance and demands high precision within the challenge. Such detection is crucial for tasks like hit count, hitting time, and hitting location. However, even after revising the well-regarded shuttlecock detecting model, TrackNet, our object detection models still fall short of the desired accuracy. To address this issue, we've implemented various deep learning methods to tackle the problems arising from noisy detectied data, leveraging diverse data types to improve precision. In this report, we detail the detection model modifications we've made and our approach to the 11 tasks. Notably, our system garnered a score of 0.78 out of 1.0 in the challenge.

摘要
coachai 羽毛球 2023 跟踪1 INITIATIVE 目标是自动探测羽毛球赛事视频中的事件。特别是小 object，如羽毛球，需要高精度的探测，这是因为这些探测对于hit count、 hitting time 和 hitting location 等任务非常重要。不过，即使修改了 widely recognized 的羽毛球探测模型 TrackNet，我们的 object detection 模型仍然没有达到所需的准确性。为了解决这个问题，我们采用了多种深度学习方法，以提高不同数据类型的精度。在这份报告中，我们详细介绍了我们对模型的修改和我们对11个任务的方法。值得注意的是，我们的系统在挑战中得到了 0.78 分的成绩。

Tag-Based Annotation for Avatar Face Creation

paper_url: http://arxiv.org/abs/2308.12642
repo_url: None
paper_authors: An Ngo, Daniel Phelps, Derrick Lai, Thanyared Wong, Lucas Mathias, Anish Shivamurthy, Mustafa Ajmal, Minghao Liu, James Davis
for: 这篇论文的目的是如何自动生成数字人物图像。
methods: 这篇论文使用了标签基于注释的方法来训练模型生成人物图像。
results: 这篇论文的结果是通过标签基于注释的方法来提高模型的预测质量和降低噪音水平。

Abstract
Currently, digital avatars can be created manually using human images as reference. Systems such as Bitmoji are excellent producers of detailed avatar designs, with hundreds of choices for customization. A supervised learning model could be trained to generate avatars automatically, but the hundreds of possible options create difficulty in securing non-noisy data to train a model. As a solution, we train a model to produce avatars from human images using tag-based annotations. This method provides better annotator agreement, leading to less noisy data and higher quality model predictions. Our contribution is an application of tag-based annotation to train a model for avatar face creation. We design tags for 3 different facial facial features offered by Bitmoji, and train a model using tag-based annotation to predict the nose.

摘要
当前，数字化人物可以通过人像作为参考来手动创建。系统如Bitmoji可以生成细节rich的人物设计，具有数百个个性化选项。一个监督学习模型可以自动生成人物，但是数百个可能的选项带来难度，困难于获得不含噪声数据来训练模型。为解决这个问题，我们使用标签基本注解来训练模型生成人物脸。这种方法可以提供更好的注释协议，从而减少噪声数据和提高模型预测质量。我们的贡献是通过标签基本注解来训练模型，以生成人物脸。我们设计了Bitmoji提供的三种不同的 facial 特征标签，并使用标签基本注解来预测脸的鼻子。

Cross-Video Contextual Knowledge Exploration and Exploitation for Ambiguity Reduction in Weakly Supervised Temporal Action Localization

paper_url: http://arxiv.org/abs/2308.12609
repo_url: None
paper_authors: Songchun Zhang, Chunhui Zhao
for: 本研究旨在提高无rimmed视频中动作地址的准确性和效率，使用视频级标签进行weakly supervised temporal action localization。
methods: 我们提出了一个综合的框架，包括Robust Memory-Guided Contrastive Learning（RMGCL）模块和Global Knowledge Summarization and Aggregation（GKSA）模块，以挖掘和利用跨视频动作特征的相似性和一致性，从而提高动作特征的结构化编码，并降低分类学习中的ambiguity。
results: 我们的方法在THUMOS14、ActivityNet1.3和FineAction等三个 datasets上进行了广泛的实验，结果显示，我们的方法可以高效地提高无rimmed视频中动作地址的准确性和效率，并且可以与其他WSTAL方法结合使用。

Abstract
Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels. Despite recent advances, existing approaches mainly follow a localization-by-classification pipeline, generally processing each segment individually, thereby exploiting only limited contextual information. As a result, the model will lack a comprehensive understanding (e.g. appearance and temporal structure) of various action patterns, leading to ambiguity in classification learning and temporal localization. Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset to recover the dataset-level semantic structure of action instances via weak labels only, thereby indirectly improving the holistic understanding of fine-grained action patterns and alleviating the aforementioned ambiguities. Specifically, an end-to-end framework is proposed, including a Robust Memory-Guided Contrastive Learning (RMGCL) module and a Global Knowledge Summarization and Aggregation (GKSA) module. First, the RMGCL module explores the contrast and consistency of cross-video action features, assisting in learning more structured and compact embedding space, thus reducing ambiguity in classification learning. Further, the GKSA module is used to efficiently summarize and propagate the cross-video representative action knowledge in a learnable manner to promote holistic action patterns understanding, which in turn allows the generation of high-confidence pseudo-labels for self-learning, thus alleviating ambiguity in temporal localization. Extensive experiments on THUMOS14, ActivityNet1.3, and FineAction demonstrate that our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.

摘要
弱类超级视频动作地标（WSTAL）目标是使用视频级标签来地标视频中的动作。 DESPITE recent advances, existing methods mainly follow a localization-by-classification pipeline, which only utilizes limited contextual information. As a result, the model may lack a comprehensive understanding (e.g., appearance and temporal structure) of various action patterns, leading to ambiguity in classification learning and temporal localization. Our work addresses this issue from a novel perspective by exploring and exploiting the cross-video contextual knowledge within the dataset to recover the dataset-level semantic structure of action instances via weak labels only, thereby indirectly improving the holistic understanding of fine-grained action patterns and alleviating the aforementioned ambiguities. Specifically, we propose an end-to-end framework that includes a Robust Memory-Guided Contrastive Learning (RMGCL) module and a Global Knowledge Summarization and Aggregation (GKSA) module. First, the RMGCL module explores the contrast and consistency of cross-video action features, assisting in learning more structured and compact embedding space, thus reducing ambiguity in classification learning. Further, the GKSA module is used to efficiently summarize and propagate the cross-video representative action knowledge in a learnable manner to promote holistic action patterns understanding, which in turn allows the generation of high-confidence pseudo-labels for self-learning, thus alleviating ambiguity in temporal localization. Extensive experiments on THUMOS14, ActivityNet1.3, and FineAction demonstrate that our method outperforms the state-of-the-art methods and can be easily plugged into other WSTAL methods.

HR-Pro: Point-supervised Temporal Action Localization via Hierarchical Reliability Propagation

paper_url: http://arxiv.org/abs/2308.12608
repo_url: https://github.com/pipixin321/hr-pro
paper_authors: Huaxin Zhang, Xiang Wang, Xiaohao Xu, Zhiwu Qing, Changxin Gao, Nong Sang
for: 本研究目的是提出一种基于可靠性协议的 temporal action localization 方法，以提高 label-efficient learning 的性能。
methods: 本方法包括两个可靠性感知阶段：短 clip 级可靠性学习和实例级可靠性学习，两个阶段都会利用高 confidence 的点签注入进行可靠性传播。
results: 通过多级可靠性感知学习，我们得到了更可靠的 confidence 分布和更准确的 temporal 边界。我们的 HR-Pro 在多个挑战性 benchmark 上达到了状态的最佳性能，包括 THUMOS14 的平均 mAP 60.3%。

Abstract
Point-supervised Temporal Action Localization (PSTAL) is an emerging research direction for label-efficient learning. However, current methods mainly focus on optimizing the network either at the snippet-level or the instance-level, neglecting the inherent reliability of point annotations at both levels. In this paper, we propose a Hierarchical Reliability Propagation (HR-Pro) framework, which consists of two reliability-aware stages: Snippet-level Discrimination Learning and Instance-level Completeness Learning, both stages explore the efficient propagation of high-confidence cues in point annotations. For snippet-level learning, we introduce an online-updated memory to store reliable snippet prototypes for each class. We then employ a Reliability-aware Attention Block to capture both intra-video and inter-video dependencies of snippets, resulting in more discriminative and robust snippet representation. For instance-level learning, we propose a point-based proposal generation approach as a means of connecting snippets and instances, which produces high-confidence proposals for further optimization at the instance level. Through multi-level reliability-aware learning, we obtain more reliable confidence scores and more accurate temporal boundaries of predicted proposals. Our HR-Pro achieves state-of-the-art performance on multiple challenging benchmarks, including an impressive average mAP of 60.3% on THUMOS14. Notably, our HR-Pro largely surpasses all previous point-supervised methods, and even outperforms several competitive fully supervised methods. Code will be available at https://github.com/pipixin321/HR-Pro.

摘要
《点指导时间动作Localization（PSTAL）是一个emerging研究方向，它的目标是实现标签效率学习。然而，当前方法主要集中于网络优化，忽略了点级和实例级的可靠性。在这篇论文中，我们提出了一个层次可靠性传播（HR-Pro）框架，它包括两个可靠性感知阶段：幂级可靠性学习和实例级可靠性学习。两个阶段都是利用高信任点级别的缓存来进行可靠性传播。为幂级学习，我们引入了一个在线更新的内存，用于存储每个类型的可靠性缓存。然后，我们使用一个可靠性感知块来捕捉内视频和 между视频依赖关系，从而生成更加准确和稳定的幂级表示。为实例级学习，我们提出了一种基于点的提议生成方法，用于将幂级与实例相连接，从而生成高信任度的提议。通过多级可靠性感知学习，我们得到了更加可靠的信任分数和更加准确的时间边界。我们的HR-Pro在多个挑战性的benchmark上实现了state-of-the-art性能，其中包括THUMOS14的很出色的平均精度（mAP）60.3%。值得注意的是，我们的HR-Pro大大超过了所有前期点指导方法，并且甚至超过了一些高效的完全监督方法。代码将在https://github.com/pipixin321/HR-Pro中提供。

PoseSync: Robust pose based video synchronization

paper_url: http://arxiv.org/abs/2308.12600
repo_url: None
paper_authors: Rishit Javia, Falak Shah, Shivam Dave
for: 这篇论文是用于提出一个端到端管道，用于基于姿势进行视频同步。
methods: 该管道包括对图像中人体部分进行剪辑，然后使用姿势检测器对剪辑的图像进行姿势检测，最后使用动态时间扭曲（DTW）算法对姿势关键点之间的角度/距离度量进行比较，从而实现一个可比静态图像的姿势匹配管道。
results: 该管道可以帮助在多个领域，如游戏表现评估、编舞或导引运动员等，进行比较和评估人体动作。

Abstract
Pose based video sychronization can have applications in multiple domains such as gameplay performance evaluation, choreography or guiding athletes. The subject's actions could be compared and evaluated against those performed by professionals side by side. In this paper, we propose an end to end pipeline for synchronizing videos based on pose. The first step crops the region where the person present in the image followed by pose detection on the cropped image. This is followed by application of Dynamic Time Warping(DTW) on angle/ distance measures between the pose keypoints leading to a scale and shift invariant pose matching pipeline.

摘要
pose基于视频同步可以在多个领域有应用，如游戏性能评估、编舞或引导运动员。将主体的动作与专业人员的动作进行比较和评估。在这篇论文中，我们提出了基于pose的视频同步管道的终端到终点解决方案。首先，将图像中人物的区域裁剪，然后进行pose检测。接着，对于裁剪后的图像，应用动态时间扩展(DTW)来计算pose关键点之间的角度/距离度量，从而实现一个可以快速匹配pose的缩放和平移不敏感管道。

Logic-induced Diagnostic Reasoning for Semi-supervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.12595
repo_url: None
paper_authors: Chen Liang, Wenguan Wang, Jiaxu Miao, Yi Yang
for: 提高 semi-supervised semantic segmentation 的精度，使用 pseudo labeling 补做了有限的标注数据，忽略了 semantic concept 之间的关系知识。
methods: 提出了 LogicDiag，一种基于神经逻辑学习框架的新方法，利用 pseudo label 中的冲突，通过逻辑检查和诊断，纠正 pseudo label，从而缓解 error accumulation 问题。
results: 在三个标准 semi-supervised semantic segmentation 测试集上进行了广泛的实验，证明了 LogicDiag 的有效性和通用性。此外，LogicDiag 还探讨了将符号逻辑reasoning integrate 到 prevailing 的统计学、神经网络学习方法中的可能性。

Abstract
Recent advances in semi-supervised semantic segmentation have been heavily reliant on pseudo labeling to compensate for limited labeled data, disregarding the valuable relational knowledge among semantic concepts. To bridge this gap, we devise LogicDiag, a brand new neural-logic semi-supervised learning framework. Our key insight is that conflicts within pseudo labels, identified through symbolic knowledge, can serve as strong yet commonly ignored learning signals. LogicDiag resolves such conflicts via reasoning with logic-induced diagnoses, enabling the recovery of (potentially) erroneous pseudo labels, ultimately alleviating the notorious error accumulation problem. We showcase the practical application of LogicDiag in the data-hungry segmentation scenario, where we formalize the structured abstraction of semantic concepts as a set of logic rules. Extensive experiments on three standard semi-supervised semantic segmentation benchmarks demonstrate the effectiveness and generality of LogicDiag. Moreover, LogicDiag highlights the promising opportunities arising from the systematic integration of symbolic reasoning into the prevalent statistical, neural learning approaches.

摘要
Translated into Simplified Chinese:近期 semi-supervised semantic segmentation 领域的进步都受到了 pseudo labeling 的限制，忽视了 semantic concepts 之间的 valuabe relational knowledge。为了bridging这个 gap，我们提出 LogicDiag，一种全新的 neural-logic semi-supervised learning 框架。我们的关键发现是，在 pseudo labels 中的 conflicts，可以作为强大 yet 常被忽略的学习信号。LogicDiag 通过逻辑检查，解决这些 conflicts，使得可以恢复 (可能) 错误的 pseudo labels，从而缓解 error accumulation 问题。我们在数据充沛的 segmentation enario 中实现了 LogicDiag，并形式化 semantic concepts 的抽象结构为逻辑规则。我们在三个标准 semi-supervised semantic segmentation benchmark 上进行了广泛的实验，证明 LogicDiag 的效果和通用性。此外，LogicDiag 还展示了将逻辑推理系统化到 prevailing statistical, neural learning approaches 中的推动力。

Self-supervised Learning of Implicit Shape Representation with Dense Correspondence for Deformable Objects

paper_url: http://arxiv.org/abs/2308.12590
repo_url: None
paper_authors: Baowen Zhang, Jiahe Li, Xiaoming Deng, Yinda Zhang, Cuixia Ma, Hongan Wang
for: 学习3D形状表示的精细对应方法，用于扭形物体。
methods: 提出了一种新的自动标注方法，使用签名距离场来学习神经隐式形状表示，不需要骨架和皮肤纹理的假设。
results: 实验表明，该方法可以表示大幅扭形的形状，并且可以支持文本传输和形状编辑等应用，性能竞争力强。Here is the summary in English for reference:
for: Learning 3D shape representation with dense correspondence for deformable objects.
methods: Propose a novel self-supervised approach to learn neural implicit shape representation, which does not require prior knowledge of skeleton and skinning weight.
results: Experimental results show that the method can represent shapes with large deformations and support applications such as texture transfer and shape editing with competitive performance.

Abstract
Learning 3D shape representation with dense correspondence for deformable objects is a fundamental problem in computer vision. Existing approaches often need additional annotations of specific semantic domain, e.g., skeleton poses for human bodies or animals, which require extra annotation effort and suffer from error accumulation, and they are limited to specific domain. In this paper, we propose a novel self-supervised approach to learn neural implicit shape representation for deformable objects, which can represent shapes with a template shape and dense correspondence in 3D. Our method does not require the priors of skeleton and skinning weight, and only requires a collection of shapes represented in signed distance fields. To handle the large deformation, we constrain the learned template shape in the same latent space with the training shapes, design a new formulation of local rigid constraint that enforces rigid transformation in local region and addresses local reflection issue, and present a new hierarchical rigid constraint to reduce the ambiguity due to the joint learning of template shape and correspondences. Extensive experiments show that our model can represent shapes with large deformations. We also show that our shape representation can support two typical applications, such as texture transfer and shape editing, with competitive performance. The code and models are available at https://iscas3dv.github.io/deformshape

摘要
学习3D形状表示方法中的密集匹配问题是计算机视觉的基本问题。现有的方法经常需要特定的 semantic 领域的更多注释，例如人体或动物的skeleton 姿势，这会增加注释努力并受到错误堆积的限制，同时它们只适用于特定的领域。在这篇论文中，我们提出了一种新的自助学习方法，用于学习神经凝聚形状表示方法，可以在3D中表示形状。我们的方法不需要预先知道skeleton和皮肤粘性的积分，只需要一个包含形状的signed distance fields。为了处理大幅度的变形，我们将学习的模板形状固定在同一个隐藏空间中，并设计了一种新的本地刚性约束，以便在本地区域中强制刚性变换，解决本地反射问题。此外，我们还提出了一种新的层次刚性约束，以减少由模板形状和匹配的共同学习所导致的模糊性。广泛的实验表明我们的模型可以表示大幅度的变形。此外，我们还证明了我们的形状表示可以支持两种典型的应用，例如纹理传输和形状编辑，并且与竞争性表现。代码和模型可以在https://iscas3dv.github.io/deformshape 上获取。

paper_url: http://arxiv.org/abs/2308.12587
repo_url: https://github.com/csir1996/vln-gela
paper_authors: Yibo Cui, Liang Xie, Yakun Zhang, Meishan Zhang, Ye Yan, Erwei Yin
for: 本研究的目的是解决视觉语言导航（VLN）中的跨模态Alignment问题。
methods: 我们提出了一种新的Grounded Entity-Landmark Adaptive（GELA）预训练方法，通过引入基于实体和Landmark的 annotated数据（GEL-R2R），并采用三种基于实体和Landmark的适应预训练目标来强制学习细致的跨模态Alignment。
results: 我们的GELA模型在两个下游任务上（R2R和CVDN）得到了状态级 результа们，证明了其效果和普适性。

Abstract
Cross-modal alignment is one key challenge for Vision-and-Language Navigation (VLN). Most existing studies concentrate on mapping the global instruction or single sub-instruction to the corresponding trajectory. However, another critical problem of achieving fine-grained alignment at the entity level is seldom considered. To address this problem, we propose a novel Grounded Entity-Landmark Adaptive (GELA) pre-training paradigm for VLN tasks. To achieve the adaptive pre-training paradigm, we first introduce grounded entity-landmark human annotations into the Room-to-Room (R2R) dataset, named GEL-R2R. Additionally, we adopt three grounded entity-landmark adaptive pre-training objectives: 1) entity phrase prediction, 2) landmark bounding box prediction, and 3) entity-landmark semantic alignment, which explicitly supervise the learning of fine-grained cross-modal alignment between entity phrases and environment landmarks. Finally, we validate our model on two downstream benchmarks: VLN with descriptive instructions (R2R) and dialogue instructions (CVDN). The comprehensive experiments show that our GELA model achieves state-of-the-art results on both tasks, demonstrating its effectiveness and generalizability.

摘要
cross-modalAlignment是Vision-and-Language Navigation（VLN）中一个关键挑战。大多数现有研究专注于将全球指令或单一子指令映射到相应的路径上。然而，另一个重要的问题是实现细部对齐，即在实体水平上进行精确的对齐。为了解决这个问题，我们提出了一个新的Grounded Entity-Landmark Adaptive（GELA）预训方法 дляVLN任务。为了实现这个预训方法，我们首先将固有的实体-Landmark人工注释添加到Room-to-Room（R2R） dataset中，名为GEL-R2R。其次，我们采用三种固有的实体-Landmark适应预训练目标：1）实体短语预测，2）Landmark bounding box预测，和3）实体-Landmark语义对齐，这些目标直接监督学习跨模态的精确对齐。最后，我们 validate our model on two downstream benchmarks：VLN with descriptive instructions (R2R)和 dialogue instructions (CVDN)。弹性实验结果显示，我们的GELA模型在两个任务上均 achieve state-of-the-art results，证明其有效性和普遍性。

LORD: Leveraging Open-Set Recognition with Unknown Data

paper_url: http://arxiv.org/abs/2308.12584
repo_url: None
paper_authors: Tobias Koch, Christian Riess, Thomas Köhler
for: 这个论文的目的是如何处理未知数据，以便在部署过程中更好地进行分类。
methods: 这篇论文使用了一种名为LORD的框架，该框架在分类器训练过程中直接模型了开放空间，并提供了一系列可靠的评估方法。
results: 经过对多种评估协议的测试，这篇论文表明了在未知数据中进行分类时的改进表现，并且通过使用mixup作为数据生成技术，减轻了依赖于大量和昂贵的背景数据的问题。

Abstract
Handling entirely unknown data is a challenge for any deployed classifier. Classification models are typically trained on a static pre-defined dataset and are kept in the dark for the open unassigned feature space. As a result, they struggle to deal with out-of-distribution data during inference. Addressing this task on the class-level is termed open-set recognition (OSR). However, most OSR methods are inherently limited, as they train closed-set classifiers and only adapt the downstream predictions to OSR. This work presents LORD, a framework to Leverage Open-set Recognition by exploiting unknown Data. LORD explicitly models open space during classifier training and provides a systematic evaluation for such approaches. We identify three model-agnostic training strategies that exploit background data and applied them to well-established classifiers. Due to LORD's extensive evaluation protocol, we consistently demonstrate improved recognition of unknown data. The benchmarks facilitate in-depth analysis across various requirement levels. To mitigate dependency on extensive and costly background datasets, we explore mixup as an off-the-shelf data generation technique. Our experiments highlight mixup's effectiveness as a substitute for background datasets. Lightweight constraints on mixup synthesis further improve OSR performance.

摘要
处理完全未知数据是任何部署类фика器的挑战。类фика器通常是在静态预先定义的数据集上训练，因此在推理过程中难以处理外部不确定数据。为解决这个问题，我们提出了开放集 recognition（OSR）技术。然而，大多数OSR方法都受限于它们只是将关闭集类фика器 retrained，并且只是在推理过程中适应OSR。本文介绍了LORD框架，它可以利用未知数据进行开放集 recognition。LORD在类ifica器训练过程中直接模型开放空间，并提供了系统的评估方法。我们identified三种模型无关的训练策略，并应用这些策略到了已知的类ifica器上。由于LORD的广泛的评估协议，我们在不同的需求水平上 consistently 示出了未知数据的更好的识别。这些标准化的协议为我们进行了深入的分析。为了减少依赖于广泛和昂贵的背景数据集，我们探索了mixup作为一种可用的数据生成技术。我们的实验表明，mixup是一种有效的替代方案。在进一步提高OSR性能的同时，我们还提出了一些轻量级的约束来限制mixup的生成。

StreamMapNet: Streaming Mapping Network for Vectorized Online HD Map Construction

paper_url: http://arxiv.org/abs/2308.12570
repo_url: None
paper_authors: Tianyuan Yuan, Yicheng Liu, Yue Wang, Yilun Wang, Hang Zhao
for: 高清晰地图是自动驾驶系统的关键 Component， StreamMapNet 提供了一种新的在线地图生成管线，可以处理长串 temporal 信息，提高了稳定性和性能。
methods: StreamMapNet 使用多点注意力和时间信息来建立大范围的本地高清晰地图，并且可以处理复杂的场景，如 occlusion。
results: StreamMapNet 在所有设置下都与现有方法进行比较，表现出色，并且可以在 $14.2$ FPS 的在线推理速度下保持稳定性和高性能。

Abstract
High-Definition (HD) maps are essential for the safety of autonomous driving systems. While existing techniques employ camera images and onboard sensors to generate vectorized high-precision maps, they are constrained by their reliance on single-frame input. This approach limits their stability and performance in complex scenarios such as occlusions, largely due to the absence of temporal information. Moreover, their performance diminishes when applied to broader perception ranges. In this paper, we present StreamMapNet, a novel online mapping pipeline adept at long-sequence temporal modeling of videos. StreamMapNet employs multi-point attention and temporal information which empowers the construction of large-range local HD maps with high stability and further addresses the limitations of existing methods. Furthermore, we critically examine widely used online HD Map construction benchmark and datasets, Argoverse2 and nuScenes, revealing significant bias in the existing evaluation protocols. We propose to resplit the benchmarks according to geographical spans, promoting fair and precise evaluations. Experimental results validate that StreamMapNet significantly outperforms existing methods across all settings while maintaining an online inference speed of $14.2$ FPS.

摘要
高清定义（HD）地图是自动驾驶系统的关键。现有技术使用摄像头图像和车辆上的感知器来生成 вектор化高精度地图，但这些技术受到单帧输入的限制，导致它们在复杂的情况下表现不稳定，主要是因为缺乏时间信息。此外，它们在扩大观察范围时表现下降。在这篇论文中，我们提出了StreamMapNet，一种新的在线地图生成管道，可以长时间序列模型视频。StreamMapNet使用多点注意力和时间信息，使得在大范围本地高清定义地图的建构中具有高稳定性，并解决了现有方法的局限性。此外，我们严格检查了 Argoverse2 和 nuScenes 等在线 HD 地图建构标准和数据集，发现这些标准存在偏见。我们提议将标准按地理范围重新分割，以便更公正和精确的评估。实验结果表明，StreamMapNet 在所有设置下与现有方法进行比较，并且保持在线推理速度为14.2帧/秒。

NOVA: NOvel View Augmentation for Neural Composition of Dynamic Objects

paper_url: http://arxiv.org/abs/2308.12560
repo_url: https://github.com/dakshitagrawal/nova
paper_authors: Dakshit Agrawal, Jiajie Xu, Siva Karthik Mustikovela, Ioannis Gkioulekas, Ashish Shrivastava, Yuning Chai
for: trains NeRFs for photo-realistic 3D composition of dynamic objects in a static scene
methods: uses a novel-view augmentation (NOVA) strategy
results: reduces blending artifacts, achieves comparable PSNR without additional ground truth modalities, and provides ease, flexibility, and scalability in neural composition.

Abstract
We propose a novel-view augmentation (NOVA) strategy to train NeRFs for photo-realistic 3D composition of dynamic objects in a static scene. Compared to prior work, our framework significantly reduces blending artifacts when inserting multiple dynamic objects into a 3D scene at novel views and times; achieves comparable PSNR without the need for additional ground truth modalities like optical flow; and overall provides ease, flexibility, and scalability in neural composition. Our codebase is on GitHub.

摘要
我们提出一种新视图增强策略（NOVA），用于在静止场景中使用神经网络组合动态对象的3D组合。相比之前的工作，我们的框架可以在新视图和时间插入多个动态对象时减少融合 artifacts，达到相同的PSNR，而不需要额外的真实流动模式 like 光学流体；同时提供了更容易、灵活和可扩展的神经组合。我们的代码库在 GitHub 上。

Hyperbolic Audio-visual Zero-shot Learning

paper_url: http://arxiv.org/abs/2308.12558
repo_url: None
paper_authors: Jie Hong, Zeeshan Hayder, Junlin Han, Pengfei Fang, Mehrtash Harandi, Lars Petersson
for: 这个论文的目的是探讨采用几何学变换来实现零shot学习，以便更好地处理具有复杂层次结构的数据。methods: 该方法使用了一种新的损失函数，该损失函数将视频和音频特征在几何空间进行交叉模块对齐。此外，该方法还 explore了使用多个自适应几何 curvature来进行几何投影。results: 实验结果表明，我们的提议的几何方法在三个数据集上（VGGSound-GZSL、UCF-GZSL和ActivityNet-GZSL）实现了预测值的大约3.0%、7.0%和5.3%的提升，相对于现有的最佳方法。

Abstract
Audio-visual zero-shot learning aims to classify samples consisting of a pair of corresponding audio and video sequences from classes that are not present during training. An analysis of the audio-visual data reveals a large degree of hyperbolicity, indicating the potential benefit of using a hyperbolic transformation to achieve curvature-aware geometric learning, with the aim of exploring more complex hierarchical data structures for this task. The proposed approach employs a novel loss function that incorporates cross-modality alignment between video and audio features in the hyperbolic space. Additionally, we explore the use of multiple adaptive curvatures for hyperbolic projections. The experimental results on this very challenging task demonstrate that our proposed hyperbolic approach for zero-shot learning outperforms the SOTA method on three datasets: VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL achieving a harmonic mean (HM) improvement of around 3.0%, 7.0%, and 5.3%, respectively.

摘要
audio-visual zero-shot learning 目标是将相对应的音频和视频序列分类为在训练中不存在的类。数据分析显示 audio-visual 数据具有大量的抽象性，表明可能通过使用抽象变换来实现曲线意识的几何学学习，以探索更复杂的层次数据结构。我们提议的方法使用一种新的损失函数，该函数包含视频和音频特征在抽象空间的交叉模块Alignment。此外，我们还探讨了多个自适应曲线的投影。实验结果表明，我们的提议的抽象方法在三个数据集上（VGGSound-GZSL、UCF-GZSL 和 ActivityNet-GZSL）实现了harmonic mean（HM）提高约3.0%, 7.0%, 5.3%，分别。

Hybrid Models for Facial Emotion Recognition in Children

paper_url: http://arxiv.org/abs/2308.12547
repo_url: None
paper_authors: Rafael Zimmer, Marcos Sobral, Helio Azevedo
for: 这个研究旨在使用情绪识别技术来帮助儿童心理师在远程 робо扮SESSION中进行儿童治疗。
methods: 该研究使用了Embodied Conversational Agents（ECA）作为中间工具，以帮助专业人员与儿童进行交流，特别是对于患有注意力不足过动症（ADHD）、自闭症 спектル（ASD）或者因为战争、自然灾害或其他原因而无法进行面对面会议的儿童。情绪识别技术作为反馈工具，能够帮助心理师更好地了解儿童的情绪状态。
results: 该研究首先对儿童情绪识别领域进行了文献综述，并对当前社区广泛使用的算法和数据集进行了初步的检视。然后，通过使用 dense optical flow features 技术，提高了儿童情绪识别的精度。 HybridCNNFusion 模型由一个 Convolutional Neural Network 和两个中间特征 fusion 组成，可以更好地识别儿童的情绪。最终，该研究使用了巴西儿童的数据集，并取得了初步的情绪识别结果。

Abstract
This paper focuses on the use of emotion recognition techniques to assist psychologists in performing children's therapy through remotely robot operated sessions. In the field of psychology, the use of agent-mediated therapy is growing increasingly given recent advances in robotics and computer science. Specifically, the use of Embodied Conversational Agents (ECA) as an intermediary tool can help professionals connect with children who face social challenges such as Attention Deficit Hyperactivity Disorder (ADHD), Autism Spectrum Disorder (ASD) or even who are physically unavailable due to being in regions of armed conflict, natural disasters, or other circumstances. In this context, emotion recognition represents an important feedback for the psychotherapist. In this article, we initially present the result of a bibliographical research associated with emotion recognition in children. This research revealed an initial overview on algorithms and datasets widely used by the community. Then, based on the analysis carried out on the results of the bibliographical research, we used the technique of dense optical flow features to improve the ability of identifying emotions in children in uncontrolled environments. From the output of a hybrid model of Convolutional Neural Network, two intermediary features are fused before being processed by a final classifier. The proposed architecture was called HybridCNNFusion. Finally, we present the initial results achieved in the recognition of children's emotions using a dataset of Brazilian children.

摘要
In this article, we first present the results of a bibliographical research on emotion recognition in children. This research provided an initial overview of the algorithms and datasets commonly used by the community. Based on the analysis of the results, we improved the ability to identify emotions in children in uncontrolled environments using the technique of dense optical flow features. A hybrid model of Convolutional Neural Network (CNN) was used, which fused two intermediary features before being processed by a final classifier. The proposed architecture was called HybridCNNFusion.Finally, we present the initial results achieved in recognizing children's emotions using a dataset of Brazilian children.

Mutual-Guided Dynamic Network for Image Fusion

paper_url: http://arxiv.org/abs/2308.12538
repo_url: https://github.com/guanys-dar/mgdn
paper_authors: Yuanshen Guan, Ruikang Xu, Mingde Yao, Lizhi Wang, Zhiwei Xiong
for:This paper proposes a novel mutual-guided dynamic network (MGDN) for image fusion, which aims to generate high-quality images from multiple inputs captured under varying conditions.methods:The proposed MGDN method utilizes a mutual-guided dynamic filter (MGDF) for adaptive feature extraction, which incorporates additional guidance from different inputs and generates spatial-variant kernels for different locations. Additionally, a parallel feature fusion (PFF) module is introduced to effectively fuse local and global information of the extracted features.results:Experimental results on five benchmark datasets demonstrate that the proposed MGDN method outperforms existing methods on four image fusion tasks, showcasing its effectiveness in preserving complementary information while filtering out irrelevant information for the fused result.

Abstract
Image fusion aims to generate a high-quality image from multiple images captured under varying conditions. The key problem of this task is to preserve complementary information while filtering out irrelevant information for the fused result. However, existing methods address this problem by leveraging static convolutional neural networks (CNNs), suffering two inherent limitations during feature extraction, i.e., being unable to handle spatial-variant contents and lacking guidance from multiple inputs. In this paper, we propose a novel mutual-guided dynamic network (MGDN) for image fusion, which allows for effective information utilization across different locations and inputs. Specifically, we design a mutual-guided dynamic filter (MGDF) for adaptive feature extraction, composed of a mutual-guided cross-attention (MGCA) module and a dynamic filter predictor, where the former incorporates additional guidance from different inputs and the latter generates spatial-variant kernels for different locations. In addition, we introduce a parallel feature fusion (PFF) module to effectively fuse local and global information of the extracted features. To further reduce the redundancy among the extracted features while simultaneously preserving their shared structural information, we devise a novel loss function that combines the minimization of normalized mutual information (NMI) with an estimated gradient mask. Experimental results on five benchmark datasets demonstrate that our proposed method outperforms existing methods on four image fusion tasks. The code and model are publicly available at: https://github.com/Guanys-dar/MGDN.

摘要
图像融合目标是生成多个图像下 varying 条件下的高质量图像。该任务的关键问题是保留 complementary information 而过滤 irrelevant information 以生成融合结果。然而，现有方法通过利用静态 convolutional neural networks (CNNs) 解决这个问题，具有两个内在的限制：无法处理空间 variant 内容和缺乏多输入的指导。在这篇论文中，我们提出了一种新的 mutual-guided dynamic network (MGDN) для图像融合，允许有效地利用不同的位置和输入中的信息。具体来说，我们设计了一种 mutual-guided cross-attention (MGCA) 模块和一种动态滤波预测器，其中前者包含不同输入的额外指导，而后者生成不同位置的空间variant 滤波器。此外，我们引入了一种 parallel feature fusion (PFF) 模块，以有效地融合本地和全局的特征信息。为了进一步减少提取的特征信息之间的重复，我们设计了一种新的损失函数，它将 normalized mutual information (NMI) 的最小化与一个估计的梯度掩码相结合。实验结果表明，我们的提出的方法在五个 benchmark 数据集上比现有方法在四个图像融合任务上表现出色。代码和模型可以在：https://github.com/Guanys-dar/MGDN 上获取。

HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt interaction tasks

paper_url: http://arxiv.org/abs/2308.12537
repo_url: https://github.com/dzcgaara/HuBo-VLM
paper_authors: Zichao Dong, Weikun Zhang, Xufeng Huang, Hang Ji, Xin Zhan, Junbo Chen
for: 这个论文旨在提出一种基于 трансформа器视觉语言模型的人机交互模型，以便帮助机器人理解人类的自然语言指令并完成相关任务。
methods: 该论文提出了一种基于 transformer 视觉语言模型的人机交互模型，包括对象检测和视觉定位。
results: EXTENSIVE EXPERIMENTS ON THE TALK2CAR BENCHMARK DEMONSTRATE THE EFFECTIVENESS OF THE PROPOSED APPROACH。

Abstract
Human robot interaction is an exciting task, which aimed to guide robots following instructions from human. Since huge gap lies between human natural language and machine codes, end to end human robot interaction models is fair challenging. Further, visual information receiving from sensors of robot is also a hard language for robot to perceive. In this work, HuBo-VLM is proposed to tackle perception tasks associated with human robot interaction including object detection and visual grounding by a unified transformer based vision language model. Extensive experiments on the Talk2Car benchmark demonstrate the effectiveness of our approach. Code would be publicly available in https://github.com/dzcgaara/HuBo-VLM.

摘要
人机交互是一项有趣的任务，旨在使 робоッツ按照人类的指令行动。由于人类自然语言与机器代码之间存在巨大的差距，结束到端人机交互模型是非常困难的。此外，机器人的感知器也是一种困难的语言，对机器人来说很难理解。在这项工作中，我们提出了一种解决人机交互相关的感知任务，包括物体检测和视觉定位，的方法。我们使用了一种基于转换器的视Language模型，并进行了广泛的实验，证明了我们的方法的有效性。代码将在https://github.com/dzcgaara/HuBo-VLM上公开。

SCP: Spherical-Coordinate-based Learned Point Cloud Compression

paper_url: http://arxiv.org/abs/2308.12535
repo_url: https://github.com/luoao-kddi/SCP
paper_authors: Ao Luo, Linxin Song, Keisuke Nonaka, Kyohei Unno, Heming Sun, Masayuki Goto, Jiro Katto
for: 本研究targets learned point cloud compression, particularly for spinning LiDAR point clouds with circular shapes and azimuthal angle invariance features.
methods: 该方法基于Spherical-Coordinate-based learned Point cloud compression (SCP)，利用了上述特征，并提出了多级Octree来降低远区域重建误差。
results: 实验结果显示，SCP比前一代方法提高了29.14%的点到点PSNR BD-Rate。

Abstract
In recent years, the task of learned point cloud compression has gained prominence. An important type of point cloud, the spinning LiDAR point cloud, is generated by spinning LiDAR on vehicles. This process results in numerous circular shapes and azimuthal angle invariance features within the point clouds. However, these two features have been largely overlooked by previous methodologies. In this paper, we introduce a model-agnostic method called Spherical-Coordinate-based learned Point cloud compression (SCP), designed to leverage the aforementioned features fully. Additionally, we propose a multi-level Octree for SCP to mitigate the reconstruction error for distant areas within the Spherical-coordinate-based Octree. SCP exhibits excellent universality, making it applicable to various learned point cloud compression techniques. Experimental results demonstrate that SCP surpasses previous state-of-the-art methods by up to 29.14% in point-to-point PSNR BD-Rate.

摘要
近年来，学习点云压缩任务得到了更多的关注。一种重要的点云类型是旋转雷达点云，由旋转雷达在车辆上生成。这个过程会生成很多圆形和方位角协variance特征，但这些特征在前一代方法中受到了忽略。在这篇论文中，我们介绍了一种模型无关的方法called Spherical-Coordinate-based learned Point cloud compression (SCP)，旨在利用上述特征。此外，我们提议了一种多级 Octree 来 mitigate SCP 的重建误差。SCP 具有优秀的通用性，可以应用于多种学习点云压缩技术。实验结果表明，SCP 可以比前一代方法提高点-到-点 PSNR BD-Rate 的最高提升率达29.14%。

Channel and Spatial Relation-Propagation Network for RGB-Thermal Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.12534
repo_url: None
paper_authors: Zikun Zhou, Shukun Wu, Guoqing Zhu, Hongpeng Wang, Zhenyu He
for: 这个论文的目的是提出一个 Channel and Spatial Relation-Propagation Network (CSRPNet)，用于RGB-T semantic segmentation，以利用两 modalities 之间的共同特征来提高 semantic segmentation 的精度。
methods: 这个论文使用了一个叫做 Channel and Spatial Relation-Propagation Network (CSRPNet) 的网络，它首先在通道和空间 dimension 进行了关系传递，以捕捉两 modalities 之间的共同特征。然后，它将一 modalities 的特征与另一 modalities 的入力特征进行了融合，以提高入力特征不受污染的问题。
results: 实验结果显示，CSRPNet 可以与现有的方法相比，在RGB-T semantic segmentation 中表现出色。

Abstract
RGB-Thermal (RGB-T) semantic segmentation has shown great potential in handling low-light conditions where RGB-based segmentation is hindered by poor RGB imaging quality. The key to RGB-T semantic segmentation is to effectively leverage the complementarity nature of RGB and thermal images. Most existing algorithms fuse RGB and thermal information in feature space via concatenation, element-wise summation, or attention operations in either unidirectional enhancement or bidirectional aggregation manners. However, they usually overlook the modality gap between RGB and thermal images during feature fusion, resulting in modality-specific information from one modality contaminating the other. In this paper, we propose a Channel and Spatial Relation-Propagation Network (CSRPNet) for RGB-T semantic segmentation, which propagates only modality-shared information across different modalities and alleviates the modality-specific information contamination issue. Our CSRPNet first performs relation-propagation in channel and spatial dimensions to capture the modality-shared features from the RGB and thermal features. CSRPNet then aggregates the modality-shared features captured from one modality with the input feature from the other modality to enhance the input feature without the contamination issue. While being fused together, the enhanced RGB and thermal features will be also fed into the subsequent RGB or thermal feature extraction layers for interactive feature fusion, respectively. We also introduce a dual-path cascaded feature refinement module that aggregates multi-layer features to produce two refined features for semantic and boundary prediction. Extensive experimental results demonstrate that CSRPNet performs favorably against state-of-the-art algorithms.

摘要

SieveNet: Selecting Point-Based Features for Mesh Networks

paper_url: http://arxiv.org/abs/2308.12530
repo_url: https://github.com/sievenet/sievenet.github.io
paper_authors: Shengchao Yuan, Yishun Dou, Rui Shi, Bingbing Ni, Zhong Zheng
for: 提高3D计算机视觉和图形领域中的网格使用，解决网格的不规则结构限制现有神经网络体系中的应用。
methods: 提出了一种新的思路，即使用结构化网格结构和精度地理信息，从原始网格表面进行误差意识抽取点批量检测，从而兼顾规则结构和准确地理信息。
results: 经过广泛的实验研究，在分类和 segmentation 任务中，提出的 Sievenet 方法能够具有较高的效果和优势，不需要手动设计特征工程。

Abstract
Meshes are widely used in 3D computer vision and graphics, but their irregular topology poses challenges in applying them to existing neural network architectures. Recent advances in mesh neural networks turn to remeshing and push the boundary of pioneer methods that solely take the raw meshes as input. Although the remeshing offers a regular topology that significantly facilitates the design of mesh network architectures, features extracted from such remeshed proxies may struggle to retain the underlying geometry faithfully, limiting the subsequent neural network's capacity. To address this issue, we propose SieveNet, a novel paradigm that takes into account both the regular topology and the exact geometry. Specifically, this method utilizes structured mesh topology from remeshing and accurate geometric information from distortion-aware point sampling on the surface of the original mesh. Furthermore, our method eliminates the need for hand-crafted feature engineering and can leverage off-the-shelf network architectures such as the vision transformer. Comprehensive experimental results on classification and segmentation tasks well demonstrate the effectiveness and superiority of our method.

摘要
mesh 广泛应用于3D计算机视觉和图形领域，但它们的不规则结构会对现有神经网络架构的应用带来挑战。 recent advances in mesh neural networks have turned to remeshing and pushed the boundaries of pioneering methods that only take raw meshes as input. Although remeshing provides a regular topology that significantly facilitates the design of mesh network architectures, features extracted from such remeshed proxies may struggle to retain the underlying geometry faithfully, limiting the subsequent neural network's capacity. To address this issue, we propose SieveNet, a novel paradigm that takes into account both the regular topology and the exact geometry. Specifically, this method utilizes structured mesh topology from remeshing and accurate geometric information from distortion-aware point sampling on the surface of the original mesh. Furthermore, our method eliminates the need for hand-crafted feature engineering and can leverage off-the-shelf network architectures such as the vision transformer. Comprehensive experimental results on classification and segmentation tasks well demonstrate the effectiveness and superiority of our method.Note: The translation is in Simplified Chinese, which is the standardized form of Chinese used in mainland China. The traditional Chinese form of the text would be slightly different.

Uniformly Distributed Category Prototype-Guided Vision-Language Framework for Long-Tail Recognition

paper_url: http://arxiv.org/abs/2308.12522
repo_url: None
paper_authors: Siming Fu, Xiaoxuan He, Xinpeng Ding, Yuchen Cao, Hualiang Wang
for: 这个研究是为了解决长尾认知 task 中的类别偏度问题，特别是当训练数据具有类别差异时，模型会受到特定类别的扭曲。
methods: 我们提出了一个uniformly category prototype-guided vision-language框架，通过生成一些均匀分布在球体上的类别原型，将不同类别的特征扩展到这些原型上，使得特征空间内的分布变得均匀。此外，我们还提出了一个无关文本筛选和特征增强模组，让模型忽略无关的噪音文本，更加重视关键特征信息。
results: 我们的方法比前一代的视觉语言方法更好地适应长尾认知任务，并且实现了类别偏度问题的解决。具体来说，我们的方法在认知精度上比前一代的方法提高了大约20%，并且在长尾类别上保持了高度的稳定性。

Abstract
Recently, large-scale pre-trained vision-language models have presented benefits for alleviating class imbalance in long-tailed recognition. However, the long-tailed data distribution can corrupt the representation space, where the distance between head and tail categories is much larger than the distance between two tail categories. This uneven feature space distribution causes the model to exhibit unclear and inseparable decision boundaries on the uniformly distributed test set, which lowers its performance. To address these challenges, we propose the uniformly category prototype-guided vision-language framework to effectively mitigate feature space bias caused by data imbalance. Especially, we generate a set of category prototypes uniformly distributed on a hypersphere. Category prototype-guided mechanism for image-text matching makes the features of different classes converge to these distinct and uniformly distributed category prototypes, which maintain a uniform distribution in the feature space, and improve class boundaries. Additionally, our proposed irrelevant text filtering and attribute enhancement module allows the model to ignore irrelevant noisy text and focus more on key attribute information, thereby enhancing the robustness of our framework. In the image recognition fine-tuning stage, to address the positive bias problem of the learnable classifier, we design the class feature prototype-guided classifier, which compensates for the performance of tail classes while maintaining the performance of head classes. Our method outperforms previous vision-language methods for long-tailed learning work by a large margin and achieves state-of-the-art performance.

摘要
近期，大规模预训练视觉语言模型已经显示出了对长尾识别问题的缓解效果。然而，长尾数据分布可以损害模型的表征空间，导致模型在uniform测试集上展示不明确和不分化的决策边界，从而降低其性能。为解决这些挑战，我们提议使用 uniformly分布的类prototype来引导视觉语言框架，以有效地消除数据不均分带来的表径空间偏见。具体来说，我们生成了一组 uniformly分布在 hypersphere 上的类prototype。这些类prototype在图像文本匹配中 acted as a guide, making the features of different classes converge to these distinct and uniformly distributed category prototypes, thereby maintaining a uniform distribution in the feature space and improving class boundaries.此外，我们还提出了不相关文本过滤和特征增强模块，使模型忽略不相关的噪音文本，更加注重关键特征信息，从而提高了我们的框架的Robustness。在图像识别细化阶段，为了解决learnable classifier的正面偏好问题，我们设计了类feature prototype-guided类ifier，这种方法可以补偿尾类的性能，同时保持头类的性能。根据我们的实验结果，我们的方法在长尾学习任务上比前一代视觉语言方法出performanced by a large margin，达到了状态的最佳性能。

Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval

paper_url: http://arxiv.org/abs/2308.12509
repo_url: None
paper_authors: Yuan Yuan, Yang Zhan, Zhitong Xiong
for: 这研究旨在提出一种高效高可用的视言语传播学习方法，以便在实际应用中处理大量的远程感知数据。
methods: 该研究使用了CLIP模型作为预训练模型，并设计了一种多模态远程感知适配器以及一种混合多模态对比学习目标。此外，我们还提出了一种简单 yet有效的HMMC损失函数来解决高内模态相似性问题。
results: 我们的研究表明，使用PETL方法可以有效地传播视言语知识从自然领域到远程感知领域，并且可以大幅降低训练成本和环境影响。我们的模型只包含0.16M个训练参数，可以实现98.9%的参数减少，并且在 Retrieval 性能方面超过传统方法7-13%，与全 Fine-tuning 的性能相当或更好。

Abstract
Vision-and-language pre-training (VLP) models have experienced a surge in popularity recently. By fine-tuning them on specific datasets, significant performance improvements have been observed in various tasks. However, full fine-tuning of VLP models not only consumes a significant amount of computational resources but also has a significant environmental impact. Moreover, as remote sensing (RS) data is constantly being updated, full fine-tuning may not be practical for real-world applications. To address this issue, in this work, we investigate the parameter-efficient transfer learning (PETL) method to effectively and efficiently transfer visual-language knowledge from the natural domain to the RS domain on the image-text retrieval task. To this end, we make the following contributions. 1) We construct a novel and sophisticated PETL framework for the RS image-text retrieval (RSITR) task, which includes the pretrained CLIP model, a multimodal remote sensing adapter, and a hybrid multi-modal contrastive (HMMC) learning objective; 2) To deal with the problem of high intra-modal similarity in RS data, we design a simple yet effective HMMC loss; 3) We provide comprehensive empirical studies for PETL-based RS image-text retrieval. Our results demonstrate that the proposed method is promising and of great potential for practical applications. 4) We benchmark extensive state-of-the-art PETL methods on the RSITR task. Our proposed model only contains 0.16M training parameters, which can achieve a parameter reduction of 98.9% compared to full fine-tuning, resulting in substantial savings in training costs. Our retrieval performance exceeds traditional methods by 7-13% and achieves comparable or better performance than full fine-tuning. This work can provide new ideas and useful insights for RS vision-language tasks.

摘要
Recently, vision-and-language pre-training (VLP) 模型在不同领域中得到了广泛的应用。通过特定数据集的精细调整，VLP模型在各种任务中表现出了显著的性能提升。然而，全量调整VLP模型不仅需要巨量的计算资源，还会对环境产生巨大的影响。此外，随着Remote Sensing（RS）数据不断更新，全量调整可能无法适应实际应用中的需求。为此，本文提出了参数有效传播学习（PETL）方法，以有效地和高效地将视觉语言知识从自然领域传播到RS领域中的图文检索任务上。为此，我们做了以下贡献：1. 我们建立了一个新的和复杂的PETL框架 дляRS图文检索任务，包括预训练的CLIP模型、多模态RS适配器和混合多模态对比（HMMC）学习目标；2. 为RS数据中高内模态相似性问题而设计了一个简单 yet effective的HMMC损失函数；3. 我们提供了RS图文检索的广泛的实验研究。我们的结果表明，我们提出的方法具有扎实的推荐和实际应用的潜在价值。4. 我们对现有的PETL方法进行了广泛的比较研究，并发现我们的提出的模型只需0.16M参数进行训练，相比涵盖所有模型，可以实现参数减少98.9%，减少训练成本。我们的检索性能高于传统方法7-13%，并且与全量调整的性能相当或更好。本文可以提供新的想法和有用的意见 дляRS视觉语言任务。

FFEINR: Flow Feature-Enhanced Implicit Neural Representation for Spatio-temporal Super-Resolution

paper_url: http://arxiv.org/abs/2308.12508
repo_url: None
paper_authors: Chenyue Jiao, Chongke Bi, Lu Yang
for: 提高流体动力学数据的空间和时间分辨率
methods: 基于嵌入型神经网络（FFEINR）和特征增强技术
results: 比较比特里利内插法更好的结果

Abstract
Large-scale numerical simulations are capable of generating data up to terabytes or even petabytes. As a promising method of data reduction, super-resolution (SR) has been widely studied in the scientific visualization community. However, most of them are based on deep convolutional neural networks (CNNs) or generative adversarial networks (GANs) and the scale factor needs to be determined before constructing the network. As a result, a single training session only supports a fixed factor and has poor generalization ability. To address these problems, this paper proposes a Feature-Enhanced Implicit Neural Representation (FFEINR) for spatio-temporal super-resolution of flow field data. It can take full advantage of the implicit neural representation in terms of model structure and sampling resolution. The neural representation is based on a fully connected network with periodic activation functions, which enables us to obtain lightweight models. The learned continuous representation can decode the low-resolution flow field input data to arbitrary spatial and temporal resolutions, allowing for flexible upsampling. The training process of FFEINR is facilitated by introducing feature enhancements for the input layer, which complements the contextual information of the flow field.To demonstrate the effectiveness of the proposed method, a series of experiments are conducted on different datasets by setting different hyperparameters. The results show that FFEINR achieves significantly better results than the trilinear interpolation method.

摘要
大规模数值计算可以生成数据达到tera bytes甚至petabytes级别。作为数据压缩的承诺方法，超分辨率（SR）在科学视觉社区得到了广泛的研究。然而，大多数都基于深度卷积神经网络（CNN）或生成敌对网络（GAN），并且需要确定缩放因子之前建立网络。这意味着单个训练会话只支持固定因子，并且具有较差的泛化能力。为解决这些问题，本文提出了基于几何卷积神经网络的特征增强隐藏表示（FFEINR），用于空间时间超分辨率的流场数据。它可以完全利用隐藏表示的几何结构和采样分辨率来获得轻量级模型。学习的连续表示可以将低分辨率流场输入数据解码到任意空间和时间分辨率，以便灵活增加。为便于FFEINR的训练，我们引入了输入层的特征增强，以增强流场的上下文信息。为证明提案的效iveness，我们在不同的数据集上进行了一系列实验，并通过设置不同的超参数来评估结果。结果显示，FFEINR在比较方法中表现出了显著的优势。

DD-GCN: Directed Diffusion Graph Convolutional Network for Skeleton-based Human Action Recognition

paper_url: http://arxiv.org/abs/2308.12501
repo_url: https://github.com/shiyin-lc/dd-gcn
paper_authors: Chang Li, Qian Huang, Yingchi Mao
for: 这篇论文是为了提高skeleton-based human action recognition中的Graph Convolutional Networks（GCNs）性能而写的。
methods: 该论文使用了导向协同分布图（DD-GCN），它利用了建立导向分布图以实现动作模型化，并引入了活动分区策略来优化图 convolution kernels 的加权共享机制。此外，它还提出了空间时间同步编码器来嵌入同步空间时间 semantics。
results: 实验结果表明，该方法在三个公共数据集（NTU-RGB+D、NTU-RGB+D 120、NW-UCLA）上达到了当前最佳性能。

Abstract
Graph Convolutional Networks (GCNs) have been widely used in skeleton-based human action recognition. In GCN-based methods, the spatio-temporal graph is fundamental for capturing motion patterns. However, existing approaches ignore the physical dependency and synchronized spatio-temporal correlations between joints, which limits the representation capability of GCNs. To solve these problems, we construct the directed diffusion graph for action modeling and introduce the activity partition strategy to optimize the weight sharing mechanism of graph convolution kernels. In addition, we present the spatio-temporal synchronization encoder to embed synchronized spatio-temporal semantics. Finally, we propose Directed Diffusion Graph Convolutional Network (DD-GCN) for action recognition, and the experiments on three public datasets: NTU-RGB+D, NTU-RGB+D 120, and NW-UCLA, demonstrate the state-of-the-art performance of our method.

摘要
格点图 neural network (GCN) 在人体动作识别中广泛应用。在 GCN 基本方法中，空间时间图是关键 для捕捉运动模式。然而，现有方法忽略了物理依赖性和同步空间时间相关性 между 关节，这限制了 GCN 的表示能力。为解决这些问题，我们构建了导向干扰图 для动作模型化，并引入活动分区策略来优化图像卷积核的加权共享机制。此外，我们提出了空间时间同步编码器，以嵌入同步空间时间 semantics。最后，我们提出了导向干扰图 convolutional neural network (DD-GCN) для动作识别，并在三个公共数据集（NTU-RGB+D、NTU-RGB+D 120、NW-UCLA）上进行实验，其中表现出了当今最佳性能。

MOFA: A Model Simplification Roadmap for Image Restoration on Mobile Devices

paper_url: http://arxiv.org/abs/2308.12494
repo_url: None
paper_authors: Xiangyu Chen, Ruiwen Zhen, Shuai Li, Xiaotian Li, Guanghui Wang
for: restore high-quality images from degraded counterparts and improve the efficiency of image restoration models on mobile devices.
methods: add more parameters to partial convolutions on FLOPs non-sensitive layers, apply partial depthwise convolution coupled with decoupling upsampling/downsampling layers.
results: decrease runtime by up to 13%, reduce the number of parameters by up to 23%, while increasing PSNR and SSIM on several image restoration datasets.Here is the text in Simplified Chinese:
for: restore高品质的图像从损坏版本中，并提高移动设备上图像恢复模型的效率。
methods: 增加FLOPs非敏感层中的参数，应用部分深度卷积并与解解锁升降样例层。
results: 减少运行时间，减少参数数量，同时提高PSNR和SSIM在多个图像恢复数据集上。

Abstract
Image restoration aims to restore high-quality images from degraded counterparts and has seen significant advancements through deep learning techniques. The technique has been widely applied to mobile devices for tasks such as mobile photography. Given the resource limitations on mobile devices, such as memory constraints and runtime requirements, the efficiency of models during deployment becomes paramount. Nevertheless, most previous works have primarily concentrated on analyzing the efficiency of single modules and improving them individually. This paper examines the efficiency across different layers. We propose a roadmap that can be applied to further accelerate image restoration models prior to deployment while simultaneously increasing PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index). The roadmap first increases the model capacity by adding more parameters to partial convolutions on FLOPs non-sensitive layers. Then, it applies partial depthwise convolution coupled with decoupling upsampling/downsampling layers to accelerate the model speed. Extensive experiments demonstrate that our approach decreases runtime by up to 13% and reduces the number of parameters by up to 23%, while increasing PSNR and SSIM on several image restoration datasets. Source Code of our method is available at \href{https://github.com/xiangyu8/MOFA}{https://github.com/xiangyu8/MOFA}.

摘要
Image restoration aims to restore high-quality images from degraded counterparts and has seen significant advancements through deep learning techniques. The technique has been widely applied to mobile devices for tasks such as mobile photography. Given the resource limitations on mobile devices, such as memory constraints and runtime requirements, the efficiency of models during deployment becomes paramount. Nevertheless, most previous works have primarily concentrated on analyzing the efficiency of single modules and improving them individually. This paper examines the efficiency across different layers. We propose a roadmap that can be applied to further accelerate image restoration models prior to deployment while simultaneously increasing PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index). The roadmap first increases the model capacity by adding more parameters to partial convolutions on FLOPs non-sensitive layers. Then, it applies partial depthwise convolution coupled with decoupling upsampling/downsampling layers to accelerate the model speed. Extensive experiments demonstrate that our approach decreases runtime by up to 13% and reduces the number of parameters by up to 23%, while increasing PSNR and SSIM on several image restoration datasets. 源代码我们的方法可以在 \href{https://github.com/xiangyu8/MOFA}{https://github.com/xiangyu8/MOFA} 上 obtain.

Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion

paper_url: http://arxiv.org/abs/2308.12469
repo_url: None
paper_authors: Junjiao Tian, Lavisha Aggarwal, Andrea Colaco, Zsolt Kira, Mar Gonzalez-Franco
for: 实现零shot Segmentation的高质量分割mask，解决了计算机视觉领域中的基本问题。
methods: 利用稳定扩散模型的自我注意层，通过衡量KL差值 среди注意图来 merge them into valid segmentation masks。
results: 在COCO-Stuff-27上，我们的方法超过了先前的无supervised zero-shot SOTA方法，净误率提高26%， Mean IoU提高17%。

Abstract
Producing quality segmentation masks for images is a fundamental problem in computer vision. Recent research has explored large-scale supervised training to enable zero-shot segmentation on virtually any image style and unsupervised training to enable segmentation without dense annotations. However, constructing a model capable of segmenting anything in a zero-shot manner without any annotations is still challenging. In this paper, we propose to utilize the self-attention layers in stable diffusion models to achieve this goal because the pre-trained stable diffusion model has learned inherent concepts of objects within its attention layers. Specifically, we introduce a simple yet effective iterative merging process based on measuring KL divergence among attention maps to merge them into valid segmentation masks. The proposed method does not require any training or language dependency to extract quality segmentation for any images. On COCO-Stuff-27, our method surpasses the prior unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17% in mean IoU.

摘要
生成高质量的图像分割 маSK是计算机视觉的基本问题。 latest research has explored large-scale supervised training to enable zero-shot segmentation on virtually any image style and unsupervised training to enable segmentation without dense annotations. However, constructing a model capable of segmenting anything in a zero-shot manner without any annotations is still challenging. In this paper, we propose to utilize the self-attention layers in stable diffusion models to achieve this goal because the pre-trained stable diffusion model has learned inherent concepts of objects within its attention layers. Specifically, we introduce a simple yet effective iterative merging process based on measuring KL divergence among attention maps to merge them into valid segmentation masks. The proposed method does not require any training or language dependency to extract quality segmentation for any images. On COCO-Stuff-27, our method surpasses the prior unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17% in mean IoU.Here's the text with traditional Chinese characters:生成高质量的图像分割mask是计算机视觉的基本问题。 latest research has explored large-scale supervised training to enable zero-shot segmentation on virtually any image style and unsupervised training to enable segmentation without dense annotations. However, constructing a model capable of segmenting anything in a zero-shot manner without any annotations is still challenging. In this paper, we propose to utilize the self-attention layers in stable diffusion models to achieve this goal because the pre-trained stable diffusion model has learned inherent concepts of objects within its attention layers. Specifically, we introduce a simple yet effective iterative merging process based on measuring KL divergence among attention maps to merge them into valid segmentation masks. The proposed method does not require any training or language dependency to extract quality segmentation for any images. On COCO-Stuff-27, our method surpasses the prior unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17% in mean IoU.

InverseSR: 3D Brain MRI Super-Resolution Using a Latent Diffusion Model

paper_url: http://arxiv.org/abs/2308.12465
repo_url: https://github.com/biomedai-ucsc/inversesr
paper_authors: Jueqi Wang, Jacob Levman, Walter Hugo Lopez Pinaya, Petru-Daniel Tudosiu, M. Jorge Cardoso, Razvan Marinescu
for: 这个论文的目的是提出一种基于深度学习的MRI超分辨（SR）方法，以提高临床MRI扫描的分辨率。
methods: 该方法利用一个 estado-of-the-art 3D脑生成模型（LDM），通过在 UK BioBank 上训练该模型，来提高临床MRI扫描的分辨率。
results: 该方法可以在多种不同的MRI SR问题中提高分辨率，并且可以在不同的设置下选择合适的方法。Here are the three points in Simplified Chinese text:
for: 这个论文的目的是提出一种基于深度学习的MRI超分辨（SR）方法，以提高临床MRI扫描的分辨率。
methods: 该方法利用一个 estado-of-the-art 3D脑生成模型（LDM），通过在 UK BioBank 上训练该模型，来提高临床MRI扫描的分辨率。
results: 该方法可以在多种不同的MRI SR问题中提高分辨率，并且可以在不同的设置下选择合适的方法。

Abstract
High-resolution (HR) MRI scans obtained from research-grade medical centers provide precise information about imaged tissues. However, routine clinical MRI scans are typically in low-resolution (LR) and vary greatly in contrast and spatial resolution due to the adjustments of the scanning parameters to the local needs of the medical center. End-to-end deep learning methods for MRI super-resolution (SR) have been proposed, but they require re-training each time there is a shift in the input distribution. To address this issue, we propose a novel approach that leverages a state-of-the-art 3D brain generative model, the latent diffusion model (LDM) trained on UK BioBank, to increase the resolution of clinical MRI scans. The LDM acts as a generative prior, which has the ability to capture the prior distribution of 3D T1-weighted brain MRI. Based on the architecture of the brain LDM, we find that different methods are suitable for different settings of MRI SR, and thus propose two novel strategies: 1) for SR with more sparsity, we invert through both the decoder of the LDM and also through a deterministic Denoising Diffusion Implicit Models (DDIM), an approach we will call InverseSR(LDM); 2) for SR with less sparsity, we invert only through the LDM decoder, an approach we will call InverseSR(Decoder). These two approaches search different latent spaces in the LDM model to find the optimal latent code to map the given LR MRI into HR. The training process of the generative model is independent of the MRI under-sampling process, ensuring the generalization of our method to many MRI SR problems with different input measurements. We validate our method on over 100 brain T1w MRIs from the IXI dataset. Our method can demonstrate that powerful priors given by LDM can be used for MRI reconstruction.

摘要
高解度（HR）MRI扫描从研究级医疗机构获取的信息非常精确。然而，日常临床MRI扫描通常是低解度（LR）的，并且因为扫描参数的调整而具有不同的对比度和空间分辨率。为解决这个问题，我们提出了一种新的方法，利用UK BioBank上训练的状态态流模型（LDM）来提高临床MRI扫描的解度。LDM acts as a generative prior, which has the ability to capture the prior distribution of 3D T1-weighted brain MRI。基于LDM的架构，我们发现不同的方法适用于不同的MRI SR设置，因此我们提出了两种新的策略：1）为SR with more sparsity，我们通过LDM的解码器和deterministic Denoising Diffusion Implicit Models（DDIM）进行逆变换，一种我们将称为InverseSR(LDM)；2）为SR with less sparsity，我们只通过LDM的解码器进行逆变换，一种我们将称为InverseSR(Decoder)。这两种方法在LDM模型中寻找不同的秘密空间，以找到将LR MRI映射到HR的最佳秘密代码。我们的训练过程不依赖MRI下抽样过程，因此我们的方法可以通用许多MRI SR问题。我们验证了我们的方法在IXI数据集上的100多个脑T1w MRI中。我们的方法可以证明LDM可以提供强大的PRIOR，用于MRI重建。

Overcoming General Knowledge Loss with Selective Parameter Finetuning

paper_url: http://arxiv.org/abs/2308.12462
repo_url: None
paper_authors: Wenxuan Zhang, Paul Janson, Rahaf Aljundi, Mohamed Elhoseiny
for: 提高基础模型的更新能力，以适应新的信息和维护原有知识。
methods: 本文提出了一种新的方法，通过对小部分参数进行本地修改来实现基础模型的不断更新。这种方法基于先前分析基础模型的经验，首先局部化特定层进行模型精度，然后引入重要性分数机制，以更新关键参数。
results: 对基础视觉语言模型进行了广泛评估，证明该方法可以在不同的持续学习任务上提高现有的持续学习方法，并将预先学习的知识减少到0.97%。

Abstract
Foundation models encompass an extensive knowledge base and offer remarkable transferability. However, this knowledge becomes outdated or insufficient over time. The challenge lies in updating foundation models to accommodate novel information while retaining their original ability. In this paper, we present a novel approach to achieving continual model updates by effecting localized modifications to a small subset of parameters. Guided by insights gleaned from prior analyses of foundational models, we first localize a specific layer for model refinement and then introduce an importance scoring mechanism designed to update only the most crucial weights. Our method is exhaustively evaluated on foundational vision-language models, measuring its efficacy in both learning new information and preserving pre-established knowledge across a diverse spectrum of continual learning tasks, including Aircraft, Birdsnap CIFAR-100, CUB, Cars, and GTSRB. The results show that our method improves the existing continual learning methods by 0.5\% - 10\% on average, and reduces the loss of pre-trained knowledge from around 5\% to 0.97\%. Comprehensive ablation studies substantiate our method design, shedding light on the contributions of each component to controllably learning new knowledge and mitigating the forgetting of pre-trained knowledge.

摘要
基础模型包含广泛的知识库和卓越的跨 Transferability。然而，这些知识随着时间的推移会变得过时或不足。挑战在于更新基础模型以容纳新的信息，而不会失去原有的知识。在这篇论文中，我们提出了一种新的方法来实现不断模型更新，通过对一小部分参数进行本地化修改。以先前分析基础模型所获得的知识为指导，我们首先本地化特定层，然后引入一种重要性分配机制，以更新最重要的权重。我们的方法在基础视觉语言模型上进行了完整的评估，并测试其在多种不断学习任务上的效果，包括飞机、鸟卷CIFAR-100、CUB、汽车和GTSRB。结果表明，我们的方法与现有的不断学习方法相比，平均提高了0.5%-10%，并将先前学习的知识损失从约5%降低至0.97%。我们还进行了广泛的减少分析，以证明我们的方法设计的每一部分对于控制新知识学习和减少先前知识损失做出了贡献。

ARF-Plus: Controlling Perceptual Factors in Artistic Radiance Fields for 3D Scene Stylization

paper_url: http://arxiv.org/abs/2308.12452
repo_url: None
paper_authors: Wenzhao Li, Tianhao Wu, Fangcheng Zhong, Cengiz Oztireli
for: 用于三维场景样式传递
methods: 使用3D神经辐射场进行样式传递，并提供四种控制方法：色彩保持控制、纹理尺度控制、空间选择性风格控制和深度增强控制
results: 通过实际数据集的量化和质量评估，表明ARF-Plus框架在三维场景样式传递中提供了有效的控制功能，并且可以同时应用多种样式效果，创造出独特和引人注目的风格效果。

Abstract
The radiance fields style transfer is an emerging field that has recently gained popularity as a means of 3D scene stylization, thanks to the outstanding performance of neural radiance fields in 3D reconstruction and view synthesis. We highlight a research gap in radiance fields style transfer, the lack of sufficient perceptual controllability, motivated by the existing concept in the 2D image style transfer. In this paper, we present ARF-Plus, a 3D neural style transfer framework offering manageable control over perceptual factors, to systematically explore the perceptual controllability in 3D scene stylization. Four distinct types of controls - color preservation control, (style pattern) scale control, spatial (selective stylization area) control, and depth enhancement control - are proposed and integrated into this framework. Results from real-world datasets, both quantitative and qualitative, show that the four types of controls in our ARF-Plus framework successfully accomplish their corresponding perceptual controls when stylizing 3D scenes. These techniques work well for individual style inputs as well as for the simultaneous application of multiple styles within a scene. This unlocks a realm of limitless possibilities, allowing customized modifications of stylization effects and flexible merging of the strengths of different styles, ultimately enabling the creation of novel and eye-catching stylistic effects on 3D scenes.

摘要
《几何场景风格传输》是一个刚刚崛起的领域，感谢神经透辉场景的出色表现在3D重建和视觉合成中。我们指出了几何场景风格传输的研究漏洞，即无 suficient perceptual控制，这是基于2D图像风格传输的现有概念。在这篇论文中，我们提出了ARF-Plus，一个3D神经风格传输框架，可以系统地探索3D场景风格传输中的perceptual控制。我们提出了四种控制类型：颜色保持控制、样式模式比例控制、空间（选择性风格着色）控制和深度强化控制。这些控制被纳入到这个框架中，并在实际世界数据集上进行了评估。结果表明，ARF-Plus框架中的四种控制类型能够成功地实现对perceptual控制的管理，并且这些控制可以单独应用于个体风格输入或同时应用于场景中的多种风格。这些技术在创建个性化 modify 3D场景风格效果和自由混合不同风格的优点时，都工作良好。

MOFO: MOtion FOcused Self-Supervision for Video Understanding

paper_url: http://arxiv.org/abs/2308.12447
repo_url: None
paper_authors: Mona Ahmadian, Frank Guerin, Andrew Gilbert
for: 本研究的目的是提高视频中动作识别的性能，通过对视频中动作区域进行自我监督学习，以提高视频中动作的表征学习。methods: 我们提出了一种新的自我监督学习方法，称为MOFO（动作区域关注），它可以自动检测视频中动作区域，并使用这些区域来引导自我监督学习任务。我们使用了一种帮助器隐藏一定比例的输入序列中的掩码，并强制掩码在动作区域内部的一定比例被隐藏，而其余部分来自外部。此外，我们还在下游任务中进行了动作信息的加入，以强调动作的表征。results: 我们的研究表明，我们的动作区域关注技术可以明显提高当前最佳的自我监督学习方法（VideoMAE）的动作识别性能。我们在Epic-Kitchens verb、noun和动作分类任务上提高了2.6%、2.1%和1.3%的精度，并在Something-Something V2动作分类任务上提高了4.7%的精度。这表明，在自我监督学习中显式地编码动作是非常重要的。

Abstract
Self-supervised learning (SSL) techniques have recently produced outstanding results in learning visual representations from unlabeled videos. Despite the importance of motion in supervised learning techniques for action recognition, SSL methods often do not explicitly consider motion information in videos. To address this issue, we propose MOFO (MOtion FOcused), a novel SSL method for focusing representation learning on the motion area of a video, for action recognition. MOFO automatically detects motion areas in videos and uses these to guide the self-supervision task. We use a masked autoencoder which randomly masks out a high proportion of the input sequence; we force a specified percentage of the inside of the motion area to be masked and the remainder from outside. We further incorporate motion information into the finetuning step to emphasise motion in the downstream task. We demonstrate that our motion-focused innovations can significantly boost the performance of the currently leading SSL method (VideoMAE) for action recognition. Our method improves the recent self-supervised Vision Transformer (ViT), VideoMAE, by achieving +2.6%, +2.1%, +1.3% accuracy on Epic-Kitchens verb, noun and action classification, respectively, and +4.7% accuracy on Something-Something V2 action classification. Our proposed approach significantly improves the performance of the current SSL method for action recognition, indicating the importance of explicitly encoding motion in SSL.

摘要
自顾学（SSL）技术在无标注视频中学习视觉表示方面最近取得了出色的结果。尽管动作认知中的运动信息在指导学习过程中非常重要，但SSL方法通常不直接考虑视频中的运动信息。为解决这个问题，我们提议MOFO（运动区域关注）方法，它是一种新的SSL方法，用于在视频中注意力集中在运动区域上，以提高动作认知。MOFO方法自动检测视频中的运动区域，并使用这些区域来引导自我超vision任务。我们使用一个随机屏蔽输入序列的masked autoencoder，其中高比例的输入序列会被随机屏蔽，而在运动区域内部则强制屏蔽一定比例。此外，我们还在下游任务中注入运动信息，以强调运动在下游任务中的作用。我们示出，我们的运动关注创新可以显著提高现有的SSL方法（VideoMAE）对动作认知的性能。我们的方法可以在Epic-Kitchens动词、名词和动作分类中提高VideoMAE的性能，分别提高+2.6%、+2.1%和+1.3%的精度。此外，我们还在Something-Something V2动作分类中提高了+4.7%的精度。这表明，在SSL中显式编码运动的重要性。

TAI-GAN: Temporally and Anatomically Informed GAN for early-to-late frame conversion in dynamic cardiac PET motion correction

paper_url: http://arxiv.org/abs/2308.12443
repo_url: https://github.com/gxq1998/tai-gan
paper_authors: Xueqi Guo, Luyao Shi, Xiongchao Chen, Bo Zhou, Qiong Liu, Huidong Xie, Yi-Hwa Liu, Richard Palyo, Edward J. Miller, Albert J. Sinusas, Bruce Spottiswoode, Chi Liu, Nicha C. Dvornek
for: 这篇论文主要关注的是动脉心PET图像中的快速追踪器动力学和各帧分布的高变化，以及这些变化对插入动作 corrections 的影响。
methods: 该论文提出了一种使用生成方法处理 tracer 分布变化以帮助现有的注册方法。具体来说，我们提出了一种 Temporally and Anatomically Informed Generative Adversarial Network (TAI-GAN)，用于在早期帧中将 tracer 分布变化转换为late reference frame中的图像。
results: 我们在临床 $^{82}$Rb PET数据集上验证了我们的提议方法，并发现我们的 TAI-GAN 可以生成高质量的转换图像，与参照帧图像相似。 после TAI-GAN 转换，运动估计精度和临床血液流量（MBF）的量化得到了改善。

Abstract
The rapid tracer kinetics of rubidium-82 ($^{82}$Rb) and high variation of cross-frame distribution in dynamic cardiac positron emission tomography (PET) raise significant challenges for inter-frame motion correction, particularly for the early frames where conventional intensity-based image registration techniques are not applicable. Alternatively, a promising approach utilizes generative methods to handle the tracer distribution changes to assist existing registration methods. To improve frame-wise registration and parametric quantification, we propose a Temporally and Anatomically Informed Generative Adversarial Network (TAI-GAN) to transform the early frames into the late reference frame using an all-to-one mapping. Specifically, a feature-wise linear modulation layer encodes channel-wise parameters generated from temporal tracer kinetics information, and rough cardiac segmentations with local shifts serve as the anatomical information. We validated our proposed method on a clinical $^{82}$Rb PET dataset and found that our TAI-GAN can produce converted early frames with high image quality, comparable to the real reference frames. After TAI-GAN conversion, motion estimation accuracy and clinical myocardial blood flow (MBF) quantification were improved compared to using the original frames. Our code is published at https://github.com/gxq1998/TAI-GAN.

摘要
<> translate "The rapid tracer kinetics of rubidium-82 ($^{82}$Rb) and high variation of cross-frame distribution in dynamic cardiac positron emission tomography (PET) raise significant challenges for inter-frame motion correction, particularly for the early frames where conventional intensity-based image registration techniques are not applicable. Alternatively, a promising approach utilizes generative methods to handle the tracer distribution changes to assist existing registration methods. To improve frame-wise registration and parametric quantification, we propose a Temporally and Anatomically Informed Generative Adversarial Network (TAI-GAN) to transform the early frames into the late reference frame using an all-to-one mapping. Specifically, a feature-wise linear modulation layer encodes channel-wise parameters generated from temporal tracer kinetics information, and rough cardiac segmentations with local shifts serve as the anatomical information. We validated our proposed method on a clinical $^{82}$Rb PET dataset and found that our TAI-GAN can produce converted early frames with high image quality, comparable to the real reference frames. After TAI-GAN conversion, motion estimation accuracy and clinical myocardial blood flow (MBF) quantification were improved compared to using the original frames. Our code is published at https://github.com/gxq1998/TAI-GAN." into Simplified Chinese.Here's the translation:<>rapid tracer kinetics of rubidium-82 ($^{82}$Rb) and high variation of cross-frame distribution in dynamic cardiac positron emission tomography (PET) pose significant challenges for inter-frame motion correction, especially for early frames where conventional intensity-based image registration techniques are not applicable. alternatively, a promising approach utilizes generative methods to handle tracer distribution changes to assist existing registration methods. to improve frame-wise registration and parametric quantification, we propose a Temporally and Anatomically Informed Generative Adversarial Network (TAI-GAN) to transform early frames into the late reference frame using an all-to-one mapping. specifically, a feature-wise linear modulation layer encodes channel-wise parameters generated from temporal tracer kinetics information, and rough cardiac segmentations with local shifts serve as anatomical information. we validated our proposed method on a clinical $^{82}$Rb PET dataset and found that our TAI-GAN can produce converted early frames with high image quality, comparable to real reference frames. after TAI-GAN conversion, motion estimation accuracy and clinical myocardial blood flow (MBF) quantification were improved compared to using the original frames. our code is published at https://github.com/gxq1998/TAI-GAN.Note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. The other version is Traditional Chinese.

HNAS-reg: hierarchical neural architecture search for deformable medical image registration

paper_url: http://arxiv.org/abs/2308.12440
repo_url: None
paper_authors: Jiong Wu, Yong Fan
for: 这篇论文是为了找出最佳的深度学习模型，用于医疗影像注册。
methods: 这篇论文使用了一个内在的 NAS 框架 (HNAS-Reg)，包括了扩散操作搜索和网络架构搜索，以找到最佳的网络架构。具体来说，这个框架使用了一种参数化的搜索方法，以找到最佳的扩散操作和网络架构。
results: 实验结果显示，提议的方法可以建立一个具有更高影像注册精度和较小的模型大小的深度学习模型，比过去的影像注册方法更好。具体来说，在三个数据集上（包括 636 个 T1-调试磁共振成像（MRI）），提议的方法可以建立一个深度学习模型，并且与其他两个Unsupervised Learning-based方法相比，具有更高的影像注册精度和较小的模型大小。

Abstract
Convolutional neural networks (CNNs) have been widely used to build deep learning models for medical image registration, but manually designed network architectures are not necessarily optimal. This paper presents a hierarchical NAS framework (HNAS-Reg), consisting of both convolutional operation search and network topology search, to identify the optimal network architecture for deformable medical image registration. To mitigate the computational overhead and memory constraints, a partial channel strategy is utilized without losing optimization quality. Experiments on three datasets, consisting of 636 T1-weighted magnetic resonance images (MRIs), have demonstrated that the proposal method can build a deep learning model with improved image registration accuracy and reduced model size, compared with state-of-the-art image registration approaches, including one representative traditional approach and two unsupervised learning-based approaches.

摘要
卷积神经网络（CNN）已经广泛用于深度学习模型的医学图像注册，但是手动设计的网络架构可能不是最佳的。这篇论文提出了一种层次 NAS 框架（HNAS-Reg），包括卷积操作搜索和网络架构搜索，以确定最佳的医学图像注册网络架构。为了减少计算负担和内存限制，该方法使用了部分通道策略，而不失去优化质量。在三个数据集上，包括 636 个 T1 束缚磁共振成像（MRI），实验表明，提议方法可以建立一个具有提高图像注册精度和减少模型大小的深度学习模型，相比之下一个代表性的传统方法和两个无监督学习方法。

Characterising representation dynamics in recurrent neural networks for object recognition

paper_url: http://arxiv.org/abs/2308.12435
repo_url: None
paper_authors: Sushrut Thorat, Adrien Doerig, Tim C. Kietzmann
for: 这种研究旨在理解Recurrent Neural Networks (RNNs) 在复杂视觉任务中的表征动态，特别是大规模视觉模型中的计算。
methods: 研究者使用了MiniEcoset，一个新的子集，来训练 RNNs 进行物体分类。他们还使用了“读取区”来描述计算轨迹的活动排序。
results: 研究者发现，在推断时，表示 continuted to evolve после正确的分类，这表明 RNNs 没有“完成分类”的概念。此外，研究者发现，在“读取区”中，错误的表示具有较低的L2范数活动排序，并位于更加外围的位置。这种排序可以帮助错误的表示逐渐移动到正确的区域。这些发现可以普通到其他类型的 RNNs，包括理解Primates 视觉中的表征动态。

Abstract
Recurrent neural networks (RNNs) have yielded promising results for both recognizing objects in challenging conditions and modeling aspects of primate vision. However, the representational dynamics of recurrent computations remain poorly understood, especially in large-scale visual models. Here, we studied such dynamics in RNNs trained for object classification on MiniEcoset, a novel subset of ecoset. We report two main insights. First, upon inference, representations continued to evolve after correct classification, suggesting a lack of the notion of being ``done with classification''. Second, focusing on ``readout zones'' as a way to characterize the activation trajectories, we observe that misclassified representations exhibit activation patterns with lower L2 norm, and are positioned more peripherally in the readout zones. Such arrangements help the misclassified representations move into the correct zones as time progresses. Our findings generalize to networks with lateral and top-down connections, and include both additive and multiplicative interactions with the bottom-up sweep. The results therefore contribute to a general understanding of RNN dynamics in naturalistic tasks. We hope that the analysis framework will aid future investigations of other types of RNNs, including understanding of representational dynamics in primate vision.

摘要
recurrent neural networks (RNNs) 已经在具有挑战性的条件下识别对象以及模型Primates的视觉方面显示了promising的结果。然而，RNNs中的表达动力学 Dynamics 仍未得到了充分的理解，特别是在大规模的视觉模型中。在这里，我们对RNNs在MiniEcoset上进行了对象分类训练。我们发现了两个主要的发现：首先，在推理时，表达还在继续进行改变，表明没有“完成分类”的概念。第二，我们将“读取区”作为表达轨迹的特征进行分析，发现了在读取区中的表达方式具有更低的L2范数，并且位于读取区的更外围位置。这种排列可以帮助错误的表达移动到正确的区域，并在时间的推移中进行改变。我们的发现涵盖了具有 Lateral 和上下 Connection 的网络，并包括了加法和乘法交互。这些结果因此对 RNN 动力学在自然任务中的一般理解做出了贡献，并且可以帮助未来对其他类型的 RNN 进行更深入的研究，包括理解primates 视觉中的表达动力学。

A Spatiotemporal Correspondence Approach to Unsupervised LiDAR Segmentation with Traffic Applications

paper_url: http://arxiv.org/abs/2308.12433
repo_url: None
paper_authors: Xiao Li, Pan He, Aotian Wu, Sanjay Ranka, Anand Rangarajan
for: 这个研究旨在解决室外LiDAR点云Sequence中的无监督Semantic Segmentation问题，尤其是在自动驾驶和交叉基建中的多种交通情况下。
methods: 本研究利用Point cloud sequence的空间时间特性，并在多帧框架之间建立强大的对应关系，以提高Semantic Segmentation的精度。研究将 clustering和pseudo-label学习结合，将点 cloud分组成Semantic groups，并使用点 clouds的pseudo-spatiotemporal标签进行模型优化。
results: 研究在Semantic-KITTI、SemanticPOSS和FLORIDAbenchmark dataset上得到了竞争性的Semantic Segmentation性能，与许多现有的对照学习方法相比。这个通用框架可以带来LiDAR点云Sequence中的统一表现学习方法，并结合对领域知识的导入。

Abstract
We address the problem of unsupervised semantic segmentation of outdoor LiDAR point clouds in diverse traffic scenarios. The key idea is to leverage the spatiotemporal nature of a dynamic point cloud sequence and introduce drastically stronger augmentation by establishing spatiotemporal correspondences across multiple frames. We dovetail clustering and pseudo-label learning in this work. Essentially, we alternate between clustering points into semantic groups and optimizing models using point-wise pseudo-spatiotemporal labels with a simple learning objective. Therefore, our method can learn discriminative features in an unsupervised learning fashion. We show promising segmentation performance on Semantic-KITTI, SemanticPOSS, and FLORIDA benchmark datasets covering scenarios in autonomous vehicle and intersection infrastructure, which is competitive when compared against many existing fully supervised learning methods. This general framework can lead to a unified representation learning approach for LiDAR point clouds incorporating domain knowledge.

摘要
我们 Addressing the problem of unsupervised semantic segmentation of outdoor LiDAR point clouds in diverse traffic scenarios. The key idea is to leverage the spatiotemporal nature of a dynamic point cloud sequence and introduce drastically stronger augmentation by establishing spatiotemporal correspondences across multiple frames. We dovetail clustering and pseudo-label learning in this work. Essentially, we alternate between clustering points into semantic groups and optimizing models using point-wise pseudo-spatiotemporal labels with a simple learning objective. Therefore, our method can learn discriminative features in an unsupervised learning fashion. We show promising segmentation performance on Semantic-KITTI, SemanticPOSS, and FLORIDA benchmark datasets covering scenarios in autonomous vehicle and intersection infrastructure, which is competitive when compared against many existing fully supervised learning methods. This general framework can lead to a unified representation learning approach for LiDAR point clouds incorporating domain knowledge.Here's the word-for-word translation:我们 Addressing outdoor LiDAR point cloud semantic segmentation problem in diverse traffic scenarios. 针对多个 traffic scenarios 中的 outdoor LiDAR point cloud semantic segmentation problem. The key idea is to leverage point cloud sequence's spatiotemporal nature and introduce stronger augmentation by establishing spatiotemporal correspondences across multiple frames. 利用 point cloud sequence 的 spatiotemporal nature 和多幅 frames 之间的匹配，提高 semantic segmentation 的精度。 We dovetail clustering and pseudo-label learning in this work. 在这个工作中，我们将 clustering 和 pseudo-label learning 相互协调使用。 Essentially, we alternate between clustering points into semantic groups and optimizing models using point-wise pseudo-spatiotemporal labels with a simple learning objective. 我们将 alternate between clustering points into semantic groups 和使用 point-wise pseudo-spatiotemporal labels 来优化模型，使用简单的 learning objective。 Therefore, our method can learn discriminative features in an unsupervised learning fashion. 因此，我们的方法可以在无监督学习中学习出 distinguishing 特征。 We show promising segmentation performance on Semantic-KITTI, SemanticPOSS, and FLORIDA benchmark datasets covering scenarios in autonomous vehicle and intersection infrastructure. 我们在 Semantic-KITTI, SemanticPOSS, 和 FLORIDA benchmark datasets 上显示出了优秀的 segmentation 性能，这些 datasets 涵盖了自动驾驶车和交叉道路基础设施的场景。 These datasets are competitive when compared against many existing fully supervised learning methods. 这些 datasets 与许多现有的完全监督学习方法相比，显示出了竞争力。 This general framework can lead to a unified representation learning approach for LiDAR point clouds incorporating domain knowledge. 这个通用的框架可以导致一种 incorporating domain knowledge 的 LiDAR point clouds 的表示学习方法。

An Initial Exploration: Learning to Generate Realistic Audio for Silent Video

paper_url: http://arxiv.org/abs/2308.12408
repo_url: None
paper_authors: Matthew Martel, Jackson Wagner
for: 这个论文的目的是开发一种基于深度学习的框架，用于生成电影和其他媒体中的真实的音效。
methods: 这个论文使用了多种不同的模型建立，包括深度融合CNN、扩展Wavenet CNN以及Transformer结构。这些模型都将视频上下文和先前生成的音频融合在一起，以生成真实的音效。
results: 研究发现，使用Transformer结构可以匹配视频中的低频信号，但是无法生成更加复杂的波形。

Abstract
Generating realistic audio effects for movies and other media is a challenging task that is accomplished today primarily through physical techniques known as Foley art. Foley artists create sounds with common objects (e.g., boxing gloves, broken glass) in time with video as it is playing to generate captivating audio tracks. In this work, we aim to develop a deep-learning based framework that does much the same - observes video in it's natural sequence and generates realistic audio to accompany it. Notably, we have reason to believe this is achievable due to advancements in realistic audio generation techniques conditioned on other inputs (e.g., Wavenet conditioned on text). We explore several different model architectures to accomplish this task that process both previously-generated audio and video context. These include deep-fusion CNN, dilated Wavenet CNN with visual context, and transformer-based architectures. We find that the transformer-based architecture yields the most promising results, matching low-frequencies to visual patterns effectively, but failing to generate more nuanced waveforms.

摘要
generate realistic audio effects for movies and other media is a challenging task that is primarily accomplished today through physical techniques known as Foley art. Foley artists create sounds with common objects (e.g., boxing gloves, broken glass) in time with video as it is playing to generate captivating audio tracks. In this work, we aim to develop a deep-learning based framework that does much the same - observes video in its natural sequence and generates realistic audio to accompany it. Notably, we have reason to believe this is achievable due to advancements in realistic audio generation techniques conditioned on other inputs (e.g., Wavenet conditioned on text). We explore several different model architectures to accomplish this task that process both previously-generated audio and video context. These include deep-fusion CNN, dilated Wavenet CNN with visual context, and transformer-based architectures. We find that the transformer-based architecture yields the most promising results, matching low-frequencies to visual patterns effectively, but failing to generate more nuanced waveforms.Here's the text with some notes on the translation:* "generate realistic audio effects" is translated as "生成真实的音效" (shēngjīn zhēnshí de yīngxìng), which is a more literal translation of the original English phrase.* "Foley art" is translated as "FOLEY艺术" (fōlēi yìshù), which is a direct translation of the original English phrase.* "captivating audio tracks" is translated as "吸引人的音乐轨迹" (xīhuī rén de yīngyuè guītà), which is a more idiomatic translation that conveys the idea of audio that is engaging and immersive.* "deep-learning based framework" is translated as "基于深度学习的框架" (jīyù shēngrán de kuàihù), which is a more literal translation of the original English phrase.* "low-frequencies" is translated as "低频谱" (dīfreqè), which is a more technical term that is commonly used in audio engineering.* "visual patterns" is translated as "视觉模式" (wèishì móxìng), which is a more idiomatic translation that conveys the idea of patterns that are visible and can be perceived through sight.I hope this helps! Let me know if you have any further questions or if you would like me to translate anything else.

FG-Net: Facial Action Unit Detection with Generalizable Pyramidal Features

paper_url: http://arxiv.org/abs/2308.12380
repo_url: https://github.com/ihp-lab/fg-net
paper_authors: Yufeng Yin, Di Chang, Guoxian Song, Shen Sang, Tiancheng Zhi, Jing Liu, Linjie Luo, Mohammad Soleymani
for: 该文章目的是提出一种通用的表情动作单元检测方法，以优化对 facial expression 的 объектив分析。
methods: 该方法使用 StyleGAN2 模型预训练在大型和多样化的面孔图像集上，然后使用 Pyramid CNN Interpreter 检测表情动作单元。
results: 对于 DISFA 和 BP4D datasets，提出的方法在跨域和同域检测中均达到了优于预先的状态艺术，同时在1000个样本上进行训练并且可以达到竞争性的性能。

Abstract
Automatic detection of facial Action Units (AUs) allows for objective facial expression analysis. Due to the high cost of AU labeling and the limited size of existing benchmarks, previous AU detection methods tend to overfit the dataset, resulting in a significant performance loss when evaluated across corpora. To address this problem, we propose FG-Net for generalizable facial action unit detection. Specifically, FG-Net extracts feature maps from a StyleGAN2 model pre-trained on a large and diverse face image dataset. Then, these features are used to detect AUs with a Pyramid CNN Interpreter, making the training efficient and capturing essential local features. The proposed FG-Net achieves a strong generalization ability for heatmap-based AU detection thanks to the generalizable and semantic-rich features extracted from the pre-trained generative model. Extensive experiments are conducted to evaluate within- and cross-corpus AU detection with the widely-used DISFA and BP4D datasets. Compared with the state-of-the-art, the proposed method achieves superior cross-domain performance while maintaining competitive within-domain performance. In addition, FG-Net is data-efficient and achieves competitive performance even when trained on 1000 samples. Our code will be released at \url{https://github.com/ihp-lab/FG-Net}

摘要
自动检测人脸动作单元（AU）可以实现 объектив的人脸表达分析。由于AU标注的高成本和现有 benchmark 的有限大小，前一代AU检测方法往往会适应数据集，导致在 corpora 中表现不佳。为解决这个问题，我们提出了 FG-Net，一种通用的人脸动作单元检测方法。具体来说，FG-Net 从 StyleGAN2 模型在大量和多样的人脸图像数据集上预训练后的特征图进行检测AU。然后，这些特征图被 Pyramid CNN Interpreter 使用，以实现高效的训练和捕捉本地特征。我们提出的 FG-Net 在热图基于 AU 检测中实现了强大的总结能力，因为它可以从预训练的生成模型中提取通用和含义 Rich 的特征。我们进行了广泛的实验，以评估在 DISFA 和 BP4D 数据集上的在 corpora 和 across-corpus 中的 AU 检测性能。与当前状态的方法相比，我们的方法在跨频谱上实现了superior 的横跨频谱性能，同时保持竞争的在频谱内性能。此外，FG-Net 是数据效率的，可以在1000个样本上实现竞争性的表现。我们的代码将在 \url{https://github.com/ihp-lab/FG-Net} 上发布。

AdVerb: Visually Guided Audio Dereverberation

paper_url: http://arxiv.org/abs/2308.12370
repo_url: None
paper_authors: Sanjoy Chowdhury, Sreyan Ghosh, Subhrajyoti Dasgupta, Anton Ratnarajah, Utkarsh Tyagi, Dinesh Manocha
for: 提高听起来的声音质量，使其更加清晰和可识别。
methods: 利用视觉特征和听起来的声音，通过一种新的几何学感知架构，捕捉场景几何和听起来的跨Modal关系，生成复杂的理想比例幕，以提高听起来的声音质量。
results: 比较传统的听起来只和听起来+视觉两个基elines，实现了18%-82%的提升，在LibriSpeech测试集上。同时，在AVSpeech数据集上也实现了非常满意的RT60错误分数。

Abstract
We present AdVerb, a novel audio-visual dereverberation framework that uses visual cues in addition to the reverberant sound to estimate clean audio. Although audio-only dereverberation is a well-studied problem, our approach incorporates the complementary visual modality to perform audio dereverberation. Given an image of the environment where the reverberated sound signal has been recorded, AdVerb employs a novel geometry-aware cross-modal transformer architecture that captures scene geometry and audio-visual cross-modal relationship to generate a complex ideal ratio mask, which, when applied to the reverberant audio predicts the clean sound. The effectiveness of our method is demonstrated through extensive quantitative and qualitative evaluations. Our approach significantly outperforms traditional audio-only and audio-visual baselines on three downstream tasks: speech enhancement, speech recognition, and speaker verification, with relative improvements in the range of 18% - 82% on the LibriSpeech test-clean set. We also achieve highly satisfactory RT60 error scores on the AVSpeech dataset.

摘要
我们介绍了AdVerb，一种新的音频-视觉减振框架，该框架利用视觉信号以及干扰音频来估算清晰音频。虽然音频只的减振框架已经广泛研究过，但我们的方法具有较好的场景准确性和音频视觉跨模态关系，可以更好地进行音频减振。给出了环境中录制的干扰音频的图像，AdVerb使用了一种新的场景意识的cross-modal transformer架构，捕捉场景准确性和音频视觉跨模态关系，生成复杂的理想比例面积，当应用于干扰音频时，可以预测清晰音频。我们的方法的效果得到了广泛的量化和质量评估。与传统的音频只和音频视觉基线相比，我们的方法在三个下游任务中表现出了显著的改善，即speech enhancement、speech recognition和speaker verification，改善比例在0.18-0.82之间。此外，我们在AVSpeech dataset上也实现了高度满意的RT60错误分布。

Continual Zero-Shot Learning through Semantically Guided Generative Random Walks

paper_url: http://arxiv.org/abs/2308.12366
repo_url: https://github.com/wx-zhang/igczsl
paper_authors: Wenxuan Zhang, Paul Janson, Kai Yi, Ivan Skorokhodov, Mohamed Elhoseiny
for: 本研究旨在模型人类在生活中不断学习和应用新知识，以及将之应用于未来任务中。
methods: 本研究使用生成模型，通过学习seen类的质量表示来提高对未经训练的视觉空间的生成理解。
results: 提出了一种基于生成模型的 continual zero-shot learning 算法，在 AWA1、AWA2、CUB 和 SUN 数据集上达到了状态之 arts 性能，比现有的 CZSL 方法高出 3-7%。

Abstract
Learning novel concepts, remembering previous knowledge, and adapting it to future tasks occur simultaneously throughout a human's lifetime. To model such comprehensive abilities, continual zero-shot learning (CZSL) has recently been introduced. However, most existing methods overused unseen semantic information that may not be continually accessible in realistic settings. In this paper, we address the challenge of continual zero-shot learning where unseen information is not provided during training, by leveraging generative modeling. The heart of the generative-based methods is to learn quality representations from seen classes to improve the generative understanding of the unseen visual space. Motivated by this, we introduce generalization-bound tools and provide the first theoretical explanation for the benefits of generative modeling to CZSL tasks. Guided by the theoretical analysis, we then propose our learning algorithm that employs a novel semantically guided Generative Random Walk (GRW) loss. The GRW loss augments the training by continually encouraging the model to generate realistic and characterized samples to represent the unseen space. Our algorithm achieves state-of-the-art performance on AWA1, AWA2, CUB, and SUN datasets, surpassing existing CZSL methods by 3-7\%. The code has been made available here \url{https://github.com/wx-zhang/IGCZSL}

摘要
人类生命中，同时学习新概念，记忆过去知识，并将其应用到未来任务中发生。为模型这种全面能力，最近才提出了无限 zero-shot learning（CZSL）。然而，现有的方法往往过度利用无法在实际场景中 continually 获得的无序 semantic information。在这篇论文中，我们解决了 CZSL 任务中无法在训练中提供无序信息的挑战，通过使用生成模型。生成模型的核心是学习seen类型的高质量表示，以改善对未seen visual空间的生成理解。这些基于的概念工具，我们提供了第一个理论解释，描述了生成模型对 CZSL 任务的优势。受理论分析的指导，我们然后提出了我们的学习算法，该算法使用了一种新的semantically guided Generative Random Walk（GRW）损失函数。GRW损失函数在训练中不断地鼓励模型生成真实、特征化的样本，以表示未seen空间。我们的算法在 AWA1、AWA2、CUB 和 SUN 数据集上达到了状态机器人的性能，超过了现有的 CZSL 方法3-7\%。我们的代码已经在 GitHub 上公开，访问地址为 \url{https://github.com/wx-zhang/IGCZSL}。

Saliency-based Video Summarization for Face Anti-spoofing

paper_url: http://arxiv.org/abs/2308.12364
repo_url: https://github.com/Usman1021/Saliency
paper_authors: Usman Muhammad, Mourad Oussalah, Md Ziaul Hoque, Jorma Laaksonen
for: 提高面部骗取检测器的性能和效率，使用视觉吸引力理论来增强深度学习模型的表现。
methods: 提出了一种视频概要方法，通过提取源图像的视觉吸引力信息，对每帧图像进行分解，并使用重要性映射来线性组合源图像，创建一个代表整个视频的单一图像。
results: 实验结果表明，该方法可以在五个面部骗取检测 datasets 上达到状态 искусственный智能的性能，并且比 tradicional 方法有更好的性能和效率。

Abstract
Due to the growing availability of face anti-spoofing databases, researchers are increasingly focusing on video-based methods that use hundreds to thousands of images to assess their impact on performance. However, there is no clear consensus on the exact number of frames in a video required to improve the performance of face anti-spoofing tasks. Inspired by the visual saliency theory, we present a video summarization method for face anti-spoofing tasks that aims to enhance the performance and efficiency of deep learning models by leveraging visual saliency. In particular, saliency information is extracted from the differences between the Laplacian and Wiener filter outputs of the source images, enabling identification of the most visually salient regions within each frame. Subsequently, the source images are decomposed into base and detail layers, enhancing representation of important information. The weighting maps are then computed based on the saliency information, indicating the importance of each pixel in the image. By linearly combining the base and detail layers using the weighting maps, the method fuses the source images to create a single representative image that summarizes the entire video. The key contribution of our proposed method lies in demonstrating how visual saliency can be used as a data-centric approach to improve the performance and efficiency of face presentation attack detection models. By focusing on the most salient images or regions within the images, a more representative and diverse training set can be created, potentially leading to more effective models. To validate the method's effectiveness, a simple deep learning architecture (CNN-RNN) was used, and the experimental results showcased state-of-the-art performance on five challenging face anti-spoofing datasets.

摘要
Translated into Simplified Chinese:由于面对面骗降库的可用性不断增长，研究人员正在更加关注视频基于方法，使用数百到千个图像来评估其影响性。然而，没有明确的共识，关于视频中帧数所需要提高面对面骗降模型的性能。我们根据视觉吸引力理论，提出了一种面对面骗降视频 summarization方法，以提高深度学习模型的性能和效率。具体来说，该方法使用源图像的差分 Laplacian 和 Wiener 滤波器输出来提取视觉吸引力信息，并在每帧中标识最有吸引力的区域。然后，源图像被分解成基层和详细层，从而增强图像的重要信息表示。最后，根据视觉吸引力信息，计算weighting map，以指示每个像素的重要性。通过线性组合基层和详细层，方法将源图像总结为整个视频的代表图像。我们的提案的关键在于，通过使用视觉吸引力来为面对面骗降模型提高性能和效率。通过关注图像中最有吸引力的部分或区域，可以创建更加代表和多样的训练集，可能导致更有效的模型。为验证方法的效果，我们使用了一种简单的深度学习架构（CNN-RNN），并在五个面对面骗降数据集上获得了状态艺术性的实验结果。

Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.12350
repo_url: None
paper_authors: Duo Peng, Ping Hu, Qiuhong Ke, Jun Liu
for: 提高频率领域转换图像的semantic consistency
methods: 使用源域标签作为Explicit导航 during image translation
results: 比对前方法有superiority

Abstract
Translating images from a source domain to a target domain for learning target models is one of the most common strategies in domain adaptive semantic segmentation (DASS). However, existing methods still struggle to preserve semantically-consistent local details between the original and translated images. In this work, we present an innovative approach that addresses this challenge by using source-domain labels as explicit guidance during image translation. Concretely, we formulate cross-domain image translation as a denoising diffusion process and utilize a novel Semantic Gradient Guidance (SGG) method to constrain the translation process, conditioning it on the pixel-wise source labels. Additionally, a Progressive Translation Learning (PTL) strategy is devised to enable the SGG method to work reliably across domains with large gaps. Extensive experiments demonstrate the superiority of our approach over state-of-the-art methods.

摘要
通常，域适应semantic segmentation（DASS）中将源域图像翻译到目标域图像是一种常见的策略。然而，现有方法仍然困难保持 semantic consistency的本地细节 между原始图像和翻译图像。在这种情况下，我们提出了一种创新的方法，通过在翻译过程中使用源域标签作为直接导航来解决这个挑战。具体来说，我们将cross-domain image translation表示为干扰扩散过程，并使用一种新的Semantic Gradient Guidance（SGG）方法来约束翻译过程，将其受到像素级source标签的控制。此外，我们还提出了一种Progressive Translation Learning（PTL）策略，以确保 SGG 方法在不同域的大差下可靠地工作。广泛的实验证明了我们的方法在现有方法之上表现出了superiority。

A Generative Approach for Image Registration of Visible-Thermal (VT) Cancer Faces

paper_url: http://arxiv.org/abs/2308.12271
repo_url: None
paper_authors: Catherine Ordun, Alexandra Cha, Edward Raff, Sanjay Purushotham, Karen Kwok, Mason Rule, James Gulley
for: 这项研究旨在提高人工智能下的疼痛研究，使用可见光和热成像图像进行对比。
methods: 该研究使用了生成对应算法进行图像 регистра，以解决可见光和热成像图像之间的偏移问题。
results: 研究发现，通过对可见光和热成像图像进行REGISTERING，可以提高热成像图像质量，提高疼痛研究的效果，最高提高率达52.5%。

Abstract
Since thermal imagery offers a unique modality to investigate pain, the U.S. National Institutes of Health (NIH) has collected a large and diverse set of cancer patient facial thermograms for AI-based pain research. However, differing angles from camera capture between thermal and visible sensors has led to misalignment between Visible-Thermal (VT) images. We modernize the classic computer vision task of image registration by applying and modifying a generative alignment algorithm to register VT cancer faces, without the need for a reference or alignment parameters. By registering VT faces, we demonstrate that the quality of thermal images produced in the generative AI downstream task of Visible-to-Thermal (V2T) image translation significantly improves up to 52.5\%, than without registration. Images in this paper have been approved by the NIH NCI for public dissemination.

摘要
由于热影像可以提供一种独特的方式来研究疼痛，美国国家医学研究院（NIH）已经收集了大量和多样化的癌症患者脸部热影像，用于人工智能基于痛症研究。然而，相机捕捉的角度差异导致热影像和可见感器拍摄的图像不一致，这导致了可见热图像的注册问题。我们使用和修改生成对齐算法，以无需参考或对齐参数，对热照相机拍摄的癌症脸部进行注册。通过注册热照相机拍摄，我们证明了在生成AI下渠道任务中，将可见图像翻译成热图像的质量显著提高，比无注册情况提高至52.5%。图像在本文中已经获得了NIH NCI的批准，可以公开发布。

MolGrapher: Graph-based Visual Recognition of Chemical Structures

paper_url: http://arxiv.org/abs/2308.12234
repo_url: https://github.com/ds4sd/molgrapher
paper_authors: Lucas Morin, Martin Danelljan, Maria Isabel Agea, Ahmed Nassar, Valery Weber, Ingmar Meijer, Peter Staar, Fisher Yu
for: 本研究旨在提高化学文献自动分析的效率，以促进新材料和药物的发现。
methods: 本研究使用了深度键点检测器和图学神经网络来自动识别化学结构。
results: 对五个数据集进行了广泛的实验，结果表明，我们的方法在大多数情况下与经典和学习基于方法相比，有显著的优异表现。

Abstract
The automatic analysis of chemical literature has immense potential to accelerate the discovery of new materials and drugs. Much of the critical information in patent documents and scientific articles is contained in figures, depicting the molecule structures. However, automatically parsing the exact chemical structure is a formidable challenge, due to the amount of detailed information, the diversity of drawing styles, and the need for training data. In this work, we introduce MolGrapher to recognize chemical structures visually. First, a deep keypoint detector detects the atoms. Second, we treat all candidate atoms and bonds as nodes and put them in a graph. This construct allows a natural graph representation of the molecule. Last, we classify atom and bond nodes in the graph with a Graph Neural Network. To address the lack of real training data, we propose a synthetic data generation pipeline producing diverse and realistic results. In addition, we introduce a large-scale benchmark of annotated real molecule images, USPTO-30K, to spur research on this critical topic. Extensive experiments on five datasets show that our approach significantly outperforms classical and learning-based methods in most settings. Code, models, and datasets are available.

摘要
自动分析化学文献的潜在可能性非常大，可以加速发现新材料和药物。文献中大量关键信息都集中在图像中，其中包括分子结构。然而，自动解析图像中的具体化学结构是一项具有挑战性的任务，原因在于图像中的信息量、绘制风格的多样性以及需要训练数据。在这项工作中，我们介绍了MolGrapher，一种可视化化学结构的识别算法。首先，我们使用深度关键点检测器检测原子。其次，我们将所有候选原子和键视为图像中的节点，并将它们建立成一个图。这种构建方式允许自然地表示分子的图像。最后，我们使用图 neural network 来分类原子和键节点。因为缺乏真实的训练数据，我们提出了一个生成 sintetic 数据的管道，以生成多样化和真实的结果。此外，我们还介绍了一个大规模的注释实验室， USPTO-30K，以促进这一重要领域的研究。我们在五个数据集上进行了广泛的实验，结果显示，我们的方法在大多数情况下与 класси方法和学习型方法相比，表现出了显著的优势。代码、模型和数据集都可以获得。

SPPNet: A Single-Point Prompt Network for Nuclei Image Segmentation

paper_url: http://arxiv.org/abs/2308.12231
repo_url: https://github.com/xq141839/sppnet
paper_authors: Qing Xu, Wenwei Kuang, Zeyu Zhang, Xueyao Bao, Haoran Chen, Wenting Duan
for: 这个研究旨在提出一个单点提示网络（SPPNet），用于核仁像分类，以解决目前的模型存在大量参数和训练成本的问题。
methods: 这个模型使用了轻量级的投影 транс福曼（ViT）来取代原始的图像编码器，并添加了一个有效的混合层来提高图像中低层次的Semantic信息EXTRACTION。
results: 这个研究显示了 SPPNet 比现有的 U-shape 架构更好地运行，并且在训练过程中更快地训练。相比于目前的模型，SPPNet 的测试速度大约是 20 倍 faster，仅需要 1/70 参数和computational cost。此外，这个模型只需要在训练和测试阶段点击一次，更适合临床应用。

Abstract
Image segmentation plays an essential role in nuclei image analysis. Recently, the segment anything model has made a significant breakthrough in such tasks. However, the current model exists two major issues for cell segmentation: (1) the image encoder of the segment anything model involves a large number of parameters. Retraining or even fine-tuning the model still requires expensive computational resources. (2) in point prompt mode, points are sampled from the center of the ground truth and more than one set of points is expected to achieve reliable performance, which is not efficient for practical applications. In this paper, a single-point prompt network is proposed for nuclei image segmentation, called SPPNet. We replace the original image encoder with a lightweight vision transformer. Also, an effective convolutional block is added in parallel to extract the low-level semantic information from the image and compensate for the performance degradation due to the small image encoder. We propose a new point-sampling method based on the Gaussian kernel. The proposed model is evaluated on the MoNuSeg-2018 dataset. The result demonstrated that SPPNet outperforms existing U-shape architectures and shows faster convergence in training. Compared to the segment anything model, SPPNet shows roughly 20 times faster inference, with 1/70 parameters and computational cost. Particularly, only one set of points is required in both the training and inference phases, which is more reasonable for clinical applications. The code for our work and more technical details can be found at https://github.com/xq141839/SPPNet.

摘要
Image segmentation plays an essential role in nuclei image analysis. Recently, the segment anything model has made a significant breakthrough in such tasks. However, the current model exists two major issues for cell segmentation: (1) the image encoder of the segment anything model involves a large number of parameters. Retraining or even fine-tuning the model still requires expensive computational resources. (2) in point prompt mode, points are sampled from the center of the ground truth and more than one set of points is expected to achieve reliable performance, which is not efficient for practical applications.In this paper, a single-point prompt network is proposed for nuclei image segmentation, called SPPNet. We replace the original image encoder with a lightweight vision transformer. Also, an effective convolutional block is added in parallel to extract the low-level semantic information from the image and compensate for the performance degradation due to the small image encoder. We propose a new point-sampling method based on the Gaussian kernel. The proposed model is evaluated on the MoNuSeg-2018 dataset. The result demonstrated that SPPNet outperforms existing U-shape architectures and shows faster convergence in training. Compared to the segment anything model, SPPNet shows roughly 20 times faster inference, with 1/70 parameters and computational cost. Particularly, only one set of points is required in both the training and inference phases, which is more reasonable for clinical applications.The code for our work and more technical details can be found at .

2023-08-24

cs.AI

cs.AI - 2023-08-24

FaceTouch: Detecting hand-to-face touch with supervised contrastive learning to assist in tracing infectious disease

paper_url: http://arxiv.org/abs/2308.12840
repo_url: None
paper_authors: Mohamed R. Ibrahim, Terry Lyons
for: 本研究旨在提出一种基于深度学习的计算机视觉框架，以探索在复杂的城市场景中自动检测人员之间的手带面接触。
methods: 该框架基于深度学习的两个子模型，一个用于检测人员，另一个用于分析人员的动作。 FaceTouch 使用RGB图像来检测手带面接触，并利用人体姿势such as arm movement来减少部分遮挡。
results: 研究表明，FaceTouch 在 Complex urban scenes 中能够准确检测手带面接触，并在未经过其他数据集训练的情况下显示了强验证能力。

Abstract
Through our respiratory system, many viruses and diseases frequently spread and pass from one person to another. Covid-19 served as an example of how crucial it is to track down and cut back on contacts to stop its spread. There is a clear gap in finding automatic methods that can detect hand-to-face contact in complex urban scenes or indoors. In this paper, we introduce a computer vision framework, called FaceTouch, based on deep learning. It comprises deep sub-models to detect humans and analyse their actions. FaceTouch seeks to detect hand-to-face touches in the wild, such as through video chats, bus footage, or CCTV feeds. Despite partial occlusion of faces, the introduced system learns to detect face touches from the RGB representation of a given scene by utilising the representation of the body gestures such as arm movement. This has been demonstrated to be useful in complex urban scenarios beyond simply identifying hand movement and its closeness to faces. Relying on Supervised Contrastive Learning, the introduced model is trained on our collected dataset, given the absence of other benchmark datasets. The framework shows a strong validation in unseen datasets which opens the door for potential deployment.

摘要
我们的呼吸系统中有许多病毒和疾病经常传播和传递 від一个人到另一个。COVID-19 作为一个例子，说明了如何重要地追踪和降低联系以阻据其传播。然而，在复杂的城市场景或室内环境中找到自动方法检测手部触摸是一个明显的潜在难点。在这篇文章中，我们介绍了一个基于深度学习的计算机视觉框架，called FaceTouch。这个框架包括深度子模型以检测人类和分析他们的动作。FaceTouch 目标在野兽中检测手部触摸，例如透过视频聊天、公共汽车录影或 CCTV 输入。即使人脸部分被遮蔽，引入的系统可以从 RGB 表示中检测手部触摸，利用人体姿势的变化，如臂部运动。这已经在复杂的城市场景中显示出了实用性，超出了单纯检测手部运动和距离面部的能力。我们靠Supervised Contrastive Learning 进行训练，使用我们收集的数据集，因为没有其他参考数据集。引入的模型在未见 datasets 中显示了强大的验证，这开启了潜在的应用之门。

Short Run Transit Route Planning Decision Support System Using a Deep Learning-Based Weighted Graph

paper_url: http://arxiv.org/abs/2308.12828
repo_url: None
paper_authors: Nadav Shalit, Michael Fire, Dima Kagan, Eran Ben-Elia
for: 提高公共交通服务的效率和可靠性，帮助公共交通观察者快速地找到更好的路线。
methods: 使用深度学习技术建立一个决策支持系统，将多种数据来源（如GTFS和智能卡数据）处理和模型，并使用自我超vision进行训练，以预测路段延迟值。这些延迟值被用作交通Graph的边重量，以便高效地寻找路径。
results: 在 tel aviv 进行评估中，我们能够 reducemore than 9% of the routes, including both intraurban and suburban routes, highlighting the model’s versatility and effectiveness in improving public transport services.

Abstract
Public transport routing plays a crucial role in transit network design, ensuring a satisfactory level of service for passengers. However, current routing solutions rely on traditional operational research heuristics, which can be time-consuming to implement and lack the ability to provide quick solutions. Here, we propose a novel deep learning-based methodology for a decision support system that enables public transport (PT) planners to identify short-term route improvements rapidly. By seamlessly adjusting specific sections of routes between two stops during specific times of the day, our method effectively reduces times and enhances PT services. Leveraging diverse data sources such as GTFS and smart card data, we extract features and model the transportation network as a directed graph. Using self-supervision, we train a deep learning model for predicting lateness values for road segments. These lateness values are then utilized as edge weights in the transportation graph, enabling efficient path searching. Through evaluating the method on Tel Aviv, we are able to reduce times on more than 9\% of the routes. The improved routes included both intraurban and suburban routes showcasing a fact highlighting the model's versatility. The findings emphasize the potential of our data-driven decision support system to enhance public transport and city logistics, promoting greater efficiency and reliability in PT services.

摘要
公共交通路径规划在公共交通网络设计中发挥重要作用，确保乘客获得满意的服务水平。然而，当前的路径解决方案通常基于传统的操作研究策略，可能需要较长时间来实现并且缺乏快速解决方案。在这里，我们提出了一种基于深度学习的决策支持系统，可以帮助公共交通（PT）规划人员在短时间内迅速地提高路径。通过在两个停站之间的特定路段进行轻量级调整，我们的方法可以减少时间并提高PT服务质量。我们利用了多种数据源，如GTFS和智能卡数据，提取特征并将交通网络模型为指定图。使用无监督学习，我们训练了一个深度学习模型，可以预测路段延迟值。这些延迟值然后被用作路径搜索的边重量，使得路径搜索更加高效。通过对特拉维夫进行评估，我们可以减少路线的时间超过9％。改进的路线包括城市内和郊区路线，这一结果表明模型的 universality。这些发现强调了我们数据驱动的决策支持系统的潜在能力，推动公共交通和城市物流的更高效和可靠性。

Job Shop Scheduling Benchmark: Environments and Instances for Learning and Non-learning Methods

paper_url: http://arxiv.org/abs/2308.12794
repo_url: https://github.com/ai-for-decision-making-tue/job_shop_scheduling_benchmark
paper_authors: Robbert Reijnen, Kjell van Straaten, Zaharah Bukhsh, Yingqian Zhang
for: 提供一个中央化的测试平台供研究人员、实践者和热门追求者在机器平面管理中解决问题。
methods: 使用 GitHub 开源存储平台提供了丰富的测试库，涵盖了各种机器平面管理问题，包括 Job Shop Scheduling (JSP)、Flow Shop Scheduling (FSP)、Flexible Job Shop Scheduling (FJSP)、FJSP with Assembly constraints (FAJSP)、FJSP with Sequence-Dependent Setup Times (FJSP-SDST) 和在线 FJSP。
results: 提供了一个中央化的测试平台，以便研究人员、实践者和热门追求者可以在一个集中化的位置上解决机器平面管理的挑战。

Abstract
We introduce an open-source GitHub repository containing comprehensive benchmarks for a wide range of machine scheduling problems, including Job Shop Scheduling (JSP), Flow Shop Scheduling (FSP), Flexible Job Shop Scheduling (FJSP), FJSP with Assembly constraints (FAJSP), FJSP with Sequence-Dependent Setup Times (FJSP-SDST), and the online FJSP (with online job arrivals). Our primary goal is to provide a centralized hub for researchers, practitioners, and enthusiasts interested in tackling machine scheduling challenges.

摘要
我们介绍一个开源的GitHub存储库，包含了各种机器调度问题的完整的benchmark，包括作业shop调度（JSP）、流shop调度（FSP）、可变作业shop调度（FJSP）、FJSP具有组装约束（FAJSP）、FJSP具有时间序列依赖的设置（FJSP-SDST）以及在线FJSP。我们的主要目标是为研究人员、实践者和爱好者提供一个中心化的平台，以便他们可以解决机器调度挑战。

Acquiring Qualitative Explainable Graphs for Automated Driving Scene Interpretation

paper_url: http://arxiv.org/abs/2308.12755
repo_url: None
paper_authors: Nassim Belmecheri, Arnaud Gotlieb, Nadjib Lazaar, Helge Spieker
for: 这篇论文旨在提出一种新的自动驾驶场景表示方法，以便更好地解释自动驾驶的决策。
methods: 该方法基于 Qualitative Constraint Acquisition paradigm，可以快速计算出自动驾驶场景的Qualitative eXplainable Graph。
results: 实验结果表明，这种方法可以在实时计算和快速存储的情况下构建自动驾驶场景的Qualitative eXplainable Graph，这使得它成为可能有用的工具 для提高自动驾驶的识别和控制过程。

Abstract
The future of automated driving (AD) is rooted in the development of robust, fair and explainable artificial intelligence methods. Upon request, automated vehicles must be able to explain their decisions to the driver and the car passengers, to the pedestrians and other vulnerable road users and potentially to external auditors in case of accidents. However, nowadays, most explainable methods still rely on quantitative analysis of the AD scene representations captured by multiple sensors. This paper proposes a novel representation of AD scenes, called Qualitative eXplainable Graph (QXG), dedicated to qualitative spatiotemporal reasoning of long-term scenes. The construction of this graph exploits the recent Qualitative Constraint Acquisition paradigm. Our experimental results on NuScenes, an open real-world multi-modal dataset, show that the qualitative eXplainable graph of an AD scene composed of 40 frames can be computed in real-time and light in space storage which makes it a potentially interesting tool for improved and more trustworthy perception and control processes in AD.

摘要

Motion In-Betweening with Phase Manifolds

paper_url: http://arxiv.org/abs/2308.12751
repo_url: https://github.com/pauzii/phasebetweener
paper_authors: Paul Starke, Sebastian Starke, Taku Komura, Frank Steinicke
for: This paper introduces a novel data-driven motion in-betweening system to reach target poses of characters.
methods: The paper uses a mixture-of-experts neural network model, a Periodic Autoencoder, and a learned bi-directional control scheme to generate smooth and realistic character movements.
results: The proposed framework can compete with popular state-of-the-art methods for motion in-betweening in terms of motion quality and generalization, especially in the existence of long transition durations, and can also synthesize more challenging movements beyond locomotion behaviors. Additionally, style control is enabled between given target keyframes.

Abstract
This paper introduces a novel data-driven motion in-betweening system to reach target poses of characters by making use of phases variables learned by a Periodic Autoencoder. Our approach utilizes a mixture-of-experts neural network model, in which the phases cluster movements in both space and time with different expert weights. Each generated set of weights then produces a sequence of poses in an autoregressive manner between the current and target state of the character. In addition, to satisfy poses which are manually modified by the animators or where certain end effectors serve as constraints to be reached by the animation, a learned bi-directional control scheme is implemented to satisfy such constraints. The results demonstrate that using phases for motion in-betweening tasks sharpen the interpolated movements, and furthermore stabilizes the learning process. Moreover, using phases for motion in-betweening tasks can also synthesize more challenging movements beyond locomotion behaviors. Additionally, style control is enabled between given target keyframes. Our proposed framework can compete with popular state-of-the-art methods for motion in-betweening in terms of motion quality and generalization, especially in the existence of long transition durations. Our framework contributes to faster prototyping workflows for creating animated character sequences, which is of enormous interest for the game and film industry.

摘要

Separating the Human Touch from AI-Generated Text using Higher Criticism: An Information-Theoretic Approach

paper_url: http://arxiv.org/abs/2308.12747
repo_url: None
paper_authors: Alon Kipnis
for: 本研究旨在判断一篇文章是否完全由生成语言模型编写，或者另一种情况下，文章包含了一些重要的人工编辑。
methods: 本研究使用多种困惑测试来评估文章中各句的起源，并将这些测试结果组合使用高等批判（HC）。这种方法可以同时判断各句的起源是否为生成语言模型所致，以及哪些句子可能有人工编辑。
results: 研究使用实际数据进行了证明，并分析了影响方法效果的因素。这种分析提出了一些有趣的开放挑战，解决这些挑战可能会提高方法的效果。

Abstract
We propose a method to determine whether a given article was entirely written by a generative language model versus an alternative situation in which the article includes some significant edits by a different author, possibly a human. Our process involves many perplexity tests for the origin of individual sentences or other text atoms, combining these multiple tests using Higher Criticism (HC). As a by-product, the method identifies parts suspected to be edited. The method is motivated by the convergence of the log-perplexity to the cross-entropy rate and by a statistical model for edited text saying that sentences are mostly generated by the language model, except perhaps for a few sentences that might have originated via a different mechanism. We demonstrate the effectiveness of our method using real data and analyze the factors affecting its success. This analysis raises several interesting open challenges whose resolution may improve the method's effectiveness.

摘要
我们提出了一种方法，用于判断一篇文章是否完全由生成语言模型写成，或者是一种另外的情况，文章包含了一些重要的人工修改。我们的过程包括多种plexity测试，对各个句子或其他文本元素的起源进行组合使用高等批判（HC）。这种方法可以标识可能有人工修改的部分。我们的方法受到了log-plexity converging到cross-entropy rate的统计模型，以及一种编辑文本的统计模型，即句子主要由语言模型生成，除了一些句子可能通过不同的机制生成。我们使用实际数据进行示例，并分析了这种方法的成功因素。这种分析提出了一些有趣的开放挑战，解决这些挑战可能会提高方法的效果。

Human Comprehensible Active Learning of Genome-Scale Metabolic Networks

paper_url: http://arxiv.org/abs/2308.12740
repo_url: None
paper_authors: Lun Ai, Shi-Shun Liang, Wang-Zhou Dai, Liam Hallett, Stephen H. Muggleton, Geoff S. Baldwin
for: 这个论文主要用于提高干预生物学中的设计、建造、测试和学习（DBTL）循环中的实验设计和成本降低。
methods: 该论文提出了一种基于卷积逻辑编程（ILP）的新机器学习框架ILP-iML1515，该框架通过逻辑推理和学习新的逻辑结构从auxotrophic异常变种试验中更新模型。
results: ILP-iML1515可以高速进行模拟和活动地选择实验，并且可以在测试新功能蛋白时降低实验成本。

Abstract
An important application of Synthetic Biology is the engineering of the host cell system to yield useful products. However, an increase in the scale of the host system leads to huge design space and requires a large number of validation trials with high experimental costs. A comprehensible machine learning approach that efficiently explores the hypothesis space and guides experimental design is urgently needed for the Design-Build-Test-Learn (DBTL) cycle of the host cell system. We introduce a novel machine learning framework ILP-iML1515 based on Inductive Logic Programming (ILP) that performs abductive logical reasoning and actively learns from training examples. In contrast to numerical models, ILP-iML1515 is built on comprehensible logical representations of a genome-scale metabolic model and can update the model by learning new logical structures from auxotrophic mutant trials. The ILP-iML1515 framework 1) allows high-throughput simulations and 2) actively selects experiments that reduce the experimental cost of learning gene functions in comparison to randomly selected experiments.

摘要
синтетиче生物的重要应用之一是通过引入主机细胞系统来生产有用产品。然而，随着主机系统的扩大，设计空间的增加导致了庞大的实验成本和高效性的需求。我们需要一种可靠的机器学习方法，能够有效地探索假设空间，并 guid experimental design 进行 DBTL 循环。我们提出了一种基于推理学习的机器学习框架 ILP-iML1515，通过推理逻辑学习来协助主机细胞系统的设计和验证。与数值模型不同，ILP-iML1515 基于可读性的逻辑表示，可以通过学习新的逻辑结构来更新模型，并且能够高效地进行大规模的 simulations。此外，ILP-iML1515 框架还可以活动地选择实验，以降低学习基因功能的实验成本，相比于随机选择的实验。

Asymmetric Co-Training with Explainable Cell Graph Ensembling for Histopathological Image Classification

paper_url: http://arxiv.org/abs/2308.12737
repo_url: None
paper_authors: Ziqi Yang, Zhongyu Li, Chen Liu, Xiangde Luo, Xingguang Wang, Dou Xu, Chaoqun Li, Xiaoying Qin, Meng Yang, Long Jin
for: This paper focuses on multi-class histopathological image classification, with the goal of improving explainability and performance.
methods: The proposed method combines a deep graph convolutional network and a convolutional neural network, with an asymmetric co-training framework to dynamically integrate pixel-level and cell-level information.
results: The proposed method achieves superior performance, explainability, and generalizability in multi-class histopathological image classification, as demonstrated on private and public datasets.Here is the full text in Simplified Chinese:
for: 本研究旨在提高多类组织肿瘤图像分类的可解释性和性能。
methods: 提议的方法结合深度图像神经网络和图像神经网络，采用不对称共训框架，动态地集成像素级和细胞级信息。
results: 提议的方法在多类组织肿瘤图像分类中具有优秀的性能、可解释性和普适性，如 demonstrated 在私有和公共数据集上。

Abstract
Convolutional neural networks excel in histopathological image classification, yet their pixel-level focus hampers explainability. Conversely, emerging graph convolutional networks spotlight cell-level features and medical implications. However, limited by their shallowness and suboptimal use of high-dimensional pixel data, GCNs underperform in multi-class histopathological image classification. To make full use of pixel-level and cell-level features dynamically, we propose an asymmetric co-training framework combining a deep graph convolutional network and a convolutional neural network for multi-class histopathological image classification. To improve the explainability of the entire framework by embedding morphological and topological distribution of cells, we build a 14-layer deep graph convolutional network to handle cell graph data. For the further utilization and dynamic interactions between pixel-level and cell-level information, we also design a co-training strategy to integrate the two asymmetric branches. Notably, we collect a private clinically acquired dataset termed LUAD7C, including seven subtypes of lung adenocarcinoma, which is rare and more challenging. We evaluated our approach on the private LUAD7C and public colorectal cancer datasets, showcasing its superior performance, explainability, and generalizability in multi-class histopathological image classification.

摘要
convolutional neural networks 在 Histopathological 图像分类中表现出色，但是它们的像素级别关注使得解释性受限。相反，出现在的图像 convolutional neural networks 注重 cell 级别特征和医学意义。然而，由于它们的浅度和高维像素数据的不佳使用，GCNs在多类 Histopathological 图像分类中表现不佳。为了在动态地使用像素级别和 cell 级别特征，我们提议一种不对称 co-training 框架， combining a deep graph convolutional network 和一个 convolutional neural network для多类 Histopathological 图像分类。为了提高整个框架的解释性，我们建立了一个 14 层深的 graph convolutional network 来处理 cell graph 数据。此外，我们还设计了一种 co-training 策略，以实现像素级别和 cell 级别信息之间的动态交互。值得一提的是，我们收集了一个私有的临床获得的数据集 termed LUAD7C，包括七种肺adenocarcinoma 的亚型，这种数据集是罕见且更加挑战。我们对私人 LUAD7C 和公共的 colorectal cancer 数据集进行评估，展示了我们的方法在多类 Histopathological 图像分类中的优秀表现、解释性和普适性。

DeepLOC: Deep Learning-based Bone Pathology Localization and Classification in Wrist X-ray Images

paper_url: http://arxiv.org/abs/2308.12727
repo_url: https://github.com/olegrgv/DeepLOC
paper_authors: Razan Dibo, Andrey Galichin, Pavel Astashev, Dmitry V. Dylov, Oleg Y. Rogov
for: 该论文旨在帮助放射学家更加准确和高效地分析骨病变图像。
methods: 该方法结合了YOLO和Shifted Window Transformer（Swin），并提出了一个新的块来解决骨病变图像分类和定位的两大挑战。YOLO框架用于检测和定位骨病变，利用其实时物体检测功能；而Swin则用于从定位区域中提取 Contextual information，以准确地分类骨病变。
results: 该方法可以准确地定位和分类骨病变，并且可以提高放射学家的分析效率和准确率。

Abstract
In recent years, computer-aided diagnosis systems have shown great potential in assisting radiologists with accurate and efficient medical image analysis. This paper presents a novel approach for bone pathology localization and classification in wrist X-ray images using a combination of YOLO (You Only Look Once) and the Shifted Window Transformer (Swin) with a newly proposed block. The proposed methodology addresses two critical challenges in wrist X-ray analysis: accurate localization of bone pathologies and precise classification of abnormalities. The YOLO framework is employed to detect and localize bone pathologies, leveraging its real-time object detection capabilities. Additionally, the Swin, a transformer-based module, is utilized to extract contextual information from the localized regions of interest (ROIs) for accurate classification.

摘要

Continuous Reinforcement Learning-based Dynamic Difficulty Adjustment in a Visual Working Memory Game

paper_url: http://arxiv.org/abs/2308.12726
repo_url: None
paper_authors: Masoud Rahimi, Hadi Moradi, Abdol-hossein Vahabie, Hamed Kebriaei
for: 这个论文目的是提出一种基于强化学习的游戏难度调整方法，以提高玩家的游戏体验。
methods: 该方法使用了连续控制器学习（RL）方法，并使用了视觉工作记忆（VWM）游戏来处理复杂的搜索空间。
results: 该方法在52名参与者的在人体实验中表现出了显著更好的游戏体验，包括积极、紧张、负面情绪和正面情绪。同时，玩家的得分和胜率也有所提高，并且难度调整方法导致20次试Session中的得分下降减少了。

Abstract
Dynamic Difficulty Adjustment (DDA) is a viable approach to enhance a player's experience in video games. Recently, Reinforcement Learning (RL) methods have been employed for DDA in non-competitive games; nevertheless, they rely solely on discrete state-action space with a small search space. In this paper, we propose a continuous RL-based DDA methodology for a visual working memory (VWM) game to handle the complex search space for the difficulty of memorization. The proposed RL-based DDA tailors game difficulty based on the player's score and game difficulty in the last trial. We defined a continuous metric for the difficulty of memorization. Then, we consider the task difficulty and the vector of difficulty-score as the RL's action and state, respectively. We evaluated the proposed method through a within-subject experiment involving 52 subjects. The proposed approach was compared with two rule-based difficulty adjustment methods in terms of player's score and game experience measured by a questionnaire. The proposed RL-based approach resulted in a significantly better game experience in terms of competence, tension, and negative and positive affect. Players also achieved higher scores and win rates. Furthermore, the proposed RL-based DDA led to a significantly less decline in the score in a 20-trial session.

摘要
dynamical difficulty adjustment (DDA) 是一种有效的方法来提高玩家在电子游戏中的体验。最近，人工智能学习（RL）方法已经在非竞争性游戏中应用于 DDA;然而，它们仅仅基于精确的状态动作空间和小搜索空间。在这篇论文中，我们提议了一种基于连续RL的 DDA方法ology for a visual working memory (VWM) game to handle the complex search space for the difficulty of memorization. 我们定义了一个连续的听力难度度量，然后考虑了任务难度和游戏难度的向量作为RL的动作和状态，分别。我们通过一个在subject experiment中，涉及52名参与者进行评估我们的方法。我们的方法与两种规则基于的难度调整方法进行比较，以获得玩家的得分和游戏体验的评估，通过问卷调查。我们的RL基于的方法在玩家的得分和游戏体验方面表现出了显著更好的效果，并且玩家的得分和胜率也更高。此外，我们的RL基于的 DDA 方法还导致了20次Session中的得分下降减少为显著水平。

VIGC: Visual Instruction Generation and Correction

paper_url: http://arxiv.org/abs/2308.12714
repo_url: None
paper_authors: Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, Conghui He
For: The paper aims to address the challenge of obtaining high-quality instruction-tuning data for vision-language tasks, specifically by utilizing multimodal large language models (MLLMs) to generate such data.* Methods: The proposed framework, called Visual Instruction Generation and Correction (VIGC), consists of two main components: Visual Instruction Generation (VIG) and Visual Instruction Correction (VIC). VIG guides the vision-language model to generate diverse instruction-tuning data, while VIC corrects any inaccuracies in the generated data through an iterative update mechanism.* Results: The proposed VIGC framework effectively enhances the quality of instruction-tuning data, as demonstrated by experimental results that show improved benchmark performance compared to language-only data generation methods.Here’s the simplified Chinese version of the three key points:* For: 解决视语言任务中获得高质量指令调试数据的挑战，通过使用多Modal大语言模型（MLLMs）生成相关数据。* Methods: 提出的框架为Visual Instruction Generation and Correction（VIGC），包括Visual Instruction Generation（VIG）和Visual Instruction Correction（VIC）两个主要组成部分。VIG使得视语言模型生成多样的指令调试数据，而VIC通过迭代更新机制，确保生成数据的准确性。* Results: VIGC框架能够有效提高指令调试数据的质量，经实验证明，比语言只数据生成方法得到更好的 benchMark性能。

Abstract
The integration of visual encoders and large language models (LLMs) has driven recent progress in multimodal large language models (MLLMs). However, the scarcity of high-quality instruction-tuning data for vision-language tasks remains a challenge. The current leading paradigm, such as LLaVA, relies on language-only GPT-4 to generate data, which requires pre-annotated image captions and detection bounding boxes, suffering from understanding image details. A practical solution to this problem would be to utilize the available multimodal large language models (MLLMs) to generate instruction data for vision-language tasks. However, it's worth noting that the currently accessible MLLMs are not as powerful as their LLM counterparts, as they tend to produce inadequate responses and generate false information. As a solution for addressing the current issue, this paper proposes the Visual Instruction Generation and Correction (VIGC) framework that enables multimodal large language models to generate instruction-tuning data and progressively enhance its quality on-the-fly. Specifically, Visual Instruction Generation (VIG) guides the vision-language model to generate diverse instruction-tuning data. To ensure generation quality, Visual Instruction Correction (VIC) adopts an iterative update mechanism to correct any inaccuracies in data produced by VIG, effectively reducing the risk of hallucination. Leveraging the diverse, high-quality data generated by VIGC, we finetune mainstream models and validate data quality based on various evaluations. Experimental results demonstrate that VIGC not only compensates for the shortcomings of language-only data generation methods, but also effectively enhances the benchmark performance. The models, datasets, and code will be made publicly available.

摘要
Recent progress in multimodal large language models (MLLMs) has been driven by the integration of visual encoders and large language models (LLMs). However, the lack of high-quality instruction-tuning data for vision-language tasks remains a challenge. The current leading paradigm, such as LLaVA, relies on language-only GPT-4 to generate data, which requires pre-annotated image captions and detection bounding boxes, and suffers from a lack of understanding of image details. To address this problem, this paper proposes the Visual Instruction Generation and Correction (VIGC) framework, which enables multimodal large language models to generate instruction-tuning data and progressively enhance its quality on-the-fly. Specifically, Visual Instruction Generation (VIG) guides the vision-language model to generate diverse instruction-tuning data. To ensure generation quality, Visual Instruction Correction (VIC) adopts an iterative update mechanism to correct any inaccuracies in data produced by VIG, effectively reducing the risk of hallucination. By leveraging the diverse, high-quality data generated by VIGC, we finetune mainstream models and validate data quality based on various evaluations. Experimental results demonstrate that VIGC not only compensates for the shortcomings of language-only data generation methods, but also effectively enhances the benchmark performance. The models, datasets, and code will be made publicly available.

SayCanPay: Heuristic Planning with Large Language Models using Learnable Domain Knowledge

paper_url: http://arxiv.org/abs/2308.12682
repo_url: None
paper_authors: Rishi Hazra, Pedro Zuidberg Dos Martires, Luc De Raedt
for: 这个论文旨在使用语言模型（LLM）和规划方法（heuristic planning）结合以产生可行和成本效益的计划。
methods: 本文提出的方法是使用LLM生成动作（Say），根据学习的域知识进行评估动作的可行性（Can）和长期奖励（Pay），并使用规划搜索选择最佳动作序列。
results: 对比其他LLM规划方法，本文的模型在评估中表现出色，可以生成更加可行和成本效益的计划。

Abstract
Large Language Models (LLMs) have demonstrated impressive planning abilities due to their vast "world knowledge". Yet, obtaining plans that are both feasible (grounded in affordances) and cost-effective (in plan length), remains a challenge, despite recent progress. This contrasts with heuristic planning methods that employ domain knowledge (formalized in action models such as PDDL) and heuristic search to generate feasible, optimal plans. Inspired by this, we propose to combine the power of LLMs and heuristic planning by leveraging the world knowledge of LLMs and the principles of heuristic search. Our approach, SayCanPay, employs LLMs to generate actions (Say) guided by learnable domain knowledge, that evaluates actions' feasibility (Can) and long-term reward/payoff (Pay), and heuristic search to select the best sequence of actions. Our contributions are (1) a novel framing of the LLM planning problem in the context of heuristic planning, (2) integrating grounding and cost-effective elements into the generated plans, and (3) using heuristic search over actions. Our extensive evaluations show that our model surpasses other LLM planning approaches.

摘要

将 LLMS 规划问题框入规划搜索的Context。2. 在生成的计划中 integrate 可行和cost-effective的元素。3. 使用规划搜索 sobre 动作。我们的广泛评估表明，我们的模型超过了其他 LLMS 规划方法。

LR-XFL: Logical Reasoning-based Explainable Federated Learning

paper_url: http://arxiv.org/abs/2308.12681
repo_url: None
paper_authors: Yanci Zhang, Han Yu
for: This paper aims to improve the transparency and explainability of federated learning (FL) models by incorporating logic-based explanations into the FL framework.
methods: The proposed Logical Reasoning-based eXplainable Federated Learning (LR-XFL) approach involves FL clients creating local logic rules based on their local data and sending them to the FL server, which connects the local logic rules through a proper logical connector without requiring access to the raw data. The server aggregates the local model updates with weight values determined by the quality of the clients’ local data as reflected by their uploaded logic rules.
results: The results show that LR-XFL outperforms the most relevant baseline by 1.19%, 5.81% and 5.41% in terms of classification accuracy, rule accuracy and rule fidelity, respectively. The explicit rule evaluation and expression under LR-XFL enable human experts to validate and correct the rules on the server side, hence improving the global FL model’s robustness to errors.

Abstract
Federated learning (FL) is an emerging approach for training machine learning models collaboratively while preserving data privacy. The need for privacy protection makes it difficult for FL models to achieve global transparency and explainability. To address this limitation, we incorporate logic-based explanations into FL by proposing the Logical Reasoning-based eXplainable Federated Learning (LR-XFL) approach. Under LR-XFL, FL clients create local logic rules based on their local data and send them, along with model updates, to the FL server. The FL server connects the local logic rules through a proper logical connector that is derived based on properties of client data, without requiring access to the raw data. In addition, the server also aggregates the local model updates with weight values determined by the quality of the clients' local data as reflected by their uploaded logic rules. The results show that LR-XFL outperforms the most relevant baseline by 1.19%, 5.81% and 5.41% in terms of classification accuracy, rule accuracy and rule fidelity, respectively. The explicit rule evaluation and expression under LR-XFL enable human experts to validate and correct the rules on the server side, hence improving the global FL model's robustness to errors. It has the potential to enhance the transparency of FL models for areas like healthcare and finance where both data privacy and explainability are important.

摘要
Federated learning（FL）是一种emergingapproach для协同训练机器学习模型，保持数据隐私。由于需要隐私保护，FL模型具有限制global transparency和解释性。为Address这些Limitations,我们在FL中嵌入逻辑基于的解释by proposing the Logical Reasoning-based eXplainable Federated Learning（LR-XFL）approach。在LR-XFL中，FL客户端创建基于本地数据的本地逻辑规则，并将其发送到FL服务器。FL服务器通过基于客户端数据的逻辑连接器连接客户端的本地逻辑规则，而无需访问原始数据。此外，服务器还将客户端上传的模型更新与基于客户端数据质量的重量值相结合。结果表明，LR-XFL比最相关的基eline提高了1.19%、5.81%和5.41%的分类精度、逻辑规则精度和逻辑权重精度，分别。explicit Rule evaluation和表达在LR-XFL中允许人工专家在服务器端验证和更正规则，因此提高了全局FL模型的Robustness。它可以提高FL模型在医疗和金融等领域的透明度，这些领域都是数据隐私和解释性重要。

Improving Translation Faithfulness of Large Language Models via Augmenting Instructions

paper_url: http://arxiv.org/abs/2308.12674
repo_url: https://github.com/pppa2019/swie_overmiss_llm4mt
paper_authors: Yijie Chen, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou
for: 提高大型自然语言模型（LLM）的特殊能力，如机器翻译，通过低成本的指令调整。
methods: 提出了 Segment-Weighted Instruction Embedding（SWIE）和 OVERMISS instrucion-following 数据集，以解决 LLM 的注意力机制受限，容易忘记指令的问题。
results: 对两种主流开源 LLM（BLOOM 和 LLaMA）进行应用，实验结果表明，SWIE 可以改善翻译性能，特别是零shot 和长文本翻译，而 OVERMISS 可以提高翻译性能和对word alignment的忠实度。

Abstract
Large Language Models (LLMs) present strong general capabilities, and a current compelling challenge is stimulating their specialized capabilities, such as machine translation, through low-cost instruction tuning. The standard instruction-following data is sequentially organized as the concatenation of an instruction, an input, and a response. As the attention mechanism of LLMs has limitations on local focus, LLMs tend to focus more on the words or sentences nearby at each position. This leads to a high risk of instruction forgetting during decoding. To alleviate the above issues, We propose SWIE (Segment-Weighted Instruction Embedding) and an instruction-following dataset OVERMISS. SWIE improves the model instruction understanding by adding a global instruction representation on the following input and response representations. OVERMISS improves model faithfulness by comparing over-translation and miss-translation results with the correct translation. We apply our methods to two main-stream open-source LLMs, BLOOM and LLaMA. The experimental results demonstrate significant improvements in translation performance with SWIE based on BLOOMZ-3b, particularly in zero-shot and long text translations due to reduced instruction forgetting risk. Additionally, OVERMISS outperforms the baseline in translation performance (e.g. an increase in BLEU scores from 0.69 to 3.12 and an average improvement of 0.48 percentage comet scores for LLaMA-7b) with further enhancements seen in models combining OVERMISS and SWIE (e.g. the BLUE scores increase up to 0.56 from English to German across three different backbones), and both exhibit improvements in the faithfulness metric based on word alignment.

摘要
大型语言模型（LLM）具有强大的通用能力，现在的挑战是鼓励它们的特殊能力，例如机器翻译，通过低成本的指令调整。标准的指令跟踪数据是逐一 concatenate 一个指令、输入和回应。由于 LL 的注意机制有局部关注的限制， LL 倾向于在每个位置上更多地关注词句或句子。这会导致翻译过程中的指令忘记风险增加。为了解决上述问题，我们提出了 SWIE（Segment-Weighted Instruction Embedding）和 OVERMISS instruction-following 数据集。SWIE 改善了模型对指令的理解，将全球指令表现添加到下一个输入和回应表现中。OVERMISS 则提高了模型的忠实度，通过比较翻译结果和正确翻译结果之间的差异。我们将我们的方法应用到两个主流的开源 LL 中，namely BLOOM 和 LLaMA。实验结果显示，SWIE 在 BLOOMZ-3b 中具有优化翻译性能，特别是在零shot 和长文翻译中具有削减指令忘记风险的效果。此外，OVERMISS 在翻译性能方面表现出色，例如对于 LLMA-7b 的 BLEU 分数从 0.69 提高到 3.12，平均提高 0.48 百分比统计分数。此外，在组合 OVERMISS 和 SWIE 时，模型具有进一步的改善，例如对于英文到德文的翻译中，BLUE 分数从 0.56 提高到 0.64。此外，SWIE 和 OVERMISS 都具有改善的忠实度，基于词汇对Alignment。

Don’t Look into the Sun: Adversarial Solarization Attacks on Image Classifiers

paper_url: http://arxiv.org/abs/2308.12661
repo_url: https://github.com/paulgavrikov/adversarial_solarization
paper_authors: Paul Gavrikov, Janis Keuper
for: 这篇论文的目的是测试深度神经网络对于不同类型的输入进行抗性测试，特别是在自动驾驶和安全系统中，以防止黑客利用数位修改输入来逃脱安全检查。
methods: 这篇论文提出了一种基于图像阳光化的攻击方法，这是一种概念简单但不会干扰自然图像的结构的攻击方法。研究者透过对多个ImageNet模型进行了全面的评估，证明了这种攻击方法可以对精度造成严重的下降，但是不包括在训练增强中。
results: 研究者发现这种攻击方法可以对精度造成严重的下降，并且不包括在训练增强中。此外，这种攻击方法可以转化为黑盒攻击，并且不需要了解特定的模型内部细节。这些结果表明，对于深度神经网络的抗性测试仍然是一个复杂的和需要进一步研究的领域。

Abstract
Assessing the robustness of deep neural networks against out-of-distribution inputs is crucial, especially in safety-critical domains like autonomous driving, but also in safety systems where malicious actors can digitally alter inputs to circumvent safety guards. However, designing effective out-of-distribution tests that encompass all possible scenarios while preserving accurate label information is a challenging task. Existing methodologies often entail a compromise between variety and constraint levels for attacks and sometimes even both. In a first step towards a more holistic robustness evaluation of image classification models, we introduce an attack method based on image solarization that is conceptually straightforward yet avoids jeopardizing the global structure of natural images independent of the intensity. Through comprehensive evaluations of multiple ImageNet models, we demonstrate the attack's capacity to degrade accuracy significantly, provided it is not integrated into the training augmentations. Interestingly, even then, no full immunity to accuracy deterioration is achieved. In other settings, the attack can often be simplified into a black-box attack with model-independent parameters. Defenses against other corruptions do not consistently extend to be effective against our specific attack. Project website: https://github.com/paulgavrikov/adversarial_solarization

摘要
评估深度神经网络对非标型输入的Robustness是非常重要的，尤其在自动驾驶和安全系统中， где可能有恶意actor会通过数字修改输入来逃脱安全护垫。然而，设计全面的非标型测试方法，涵盖所有可能的场景，而且保持准确的标签信息是一项具有挑战性的任务。现有的方法ologies oft en compromise on variety and constraint levels for attacks, and sometimes even both.为了更好地评估图像分类模型的Robustness，我们提出了基于图像折射的攻击方法，这是一种简单的概念，但是不会损害自然图像的全球结构，无论输入的强度如何。通过对多个ImageNet模型进行全面的评估，我们示示了这种攻击的能力可以让准确性降低显著，即使不包括在训练增强中。另外，这种攻击可以在其他设置下简化为黑盒攻击，并且可以独立地设置模型参数。防御其他损害的方法不一定能够对我们的特定攻击提供有效防御。更多信息可以查看我们的项目网站：

kTrans: Knowledge-Aware Transformer for Binary Code Embedding

paper_url: http://arxiv.org/abs/2308.12659
repo_url: https://github.com/learner0x5a/ktrans-release
paper_authors: Wenyu Zhu, Hao Wang, Yuchen Zhou, Jiaming Wang, Zihan Sha, Zeyu Gao, Chao Zhang
for: 本文旨在提出一种基于Transformer模型的知识塑化二进制代码嵌入（kTrans），以提高下游任务的性能。
methods: 本文使用Transformer模型，并FeedExplicit知识作为额外输入，同时使用一种新的预训练任务来融合Implicit知识。
results: 对于3个下游任务（二进制代码相似检测、函数类型恢复和间接调用识别），kTrans可以生成高质量的二进制代码嵌入，并与现有最佳方法相比，提高了5.2%, 6.8%和12.6%的性能。

Abstract
Binary Code Embedding (BCE) has important applications in various reverse engineering tasks such as binary code similarity detection, type recovery, control-flow recovery and data-flow analysis. Recent studies have shown that the Transformer model can comprehend the semantics of binary code to support downstream tasks. However, existing models overlooked the prior knowledge of assembly language. In this paper, we propose a novel Transformer-based approach, namely kTrans, to generate knowledge-aware binary code embedding. By feeding explicit knowledge as additional inputs to the Transformer, and fusing implicit knowledge with a novel pre-training task, kTrans provides a new perspective to incorporating domain knowledge into a Transformer framework. We inspect the generated embeddings with outlier detection and visualization, and also apply kTrans to 3 downstream tasks: Binary Code Similarity Detection (BCSD), Function Type Recovery (FTR) and Indirect Call Recognition (ICR). Evaluation results show that kTrans can generate high-quality binary code embeddings, and outperforms state-of-the-art (SOTA) approaches on downstream tasks by 5.2%, 6.8%, and 12.6% respectively. kTrans is publicly available at: https://github.com/Learner0x5a/kTrans-release

摘要
“二进制代码嵌入”（BCE）在各种反工程任务中扮演着重要角色，如二进制代码相似检测、类型恢复、控制流恢复和数据流分析。现有研究表明，Transformer模型可以理解二进制代码的 semantics，以支持下游任务。然而，现有模型忽略了 Assembly 语言的先前知识。在本文中，我们提出了一种新的 Transformer 基于的方法，即 kTrans，用于生成知识感知的二进制代码嵌入。通过将显式知识作为 Transformer 的输入，并将隐式知识与一种新的预训练任务结合，kTrans 提供了一种新的方式来 incorporate 领域知识 into Transformer 框架。我们通过检测异常值和可视化，以及应用 kTrans 于 3 个下游任务：二进制代码相似检测（BCSD）、函数类型恢复（FTR）和间接调用认知（ICR）。评估结果表明，kTrans 可以生成高质量的二进制代码嵌入，并在下游任务上过去 SOTA 方法 by 5.2%, 6.8%, 和 12.6% 分别。kTrans 可以在：https://github.com/Learner0x5a/kTrans-release 上获取。

APART: Diverse Skill Discovery using All Pairs with Ascending Reward and DropouT

paper_url: http://arxiv.org/abs/2308.12649
repo_url: None
paper_authors: Hadar Schreiber Galler, Tom Zahavy, Guillaume Desjardins, Alon Cohen
for: 本研究旨在在奖励环境中发现多样化技能，目的是在简单的格子世界中发现所有可能的技能，而前一些方法在这些环境中很难 succeed.
methods: 我们使用了一种名为APART的方法，即多对多推论器和一种新的内在奖励函数，以及一种Dropout regularization技术。
results: 我们的实验表明，APART可以在格子世界中发现所有可能的技能，比前一些方法需要更少的样本数据。此外，我们还提出了一种简化版的算法，通过修改VIC、涨奖励和软max推论器的温度来实现最大技能数。我们认为我们的发现对于激励学习中技能发现算法的成功因素提供了灵感。

Abstract
We study diverse skill discovery in reward-free environments, aiming to discover all possible skills in simple grid-world environments where prior methods have struggled to succeed. This problem is formulated as mutual training of skills using an intrinsic reward and a discriminator trained to predict a skill given its trajectory. Our initial solution replaces the standard one-vs-all (softmax) discriminator with a one-vs-one (all pairs) discriminator and combines it with a novel intrinsic reward function and a dropout regularization technique. The combined approach is named APART: Diverse Skill Discovery using All Pairs with Ascending Reward and Dropout. We demonstrate that APART discovers all the possible skills in grid worlds with remarkably fewer samples than previous works. Motivated by the empirical success of APART, we further investigate an even simpler algorithm that achieves maximum skills by altering VIC, rescaling its intrinsic reward, and tuning the temperature of its softmax discriminator. We believe our findings shed light on the crucial factors underlying success of skill discovery algorithms in reinforcement learning.

摘要
我们研究了不带奖励的环境中多样化技能发现，目标是在简单的网格世界中发现所有可能的技能。这个问题被формализова为通过内在奖励和一个用于预测技能的权重来训练技能。我们的初始解决方案是将标准的一对一（所有对）权重交换掉了一个对一（softmax）权重，并将其与一种新的内在奖励函数和掉量 regularization 技术结合使用。这种结合的方法被称为APART：多样化技能发现使用所有对与升奖和掉量。我们示出了APART在网格世界中发现所有可能的技能，只需要非常少的样本数据，远远少于前一作。受APART的实际成功的激励，我们进一步调查了一种更简单的算法，通过修改VIC，调整其内在奖励，并调整其softmax权重的温度来实现最大技能数。我们认为我们的发现可能把 reinforcement learning 中技能发现算法的成功因素 shed light on 。

Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines

paper_url: http://arxiv.org/abs/2308.12635
repo_url: https://github.com/huspacy/huspacy
paper_authors: György Orosz, Gergő Szabó, Péter Berkecz, Zsolt Szántó, Richárd Farkas
for: 这篇论文旨在提供一个高效且资源减少的工业级文本处理模型，以达到near state-of-the-art性能水平。
methods: 该论文使用了spaCy框架，并对 HuSpaCy 工具集进行了多个改进，包括Tokenization、句子分界检测、parts-of-speech 标注、 morphological feature 标注、lemmatization、依赖分析和命名实体识别等基本文本处理步骤。
results: 论文对提议的改进进行了全面评估，并与现有的 NLP 工具进行了比较，并证明了新的模型在所有文本预处理步骤中具有竞争性的性能。

Abstract
This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy. Models have been implemented in the spaCy framework, extending the HuSpaCy toolkit with several improvements to its architecture. Compared to existing NLP tools for Hungarian, all of our pipelines feature all basic text processing steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependency parsing and named entity recognition with high accuracy and throughput. We thoroughly evaluated the proposed enhancements, compared the pipelines with state-of-the-art tools and demonstrated the competitive performance of the new models in all text preprocessing steps. All experiments are reproducible and the pipelines are freely available under a permissive license.

摘要

Towards Hierarchical Regional Transformer-based Multiple Instance Learning

paper_url: http://arxiv.org/abs/2308.12634
repo_url: None
paper_authors: Josef Cersovsky, Sadegh Mohammadi, Dagmar Kainmueller, Johannes Hoehne
for: 这篇论文主要针对高分辨率 histopathology 图像的分类问题进行研究，以实现数位patology 和精确医疗的需求。
methods: 本研究使用 transformer 基础的多个实例学习模型，并取代传统的学习注意力机制使用区域 Vision Transformer 灵活自注意力机制。文章还提出了一种方法，将区域贴图信息融合以 derive 标本水平预测，并示出了在不同距离水平上堆叠Feature Processing的方法。
results: 本研究在两个 histopathology 数据集上进行评估，比基eline表现出色，尤其是 для datasets 中小的本地 morphological 特征。研究还引入了一种方法，在推断过程中专注于高注意区域，以提高预测精度。

Abstract
The classification of gigapixel histopathology images with deep multiple instance learning models has become a critical task in digital pathology and precision medicine. In this work, we propose a Transformer-based multiple instance learning approach that replaces the traditional learned attention mechanism with a regional, Vision Transformer inspired self-attention mechanism. We present a method that fuses regional patch information to derive slide-level predictions and show how this regional aggregation can be stacked to hierarchically process features on different distance levels. To increase predictive accuracy, especially for datasets with small, local morphological features, we introduce a method to focus the image processing on high attention regions during inference. Our approach is able to significantly improve performance over the baseline on two histopathology datasets and points towards promising directions for further research.

摘要
<>对巨像病理图像的分类使用深度多例学习模型已成为数字病理学和精度医学中的关键任务。在这种工作中，我们提议使用Transformer基于自我注意机制来替代传统学习的注意力机制。我们介绍了一种将区域补做信息融合以获得滤波器级别预测的方法，并显示了如何在不同的距离水平上堆叠特征进行处理。为了提高预测精度，特别是 для datasets中的小本地 morphological features，我们引入了一种在推理时对高注意区域进行图像处理的方法。我们的方法可以在两个病理图像集上显著提高性能，并指向了未来研究的可能的方向。>>>

A Greedy Approach for Offering to Telecom Subscribers

paper_url: http://arxiv.org/abs/2308.12606
repo_url: None
paper_authors: Piyush Kanti Bhunre, Tanmay Sen, Arijit Sarkar
for: This paper is written for telecom operators to optimize offer campaigns for customer retention and churn prevention.
methods: The paper proposes a novel combinatorial algorithm for solving offer optimization under heterogeneous offers by maximizing expected revenue under the scenario of subscriber churn.
results: The proposed algorithm is efficient and accurate even for a very large subscriber-base.Here’s the Chinese translation of the three key information points:
for: 这篇论文是为telecom运营商优化奖励计划以防止用户流失。
methods: 论文提出了一种新的 combinatorial 算法，用于在不同的奖励下进行奖励优化，以达到用户流失情况下的预期收入最大化。
results: 提出的算法能够具有高效率和准确性，甚至对很大的用户基数进行处理。

Abstract
Customer retention or churn prevention is a challenging task of a telecom operator. One of the effective approaches is to offer some attractive incentive or additional services or money to the subscribers for keeping them engaged and make sure they stay in the operator's network for longer time. Often, operators allocate certain amount of monetary budget to carry out the offer campaign. The difficult part of this campaign is the selection of a set of customers from a large subscriber-base and deciding the amount that should be offered to an individual so that operator's objective is achieved. There may be multiple objectives (e.g., maximizing revenue, minimizing number of churns) for selection of subscriber and selection of an offer to the selected subscriber. Apart from monetary benefit, offers may include additional data, SMS, hots-spot tethering, and many more. This problem is known as offer optimization. In this paper, we propose a novel combinatorial algorithm for solving offer optimization under heterogeneous offers by maximizing expected revenue under the scenario of subscriber churn, which is, in general, seen in telecom domain. The proposed algorithm is efficient and accurate even for a very large subscriber-base.

摘要
In this paper, we propose a novel combinatorial algorithm to solve offer optimization under heterogeneous offers by maximizing expected revenue under the scenario of subscriber churn, which is common in the telecom domain. The proposed algorithm is efficient and accurate, even for a very large subscriber base.

APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency

paper_url: http://arxiv.org/abs/2308.12605
repo_url: None
paper_authors: Yupu Yao, Shangqi Deng, Zihan Cao, Harry Zhang, Liang-Jian Deng
for: 这个论文旨在提出一种基于噪声扩散模型的文本到视频（T2V）生成网络结构，以解决传统噪声扩散模型在视频生成中缺乏地方具有一致性的问题。
methods: 该论文提出了一种基于噪声扩散模型的T2V生成网络结构，其中引入了一个额外的干扰分布网络（VGT），以EXTRACT perturbances from the inherent information contained within the input, 并且通过混合 transformers 和卷积神经来补做时间细节，从而提高视频生成中的一致性。
results: 实验表明，该论文提出的T2V生成网络结构可以显著提高视频生成中的一致性， both qualitatively 和 quantitatively。

Abstract
Diffusion models have exhibited promising progress in video generation. However, they often struggle to retain consistent details within local regions across frames. One underlying cause is that traditional diffusion models approximate Gaussian noise distribution by utilizing predictive noise, without fully accounting for the impact of inherent information within the input itself. Additionally, these models emphasize the distinction between predictions and references, neglecting information intrinsic to the videos. To address this limitation, inspired by the self-attention mechanism, we propose a novel text-to-video (T2V) generation network structure based on diffusion models, dubbed Additional Perturbation for Latent noise with Adversarial training (APLA). Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks. Notably, we introduce an additional compact network, known as the Video Generation Transformer (VGT). This auxiliary component is designed to extract perturbations from the inherent information contained within the input, thereby refining inconsistent pixels during temporal predictions. We leverage a hybrid architecture of transformers and convolutions to compensate for temporal intricacies, enhancing consistency between different frames within the video. Experiments demonstrate a noticeable improvement in the consistency of the generated videos both qualitatively and quantitatively.

摘要
文本转化为简化中文：传播模型在视频生成方面已经取得了可观的进步。然而，它们经常在不同帧之间保持相同的细节存在问题。这一问题的一个根本原因是传播模型通常通过预测噪声来估算 Gaussian 噪声分布，不充分考虑输入中的内在信息的影响。此外，这些模型强调预测和参照之间的差异，忽视视频中的内在信息。为了解决这些局限性，我们提出了一种基于传播模型的文本转化（T2V）生成网络结构，称之为附加噪声扰动with Adversarial training（APLA）。我们的方法只需要一个输入视频，并基于预训练稳定的传播网络。另外，我们引入了一个附加的小型网络，称之为视频生成变换器（VGT）。这个辅助组件是用于提取输入中的内在信息，并在时间预测中进行细节调整。我们利用一种混合的扩展和卷积网络架构，以资料 temporal 细节，提高不同帧之间的视频生成的一致性。实验表明，我们的方法可以在视频生成中提高一致性， tanto qualitatively 还是 quantitatively。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know.

SICNN: Soft Interference Cancellation Inspired Neural Network Equalizers

paper_url: http://arxiv.org/abs/2308.12591
repo_url: None
paper_authors: Stefan Baumgartner, Oliver Lang, Mario Huemer
for: 这种论文主要是为了提出一种基于神经网络的平衡方法，以解决模型基于平衡方法的高计算复杂度和性能下降问题。
methods: 该论文提出了两种不同的神经网络平衡方法，即SICNNv1和SICNNv2，其中SICNNv1是专门针对单载频域平衡系统，而SICNNv2是更通用的，适用于任何块式数据传输系统。
results: 论文表明，SICNNv1在比较其他方法时表现出色，并且可以在不同的训练集大小下达到优秀的性能。此外，论文还进行了计算复杂度分析，并证明了神经网络平衡方法的可行性。

Abstract
Equalization is an important task at the receiver side of a digital wireless communication system, which is traditionally conducted with model-based estimation methods. Among the numerous options for model-based equalization, iterative soft interference cancellation (SIC) is a well-performing approach since error propagation caused by hard decision data symbol estimation during the iterative estimation procedure is avoided. However, the model-based method suffers from high computational complexity and performance degradation due to required approximations. In this work, we propose a novel neural network (NN-)based equalization approach, referred to as SICNN, which is designed by deep unfolding of a model-based iterative SIC method, eliminating the main disadvantages of its model-based counterpart. We present different variants of SICNN. SICNNv1 is very similar to the model-based method, and is specifically tailored for single carrier frequency domain equalization systems, which is the communication system we regard in this work. The second variant, SICNNv2, is more universal, and is applicable as an equalizer in any communication system with a block-based data transmission scheme. We highlight the pros and cons of both variants. Moreover, for both SICNNv1 and SICNNv2 we present a version with a highly reduced number of learnable parameters. We compare the achieved bit error ratio performance of the proposed NN-based equalizers with state-of-the-art model-based and NN-based approaches, highlighting the superiority of SICNNv1 over all other methods. Also, we present a thorough complexity analysis of the proposed NN-based equalization approaches, and we investigate the influence of the training set size on the performance of NN-based equalizers.

摘要
Equalization是收发器端数字无线通信系统中的重要任务，传统上采用模型基于估计方法进行实现。iterative soft interference cancellation（SIC）是一种 performs well的方法，因为它可以避免由硬判据数据符号估计过程中的错误卷积。然而，模型基于方法受到高计算复杂性和性能下降的限制。在这项工作中，我们提出了一种基于神经网络（NN）的平衡方法， referred to as SICNN，这种方法通过深度 unfolding 的方式消除了模型基于方法的主要缺陷。我们提出了不同的 SICNN 变种。SICNNv1 和 SICNNv2。SICNNv1 特性类似于模型基于方法，并且特意适用于单 carriers frequency domain equalization 系统，这是我们在这项工作中考虑的系统。第二个变种，SICNNv2，更加通用，可以作为任何通信系统的平衡器。我们比较了这两种变种的优缺点。此外，我们还提出了具有很少学习参数的版本。我们比较了我们提出的 NN 基于平衡器与现有的模型基于和 NN 基于方法的 bit error ratio 性能，并证明了 SICNNv1 在所有其他方法之上表现出色。此外，我们还进行了 NN 基于平衡器的复杂度分析，并investigated 训练集大小对 NN 基于平衡器的性能的影响。

A Huber Loss Minimization Approach to Byzantine Robust Federated Learning

paper_url: http://arxiv.org/abs/2308.12581
repo_url: None
paper_authors: Puning Zhao, Fei Yu, Zhiguo Wan
for: 防止 Federated Learning 系统受到攻击，提出一种基于捷径函数损失最小化的新集成器，并进行了全面的理论分析。
methods: 使用捷径函数损失最小化来防止 Federated Learning 系统受到攻击，并且不需要精确知道攻击客户端的比率（epsilon）。
results: 在独立同分布（i.i.d）假设下，我们的方法具有优化的 $\epsilon$ 依赖性，允许客户端有不同的数据大小，并且可以扩展到非 i.i.d 数据。

Abstract
Federated learning systems are susceptible to adversarial attacks. To combat this, we introduce a novel aggregator based on Huber loss minimization, and provide a comprehensive theoretical analysis. Under independent and identically distributed (i.i.d) assumption, our approach has several advantages compared to existing methods. Firstly, it has optimal dependence on $\epsilon$, which stands for the ratio of attacked clients. Secondly, our approach does not need precise knowledge of $\epsilon$. Thirdly, it allows different clients to have unequal data sizes. We then broaden our analysis to include non-i.i.d data, such that clients have slightly different distributions.

摘要
联合学习系统容易受到敌意攻击。为此，我们提出了基于捷径损函数优化的新的聚合器，并进行了全面的理论分析。在独立同分布（i.i.d）假设下，我们的方法有以下优势：首先，它具有优化的 $\epsilon$ 依赖性，其中 $\epsilon$ 表示攻击客户端的比率。其次，我们的方法不需要准确知道 $\epsilon$。最后，它允许客户端有不同的数据大小。然后，我们扩展了我们的分析至包括非i.i.d数据，例如客户端的数据分布略有不同。>>>Note: The text is translated into Simplified Chinese, which is the standard form of Chinese used in mainland China. The Traditional Chinese form is also commonly used in Taiwan and other parts of the world.

paper_url: http://arxiv.org/abs/2308.12578
repo_url: None
paper_authors: Yachao Zhao, Bo Wang, Dongming Zhao, Kun Huang, Yan Wang, Ruifang He, Yuexian Hou
for: 本研究探讨了大型自然语言模型（LLMs）中的认知特征，特别是人类认知constructs的相似性。
methods: 本研究采用了两个阶段的方法，包括自动完成句子和后续重新评价句子。
results: 研究发现，LLMs中存在一种同人类认知constructs相似的”重新评价不一致”现象，即自动完成句子后，LLMs会重新评价并 contradicts 自己生成的句子。这种现象可能与人类的不意识的社会偏见和自我意识的社会偏见之间的不一致有关。

Abstract
Recent researches indicate that Pre-trained Large Language Models (LLMs) possess cognitive constructs similar to those observed in humans, prompting researchers to investigate the cognitive aspects of LLMs. This paper focuses on explicit and implicit social bias, a distinctive two-level cognitive construct in psychology. It posits that individuals' explicit social bias, which is their conscious expression of bias in the statements, may differ from their implicit social bias, which represents their unconscious bias. We propose a two-stage approach and discover a parallel phenomenon in LLMs known as "re-judge inconsistency" in social bias. In the initial stage, the LLM is tasked with automatically completing statements, potentially incorporating implicit social bias. However, in the subsequent stage, the same LLM re-judges the biased statement generated by itself but contradicts it. We propose that this re-judge inconsistency can be similar to the inconsistency between human's unaware implicit social bias and their aware explicit social bias. Experimental investigations on ChatGPT and GPT-4 concerning common gender biases examined in psychology corroborate the highly stable nature of the re-judge inconsistency. This finding may suggest that diverse cognitive constructs emerge as LLMs' capabilities strengthen. Consequently, leveraging psychological theories can provide enhanced insights into the underlying mechanisms governing the expressions of explicit and implicit constructs in LLMs.

摘要

REB: Reducing Biases in Representation for Industrial Anomaly Detection

paper_url: http://arxiv.org/abs/2308.12577
repo_url: https://github.com/shuailyu/reb
paper_authors: Shuai Lyu, Dongmei Mo, Waikeung Wong
for: 提高工业异常检测的性能，减少域名偏见和特定区域密度偏见。
methods: 提出了减少域名偏见的 Representing Reducing Biases (REB) 方法，以及基于自我超视的学习任务和异常生成策略（DefectMaker）来更好地适应域名偏见。同时，提出了一种基于Local Density的KNN（LDKNN）方法，以减少本地密度偏见。
results: 在广泛使用的 MVTec AD bencmark 上达到了 99.5% AUROC 的优秀成绩，并在挑战性的 MVTec LOCO AD dataset 上达到了 88.0% AUROC，超过了州对比的最佳成绩。此外，使用较小的后向网络（Vgg11 和 Resnet18）获得了更好的效果和效率，表明 REB 方法在实际工业应用中具有效果和经济性。

Abstract
Existing K-nearest neighbor (KNN) retrieval-based methods usually conduct industrial anomaly detection in two stages: obtain feature representations with a pre-trained CNN model and perform distance measures for defect detection. However, the features are not fully exploited as they ignore domain bias and the difference of local density in feature space, which limits the detection performance. In this paper, we propose Reducing Biases (REB) in representation by considering the domain bias of the pre-trained model and building a self-supervised learning task for better domain adaption with a defect generation strategy (DefectMaker) imitating the natural defects. Additionally, we propose a local density KNN (LDKNN) to reduce the local density bias and obtain effective anomaly detection. We achieve a promising result of 99.5\% AUROC on the widely used MVTec AD benchmark. We also achieve 88.0\% AUROC on the challenging MVTec LOCO AD dataset and bring an improvement of 4.7\% AUROC to the state-of-the-art result. All results are obtained with smaller backbone networks such as Vgg11 and Resnet18, which indicates the effectiveness and efficiency of REB for practical industrial applications.

摘要
现有的K-最近邻（KNN）检索基于方法通常在两个阶段进行工业异常检测：首先使用预训练的CNN模型获取特征表示，然后进行距离度量以检测异常。然而，这些特征不完全利用，因为它们忽略预测模型的领域偏见和特征空间中的地方浓度差异，这限制了检测性能。在这篇论文中，我们提出了减少偏见（REB）在表示中的技术，通过考虑预测模型的领域偏见来减少偏见。此外，我们还提出了一种本地浓度KNN（LDKNN），以减少本地浓度偏见并获得有效的异常检测。我们在广泛使用的MVTec AD数据集上实现了99.5%的AUROC报告，并在挑战性的MVTec LOCO AD数据集上实现了88.0%的AUROC报告，超过了状态 искусственный界的result。这些结果都是使用较小的背部网络，如Vgg11和Resnet18，这表明REB的效iveness和可行性。

Exploring the Integration Strategies of Retriever and Large Language Models

paper_url: http://arxiv.org/abs/2308.12574
repo_url: None
paper_authors: Ye Liu, Semih Yavuz, Rui Meng, Meghana Moorthy, Shafiq Joty, Caiming Xiong, Yingbo Zhou
for: 该论文目的是提高开放预测问答的能力， investigate different methods of combining retrieved passages with LLMs to enhance answer generation.
methods: 论文使用了四种策略来结合检索结果和LLMs，包括两种单回路方法和两种多回路方法，以实现更好的答案生成。
results: 经过广泛的分析和实验，论文提供了有用的见解，以帮助更好地利用检索结果来提高LLMs的答案生成能力。

Abstract
The integration of retrieved passages and large language models (LLMs), such as ChatGPTs, has significantly contributed to improving open-domain question answering. However, there is still a lack of exploration regarding the optimal approach for incorporating retrieved passages into the answer generation process. This paper aims to fill this gap by investigating different methods of combining retrieved passages with LLMs to enhance answer generation. We begin by examining the limitations of a commonly-used concatenation approach. Surprisingly, this approach often results in generating "unknown" outputs, even when the correct document is among the top-k retrieved passages. To address this issue, we explore four alternative strategies for integrating the retrieved passages with the LLMs. These strategies include two single-round methods that utilize chain-of-thought reasoning and two multi-round strategies that incorporate feedback loops. Through comprehensive analyses and experiments, we provide insightful observations on how to effectively leverage retrieved passages to enhance the answer generation capability of LLMs.

摘要
<> translate("The integration of retrieved passages and large language models (LLMs), such as ChatGPTs, has significantly contributed to improving open-domain question answering. However, there is still a lack of exploration regarding the optimal approach for incorporating retrieved passages into the answer generation process. This paper aims to fill this gap by investigating different methods of combining retrieved passages with LLMs to enhance answer generation. We begin by examining the limitations of a commonly-used concatenation approach. Surprisingly, this approach often results in generating "unknown" outputs, even when the correct document is among the top-k retrieved passages. To address this issue, we explore four alternative strategies for integrating the retrieved passages with the LLMs. These strategies include two single-round methods that utilize chain-of-thought reasoning and two multi-round strategies that incorporate feedback loops. Through comprehensive analyses and experiments, we provide insightful observations on how to effectively leverage retrieved passages to enhance the answer generation capability of LLMs.")]以下是文本的Simplified Chinese翻译：<>大语模型（LLMs）和检索段落的集成已经有效提高了开放领域问答。然而，在将检索段落与LLMs结合的优化方法方面，还存在一些不足。这篇论文的目标是填补这一空白，通过不同的方法来融合检索段落和LLMs来提高答案生成能力。我们首先检查了常用的 concatenation 方法的局限性。尝试意外地发现，这种方法经常生成 "未知" 输出，即使正确的文档在前top-k检索段落中。为解决这一问题，我们探索了四种不同的方法，包括两种单回合方法和两种多回合方法，以便更好地利用检索段落来增强 LLMS 的答案生成能力。通过对这些方法的全面分析和实验，我们提供了有价值的观察和建议，以帮助更好地利用检索段落来提高 LLMS 的答案生成能力。

Conditional Kernel Imitation Learning for Continuous State Environments

paper_url: http://arxiv.org/abs/2308.12573
repo_url: None
paper_authors: Rishabh Agrawal, Nathan Dahlin, Rahul Jain, Ashutosh Nayyar
for: 这 paper 的目的是解决基于观察行为的 reinforcement learning 问题，无需transition dynamics信息、奖励结构或其他环境交互数据。
methods: 这 paper 使用Markov balance equation和 conditional kernel density estimation 来建立一种基于观察行为的 imitation learning 框架，并证明其 asymptotic consistency 性。
results: 通过在连续状态环境上进行numerical experiments， authors 发现该方法在 empirical performance 上表现出 beat many state-of-the-art IL algorithms 的特点。

Abstract
Imitation Learning (IL) is an important paradigm within the broader reinforcement learning (RL) methodology. Unlike most of RL, it does not assume availability of reward-feedback. Reward inference and shaping are known to be difficult and error-prone methods particularly when the demonstration data comes from human experts. Classical methods such as behavioral cloning and inverse reinforcement learning are highly sensitive to estimation errors, a problem that is particularly acute in continuous state space problems. Meanwhile, state-of-the-art IL algorithms convert behavioral policy learning problems into distribution-matching problems which often require additional online interaction data to be effective. In this paper, we consider the problem of imitation learning in continuous state space environments based solely on observed behavior, without access to transition dynamics information, reward structure, or, most importantly, any additional interactions with the environment. Our approach is based on the Markov balance equation and introduces a novel conditional kernel density estimation-based imitation learning framework. It involves estimating the environment's transition dynamics using conditional kernel density estimators and seeks to satisfy the probabilistic balance equations for the environment. We establish that our estimators satisfy basic asymptotic consistency requirements. Through a series of numerical experiments on continuous state benchmark environments, we show consistently superior empirical performance over many state-of-the-art IL algorithms.

摘要
模仿学习（IL）是激励学习（RL）方法中的一个重要分支，不同于大多数RL方法，它不假设环境提供奖励反馈。奖励推断和形成是知识到达难度和错误感知的方法，特别是当示例数据来自人类专家时。经典方法 such as 行为做抄和反射学习是高度敏感于估计错误，特别是在连续状态空间问题上。在这篇论文中，我们考虑了基于观察行为的激励学习问题，无需环境过程动力学信息、奖励结构或任何其他与环境交互的数据。我们的方法基于Markov平衡方程，并提出了一种基于Conditional Kernel Density Estimation的新型模仿学习框架。它通过估计环境的过程动力学使用Conditional Kernel Density Estimator，并寻求满足环境的 probabilistic balance equation。我们证明了我们的估计符合基本的极限consistency要求。通过对连续状态标准环境进行数值实验，我们显示了在许多状态艺术IL算法的实际性表现上的一致性。

A Co-training Approach for Noisy Time Series Learning

paper_url: http://arxiv.org/abs/2308.12551
repo_url: None
paper_authors: Weiqi Zhang, Jianfeng Zhang, Jia Li, Fugee Tsung
for: 本研究强调Robust时间序列表示学习，假设真实世界的时间序列受到噪声和杂音的影响，同时不同视图的时间序列信息具有重要的作用。
methods: 我们采用了两个不同的编码器来创建两个视图，然后通过训练协同对照学习来学习这两个编码器。我们的实验表明，这种协同对照学习方法可以显著提高表达的性能。
results: 我们在四个时间序列 benchmark上进行了无监督和半监督的实验，结果显示，TS-CoT 方法可以减轻数据噪声和杂音的影响，并且表达可以 Transfer 到下游任务 through fine-tuning。

Abstract
In this work, we focus on robust time series representation learning. Our assumption is that real-world time series is noisy and complementary information from different views of the same time series plays an important role while analyzing noisy input. Based on this, we create two views for the input time series through two different encoders. We conduct co-training based contrastive learning iteratively to learn the encoders. Our experiments demonstrate that this co-training approach leads to a significant improvement in performance. Especially, by leveraging the complementary information from different views, our proposed TS-CoT method can mitigate the impact of data noise and corruption. Empirical evaluations on four time series benchmarks in unsupervised and semi-supervised settings reveal that TS-CoT outperforms existing methods. Furthermore, the representations learned by TS-CoT can transfer well to downstream tasks through fine-tuning.

摘要
在这项工作中，我们关注于鲁棒时间序列表示学习。我们假设真实世界中的时间序列是噪音的，并且不同视图中的时间序列信息具有重要的作用。基于这个假设，我们创建了两个视图来表示输入时间序列，并通过对这两个视图进行互训练的对照学习来学习这两个视图的编码器。我们的实验表明，这种合作学习方法可以提高表达性能。尤其是通过利用不同视图之间的补做信息，我们的提案的TS-CoT方法可以减轻数据噪音和损害的影响。我们在四个时间序列标准 bencmarks 上进行了无监督和半监督的实验，并证明了TS-CoT方法在表达性能方面的优异性。此外，TS-CoT方法学习的表示可以通过细化进行下游任务的调整，以便在不同任务上进行应用。

Synchronize Feature Extracting and Matching: A Single Branch Framework for 3D Object Tracking

paper_url: http://arxiv.org/abs/2308.12549
repo_url: None
paper_authors: Teli Ma, Mengmeng Wang, Jimin Xiao, Huifeng Wu, Yong Liu
for: 本研究旨在提出一种新的单支 Framework，即SyncTrack，用于3D LiDAR对象跟踪。
methods: SyncTrack弃备传统的Siamese方式，改用一个单支Encoder，同时在模型中引入了一种新的同步机制，以避免在模型中两次使用Encoder，并且引入了一种新的批处理策略。
results: 实验结果表明，SyncTrack在两个标准数据集（KITTI和NuScenes）上达到了实时跟踪的状态态ridge性表现。

Abstract
Siamese network has been a de facto benchmark framework for 3D LiDAR object tracking with a shared-parametric encoder extracting features from template and search region, respectively. This paradigm relies heavily on an additional matching network to model the cross-correlation/similarity of the template and search region. In this paper, we forsake the conventional Siamese paradigm and propose a novel single-branch framework, SyncTrack, synchronizing the feature extracting and matching to avoid forwarding encoder twice for template and search region as well as introducing extra parameters of matching network. The synchronization mechanism is based on the dynamic affinity of the Transformer, and an in-depth analysis of the relevance is provided theoretically. Moreover, based on the synchronization, we introduce a novel Attentive Points-Sampling strategy into the Transformer layers (APST), replacing the random/Farthest Points Sampling (FPS) method with sampling under the supervision of attentive relations between the template and search region. It implies connecting point-wise sampling with the feature learning, beneficial to aggregating more distinctive and geometric features for tracking with sparse points. Extensive experiments on two benchmark datasets (KITTI and NuScenes) show that SyncTrack achieves state-of-the-art performance in real-time tracking.

摘要
三元网络在3D LiDAR物体跟踪中 serves as a de facto benchmark framework, with a shared-parametric encoder extracting features from the template and search region, respectively. This paradigm relies heavily on an additional matching network to model the cross-correlation/similarity of the template and search region. In this paper, we abandon the conventional Siamese paradigm and propose a novel single-branch framework, SyncTrack, which synchronizes the feature extraction and matching to avoid forwarding the encoder twice for the template and search region, as well as introducing extra parameters of the matching network. The synchronization mechanism is based on the dynamic affinity of the Transformer, and an in-depth analysis of the relevance is provided theoretically. Moreover, based on the synchronization, we introduce a novel Attentive Points-Sampling strategy into the Transformer layers (APST), replacing the random/Farthest Points Sampling (FPS) method with sampling under the supervision of attentive relations between the template and search region. This implies connecting point-wise sampling with the feature learning, beneficial to aggregating more distinctive and geometric features for tracking with sparse points. Extensive experiments on two benchmark datasets (KITTI and NuScenes) show that SyncTrack achieves state-of-the-art performance in real-time tracking.

CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias

paper_url: http://arxiv.org/abs/2308.12539
repo_url: https://github.com/vipulgupta1011/calm
paper_authors: Vipul Gupta, Pranav Narayanan Venkit, Hugo Laurençon, Shomir Wilson, Rebecca J. Passonneau
for: 评估语言模型（LM）的社会经济偏见，以避免可能导致伤害的潜在危害。
methods: 引入了Comprehensive Assessment of Language Model bias（CALM），一个可靠的评估语言模型偏见的benchmarkdataset，通过Integrating16个不同领域的 dataset，并对224个模板进行筛选，构建了一 dataset of 78,400例。
results: 比较了CALM dataset的多样性与之前的dataset，并测试了小幅度的变化的敏感性，发现CALM dataset更加多样和可靠，能够更好地评估语言模型的偏见。此外，对20个大型语言模型进行了评估，发现大型模型更加偏向于某些群体，而T0系列的模型最少偏见。

Abstract
As language models (LMs) become increasingly powerful, it is important to quantify and compare them for sociodemographic bias with potential for harm. Prior bias measurement datasets are sensitive to perturbations in their manually designed templates, therefore unreliable. To achieve reliability, we introduce the Comprehensive Assessment of Language Model bias (CALM), a benchmark dataset to quantify bias in LMs across three tasks. We integrate 16 existing datasets across different domains, such as Wikipedia and news articles, to filter 224 templates from which we construct a dataset of 78,400 examples. We compare the diversity of CALM with prior datasets on metrics such as average semantic similarity, and variation in template length, and test the sensitivity to small perturbations. We show that our dataset is more diverse and reliable than previous datasets, thus better capture the breadth of linguistic variation required to reliably evaluate model bias. We evaluate 20 large language models including six prominent families of LMs such as Llama-2. In two LM series, OPT and Bloom, we found that larger parameter models are more biased than lower parameter models. We found the T0 series of models to be the least biased. Furthermore, we noticed a tradeoff between gender and racial bias with increasing model size in some model series. The code is available at https://github.com/vipulgupta1011/CALM.

摘要
为了评估语言模型（LM）的社会经济偏见，需要量化和比较它们。但现有的偏见测试数据集有很多缺陷，例如 manually designed templates 的修改会导致测试结果不可靠。为了解决这问题，我们引入了 Comprehensive Assessment of Language Model Bias（CALM），一个可靠的偏见测试数据集，用于评估 LM 的偏见。我们将16个不同领域的数据集集成，包括 Wikipedia 和新闻文章，并从这些模板中过滤出224个模板，然后构建了一个包含 78,400 个示例的数据集。我们比较了 CALM 数据集的多样性和先前数据集的多样性，并测试了小改动的敏感性。我们发现 CALM 数据集更加多样和可靠，因此更能够正确地评估 LM 的偏见。我们测试了 20 个大型语言模型，包括六种主要的 LM 家族，例如 Llama-2。在一些模型系列中，我们发现大型模型更加偏见，而小型模型更加不偏见。此外，我们发现一些模型系列中， gender 和种族偏见之间存在负相关性。模型代码可以在 GitHub 上找到：https://github.com/vipulgupta1011/CALM。

FedSoL: Bridging Global Alignment and Local Generality in Federated Learning

paper_url: http://arxiv.org/abs/2308.12532
repo_url: None
paper_authors: Gihun Lee, Minchan Jeong, Sangmook Kim, Jaehoon Oh, Se-Young Yun
for: 提高 Federated Learning（FL）的性能，解决 Client 数据分布不均的问题。
methods: combinest both the concepts of global alignment and local generality，使用 parameter region robust against proximal perturbations 来做 Local Learning。
results: experiments show that FedSoL consistently achieves state-of-the-art performance on various setups。

Abstract
Federated Learning (FL) aggregates locally trained models from individual clients to construct a global model. While FL enables learning a model with data privacy, it often suffers from significant performance degradation when client data distributions are heterogeneous. Many previous FL algorithms have addressed this issue by introducing various proximal restrictions. These restrictions aim to encourage global alignment by constraining the deviation of local learning from the global objective. However, they inherently limit local learning by interfering with the original local objectives. Recently, an alternative approach has emerged to improve local learning generality. By obtaining local models within a smooth loss landscape, this approach mitigates conflicts among different local objectives of the clients. Yet, it does not ensure stable global alignment, as local learning does not take the global objective into account. In this study, we propose Federated Stability on Learning (FedSoL), which combines both the concepts of global alignment and local generality. In FedSoL, the local learning seeks a parameter region robust against proximal perturbations. This strategy introduces an implicit proximal restriction effect in local learning while maintaining the original local objective for parameter update. Our experiments show that FedSoL consistently achieves state-of-the-art performance on various setups.

摘要
federa 学习（FL）将本地训练的模型从客户端集成成全局模型。而FL可以保持数据隐私，但它经常受到客户端数据分布不均的影响，导致性能下降。许多前一代FL算法已经解决这个问题，通过引入不同的距离约束。这些约束 стремятся强制全局对应，但它们会限制本地学习。在最近几年，一种新的方法出现了，以提高本地学习的通用性。通过在本地学习中获得一个平滑的损失函数，这种方法减少了不同客户端的本地目标之间的冲突。然而，这种方法并不保证全局稳定性，因为本地学习并不考虑全局目标。在本研究中，我们提出了联邦稳定学习（FedSoL），它结合了全局对应和本地通用性两个概念。在FedSoL中，本地学习寻找一个对于质量变化具有鲁棒性的参数空间。这种策略引入了一种含义质量变化的隐藏约束效果，同时保持原始本地目标进行参数更新。我们的实验显示，FedSoL能够在不同的设置下保持状态革命性的性能。

Not Only Rewards But Also Constraints: Applications on Legged Robot Locomotion

paper_url: http://arxiv.org/abs/2308.12517
repo_url: None
paper_authors: Yunho Kim, Hyunsik Oh, Jeonghyun Lee, Jinhyeok Choi, Gwanghyeon Ji, Moonkyu Jung, Donghoon Youm, Jemin Hwangbo
for: 本研究旨在开发一种基于强化学习的控制器训练框架，以便在复杂的机器人系统中实现自然的运动风格和高任务性能。methods: 该框架使用了两种约束类型和一种高效的政策优化算法，以便让工程师可以合理地反映他们的意图并处理约束，而不需要进行大量的计算开销。results: 在大量的simulation和实际实验中，这种学习框架可以让performant的控制器在不需要大量的奖励工程的情况下被训练，只需要调整一个奖励系数即可。此外，由于约束的可读性和普适性，可以更好地利用这些约束来实现更直观和intuitive的工程过程。

Abstract
Several earlier studies have shown impressive control performance in complex robotic systems by designing the controller using a neural network and training it with model-free reinforcement learning. However, these outstanding controllers with natural motion style and high task performance are developed through extensive reward engineering, which is a highly laborious and time-consuming process of designing numerous reward terms and determining suitable reward coefficients. In this work, we propose a novel reinforcement learning framework for training neural network controllers for complex robotic systems consisting of both rewards and constraints. To let the engineers appropriately reflect their intent to constraints and handle them with minimal computation overhead, two constraint types and an efficient policy optimization algorithm are suggested. The learning framework is applied to train locomotion controllers for several legged robots with different morphology and physical attributes to traverse challenging terrains. Extensive simulation and real-world experiments demonstrate that performant controllers can be trained with significantly less reward engineering, by tuning only a single reward coefficient. Furthermore, a more straightforward and intuitive engineering process can be utilized, thanks to the interpretability and generalizability of constraints. The summary video is available at https://youtu.be/KAlm3yskhvM.

摘要
几乎早前的研究已经表明，通过使用神经网络和无模型学习来设计控制器，可以实现复杂的机器人系统中的出色的控制性能。然而，这些出色的控制器具有自然的运动风格和高任务性能，通过广泛的奖励工程来实现，这是一项高度劳动密集和时间耗费的过程，涉及到设计很多奖励项和确定适当的奖励系数。在这种情况下，我们提出了一种新的学习框架，用于训练神经网络控制器，以便在复杂的机器人系统中实现高性能。为了让工程师能够正确地反映他们的意图，并减少计算开销，我们建议了两种约束类型和一种高效的政策优化算法。我们的学习框架在许多跑动和实际实验中被应用，并证明了可以通过微不足的奖励工程，训练出高性能的控制器。此外，由于约束的可读性和普遍性，工程师可以使用更加直观和直接的工程过程。有关视频可以在以下链接中找到：https://youtu.be/KAlm3yskhvM。

I3DOD: Towards Incremental 3D Object Detection via Prompting

paper_url: http://arxiv.org/abs/2308.12512
repo_url: None
paper_authors: Wenqi Liang, Gan Sun, Chenxi Liu, Jiahua Dong, Kangru Wang
for: 本研究的目的是提出一种基于指导的增量3D对象检测框架（I3DOD），以解决现有的类增量3D对象检测方法会导致致命忘记旧类问题。
methods: 该方法提出了一种任务共享提示机制，通过学习对象位置信息和类别 semantic 信息之间的匹配关系，以及一种可靠的维持知识传递策略，包括一种可靠的动态维持策略和一种关系特征来捕捉响应特征空间中的相互关系。
results: 通过对两个 benchmark 数据集进行了广泛的实验， authors 发现，与现有对象检测方法相比，该方法在 mAP@0.25 中提高了0.6% - 2.7%。

Abstract
3D object detection has achieved significant performance in many fields, e.g., robotics system, autonomous driving, and augmented reality. However, most existing methods could cause catastrophic forgetting of old classes when performing on the class-incremental scenarios. Meanwhile, the current class-incremental 3D object detection methods neglect the relationships between the object localization information and category semantic information and assume all the knowledge of old model is reliable. To address the above challenge, we present a novel Incremental 3D Object Detection framework with the guidance of prompting, i.e., I3DOD. Specifically, we propose a task-shared prompts mechanism to learn the matching relationships between the object localization information and category semantic information. After training on the current task, these prompts will be stored in our prompt pool, and perform the relationship of old classes in the next task. Moreover, we design a reliable distillation strategy to transfer knowledge from two aspects: a reliable dynamic distillation is developed to filter out the negative knowledge and transfer the reliable 3D knowledge to new detection model; the relation feature is proposed to capture the responses relation in feature space and protect plasticity of the model when learning novel 3D classes. To the end, we conduct comprehensive experiments on two benchmark datasets and our method outperforms the state-of-the-art object detection methods by 0.6% - 2.7% in terms of mAP@0.25.

摘要
三维物体检测已经在许多领域取得了显著性能，如 робо扮系统、自动驾驶和增强现实。然而，大多数现有方法在类增量enario中会导致致命的忘记老类。同时，当前的类增量三维物体检测方法忽视了物体Localization信息和类别Semantic信息之间的关系，并将所有古老模型的知识视为可靠。为了解决这些挑战，我们提出了一种新的增量三维物体检测框架，即I3DOD。我们特点是提出了一种任务分享提示机制，用于学习物体Localization信息和类别Semantic信息之间的匹配关系。在训练当前任务后，这些提示将被存储在我们的提示Pool中，并在下一个任务中进行相应的关系学习。此外，我们设计了一种可靠的搅拌策略，用于从两个方面传递知识：一种可靠的动态搅拌策略可以过滤出负知识，并将可靠的三维知识传递到新的检测模型；另一种特征relation被提出，用于在特征空间中捕捉响应关系，并保护模型在学习新的三维类时的塑性。最后，我们进行了对两个benchmark数据集的全面实验，并发现我们的方法在mAP@0.25上比前state-of-the-art object detection方法高出0.6% - 2.7%。

Masked Autoencoders are Efficient Class Incremental Learners

paper_url: http://arxiv.org/abs/2308.12510
repo_url: https://github.com/scok30/mae-cil
paper_authors: Jiang-Tian Zhai, Xialei Liu, Andrew D. Bagdanov, Ke Li, Ming-Ming Cheng
for: 这篇论文旨在Sequential Learning中学习新的类别，并避免过去知识的抹除。
methods: 我们提出使用Masked Autoencoders（MAEs）作为有效的学习器，MAEs可以透过复原式无监督学习得到有用的表现，并且可以轻松地与类别标签整合。我们还提出了两边MAE框架，从图像水平和嵌入水平的融合中学习。
results: 我们的方法在CIFAR-100、ImageNet-Subset和ImageNet-Full上比顶对比方法更好，实验证明了我们的方法的有效性。

Abstract
Class Incremental Learning (CIL) aims to sequentially learn new classes while avoiding catastrophic forgetting of previous knowledge. We propose to use Masked Autoencoders (MAEs) as efficient learners for CIL. MAEs were originally designed to learn useful representations through reconstructive unsupervised learning, and they can be easily integrated with a supervised loss for classification. Moreover, MAEs can reliably reconstruct original input images from randomly selected patches, which we use to store exemplars from past tasks more efficiently for CIL. We also propose a bilateral MAE framework to learn from image-level and embedding-level fusion, which produces better-quality reconstructed images and more stable representations. Our experiments confirm that our approach performs better than the state-of-the-art on CIFAR-100, ImageNet-Subset, and ImageNet-Full. The code is available at https://github.com/scok30/MAE-CIL .

摘要
<>translate into Simplified ChineseClass Incremental Learning (CIL) targets Sequential Learning of new classes while avoiding catastrophic forgetting of previous knowledge. We propose using Masked Autoencoders (MAEs) as efficient learners for CIL. MAEs were originally designed to learn useful representations through reconstructive unsupervised learning, and they can be easily integrated with a supervised loss for classification. Moreover, MAEs can reliably reconstruct original input images from randomly selected patches, which we use to store exemplars from past tasks more efficiently for CIL. We also propose a bilateral MAE framework to learn from image-level and embedding-level fusion, which produces better-quality reconstructed images and more stable representations. Our experiments confirm that our approach performs better than the state-of-the-art on CIFAR-100, ImageNet-Subset, and ImageNet-Full. Code available at https://github.com/scok30/MAE-CIL .Note: Please note that the translation is in Simplified Chinese, which is one of the two standard Chinese languages. If you need Traditional Chinese, please let me know.

CGMI: Configurable General Multi-Agent Interaction Framework

paper_url: http://arxiv.org/abs/2308.12503
repo_url: None
paper_authors: Shi Jinxin, Zhao Jiabao, Wang Yilei, Wu Xingjiao, Li Jiawen, He Liang
for: 这个论文旨在提供一个基于大语言模型的多代理人系统，用于模拟人类交互和解决域专任务。
methods: 该系统使用树结构的方法来分配、检测和维护代理人性格，同时采用基于ACT*模型的认知建筑，包括记忆、反思和规划模块。
results: 通过在虚拟环境中模拟教师与学生之间的互动，该系统实现了许多与实际教室情况相似的方面，如教学方法、课程和学生表现。

Abstract
Benefiting from the powerful capabilities of large language models (LLMs), agents based on LLMs have shown the potential to address domain-specific tasks and emulate human behaviors. However, the content generated by these agents remains somewhat superficial, owing to their limited domain expertise and the absence of an effective cognitive architecture. To address this, we present the Configurable General Multi-Agent Interaction (CGMI) framework, designed to replicate human interactions in real-world scenarios. Specifically, we propose a tree-structured methodology for the assignment, detection, and maintenance of agent personality. Additionally, we designed a cognitive architecture equipped with a skill library based on the ACT* model, which contains memory, reflection, and planning modules. We have also integrated general agents to augment the virtual environment's realism. Using the CGMI framework, we simulated numerous classroom interactions between teacher and students. The experiments indicate that aspects such as the teaching methodology, curriculum, and student performance closely mirror real classroom settings. We will open source our work.

摘要
利用大型自然语言模型（LLM）的强大能力，基于LLM的代理人已经展示了解决域特定任务和模拟人类行为的潜力。然而，由这些代理人生成的内容仍然有所 superficious，主要是因为它们的域专业知识有限和缺乏有效的认知架构。为了解决这个问题，我们提出了可 configurable通用多代理人交互（CGMI）框架，用于复制真实世界中的人类互动。具体来说，我们提出了一种树状的方法ологи？ для代理人性分配、检测和维护。此外，我们还设计了一个基于ACT*模型的认知架构，包括记忆、反思和规划模块。此外，我们还将通用代理人集成到虚拟环境中，以增加真实性。使用CGMI框架，我们模拟了许多教室互动 между教师和学生。实验结果表明，教学方法、课程和学生表现均具有真实教室情况的特点。我们将将我们的工作开源。

Source-Free Collaborative Domain Adaptation via Multi-Perspective Feature Enrichment for Functional MRI Analysis

paper_url: http://arxiv.org/abs/2308.12495
repo_url: https://github.com/yqfang9199/scda
paper_authors: Yuqi Fang, Jinjian Wu, Qianqian Wang, Shijun Qiu, Andrea Bozoki, Huaicheng Yan, Mingxia Liu
for: 这项研究旨在提供一种无源数据的域适应方法，以帮助解决Functional MRI（fMRI）数据的差异问题，从而提高脑功能研究的精度和可重复性。
methods: 该方法基于多个视角特征增强方法（MFE），包括多个协作分支，每个分支都有数据供应模块、空间时间特征编码器和分类预测器。此外，我们还提出了一种互相一致约束，以便在不同域的数据上学习坚实的特征表示。
results: 我们的方法在三个公共数据集和一个私有数据集上进行了实验，并达到了跨扫描仪和跨研究任务的预测任务中的高效性。此外，我们还提供了一个基于大规模rs-fMRI数据的预训练模型，并将其公开发布。

Abstract
Resting-state functional MRI (rs-fMRI) is increasingly employed in multi-site research to aid neurological disorder analysis. Existing studies usually suffer from significant cross-site/domain data heterogeneity caused by site effects such as differences in scanners/protocols. Many methods have been proposed to reduce fMRI heterogeneity between source and target domains, heavily relying on the availability of source data. But acquiring source data is challenging due to privacy concerns and/or data storage burdens in multi-site studies. To this end, we design a source-free collaborative domain adaptation (SCDA) framework for fMRI analysis, where only a pretrained source model and unlabeled target data are accessible. Specifically, a multi-perspective feature enrichment method (MFE) is developed for target fMRI analysis, consisting of multiple collaborative branches to dynamically capture fMRI features of unlabeled target data from multiple views. Each branch has a data-feeding module, a spatiotemporal feature encoder, and a class predictor. A mutual-consistency constraint is designed to encourage pair-wise consistency of latent features of the same input generated from these branches for robust representation learning. To facilitate efficient cross-domain knowledge transfer without source data, we initialize MFE using parameters of a pretrained source model. We also introduce an unsupervised pretraining strategy using 3,806 unlabeled fMRIs from three large-scale auxiliary databases, aiming to obtain a general feature encoder. Experimental results on three public datasets and one private dataset demonstrate the efficacy of our method in cross-scanner and cross-study prediction tasks. The model pretrained on large-scale rs-fMRI data has been released to the public.

摘要
休息态功能磁共振成像（rs-fMRI）在多站研究中越来越广泛应用，以帮助诊断神经系统疾病。现有的研究通常受到数据不同站点/领域的差异所致的重大跨站/领域数据不一致性问题，而且许多方法已经被提出来减少rs-fMRI数据的不一致性。然而，获取源数据是困难的，这主要是因为隐私问题和数据存储压力在多站研究中。为此，我们设计了一个无源数据的协同领域适应（SCDA）框架，只有预训练的源模型和无标签目标数据可用。具体来说，我们开发了一种多视角特征增强方法（MFE），用于目标rs-fMRI分析，该方法包括多个协同分支，用于动态捕捉目标rs-fMRI数据的不同视角特征。每个分支有数据供应模块、空间时间特征编码器和类别预测器。我们设计了一种互相一致性约束，以便在这些分支中对同一个输入生成的约束实现Robust特征学习。为了快速无源数据进行跨站知识传递，我们使用预训练源模型的参数初始化MFE。此外，我们还提出了一种无监督预训练策略，使用3806个无标签rs-fMRI数据集，以获得一个通用的特征编码器。实验结果表明，我们的方法在三个公共数据集和一个私有数据集上表现出色，在跨扫描和跨研究预测任务中。我们已经将预训练于大规模rs-fMRI数据的模型公开发布。

GPTEval: A Survey on Assessments of ChatGPT and GPT-4

paper_url: http://arxiv.org/abs/2308.12488
repo_url: None
paper_authors: Rui Mao, Guanyi Chen, Xulang Zhang, Frank Guerin, Erik Cambria
for: This survey aims to comprehensively review and analyze the collective assessment findings of ChatGPT and GPT-4 in various tasks and disciplines, focusing on their language and reasoning abilities, scientific knowledge, and ethical considerations.
methods: The survey examines prior evaluations of ChatGPT and GPT-4, including their language and reasoning abilities, scientific knowledge, and ethical considerations.
results: The survey provides a comprehensive assessment of the collective findings of prior evaluations, offering several recommendations for future research in evaluating large language models.

Abstract
The emergence of ChatGPT has generated much speculation in the press about its potential to disrupt social and economic systems. Its astonishing language ability has aroused strong curiosity among scholars about its performance in different domains. There have been many studies evaluating the ability of ChatGPT and GPT-4 in different tasks and disciplines. However, a comprehensive review summarizing the collective assessment findings is lacking. The objective of this survey is to thoroughly analyze prior assessments of ChatGPT and GPT-4, focusing on its language and reasoning abilities, scientific knowledge, and ethical considerations. Furthermore, an examination of the existing evaluation methods is conducted, offering several recommendations for future research in evaluating large language models.

摘要
chatgpt的出现引发了媒体的很多 спекуляция，关于其可能性影响社会和经济系统。它的语言能力引起了学者们强烈的好奇，关于它在不同领域的表现。已有许多研究评估chatgpt和gpt-4的能力，但没有一篇全面的评估报告。本调查的目标是对先前评估chatgpt和gpt-4的结果进行全面分析，强调其语言和理智能力、科学知识和伦理考虑。此外，还进行了现有评估方法的检查，提出了未来研究评估大语言模型的建议。

A Model of Sequential Learning based on Non-Axiomatic Logic

paper_url: http://arxiv.org/abs/2308.12486
repo_url: None
paper_authors: Bowen Xu
for: 本论文是关于智能代理人的sequential learning功能的研究报告。
methods: 本论文使用非AXIomial逻辑来解释学习过程，包括三步：假设、修订和回收。学习过程可以在不充分的知识和资源下进行。
results: 虽然当前设计有限制，但模型在一些简单的情况下已经证明有效。

Abstract
Sequential learning is a fundamental function of an intelligent agent. This technical report introduces a model of sequential learning, which is interpretable through Non-Axiomatic Logic. The learning procedure includes three steps, hypothesizing, revising, and recycling, and can work under the Assumption of Insufficient Knowledge and Resources. Although there are limitations for the current design, the model has been proven effective in some simple cases.

摘要
Sequential learning是智能代理的基本功能之一。本技报介绍了一种可解释的续学学习模型，通过非AXIом逻辑进行解释。学习过程包括三步：假设、修订和回收，可以在不充分的知识和资源下进行。虽然当前设计有限制，但模型已在一些简单的情况下证明有效。Note: "Sequential learning" in Chinese is usually translated as "续学学习" (xùxué xuéxí), but in this context, the word "sequential" is used to emphasize the order of the learning steps, so I used "续学" (xùxué) instead.

Attention-Based Acoustic Feature Fusion Network for Depression Detection

paper_url: http://arxiv.org/abs/2308.12478
repo_url: https://github.com/xuxiaoooo/abafnet
paper_authors: Xiao Xu, Yang Wang, Xinru Wei, Fei Wang, Xizhe Zhang
for: 这个论文旨在提出一种新的听音数据检测方法，用于早期发现抑郁症。
methods: 该方法利用了高级机器学习理论，并将四种不同的听音特征 fusion 到一起，以提高抑郁症检测的准确率。
results: 对两个临床听音数据库进行了广泛验证，并比前期方法提高了抑郁症检测和亚型分类的性能。 I hope that helps! Let me know if you have any other questions.

Abstract
Depression, a common mental disorder, significantly influences individuals and imposes considerable societal impacts. The complexity and heterogeneity of the disorder necessitate prompt and effective detection, which nonetheless, poses a difficult challenge. This situation highlights an urgent requirement for improved detection methods. Exploiting auditory data through advanced machine learning paradigms presents promising research directions. Yet, existing techniques mainly rely on single-dimensional feature models, potentially neglecting the abundance of information hidden in various speech characteristics. To rectify this, we present the novel Attention-Based Acoustic Feature Fusion Network (ABAFnet) for depression detection. ABAFnet combines four different acoustic features into a comprehensive deep learning model, thereby effectively integrating and blending multi-tiered features. We present a novel weight adjustment module for late fusion that boosts performance by efficaciously synthesizing these features. The effectiveness of our approach is confirmed via extensive validation on two clinical speech databases, CNRAC and CS-NRAC, thereby outperforming previous methods in depression detection and subtype classification. Further in-depth analysis confirms the key role of each feature and highlights the importance of MFCCrelated features in speech-based depression detection.

摘要
抑郁症是一种常见的心理疾病，对个人和社会产生了重要的影响。由于抑郁症的复杂性和多样性，早期检测成为了一项紧迫的需求。然而，现有的检测方法多数仅仅利用单一的特征模型，可能会损失大量的信息。为了解决这问题，我们提出了一种新的注意力基于的听音特征融合网络（ABAFnet），用于抑郁检测。ABAFnet结合了四种不同的听音特征，通过深度学习模型进行集成和融合。我们还提出了一种新的权重调整模块，用于late fusion，可以有效地合并这些特征。我们通过对两个临床听音数据库（CNRAC和CS-NRAC）进行广泛验证，证明了我们的方法在抑郁检测和分型分类方面的表现比前方法更高。进一步的深入分析表明，每种特征都扮演着重要的角色，而MFCC相关的特征在听音基于的抑郁检测中具有重要性。

Are ChatGPT and GPT-4 Good Poker Players? – A Pre-Flop Analysis

paper_url: http://arxiv.org/abs/2308.12466
repo_url: None
paper_authors: Akshat Gupta
for: 这个论文旨在评估 chatGPT 和 GPT-4 在牛牛游戏中的表现。
methods: 作者使用 chatGPT 和 GPT-4 进行了一系列实验，以评估这两种模型在牛牛游戏中的表现。
results: 研究发现，虽然 chatGPT 和 GPT-4 都具备了牛牛游戏的基本理解，但两者都不是 game theory optimal (GTO) 牛牛 player。GPT-4 比 chatGPT 更为进攻性，但两者的策略都不是 GTO。

Abstract
Since the introduction of ChatGPT and GPT-4, these models have been tested across a large number of tasks. Their adeptness across domains is evident, but their aptitude in playing games and specifically their aptitude in the realm of poker has remained unexplored. Poker is a game that requires decision making under uncertainty and incomplete information. In this paper, we put ChatGPT and GPT-4 through the poker test and evaluate their poker skills. Our findings reveal that while both models display an advanced understanding of poker, encompassing concepts like the valuation of starting hands, playing positions and other intricacies of game theory optimal (GTO) poker, both ChatGPT and GPT-4 are NOT game theory optimal poker players. Through a series of experiments, we first discover the characteristics of optimal prompts and model parameters for playing poker with these models. Our observations then unveil the distinct playing personas of the two models. We first conclude that GPT-4 is a more advanced poker player than ChatGPT. This exploration then sheds light on the divergent poker tactics of the two models: ChatGPT's conservativeness juxtaposed against GPT-4's aggression. In poker vernacular, when tasked to play GTO poker, ChatGPT plays like a Nit, which means that it has a propensity to only engage with premium hands and folds a majority of hands. When subjected to the same directive, GPT-4 plays like a maniac, showcasing a loose and aggressive style of play. Both strategies, although relatively advanced, are not game theory optimal.

摘要
Since the introduction of ChatGPT and GPT-4, these models have been tested across a large number of tasks. Their aptitude across domains is evident, but their ability in playing games and specifically their ability in the realm of poker has remained unexplored. Poker is a game that requires decision making under uncertainty and incomplete information. In this paper, we put ChatGPT and GPT-4 through the poker test and evaluate their poker skills. Our findings reveal that while both models display an advanced understanding of poker, encompassing concepts like the valuation of starting hands, playing positions, and other intricacies of game theory optimal (GTO) poker, both ChatGPT and GPT-4 are NOT game theory optimal poker players. Through a series of experiments, we first discover the characteristics of optimal prompts and model parameters for playing poker with these models. Our observations then unveil the distinct playing personas of the two models. We first conclude that GPT-4 is a more advanced poker player than ChatGPT. This exploration then sheds light on the divergent poker tactics of the two models: ChatGPT's conservativeness juxtaposed against GPT-4's aggression. In poker vernacular, when tasked to play GTO poker, ChatGPT plays like a Nit, which means that it has a propensity to only engage with premium hands and folds a majority of hands. When subjected to the same directive, GPT-4 plays like a maniac, showcasing a loose and aggressive style of play. Both strategies, although relatively advanced, are not game theory optimal.

PFL-GAN: When Client Heterogeneity Meets Generative Models in Personalized Federated Learning

paper_url: http://arxiv.org/abs/2308.12454
repo_url: None
paper_authors: Achintha Wijesinghe, Songyang Zhang, Zhi Ding
for: 提高个人化 federated learning (PFL) 的效果，处理客户端数据不同的场景。
methods: 提出一种基于生成器对抗网络 (GAN) 的 PFL 模型，通过学习客户端之间的相似性，实现Weighted collaborative data aggregation。
results: 通过对几个常见的数据集进行严谨的实验，证明 PFL-GAN 的效果。

Abstract
Recent advances of generative learning models are accompanied by the growing interest in federated learning (FL) based on generative adversarial network (GAN) models. In the context of FL, GAN can capture the underlying client data structure, and regenerate samples resembling the original data distribution without compromising the private raw data. Although most existing GAN-based FL works focus on training a global model, Personalized FL (PFL) sometimes can be more effective in view of client data heterogeneity in terms of distinct data sample distributions, feature spaces, and labels. To cope with client heterogeneity in GAN-based FL, we propose a novel GAN sharing and aggregation strategy for PFL. The proposed PFL-GAN addresses the client heterogeneity in different scenarios. More specially, we first learn the similarity among clients and then develop an weighted collaborative data aggregation. The empirical results through the rigorous experimentation on several well-known datasets demonstrate the effectiveness of PFL-GAN.

摘要
Here's the Simplified Chinese translation:现有的生成学模型技术的进步，使得联合学习（FL）基于生成对抗网络（GAN）模型的兴趣在提高。在FL中，GAN可以捕捉客户端数据结构，并生成类似原始数据分布的样本，无需披露私有的原始数据。然而，大多数现有的GAN-based FL工作都是通过全局模型进行训练，而Personalized FL（PFL）可能更有效，尤其是在客户端数据多样性方面，包括不同的数据样本分布、特征空间和标签。为了 Addressing client heterogeneity in GAN-based FL, we propose a novel GAN sharing and aggregation strategy for PFL. The proposed PFL-GAN addresses client heterogeneity in different scenarios by first learning the similarity among clients and then developing an weighted collaborative data aggregation. Empirical results from rigorous experimentation on several well-known datasets demonstrate the effectiveness of PFL-GAN.

Augmenting medical image classifiers with synthetic data from latent diffusion models

paper_url: http://arxiv.org/abs/2308.12453
repo_url: None
paper_authors: Luke W. Sagers, James A. Diao, Luke Melas-Kyriazi, Matthew Groh, Pranav Rajpurkar, Adewole S. Adamson, Veronica Rotemberg, Roxana Daneshjou, Arjun K. Manrai
for: 这篇论文的目的是为了探讨人工智能（AI）算法在医疗领域中的应用，特别是在处理皮肤病的情况下。
methods: 这篇论文使用了潜在扩散模型来生成皮肤病的synthetic图像，并评估了这些数据是否能够提高医疗AI算法的表现。
results: 研究发现，使用synthetic图像进行模型训练可以提高模型的表现，但是这些表现增强随着实际图像的数量增加。 Specifically, the performance gains saturate at a synthetic-to-real image ratio of 10:1, and are substantially smaller than the gains obtained from adding real images.

Abstract
While hundreds of artificial intelligence (AI) algorithms are now approved or cleared by the US Food and Drugs Administration (FDA), many studies have shown inconsistent generalization or latent bias, particularly for underrepresented populations. Some have proposed that generative AI could reduce the need for real data, but its utility in model development remains unclear. Skin disease serves as a useful case study in synthetic image generation due to the diversity of disease appearance, particularly across the protected attribute of skin tone. Here we show that latent diffusion models can scalably generate images of skin disease and that augmenting model training with these data improves performance in data-limited settings. These performance gains saturate at synthetic-to-real image ratios above 10:1 and are substantially smaller than the gains obtained from adding real images. As part of our analysis, we generate and analyze a new dataset of 458,920 synthetic images produced using several generation strategies. Our results suggest that synthetic data could serve as a force-multiplier for model development, but the collection of diverse real-world data remains the most important step to improve medical AI algorithms.

摘要
美国食品和药品管理局（FDA）已批准或批准了数百个人工智能（AI）算法，但许多研究表明这些算法在不同人群中存在不一致的泛化或隐藏偏见，特别是对于受保护的人群。一些人提议使用生成AI可以减少实际数据的需求，但它在模型开发中的使用效果仍然不清楚。皮肤病 serves as a useful case study in synthetic image generation due to the diversity of disease appearance, particularly across the protected attribute of skin tone. 在这里，我们表明了液态增殖模型可以可扩展地生成皮肤病图像，并且在数据有限的情况下，通过这些数据进行模型训练可以提高表现。这些表现提升随synthetic-to-real图像比率增加，并且比于使用实际图像添加表现更小。在我们的分析中，我们生成了458,920个synthetic图像，并进行了分析。我们的结果表明，生成数据可以成为模型开发中的力multiplier，但收集多样化的实际数据仍然是改进医疗AI算法的最重要步骤。

An Intentional Forgetting-Driven Self-Healing Method For Deep Reinforcement Learning Systems

paper_url: http://arxiv.org/abs/2308.12445
repo_url: https://github.com/ahmedhajyahmed/drdrl
paper_authors: Ahmed Haj Yahmed, Rached Bouchoucha, Houssem Ben Braiek, Foutse Khomh
for: 本研究旨在提出一种有效的自适应方法，以解决 Deep Reinforcement Learning (DRL) 系统在大规模生产环境中遇到的环境风险。
methods: 本研究使用了 vanilla Continual Learning (CL) 方法，并增加了一种归yk intentional forgetting 机制，以解决 CL 中的主要问题，如 catastrophic forgetting、warm-starting failure 和 slow convergence。
results: 比较 vanilla CL 和 Dr. DRL 两种方法，Dr. DRL 能够在不同的演变环境中减少平均的恢复时间和精度调整集数量，并在19.63% 的演变环境中成功适应。同时，Dr. DRL 能够保持和提高在演变环境中解决的奖励 Water 到 45%。

Abstract
Deep reinforcement learning (DRL) is increasingly applied in large-scale productions like Netflix and Facebook. As with most data-driven systems, DRL systems can exhibit undesirable behaviors due to environmental drifts, which often occur in constantly-changing production settings. Continual Learning (CL) is the inherent self-healing approach for adapting the DRL agent in response to the environment's conditions shifts. However, successive shifts of considerable magnitude may cause the production environment to drift from its original state. Recent studies have shown that these environmental drifts tend to drive CL into long, or even unsuccessful, healing cycles, which arise from inefficiencies such as catastrophic forgetting, warm-starting failure, and slow convergence. In this paper, we propose Dr. DRL, an effective self-healing approach for DRL systems that integrates a novel mechanism of intentional forgetting into vanilla CL to overcome its main issues. Dr. DRL deliberately erases the DRL system's minor behaviors to systematically prioritize the adaptation of the key problem-solving skills. Using well-established DRL algorithms, Dr. DRL is compared with vanilla CL on various drifted environments. Dr. DRL is able to reduce, on average, the healing time and fine-tuning episodes by, respectively, 18.74% and 17.72%. Dr. DRL successfully helps agents to adapt to 19.63% of drifted environments left unsolved by vanilla CL while maintaining and even enhancing by up to 45% the obtained rewards for drifted environments that are resolved by both approaches.

摘要
深度强化学习（DRL）在大规模生产中越来越普遍应用，如Netflix和Facebook。然而，与大多数数据驱动系统一样，DRL系统可能会表现出不良行为，即因环境变化而导致的环境漂移。循环学习（CL）是DRLAgent的自适应方法，可以响应环境的变化。然而，连续的大规模变化可能会使生产环境偏离原来的状态。现有研究表明，这些环境变化通常会让CL进入长期或无法恢复的循环征化，这是因为它们可能会导致忘记、暖启缺陷和慢 converges。在这篇论文中，我们提出了Dr. DRL，一种有效的自适应方法，用于解决DRL系统中的环境漂移问题。Dr. DRL通过novel的意图忘记机制，系统地优先级掌握DRL系统的关键问题解决技能。使用已知的DRL算法，Dr. DRL与vanilla CL进行了比较。Dr. DRL能够将循环时间和微调集数降低，减少了18.74%和17.72%。Dr. DRL成功地帮助代理人适应了19.63%的漂移环境，并保持了或提高了对漂移环境的解决率。

BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection

paper_url: http://arxiv.org/abs/2308.12439
repo_url: None
paper_authors: Tinghao Xie, Xiangyu Qi, Ping He, Yiming Li, Jiachen T. Wang, Prateek Mittal
for:这则研究旨在防止深度神经网络（DNNs）上的后门攻击，其中敌友将附加了阴性的行为（backdoor）到DNNs中。methods:我们的防御方法属于post-development防御， Meaning it operates independently of how the model was generated。我们的防御方法基于一种新的反工程approach，可以直接将backdoor功能从一个附加了backdoor的模型中提取出来，并转换为一个名为backdoor expert model的模型。results:我们的防御方法可以高度有效地排除backdoor输入，并对于清洁的使用者享有轻微的影响。我们在多个数据集（CIFAR10、GTSRB和ImageNet）上验证了我们的防御方法，并在不同的模型架构（ResNet、VGG、MobileNetV2和Vision Transformer）上进行了实验。

Abstract
We present a novel defense, against backdoor attacks on Deep Neural Networks (DNNs), wherein adversaries covertly implant malicious behaviors (backdoors) into DNNs. Our defense falls within the category of post-development defenses that operate independently of how the model was generated. The proposed defense is built upon a novel reverse engineering approach that can directly extract backdoor functionality of a given backdoored model to a backdoor expert model. The approach is straightforward -- finetuning the backdoored model over a small set of intentionally mislabeled clean samples, such that it unlearns the normal functionality while still preserving the backdoor functionality, and thus resulting in a model (dubbed a backdoor expert model) that can only recognize backdoor inputs. Based on the extracted backdoor expert model, we show the feasibility of devising highly accurate backdoor input detectors that filter out the backdoor inputs during model inference. Further augmented by an ensemble strategy with a finetuned auxiliary model, our defense, BaDExpert (Backdoor Input Detection with Backdoor Expert), effectively mitigates 16 SOTA backdoor attacks while minimally impacting clean utility. The effectiveness of BaDExpert has been verified on multiple datasets (CIFAR10, GTSRB and ImageNet) across various model architectures (ResNet, VGG, MobileNetV2 and Vision Transformer).

摘要
我们提出了一种新的防御机制，以防止深度神经网络（DNN）上的后门攻击。在这种攻击中，敌人将附加了阴性的行为（backdoor）到DNN中。我们的防御方法属于后期开发防御，可以独立于模型的生成方式运作。我们的防御方法基于一种新的反引擎方法，可以直接将backdoor功能从一个附加了backdoor的模型中提取出来，并将其转换为一个专门的backdoor专家模型。这个方法简单明了：通过在一小量故意错abeled的清洁样本上调整附加了backdoor的模型，使其忘记正常功能，但仍保留backdoor功能，从而将模型转换为可以只识别backdoor输入的模型（称为backdoor专家模型）。基于提取的backdoor专家模型，我们显示了可以创建高精度的backdoor输入检测器，以筛出backdoor输入在模型测试过程中。另外，我们还将这个防御方法与一个调整的副标的模型进行拓展，创建了一个名为BaDExpert（深度神经网络后门专家）的防御系统。BaDExpert有效地抵销了16个SOTA后门攻击，而且对于清洁的功能影响轻微。我们在CIFAR10、GTSRB和ImageNet等多个数据集上验证了BaDExpert的可靠性，并在不同的模型架构（ResNet、VGG、MobileNetV2和Vision Transformer）上进行了实验。

Deploying Deep Reinforcement Learning Systems: A Taxonomy of Challenges

paper_url: http://arxiv.org/abs/2308.12438
repo_url: https://github.com/drldeploymentchallenges-icsme2023/replicationpackage
paper_authors: Ahmed Haj Yahmed, Altaf Allah Abbassi, Amin Nikanjam, Heng Li, Foutse Khomh
For: This paper aims to understand the challenges that practitioners face when deploying deep reinforcement learning (DRL) systems, and to identify the most common and difficult challenges in deploying DRL to different platforms.* Methods: The paper uses an empirical study on Stack Overflow (SO), the most popular Q&A forum for developers, to uncover and understand the challenges practitioners faced when deploying DRL systems. The study categorizes relevant SO posts by deployment platforms and investigates the current state and challenges related to deploying DRL systems.* Results: The study finds that the general interest in DRL deployment is growing, confirming the study’s relevance and importance. The study also finds that DRL deployment is more difficult than other DRL issues, and that RL environment-related challenges are the most popular, while communication-related challenges are the most difficult among practitioners. The study identifies a taxonomy of 31 unique challenges in deploying DRL to different platforms.

Abstract
Deep reinforcement learning (DRL), leveraging Deep Learning (DL) in reinforcement learning, has shown significant potential in achieving human-level autonomy in a wide range of domains, including robotics, computer vision, and computer games. This potential justifies the enthusiasm and growing interest in DRL in both academia and industry. However, the community currently focuses mostly on the development phase of DRL systems, with little attention devoted to DRL deployment. In this paper, we propose an empirical study on Stack Overflow (SO), the most popular Q&A forum for developers, to uncover and understand the challenges practitioners faced when deploying DRL systems. Specifically, we categorized relevant SO posts by deployment platforms: server/cloud, mobile/embedded system, browser, and game engine. After filtering and manual analysis, we examined 357 SO posts about DRL deployment, investigated the current state, and identified the challenges related to deploying DRL systems. Then, we investigate the prevalence and difficulty of these challenges. Results show that the general interest in DRL deployment is growing, confirming the study's relevance and importance. Results also show that DRL deployment is more difficult than other DRL issues. Additionally, we built a taxonomy of 31 unique challenges in deploying DRL to different platforms. On all platforms, RL environment-related challenges are the most popular, and communication-related challenges are the most difficult among practitioners. We hope our study inspires future research and helps the community overcome the most common and difficult challenges practitioners face when deploying DRL systems.

摘要
深度强化学习（DRL），利用深度学习（DL）在强化学习中，已经表现出了人类水平自主性的潜力，在 робо技术、计算机视觉和电子游戏等领域取得了 significante 的成果。这种潜力正ifies和业界的兴趣，但现在大多数研究者在DRL系统的开发阶段所投入的时间较多，对DRL部署的关注相对较少。在这篇论文中，我们通过Stack Overflow（SO），最popular的开发者问答社区，进行了实证研究，探索和理解实践者在部署DRL系统时遇到的挑战。 Specifically, we categorized relevant SO posts by deployment platforms: server/cloud, mobile/embedded system, browser, and game engine. After filtering and manual analysis, we examined 357 SO posts about DRL deployment, investigated the current state, and identified the challenges related to deploying DRL systems. Then, we investigate the prevalence and difficulty of these challenges. Results show that the general interest in DRL deployment is growing, confirming the study's relevance and importance. Results also show that DRL deployment is more difficult than other DRL issues. Additionally, we built a taxonomy of 31 unique challenges in deploying DRL to different platforms. On all platforms, RL environment-related challenges are the most popular, and communication-related challenges are the most difficult among practitioners. We hope our study inspires future research and helps the community overcome the most common and difficult challenges practitioners face when deploying DRL systems.

Reframing the Brain Age Prediction Problem to a More Interpretable and Quantitative Approach

paper_url: http://arxiv.org/abs/2308.12416
repo_url: None
paper_authors: Neha Gianchandani, Mahsa Dibaji, Mariana Bento, Ethan MacDonald, Roberto Souza
for: 这篇论文旨在用深度学习模型从核磁共振图像中预测大脑年龄，并提供更加可读的解释方式。
methods: 这篇论文使用了图像到图像回归模型，以估计大脑每个脑窍ixel的年龄。它们比较了全像年龄预测模型和相应的精密度地图，并证明了 voxel-wise 预测模型更加可读，因为它们提供了脑部年龄变化的空间信息，并且具有量化的优势。
results: 研究结果表明，voxel-wise 预测模型在提供脑部年龄变化的空间信息方面比较出色，而且具有量化的优势。

Abstract
Deep learning models have achieved state-of-the-art results in estimating brain age, which is an important brain health biomarker, from magnetic resonance (MR) images. However, most of these models only provide a global age prediction, and rely on techniques, such as saliency maps to interpret their results. These saliency maps highlight regions in the input image that were significant for the model's predictions, but they are hard to be interpreted, and saliency map values are not directly comparable across different samples. In this work, we reframe the age prediction problem from MR images to an image-to-image regression problem where we estimate the brain age for each brain voxel in MR images. We compare voxel-wise age prediction models against global age prediction models and their corresponding saliency maps. The results indicate that voxel-wise age prediction models are more interpretable, since they provide spatial information about the brain aging process, and they benefit from being quantitative.

摘要
In this work, we reframe the age prediction problem from MR images to an image-to-image regression problem, where we estimate the brain age for each brain voxel in MR images. We compare voxel-wise age prediction models against global age prediction models and their corresponding saliency maps. The results show that voxel-wise age prediction models are more interpretable, as they provide spatial information about the brain aging process, and they benefit from being quantitative.

Benchmarking Causal Study to Interpret Large Language Models for Source Code

paper_url: http://arxiv.org/abs/2308.12415
repo_url: None
paper_authors: Daniel Rodriguez-Cardenas, David N. Palacio, Dipin Khati, Henry Burke, Denys Poshyvanyk
for: 本研究旨在提供一种基于 causal inference 的生成代码评价策略，以帮助研究人员更好地理解 LLMS 的性能。
methods: 本研究使用了 Galeras benchmarking strategy，包括三个 SE 任务（代码完成、代码摘要和提交生成）的测试床，以帮助解释 LLMS 的性能。
results: 对 ChatGPT 的表现进行了一个案例研究，发现prompt semantics 对 ChatGPT 的生成性能具有正向 causal 影响（均值效应约为 3%），并发现 prompt size 和 accuracy metrics 之间存在高度相关性（约为 0.412%）。这些结果表明，通过使用 causal inference 评价策略，可以减少偏见干扰，提供更可靠的 LLMS 性能评价。

Abstract
One of the most common solutions adopted by software researchers to address code generation is by training Large Language Models (LLMs) on massive amounts of source code. Although a number of studies have shown that LLMs have been effectively evaluated on popular accuracy metrics (e.g., BLEU, CodeBleu), previous research has largely overlooked the role of Causal Inference as a fundamental component of the interpretability of LLMs' performance. Existing benchmarks and datasets are meant to highlight the difference between the expected and the generated outcome, but do not take into account confounding variables (e.g., lines of code, prompt size) that equally influence the accuracy metrics. The fact remains that, when dealing with generative software tasks by LLMs, no benchmark is available to tell researchers how to quantify neither the causal effect of SE-based treatments nor the correlation of confounders to the model's performance. In an effort to bring statistical rigor to the evaluation of LLMs, this paper introduces a benchmarking strategy named Galeras comprised of curated testbeds for three SE tasks (i.e., code completion, code summarization, and commit generation) to help aid the interpretation of LLMs' performance. We illustrate the insights of our benchmarking strategy by conducting a case study on the performance of ChatGPT under distinct prompt engineering methods. The results of the case study demonstrate the positive causal influence of prompt semantics on ChatGPT's generative performance by an average treatment effect of $\approx 3\%$. Moreover, it was found that confounders such as prompt size are highly correlated with accuracy metrics ($\approx 0.412\%$). The end result of our case study is to showcase causal inference evaluations, in practice, to reduce confounding bias. By reducing the bias, we offer an interpretable solution for the accuracy metric under analysis.

摘要
一种非常常见的解决方案是训练大型自然语言模型（LLM）以生成代码。虽然许多研究表明了LLM的评价精度（例如BLEU、CodeBleu），但previous research largely overlooked causal inference的作用。现有的benchmark和数据集只能highlight生成结果与预期结果的差异，而不考虑混合变量（例如代码行数、提示大小），这些变量也影响精度指标。因此，在LLM执行生成代码任务时，无法用benchmark来衡量SE-基于治疗的 causal效应，也无法衡量混合变量与模型性能的相关性。为了带来统计学的正规性，本文提出了一种名为Galerascurrency的 benchmarking策略，包括了三个SE任务（代码完成、代码摘要、提交生成）的curated testbed，以帮助解释LLM的性能。我们通过对ChatGPT的表现进行case study，ILLUSTRATE了我们的benchmarking策略的启示。结果表明，ChatGPT的生成性能受提示 semantics 的average treatment effect（ATT）的Positive causal影响，其中ATT约为3%。此外，我们发现提示大小和精度指标之间存在高度的相关性（约0.412%）。最终，我们的case study表明，通过使用causal inference评价，可以减少混合偏见，从而提供可解释的精度指标。

A Theory of Intelligences: Concepts, Models, Implications

paper_url: http://arxiv.org/abs/2308.12411
repo_url: https://github.com/Aryia-Behroziuan/Other-sources
paper_authors: Michael E. Hochberg
for: The paper is written to propose a theory of intelligence based on first principles, with the goal of understanding intelligence in humans and machines, and to explain endeavors that do not necessarily affect Darwinian fitness.
methods: The paper uses a variety of methods, including discussing key features of intelligence, presenting a framework for a first principles Theory of Intelligence, and proposing a compact mathematical form of surprisal and difficulty.
results: The paper presents several conceptual advances, including the prediction that paths to a goal not only function to accurately achieve goals, but also lead to higher probabilities for future attainable goals and increased breadth to enter new goal spaces.Here is the information in Simplified Chinese text:
for: 本文提出了一种基于初等原理的智能理论，以便更好地理解人类和机器智能，以及不直接影响达尔沃因果适应的尝试。
methods: 本文使用了多种方法，包括讨论智能的关键特征、提出基于初等原理的智能理论框架，以及提出一种简洁的数学表达方式。
results: 本文提出了多个概念进步，包括Path efficiency和目标准确率的预测，以及路径可能导致更高的未来可达目标的可能性和新目标空间的扩展。

Abstract
Intelligence is a human construct to represent the ability to achieve goals. Given this wide berth, intelligence has been defined countless times, studied in a variety of ways and quantified using numerous measures. Understanding intelligence ultimately requires theory and quantification, both of which are elusive. My main objectives are to identify some of the central elements in and surrounding intelligence, discuss some of its challenges and propose a theory based on first principles. I focus on intelligence as defined by and for humans, frequently in comparison to machines, with the intention of setting the stage for more general characterizations in life, collectives, human designs such as AI and in non-designed physical and chemical systems. I discuss key features of intelligence, including path efficiency and goal accuracy, intelligence as a Black Box, environmental influences, flexibility to deal with surprisal, the regress of intelligence, the relativistic nature of intelligence and difficulty, and temporal changes in intelligence including its evolution. I present a framework for a first principles Theory of IntelligenceS (TIS), based on the quantifiable macro-scale system features of difficulty, surprisal and goal resolution accuracy. The proposed partitioning of uncertainty/solving and accuracy/understanding is particularly novel since it predicts that paths to a goal not only function to accurately achieve goals, but as experimentations leading to higher probabilities for future attainable goals and increased breadth to enter new goal spaces. TIS can therefore explain endeavors that do not necessarily affect Darwinian fitness, such as leisure, politics, games and art. I conclude with several conceptual advances of TIS including a compact mathematical form of surprisal and difficulty, the theoretical basis of TIS, and open questions.

摘要
人类创造出的智能概念表示能够实现目标的能力。由此广泛定义、研究和量化，智能的理解最终需要理论和量化，它们都是悬而不稳的。我的主要目标是确定智能的中心元素，讨论智能的挑战，并提出基于初始原则的理论（TIS）。我将智能定义为人类定义的，并与机器进行比较，以设置更广泛的生命、集体、人工智能和非设计的物理和化学系统的舞台。我将讨论智能的关键特征，包括路径效率和目标准确率，智能的黑盒特性，环境影响、意外处理能力、智能的征途回归、智能的相对性和困难度，以及时间变化，包括智能的演化。我将提出一种基于量化大规模系统特征的TIS理论框架，该框架包括难度、意外和目标解决准确率三个量化特征。这种分解不确定性/解决和准确性/理解的分 partitioning是特别有趣，因为它预测了路径不仅用于准确实现目标，还用于产生更高的未来可能目标的可能性和扩大进入新的目标空间。TIS可以解释不必要影响达尔沃尼遗传fitness的活动，如休闲、政治、游戏和艺术。我的概念进步包括量化难度和意外的mathematical表达、TIS理论基础和开Question。

Self-Supervised Learning for Endoscopic Video Analysis

paper_url: http://arxiv.org/abs/2308.12394
repo_url: https://github.com/royhirsch/endossl
paper_authors: Roy Hirsch, Mathilde Caron, Regev Cohen, Amir Livne, Ron Shapiro, Tomer Golany, Roman Goldenberg, Daniel Freedman, Ehud Rivlin
for: 这篇论文的目的是探讨自主学习（SSL）在医疗领域中的应用，特别是在镜头结构探查和腹腔镜检查等领域。
methods: 这篇论文使用了Masked Siamese Networks（MSNs）作为SSL框架，并使用大量的无标注影像资料进行训练。
results: 这篇论文在医疗领域的测试中获得了state-of-the-art的表现，包括胃腔镜检查和colonoscopy的手术阶段识别和肿瘤特征分类等。此外，这篇论文还获得了50%的标注数据量减少，未对表现造成影响。因此，这篇论文提供了SSL可以对医疗领域中的标注数据量做出具体的减少。

Abstract
Self-supervised learning (SSL) has led to important breakthroughs in computer vision by allowing learning from large amounts of unlabeled data. As such, it might have a pivotal role to play in biomedicine where annotating data requires a highly specialized expertise. Yet, there are many healthcare domains for which SSL has not been extensively explored. One such domain is endoscopy, minimally invasive procedures which are commonly used to detect and treat infections, chronic inflammatory diseases or cancer. In this work, we study the use of a leading SSL framework, namely Masked Siamese Networks (MSNs), for endoscopic video analysis such as colonoscopy and laparoscopy. To fully exploit the power of SSL, we create sizable unlabeled endoscopic video datasets for training MSNs. These strong image representations serve as a foundation for secondary training with limited annotated datasets, resulting in state-of-the-art performance in endoscopic benchmarks like surgical phase recognition during laparoscopy and colonoscopic polyp characterization. Additionally, we achieve a 50% reduction in annotated data size without sacrificing performance. Thus, our work provides evidence that SSL can dramatically reduce the need of annotated data in endoscopy.

摘要
自我监督学习（SSL）已经在计算机视觉领域取得了重要突破，允许学习大量无标注数据。因此，它可能在生物医学领域扮演重要的角色，因为标注数据的获得需要特殊的专业知识。然而，医疗领域中有许多尚未得到广泛探索的领域。我们在这种领域中研究了一种主流SSL框架，即假设网络（MSNs），用于endoскопи视频分析，如colonoscopy和laparoscopy。为了充分利用SSL的力量，我们创建了大量无标注endoскопи视频数据集用于MSNs的训练。这些强大的图像表示serve as a foundation for secondary training with limited annotated datasets， resulting in state-of-the-art performance in endoscopic benchmarks like surgical phase recognition during laparoscopy and colonoscopic polyp characterization。此外，我们实现了标注数据大小的50%减少，无需牺牲性能。因此，我们的工作提供了证据，表明SSL可以在endoscopy中减少标注数据的需求。

With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning

paper_url: http://arxiv.org/abs/2308.12383
repo_url: https://github.com/aimagelab/pma-net
paper_authors: Manuele Barraco, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
for: 本研究旨在提高图像描述task中Transformer建模器的表现，特别是利用其他训练样本的信息来提高图像描述的准确率。
methods: 本研究提出了一种基于prototypical memory模型的注意力机制，可以在图像描述任务中共享知识，提高模型的表现。
results: 实验结果表明，与基elines和state-of-the-art方法进行比较，提出的方案可以在COCO数据集上提高Encoder-Decoder Transformer模型的表现，增加CIDEr点3.7个。

Abstract
Image captioning, like many tasks involving vision and language, currently relies on Transformer-based architectures for extracting the semantics in an image and translating it into linguistically coherent descriptions. Although successful, the attention operator only considers a weighted summation of projections of the current input sample, therefore ignoring the relevant semantic information which can come from the joint observation of other samples. In this paper, we devise a network which can perform attention over activations obtained while processing other training samples, through a prototypical memory model. Our memory models the distribution of past keys and values through the definition of prototype vectors which are both discriminative and compact. Experimentally, we assess the performance of the proposed model on the COCO dataset, in comparison with carefully designed baselines and state-of-the-art approaches, and by investigating the role of each of the proposed components. We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training. Source code and trained models are available at: https://github.com/aimagelab/PMA-Net.

摘要
图像描述、如多种视觉语言任务一样，目前都是基于Transformer架构来提取图像中的 semantics并将其转换成语言上的准确描述。虽然成功，但只考虑当前输入样本的权重汇集的注意操作，ignore了与其他样本的共同观察可能提供的相关 semantic信息。在这篇论文中，我们设计了一个网络，可以在处理其他训练样本的活动中进行注意力，通过一种prototype模型。我们的记忆模型将过去的键和值分布模型为prototype вектор，这些 вектор都是 Both discriminative和compact。实验中，我们对COCO dataset进行评估，与注意点设计和现有approaches进行比较，并investigate每个提案的作用。我们发现，我们的提议可以在encoder-decoder Transformer中提高性能，即使只在cross-entropy中训练或者自我批判序列训练。源代码和训练模型可以在：https://github.com/aimagelab/PMA-Net 中找到。

Open-set Face Recognition with Neural Ensemble, Maximal Entropy Loss and Feature Augmentation

paper_url: http://arxiv.org/abs/2308.12371
repo_url: None
paper_authors: Rafael Henrique Vareto, Manuel Günther, William Robson Schwartz
for: 本研究旨在提高面Recognition系统的准确率和安全性，特别是在面识别任务中处理无法识别的人脸场景。
methods: 本研究提出了一种新的方法，通过结合紧凑型神经网络 ensemble和margin-based cost function来提高面识别精度和鲁棒性。在训练时间内，通过新的混合特征增强技术，可以从外部数据库或者生成synthetically获取补充的负样本。
results: 在LFW和IJB-C datasets上进行了实验，结果显示，该方法可以提高closed和open-set identification率。

Abstract
Open-set face recognition refers to a scenario in which biometric systems have incomplete knowledge of all existing subjects. Therefore, they are expected to prevent face samples of unregistered subjects from being identified as previously enrolled identities. This watchlist context adds an arduous requirement that calls for the dismissal of irrelevant faces by focusing mainly on subjects of interest. As a response, this work introduces a novel method that associates an ensemble of compact neural networks with a margin-based cost function that explores additional samples. Supplementary negative samples can be obtained from external databases or synthetically built at the representation level in training time with a new mix-up feature augmentation approach. Deep neural networks pre-trained on large face datasets serve as the preliminary feature extraction module. We carry out experiments on well-known LFW and IJB-C datasets where results show that the approach is able to boost closed and open-set identification rates.

摘要

SafeAR: Towards Safer Algorithmic Recourse by Risk-Aware Policies

paper_url: http://arxiv.org/abs/2308.12367
repo_url: None
paper_authors: Haochen Wu, Shubham Sharma, Sunandita Patra, Sriram Gopalakrishnan
for:This paper focuses on providing recourse for individuals adversely affected by machine learning (ML) models in critical domains like finance and healthcare. The goal is to empower people to choose a recourse based on their risk tolerance, considering the risk of higher costs.methods:The paper proposes a method called Safer Algorithmic Recourse (SafeAR) that computes recourse policies with risk considerations. It connects algorithmic recourse literature with risk-sensitive reinforcement learning and adopts financial measures like Value at Risk and Conditional Value at Risk to summarize risk concisely.results:The paper compares policies with different levels of risk-aversion using risk measures and recourse desiderata (sparsity and proximity) on two real-world datasets. The results show that SafeAR can provide more risk-sensitive recourse recommendations than existing methods, enabling individuals to make more informed decisions based on their risk tolerance.

Abstract
With the growing use of machine learning (ML) models in critical domains such as finance and healthcare, the need to offer recourse for those adversely affected by the decisions of ML models has become more important; individuals ought to be provided with recommendations on actions to take for improving their situation and thus receive a favorable decision. Prior work on sequential algorithmic recourse -- which recommends a series of changes -- focuses on action feasibility and uses the proximity of feature changes to determine action costs. However, the uncertainties of feature changes and the risk of higher than average costs in recourse have not been considered. It is undesirable if a recourse could (with some probability) result in a worse situation from which recovery requires an extremely high cost. It is essential to incorporate risks when computing and evaluating recourse. We call the recourse computed with such risk considerations as Safer Algorithmic Recourse (SafeAR). The objective is to empower people to choose a recourse based on their risk tolerance. In this work, we discuss and show how existing recourse desiderata can fail to capture the risk of higher costs. We present a method to compute recourse policies that consider variability in cost and connect algorithmic recourse literature with risk-sensitive reinforcement learning. We also adopt measures ``Value at Risk'' and ``Conditional Value at Risk'' from the financial literature to summarize risk concisely. We apply our method to two real-world datasets and compare policies with different levels of risk-aversion using risk measures and recourse desiderata (sparsity and proximity).

摘要
In this work, we discuss how existing recourse desiderata can fail to capture the risk of higher costs. We present a method to compute recourse policies that consider variability in cost and connect algorithmic recourse literature with risk-sensitive reinforcement learning. We also adopt measures like "Value at Risk" and "Conditional Value at Risk" from the financial literature to summarize risk concisely. We apply our method to two real-world datasets and compare policies with different levels of risk-aversion using risk measures and recourse desiderata (sparsity and proximity).

CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images

paper_url: http://arxiv.org/abs/2308.12288
repo_url: None
paper_authors: Sookwan Han, Hanbyul Joo
for: 这项研究的目的是教育机器人理解和模型人类和物体之间的3D空间共同谐谑。
methods: 这种方法利用一种生成模型，该模型可以生成高质量的2D图像从不同视角 capture human-object互动的多个图像。
results: 研究人员表明，使用这种方法可以从不同视角的2D图像中学习人类和物体之间的3D空间共同谐谑。此外，他们还提出了多种方法，包括利用生成图像模型来学习3D人类-物体空间关系，以及使用3D占用理解和pose canonicalization来解释不一致的2D图像。

Abstract
We present a method for teaching machines to understand and model the underlying spatial common sense of diverse human-object interactions in 3D in a self-supervised way. This is a challenging task, as there exist specific manifolds of the interactions that can be considered human-like and natural, but the human pose and the geometry of objects can vary even for similar interactions. Such diversity makes the annotating task of 3D interactions difficult and hard to scale, which limits the potential to reason about that in a supervised way. One way of learning the 3D spatial relationship between humans and objects during interaction is by showing multiple 2D images captured from different viewpoints when humans interact with the same type of objects. The core idea of our method is to leverage a generative model that produces high-quality 2D images from an arbitrary text prompt input as an "unbounded" data generator with effective controllability and view diversity. Despite its imperfection of the image quality over real images, we demonstrate that the synthesized images are sufficient to learn the 3D human-object spatial relations. We present multiple strategies to leverage the synthesized images, including (1) the first method to leverage a generative image model for 3D human-object spatial relation learning; (2) a framework to reason about the 3D spatial relations from inconsistent 2D cues in a self-supervised manner via 3D occupancy reasoning with pose canonicalization; (3) semantic clustering to disambiguate different types of interactions with the same object types; and (4) a novel metric to assess the quality of 3D spatial learning of interaction. Project Page: https://jellyheadandrew.github.io/projects/chorus

摘要
我们提出了一种方法，用于教学机器人理解和模型人类和物体之间的底层空间共识，在3D空间中自主监督式地学习。这是一项挑战性的任务，因为人类的姿势和物体的几何结构可以在相似的交互中存在差异，这导致了标注3D交互的困难和缺乏扩展性，限制了我们可以在监督方式下理解的可能性。一种方法是通过显示多个视角 capture的2D图像，以便在人类和物体之间的交互中学习3D空间关系。我们的核心想法是利用一种生成模型，可以生成高质量的2D图像，从不同的视角捕捉到人类和物体之间的交互。尽管生成的图像质量不如实际图像，但我们示出了这些生成的图像足够以学习人类和物体之间的3D空间关系。我们提出了多种使用生成的图像进行3D人物空间关系学习的策略，包括：1. 首次利用生成图像模型来学习3D人物空间关系。2. 通过自主监督的方式，从不一致的2Dcue中理解3D空间关系，并使用pose canonicalization进行姿势标准化。3. 使用语义归一化来综合分类不同的交互方式。4. 提出了一种新的评价指标，用于评估3D人物空间关系的学习质量。更多细节请参考我们的项目页面：https://jellyheadandrew.github.io/projects/chorus

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

paper_url: http://arxiv.org/abs/2308.12284
repo_url: None
paper_authors: Kushal Tirumala, Daniel Simig, Armen Aghajanyan, Ari S. Morcos
for: 这paper主要针对大型自然语言模型（LLM）的预训练和下游任务的性能提升。
methods: 这paper使用了预训练模型的嵌入来进行精心的数据选择，以提高预训练和下游任务的性能。
results: 这paper的实验结果表明，采用精心的数据选择可以提高LLM的预训练速度（20%的效率提升），并在16种NLP任务中提高下游任务的平均性能（最多2%的提升），特别是在6.7B模型 scales。此外，paper还表明，通过智能重复数据可以超过基eline训练，而随机重复数据则 perfoms worse than baseline training。

Abstract
Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While training on ever-larger portions of the internet leads to consistent performance improvements, the size of these improvements diminishes with scale, and there has been little work exploring the effect of data selection on pre-training and downstream performance beyond simple de-duplication methods such as MinHash. Here, we show that careful data selection (on top of de-duplicated data) via pre-trained model embeddings can speed up training (20% efficiency gains) and improves average downstream accuracy on 16 NLP tasks (up to 2%) at the 6.7B model scale. Furthermore, we show that repeating data intelligently consistently outperforms baseline training (while repeating random data performs worse than baseline training). Our results indicate that clever data selection can significantly improve LLM pre-training, calls into question the common practice of training for a single epoch on as much data as possible, and demonstrates a path to keep improving our models past the limits of randomly sampling web data.

摘要

Simple is Better and Large is Not Enough: Towards Ensembling of Foundational Language Models

paper_url: http://arxiv.org/abs/2308.12272
repo_url: None
paper_authors: Nancy Tyagi, Aidin Shiri, Surjodeep Sarkar, Abhishek Kumar Umrawal, Manas Gaur
for: This paper aims to explore the potential of smaller foundational language models (FLMs) and their ensembling on benchmark and real-world datasets, and to investigate the influence of ensemble on the individualistic attention of FLMs.
methods: The authors use three ensemble techniques: Shallow, Semi, and Deep, and introduce a knowledge-guided reinforcement learning approach in the Deep-Ensemble.
results: The suggested Deep-Ensemble BERT outperforms its large variation i.e. BERTlarge, by a factor of many times using datasets that show the usefulness of NLP in sensitive fields, such as mental health.Here’s the summary in traditional Chinese characters:
for: 本研究旨在探讨小型基础语言模型 (FLMs) 的可能性和其ensemble在标准和实际数据集上的表现，并 investigate ensemble对FLMs的个性注意力的影响。
methods: 作者使用三种ensemble技术：浅层、半深和深，并引入知识导向资源学习approach。
results: 建议的深度ensembleBERT exceeds其大型变体BERTlarge，使用敏感领域中NLP的有用性的数据集，例如心理健康。

Abstract
Foundational Language Models (FLMs) have advanced natural language processing (NLP) research. Current researchers are developing larger FLMs (e.g., XLNet, T5) to enable contextualized language representation, classification, and generation. While developing larger FLMs has been of significant advantage, it is also a liability concerning hallucination and predictive uncertainty. Fundamentally, larger FLMs are built on the same foundations as smaller FLMs (e.g., BERT); hence, one must recognize the potential of smaller FLMs which can be realized through an ensemble. In the current research, we perform a reality check on FLMs and their ensemble on benchmark and real-world datasets. We hypothesize that the ensembling of FLMs can influence the individualistic attention of FLMs and unravel the strength of coordination and cooperation of different FLMs. We utilize BERT and define three other ensemble techniques: {Shallow, Semi, and Deep}, wherein the Deep-Ensemble introduces a knowledge-guided reinforcement learning approach. We discovered that the suggested Deep-Ensemble BERT outperforms its large variation i.e. BERTlarge, by a factor of many times using datasets that show the usefulness of NLP in sensitive fields, such as mental health.

摘要
基础语言模型（FLM）在自然语言处理（NLP）研究中进步了大量。当前研究人员在开发更大的FLM（如XLNet、T5）以实现语言表示、分类和生成上的上下文化化能力。虽然开发更大的FLM有了显著的优势，但也存在幻觉和预测不确定性的问题。基本上，更大的FLM都是基于小型FLM（如BERT）的基础上建立的，因此需要认可小型FLM的潜在能力，并通过集成来实现。在当前的研究中，我们对FLM和其集成的现实检查，并对标准和实际数据集上进行评估。我们假设 ensemble FLM 可以影响它们的个人注意力，并探索不同 FLM 之间的协作和合作的强度。我们使用 BERT 定义三种集成技术：{浅、半、深}，其中深度集成还使用了知识导向的强化学习方法。我们发现，我们提议的深度集成 BERT 在使用敏感领域中的 NLP 数据集上，比其大版本 BERTlarge 高得多。

Language Reward Modulation for Pretraining Reinforcement Learning

paper_url: http://arxiv.org/abs/2308.12270
repo_url: https://github.com/ademiadeniji/lamp
paper_authors: Ademi Adeniji, Amber Xie, Carmelo Sferrazza, Younggyo Seo, Stephen James, Pieter Abbeel
For: 本研究考虑了使用学习奖励函数（LRF）来解决稀缺奖励学习（RL）任务，并取得了一些任务复杂度的进步。但我们提问是否今天的LRF适用于直接替换任务奖励。相反，我们提议利用LRF作为RL前期训练的能力。* Methods: 我们提出了$\textbf{LA}$nguage Reward $\textbf{M}$odulated $\textbf{P}$retraining（LAMP）方法，它利用预训练的视觉语言模型（VLM）来生成随机的语言指令和图像观察的对比准则，并用这些准则作为RL前期训练的预训练资源。* Results: 我们的LAMP方法可以在RLBench中使用 sample-efficient 的方式快速学习爬行任务。

Abstract
Using learned reward functions (LRFs) as a means to solve sparse-reward reinforcement learning (RL) tasks has yielded some steady progress in task-complexity through the years. In this work, we question whether today's LRFs are best-suited as a direct replacement for task rewards. Instead, we propose leveraging the capabilities of LRFs as a pretraining signal for RL. Concretely, we propose $\textbf{LA}$nguage Reward $\textbf{M}$odulated $\textbf{P}$retraining (LAMP) which leverages the zero-shot capabilities of Vision-Language Models (VLMs) as a $\textit{pretraining}$ utility for RL as opposed to a downstream task reward. LAMP uses a frozen, pretrained VLM to scalably generate noisy, albeit shaped exploration rewards by computing the contrastive alignment between a highly diverse collection of language instructions and the image observations of an agent in its pretraining environment. LAMP optimizes these rewards in conjunction with standard novelty-seeking exploration rewards with reinforcement learning to acquire a language-conditioned, pretrained policy. Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks in RLBench.

摘要
LAMP leverages a frozen, pretrained VLM to generate noisy, yet shaped exploration rewards by computing the contrastive alignment between a diverse collection of language instructions and the image observations of an agent in its pretraining environment. LAMP optimizes these rewards in conjunction with standard novelty-seeking exploration rewards with reinforcement learning to acquire a language-conditioned, pretrained policy. Our VLM pretraining approach, which departs from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks in RLBench.

FECoM: A Step towards Fine-Grained Energy Measurement for Deep Learning

paper_url: http://arxiv.org/abs/2308.12264
repo_url: None
paper_authors: Saurabhsingh Rajput, Tim Widmayer, Ziyuan Shang, Maria Kechagia, Federica Sarro, Tushar Sharma
for: 本研究旨在提高深度学习（DL）模型的能源消耗量，并且提出了一种精细化能源消耗量测量的框架（FECoM），以便帮助研究人员和开发者更好地了解DL系统的能源消耗量。
methods: FECoM使用静态实rumentation技术，考虑了多种因素，包括计算负荷和温度稳定性，以减少测量精细化能源消耗量的挑战。
results: 通过使用FECoM，我们对TensorFlow框架的能源消耗量进行了精细化测量，并研究了参数大小和执行时间对能源消耗量的影响。这些结果可以帮助我们更好地理解TensorFlow API的能源资源。此外，我们还讨论了设计和实现精细化能源消耗量测量工具的一些考虑因素和挑战。

Abstract
With the increasing usage, scale, and complexity of Deep Learning (DL) models, their rapidly growing energy consumption has become a critical concern. Promoting green development and energy awareness at different granularities is the need of the hour to limit carbon emissions of DL systems. However, the lack of standard and repeatable tools to accurately measure and optimize energy consumption at a fine granularity (e.g., at method level) hinders progress in this area. In this paper, we introduce FECoM (Fine-grained Energy Consumption Meter), a framework for fine-grained DL energy consumption measurement. Specifically, FECoM provides researchers and developers a mechanism to profile DL APIs. FECoM addresses the challenges of measuring energy consumption at fine-grained level by using static instrumentation and considering various factors, including computational load and temperature stability. We assess FECoM's capability to measure fine-grained energy consumption for one of the most popular open-source DL frameworks, namely TensorFlow. Using FECoM, we also investigate the impact of parameter size and execution time on energy consumption, enriching our understanding of TensorFlow APIs' energy profiles. Furthermore, we elaborate on the considerations, issues, and challenges that one needs to consider while designing and implementing a fine-grained energy consumption measurement tool. We hope this work will facilitate further advances in DL energy measurement and the development of energy-aware practices for DL systems.

摘要
In this paper, we introduce Fine-grained Energy Consumption Meter (FECoM), a framework for measuring DL energy consumption at a fine granularity. FECoM provides researchers and developers with a mechanism to profile DL APIs. FECoM addresses the challenges of measuring energy consumption at a fine-grained level by using static instrumentation and considering various factors, including computational load and temperature stability.We assess FECoM's capability to measure fine-grained energy consumption for one of the most popular open-source DL frameworks, TensorFlow. Using FECoM, we also investigate the impact of parameter size and execution time on energy consumption, enriching our understanding of TensorFlow APIs' energy profiles. Furthermore, we discuss the considerations, issues, and challenges that need to be considered when designing and implementing a fine-grained energy consumption measurement tool.We hope that this work will facilitate further advances in DL energy measurement and the development of energy-aware practices for DL systems.

Multi-Objective Optimization for Sparse Deep Neural Network Training

paper_url: http://arxiv.org/abs/2308.12243
repo_url: https://github.com/salomonhotegni/mdmtn
paper_authors: S. S. Hotegni, S. Peitz, M. Berkemeier
for: 这篇论文目的是为了训练深度学习网络（DNNs），并使其能够同时完成多个任务（Multi-Task Learning）。
methods: 这篇论文使用了一种 modificated Weighted Chebyshev scalarization 技术，将多个任务转换为一个单一的问题，并使用 Augmented Lagrangian 方法来解决。
results: 这篇论文的实验结果显示，这种方法可以在训练 DNNs 时，逐步简化网络结构，并且可以适应不同任务的适应率，无需对网络结构进行大量修改。

Abstract
Different conflicting optimization criteria arise naturally in various Deep Learning scenarios. These can address different main tasks (i.e., in the setting of Multi-Task Learning), but also main and secondary tasks such as loss minimization versus sparsity. The usual approach is a simple weighting of the criteria, which formally only works in the convex setting. In this paper, we present a Multi-Objective Optimization algorithm using a modified Weighted Chebyshev scalarization for training Deep Neural Networks (DNNs) with respect to several tasks. By employing this scalarization technique, the algorithm can identify all optimal solutions of the original problem while reducing its complexity to a sequence of single-objective problems. The simplified problems are then solved using an Augmented Lagrangian method, enabling the use of popular optimization techniques such as Adam and Stochastic Gradient Descent, while efficaciously handling constraints. Our work aims to address the (economical and also ecological) sustainability issue of DNN models, with a particular focus on Deep Multi-Task models, which are typically designed with a very large number of weights to perform equally well on multiple tasks. Through experiments conducted on two Machine Learning datasets, we demonstrate the possibility of adaptively sparsifying the model during training without significantly impacting its performance, if we are willing to apply task-specific adaptations to the network weights. Code is available at https://github.com/salomonhotegni/MDMTN.

摘要
不同的冲突优化标准在深度学习场景中自然出现。这些标准可以用于不同的主任务（例如在多任务学习设置中），也可以用于主任务和次任务之间的冲突，如损失最小化和稀疏化。通常的方法是将这些标准简单地权衡，但这只能在凸Setting中有效。在这篇论文中，我们提出了一种多目标优化算法，使用修改后的Weighted ChebyshevScalarization来训练深度神经网络（DNNs）对多个任务进行训练。通过使用这种Scalarization技术，算法可以找到原始问题的所有优化解决方案，并将其减少到一个序列中的单个目标问题。这些简化后的问题然后可以使用 Augmented Lagrangian 方法解决，使用流行的优化技术such as Adam和Stochastic Gradient Descent，同时有效地处理约束。我们的工作旨在解决深度神经网络模型的（经济和生态）可持续性问题，尤其是深度多任务模型，这些模型通常具有很多权重，以便在多个任务上具有相同的性能。通过对两个机器学习数据集进行实验，我们示出了在训练过程中适应性减少模型的可能性，只要愿意在任务特定的网络权重上应用适应。代码可以在 https://github.com/salomonhotegni/MDMTN 上获取。

LLMRec: Benchmarking Large Language Models on Recommendation Task

paper_url: http://arxiv.org/abs/2308.12241
repo_url: https://github.com/williamliujl/llmrec
paper_authors: Junling Liu, Chao Liu, Peilin Zhou, Qichen Ye, Dading Chong, Kang Zhou, Yueqi Xie, Yuwei Cao, Shoujin Wang, Chenyu You, Philip S. Yu
for:本研究旨在对大型自然语言模型（LLM）在推荐领域的应用进行 investigate。methods:本研究使用了多种常用的Off-the-shelf LLM，如ChatGPT、LLaMA、ChatGLM，对五种推荐任务进行了 benchmark，包括评分预测、sequential recommendation、直接推荐、解释生成和评论概要。此外，我们还 investigate了精度 fine-tuning 的效iveness以提高 LLM 的指令遵从能力。results:结果表明 LLM 在准确性基于任务中只 display moderate 的能力，但在可解释性基于任务中与现状方法相当。此外，我们还进行了质量评估，发现 LLM 可以真正理解提供的信息，并生成 clearer 和更合理的结果。

Abstract
Recently, the fast development of Large Language Models (LLMs) such as ChatGPT has significantly advanced NLP tasks by enhancing the capabilities of conversational models. However, the application of LLMs in the recommendation domain has not been thoroughly investigated. To bridge this gap, we propose LLMRec, a LLM-based recommender system designed for benchmarking LLMs on various recommendation tasks. Specifically, we benchmark several popular off-the-shelf LLMs, such as ChatGPT, LLaMA, ChatGLM, on five recommendation tasks, including rating prediction, sequential recommendation, direct recommendation, explanation generation, and review summarization. Furthermore, we investigate the effectiveness of supervised finetuning to improve LLMs' instruction compliance ability. The benchmark results indicate that LLMs displayed only moderate proficiency in accuracy-based tasks such as sequential and direct recommendation. However, they demonstrated comparable performance to state-of-the-art methods in explainability-based tasks. We also conduct qualitative evaluations to further evaluate the quality of contents generated by different models, and the results show that LLMs can truly understand the provided information and generate clearer and more reasonable results. We aspire that this benchmark will serve as an inspiration for researchers to delve deeper into the potential of LLMs in enhancing recommendation performance. Our codes, processed data and benchmark results are available at https://github.com/williamliujl/LLMRec.

摘要
近些时候，大型语言模型（LLM）如ChatGPT的快速发展已经对自然语言处理（NLP）任务提供了显著改进。然而，LLM在推荐领域的应用还未得到了全面的探索。为了填补这一空白，我们提出了LLMRec，一个基于LLM的推荐系统，用于对不同的推荐任务进行比较。具体来说，我们对几种流行的准备好的LLM，如ChatGPT、LLaMA、ChatGLM进行了多种推荐任务的评估，包括评分预测、顺序推荐、直接推荐、解释生成和评论概要。此外，我们还 investigate了LLM的指导遵从能力是否可以通过监督微调来改进。 benchmark结果显示，LLM在精度基于任务中只示 moderate 的能力，但在可解释性基于任务中表现和当前领先方法相当。我们还进行了质量评估，以评估不同模型生成的内容质量，结果表明LLM可以真正理解提供的信息，并生成更清晰和合理的结果。我们希望这个 benchmark 能够激励研究人员更深入研究LLM在提高推荐性能方面的潜在力量。我们的代码、处理数据和 benchmark 结果可以在中找到。

Enhancing cardiovascular risk prediction through AI-enabled calcium-omics

paper_url: http://arxiv.org/abs/2308.12224
repo_url: None
paper_authors: Ammar Hoori, Sadeer Al-Kindi, Tao Hu, Yingnan Song, Hao Wu, Juhwan Lee, Nour Tashtish, Pingfu Fu, Robert Gilkeson, Sanjay Rajagopalan, David L. Wilson
for:The paper aims to determine if AI methods using detailed calcification features can improve the prediction of major adverse cardiovascular events (MACE).methods:The study uses a Cox model with elastic-net regularization on 2457 CT calcium score (CTCS) enriched for MACE events, and employs sampling techniques to enhance model training. The study also investigates Cox models with selected features to identify explainable high-risk characteristics.results:The proposed calcium-omics model with modified synthetic down sampling and up sampling gave higher C-index and two-year AUC compared to the Agatston score. The study found that numbers of calcifications, LAD mass, and diffusivity were important determinants of increased risk, and dense calcification was associated with lower risk. The calcium-omics model reclassified 63% of MACE patients to the high-risk group in a held-out test, with a categorical net-reclassification index of NRI=0.153.Here’s the information in Simplified Chinese text:for: 这项研究的目的是判断使用细化calcification特征的人工智能方法是否可以提高主要不良心血管事件预测。methods: 这项研究使用了Cox模型与杂化正则化，对2457个CT calcification score（CTCS）中的MACE事件进行了增强模型训练。研究还调查了Cox模型中选择的特征，以确定高风险特征的解释。results: 提案的calcification-omics模型使用了修改的同步下采样和上采样，对80:20的训练/测试集进行了评估。与Agatston分数相比，calcification-omics模型的C-index和两年AUC得分较高（80.5%/71.6% vs 71.3%/70.3%）。研究发现，calcification数量、LAD质量和扩散率（一种度量calcification的空间分布）是增加风险的重要因素，而高浓度calcification（>1000HU）则与降低风险相关。calcification-omics模型在一个储存测试集中重新分类了63%的MACE患者为高风险组。

Abstract
Background. Coronary artery calcium (CAC) is a powerful predictor of major adverse cardiovascular events (MACE). Traditional Agatston score simply sums the calcium, albeit in a non-linear way, leaving room for improved calcification assessments that will more fully capture the extent of disease. Objective. To determine if AI methods using detailed calcification features (i.e., calcium-omics) can improve MACE prediction. Methods. We investigated additional features of calcification including assessment of mass, volume, density, spatial distribution, territory, etc. We used a Cox model with elastic-net regularization on 2457 CT calcium score (CTCS) enriched for MACE events obtained from a large no-cost CLARIFY program (ClinicalTri-als.gov Identifier: NCT04075162). We employed sampling techniques to enhance model training. We also investigated Cox models with selected features to identify explainable high-risk characteristics. Results. Our proposed calcium-omics model with modified synthetic down sampling and up sampling gave C-index (80.5%/71.6%) and two-year AUC (82.4%/74.8%) for (80:20, training/testing), respectively (sampling was applied to the training set only). Results compared favorably to Agatston which gave C-index (71.3%/70.3%) and AUC (71.8%/68.8%), respectively. Among calcium-omics features, numbers of calcifications, LAD mass, and diffusivity (a measure of spatial distribution) were important determinants of increased risk, with dense calcification (>1000HU) associated with lower risk. The calcium-omics model reclassified 63% of MACE patients to the high risk group in a held-out test. The categorical net-reclassification index was NRI=0.153. Conclusions. AI analysis of coronary calcification can lead to improved results as compared to Agatston scoring. Our findings suggest the utility of calcium-omics in improved prediction of risk.

摘要
背景：肺动脉 calcification（CAC）是一个强大的预测主要冠状疾病事件（MACE）的预测因素。传统的阿加顿分数简单地总计calcification，却不是线性的，留下了更好的calcification评估方法来更全面地捕捉疾病的程度。目标：确定AI方法使用细节calcification特征（i.e., calcium-omics）可以提高MACE预测。方法：我们调查了calcification的更多特征，包括质量、体积、密度、空间分布、领域等。我们使用了一个Cox模型与栅格 regularization，对于2457个CT calcification score（CTCS）中的MACE事件进行了大规模的no-cost CLARIFY计划（ClinicalTrials.gov Identifier: NCT04075162）。我们使用了采样技术来增强模型训练。我们还使用了Cox模型选择特征来 Identify可解释的高风险特征。结果：我们的提议的calcification-omics模型与修改后的同步下采样和上采样给C-index（80.5%/71.6%）和两年AUC（82.4%/74.8%），分别在（80:20，训练/测试）。结果与阿加顿相比，给C-index（71.3%/70.3%）和AUC（71.8%/68.8%）。 Among calcification-omics features，numbers of calcifications、LAD mass和diffusivity（一种度量空间分布）是风险增加的重要决定因素，而dense calcification（>1000HU）与lower risk相关。calcification-omics模型在一个保留测试中重新分类了63%的MACE患者。NRI=0.153。结论：AI对肺动脉calcification的分析可以导致更好的结果，相比阿加顿分数。我们的发现表明calcification-omics的使用可以提高预测风险的准确性。

Critical Learning Periods Emerge Even in Deep Linear Networks

paper_url: http://arxiv.org/abs/2308.12221
repo_url: None
paper_authors: Michael Kleinman, Alessandro Achille, Stefano Soatto
for: 本研究探讨了深度学习网络中的批处理期，即在学习过程中，某些特定的批处理可以对后续学习产生深刻的影响。
methods: 本研究使用了深度线性网络模型，并通过分析和实验表明了批处理期的依存关系。
results: 研究发现，批处理期在深度网络中存在，并且与模型的深度和数据分布结构有关。此外，研究还发现了在多任务学习中，预训练的影响，并提出了一种基于竞争的解释模型。

Abstract
Critical learning periods are periods early in development where temporary sensory deficits can have a permanent effect on behavior and learned representations. Despite the radical differences between biological and artificial networks, critical learning periods have been empirically observed in both systems. This suggests that critical periods may be fundamental to learning and not an accident of biology. Yet, why exactly critical periods emerge in deep networks is still an open question, and in particular it is unclear whether the critical periods observed in both systems depend on particular architectural or optimization details. To isolate the key underlying factors, we focus on deep linear network models, and show that, surprisingly, such networks also display much of the behavior seen in biology and artificial networks, while being amenable to analytical treatment. We show that critical periods depend on the depth of the model and structure of the data distribution. We also show analytically and in simulations that the learning of features is tied to competition between sources. Finally, we extend our analysis to multi-task learning to show that pre-training on certain tasks can damage the transfer performance on new tasks, and show how this depends on the relationship between tasks and the duration of the pre-training stage. To the best of our knowledge, our work provides the first analytically tractable model that sheds light into why critical learning periods emerge in biological and artificial networks.

摘要
“重要学习期是在发育早期的一些时期，这些时期的暂时感知缺陷可能会导致永久性的行为和学习表征的改变。尽管生物和人工网络之间有很大差异，但critical periods仍然在两种系统中被观察到。这表明critical periods可能是学习的基本特征，而不是生物学意外现象。然而，critical periods在哪里emerge仍然是一个开放的问题，尤其是不知道critical periods在两种系统中是否受到特定的建筑或优化细节的影响。为了孤立关键因素，我们将注重深度Linear Network模型，并显示了这些网络在生物和人工网络中显示了大量的行为，同时可以进行分析处理。我们发现critical periods与模型的深度和数据分布结构有关，并且在分析和 simulations中表明了学习特征的吸引是与多源竞争相关。最后，我们扩展我们的分析到多任务学习，并显示了在某些任务上进行预训练可能会对新任务的传输性能产生负面影响，并且如何这种影响与任务之间的关系和预训练阶段的长度有关。到目前为止，我们的工作提供了第一个可分析的模型，可以解释critical learning periods在生物和人工网络中的出现。”

Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning

paper_url: http://arxiv.org/abs/2308.12219
repo_url: https://github.com/yegcjs/diffusionllm
paper_authors: Jiasheng Ye, Zaixiang Zheng, Yu Bao, Lihua Qian, Quanquan Gu
for: 这篇论文旨在探讨 diffusion 语言模型是否可以解决通用语言任务，并证明可以通过扩大数据、大小和任务来使 diffusion 语言模型成为强大的语言学习模型。
methods: 作者首先通过大量数据的隐藏语言模型预训练获得知识，然后通过填充式适应来转化预训练的masked语言模型为 diffusion 语言模型，并在不同任务上进行任务特定的训练和指令特化训练来解锁其多样性。
results: 实验显示，随着 diffusion 语言模型的扩大，其表现在下游语言任务中逐渐提高，并且在不同任务上具有zero-shot和几少shot在场景学习能力，可以根据自然语言指令来解决许多未看过任务。

Abstract
The recent surge of generative AI has been fueled by the generative power of diffusion probabilistic models and the scalable capabilities of large language models. Despite their potential, it remains elusive whether diffusion language models can solve general language tasks comparable to their autoregressive counterparts. This paper demonstrates that scaling diffusion models w.r.t. data, sizes, and tasks can effectively make them strong language learners. We build competent diffusion language models at scale by first acquiring knowledge from massive data via masked language modeling pretraining thanks to their intrinsic connections. We then reprogram pretrained masked language models into diffusion language models via diffusive adaptation, wherein task-specific finetuning and instruction finetuning are explored to unlock their versatility in solving general language tasks. Experiments show that scaling diffusion language models consistently improves performance across downstream language tasks. We further discover that instruction finetuning can elicit zero-shot and few-shot in-context learning abilities that help tackle many unseen tasks by following natural language instructions, and show promise in advanced and challenging abilities such as reasoning

摘要
最近的生成AI冲击浪潮得以归功于扩散概率模型的生成能力和大语言模型的可扩展性。 despite their potential, it remains unclear whether diffusion language models can solve general language tasks comparable to their autoregressive counterparts. This paper demonstrates that scaling diffusion models with respect to data, sizes, and tasks can effectively make them strong language learners. We build competent diffusion language models at scale by first acquiring knowledge from massive data via masked language modeling pretraining, thanks to their intrinsic connections. We then reprogram pretrained masked language models into diffusion language models via diffusive adaptation, wherein task-specific finetuning and instruction finetuning are explored to unlock their versatility in solving general language tasks. Experiments show that scaling diffusion language models consistently improves performance across downstream language tasks. We further discover that instruction finetuning can elicit zero-shot and few-shot in-context learning abilities that help tackle many unseen tasks by following natural language instructions, and show promise in advanced and challenging abilities such as reasoning.Here's a word-for-word translation of the text into Simplified Chinese:最近的生成AI冲击浪潮得以归功于扩散概率模型的生成能力和大语言模型的可扩展性。despite their potential, it remains unclear whether diffusion language models can solve general language tasks comparable to their autoregressive counterparts. This paper demonstrates that scaling diffusion models with respect to data, sizes, and tasks can effectively make them strong language learners. We build competent diffusion language models at scale by first acquiring knowledge from massive data via masked language modeling pretraining, thanks to their intrinsic connections. We then reprogram pretrained masked language models into diffusion language models via diffusive adaptation, wherein task-specific finetuning and instruction finetuning are explored to unlock their versatility in solving general language tasks. Experiments show that scaling diffusion language models consistently improves performance across downstream language tasks. We further discover that instruction finetuning can elicit zero-shot and few-shot in-context learning abilities that help tackle many unseen tasks by following natural language instructions, and show promise in advanced and challenging abilities such as reasoning.

2023-08-24

cs.CL

cs.CL - 2023-08-24

Text Similarity from Image Contents using Statistical and Semantic Analysis Techniques

paper_url: http://arxiv.org/abs/2308.12842
repo_url: None
paper_authors: Sagar Kulkarni, Sharvari Govilkar, Dhiraj Amin
for: 这种论文主要targets the problem of plagiarism in image content, such as figures, graphs, and tables, and proposes a system to detect plagiarism in these contents.
methods: The proposed system uses a combination of statistical algorithms, including Jaccard and Cosine, and semantic algorithms, such as LSA, BERT, and WordNet, to detect plagiarism in image content.
results: The system outperformed in detecting efficient and accurate plagiarism in image content, demonstrating its effectiveness in addressing the challenge of plagiarism in this area.

Abstract
Plagiarism detection is one of the most researched areas among the Natural Language Processing(NLP) community. A good plagiarism detection covers all the NLP methods including semantics, named entities, paraphrases etc. and produces detailed plagiarism reports. Detection of Cross Lingual Plagiarism requires deep knowledge of various advanced methods and algorithms to perform effective text similarity checking. Nowadays the plagiarists are also advancing themselves from hiding the identity from being catch in such offense. The plagiarists are bypassed from being detected with techniques like paraphrasing, synonym replacement, mismatching citations, translating one language to another. Image Content Plagiarism Detection (ICPD) has gained importance, utilizing advanced image content processing to identify instances of plagiarism to ensure the integrity of image content. The issue of plagiarism extends beyond textual content, as images such as figures, graphs, and tables also have the potential to be plagiarized. However, image content plagiarism detection remains an unaddressed challenge. Therefore, there is a critical need to develop methods and systems for detecting plagiarism in image content. In this paper, the system has been implemented to detect plagiarism form contents of Images such as Figures, Graphs, Tables etc. Along with statistical algorithms such as Jaccard and Cosine, introducing semantic algorithms such as LSA, BERT, WordNet outperformed in detecting efficient and accurate plagiarism.

摘要
“抄袭探测是自然语言处理（NLP）社区中最受欢迎的研究领域之一。一个好的抄袭探测系统应包括所有NLP方法，包括语意、名称实体、重复文本等，并生成详细的抄袭报告。跨语言抄袭探测需要深厚的多种高级方法和算法，以进行有效的文本相似性检查。现在，抄袭者也在不断地提高自己的隐身技巧，以避免被检测。抄袭者会使用技巧如重写、词汇替换、不一致的引用、翻译语言等。图像内容抄袭探测（ICPD）已经获得了重要性，通过进阶的图像内容处理技术来确保图像内容的完整性。但是，图像内容抄袭探测仍然是一个未解决的挑战。因此，有一个急需开发方法和系统来检测图像内容中的抄袭。在这篇文章中，我们已经实现了对图像内容中的内容进行抄袭探测，包括 figura、图表、グラフ等。我们还使用了统计算法如Jaccard和Cosine，以及语义算法如LSA、BERT和WordNet，它们在检测效率和准确性方面表现出色。”

Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities

paper_url: http://arxiv.org/abs/2308.12833
repo_url: None
paper_authors: Maximilian Mozes, Xuanli He, Bennett Kleinberg, Lewis D. Griffin
for: 本研究旨在提高大语言模型（LLM）的安全性和安全性问题的认识，包括恶意使用、人工欺诈和生成恶意软件等问题。
methods: 本研究使用了现有的科学研究来识别和解决LLM所存在的威胁和漏洞。我们还提出了一个概念体系，描述了威胁、预防措施和预防措施的漏洞之间的关系。
results: 本研究希望通过提高开发者和实践者对LLM安全性问题的认识，提高LLM在安全性方面的可靠性和可靠性。

Abstract
Spurred by the recent rapid increase in the development and distribution of large language models (LLMs) across industry and academia, much recent work has drawn attention to safety- and security-related threats and vulnerabilities of LLMs, including in the context of potentially criminal activities. Specifically, it has been shown that LLMs can be misused for fraud, impersonation, and the generation of malware; while other authors have considered the more general problem of AI alignment. It is important that developers and practitioners alike are aware of security-related problems with such models. In this paper, we provide an overview of existing - predominantly scientific - efforts on identifying and mitigating threats and vulnerabilities arising from LLMs. We present a taxonomy describing the relationship between threats caused by the generative capabilities of LLMs, prevention measures intended to address such threats, and vulnerabilities arising from imperfect prevention measures. With our work, we hope to raise awareness of the limitations of LLMs in light of such security concerns, among both experienced developers and novel users of such technologies.

摘要
受到大语言模型（LLM）的快速发展和散布的影响，近期的许多研究都集中在LLM的安全和安全性问题上，特别是在涉及到可犯罪活动的情况下。据显示，LLM可以被滥用于诈骗、人身伪造和生成恶意软件等; 而其他作者则考虑了更一般的人工智能对齐问题。在这篇论文中，我们提供了现有的主要科学努力，以识别和解决由LLM引起的威胁和漏洞。我们提出了一种分类方案，描述了由LLM生成能力引起的威胁、防范措施和不完全防范措施所导致的漏洞之间的关系。我们希望通过这种工作，让开发者和实践者都意识到LLM的安全限制，以及相关的安全问题。

WavMark: Watermarking for Audio Generation

paper_url: http://arxiv.org/abs/2308.12770
repo_url: None
paper_authors: Guangyu Chen, Yu Wu, Shujie Liu, Tao Liu, Xiaoyong Du, Furu Wei
for: 这篇论文旨在提出一种新的声音水印框架，以增强声音水印的鲁棒性和可靠性。
methods: 该框架使用了1秒钟的音频抽象，并在不可见的情况下编码了32位水印。它还可以组合多个水印段以实现更高的鲁棒性和容量。
results: 该框架在10-20秒的音频上实现了0.48%的比特错误率，比现有的水印工具减少了超过2800%的比特错误率。

Abstract
Recent breakthroughs in zero-shot voice synthesis have enabled imitating a speaker's voice using just a few seconds of recording while maintaining a high level of realism. Alongside its potential benefits, this powerful technology introduces notable risks, including voice fraud and speaker impersonation. Unlike the conventional approach of solely relying on passive methods for detecting synthetic data, watermarking presents a proactive and robust defence mechanism against these looming risks. This paper introduces an innovative audio watermarking framework that encodes up to 32 bits of watermark within a mere 1-second audio snippet. The watermark is imperceptible to human senses and exhibits strong resilience against various attacks. It can serve as an effective identifier for synthesized voices and holds potential for broader applications in audio copyright protection. Moreover, this framework boasts high flexibility, allowing for the combination of multiple watermark segments to achieve heightened robustness and expanded capacity. Utilizing 10 to 20-second audio as the host, our approach demonstrates an average Bit Error Rate (BER) of 0.48\% across ten common attacks, a remarkable reduction of over 2800\% in BER compared to the state-of-the-art watermarking tool. See https://aka.ms/wavmark for demos of our work.

摘要
最近的零频讲话突破口已经使得可以通过几秒钟的录音来模仿说话者的声音，同时保持高度的真实感。然而，这项强大技术也带来了明显的风险，包括声音骗财和说话者模仿。不同于传统的仅仅依靠静止方法来检测合成数据，水印技术提供了一种积极和坚强的防御机制。这篇论文介绍了一种创新的音频水印框架，可以在1秒钟的音频截取中编码Up to 32位的水印，人类感知不到。这个水印具有强大的抗击攻击特性，可以作为合成声音的标识符，并且有广泛的应用前途在音频版权保护方面。此外，该框架具有高灵活性，可以将多个水印段组合以实现更高的坚强性和扩展性。使用10-20秒的音频作为主机，我们的方法在十种常见攻击中显示了平均的比特错误率（BER）为0.48%，相比之下，当前的水印工具的BER下降了超过2800%。请参考https://aka.ms/wavmark了解我们的工作。

Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

paper_url: http://arxiv.org/abs/2308.12734
repo_url: None
paper_authors: Jordan J. Bird, Ahmad Lotfi
for: 这个研究旨在探讨深度复制声音技术，以探索它们对于声音识别和伪造的潜在影响。
methods: 这个研究使用了Retrieval-based Voice Conversion技术生成了DEEP-VOICE数据集，包含了八位知名人士的真实声音，并将其转换为对方的声音。研究以binary分类问题的形式进行分析，使用了时间 audio 特征的 Statistical Analysis，发现真实声音和AI生成声音之间存在显著的不同分布。
results: 这个研究发现，使用Extreme Gradient Boosting模型可以实现平均分类精度为99.3%，并且可以在0.004毫秒运行，即一秒钟的声音。所有的数据都公开发布，以便未来的研究人员对于AI声音检测进行更多的研究。

Abstract
There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion. To address the above emerging issues, the DEEP-VOICE dataset is generated in this study, comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion. Presenting as a binary classification problem of whether the speech is real or AI-generated, statistical analysis of temporal audio features through t-testing reveals that there are significantly different distributions. Hyperparameter optimisation is implemented for machine learning models to identify the source of speech. Following the training of 208 individual machine learning models over 10-fold cross validation, it is found that the Extreme Gradient Boosting model can achieve an average classification accuracy of 99.3% and can classify speech in real-time, at around 0.004 milliseconds given one second of speech. All data generated for this study is released publicly for future research on AI speech detection.

摘要
“有着增长的对话AI生成技术的含义，可以让语音变为别人的语音。这种技术可能会导致隐私泄露和误导，因此需要实时检测AI生成的语音。为解决这些问题，本研究中提出了DEEP-VOICE数据集，包括8名知名人士的真实语音和使用检索式语音转换后转换为别人的语音。这被视为一个二分类问题，即语音是真实的或者是AI生成的。通过统计Audio特征的时间分布，通过t检测发现了不同的分布。对于机器学习模型来源的标识，进行了参数优化。经过208个个人机器学习模型的10次横向分割训练，发现了使用极限梯度提升模型，可以在0.004毫秒内将语音分类为真实或AI生成，并且在1秒钟语音时，平均分类精度达99.3%。所有用于本研究的数据都公开发布，以便未来关于AI语音检测的研究。”

Harnessing the Power of David against Goliath: Exploring Instruction Data Generation without Using Closed-Source Models

paper_url: http://arxiv.org/abs/2308.12711
repo_url: None
paper_authors: Yue Wang, Xinrui Wang, Juntao Li, Jinxiong Chang, Qishen Zhang, Zhongyi Liu, Guannan Zhang, Min Zhang
for: This paper aims to explore alternative methods for generating high-quality instruction data for training large language models, without relying on closed-source models.
methods: The authors investigate various existing instruction generation methods and integrate the most efficient variant with two novel strategies to enhance quality.
results: The generated instruction data outperforms Alpaca, a method reliant on closed-source models, as demonstrated by evaluation results from two benchmarks and the GPT-4 model.Here’s the text in Simplified Chinese:
for: 本研究目的是探讨不使用关闭源模型的方法生成高质量的指令数据，用于训练大语言模型。
methods: 作者们 investigate了多种现有的指令生成方法，并将最高效的变体与两种新策略相结合，以进一步提高质量。
results: 生成的指令数据超过了基于关闭源模型的Alpaca方法，如果从两个 benchmark 和 GPT-4 模型的评估结果来看。

Abstract
Instruction tuning is instrumental in enabling Large Language Models~(LLMs) to follow user instructions to complete various open-domain tasks. The success of instruction tuning depends on the availability of high-quality instruction data. Owing to the exorbitant cost and substandard quality of human annotation, recent works have been deeply engaged in the exploration of the utilization of powerful closed-source models to generate instruction data automatically. However, these methods carry potential risks arising from the usage requirements of powerful closed-source models, which strictly forbid the utilization of their outputs to develop machine learning models. To deal with this problem, in this work, we explore alternative approaches to generate high-quality instruction data that do not rely on closed-source models. Our exploration includes an investigation of various existing instruction generation methods, culminating in the integration of the most efficient variant with two novel strategies to enhance the quality further. Evaluation results from two benchmarks and the GPT-4 model demonstrate the effectiveness of our generated instruction data, which can outperform Alpaca, a method reliant on closed-source models. We hope that more progress can be achieved in generating high-quality instruction data without using closed-source models.

摘要
大型语言模型（LLM）的 instrucion tuning 是实现用户指令完成各种开放领域任务的关键因素。 instrucion tuning 的成功取决于高质量的 instrucion 数据的可用性。由于人工标注的成本高昂且质量不高，现有的研究专注于自动生成 instrucion 数据的方法。但这些方法存在受到强大关闭源模型的使用需求的风险。为了解决这个问题，这个工作寻找不靠强大关闭源模型的替代方法来生成高质量的 instrucion 数据。我们的探索包括评估多种现有的 instrucion 生成方法，并将最高效的variant与两个新的策略相互融合，以进一步提高 instrucion 数据的质量。实验结果显示，我们所生成的 instrucion 数据能够超越 Alpaca，一种基于关闭源模型的方法。我们希望这个领域能够获得更多的进步，以生成高质量的 instrucion 数据，不靠强大关闭源模型。

From Chatter to Matter: Addressing Critical Steps of Emotion Recognition Learning in Task-oriented Dialogue

paper_url: http://arxiv.org/abs/2308.12648
repo_url: None
paper_authors: Shutong Feng, Nurul Lubis, Benjamin Ruppik, Christian Geishauser, Michael Heck, Hsien-chin Lin, Carel van Niekerk, Renato Vukovic, Milica Gašić
for: 提高人类对话机器人的听众性能（Emotion Recognition in Conversations，ERC），特别是在任务导向对话（Task-Oriented Dialogues，ToDs）中。
methods: 将适合协议对话（Chit-Chat Dialogues）的ERC模型转化为任务导向的模型，通过三个关键方面：数据、特征和目标。首先，我们提出了两种增强罕见情感的方法。其次，我们使用对话状态作为辅助特征，以包含用户的目标信息。最后，我们采用多方面情感定义和多任务学习目标，以及一种情感距离权重损失函数。
results: 在Emowoz大规模数据集上，我们的框架可以提高许多适合协议对话的ERC模型的性能。此外，我们还进行了不同ToD数据集上的满意度预测研究，并与超参数比较，显示了我们的框架在各种场景中的可 reuse性。

Abstract
Emotion recognition in conversations (ERC) is a crucial task for building human-like conversational agents. While substantial efforts have been devoted to ERC for chit-chat dialogues, the task-oriented counterpart is largely left unattended. Directly applying chit-chat ERC models to task-oriented dialogues (ToDs) results in suboptimal performance as these models overlook key features such as the correlation between emotions and task completion in ToDs. In this paper, we propose a framework that turns a chit-chat ERC model into a task-oriented one, addressing three critical aspects: data, features and objective. First, we devise two ways of augmenting rare emotions to improve ERC performance. Second, we use dialogue states as auxiliary features to incorporate key information from the goal of the user. Lastly, we leverage a multi-aspect emotion definition in ToDs to devise a multi-task learning objective and a novel emotion-distance weighted loss function. Our framework yields significant improvements for a range of chit-chat ERC models on EmoWOZ, a large-scale dataset for user emotion in ToDs. We further investigate the generalisability of the best resulting model to predict user satisfaction in different ToD datasets. A comparison with supervised baselines shows a strong zero-shot capability, highlighting the potential usage of our framework in wider scenarios.

摘要
情感认知在对话中（ERC）是创建人类化对话代理的关键任务。虽然有大量努力投入到了ERC的普通对话（Chit-chat）中，但相对的任务导向对话（ToD）尚未得到了足够的注意。直接将Chit-chat ERC模型应用到ToD中会导致性能下降，因为这些模型忽略了对话完成任务的关键特征。在这篇论文中，我们提出了一个框架，把Chit-chat ERC模型转换成任务导向的模型，解决了三个关键方面：数据、特征和目标。首先，我们提出了两种增强罕见情感的方法，以提高ERC性能。其次，我们使用对话状态作为助记特征，以包含用户的目标信息。最后，我们采用了多方面情感定义和多任务学习目标，并提出了一种新的情感距离权重损失函数。我们的框架在EmoWOZ大规模数据集上实现了显著的改进，并且我们进一步调查了最佳模型在不同的ToD数据集中预测用户满意度的能力。与超级vised基准相比，我们的模型在零上shot情况下具有强大的能力， highlighting the potential usage of our framework in wider scenarios.

Probabilistic Method of Measuring Linguistic Productivity

paper_url: http://arxiv.org/abs/2308.12643
repo_url: None
paper_authors: Sergei Monakhov
for: 这个论文旨在提出一种新的语言产率测量方法，以评估词根在拓展新词汇时的能力，而不是直接依赖于token频率。
methods: 该方法认为语言产率可以视为词根与随机基本单词的概率相互结合。这些优点包括：首先，token频率不会直接影响产率测量；其次，我们不仅是计数已知词类具有词根，而是通过模拟词类的构造并检查它们是否存在于词库中来评估词根的产率。最后，基于词库和随机设计，新词和旧词都有平等的机会被选择。
results: 在英语和俄语数据上测试了该算法，结果显示，语言产率与词类数量和token频率之间存在一定的关系。具体来说，语言产率首先增加高频项目，然后才增加低频项目。

Abstract
In this paper I propose a new way of measuring linguistic productivity that objectively assesses the ability of an affix to be used to coin new complex words and, unlike other popular measures, is not directly dependent upon token frequency. Specifically, I suggest that linguistic productivity may be viewed as the probability of an affix to combine with a random base. The advantages of this approach include the following. First, token frequency does not dominate the productivity measure but naturally influences the sampling of bases. Second, we are not just counting attested word types with an affix but rather simulating the construction of these types and then checking whether they are attested in the corpus. Third, a corpus-based approach and randomised design assure that true neologisms and words coined long ago have equal chances to be selected. The proposed algorithm is evaluated both on English and Russian data. The obtained results provide some valuable insights into the relation of linguistic productivity to the number of types and tokens. It looks like burgeoning linguistic productivity manifests itself in an increasing number of types. However, this process unfolds in two stages: first comes the increase in high-frequency items, and only then follows the increase in low-frequency items.

摘要
在这篇论文中，我提出了一种新的语言产率测量方法，该方法对象ively评估一个词缀的使用能力，并不直接受到各种各样的token频率的影响。 Specifically, 我建议将语言产率视为一个词缀与随机基础的概率。这些方法的优点包括以下几点：首先，token频率不会控制产率测量，而是自然地影响采样的基础。其次，我们不仅是统计已知的单词类型，而是通过模拟这些类型的构建，然后检查它们是否存在于词库中。最后，基于词库和随机设计，所有新词和昔日的词都有平等的机会被选择。我们使用的算法在英语和俄语数据上进行了评估，得到的结果提供了一些有价值的发现，关于语言产率与单元和token之间的关系。看来，语言产率的增长 manifested itself in an increasing number of types，但是这个过程发展在两个阶段：首先是高频项的增长，然后才是低频项的增长。

PromptMRG: Diagnosis-Driven Prompts for Medical Report Generation

paper_url: http://arxiv.org/abs/2308.12604
repo_url: None
paper_authors: Haibo Jin, Haoxuan Che, Yi Lin, Hao Chen
for:The paper aims to improve the accuracy of automatic medical report generation (MRG) by proposing a novel framework called PromptMRG.methods:PromptMRG uses an encoder-decoder architecture with an extra disease classification branch, and incorporates cross-modal feature enhancement and adaptive logit-adjusted loss to address the challenges of disease imbalance and precise clinical understanding.results:The proposed method achieves state-of-the-art clinical efficacy performance on two MRG benchmarks, demonstrating its effectiveness in improving the accuracy of MRG.

Abstract
Automatic medical report generation (MRG) is of great research value as it has the potential to relieve radiologists from the heavy burden of report writing. Despite recent advancements, accurate MRG remains challenging due to the need for precise clinical understanding and the identification of clinical findings. Moreover, the imbalanced distribution of diseases makes the challenge even more pronounced, as rare diseases are underrepresented in training data, making their diagnostic performance unreliable. To address these challenges, we propose diagnosis-driven prompts for medical report generation (PromptMRG), a novel framework that aims to improve the diagnostic accuracy of MRG with the guidance of diagnosis-aware prompts. Specifically, PromptMRG is based on encoder-decoder architecture with an extra disease classification branch. When generating reports, the diagnostic results from the classification branch are converted into token prompts to explicitly guide the generation process. To further improve the diagnostic accuracy, we design cross-modal feature enhancement, which retrieves similar reports from the database to assist the diagnosis of a query image by leveraging the knowledge from a pre-trained CLIP. Moreover, the disease imbalanced issue is addressed by applying an adaptive logit-adjusted loss to the classification branch based on the individual learning status of each disease, which overcomes the barrier of text decoder's inability to manipulate disease distributions. Experiments on two MRG benchmarks show the effectiveness of the proposed method, where it obtains state-of-the-art clinical efficacy performance on both datasets.

摘要
自动医疗报告生成（MRG）具有很大的研究价值，因为它有可能减轻医生对报告写作的重荷。尽管最近有所进步，但准确的MRG仍然是一项挑战，因为需要精准的临床理解和病理发现的识别。此外，疾病的分布不均，使得报告生成的性能变得更加困难，因为罕见疾病在训练数据中的表现不充分，导致报告生成的准确性受到影响。为 Address these challenges, we propose 诊断驱动的报告生成PromptMRG，一种新的框架，用于提高MRG的诊断准确性，通过诊断意识的指导来Explicitly guide the generation process。具体来说，PromptMRG采用encoder-decoder架构，并添加了疾病分类分支。当生成报告时，从分支中获取的诊断结果会被转换为token提示，以直接导引生成过程。为了进一步提高诊断准确性，我们设计了跨模态特征增强，该方法可以通过在数据库中检索相似报告，帮助诊断查询图像的疾病诊断，并利用预训练的CLIP来增强报告生成的准确性。此外，我们还解决了疾病分布不均的问题，通过应用适应的逻辑调整损失，根据每种疾病的学习状态来调整分支的损失。实验结果表明，提案的方法在两个MRG标准测试集上表现出色，在两个 dataset上都达到了状态的诊断效果。

A Small and Fast BERT for Chinese Medical Punctuation Restoration

paper_url: http://arxiv.org/abs/2308.12568
repo_url: None
paper_authors: Tongtao Ling, Chen Liao, Zhipeng Yu, Lei Chen, Shilei Huang, Yi Liu
for: 提高Automatic Speech Recognition（ASR）输出文本的精度和可读性，使得医疗报告能够更加precise和可理解。
methods: 基于’预训练和精度调整’ парадигм，提出了一种快速和轻量级的预训练模型，并通过练习对比学习和一种新的辅助预训练任务（括号标记预测）来适应医疗报告的括号 restauration。
results: 我们的实验表明，我们的模型可以达到95%的性能水平，而且与state-of-the-art中的Chinese RoBERTa模型相比，其模型体积只占10%。

Abstract
In clinical dictation, utterances after automatic speech recognition (ASR) without explicit punctuation marks may lead to the misunderstanding of dictated reports. To give a precise and understandable clinical report with ASR, automatic punctuation restoration is required. Considering a practical scenario, we propose a fast and light pre-trained model for Chinese medical punctuation restoration based on 'pretraining and fine-tuning' paradigm. In this work, we distill pre-trained models by incorporating supervised contrastive learning and a novel auxiliary pre-training task (Punctuation Mark Prediction) to make it well-suited for punctuation restoration. Our experiments on various distilled models reveal that our model can achieve 95% performance while 10% model size relative to state-of-the-art Chinese RoBERTa.

摘要
在临床词汇录制中，无法显式标点符号的utterances可能会导致译音报告的歧义。为了提供准确可理解的临床报告，自动标点 restauration 是必要的。在实践场景中，我们提议一种快速轻量级的预训练模型，基于 '预训练和精度调整' 模型。在这种情况下，我们通过将预训练模型进行精度调整，并在新的辅助预训练任务（标点符号预测）中进行精度调整。我们的实验表明，我们的模型可以达到 95% 的性能，而且与 state-of-the-art 中文 RoBERTa 模型相比，只有 10% 的模型大小。

CARE: Co-Attention Network for Joint Entity and Relation Extraction

paper_url: http://arxiv.org/abs/2308.12531
repo_url: None
paper_authors: Wenjun Kong, Yamei Xia
for: 本文旨在提高 JOINT 信息抽取的性能，即同时提取实体和关系信息。
methods: 该方法使用 Co-Attention 网络，分别学习实体和关系信息的表示，以避免特征混淆。主要 componenet 是两个任务之间的协作模块，使模型可以利用实体信息来预测关系，并 vice versa。
results: EXTENSIVE 实验表明，提出的模型在三个 JOINT 实体-关系抽取 benchmark 数据集（NYT、WebNLG 和 SciERC）上表现出色，超过现有基eline模型。

Abstract
Joint entity and relation extraction is the fundamental task of information extraction, consisting of two subtasks: named entity recognition and relation extraction. Most existing joint extraction methods suffer from issues of feature confusion or inadequate interaction between two subtasks. In this work, we propose a Co-Attention network for joint entity and Relation Extraction (CARE). Our approach involves learning separate representations for each subtask, aiming to avoid feature overlap. At the core of our approach is the co-attention module that captures two-way interaction between two subtasks, allowing the model to leverage entity information for relation prediction and vice versa, thus promoting mutual enhancement. Extensive experiments on three joint entity-relation extraction benchmark datasets (NYT, WebNLG and SciERC) show that our proposed model achieves superior performance, surpassing existing baseline models.

摘要
共同实体和关系抽取是信息抽取的基本任务，包括两个子任务：命名实体识别和关系抽取。现有的大多数共同抽取方法受到特征混乱或两个子任务之间不足的互动的问题。在这种情况下，我们提出了一种共同注意力网络 для共同实体和关系抽取（CARE）。我们的方法是学习每个子任务的独立表示，以避免特征重叠。我们的核心方法是两个子任务之间的共同注意力模块，允许模型利用实体信息来预测关系，并 vice versa，从而促进互助。我们在三个共同实体-关系抽取 benchmark 数据集（NYT、WebNLG 和 SciERC）进行了广泛的实验，结果显示，我们提出的模型在比较存在的基准模型上表现出色，超越了现有的基准模型。

Large Language Model as Autonomous Decision Maker

paper_url: http://arxiv.org/abs/2308.12519
repo_url: None
paper_authors: Yining Ye, Xin Cong, Yujia Qin, Yankai Lin, Zhiyuan Liu, Maosong Sun
For: This paper aims to enable large language models (LLMs) to make autonomous decisions by endowing them with self-judgment ability, allowing them to explore and judge decision steps based on their values and utilities.* Methods: The proposed approach, called JuDec, uses an Elo-based Self-Judgment Mechanism to assign Elo scores to decision steps and guide the decision-searching process toward the optimal solution.* Results: Experimental results on the ToolBench dataset show that JuDec achieves over 10% improvement in Pass Rate on diverse tasks, offering higher-quality solutions and reducing costs (ChatGPT API calls), demonstrating its effectiveness and efficiency.Here is the summary in Traditional Chinese:* For: 这篇论文目的是将大型语言模型（LLMs）变成自主的决策者，通过将自己的判断能力授与 LLMs。* Methods: 提案的方法称为 JuDec，采用 Elo 分数自我评价机制，将 Elo 分数 assign 给决策步骤，以judge 其值和利益via 对两个解决方案的对比，导引决策搜寻过程向优化解决方案。* Results: 实验结果显示，JuDec 在 ToolBench 数据集上超过 10% 的提升率，在多个任务上表现出色，提供更高质量的解决方案，降低 ChatGPT API 调用成本，强调其效率和有效性。

Abstract
While large language models (LLMs) exhibit impressive language understanding and in-context learning abilities, their decision-making ability still heavily relies on the guidance of task-specific expert knowledge when solving real-world tasks. To unleash the potential of LLMs as autonomous decision makers, this paper presents an approach JuDec to endow LLMs with the self-judgment ability, enabling LLMs to achieve autonomous judgment and exploration for decision making. Specifically, in JuDec, Elo-based Self-Judgment Mechanism is designed to assign Elo scores to decision steps to judge their values and utilities via pairwise comparisons between two solutions and then guide the decision-searching process toward the optimal solution accordingly. Experimental results on the ToolBench dataset demonstrate JuDec's superiority over baselines, achieving over 10% improvement in Pass Rate on diverse tasks. It offers higher-quality solutions and reduces costs (ChatGPT API calls), highlighting its effectiveness and efficiency.

摘要
大型语言模型（LLM）具有吸引人的语言理解和Contextual learning能力，但它们的决策能力仍然受到专业知识的导引。为了让 LLM 成为独立决策者，这篇论文提出了 JuDec approach，旨在赋予 LLM 自我评价能力，使其能够达到自主评估和探索决策。具体来说，JuDec 使用 Elo 分数机制来评估决策步骤的价值和用于导引决策搜索过程。实验结果表明 JuDec 在 ToolBench 数据集上表现出优于基eline，实现了多达 10% 的提升率，并且可以提供更高质量的解决方案，降低 ChatGPT API 调用成本， highlighting 其效率和可行性。

MultiPA: a multi-task speech pronunciation assessment system for a closed and open response scenario

paper_url: http://arxiv.org/abs/2308.12490
repo_url: None
paper_authors: Yu-Wen Chen, Zhou Yu, Julia Hirschberg
for: 这个研究旨在开发一种多任务语音发音评估系统，以提供更加精准和全面的发音技巧评估。
methods: 这个系统使用多任务学习方法，包括卷积神经网络和长期循环神经网络，以实现在关闭和开放响应场景下的发音评估。
results: 实验结果表明，这个系统在关闭响应场景下的性能相对较高，而且在开放响应场景下的性能更加稳定。

Abstract
The design of automatic speech pronunciation assessment can be categorized into closed and open response scenarios, each with strengths and limitations. A system with the ability to function in both scenarios can cater to diverse learning needs and provide a more precise and holistic assessment of pronunciation skills. In this study, we propose a Multi-task Pronunciation Assessment model called MultiPA. MultiPA provides an alternative to Kaldi-based systems in that it has simpler format requirements and better compatibility with other neural network models. Compared with previous open response systems, MultiPA provides a wider range of evaluations, encompassing assessments at both the sentence and word-level. Our experimental results show that MultiPA achieves comparable performance when working in closed response scenarios and maintains more robust performance when directly used for open responses.

摘要

American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers

paper_url: http://arxiv.org/abs/2308.12477
repo_url: https://github.com/dell-research-harvard/americanstories
paper_authors: Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Zejiang Shen, Luca D’Amico-Wong, Quan Le, Pablo Querubin, Leander Heldring
for: The paper is written to extract full article texts from newspaper images in the Library of Congress’s public domain Chronicling America collection, with the goal of providing high-quality data for pre-training a large language model and improving historical information accessibility.
methods: The paper develops a novel, deep learning pipeline for extracting full article texts from newspaper images, including layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes. The pipeline is designed for mobile phones to achieve high scalability.
results: The resulting American Stories dataset provides high-quality, structured article texts that can be used for pre-training a large language model, topic classification, detection of reproduced content, and news story clustering. The dataset also facilitates innovation in multimodal layout analysis models and other multimodal applications.

Abstract
Existing full text datasets of U.S. public domain newspapers do not recognize the often complex layouts of newspaper scans, and as a result the digitized content scrambles texts from articles, headlines, captions, advertisements, and other layout regions. OCR quality can also be low. This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images and applies it to the nearly 20 million scans in Library of Congress's public domain Chronicling America collection. The pipeline includes layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes. To achieve high scalability, it is built with efficient architectures designed for mobile phones. The resulting American Stories dataset provides high quality data that could be used for pre-training a large language model to achieve better understanding of historical English and historical world knowledge. The dataset could also be added to the external database of a retrieval-augmented language model to make historical information - ranging from interpretations of political events to minutiae about the lives of people's ancestors - more widely accessible. Furthermore, structured article texts facilitate using transformer-based methods for popular social science applications like topic classification, detection of reproduced content, and news story clustering. Finally, American Stories provides a massive silver quality dataset for innovating multimodal layout analysis models and other multimodal applications.

摘要
现有的美国公共领域报纸全文数据集不能识别报纸扫描图像中的复杂布局，因此扫描的内容会混乱，包括文章、标题、标签、广告和其他布局区域。此研究开发了一个新的深度学习管道，用于从报纸图像中提取全文，并应用到美国国会图书馆的公共领域Chronicling America收藏中的 nearly 20 万扫描。该管道包括布局检测、可读性分类、自定义 OCR 和文章тексты横跨多个 bounding box 的关联。为实现高可扩展性，它采用了高效的建筑设计 для移动电话。结果的美国故事数据集提供了高质量的数据，可以用于预训练大型自然语言模型，以达到更好地理解历史英语和历史世界知识。此数据集还可以添加到一个外部数据库中，以便通过检索增强的语言模型来访问历史信息，从 interpretations of political events 到人们祖先的生活细节。此外，结构化的文章文本可以使用 transformer 类型的方法进行流行的社会科学应用，如新闻故事归一化、检测复制内容和主题分类。最后，美国故事提供了一个庞大的高质量银屑数据集，用于创新多模态布局分析模型和其他多模态应用。

Evolution of ESG-focused DLT Research: An NLP Analysis of the Literature

paper_url: http://arxiv.org/abs/2308.12420
repo_url: None
paper_authors: Walter Hernandez, Kamil Tylinski, Alastair Moore, Niall Roche, Nikhil Vadgama, Horst Treiblmaier, Jiangbo Shangguan, Paolo Tasca, Jiahua Xu
for: 本研究旨在提供一种机器学习驱动的系统atic literature review方法，用于 investigate Distributed Ledger Technologies (DLTs) 的多种组成部分。
methods: 研究采用了 transformer-based 语言模型，使用自己标注的数据集进行Named Entity Recognition (NER) 任务的 fine-tuning，并使用 Temporal Graph Analysis (TGA) 进行文献综述。
results: 研究发现了一个包含 505 个关键论文的核心论文集，这些论文具有关于 DLT 的 Environmental, Sustainability, and Governance (ESG) 方面的内容。同时，研究还提供了一个包含 54,808 个名实体的 NER 数据集，可供 DLT 和 ESG-相关的探索。

Abstract
Distributed Ledger Technologies (DLTs) have rapidly evolved, necessitating comprehensive insights into their diverse components. However, a systematic literature review that emphasizes the Environmental, Sustainability, and Governance (ESG) components of DLT remains lacking. To bridge this gap, we selected 107 seed papers to build a citation network of 63,083 references and refined it to a corpus of 24,539 publications for analysis. Then, we labeled the named entities in 46 papers according to twelve top-level categories derived from an established technology taxonomy and enhanced the taxonomy by pinpointing DLT's ESG elements. Leveraging transformer-based language models, we fine-tuned a pre-trained language model for a Named Entity Recognition (NER) task using our labeled dataset. We used our fine-tuned language model to distill the corpus to 505 key papers, facilitating a literature review via named entities and temporal graph analysis on DLT evolution in the context of ESG. Our contributions are a methodology to conduct a machine learning-driven systematic literature review in the DLT field, placing a special emphasis on ESG aspects. Furthermore, we present a first-of-its-kind NER dataset, composed of 54,808 named entities, designed for DLT and ESG-related explorations.

摘要
分布式笔记技术（DLT）在发展中，需要全面的掌握其多种组成部分。然而，一篇系统性的文献评议，强调环境、可持续发展和管理（ESG）方面的分析，仍然缺失。为了填补这一空白，我们选择了107个种子论文，建立了63,083个参考文献的公共网络，并从中缩放到24,539篇文献进行分析。然后，我们将46篇论文中的名称实体标注为12个顶级类别，根据已有的技术分类标准，并将DLT的ESG元素细化。通过使用基于转换器的自然语言模型，我们精细了一个预训练的语言模型，以进行名称实体识别（NER）任务。我们使用我们精细化的语言模型，对文献库进行筛选，得到505份关键论文，可以进行基于名称实体和时间图分析，对DLT在ESG方面的进化。我们的贡献包括在DLT领域进行机器学习驱动的系统性文献评议方法，以及一个特有的NER数据集，包含54,808个名称实体，适用于DLT和ESG相关的探索。

Toward American Sign Language Processing in the Real World: Data, Tasks, and Methods

paper_url: http://arxiv.org/abs/2308.12419
repo_url: None
paper_authors: Bowen Shi
for: 本论文的目的是研究自然环境中的自动手语处理，使用来自互联网的签语视频。methods: 本论文使用了新的大规模 ASL datasets，以及一些新的任务和方法。其中大部分章节都关注了手语中的指文字识别，这是手语中重要的一部分，尚未受到过去的研究。results: 本论文提出了一种基于迭代注意力的 end-to-end 方法，可以从 raw video 直接识别指文字。此外，使用 Conformer 网络同时模型手势和 lip mouthing 可以达到人类水平的性能。此外，本论文还提出了一些用于实际应用程序的两个任务：指文字检测和搜索。

Abstract
Sign language, which conveys meaning through gestures, is the chief means of communication among deaf people. Recognizing sign language in natural settings presents significant challenges due to factors such as lighting, background clutter, and variations in signer characteristics. In this thesis, I study automatic sign language processing in the wild, using signing videos collected from the Internet. This thesis contributes new datasets, tasks, and methods. Most chapters of this thesis address tasks related to fingerspelling, an important component of sign language and yet has not been studied widely by prior work. I present three new large-scale ASL datasets in the wild: ChicagoFSWild, ChicagoFSWild+, and OpenASL. Using ChicagoFSWild and ChicagoFSWild+, I address fingerspelling recognition, which consists of transcribing fingerspelling sequences into text. I propose an end-to-end approach based on iterative attention that allows recognition from a raw video without explicit hand detection. I further show that using a Conformer-based network jointly modeling handshape and mouthing can bring performance close to that of humans. Next, I propose two tasks for building real-world fingerspelling-based applications: fingerspelling detection and search. For fingerspelling detection, I introduce a suite of evaluation metrics and a new detection model via multi-task training. To address the problem of searching for fingerspelled keywords in raw sign language videos, we propose a novel method that jointly localizes and matches fingerspelling segments to text. Finally, I will describe a benchmark for large-vocabulary open-domain sign language translation based on OpenASL. To address the challenges of sign language translation in realistic settings, we propose a set of techniques including sign search as a pretext task for pre-training and fusion of mouthing and handshape features.

摘要
sign language，通过手势表达意义，是聋人之主要沟通方式。在自然环境中识别手语具有许多因素的挑战，如照明、背景干扰和手语表达者的变化。在这个论文中，我研究了在野外自动处理手语，使用互联网上收集的手语视频。这个论文的贡献包括新的数据集、任务和方法。大多数本论文的章节关注手语 fingerspelling，尚未得到了前期研究的广泛关注。我提供了三个大规模的ASL数据集在野外：ChicagoFSWild、ChicagoFSWild+和OpenASL。使用ChicagoFSWild和ChicagoFSWild+，我解决了手语 fingerspelling 识别问题，即将手语 fingerspelling 序列转换成文本。我提出了一种综合注意力的端到端方法，可以从原始视频中识别手语 fingerspelling 无需显式手势检测。此外，我还证明了使用基于 Conformer 网络同时模型手势和嘴形可以达到人类水平。接下来，我提出了两个用于实际应用程序开发的任务：手语 fingerspelling 检测和搜索。为手语 fingerspelling 检测，我引入了一系列评估指标和一种新的检测模型，通过多任务训练来实现。为了在原始手语视频中搜索手语 fingerspelling 关键字，我们提出了一种新的方法，即同时地Localize和匹配手语 fingerspelling 段落到文本。最后，我将介绍一个基于 OpenASL 的大词汇开放语言翻译 benchmark，用于Addressing the challenges of sign language translation in realistic settings, we propose a set of techniques including sign search as a pretext task for pre-training and fusion of mouthing and handshape features.

Vision Transformer Adapters for Generalizable Multitask Learning

paper_url: http://arxiv.org/abs/2308.12372
repo_url: https://github.com/IVRL/VTAGML
paper_authors: Deblina Bhattacharjee, Sabine Süsstrunk, Mathieu Salzmann
for: 这篇论文旨在提出一种基于视Transformer的多任务适应器，可以将任务相似性学习到新任务和领域中，无需重新训练或微调。
methods: 该方法基于视Transformer的底层结构，并将多个稠密视图任务集成到一起，通过一种任务相似性学习机制来学习通用任务相似性。
results: 作者表明，对于多个稠密视图任务，该方法可以在参数效率的情况下同时解决多个任务，并且在零shot任务转移、无监督领域适应和不需要微调到新领域中达到更高的性能。

Abstract
We introduce the first multitasking vision transformer adapters that learn generalizable task affinities which can be applied to novel tasks and domains. Integrated into an off-the-shelf vision transformer backbone, our adapters can simultaneously solve multiple dense vision tasks in a parameter-efficient manner, unlike existing multitasking transformers that are parametrically expensive. In contrast to concurrent methods, we do not require retraining or fine-tuning whenever a new task or domain is added. We introduce a task-adapted attention mechanism within our adapter framework that combines gradient-based task similarities with attention-based ones. The learned task affinities generalize to the following settings: zero-shot task transfer, unsupervised domain adaptation, and generalization without fine-tuning to novel domains. We demonstrate that our approach outperforms not only the existing convolutional neural network-based multitasking methods but also the vision transformer-based ones. Our project page is at \url{https://ivrl.github.io/VTAGML}.

摘要
我们介绍了首个多任务视觉变换器适应器，这些适应器可以学习通用的任务相似性，并可以应用于新任务和领域。它们integrated into an off-the-shelf vision transformer backbone，可以同时解决多个密集视觉任务，而不需要 Parametrically expensive。与现有的多任务变换器不同，我们不需要在新任务或领域添加时进行重新训练或微调。我们引入了任务适应的注意力机制，该机制 combining gradient-based task similarities with attention-based ones。学习的任务相似性可以通过以下设置进行推广：零学习任务转移、不监督领域适应和无需微调来新领域的普适化。我们 demonstrably that our approach outperforms not only the existing convolutional neural network-based multitasking methods but also the vision transformer-based ones。我们的项目页面可以在 \url{https://ivrl.github.io/VTAGML} 中找到。

Prompt2Model: Generating Deployable Models from Natural Language Instructions

paper_url: http://arxiv.org/abs/2308.12261
repo_url: https://github.com/neulab/prompt2model
paper_authors: Vijay Viswanathan, Chenyang Zhao, Amanda Bertsch, Tongshuang Wu, Graham Neubig
for: 这 paper 是为了探讨如何使用 prompt 来训练特殊用途 NLP 模型，以提高模型的性能和可用性。
methods: 这 paper 使用了一种多步骤的方法，包括数据 retrieve 和 pretrained 模型，数据生成 using LLMs，以及supervised fine-tuning 这些获取和生成的数据。
results: 在三个任务上，这 paper 展示了Prompt2Model 可以使用同样的几个示例prompt来训练模型，并取得比 gpt-3.5-turbo 强的平均提升20%，而且模型的大小可以减少到700倍。此外，这 paper 还表明这些数据可以用于获取可靠的模型性能估计，帮助模型开发者在部署前评估模型可靠性。

Abstract
Large language models (LLMs) enable system builders today to create competent NLP systems through prompting, where they only need to describe the task in natural language and provide a few examples. However, in other ways, LLMs are a step backward from traditional special-purpose NLP models; they require extensive computational resources for deployment and can be gated behind APIs. In this paper, we propose Prompt2Model, a general-purpose method that takes a natural language task description like the prompts provided to LLMs, and uses it to train a special-purpose model that is conducive to deployment. This is done through a multi-step process of retrieval of existing datasets and pretrained models, dataset generation using LLMs, and supervised fine-tuning on these retrieved and generated datasets. Over three tasks, we demonstrate that given the same few-shot prompt as input, Prompt2Model trains models that outperform the results of a strong LLM, gpt-3.5-turbo, by an average of 20% while being up to 700 times smaller. We also show that this data can be used to obtain reliable performance estimates of model performance, enabling model developers to assess model reliability before deployment. Prompt2Model is available open-source at https://github.com/neulab/prompt2model.

摘要
Prompt2Model 通过多个步骤来实现这一目标：首先，检索现有的数据集和预训练模型；其次，使用 LLM 生成新的数据集；最后，使用这些检索和生成的数据集进行supervised fine-tuning。在三个任务上，我们证明了，给定相同的几个示例提示，Prompt2Model 可以训练比 gpt-3.5-turbo 强的模型，而且模型的大小可以减少到 700 倍。此外，我们还示出了这些数据可以用于获得可靠的模型性能估计，从而帮助模型开发者在部署之前评估模型可靠性。Prompt2Model 已经开源在 GitHub 上，可以在中下载。

How to Protect Copyright Data in Optimization of Large Language Models?

paper_url: http://arxiv.org/abs/2308.12247
repo_url: None
paper_authors: Timothy Chu, Zhao Song, Chiwun Yang
for: 本研究旨在防止大语言模型（LLMs）生成版权数据。
methods: 本研究使用了一种新的方法，即视为softmax回归问题进行大语言模型训练和优化。
results: 研究表明，通过这种方法可以有效避免生成版权数据，从而实现了训练大语言模型的理论基础。

Abstract
Large language models (LLMs) and generative AI have played a transformative role in computer research and applications. Controversy has arisen as to whether these models output copyrighted data, which can occur if the data the models are trained on is copyrighted. LLMs are built on the transformer neural network architecture, which in turn relies on a mathematical computation called Attention that uses the softmax function. In this paper, we show that large language model training and optimization can be seen as a softmax regression problem. We then establish a method of efficiently performing softmax regression, in a way that prevents the regression function from generating copyright data. This establishes a theoretical method of training large language models in a way that avoids generating copyright data.

摘要
大型语言模型（LLM）和生成AI在计算机研究和应用中扮演了transformative的角色。但是，有 controvery arose as to whether these models output copyrighted data, which can occur if the data the models are trained on is copyrighted. LLMs are built on the transformer neural network architecture, which in turn relies on a mathematical computation called Attention that uses the softmax function. 在这篇论文中，我们显示了大型语言模型的训练和优化可以被看作softmax regression问题。我们然后建立了一种有效地进行softmax regression的方法，以避免生成版权 Daten。这个方法可以帮助train大型语言模型，以避免生成版权 Daten。

2023-08-24

cs.LG

cs.LG - 2023-08-24

Easy attention: A simple self-attention mechanism for Transformers

paper_url: http://arxiv.org/abs/2308.12874
repo_url: None
paper_authors: Marcial Sanchis-Agudo, Yuning Wang, Karthik Duraisamy, Ricardo Vinuesa
for: 预测混沌系统的时间动态特性
methods: 提议一种名为“易注意”的注意机制，通过对软MAX的特征值分解来更好地捕捉长期相关性
results: 与自注意和LSTM网络相比，该方法具有更高的稳定性和较低的复杂性，并且在重建和预测混沌系统的时间动态特性方面获得了优秀的结果

Abstract
To improve the robustness of transformer neural networks used for temporal-dynamics prediction of chaotic systems, we propose a novel attention mechanism called easy attention. Due to the fact that self attention only makes usage of the inner product of queries and keys, it is demonstrated that the keys, queries and softmax are not necessary for obtaining the attention score required to capture long-term dependencies in temporal sequences. Through implementing singular-value decomposition (SVD) on the softmax attention score, we further observe that the self attention compresses contribution from both queries and keys in the spanned space of the attention score. Therefore, our proposed easy-attention method directly treats the attention scores as learnable parameters. This approach produces excellent results when reconstructing and predicting the temporal dynamics of chaotic systems exhibiting more robustness and less complexity than the self attention or the widely-used long short-term memory (LSTM) network. Our results show great potential for applications in more complex high-dimensional dynamical systems.

摘要
要提高转换器神经网络在时间动力学预测混沌系统的稳定性，我们提议一种新的注意力机制called“容易注意”。由于自注意只使用内积Query和Key，所以我们发现键、Query和软MAX不是必需的来获取注意力分数，以 capture长期依赖关系在时间序列中。通过对软MAX注意力分数进行特征值分解（SVD），我们还发现自注意力压缩从Query和Key在注意力分数的核心空间中的贡献。因此，我们的提议的“容易注意”方法直接将注意力分数作为学习参数。这种方法在重建和预测混沌系统的时间动力学中表现出色，比自注意或广泛使用的长期短期记忆（LSTM）网络更加稳定和简洁。我们的结果表明这种方法在更复杂的高维动力系统中具有潜在的应用前景。

IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency

paper_url: http://arxiv.org/abs/2308.12871
repo_url: None
paper_authors: Saeid Ghafouri, Kamran Razavi, Mehran Salmani, Alireza Sanaee, Tania Lorido-Botran, Lin Wang, Joseph Doyle, Pooyan Jamshidi
for: 提高深度学习推理系统的快速、准确和Cost-effective推理
methods: 使用Integer Programming动态配置批处理大小、复制和模型变体，以优化准确率、降低成本并满足用户定义的响应时间SLAs
results: 在Kubernetes实现上，对五个实际推理管道进行了广泛的实验，发现IPA可以提高normalized准确率达35%，而成本增加仅为5%以下。

Abstract
Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in ML production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of accuracy and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling accuracy and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an online deep-learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained models for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically configures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency SLAs using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while remaining adaptable to varying workloads and dynamic traffic patterns. Extensive experiments on a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves normalized accuracy by up to 35% with a minimal cost increase of less than 5%.

摘要
efficiently 优化多模型推理管道是 ML 生产系统中关键的挑战，以实现快速、准确、cost-effective 的推理。为了简化推理管道中精度和成本费用之间的质量和成本费用的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用之间的质量和成本费用

Auto-weighted Bayesian Physics-Informed Neural Networks and robust estimations for multitask inverse problems in pore-scale imaging of dissolution

paper_url: http://arxiv.org/abs/2308.12864
repo_url: None
paper_authors: Sarah Perez, Philippe Poncet
for:这种新的数据融合策略可以帮助解决涉及不确定性评估的反应 inverse 问题，并提供可靠的减噪计算方法。methods:这种方法基于多任务形式的反应 inverse 问题，结合数据驱动和物理驱动技术，以评估气体扩散和氯化物溶解的反应过程。results:这种方法可以提供可靠的 uncertainty 评估和反应参数范围的估计，并在1D+时和2D+时calcite dissolution中进行成功的 bayesian 推理。

Abstract
In this article, we present a novel data assimilation strategy in pore-scale imaging and demonstrate that this makes it possible to robustly address reactive inverse problems incorporating Uncertainty Quantification (UQ). Pore-scale modeling of reactive flow offers a valuable opportunity to investigate the evolution of macro-scale properties subject to dynamic processes. Yet, they suffer from imaging limitations arising from the associated X-ray microtomography (X-ray microCT) process, which induces discrepancies in the properties estimates. Assessment of the kinetic parameters also raises challenges, as reactive coefficients are critical parameters that can cover a wide range of values. We account for these two issues and ensure reliable calibration of pore-scale modeling, based on dynamical microCT images, by integrating uncertainty quantification in the workflow. The present method is based on a multitasking formulation of reactive inverse problems combining data-driven and physics-informed techniques in calcite dissolution. This allows quantifying morphological uncertainties on the porosity field and estimating reactive parameter ranges through prescribed PDE models with a latent concentration field and dynamical microCT. The data assimilation strategy relies on sequential reinforcement incorporating successively additional PDE constraints. We guarantee robust and unbiased uncertainty quantification by straightforward adaptive weighting of Bayesian Physics-Informed Neural Networks (BPINNs), ensuring reliable micro-porosity changes during geochemical transformations. We demonstrate successful Bayesian Inference in 1D+Time and 2D+Time calcite dissolution based on synthetic microCT images with meaningful posterior distribution on the reactive parameters and dimensionless numbers.

摘要
在这篇文章中，我们介绍了一种新的数据融合策略在粒度图像中，并证明这使得可以坚定地解决包含不确定性评估（UQ）的反应反问题。粒度图像模拟的反应流动提供了可观察的大规模特性的演化，但受到X射微 Tomatoes图像过程的限制，导致属性估计存在偏差。评估反应参数也存在挑战，因为反应系数可以覆盖广泛的值范围。我们考虑了这两个问题，并确保了粒度图像模拟的可靠校准，基于动态微 Tomatoes图像。我们的方法基于反应反问题的多任务形式，结合数据驱动和物理驱动技术，在氯酸硅酸盐解析中实现。这允许评估 morphological uncertainty 在porosity field 上，并估算反应参数范围通过隐藏 концентрация场和动态微 Tomatoes图像。数据融合策略基于顺序强制，并在每个循环中添加更多的PDE约束。我们 garantía robust 和无偏评估，通过简单的适应权重 Bayesian Physics-Informed Neural Networks (BPINNs)，确保微软� Porosity 变化在地化学转化过程中可靠。我们在1D+Time和2D+Time氯酸硅酸盐解析中成功完成 bayesian inference，并得到了 meaningful posterior distribution 上的反应参数和约束数。

Towards Automated Animal Density Estimation with Acoustic Spatial Capture-Recapture

paper_url: http://arxiv.org/abs/2308.12859
repo_url: None
paper_authors: Yuheng Wang, Juan Ye, David L. Borchers
for: 监测野生动物人口，特别是难以视见的动物。
methods: 使用机器学习方法进行 vocals 识别，并将各个发声的特征纳入捕捉识别。
results: 比 tradicional 方法更准确地捕捉野生动物人口，并且可以考虑发声特征的不确定性。

Abstract
Passive acoustic monitoring can be an effective way of monitoring wildlife populations that are acoustically active but difficult to survey visually. Digital recorders allow surveyors to gather large volumes of data at low cost, but identifying target species vocalisations in these data is non-trivial. Machine learning (ML) methods are often used to do the identification. They can process large volumes of data quickly, but they do not detect all vocalisations and they do generate some false positives (vocalisations that are not from the target species). Existing wildlife abundance survey methods have been designed specifically to deal with the first of these mistakes, but current methods of dealing with false positives are not well-developed. They do not take account of features of individual vocalisations, some of which are more likely to be false positives than others. We propose three methods for acoustic spatial capture-recapture inference that integrate individual-level measures of confidence from ML vocalisation identification into the likelihood and hence integrate ML uncertainty into inference. The methods include a mixture model in which species identity is a latent variable. We test the methods by simulation and find that in a scenario based on acoustic data from Hainan gibbons, in which ignoring false positives results in 17% positive bias, our methods give negligible bias and coverage probabilities that are close to the nominal 95% level.

摘要
Passive acoustic monitoring可以是有效的监测野生动物人口，因为这些动物可能是视觉上难以观察的。数字录音器allsow surveyors to gather大量数据at low cost，但是寻找目标种 vocalizations在这些数据中是非常困难的。机器学习（ML）方法经常用于进行识别。它们可以快速处理大量数据，但是它们不会检测所有的 vocalizations，而且会出现一些假阳性（不是目标种的 vocalizations）。现有的野生生物资源量评估方法已经特意设计来处理第一种错误，但是现有的方法并不是很好地处理假阳性。它们没有考虑个体 vocalizations 的特征，一些这些特征更可能是假阳性的。我们提议三种静音空间捕捉-重复检测方法，这些方法会将机器学习确定性测试结果 integrate 到概率中，因此可以 integrate 机器学习的不确定性到推断中。这些方法包括一种混合模型，在这种模型中，种类标识是隐藏变量。我们通过模拟测试了这些方法，在基于海南 Gibbon 的声音数据场景中，如果忽略假阳性，则有17%的正确率偏差，而我们的方法则没有偏差，并且涵 coverage probabilities 接近 Nominal 95% 水平。

Fast Adversarial Training with Smooth Convergence

paper_url: http://arxiv.org/abs/2308.12857
repo_url: https://github.com/fat-cs/convergesmooth
paper_authors: Mengnan Zhao, Lihe Zhang, Yuqiu Kong, Baocai Yin
for: 提高神经网络的攻击强度 robustness
methods: 提出了一种新的振荡约束（ConvergeSmooth），以保证训练过程的稳定和平滑，从而避免极端过拟合问题。
results: 经过EXTENSIVE EXPERIMENTS ON POPULAR DATASETS的测试，提出的方法能够高效地避免极端过拟合问题，并且超过所有之前的FAT方法的性能。

Abstract
Fast adversarial training (FAT) is beneficial for improving the adversarial robustness of neural networks. However, previous FAT work has encountered a significant issue known as catastrophic overfitting when dealing with large perturbation budgets, \ie the adversarial robustness of models declines to near zero during training. To address this, we analyze the training process of prior FAT work and observe that catastrophic overfitting is accompanied by the appearance of loss convergence outliers. Therefore, we argue a moderately smooth loss convergence process will be a stable FAT process that solves catastrophic overfitting. To obtain a smooth loss convergence process, we propose a novel oscillatory constraint (dubbed ConvergeSmooth) to limit the loss difference between adjacent epochs. The convergence stride of ConvergeSmooth is introduced to balance convergence and smoothing. Likewise, we design weight centralization without introducing additional hyperparameters other than the loss balance coefficient. Our proposed methods are attack-agnostic and thus can improve the training stability of various FAT techniques. Extensive experiments on popular datasets show that the proposed methods efficiently avoid catastrophic overfitting and outperform all previous FAT methods. Code is available at \url{https://github.com/FAT-CS/ConvergeSmooth}.

摘要
快速对抗训练（FAT）可以提高神经网络的对抗性。然而，先前的FAT工作遇到了一个重要的问题，即灾难性过拟合，即在训练过程中模型的对抗性下降到接近零。为了解决这个问题，我们分析了过去的FAT工作训练过程，发现灾难性过拟合与搅拌值异常强相关。因此，我们认为一个 moderadamente suave proceso de convergencia de pérdida（dubbed ConvergeSmooth）可以使FAT过程更加稳定，并解决灾难性过拟合。为了获得一个 suave proceso de convergencia de pérdida，我们提出了一种新的振荡约束（dubbed ConvergeSmooth），用于限制连续两个步骤之间的搅拌值差。抽象层中的权重归一化也是在搅拌值差的平衡下实现的。我们提出的方法是对抗性的，可以改善各种FAT技术的训练稳定性。我们进行了广泛的实验，发现我们的方法可以有效避免灾难性过拟合，并在各种 популяр的数据集上超越所有之前的FAT方法。代码可以在 \url{https://github.com/FAT-CS/ConvergeSmooth} 上找到。

Probabilistic load forecasting with Reservoir Computing

paper_url: http://arxiv.org/abs/2308.12844
repo_url: https://github.com/MicheleUIT/Probabilistic-load-forecasting-with-Reservoir-Computing
paper_authors: Michele Guerra, Simone Scardapane, Filippo Maria Bianchi
for: 预测电力负荷的准确性和不确定性评估
methods: 使用 ressvoir computing 方法进行时间序列预测，并评估不确定性评估方法的Compatibility和效果
results: 对多种不确定性评估方法进行比较，并基于精心选择的性能指标进行评估，以确定最佳的不确定性评估方法。

Abstract
Some applications of deep learning require not only to provide accurate results but also to quantify the amount of confidence in their prediction. The management of an electric power grid is one of these cases: to avoid risky scenarios, decision-makers need both precise and reliable forecasts of, for example, power loads. For this reason, point forecasts are not enough hence it is necessary to adopt methods that provide an uncertainty quantification. This work focuses on reservoir computing as the core time series forecasting method, due to its computational efficiency and effectiveness in predicting time series. While the RC literature mostly focused on point forecasting, this work explores the compatibility of some popular uncertainty quantification methods with the reservoir setting. Both Bayesian and deterministic approaches to uncertainty assessment are evaluated and compared in terms of their prediction accuracy, computational resource efficiency and reliability of the estimated uncertainty, based on a set of carefully chosen performance metrics.

摘要
某些深度学习应用需要不仅提供准确的结果，还需要衡量预测结果的信度。电力系统管理是这种情况之一：为了避免危险场景，决策者需要准确且可靠的电力负荷预测。因此，点预测不足，需要采用能够衡量不确定性的方法。这项工作选择了储池计算作为核心时间序列预测方法，因为它在计算效率和时间序列预测效果方面具有优势。而 RC 文献中大多数研究都集中在点预测方面，这项工作则探讨了储池设置下的不确定性评估方法的可行性。本研究对不确定性评估方法进行了bayesian和束定的两种方法的评估和比较，基于一组精心选择的性能指标进行评估。

Actuator Trajectory Planning for UAVs with Overhead Manipulator using Reinforcement Learning

paper_url: http://arxiv.org/abs/2308.12843
repo_url: None
paper_authors: Hazim Alzorgan, Abolfazl Razi, Ata Jahangir Moshayedi
for:这 paper 旨在研究一种 aerial manipulator 系统，即一架有控制飞行器的臂部两个自由度的无人飞行器（UAV），以执行 actuation 任务。methods:这 paper 使用 Q-learning 方法控制 manipulate 器的轨迹，并开发了基于 Time To Collision (TTC) 的动态规划模型，使quadrotor UAV 绕障碍物进行探索，并保证 manipulator 的可达性。results:这 paper 实现了多种 actuation 任务，如高空抛剂、结构监测和维护、电池更换、排障清理、高层建筑清理、电缆维护等，在困难和危险环境中完成，同时保持与飞行控制固件的兼容性。RL-based 控制机制实现了一种稳定的控制策略，可以处理飞行器的动态运动不确定性，提供了出色的性能，具体实现了92%的均方差误差（i.e. 目标和实际轨迹点之间的平均距离），使用 Q-learning 方法，共计15000集。

Abstract
In this paper, we investigate the operation of an aerial manipulator system, namely an Unmanned Aerial Vehicle (UAV) equipped with a controllable arm with two degrees of freedom to carry out actuation tasks on the fly. Our solution is based on employing a Q-learning method to control the trajectory of the tip of the arm, also called \textit{end-effector}. More specifically, we develop a motion planning model based on Time To Collision (TTC), which enables a quadrotor UAV to navigate around obstacles while ensuring the manipulator's reachability. Additionally, we utilize a model-based Q-learning model to independently track and control the desired trajectory of the manipulator's end-effector, given an arbitrary baseline trajectory for the UAV platform. Such a combination enables a variety of actuation tasks such as high-altitude welding, structural monitoring and repair, battery replacement, gutter cleaning, sky scrapper cleaning, and power line maintenance in hard-to-reach and risky environments while retaining compatibility with flight control firmware. Our RL-based control mechanism results in a robust control strategy that can handle uncertainties in the motion of the UAV, offering promising performance. Specifically, our method achieves 92\% accuracy in terms of average displacement error (i.e. the mean distance between the target and obtained trajectory points) using Q-learning with 15,000 episodes

摘要
在这篇论文中，我们研究了一种无人飞行器系统，即一架有控制可能的无人飞行器（UAV），其中包括两个自由度的臂部，用于在飞行过程中完成 actuation 任务。我们的解决方案基于 employing Q-learning 方法控制臂部的 trajectory，具体来说，我们开发了一种基于 Time To Collision（TTC）的运动规划模型，使得一架 quadrotor UAV 在避免障碍物时能够自由 navigation。此外，我们利用一种基于模型的 Q-learning 模型来独立地跟踪和控制 manipulator 的 end-effector 的愿望轨迹， givien an arbitrary baseline trajectory for the UAV platform。这种组合使得可以完成多种 actuation 任务，如高空抛射、结构监测和维护、电池更换、排水管理、高层建筑物清洁和维护、电缆维护等，而且保持与飞行控制软件的兼容性。我们的 RL-based 控制机制得到了一个可靠的控制策略，可以处理无人飞行器的运动不确定性，提供了有前途的性能。具体来说，我们的方法实现了 92% 的准确率（即target和实际轨迹点之间的平均距离）使用 Q-learning WITH 15,000 集。

Short Run Transit Route Planning Decision Support System Using a Deep Learning-Based Weighted Graph

paper_url: http://arxiv.org/abs/2308.12828
repo_url: None
paper_authors: Nadav Shalit, Michael Fire, Dima Kagan, Eran Ben-Elia
for: 提高公共交通服务的效率和可靠性
methods: 使用深度学习方法，利用多种数据源，自动调整路线段，快速实现路线改进
results: 在 Tel Aviv 测试，可以降低路线时间超过 9%，包括城市内和郊区路线，表明模型的 universality 和可靠性。

Abstract
Public transport routing plays a crucial role in transit network design, ensuring a satisfactory level of service for passengers. However, current routing solutions rely on traditional operational research heuristics, which can be time-consuming to implement and lack the ability to provide quick solutions. Here, we propose a novel deep learning-based methodology for a decision support system that enables public transport (PT) planners to identify short-term route improvements rapidly. By seamlessly adjusting specific sections of routes between two stops during specific times of the day, our method effectively reduces times and enhances PT services. Leveraging diverse data sources such as GTFS and smart card data, we extract features and model the transportation network as a directed graph. Using self-supervision, we train a deep learning model for predicting lateness values for road segments. These lateness values are then utilized as edge weights in the transportation graph, enabling efficient path searching. Through evaluating the method on Tel Aviv, we are able to reduce times on more than 9\% of the routes. The improved routes included both intraurban and suburban routes showcasing a fact highlighting the model's versatility. The findings emphasize the potential of our data-driven decision support system to enhance public transport and city logistics, promoting greater efficiency and reliability in PT services.

摘要
公共交通路径规划在城市交通网络设计中扮演了关键角色，确保乘客获得满意的服务水平。然而，当前的路径规划解决方案基于传统的操作研究规则，可能需要大量的时间来实施并且缺乏快速解决方案。在这里，我们提出了一种基于深度学习的决策支持系统，可以帮助公共交通（PT）规划员在短时间内迅速地标识路径改进。通过在特定时间段和站点之间进行部分路径的微调，我们的方法可以有效减少时间并提高PT服务质量。通过利用多种数据源，如GTFS和智能卡数据，我们提取特征并模型了交通网络为指定图。使用无监督学习，我们训练了深度学习模型，以预测路段延迟值。这些延迟值然后被用作路径搜索的边权值，使得路径搜索变得更加高效。通过对特拉维夫进行评估，我们能够在9.3%的路线上减少时间。改进的路线包括城市内部和郊区路线，这些成果展示了我们的数据驱动决策支持系统的多样性。这些发现强调了我们的方法在城市交通和城市物流方面的潜在潜力，推动公共交通和城市物流的更高效和可靠性。

Prediction without Preclusion: Recourse Verification with Reachable Sets

paper_url: http://arxiv.org/abs/2308.12820
repo_url: None
paper_authors: Avni Kothari, Bogdan Kulynych, Tsui-Wei Weng, Berk Ustun
for: 这个论文旨在提出一种正式的测试过程，用于检查机器学习模型是否能够提供可 revertible 的预测结果，以确保用户可以根据自己的需求来修改模型的决策。
methods: 该论文使用了一种新的测试方法，可以判断模型是否能够提供回退性的预测结果，并且可以根据用户提供的行动可能性来确定模型的可 revertibility。
results: 研究人员使用了这种测试方法，对实际的借款数据集进行了研究，发现一些模型可能会在预测结果中固化用户的状态，从而导致用户无法回退。这些结果表明了机器学习模型在决策中的缺陷，并且提供了一种新的方法来解决这个问题。

Abstract
Machine learning models are often used to decide who will receive a loan, a job interview, or a public benefit. Standard techniques to build these models use features about people but overlook their actionability. In turn, models can assign predictions that are fixed, meaning that consumers who are denied loans, interviews, or benefits may be permanently locked out from access to credit, employment, or assistance. In this work, we introduce a formal testing procedure to flag models that assign fixed predictions that we call recourse verification. We develop machinery to reliably determine if a given model can provide recourse to its decision subjects from a set of user-specified actionability constraints. We demonstrate how our tools can ensure recourse and adversarial robustness in real-world datasets and use them to study the infeasibility of recourse in real-world lending datasets. Our results highlight how models can inadvertently assign fixed predictions that permanently bar access, and we provide tools to design algorithms that account for actionability when developing models.

摘要
（简体中文）机器学习模型常用于决定谁将获得贷款、面试或公共援助。标准的建模技术会使用人们的特征来建模，但忽略其可行性。因此，模型可能会分配固定的预测，导致被拒绝的消费者无法再次申请贷款、面试或援助。在这项工作中，我们引入了一种正式的测试过程，以确定模型是否可以为其决策对象提供征求（recourse verification）。我们开发了一套机器可靠地确定一个给定模型是否可以从用户指定的可行性约束中提供征求。我们在实际数据集中应用了这些工具，并研究了实际贷款数据集中的不可能性。我们的结果显示了模型可能会无意地分配固定的预测，永久排除消费者的访问权，并提供了一些工具来设计考虑可行性的模型。

Job Shop Scheduling Benchmark: Environments and Instances for Learning and Non-learning Methods

paper_url: http://arxiv.org/abs/2308.12794
repo_url: https://github.com/ai-for-decision-making-tue/job_shop_scheduling_benchmark
paper_authors: Robbert Reijnen, Kjell van Straaten, Zaharah Bukhsh, Yingqian Zhang
for: 本研究的目的是提供一个中心化的Hub，用于研究者、实践者和爱好者们在机器计划问题上进行探索和解决。
methods: 本研究使用了多种机器计划问题的benchmark，包括Job Shop Scheduling (JSP)、Flow Shop Scheduling (FSP)、Flexible Job Shop Scheduling (FJSP)、FJSP with Assembly constraints (FAJSP)、FJSP with Sequence-Dependent Setup Times (FJSP-SDST)和在线FJSP。
results: 本研究提供了一个开源的GitHub存储库，包含了各种机器计划问题的完整的benchmark，以便研究者和实践者可以使用这些benchmark进行研究和解决。

Abstract
We introduce an open-source GitHub repository containing comprehensive benchmarks for a wide range of machine scheduling problems, including Job Shop Scheduling (JSP), Flow Shop Scheduling (FSP), Flexible Job Shop Scheduling (FJSP), FJSP with Assembly constraints (FAJSP), FJSP with Sequence-Dependent Setup Times (FJSP-SDST), and the online FJSP (with online job arrivals). Our primary goal is to provide a centralized hub for researchers, practitioners, and enthusiasts interested in tackling machine scheduling challenges.

摘要
我们介绍了一个开源的 GitHub 存储库，包含了广泛的机器调度问题的benchmark，包括作业shop调度问题（JSP）、流shop调度问题（FSP）、可变作业shop调度问题（FJSP）、FJSP中Assembly约束（FAJSP）、FJSP中sequence-dependent设置时间（FJSP-SDST）以及在线FJSP。我们的主要目标是提供一个中央集中的平台，为研究人员、实践者和爱好者提供机器调度挑战的机会。

Single-shot Bayesian approximation for neural networks

paper_url: http://arxiv.org/abs/2308.12785
repo_url: https://github.com/kaibrach/Moment-Propagation
paper_authors: Kai Brach, Beate Sick, Oliver Dürr
For: The paper aims to develop a single-shot Monte Carlo (MC) dropout approximation for Bayesian neural networks (BNNs) that can provide uncertainty measures and high prediction performance, while being as fast as traditional neural networks (NNs).* Methods: The proposed method is based on moment propagation (MP) and can analytically approximate the expected value and variance of the MC dropout signal for commonly used layers in NNs. The approach does not require re-training and can convert an NN into a BNN for uncertainty estimation.* Results: The paper demonstrates that the proposed single-shot MC dropout approximation can resemble the point estimate and uncertainty estimate of the predictive distribution achieved with MC methods, while being fast enough for real-time deployments. Additionally, combining the MP approach with deep ensemble techniques further improves uncertainty measures.

Abstract
Deep neural networks (NNs) are known for their high-prediction performances. However, NNs are prone to yield unreliable predictions when encountering completely new situations without indicating their uncertainty. Bayesian variants of NNs (BNNs), such as Monte Carlo (MC) dropout BNNs, do provide uncertainty measures and simultaneously increase the prediction performance. The only disadvantage of BNNs is their higher computation time during test time because they rely on a sampling approach. Here we present a single-shot MC dropout approximation that preserves the advantages of BNNs while being as fast as NNs. Our approach is based on moment propagation (MP) and allows to analytically approximate the expected value and the variance of the MC dropout signal for commonly used layers in NNs, i.e. convolution, max pooling, dense, softmax, and dropout layers. The MP approach can convert an NN into a BNN without re-training given the NN has been trained with standard dropout. We evaluate our approach on different benchmark datasets and a simulated toy example in a classification and regression setting. We demonstrate that our single-shot MC dropout approximation resembles the point estimate and the uncertainty estimate of the predictive distribution that is achieved with an MC approach, while being fast enough for real-time deployments of BNNs. We show that using part of the saved time to combine our MP approach with deep ensemble techniques does further improve the uncertainty measures.

摘要
Here we present a single-shot MC dropout approximation that preserves the advantages of BNNs while being as fast as NNs. Our approach is based on moment propagation (MP) and allows to analytically approximate the expected value and the variance of the MC dropout signal for commonly used layers in NNs, such as convolution, max pooling, dense, softmax, and dropout layers. The MP approach can convert an NN into a BNN without re-training, given the NN has been trained with standard dropout.We evaluate our approach on different benchmark datasets and a simulated toy example in a classification and regression setting. We demonstrate that our single-shot MC dropout approximation resembles the point estimate and the uncertainty estimate of the predictive distribution that is achieved with an MC approach, while being fast enough for real-time deployments of BNNs. Additionally, we show that using part of the saved time to combine our MP approach with deep ensemble techniques further improves the uncertainty measures.

Intentionally-underestimated Value Function at Terminal State for Temporal-difference Learning with Mis-designed Reward

paper_url: http://arxiv.org/abs/2308.12772
repo_url: None
paper_authors: Taisuke Kobayashi
For: This paper addresses the problem of unintentional overestimation in reinforcement learning, specifically in the context of temporal-difference (TD) learning.* Methods: The proposed method intentionally underestimates the value after termination to avoid learning failures due to unintentional overestimation. The degree of underestimation is adjusted according to the degree of stationarity at termination.* Results: The proposed method was tested in simulations and real robot experiments, and it was able to stably obtain the optimal policies for various tasks and reward designs.Here’s the simplified Chinese version of the three key points:* 用途: 这篇论文解决了 reinforcement learning 中 temporal-difference (TD) 学习过程中的不 Intentional overestimation 问题。* 方法: 提议的方法在结束时故意下预估值，以避免因不 Intentional overestimation 而导致的学习失败。预估度与结束时的 Stationarity 度相似。* 结果: 提议的方法在 simulated 和实际 robot 实验中被证明可以稳定地获得多种任务和奖励设计的优化策略。

Abstract
Robot control using reinforcement learning has become popular, but its learning process generally terminates halfway through an episode for safety and time-saving reasons. This study addresses the problem of the most popular exception handling that temporal-difference (TD) learning performs at such termination. That is, by forcibly assuming zero value after termination, unintentionally implicit underestimation or overestimation occurs, depending on the reward design in the normal states. When the episode is terminated due to task failure, the failure may be highly valued with the unintentional overestimation, and the wrong policy may be acquired. Although this problem can be avoided by paying attention to the reward design, it is essential in practical use of TD learning to review the exception handling at termination. This paper therefore proposes a method to intentionally underestimate the value after termination to avoid learning failures due to the unintentional overestimation. In addition, the degree of underestimation is adjusted according to the degree of stationarity at termination, thereby preventing excessive exploration due to the intentional underestimation. Simulations and real robot experiments showed that the proposed method can stably obtain the optimal policies for various tasks and reward designs. https://youtu.be/AxXr8uFOe7M

摘要

On the Consistency of Average Embeddings for Item Recommendation

paper_url: http://arxiv.org/abs/2308.12767
repo_url: https://github.com/deezer/consistency
paper_authors: Walid Bendada, Guillaume Salha-Galvan, Romain Hennequin, Thomas Bouabça, Tristan Cazenave
for: 本研究探讨了averaging item embeddings的做法是否有效。
methods: 本研究提出了一个预期精度分数，用于衡量averaging item embeddings的一致性。
results: 实验结果表明，实际的average embedding在推荐 task中的一致性较差，这laying the groundwork for future research to improve the alignment of real-world embeddings with theoretical assumptions.

Abstract
A prevalent practice in recommender systems consists of averaging item embeddings to represent users or higher-level concepts in the same embedding space. This paper investigates the relevance of such a practice. For this purpose, we propose an expected precision score, designed to measure the consistency of an average embedding relative to the items used for its construction. We subsequently analyze the mathematical expression of this score in a theoretical setting with specific assumptions, as well as its empirical behavior on real-world data from music streaming services. Our results emphasize that real-world averages are less consistent for recommendation, which paves the way for future research to better align real-world embeddings with assumptions from our theoretical setting.

摘要
一种常见的做法在推荐系统中是将item embedding平均化以表示用户或更高级概念在同一个嵌入空间中。这篇论文检查这种做法的 relevance。为此，我们提出了一个预期精度分数，用于衡量average embedding的一致性与构建它的 item 的一致性。我们随后对这个分数的数学表述在具体的假设下进行分析，以及它在实际数据上的实际行为。我们的结果表明，实际中的均值更不一致，这为未来的研究提供了更好地对准实际嵌入与假设中的假设进行调整的方向。

IP-UNet: Intensity Projection UNet Architecture for 3D Medical Volume Segmentation

paper_url: http://arxiv.org/abs/2308.12761
repo_url: None
paper_authors: Nyothiri Aung, Tahar Kechadi, Liming Chen, Sahraoui Dhelim
for: 这篇论文的目的是提出一个可以进行多类别分类的深度学习方法，并且能够运用有限的内存容量进行训练，而不会对原始3D影像的分辨率造成影响。
methods: 这篇论文提出了一个名为IP-UNet的深度学习模型，该模型使用了Intensity Projection（IP）的方法来将3D类别资料转换为2D影像，并且使用了有限的内存容量进行训练。
results: 实验结果显示，IP-UNet可以与3D-UNet模型实现相似的分类精度，但是具有更好的性能。它可以降低训练时间70%，并且对于内存consumption降低92%。

Abstract
CNNs have been widely applied for medical image analysis. However, limited memory capacity is one of the most common drawbacks of processing high-resolution 3D volumetric data. 3D volumes are usually cropped or downsized first before processing, which can result in a loss of resolution, increase class imbalance, and affect the performance of the segmentation algorithms. In this paper, we propose an end-to-end deep learning approach called IP-UNet. IP-UNet is a UNet-based model that performs multi-class segmentation on Intensity Projection (IP) of 3D volumetric data instead of the memory-consuming 3D volumes. IP-UNet uses limited memory capability for training without losing the original 3D image resolution. We compare the performance of three models in terms of segmentation accuracy and computational cost: 1) Slice-by-slice 2D segmentation of the CT scan images using a conventional 2D UNet model. 2) IP-UNet that operates on data obtained by merging the extracted Maximum Intensity Projection (MIP), Closest Vessel Projection (CVP), and Average Intensity Projection (AvgIP) representations of the source 3D volumes, then applying the UNet model on the output IP images. 3) 3D-UNet model directly reads the 3D volumes constructed from a series of CT scan images and outputs the 3D volume of the predicted segmentation. We test the performance of these methods on 3D volumetric images for automatic breast calcification detection. Experimental results show that IP-Unet can achieve similar segmentation accuracy with 3D-Unet but with much better performance. It reduces the training time by 70\% and memory consumption by 92\%.

摘要
卷积神经网络（CNN）已广泛应用于医疗影像分析领域。然而，处理高分辨率3DVolume数据的限制内存容量是最常见的缺点。通常情况下，将3DVolume数据剪辑或缩放为适合内存限制的大小后处理，可能会导致分辨率下降、类别偏斜增加和分 segmentation 算法性能下降。在这篇论文中，我们提出了一种终端深度学习方法，即IP-UNet。IP-UNet 是基于 UNet 模型的终端深度学习方法，用于在Intensity Projection（IP）中进行多类分 segmentation。与传统的内存占用高的3DVolume数据处理不同，IP-UNet 可以在具有有限内存capacity的情况下训练，无需失去原始3D图像的分辨率。我们将比较三种模型的性能，包括分 segmentation 精度和计算成本：1. 使用传统的2D UNet模型，对 CT 扫描图像进行层段处理。2. IP-UNet，对来自3DVolume数据的EXTRACTED Maximum Intensity Projection（MIP）、Closest Vessel Projection（CVP）和Average Intensity Projection（AvgIP）表示进行合并，然后应用 UNet 模型于输出 IP 图像。3. 直接使用3D UNet模型，对 CT 扫描图像序列构建的3D Volume进行分 segmentation。我们对这些方法进行了3D volumetric 图像自动乳腺炎检测的测试，实验结果表明，IP-UNet 可以与3D UNet 模型具有相似的分 segmentation 精度，但是性能更好。它可以降低训练时间70%，内存占用量92%。

Motion In-Betweening with Phase Manifolds

paper_url: http://arxiv.org/abs/2308.12751
repo_url: https://github.com/pauzii/phasebetweener
paper_authors: Paul Starke, Sebastian Starke, Taku Komura, Frank Steinicke
for: 这篇论文提出了一种基于数据驱动的动作 interpolating 系统，用于实现人物的目标姿态。
methods: 该方法使用 periodic autoencoder 学习 phase 变量，并使用 mixture-of-experts 神经网络模型，以生成动作序列。在满足特定的约束条件下，使用学习的 bi-directional 控制方案来满足这些约束。
results: 结果表明，使用 phase 变量进行动作 interpolating 可以增强 interpolated 运动的精度，并且可以在长 transition 时间下稳定学习过程。此外，该方法还可以synthesize 更加复杂的运动行为，并具有样式控制功能。与现有的状态静态方法相比，该方法在动作质量和泛化性方面具有竞争力。

Abstract
This paper introduces a novel data-driven motion in-betweening system to reach target poses of characters by making use of phases variables learned by a Periodic Autoencoder. Our approach utilizes a mixture-of-experts neural network model, in which the phases cluster movements in both space and time with different expert weights. Each generated set of weights then produces a sequence of poses in an autoregressive manner between the current and target state of the character. In addition, to satisfy poses which are manually modified by the animators or where certain end effectors serve as constraints to be reached by the animation, a learned bi-directional control scheme is implemented to satisfy such constraints. The results demonstrate that using phases for motion in-betweening tasks sharpen the interpolated movements, and furthermore stabilizes the learning process. Moreover, using phases for motion in-betweening tasks can also synthesize more challenging movements beyond locomotion behaviors. Additionally, style control is enabled between given target keyframes. Our proposed framework can compete with popular state-of-the-art methods for motion in-betweening in terms of motion quality and generalization, especially in the existence of long transition durations. Our framework contributes to faster prototyping workflows for creating animated character sequences, which is of enormous interest for the game and film industry.

摘要

Human Comprehensible Active Learning of Genome-Scale Metabolic Networks

paper_url: http://arxiv.org/abs/2308.12740
repo_url: None
paper_authors: Lun Ai, Shi-Shun Liang, Wang-Zhou Dai, Liam Hallett, Stephen H. Muggleton, Geoff S. Baldwin
for: engineering of host cell systems to yield useful products
methods: Inductive Logic Programming (ILP) and active learning from training examples
results: high-throughput simulations and reduced experimental cost of learning gene functions compared to randomly selected experiments

Abstract
An important application of Synthetic Biology is the engineering of the host cell system to yield useful products. However, an increase in the scale of the host system leads to huge design space and requires a large number of validation trials with high experimental costs. A comprehensible machine learning approach that efficiently explores the hypothesis space and guides experimental design is urgently needed for the Design-Build-Test-Learn (DBTL) cycle of the host cell system. We introduce a novel machine learning framework ILP-iML1515 based on Inductive Logic Programming (ILP) that performs abductive logical reasoning and actively learns from training examples. In contrast to numerical models, ILP-iML1515 is built on comprehensible logical representations of a genome-scale metabolic model and can update the model by learning new logical structures from auxotrophic mutant trials. The ILP-iML1515 framework 1) allows high-throughput simulations and 2) actively selects experiments that reduce the experimental cost of learning gene functions in comparison to randomly selected experiments.

摘要
sintetic biology 的一个重要应用是 Engineering the host cell system to produce useful products. However, as the scale of the host system increases, the design space grows exponentially, and a large number of validation trials with high experimental costs are required. A comprehensible machine learning approach that efficiently explores the hypothesis space and guides experimental design is urgently needed for the Design-Build-Test-Learn (DBTL) cycle of the host cell system. We propose a novel machine learning framework ILP-iML1515 based on Inductive Logic Programming (ILP) that performs abductive logical reasoning and actively learns from training examples. Unlike numerical models, ILP-iML1515 is built on comprehensible logical representations of a genome-scale metabolic model and can update the model by learning new logical structures from auxotrophic mutant trials. The ILP-iML1515 framework 1) enables high-throughput simulations and 2) actively selects experiments that reduce the experimental cost of learning gene functions compared to randomly selected experiments.

Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

paper_url: http://arxiv.org/abs/2308.12734
repo_url: None
paper_authors: Jordan J. Bird, Ahmad Lotfi
for: 这个研究的目的是检测AI生成的speech，以防止深伪语音转换。
methods: 研究使用Retrieval-based Voice Conversion生成DEEP-VOICE dataset，并通过统计分析时间声学特征进行 binary 分类。
results: 使用208个个人机器学习模型进行10次十字验证，得到Extreme Gradient Boosting模型的平均分类精度为99.3%，可以在0.004毫秒内分类一段时间为1秒的语音。

Abstract
There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion. To address the above emerging issues, the DEEP-VOICE dataset is generated in this study, comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion. Presenting as a binary classification problem of whether the speech is real or AI-generated, statistical analysis of temporal audio features through t-testing reveals that there are significantly different distributions. Hyperparameter optimisation is implemented for machine learning models to identify the source of speech. Following the training of 208 individual machine learning models over 10-fold cross validation, it is found that the Extreme Gradient Boosting model can achieve an average classification accuracy of 99.3% and can classify speech in real-time, at around 0.004 milliseconds given one second of speech. All data generated for this study is released publicly for future research on AI speech detection.

摘要
“生成AI在语音领域的应用逐渐增加，可以实现人声复制和即时变换一个人的语音为另一个人的语音。这技术带来了重要的伦理性问题，可能会导致隐私泄露和误导，因此有urgent需要实时检测AI生成的语音。为解决以上问题，这study中生成了DEEP-VOICE数据集，包括8名知名人士的真实语音和将其转换为另一个人的语音使用Retrieval-based Voice Conversion。这问题设置为一个二分类问题，决定语音是真实或AI生成的。通过时间对数特性分析，发现AI生成的语音和真实语音之间有 statistically significant differences。运用机器学习模型来识别语音来源，并在10次跨 Validation中训练208个个体学习模型。结果显示，Extreme Gradient Boosting模型可以实现平均分类率99.3%，并且可以在约0.004毫秒静态识别语音，对于一秒语音来说，这是非常快的。所有这study中生成的数据都会公开发布，以便未来关于AI语音检测的研究。”

Out of the Box Thinking: Improving Customer Lifetime Value Modelling via Expert Routing and Game Whale Detection

paper_url: http://arxiv.org/abs/2308.12729
repo_url: None
paper_authors: Shijie Zhang, Xin Yan, Xuejiao Yang, Binfeng Jia, Shuangyang Wang
for: 这个研究旨在提出一个统一的多任务框架，以便同时进行用户生命值预测（LTV）和游戏鱼类检测（Game Whale Detection）。
methods: 这个研究使用了深度神经网络设计了一个游戏鱼类检测器，可以不仅推测用户的内在顺序， sondern auch精确地识别高支付者（游戏鱼）和低支付者。然后，这个检测器被用来决定不同的混合模式，以便充分利用共同信息和场景特定信息（例如游戏鱼模型和低支付者模型）。最后，相比之前设计了两个任务的分别的估计器，这个研究设计了一个共享的估计器，可以保留内部任务之间的关系。
results: 这个研究的实验结果显示，ExpLTV 可以优化游戏发行商在用户推广投资中的广告投资，并且可以提高预测用户生命值的准确性。

Abstract
Customer lifetime value (LTV) prediction is essential for mobile game publishers trying to optimize the advertising investment for each user acquisition based on the estimated worth. In mobile games, deploying microtransactions is a simple yet effective monetization strategy, which attracts a tiny group of game whales who splurge on in-game purchases. The presence of such game whales may impede the practicality of existing LTV prediction models, since game whales' purchase behaviours always exhibit varied distribution from general users. Consequently, identifying game whales can open up new opportunities to improve the accuracy of LTV prediction models. However, little attention has been paid to applying game whale detection in LTV prediction, and existing works are mainly specialized for the long-term LTV prediction with the assumption that the high-quality user features are available, which is not applicable in the UA stage. In this paper, we propose ExpLTV, a novel multi-task framework to perform LTV prediction and game whale detection in a unified way. In ExpLTV, we first innovatively design a deep neural network-based game whale detector that can not only infer the intrinsic order in accordance with monetary value, but also precisely identify high spenders (i.e., game whales) and low spenders. Then, by treating the game whale detector as a gating network to decide the different mixture patterns of LTV experts assembling, we can thoroughly leverage the shared information and scenario-specific information (i.e., game whales modelling and low spenders modelling). Finally, instead of separately designing a purchase rate estimator for two tasks, we design a shared estimator that can preserve the inner task relationships. The superiority of ExpLTV is further validated via extensive experiments on three industrial datasets.

摘要
客户生命值（LTV）预测是移动游戏发布商需要优化每个用户获取的广告投资基于预测的价值。在移动游戏中，部署微交易是一种简单而有效的营收化策略，吸引了一些游戏鲸鱼，他们在游戏中购买各种付加值。鲸鱼的购买行为与普通用户存在很大差异，因此存在鲸鱼的存在可能会降低现有的 LTV 预测模型的实用性。然而，关于在 LTV 预测中应用游戏鲸鱼检测的研究得到了少量的关注，现有的工作主要关注于长期 LTV 预测，假设高质量的用户特征可以获得，这不适用于 UA 阶段。本文提出了 ExpLTV，一种新的多任务框架，用于同时进行 LTV 预测和游戏鲸鱼检测。在 ExpLTV 中，我们首先创新地设计了深度神经网络基于的游戏鲸鱼检测器，可以不仅掌握游戏鲸鱼的内在顺序，还能准确地识别高支付者（即游戏鲸鱼）和低支付者。然后，我们将游戏鲸鱼检测器作为分配网络来决定不同的混合模式，以便充分利用共同信息和场景特定信息（即游戏鲸鱼模型和低支付者模型）。最后，而不是分别设计两个任务的购买率估计器，我们设计了一个共享估计器，可以保持内部任务之间的关系。ExpLTV 的优势得到了EXTENSIVE EXPERIMENTS 的验证，并在三个 industrialdatasets 上进行了 validate。

Continuous Reinforcement Learning-based Dynamic Difficulty Adjustment in a Visual Working Memory Game

paper_url: http://arxiv.org/abs/2308.12726
repo_url: None
paper_authors: Masoud Rahimi, Hadi Moradi, Abdol-hossein Vahabie, Hamed Kebriaei
for: 提高玩家的游戏体验 (enhance the game experience)
methods: 使用 continuous reinforcement learning (RL) 方法调整游戏Difficulty (adjust game difficulty)
results: 比较rule-based方法，RL-based方法可以提高玩家的得分和胜率，同时减少游戏session中的得分下降 (compared to rule-based methods, the RL-based method can improve the player’s scores and win rates, while reducing the decline in scores over the course of a 20-trial session)

Abstract
Dynamic Difficulty Adjustment (DDA) is a viable approach to enhance a player's experience in video games. Recently, Reinforcement Learning (RL) methods have been employed for DDA in non-competitive games; nevertheless, they rely solely on discrete state-action space with a small search space. In this paper, we propose a continuous RL-based DDA methodology for a visual working memory (VWM) game to handle the complex search space for the difficulty of memorization. The proposed RL-based DDA tailors game difficulty based on the player's score and game difficulty in the last trial. We defined a continuous metric for the difficulty of memorization. Then, we consider the task difficulty and the vector of difficulty-score as the RL's action and state, respectively. We evaluated the proposed method through a within-subject experiment involving 52 subjects. The proposed approach was compared with two rule-based difficulty adjustment methods in terms of player's score and game experience measured by a questionnaire. The proposed RL-based approach resulted in a significantly better game experience in terms of competence, tension, and negative and positive affect. Players also achieved higher scores and win rates. Furthermore, the proposed RL-based DDA led to a significantly less decline in the score in a 20-trial session.

摘要
“智能难度调整（DDA）是一种有效的方法来提高玩家的游戏体验。最近，人工智能学习（RL）方法已经在非竞争游戏中应用于DDA，但它们仅仅基于精确的状态动作空间和小搜索空间。在这篇论文中，我们提出了基于RL的连续Difficulty Adjustment方法，用于处理视觉工作记忆游戏中的复杂搜索空间。我们定义了一个连续指标来衡量记忆难度，然后将任务难度和游戏难度的向量作为RL的动作和状态。我们通过在52名参与者的内部实验中评估了我们的方法，并与两种规则基于的难度调整方法进行比较。我们的RL基于方法对于玩家的得分和游戏体验得分（问卷评估）具有显著更好的效果，包括竞技感、紧张感和负面情感。玩家也获得了更高的得分和胜率。此外，我们的RL基于方法在20场游戏会议中的得分下降幅度也显著减少。”

Solving Forward and Inverse Problems of Contact Mechanics using Physics-Informed Neural Networks

paper_url: http://arxiv.org/abs/2308.12716
repo_url: None
paper_authors: T. Sahin, M. von Danwitz, A. Popp
for: 解决小塑性弹性物理中的前向和反向问题，使用物理学 informed neural networks (PINNs)。
methods: 使用混合变量表示、输出变换等技术，在训练网络时强制实施硬件边界条件和不等式约束。
results: PINNs 可以作为纯 PDE 解决方案、数据增强前向模型、参数标定 inverse 问题，以及快速评估的减少模型。

Abstract
This paper explores the ability of physics-informed neural networks (PINNs) to solve forward and inverse problems of contact mechanics for small deformation elasticity. We deploy PINNs in a mixed-variable formulation enhanced by output transformation to enforce Dirichlet and Neumann boundary conditions as hard constraints. Inequality constraints of contact problems, namely Karush-Kuhn-Tucker (KKT) type conditions, are enforced as soft constraints by incorporating them into the loss function during network training. To formulate the loss function contribution of KKT constraints, existing approaches applied to elastoplasticity problems are investigated and we explore a nonlinear complementarity problem (NCP) function, namely Fischer-Burmeister, which possesses advantageous characteristics in terms of optimization. Based on the Hertzian contact problem, we show that PINNs can serve as pure partial differential equation (PDE) solver, as data-enhanced forward model, as inverse solver for parameter identification, and as fast-to-evaluate surrogate model. Furthermore, we demonstrate the importance of choosing proper hyperparameters, e.g. loss weights, and a combination of Adam and L-BFGS-B optimizers aiming for better results in terms of accuracy and training time.

摘要
To formulate the loss function contribution of KKT constraints, we explore existing approaches applied to elastoplasticity problems and investigate a nonlinear complementarity problem (NCP) function, namely Fischer-Burmeister, which has advantages in terms of optimization. Based on the Hertzian contact problem, we show that PINNs can serve as pure partial differential equation (PDE) solvers, data-enhanced forward models, inverse solvers for parameter identification, and fast-to-evaluate surrogate models.Furthermore, we demonstrate the importance of choosing proper hyperparameters, such as loss weights, and a combination of Adam and L-BFGS-B optimizers to achieve better results in terms of accuracy and training time.

Disentanglement Learning via Topology

paper_url: http://arxiv.org/abs/2308.12696
repo_url: None
paper_authors: Nikita Balabin, Daria Voronkova, Ilya Trofimov, Evgeny Burnaev, Serguei Barannikov
for: 学习分离表示，即数据表示的解释性和强健性的基础。
methods: 使用多尺度拓扑学损失函数来学习分离表示。
results: 对比之前的状态艺术方法，我们的方法在分离度指标（MIG、FactorVAE分数、SAP分数和DCI分离度）上显著提高了表示分离度。我们的方法可以无监督地应用于无标注因素变化的问题。此外，我们还示出了如何使用我们的拓扑学损失函数来找出已经训练的GAN中分离的方向。

Abstract
We propose TopDis (Topological Disentanglement), a method for learning disentangled representations via adding multi-scale topological loss term. Disentanglement is a crucial property of data representations substantial for the explainability and robustness of deep learning models and a step towards high-level cognition. The state-of-the-art method based on VAE minimizes the total correlation of the joint distribution of latent variables. We take a different perspective on disentanglement by analyzing topological properties of data manifolds. In particular, we optimize the topological similarity for data manifolds traversals. To the best of our knowledge, our paper is the first one to propose a differentiable topological loss for disentanglement. Our experiments have shown that the proposed topological loss improves disentanglement scores such as MIG, FactorVAE score, SAP score and DCI disentanglement score with respect to state-of-the-art results. Our method works in an unsupervised manner, permitting to apply it for problems without labeled factors of variation. Additionally, we show how to use the proposed topological loss to find disentangled directions in a trained GAN.

摘要
我们提出了TopDis（Topological Disentanglement）方法，该方法通过添加多尺度 topological 损失项来学习分离的表示。分离是深度学习模型的解释性和稳定性的关键性质，也是高级认知的一步。现有的方法基于VAE减少共corrrelation的总 JOINT 分布变量的总 corrrelation。我们从数据集的topological性analyzing的角度来看待分离。具体来说，我们优化数据集的探索性 Similarity 来 traverse数据集的探索性。根据我们所知，我们的论文是第一篇提出了可导的topological损失的论文。我们的实验表明，我们的方法可以在无标注因素变量的情况下提高分离分数，如MIG、FactorVAE分数、SAP分数和DCI分离分数。我们的方法在无监督的情况下工作，可以应用于无标注因素变量的问题。此外，我们还展示了如何使用我们的topological损失来在训练过的GAN中找到分离的方向。

An Efficient Data Analysis Method for Big Data using Multiple-Model Linear Regression

paper_url: http://arxiv.org/abs/2308.12691
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Bohan Lyu, Jianzhong Li
for: 这篇论文提出了一种新的大数据分析方法，使用一种新定义的多模型线性回归（MMLR）模型，可以将输入数据集分成多个子集并建立本地线性回归模型。
methods: 该论文提出了一种新的近似算法来构建MMLR模型，基于($\epsilon$, $\delta$)-估计器，并给出了数学证明MMLR算法的正确性和效率。
results: 该论文通过实验证明，MMLR算法在许多情况下和现有回归方法有相当的性能，而且它的计算时间几乎是最短的。

Abstract
This paper introduces a new data analysis method for big data using a newly defined regression model named multiple model linear regression(MMLR), which separates input datasets into subsets and construct local linear regression models of them. The proposed data analysis method is shown to be more efficient and flexible than other regression based methods. This paper also proposes an approximate algorithm to construct MMLR models based on $(\epsilon,\delta)$-estimator, and gives mathematical proofs of the correctness and efficiency of MMLR algorithm, of which the time complexity is linear with respect to the size of input datasets. This paper also empirically implements the method on both synthetic and real-world datasets, the algorithm shows to have comparable performance to existing regression methods in many cases, while it takes almost the shortest time to provide a high prediction accuracy.

摘要
这篇论文介绍了一种新的大数据分析方法，使用新定义的多模型线性回归（MMLR）模型，将输入数据集分解成子集，并在每个子集上构建本地线性回归模型。提出的数据分析方法比其他回归基于方法更高效和灵活。本文还提出了一种近似算法，基于($\epsilon$, $\delta$)-估计器来构建MMLR模型，并给出了数学证明MMLR算法的正确性和效率。时间复杂度为输入数据集的线性时间。此外，文章还employs这种方法在 sintetic 和实际数据集上，实验结果表明，该算法在许多情况下与现有回归方法相当，而且具有最短时间内提供高预测精度的优点。

Match-And-Deform: Time Series Domain Adaptation through Optimal Transport and Temporal Alignment

paper_url: http://arxiv.org/abs/2308.12686
repo_url: https://github.com/rtavenar/MatchAndDeform
paper_authors: François Painblanc, Laetitia Chapel, Nicolas Courty, Chloé Friguet, Charlotte Pelletier, Romain Tavenard
for: 本文旨在采用源频率上的标签来分类目标频率上的数据，同时解决时间序列中的扭曲问题。
methods: 本文提出的匹配和扭曲（Match-And-Deform，MAD）方法可以在源和目标频率上找到匹配，同时通过最优运输损失和动态时间折叠来对时间序列进行同步。
results: 实验结果表明，MAD可以在标准时间序列预测任务上达到类似或更好的表现，比如深度时间序列预测任务。

Abstract
While large volumes of unlabeled data are usually available, associated labels are often scarce. The unsupervised domain adaptation problem aims at exploiting labels from a source domain to classify data from a related, yet different, target domain. When time series are at stake, new difficulties arise as temporal shifts may appear in addition to the standard feature distribution shift. In this paper, we introduce the Match-And-Deform (MAD) approach that aims at finding correspondences between the source and target time series while allowing temporal distortions. The associated optimization problem simultaneously aligns the series thanks to an optimal transport loss and the time stamps through dynamic time warping. When embedded into a deep neural network, MAD helps learning new representations of time series that both align the domains and maximize the discriminative power of the network. Empirical studies on benchmark datasets and remote sensing data demonstrate that MAD makes meaningful sample-to-sample pairing and time shift estimation, reaching similar or better classification performance than state-of-the-art deep time series domain adaptation strategies.

摘要
大量的无标签数据通常可available，但关联的标签却scarce. 不supervised domain adaptation问题的目标是利用源Domain中的标签来分类目标Domain中的数据。当时间序列出现时，新的问题出现，即特征分布shift。在这篇文章中，我们引入Match-And-Deform（MAD）方法，该方法在找到源和目标时间序列之间的匹配的同时，允许时间偏移。相关的优化问题同时使用最优运输损失和时间戳Dynamic Time Warping来对时间序列进行对齐。当抽象到深度神经网络中时，MAD帮助建立新的时间序列表示，同时对域进行对齐和最大化神经网络的分类力。empirical studies on benchmark datasets和远程感知数据表明，MAD可以实现有意义的样本对对应和时间偏移估计，达到或更好的深度时间序列领域适应策略的分类性能。

LR-XFL: Logical Reasoning-based Explainable Federated Learning

paper_url: http://arxiv.org/abs/2308.12681
repo_url: None
paper_authors: Yanci Zhang, Han Yu
for: 这篇论文旨在探讨如何在联合学习中保持数据隐私，同时提高模型的透明度和解释性。
methods: 这篇论文提出了逻辑推理基于联合学习（LR-XFL）方法，该方法通过在客户端上创建本地逻辑规则，并将其传输到联合学习服务器。服务器通过将本地逻辑规则连接成一个合理的逻辑连接，而不需要访问原始数据，以提高模型的透明度和解释性。
results: 实验结果表明，LR-XFL比最相关的基eline提高了1.19%、5.81%和5.41%的分类精度、逻辑规则精度和逻辑整合度，分别。此外，LR-XFL还可以提高全局联合学习模型的可靠性和透明度，为医疗和金融等领域， где数据隐私和解释性均很重要，带来可能的改进。

Abstract
Federated learning (FL) is an emerging approach for training machine learning models collaboratively while preserving data privacy. The need for privacy protection makes it difficult for FL models to achieve global transparency and explainability. To address this limitation, we incorporate logic-based explanations into FL by proposing the Logical Reasoning-based eXplainable Federated Learning (LR-XFL) approach. Under LR-XFL, FL clients create local logic rules based on their local data and send them, along with model updates, to the FL server. The FL server connects the local logic rules through a proper logical connector that is derived based on properties of client data, without requiring access to the raw data. In addition, the server also aggregates the local model updates with weight values determined by the quality of the clients' local data as reflected by their uploaded logic rules. The results show that LR-XFL outperforms the most relevant baseline by 1.19%, 5.81% and 5.41% in terms of classification accuracy, rule accuracy and rule fidelity, respectively. The explicit rule evaluation and expression under LR-XFL enable human experts to validate and correct the rules on the server side, hence improving the global FL model's robustness to errors. It has the potential to enhance the transparency of FL models for areas like healthcare and finance where both data privacy and explainability are important.

摘要
federated learning（FL）是一种emerging approach дляtrain machine learning模型在协同下保持数据隐私。由于保护隐私的需求，FL模型具有global transparency和解释性的限制。为了解决这个限制，我们将逻辑based explanations incorporated into FL by proposing the Logical Reasoning-based eXplainable Federated Learning（LR-XFL）approach。在LR-XFL中，FL客户端创建基于本地数据的本地逻辑规则，并将其发送到FL服务器。FL服务器通过基于客户端数据的性质 derivation proper logical connector，连接本地逻辑规则，而无需访问原始数据。此外，服务器还将本地模型更新与基于客户端数据质量的weight值相连接。结果显示，LR-XFL比最相关的基eline 高1.19%，5.81%和5.41%在分类精度，规则精度和规则忠实度方面。LR-XFL中的explicit rule evaluation和表达使得人工专家可以在服务器端验证和修正规则，因此提高了全球FL模型的鲁棒性。这有可能提高FL模型在医疗和金融等领域的透明度，这些领域都是数据隐私和解释性重要。

Master-slave Deep Architecture for Top-K Multi-armed Bandits with Non-linear Bandit Feedback and Diversity Constraints

paper_url: http://arxiv.org/abs/2308.12680
repo_url: https://github.com/huanghanchi/master-slave-algorithm-for-top-k-bandits
paper_authors: Hanchi Huang, Li Shen, Deheng Ye, Wei Liu
for: 解决 combinatorial 多臂矢量带带约束的 top-$K$ 带带约束问题，这是首次考虑 combinatorial 带带约束的带带约束设定。
methods: 引入 six 个奴隶模型，每个奴隶模型具有独特的优点，以生成均衡奖励和约束的多样化样本，并使用教师学习和策略合作技术来提高奴隶模型的表现。
results: 比较 existing state-of-the-art 算法，our 方法在 synthetic 和实际数据集上 для recommendation 任务表现出色，并且可以快速尝试多个策略，以满足不同的约束和奖励要求。

Abstract
We propose a novel master-slave architecture to solve the top-$K$ combinatorial multi-armed bandits problem with non-linear bandit feedback and diversity constraints, which, to the best of our knowledge, is the first combinatorial bandits setting considering diversity constraints under bandit feedback. Specifically, to efficiently explore the combinatorial and constrained action space, we introduce six slave models with distinguished merits to generate diversified samples well balancing rewards and constraints as well as efficiency. Moreover, we propose teacher learning based optimization and the policy co-training technique to boost the performance of the multiple slave models. The master model then collects the elite samples provided by the slave models and selects the best sample estimated by a neural contextual UCB-based network to make a decision with a trade-off between exploration and exploitation. Thanks to the elaborate design of slave models, the co-training mechanism among slave models, and the novel interactions between the master and slave models, our approach significantly surpasses existing state-of-the-art algorithms in both synthetic and real datasets for recommendation tasks. The code is available at: \url{https://github.com/huanghanchi/Master-slave-Algorithm-for-Top-K-Bandits}.

摘要
我们提出了一种新的主奴 Architecture来解决top-$K$ combinatorial多臂弓炮问题，这是我们所知道的首次对 combinatorial bandits 设定下的多臂弓炮问题中考虑了多样性约束。 Specifically, 为了有效地探索 combinatorial 和受限的动作空间，我们引入了六个奴隶模型，每个模型具有特殊的优点，以生成均衡奖励和约束的多样化样本。此外，我们提出了教师学习基于优化和策略合作技术来提高多个奴隶模型的性能。 master model 然后收集奴隶模型提供的最佳样本，并使用神经上下文ual UCB 网络来选择最佳样本，以实现exploration 和 exploitation 的平衡。 благо于奴隶模型的精心设计、奴隶模型之间的合作机制以及主奴模型与奴隶模型之间的新型互动，我们的方法在 synthetic 和实际数据集上 для推荐任务上显著超过了现有的状态足算法。代码可以在以下链接中找到： \url{https://github.com/huanghanchi/Master-slave-Algorithm-for-Top-K-Bandits}.

A Continual Learning Approach for Cross-Domain White Blood Cell Classification

paper_url: http://arxiv.org/abs/2308.12679
repo_url: None
paper_authors: Ario Sadafi, Raheleh Salehi, Armin Gruber, Sayedali Shetab Boushehri, Pascal Giehr, Nassir Navab, Carsten Marr
for: 静脉血白细胞分类是诊断血液疾病的关键，因此需要定期更新机器学习分类模型以适应不断变化的临床环境、数据源和疾病分类。
methods: 我们提出了一种基于复练的连续学习方法，用于在白细胞分类任务中处理逐渐学习和领域逐渐学习场景。我们使用模型预测结果选择 represntative 样本，以避免忘记先前学习的知识。
results: 我们对三个不同的白细胞分类 datasets 进行了全面的测试，包括颜色、分辨率和类型组合不同的场景。我们的方法在横跨领域的连续学习中表现出色，比如established baselines 和 iCaRL 和 EWC 方法。

Abstract
Accurate classification of white blood cells in peripheral blood is essential for diagnosing hematological diseases. Due to constantly evolving clinical settings, data sources, and disease classifications, it is necessary to update machine learning classification models regularly for practical real-world use. Such models significantly benefit from sequentially learning from incoming data streams without forgetting previously acquired knowledge. However, models can suffer from catastrophic forgetting, causing a drop in performance on previous tasks when fine-tuned on new data. Here, we propose a rehearsal-based continual learning approach for class incremental and domain incremental scenarios in white blood cell classification. To choose representative samples from previous tasks, we employ exemplar set selection based on the model's predictions. This involves selecting the most confident samples and the most challenging samples identified through uncertainty estimation of the model. We thoroughly evaluated our proposed approach on three white blood cell classification datasets that differ in color, resolution, and class composition, including scenarios where new domains or new classes are introduced to the model with every task. We also test a long class incremental experiment with both new domains and new classes. Our results demonstrate that our approach outperforms established baselines in continual learning, including existing iCaRL and EWC methods for classifying white blood cells in cross-domain environments.

摘要
准确分类白血球在周围血液中是诊断血液疾病的关键。由于临床设定不断变化，数据来源和疾病分类不断更新，因此需要定期更新机器学习分类模型以适应实际世界中的应用。但是，模型可能会出现悬峰现象，导致在新数据上练习后表现下降。为此，我们提出了基于复习的不间断学习方法，适用于白血球分类的类增量和领域增量场景。我们使用模型预测结果来选择先前任务中的表现最好和最难的样本，以增强模型的稳定性和鲁棒性。我们对三个不同的白血球分类数据集进行了全面的评估，包括新领域和新类的引入。我们还进行了长期类增量实验，以测试我们的方法在跨领域环境中的表现。结果表明，我们的方法在不间断学习中超过了现有的iCaRL和EWC方法，用于分类白血球。

Masked Feature Modelling: Feature Masking for the Unsupervised Pre-training of a Graph Attention Network Block for Bottom-up Video Event Recognition

paper_url: http://arxiv.org/abs/2308.12673
repo_url: None
paper_authors: Dimitrios Daskalakis, Nikolaos Gkalelis, Vasileios Mezaris
for: 这 paper 是为了提高视频事件识别性能而设计的。
methods: 这 paper 使用了一种新的隐藏特征模型（Masked Feature Modelling，MFM），该模型利用预训练的视觉化器来重建视频中对象的隐藏特征，并将这些特征与一个已经预训练的 Graph Attention Network（GAT）块结合使用。
results: 实验表明，使用 MFM 可以提高视频事件识别性能。

Abstract
In this paper, we introduce Masked Feature Modelling (MFM), a novel approach for the unsupervised pre-training of a Graph Attention Network (GAT) block. MFM utilizes a pretrained Visual Tokenizer to reconstruct masked features of objects within a video, leveraging the MiniKinetics dataset. We then incorporate the pre-trained GAT block into a state-of-the-art bottom-up supervised video-event recognition architecture, ViGAT, to improve the model's starting point and overall accuracy. Experimental evaluations on the YLI-MED dataset demonstrate the effectiveness of MFM in improving event recognition performance.

摘要
在这篇论文中，我们介绍了一种新的无监督预训练方法，即Masked Feature Modelling（MFM），用于提高视频事件识别性能。MFM利用一个预训练的视觉词法器来重建视频中对象的做了遮盲特征，基于MiniKinetics dataset。然后，我们将预训练的GAT块integrated到了现有的顶部向下推导视频事件识别架构ViGAT中，以提高模型的起点和总性能。实验评估在YLI-MED数据集上，证明MFM可以提高事件识别性能。

Optimal data pooling for shared learning in maintenance operations

paper_url: http://arxiv.org/abs/2308.12670
repo_url: None
paper_authors: Collin Drent, Melvin Drent, Geert-Jan van Houtum
for: 本研究探讨了 Shared Learning 在维护操作中的优势。
methods: 我们使用了一种归纳数据的方法，可以将高维 Markov 决策过程（MDP）转化为两维 MDP，以便进行结构分析和计算。
results: 我们的研究表明，通过归纳数据，可以实现cost reduction，比不归纳的情况更好。

Abstract
This paper addresses the benefits of pooling data for shared learning in maintenance operations. We consider a set of systems subject to Poisson degradation that are coupled through an a-priori unknown rate. Decision problems involving these systems are high-dimensional Markov decision processes (MDPs). We present a decomposition result that reduces such an MDP to two-dimensional MDPs, enabling structural analyses and computations. We leverage this decomposition to demonstrate that pooling data can lead to significant cost reductions compared to not pooling.

摘要
Translation Notes:* "Pooling data" is translated as "合并数据" (héshì data)* "Shared learning" is translated as "共享学习" (gòngshǎng xuéxí)* "Maintenance operations" is translated as "维护操作" (wéijī ànxíng)* "Decision problems" is translated as "决策问题" (jùcè wèn tí)* "High-dimensional Markov decision processes" is translated as "多维马尔可夫决策过程" (duōwèi mǎlèhuì juédà gòu jiāng)* "Decomposition result" is translated as "分解结果" (fēnjiè jièshu)* "Two-dimensional MDPs" is translated as "二维马尔可夫决策过程" (èrwèi mǎlèhuì juédà gòu jiāng)

Geodesic Mode Connectivity

paper_url: http://arxiv.org/abs/2308.12666
repo_url: https://github.com/char-tan/geodesic-mode-connectivity
paper_authors: Charlie Tan, Theodore Long, Sarah Zhao, Rudolf Laine
for: 研究模型之间的连接性，即训练过的模型可以通过低损失的路径相连。
methods: 使用信息几何来研究神经网络，即神经网络可视为参数化分布的空间，其几何结构具有弯曲性。提出使用曲线来近似地odesics，以实现模式连接性。
results: 验证了曲线方法可以实现模式连接性，并且可以用来优化模型的性能。

Abstract
Mode connectivity is a phenomenon where trained models are connected by a path of low loss. We reframe this in the context of Information Geometry, where neural networks are studied as spaces of parameterized distributions with curved geometry. We hypothesize that shortest paths in these spaces, known as geodesics, correspond to mode-connecting paths in the loss landscape. We propose an algorithm to approximate geodesics and demonstrate that they achieve mode connectivity.

摘要
Mode 连接性是一种现象，训练过的模型之间由一条低损失的路径相连。我们将这种现象重新划分到信息 геометрии中，即神经网络被视为参数化分布空间的拥有护圈的空间。我们假设这些空间中的最短路径（即地odesics）对应于损失图像中的模式连接路径。我们提出了一种算法来approxime geodesics，并证明它们实现模式连接性。Note: "Simplified Chinese" is a romanization of the Chinese language that uses a simpler set of characters and grammar than Traditional Chinese. It is commonly used in mainland China and Singapore.

Don’t Look into the Sun: Adversarial Solarization Attacks on Image Classifiers

paper_url: http://arxiv.org/abs/2308.12661
repo_url: https://github.com/paulgavrikov/adversarial_solarization
paper_authors: Paul Gavrikov, Janis Keuper
for: This paper aims to evaluate the robustness of deep neural networks against out-of-distribution inputs, specifically through the use of image solarization attacks.
methods: The paper introduces a new attack method based on image solarization, which is a straightforward yet effective way to degrade the accuracy of image classification models.
results: The authors demonstrate the attack’s capacity to significantly degrade accuracy, and show that existing defenses are not consistently effective against this specific attack.Here is the same information in Simplified Chinese:
for: 这篇论文目的是评估深度神经网络对不同输入的Robustness，具体来说是通过图像曝光攻击来评估。
methods: 这篇论文提出了一种基于图像曝光的新攻击方法，这种方法是简单又有效的，可以减弱图像分类模型的准确率。
results: 作者们示出了这种攻击的准确率下降的能力，并证明现有的防御措施不一定有效对于这种特定的攻击。

Abstract
Assessing the robustness of deep neural networks against out-of-distribution inputs is crucial, especially in safety-critical domains like autonomous driving, but also in safety systems where malicious actors can digitally alter inputs to circumvent safety guards. However, designing effective out-of-distribution tests that encompass all possible scenarios while preserving accurate label information is a challenging task. Existing methodologies often entail a compromise between variety and constraint levels for attacks and sometimes even both. In a first step towards a more holistic robustness evaluation of image classification models, we introduce an attack method based on image solarization that is conceptually straightforward yet avoids jeopardizing the global structure of natural images independent of the intensity. Through comprehensive evaluations of multiple ImageNet models, we demonstrate the attack's capacity to degrade accuracy significantly, provided it is not integrated into the training augmentations. Interestingly, even then, no full immunity to accuracy deterioration is achieved. In other settings, the attack can often be simplified into a black-box attack with model-independent parameters. Defenses against other corruptions do not consistently extend to be effective against our specific attack. Project website: https://github.com/paulgavrikov/adversarial_solarization

摘要
要评估深度神经网络对于非标准输入的Robustness特别是在自动驾驶和安全系统中，因为攻击者可以通过修改输入来绕过安全措施。然而，设计全面的非标准测试方法，同时保持准确性信息是一项复杂的任务。现有的方法ologies often involve a compromise between variety and constraint levels for attacks, and sometimes even both.为了更好地评估图像分类模型的Robustness，我们介绍了一种基于图像曝光的攻击方法，这种方法是概念简单，但不会影响自然图像的全球结构独立于强度。通过对多个ImageNet模型进行全面的评估，我们示出了这种攻击的能力在准确性上带来了显著的下降，只要它不包括在训练增强中。即使如此，也没有完全免疫于准确性下降。在其他设置下，这种攻击可以大多数情况下简化为黑盒攻击，参数独立于模型。防御其他损害不一定能延伸到对我们的特定攻击有效。项目网站：https://github.com/paulgavrikov/adversarial_solarization

APART: Diverse Skill Discovery using All Pairs with Ascending Reward and DropouT

paper_url: http://arxiv.org/abs/2308.12649
repo_url: None
paper_authors: Hadar Schreiber Galler, Tom Zahavy, Guillaume Desjardins, Alon Cohen
for: 这个论文旨在解决无奖励环境中多种技能发现问题，目标是在简单的网格世界环境中发现所有可能的技能。
methods: 该论文使用了一种名为APART的方法，即多对多（所有对对）拟合器和一种新的内在奖励函数，以及一种排除误差技术。
results: 论文表明，APART在网格世界环境中能够寻找所有可能的技能，并且需要更少的样本数据 than previous works。此外，论文还提出了一种简化版的算法，可以达到最大技能数量。

Abstract
We study diverse skill discovery in reward-free environments, aiming to discover all possible skills in simple grid-world environments where prior methods have struggled to succeed. This problem is formulated as mutual training of skills using an intrinsic reward and a discriminator trained to predict a skill given its trajectory. Our initial solution replaces the standard one-vs-all (softmax) discriminator with a one-vs-one (all pairs) discriminator and combines it with a novel intrinsic reward function and a dropout regularization technique. The combined approach is named APART: Diverse Skill Discovery using All Pairs with Ascending Reward and Dropout. We demonstrate that APART discovers all the possible skills in grid worlds with remarkably fewer samples than previous works. Motivated by the empirical success of APART, we further investigate an even simpler algorithm that achieves maximum skills by altering VIC, rescaling its intrinsic reward, and tuning the temperature of its softmax discriminator. We believe our findings shed light on the crucial factors underlying success of skill discovery algorithms in reinforcement learning.

摘要
我们研究了无奖环境中多样化技能发现，旨在发现所有可能的技能在简单的格子世界环境中。这个问题被形式化为用内生奖和预测技能的探测器进行互训练技能。我们的初始解决方案是将标准的一对一（所有对）探测器取代一个一对多（softmax）探测器，并将其与一种新的内生奖函数和掉帽正则化技术相结合。这种结合方法被称为APART：多样化技能发现使用所有对with Ascending奖和掉帽。我们示出了APART在格子世界中可以很快地发现所有可能的技能，远远少于之前的实验成果。受APART的实验成功的激发，我们进一步调查了一种最简单的算法，通过修改VIC、调整其内生奖、并调整探测器的温度来实现最大技能。我们认为我们的发现可以透视到涉及到奖励学习算法成功的关键因素。

The GENEA Challenge 2023: A large scale evaluation of gesture generation models in monadic and dyadic settings

paper_url: http://arxiv.org/abs/2308.12646
repo_url: None
paper_authors: Taras Kucherenko, Rajmund Nagy, Youngwoo Yoon, Jieyeon Woo, Teodor Nikolov, Mihail Tsakov, Gustav Eje Henter
for: 这个研究报告描述了2023年的GENEA挑战，参与者们构建了基于语音的手势生成系统，并进行了共同评估。
methods: 参与者们使用了同一个语音和动作数据集，并使用了语音和动作的同时评估。
results: 研究发现了参与者们的动作human-likeness有很大的差异，只有一些系统被评估为近似人工捕捉数据。并且，适用性问题尚未得到解决，大多数提交系统只能在有限的范围内 slightly above chance 水平。

Abstract
This paper reports on the GENEA Challenge 2023, in which participating teams built speech-driven gesture-generation systems using the same speech and motion dataset, followed by a joint evaluation. This year's challenge provided data on both sides of a dyadic interaction, allowing teams to generate full-body motion for an agent given its speech (text and audio) and the speech and motion of the interlocutor. We evaluated 12 submissions and 2 baselines together with held-out motion-capture data in several large-scale user studies. The studies focused on three aspects: 1) the human-likeness of the motion, 2) the appropriateness of the motion for the agent's own speech whilst controlling for the human-likeness of the motion, and 3) the appropriateness of the motion for the behaviour of the interlocutor in the interaction, using a setup that controls for both the human-likeness of the motion and the agent's own speech. We found a large span in human-likeness between challenge submissions, with a few systems rated close to human mocap. Appropriateness seems far from being solved, with most submissions performing in a narrow range slightly above chance, far behind natural motion. The effect of the interlocutor is even more subtle, with submitted systems at best performing barely above chance. Interestingly, a dyadic system being highly appropriate for agent speech does not necessarily imply high appropriateness for the interlocutor. Additional material is available via the project website at https://svito-zar.github.io/GENEAchallenge2023/ .

摘要

Human-likeness of the motion2. Appropriateness of the motion for the agent’s own speech, while controlling for the human-likeness of the motion3. Appropriateness of the motion for the behavior of the interlocutor in the interaction, while controlling for both the human-likeness of the motion and the agent’s own speechWe found a large span in human-likeness between challenge submissions, with a few systems rated close to human motion capture. However, appropriateness was not well-solved, with most submissions performing in a narrow range slightly above chance, far behind natural motion. The effect of the interlocutor was even more subtle, with submitted systems at best performing barely above chance. Interestingly, a dyadic system being highly appropriate for agent speech does not necessarily imply high appropriateness for the interlocutor. Additional materials are available on the project website at https://svito-zar.github.io/GENEAchallenge2023/.

Towards Hierarchical Regional Transformer-based Multiple Instance Learning

paper_url: http://arxiv.org/abs/2308.12634
repo_url: None
paper_authors: Josef Cersovsky, Sadegh Mohammadi, Dagmar Kainmueller, Johannes Hoehne
for: 这个研究旨在提高大比例 Histopathology 图像的分类 task 的性能，使用深度多实例学习模型。
methods: 该方法使用 Transformer 基于自注意机制，并将 regional 自注意机制应用于 Vision Transformer 中。方法还利用区域融合来Derive slide-level 预测，并可以在不同距离水平上堆叠进行特征处理。
results: 该方法在两个 Histopathology 数据集上显著提高了性能，特别是对于具有小本地形态特征的数据集。

Abstract
The classification of gigapixel histopathology images with deep multiple instance learning models has become a critical task in digital pathology and precision medicine. In this work, we propose a Transformer-based multiple instance learning approach that replaces the traditional learned attention mechanism with a regional, Vision Transformer inspired self-attention mechanism. We present a method that fuses regional patch information to derive slide-level predictions and show how this regional aggregation can be stacked to hierarchically process features on different distance levels. To increase predictive accuracy, especially for datasets with small, local morphological features, we introduce a method to focus the image processing on high attention regions during inference. Our approach is able to significantly improve performance over the baseline on two histopathology datasets and points towards promising directions for further research.

摘要
“ digitization pathology 和精度医学中，分类 gigapixel 组织学像（histopathology images）已成为一个重要的任务。在这个工作中，我们提出了一个基于 Transformer 的多个实例学习方法，取代传统的学习注意力机制。我们提出了一种使用区域 Vision Transformer 灵活注意力机制来聚合地方资讯，以 deriv 标本水平预测。此外，我们还引入了一种以高注意区域进行推理时的对应方法，以提高预测精度，特别是在小型、地方 morphological 特征的数据集上。我们的方法能够在两个 histopathology 数据集上实现明显的性能提升，这点显示了我们的方法具有推进性。”Note: Simplified Chinese is used here, which is a standardized form of Chinese that is widely used in mainland China and Singapore.

Uncertainty and Explainable Analysis of Machine Learning Model for Reconstruction of Sonic Slowness Logs

paper_url: http://arxiv.org/abs/2308.12625
repo_url: None
paper_authors: Hua Wang, Yuqiong Wu, Yushun Zhang, Fuqiang Lai, Zhou Feng, Bing Xie, Ailin Zhao
for: 这个论文的目的是用机器学习算法预测在垂直或老井中缺失的压缩波慢速度和剪切波慢速度记录，以便在钻井Field应用中减少缺失的问题。
methods: 这个论文使用了2020年的机器学习竞赛数据，并使用NGBoost算法构建了一个ensemble学习模型，以预测缺失的压缩波慢速度和剪切波慢速度记录。此外，使用SHAP方法来探究机器学习模型的可解性。
results: 研究发现，NGBoost模型在测试集中表现良好，可以提供预测结果的概率分布。此外，对预测结果的变化进行了评估，并发现预测结果的变化与 neutron气压和γ射线的大小有关，这与石油物理模型的认知相符。此外，机器学习模型还捕捉了钻井尺寸的变化对 slowness的影响，这种影响是复杂的，不易建立直接关系。这些发现与物理原理相符。

Abstract
Logs are valuable information for oil and gas fields as they help to determine the lithology of the formations surrounding the borehole and the location and reserves of subsurface oil and gas reservoirs. However, important logs are often missing in horizontal or old wells, which poses a challenge in field applications. In this paper, we utilize data from the 2020 machine learning competition of the SPWLA, which aims to predict the missing compressional wave slowness and shear wave slowness logs using other logs in the same borehole. We employ the NGBoost algorithm to construct an Ensemble Learning model that can predicate the results as well as their uncertainty. Furthermore, we combine the SHAP method to investigate the interpretability of the machine learning model. We compare the performance of the NGBosst model with four other commonly used Ensemble Learning methods, including Random Forest, GBDT, XGBoost, LightGBM. The results show that the NGBoost model performs well in the testing set and can provide a probability distribution for the prediction results. In addition, the variance of the probability distribution of the predicted log can be used to justify the quality of the constructed log. Using the SHAP explainable machine learning model, we calculate the importance of each input log to the predicted results as well as the coupling relationship among input logs. Our findings reveal that the NGBoost model tends to provide greater slowness prediction results when the neutron porosity and gamma ray are large, which is consistent with the cognition of petrophysical models. Furthermore, the machine learning model can capture the influence of the changing borehole caliper on slowness, where the influence of borehole caliper on slowness is complex and not easy to establish a direct relationship. These findings are in line with the physical principle of borehole acoustics.

摘要
批处是钻井场中的重要信息，它们可以帮助确定附近钻井的地层学特性和油气储量。然而，有些重要的批处在水平或老钻井中缺失，这会对钻井场的应用带来挑战。在这篇论文中，我们使用2020年机器学习竞赛的SPWLA数据，以预测缺失的压缩波慢速和剪切波慢速批处。我们使用NGBoost算法构建了一个ensemble学习模型，可以预测结果以及其不确定性。此外，我们使用SHAP方法来调查机器学习模型的可解释性。我们与四种常用的ensemble学习方法进行比较，包括Random Forest、GBDT、XGBoost和LightGBM。结果显示，NGBoost模型在测试集中表现良好，并可以提供预测结果的概率分布。此外，预测结果的不确定性的方差可以用来评估模型的质量。使用SHAP可解释机器学习模型，我们计算了每个输入批处对预测结果的重要性以及输入批处之间的相互关系。我们的发现表明，NGBoost模型在大于钻井内 neutron气压和γ射线时表现出更好的慢速预测结果，这与岩石物理模型的认知一致。此外，机器学习模型可以捕捉随着钻井压力的变化，钻井压力与批处之间的复杂关系。这与物理原理的钻井声学有关。

Try with Simpler – An Evaluation of Improved Principal Component Analysis in Log-based Anomaly Detection

paper_url: http://arxiv.org/abs/2308.12612
repo_url: None
paper_authors: Lin Yang, Junjie Chen, Zhihao Gong, Shutao Gao, Hongyu Zhang, Yue Kang, Huaan Li
for: 这个研究的目的是强化传统机器学习和数据挖掘技术，以提高logs中的异常探测效能。
methods: 本研究使用了优化的无supervised PCA（主成分分析）技术，将logs中的 semantic-based representation与数据分析相结合，以解决无法在训练数据中看到的问题。
results: 结果显示，优化的PCA技术与进阶的supervised/semi-supervised深度学习方法的效能相似，并且在有限的训练数据和资源下更稳定。

Abstract
The rapid growth of deep learning (DL) has spurred interest in enhancing log-based anomaly detection. This approach aims to extract meaning from log events (log message templates) and develop advanced DL models for anomaly detection. However, these DL methods face challenges like heavy reliance on training data, labels, and computational resources due to model complexity. In contrast, traditional machine learning and data mining techniques are less data-dependent and more efficient but less effective than DL. To make log-based anomaly detection more practical, the goal is to enhance traditional techniques to match DL's effectiveness. Previous research in a different domain (linking questions on Stack Overflow) suggests that optimized traditional techniques can rival state-of-the-art DL methods. Drawing inspiration from this concept, we conducted an empirical study. We optimized the unsupervised PCA (Principal Component Analysis), a traditional technique, by incorporating lightweight semantic-based log representation. This addresses the issue of unseen log events in training data, enhancing log representation. Our study compared seven log-based anomaly detection methods, including four DL-based, two traditional, and the optimized PCA technique, using public and industrial datasets. Results indicate that the optimized unsupervised PCA technique achieves similar effectiveness to advanced supervised/semi-supervised DL methods while being more stable with limited training data and resource-efficient. This demonstrates the adaptability and strength of traditional techniques through small yet impactful adaptations.

摘要
深度学习（DL）的快速发展激发了对日志基本异常检测的改进。这种方法的目标是从日志事件模板中提取意义并开发高级DL模型进行异常检测。然而，这些DL方法面临着强依赖于训练数据、标签和计算资源的挑战，因为模型的复杂性。与此相反，传统的机器学习和数据挖掘技术更加不依赖于数据，更加高效，但也更加效率。为了让日志基本异常检测更加实用，目标是提高传统技术，使其与DL的效果相匹配。前一个研究（在Stack Overflow上的问题链接）表明，优化传统技术可以与当前DL方法相当有效。以这个概念为发想，我们进行了一个实验研究。我们对不带标签的PCA（主成分分析）进行了优化，通过 incorporating lightweight semantic-based log representation来解决训练数据中未见的日志事件问题。我们对公共和工业 dataset 进行了七种日志基本异常检测方法的比较，其中包括四种DL基本、两种传统、优化PCA技术。结果表明，优化的无标签PCA技术与高级指导/半指导DL方法相当有效，同时更加稳定，资源更加有效。这种示例展示了传统技术的适应性和强大性，通过小 yet 有影响的改进。

A Greedy Approach for Offering to Telecom Subscribers

paper_url: http://arxiv.org/abs/2308.12606
repo_url: None
paper_authors: Piyush Kanti Bhunre, Tanmay Sen, Arijit Sarkar
For: The paper is written for telecom operators to optimize offer campaigns to retain subscribers and prevent churn.* Methods: The paper proposes a novel combinatorial algorithm to solve offer optimization under heterogeneous offers by maximizing expected revenue under the scenario of subscriber churn.* Results: The proposed algorithm is efficient and accurate even for a very large subscriber-base.Here are the three information points in Simplified Chinese text:* For: 这篇论文是为 телеком运营商开发的，以便优化提供套件，保持客户和防止落囊。* Methods: 论文提出了一种新的复合算法，用于在多种不同的套件下进行套件优化，以达到预期收入的最大化，面对消费者落囊的情况下。* Results: 该算法能够高效地处理很大的用户基数。

Abstract
Customer retention or churn prevention is a challenging task of a telecom operator. One of the effective approaches is to offer some attractive incentive or additional services or money to the subscribers for keeping them engaged and make sure they stay in the operator's network for longer time. Often, operators allocate certain amount of monetary budget to carry out the offer campaign. The difficult part of this campaign is the selection of a set of customers from a large subscriber-base and deciding the amount that should be offered to an individual so that operator's objective is achieved. There may be multiple objectives (e.g., maximizing revenue, minimizing number of churns) for selection of subscriber and selection of an offer to the selected subscriber. Apart from monetary benefit, offers may include additional data, SMS, hots-spot tethering, and many more. This problem is known as offer optimization. In this paper, we propose a novel combinatorial algorithm for solving offer optimization under heterogeneous offers by maximizing expected revenue under the scenario of subscriber churn, which is, in general, seen in telecom domain. The proposed algorithm is efficient and accurate even for a very large subscriber-base.

摘要
客户退订或防退是电信运营商面临的挑战之一。一种有效的方法是向用户提供吸引人的折扣或附加服务，以保持用户的兴趣和使他们尽量长时间留在运营商的网络中。经常，运营商会分配一定的财务预算来实施优惠活动。选择一个来自大量用户基数的 subset 并决定每个用户所需的金额是Operator的目标实现的困难部分。有多个目标（例如，最大化收入，最小化退订数）可以用来选择用户和选择优惠给选择的用户。除了金钱的利益外，优惠可能包括额外数据、SMS、热点终端等多种服务。这个问题被称为优惠优化。在这篇论文中，我们提出了一种新的 combinatorial 算法，用于在不同类型的优惠下对客户优惠进行优化，以达到预期收入的最大化，这是在通信领域中一般存在的退订问题。提议的算法是高效和准确，即使用户基数非常大。

Exploiting Time-Frequency Conformers for Music Audio Enhancement

paper_url: http://arxiv.org/abs/2308.12599
repo_url: None
paper_authors: Yunkee Chae, Junghyun Koo, Sungho Lee, Kyogu Lee
for: 提高网络视频平台上音乐表演录音质量
methods: 基于Conformer架构，利用注意机制进行音乐提升
results: 实现单音轨提升和多轨混合音乐提升，达到领先水平

Abstract
With the proliferation of video platforms on the internet, recording musical performances by mobile devices has become commonplace. However, these recordings often suffer from degradation such as noise and reverberation, which negatively impact the listening experience. Consequently, the necessity for music audio enhancement (referred to as music enhancement from this point onward), involving the transformation of degraded audio recordings into pristine high-quality music, has surged to augment the auditory experience. To address this issue, we propose a music enhancement system based on the Conformer architecture that has demonstrated outstanding performance in speech enhancement tasks. Our approach explores the attention mechanisms of the Conformer and examines their performance to discover the best approach for the music enhancement task. Our experimental results show that our proposed model achieves state-of-the-art performance on single-stem music enhancement. Furthermore, our system can perform general music enhancement with multi-track mixtures, which has not been examined in previous work.

摘要

LORD: Leveraging Open-Set Recognition with Unknown Data

paper_url: http://arxiv.org/abs/2308.12584
repo_url: None
paper_authors: Tobias Koch, Christian Riess, Thomas Köhler
for: 本研究旨在提高类фика器对未知数据的识别能力， addresses the challenge of handling entirely unknown data for deployed classifiers.
methods: 本研究提出了一种名为LORD的框架， Leverage Open-set Recognition by exploiting unknown Data。LORD在类ifier培训过程中直接模型开放空间，并提供了一系列模型独立的训练策略。
results: 根据研究表明，LORD可以提高未知数据的识别率，并且可以避免依赖于大量和昂贵的背景数据。此外，研究还发现了一种名为mixup的数据生成技术，可以作为背景数据的替代品，并且可以further improve OSR performance。

Abstract
Handling entirely unknown data is a challenge for any deployed classifier. Classification models are typically trained on a static pre-defined dataset and are kept in the dark for the open unassigned feature space. As a result, they struggle to deal with out-of-distribution data during inference. Addressing this task on the class-level is termed open-set recognition (OSR). However, most OSR methods are inherently limited, as they train closed-set classifiers and only adapt the downstream predictions to OSR. This work presents LORD, a framework to Leverage Open-set Recognition by exploiting unknown Data. LORD explicitly models open space during classifier training and provides a systematic evaluation for such approaches. We identify three model-agnostic training strategies that exploit background data and applied them to well-established classifiers. Due to LORD's extensive evaluation protocol, we consistently demonstrate improved recognition of unknown data. The benchmarks facilitate in-depth analysis across various requirement levels. To mitigate dependency on extensive and costly background datasets, we explore mixup as an off-the-shelf data generation technique. Our experiments highlight mixup's effectiveness as a substitute for background datasets. Lightweight constraints on mixup synthesis further improve OSR performance.

摘要
处理完全未知数据是任何部署分类器的挑战。分类模型通常在静态预先定义的数据集上训练，因此在推理时难以处理不同步数据。为解决这个问题，我们提出了开放集 recognition（OSR）技术。然而，大多数OSR方法都受限于它们只是在closed-set分类器上进行适应，而不是直接训练开放集分类器。本文介绍了LORD框架，它可以在分类器训练过程中显式地模型开放空间，并提供了一种系统的评估方法。我们确定了三种模型不依赖的训练策略，并应用于已成熟的分类器。由于LORD的广泛评估协议，我们在不同的需求水平上 consistently 示出了对未知数据的更好的识别。这些标准化的协议使得可以进行深入的分析。为了减少依赖于费时和成本高的背景数据集，我们探索了mixup作为一种可用的数据生成技术。我们的实验表明，mixup可以作为背景数据集的替代品。进一步的轻量级约束可以进一步提高OSR性能。

Persistent learning signals and working memory without continuous attractors

paper_url: http://arxiv.org/abs/2308.12585
repo_url: None
paper_authors: Il Memming Park, Ábel Ságodi, Piotr Aleksander Sokół
for: 这个论文探讨了神经动力系统中稳定吸引结构，如点吸引器和连续吸引器，是否能够支持有用的时间学习信号，以适应环境中的时间结构变化。
methods: 作者使用了Periodic和 quasi-Periodic吸引器来支持学习无限长的时间关系。与Continuous吸引器不同， quasi-Periodic吸引器具有免疫细调问题，使其更适合学习生成时间结构。
results: 作者发现，Periodic和 quasi-Periodic吸引器可以支持学习无限长的时间关系，并且不受细调问题的影响。此外，作者还提出了一种新的初始化方案，可以使 искусствен neural network在学习时间动力学任务时表现出更好的性能。最后，作者还提出了一种Robust recurrent memory机制，可以在缺少环形吸引器的情况下，将头向量维护和 инте格。

Abstract
Neural dynamical systems with stable attractor structures, such as point attractors and continuous attractors, are hypothesized to underlie meaningful temporal behavior that requires working memory. However, working memory may not support useful learning signals necessary to adapt to changes in the temporal structure of the environment. We show that in addition to the continuous attractors that are widely implicated, periodic and quasi-periodic attractors can also support learning arbitrarily long temporal relationships. Unlike the continuous attractors that suffer from the fine-tuning problem, the less explored quasi-periodic attractors are uniquely qualified for learning to produce temporally structured behavior. Our theory has broad implications for the design of artificial learning systems and makes predictions about observable signatures of biological neural dynamics that can support temporal dependence learning and working memory. Based on our theory, we developed a new initialization scheme for artificial recurrent neural networks that outperforms standard methods for tasks that require learning temporal dynamics. Moreover, we propose a robust recurrent memory mechanism for integrating and maintaining head direction without a ring attractor.

摘要
神经动力系统with稳定吸引结构，如点吸引器和连续吸引器，被假设在有用的时间行为中存在。然而，工作记忆可能无法提供有用的学习信号，以适应环境中的时间结构变化。我们表明，除了广泛被推荐的连续吸引器之外， periodic和 quasi-periodic吸引器也可以支持学习无限长的时间关系。与连续吸引器相比，quasi-periodic吸引器具有独特优势，可以学习生成时间结构化的行为。我们的理论具有广泛的应用前景，可以设计人工学习系统，并预测生物神经动力学中可以支持时间依赖学习和工作记忆的可观察特征。根据我们的理论，我们开发了一种新的初始化方案，可以超过标准方法在需要学习时间动力学任务中表现更好。此外，我们提出了一种可靠的回忆机制，可以融合和维护方向指向，而不需要环形吸引器。

A Huber Loss Minimization Approach to Byzantine Robust Federated Learning

paper_url: http://arxiv.org/abs/2308.12581
repo_url: None
paper_authors: Puning Zhao, Fei Yu, Zhiguo Wan
for: 强制学习系统受到敌意攻击的威胁，我们提出一种基于哈伯损函整合方法，并进行了全面的理论分析。
methods: 我们的方法基于哈伯损函整合，并且在独立同分布（i.i.d）假设下有以下优点：首先，它具有优化的 $\epsilon$ 依赖性，其中 $\epsilon$ 表示攻击客户端的比率；其次，我们的方法不需要准确地知道 $\epsilon$；最后，它允许客户端有不同的数据大小。
results: 我们扩展了我们的分析至非i.i.d数据，包括客户端有轻微不同的分布。

Abstract
Federated learning systems are susceptible to adversarial attacks. To combat this, we introduce a novel aggregator based on Huber loss minimization, and provide a comprehensive theoretical analysis. Under independent and identically distributed (i.i.d) assumption, our approach has several advantages compared to existing methods. Firstly, it has optimal dependence on $\epsilon$, which stands for the ratio of attacked clients. Secondly, our approach does not need precise knowledge of $\epsilon$. Thirdly, it allows different clients to have unequal data sizes. We then broaden our analysis to include non-i.i.d data, such that clients have slightly different distributions.

摘要

Hypergraph Convolutional Networks for Fine-grained ICU Patient Similarity Analysis and Risk Prediction

paper_url: http://arxiv.org/abs/2308.12575
repo_url: None
paper_authors: Yuxi Liu, Zhenhao Zhang, Shaowen Qin, Flora D. Salim, Antonio Jimeno Yepes, Jun Shen
For: 预测患者死亡风险* Methods: 使用 Hypergraph Convolutional Network represent 非对应关系（如诊断代码），以捕捉隐藏特征结构，计算细化患者相似性* Results: 在使用 eICU Collaborative Research Database 评估中，方法与状态艺前模型相比，实现了更高的死亡风险预测性能，并通过多个案例研究，表明图网络可以提供良好的透明度和可靠性在决策中。Here’s the translation in Simplified Chinese:* For: 预测患者死亡风险* Methods: 使用 Hypergraph Convolutional Network represent 非对应关系（如诊断代码），以捕捉隐藏特征结构，计算细化患者相似性* Results: 在使用 eICU Collaborative Research Database 评估中，方法与状态艺前模型相比，实现了更高的死亡风险预测性能，并通过多个案例研究，表明图网络可以提供良好的透明度和可靠性在决策中。

Abstract
The Intensive Care Unit (ICU) is one of the most important parts of a hospital, which admits critically ill patients and provides continuous monitoring and treatment. Various patient outcome prediction methods have been attempted to assist healthcare professionals in clinical decision-making. Existing methods focus on measuring the similarity between patients using deep neural networks to capture the hidden feature structures. However, the higher-order relationships are ignored, such as patient characteristics (e.g., diagnosis codes) and their causal effects on downstream clinical predictions. In this paper, we propose a novel Hypergraph Convolutional Network that allows the representation of non-pairwise relationships among diagnosis codes in a hypergraph to capture the hidden feature structures so that fine-grained patient similarity can be calculated for personalized mortality risk prediction. Evaluation using a publicly available eICU Collaborative Research Database indicates that our method achieves superior performance over the state-of-the-art models on mortality risk prediction. Moreover, the results of several case studies demonstrated the effectiveness of constructing graph networks in providing good transparency and robustness in decision-making.

摘要
医院重症监护室（ICU）是医院中最重要的部分之一， admit 重症病人并提供连续监测和治疗。不同的患者结果预测方法已经被尝试以协助医疗专业人员进行临床决策。现有方法主要是通过深度神经网络捕捉患者特征结构的隐藏关系，但是忽略了患者特征（例如诊断代码）和其影响下游临床预测的 causal 关系。在本文中，我们提出了一种新的 Hypergraph Convolutional Network，允许在幂图中表示诊断代码之间的非对比关系，以捕捉隐藏特征结构，从而计算出细致的患者相似性，用于个性化死亡风险预测。经过使用公共可用的 eICU Collaborative Research Database 评估，我们的方法在死亡风险预测中超过了当前状态的模型性能。此外，多个案例研究表明，在做出决策时，建立图网络可以提供良好的透明度和可靠性。

Conditional Kernel Imitation Learning for Continuous State Environments

paper_url: http://arxiv.org/abs/2308.12573
repo_url: None
paper_authors: Rishabh Agrawal, Nathan Dahlin, Rahul Jain, Ashutosh Nayyar
for: 本研究的目的是在离散状态空间环境中进行模仿学习，无需transition dynamics信息、奖励结构或任何额外交互。
methods: 我们的方法基于Markov balance equation，使用conditional kernel density estimator来估计环境的转移动力学，并尝试满足环境的 probabilistic balance equations。
results: 我们通过对离散状态 benchmark环境的数字实验表明，我们的方法在empirical性能方面具有显著优势，常常高于现有的IL算法。

Abstract
Imitation Learning (IL) is an important paradigm within the broader reinforcement learning (RL) methodology. Unlike most of RL, it does not assume availability of reward-feedback. Reward inference and shaping are known to be difficult and error-prone methods particularly when the demonstration data comes from human experts. Classical methods such as behavioral cloning and inverse reinforcement learning are highly sensitive to estimation errors, a problem that is particularly acute in continuous state space problems. Meanwhile, state-of-the-art IL algorithms convert behavioral policy learning problems into distribution-matching problems which often require additional online interaction data to be effective. In this paper, we consider the problem of imitation learning in continuous state space environments based solely on observed behavior, without access to transition dynamics information, reward structure, or, most importantly, any additional interactions with the environment. Our approach is based on the Markov balance equation and introduces a novel conditional kernel density estimation-based imitation learning framework. It involves estimating the environment's transition dynamics using conditional kernel density estimators and seeks to satisfy the probabilistic balance equations for the environment. We establish that our estimators satisfy basic asymptotic consistency requirements. Through a series of numerical experiments on continuous state benchmark environments, we show consistently superior empirical performance over many state-of-the-art IL algorithms.

摘要
欢迎来到我们的实验室！今天我们将谈论一种重要的学习方法——仿制学习（Imitation Learning，IL）。不同于大多数激励学习（Reinforcement Learning，RL）方法，IL不假设环境会提供奖励反馈。奖励推断和形成是一个难题，尤其是当示例数据来自人类专家时。经典方法如行为做为模式（Behavioral Cloning）和反向激励学习（Inverse Reinforcement Learning）都具有高度敏感性，特别是在连续状态空间问题上。而现代IL算法通常将行为政策学习问题转化为分布匹配问题，这些问题经常需要在线互动数据来实现效果。在这篇论文中，我们研究了在连续状态空间环境中的IL问题，无需访问转移动力学信息、奖励结构或任何额外的环境互动数据。我们的方法基于马可夫平衡方程，并提出了一种基于条件kernel density参数估计的IL框架。我们使用条件kernel density参数估计来估计环境的转移动力学，并寻求满足环境的 probabilistic 平衡方程。我们证明了我们的估计符合基本的 asymptotic 一致性要求。通过对连续状态 benchmark 环境进行 série 的实验，我们显示了在许多现代IL算法中出现的超过 empirical 性能。

Multivariate Time-Series Anomaly Detection with Contaminated Data: Application to Physiological Signals

paper_url: http://arxiv.org/abs/2308.12563
repo_url: None
paper_authors: Thi Kieu Khanh Ho, Narges Armanfard
for: 本研究旨在提出一种实用的无监督时间序列异常检测方法（TSAD），可以在受到噪声的训练数据下进行异常检测。
methods: 该方法包括三个模块：一个去噪模块，可以 rectify the anomalies（即噪声）在训练数据中；一个变量依赖模型模块，可以捕捉长期内部和间部变量之间的依赖关系，用作净正常数据的代表；以及一个异常分数模块，用于检测异常。
results: 经过广泛的实验表明，该方法在三个通用的生理学数据集上的性能都超过了现有的方法，因此成功地建立了新的领先性。

Abstract
Mainstream unsupervised anomaly detection algorithms often excel in academic datasets, yet their real-world performance is restricted due to the controlled experimental conditions involving clean training data. Addressing the challenge of training with noise, a prevalent issue in practical anomaly detection, is frequently overlooked. In a pioneering endeavor, this study delves into the realm of label-level noise within sensory time-series anomaly detection (TSAD). This paper presents a novel and practical end-to-end unsupervised TSAD when the training data are contaminated with anomalies. The introduced approach, called TSAD-C, is devoid of access to abnormality labels during the training phase. TSAD-C encompasses three modules: a Decontaminator to rectify the abnormalities (aka noise) present in the training data, a Variable Dependency Modeling module to capture both long-term intra- and inter-variable dependencies within the decontaminated data that can be considered as a surrogate of the pure normal data, and an Anomaly Scoring module to detect anomalies. Our extensive experiments conducted on three widely used physiological datasets conclusively demonstrate that our approach surpasses existing methodologies, thus establishing a new state-of-the-art performance in the field.

摘要
主流无监督异常检测算法在学术数据集中表现出色，然而在实际应用中它们的性能受到干扰。因为干扰是实际异常检测中的一个普遍存在的问题，而这种干扰在训练数据中的存在 frequently overlooked。本研究探索了标签水平干扰在感知时序异常检测（TSAD）中的挑战。本文提出了一种新的实用无监督TSAD方法，称为TSAD-C，它在训练数据中存在异常时不具备对异常标签的访问。TSAD-C包括三个模块：一个Rectifier来修正训练数据中的异常（即干扰），一个变量依赖模型来捕捉训练数据中的长期内部和间部变量关系，以及一个异常检测模块来检测异常。我们对三个广泛使用的生理数据集进行了广泛的实验，结果表明，我们的方法超越了现有方法，因此在该领域成立了新的状态态-of-the-art表现。

Variational Information Pursuit with Large Language and Multimodal Models for Interpretable Predictions

paper_url: http://arxiv.org/abs/2308.12562
repo_url: None
paper_authors: Kwan Ho Ryan Chan, Aditya Chattopadhyay, Benjamin David Haeffele, Rene Vidal
for: 这个研究的目的是扩展Variational Information Pursuit（V-IP）框架，使其能够在更大规模的任务上进行透明预测。
methods: 该研究使用两步进程，首先使用大型自然语言模型（LLM）生成足够多的任务相关和可解释性概念集，然后使用大型多Modal模型对每个数据样本进行 semantic similarity 标注。
results: 研究表明，使用LM+V-IP方法可以在测试性能和透明度之间做出平衡，并且在其他可解释性框架 such as Concept Bottleneck Models（CBMs）中使用更少的概念/查询可以达到类似的测试性能。

Abstract
Variational Information Pursuit (V-IP) is a framework for making interpretable predictions by design by sequentially selecting a short chain of task-relevant, user-defined and interpretable queries about the data that are most informative for the task. While this allows for built-in interpretability in predictive models, applying V-IP to any task requires data samples with dense concept-labeling by domain experts, limiting the application of V-IP to small-scale tasks where manual data annotation is feasible. In this work, we extend the V-IP framework with Foundational Models (FMs) to address this limitation. More specifically, we use a two-step process, by first leveraging Large Language Models (LLMs) to generate a sufficiently large candidate set of task-relevant interpretable concepts, then using Large Multimodal Models to annotate each data sample by semantic similarity with each concept in the generated concept set. While other interpretable-by-design frameworks such as Concept Bottleneck Models (CBMs) require an additional step of removing repetitive and non-discriminative concepts to have good interpretability and test performance, we mathematically and empirically justify that, with a sufficiently informative and task-relevant query (concept) set, the proposed FM+V-IP method does not require any type of concept filtering. In addition, we show that FM+V-IP with LLM generated concepts can achieve better test performance than V-IP with human annotated concepts, demonstrating the effectiveness of LLMs at generating efficient query sets. Finally, when compared to other interpretable-by-design frameworks such as CBMs, FM+V-IP can achieve competitive test performance using fewer number of concepts/queries in both cases with filtered or unfiltered concept sets.

摘要
<> translate_language: zh-CN<>Variational Information Pursuit (V-IP) 是一个框架，用于做出可解释的预测，通过顺序选择一串任务相关、用户定义且可解释的问题，以获取最有用的信息。这允许预测模型内置可解释性，但是实际应用 V-IP 到任何任务时需要具有充足的数据样本，并且需要专家 manually 标注数据，限制了 V-IP 的应用范围仅对小规模任务进行数据标注是可能的。在这个工作中，我们将 V-IP 框架扩展为基础模型（FM），以解决这个限制。更 Specifically，我们使用一个 two-step 程序，首先利用大型自然语言模型（LLM）生成一个足够大的候选者集，然后使用大型多modal模型来标注每个数据样本，以semantic similarity with each concept in the generated concept set。相比其他可解释的设计框架，如概念瓶颈模型（CBM），我们不需要进行额外的排除重复和无用的概念，以确保好的解释性和试验性。具体来说，我们 mathematically 和实验显示，只要概念集足够有用和任务相关，则 FM+V-IP 方法不需要任何型的概念范 filtering。此外，我们显示 FM+V-IP 使用 LLM 生成的概念可以 achieve better test performance than V-IP 使用人工标注的概念，这表明 LLM 可以实现更有效率的查询集生成。最后，在与其他可解释的设计框架，如 CBM，进行比较时，FM+V-IP 可以 achieve 竞争性的试验性，使用更少的概念/询问。

Deep Reinforcement Learning-driven Cross-Community Energy Interaction Optimal Scheduling

paper_url: http://arxiv.org/abs/2308.12554
repo_url: None
paper_authors: Yang Li, Fanjin Bu, Zhen Yang, Bin Wang, Meng Han
for: 这篇论文是为了协调不同社区之间的能源交互和多种能源互补系统内部的能源转换，以及在不确定条件下实现整体能源系统的优化和调度。
methods: 该论文提出了一种基于多智能深度学习算法的全面调度模型，利用不同社区的负荷特征来做出决策。在该模型中，整体能源系统的调度问题被转化为一个Markov决策过程，并使用数据驱动的深度学习算法来解决。这种方法不需要模拟复杂的能源协同关系 между多个社区和多种能源互补系统。
results: 对于实验结果，提出的方法能够准确捕捉不同社区的负荷特征，并利用这些特征进行合理的能源交互协调。这导致风力浪费率从16.3%降至0%，并将总运行成本降低为5445.6元，表现出了明显的经济和环保效益。

Abstract
In order to coordinate energy interactions among various communities and energy conversions among multi-energy subsystems within the multi-community integrated energy system under uncertain conditions, and achieve overall optimization and scheduling of the comprehensive energy system, this paper proposes a comprehensive scheduling model that utilizes a multi-agent deep reinforcement learning algorithm to learn load characteristics of different communities and make decisions based on this knowledge. In this model, the scheduling problem of the integrated energy system is transformed into a Markov decision process and solved using a data-driven deep reinforcement learning algorithm, which avoids the need for modeling complex energy coupling relationships between multi-communities and multi-energy subsystems. The simulation results show that the proposed method effectively captures the load characteristics of different communities and utilizes their complementary features to coordinate reasonable energy interactions among them. This leads to a reduction in wind curtailment rate from 16.3% to 0% and lowers the overall operating cost by 5445.6 Yuan, demonstrating significant economic and environmental benefits.

摘要
为了协调不同社区之间的能源互动和多种能源互系统内部的能源转换，并在不确定条件下实现整体能源系统的优化和调度，这篇论文提出了一种涵盖式调度模型，利用多代理人深度强化学习算法来学习不同社区的荷载特点，并根据这些知识来做出决策。在这个模型中，集成能源系统的调度问题被转化为马可夫决策过程，并使用数据驱动的深度强化学习算法来解决，这避免了模拟复杂的能源协同关系between多个社区和多种能源互系统的需要。 simulation results show that the proposed method effectively captures the load characteristics of different communities and utilizes their complementary features to coordinate reasonable energy interactions among them. This leads to a reduction in wind curtailment rate from 16.3% to 0% and lowers the overall operating cost by 5445.6 Yuan, demonstrating significant economic and environmental benefits.Here's a word-for-word translation of the text into Simplified Chinese:为了协调不同社区之间的能源互动和多种能源互系统内部的能源转换，并在不确定条件下实现整体能源系统的优化和调度，这篇论文提出了一种涵盖式调度模型，利用多代理人深度强化学习算法来学习不同社区的荷载特点，并根据这些知识来做出决策。在这个模型中，集成能源系统的调度问题被转化为马可夫决策过程，并使用数据驱动的深度强化学习算法来解决，这避免了模拟复杂的能源协同关系between多个社区和多种能源互系统的需要。 simulation results show that the proposed method effectively captures the load characteristics of different communities and utilizes their complementary features to coordinate reasonable energy interactions among them. This leads to a reduction in wind curtailment rate from 16.3% to 0% and lowers the overall operating cost by 5445.6 Yuan, demonstrating significant economic and environmental benefits.

Don’t blame Dataset Shift! Shortcut Learning due to Gradients and Cross Entropy

paper_url: http://arxiv.org/abs/2308.12553
repo_url: None
paper_authors: Aahlad Puli, Lily Zhang, Yoav Wald, Rajesh Ranganath
for: 这篇论文研究了 Default-ERM 算法在感知任务中的缺点，以及如何通过改变 inductive bias 来解决这个问题。
methods: 作者使用了一种 linear perception task 来研究 Default-ERM 的行为，并发现了 Default-ERM 在这种任务中的缺点。
results: 作者发现了一种基于 uniform margin 的 loss function，可以避免 Default-ERM 的短cut learning问题，并在多种视觉和语言任务上进行了实验，证明了这种 inductive bias 的有效性。

Abstract
Common explanations for shortcut learning assume that the shortcut improves prediction under the training distribution but not in the test distribution. Thus, models trained via the typical gradient-based optimization of cross-entropy, which we call default-ERM, utilize the shortcut. However, even when the stable feature determines the label in the training distribution and the shortcut does not provide any additional information, like in perception tasks, default-ERM still exhibits shortcut learning. Why are such solutions preferred when the loss for default-ERM can be driven to zero using the stable feature alone? By studying a linear perception task, we show that default-ERM's preference for maximizing the margin leads to models that depend more on the shortcut than the stable feature, even without overparameterization. This insight suggests that default-ERM's implicit inductive bias towards max-margin is unsuitable for perception tasks. Instead, we develop an inductive bias toward uniform margins and show that this bias guarantees dependence only on the perfect stable feature in the linear perception task. We develop loss functions that encourage uniform-margin solutions, called margin control (MARG-CTRL). MARG-CTRL mitigates shortcut learning on a variety of vision and language tasks, showing that better inductive biases can remove the need for expensive two-stage shortcut-mitigating methods in perception tasks.

摘要
通常的解释是，快捷学习短cut减少预测误差在训练分布下，但在测试分布下不减少误差。因此，通过typical的梯度基于cross-entropy的优化，我们称之为default-ERM，这些模型使用快捷。然而，even when the stable feature determines the label in the training distribution and the shortcut does not provide any additional information, like in perception tasks, default-ERM still exhibits shortcut learning。Why are such solutions preferred when the loss for default-ERM can be driven to zero using the stable feature alone？By studying a linear perception task, we show that default-ERM's preference for maximizing the margin leads to models that depend more on the shortcut than the stable feature, even without overparameterization。This insight suggests that default-ERM's implicit inductive bias towards max-margin is unsuitable for perception tasks。Instead, we develop an inductive bias toward uniform margins and show that this bias guarantees dependence only on the perfect stable feature in the linear perception task。We develop loss functions that encourage uniform-margin solutions, called margin control (MARG-CTRL)。MARG-CTRL mitigates shortcut learning on a variety of vision and language tasks, showing that better inductive biases can remove the need for expensive two-stage shortcut-mitigating methods in perception tasks。

A Co-training Approach for Noisy Time Series Learning

paper_url: http://arxiv.org/abs/2308.12551
repo_url: None
paper_authors: Weiqi Zhang, Jianfeng Zhang, Jia Li, Fugee Tsung
for: 本研究强调鲁棒时间序列表示学习。
methods: 我们采用了两个视图的encoder创建两个不同的视图，然后通过协同对照学习来学习encoder。
results: 我们的TS-CoT方法在四个时间序列benchmark上进行了实验，结果显示TS-CoT方法可以减轻数据噪声和损害的影响，并且可以 Transfer learning到下游任务。

Abstract
In this work, we focus on robust time series representation learning. Our assumption is that real-world time series is noisy and complementary information from different views of the same time series plays an important role while analyzing noisy input. Based on this, we create two views for the input time series through two different encoders. We conduct co-training based contrastive learning iteratively to learn the encoders. Our experiments demonstrate that this co-training approach leads to a significant improvement in performance. Especially, by leveraging the complementary information from different views, our proposed TS-CoT method can mitigate the impact of data noise and corruption. Empirical evaluations on four time series benchmarks in unsupervised and semi-supervised settings reveal that TS-CoT outperforms existing methods. Furthermore, the representations learned by TS-CoT can transfer well to downstream tasks through fine-tuning.

摘要
在这项工作中，我们关注着鲁棒时序序表示学习。我们假设真实世界中的时序序列是噪音的，并且不同视图中的相同时序序列信息在分析噪音输入时发挥重要作用。基于这个假设，我们创建了两个视图对输入时序序列进行编码。我们通过轮替彩色学习来启动这两个编码器。我们的实验表明，这种轮替彩色学习方法可以提高性能。尤其是通过利用不同视图中的补充信息，我们的提案的TS-CoT方法可以减轻数据噪音和损害的影响。我们在四个时序序 benchmark 上进行了无监督和半监督的实验，结果表明TS-CoT方法在性能上超过了现有方法。此外，TS-CoT方法学习的表示可以通过精度调整来传递到下游任务中。

CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias

paper_url: http://arxiv.org/abs/2308.12539
repo_url: https://github.com/vipulgupta1011/calm
paper_authors: Vipul Gupta, Pranav Narayanan Venkit, Hugo Laurençon, Shomir Wilson, Rebecca J. Passonneau
for: 这个论文的目的是量化和比较语言模型（LM）的社会经济偏见，以及这些偏见的可能性导致的危害。
methods: 这个论文使用了一个新的benchmarkdataset，称为Comprehensive Assessment of Language Model bias（CALM），来量化LM的偏见。它 integrate了16个不同领域的数据集，并从中过滤了224个模板，然后构建了一个包含78,400个例子的dataset。
results: 研究发现，与先前的数据集不同，CALM dataset更加多样化和可靠，并且可以更好地评估LM的偏见。在测试20个大型语言模型时，研究发现，一些模型系列的大型模型更加偏见，而一些模型系列的小型模型更加不偏见。此外，研究还发现了一些模型系列中的人种和性别偏见之间的负相关性。

Abstract
As language models (LMs) become increasingly powerful, it is important to quantify and compare them for sociodemographic bias with potential for harm. Prior bias measurement datasets are sensitive to perturbations in their manually designed templates, therefore unreliable. To achieve reliability, we introduce the Comprehensive Assessment of Language Model bias (CALM), a benchmark dataset to quantify bias in LMs across three tasks. We integrate 16 existing datasets across different domains, such as Wikipedia and news articles, to filter 224 templates from which we construct a dataset of 78,400 examples. We compare the diversity of CALM with prior datasets on metrics such as average semantic similarity, and variation in template length, and test the sensitivity to small perturbations. We show that our dataset is more diverse and reliable than previous datasets, thus better capture the breadth of linguistic variation required to reliably evaluate model bias. We evaluate 20 large language models including six prominent families of LMs such as Llama-2. In two LM series, OPT and Bloom, we found that larger parameter models are more biased than lower parameter models. We found the T0 series of models to be the least biased. Furthermore, we noticed a tradeoff between gender and racial bias with increasing model size in some model series. The code is available at https://github.com/vipulgupta1011/CALM.

摘要
As language models (LMs) become increasingly powerful, it is important to quantify and compare them for sociodemographic bias with potential for harm. Prior bias measurement datasets are sensitive to perturbations in their manually designed templates, therefore unreliable. To achieve reliability, we introduce the Comprehensive Assessment of Language Model bias (CALM), a benchmark dataset to quantify bias in LMs across three tasks. We integrate 16 existing datasets across different domains, such as Wikipedia and news articles, to filter 224 templates from which we construct a dataset of 78,400 examples. We compare the diversity of CALM with prior datasets on metrics such as average semantic similarity, and variation in template length, and test the sensitivity to small perturbations. We show that our dataset is more diverse and reliable than previous datasets, thus better capture the breadth of linguistic variation required to reliably evaluate model bias. We evaluate 20 large language models including six prominent families of LMs such as Llama-2. In two LM series, OPT and Bloom, we found that larger parameter models are more biased than lower parameter models. We found the T0 series of models to be the least biased. Furthermore, we noticed a tradeoff between gender and racial bias with increasing model size in some model series. The code is available at https://github.com/vipulgupta1011/CALM.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other parts of the world. If you prefer Traditional Chinese, I can provide that as well.

FedSoL: Bridging Global Alignment and Local Generality in Federated Learning

paper_url: http://arxiv.org/abs/2308.12532
repo_url: None
paper_authors: Gihun Lee, Minchan Jeong, Sangmook Kim, Jaehoon Oh, Se-Young Yun
for: 提高 Federated Learning 性能在不同客户端数据分布情况下
methods: combining global alignment和本地通用性，通过在本地学习中寻找Parameter region robust against proximal perturbations
results: experiments show that FedSoL consistently achieves state-of-the-art performance on various setups.Here’s the full text in Simplified Chinese:
for: 本研究旨在提高 Federated Learning 性能在不同客户端数据分布情况下
methods: Federated Stability on Learning (FedSoL) combines global alignment和本地通用性，通过在本地学习中寻找Parameter region robust against proximal perturbations
results: experiments show that FedSoL consistently achieves state-of-the-art performance on various setups.

Abstract
Federated Learning (FL) aggregates locally trained models from individual clients to construct a global model. While FL enables learning a model with data privacy, it often suffers from significant performance degradation when client data distributions are heterogeneous. Many previous FL algorithms have addressed this issue by introducing various proximal restrictions. These restrictions aim to encourage global alignment by constraining the deviation of local learning from the global objective. However, they inherently limit local learning by interfering with the original local objectives. Recently, an alternative approach has emerged to improve local learning generality. By obtaining local models within a smooth loss landscape, this approach mitigates conflicts among different local objectives of the clients. Yet, it does not ensure stable global alignment, as local learning does not take the global objective into account. In this study, we propose Federated Stability on Learning (FedSoL), which combines both the concepts of global alignment and local generality. In FedSoL, the local learning seeks a parameter region robust against proximal perturbations. This strategy introduces an implicit proximal restriction effect in local learning while maintaining the original local objective for parameter update. Our experiments show that FedSoL consistently achieves state-of-the-art performance on various setups.

摘要
联合学习（FL）将个别客户的本地训练模型集成成全域模型。而FL可以保持资料隐私，但它经常受到客户资料分布不均的影响，导致性能下降。许多前一代FL算法已经解决这个问题，通过引入不同的距离限制。这些限制的目的是鼓励全球对齐，但是它们会限制本地学习。在最近的研究中，一种新的方法已经出现，可以提高本地学习的通用性。通过在本地学习中获得一个缓和的损失函数，这种方法可以抑制不同客户的本地目标之间的冲突。然而，这种方法不能保证稳定的全球对齐，因为本地学习不会考虑全球目标。在这个研究中，我们提出了联合学习稳定（FedSoL），它结合了全球对齐和本地通用性两个概念。在FedSoL中，本地学习寻找一个对 proximal 干扰有效的参数区域。这策略将引入一个隐式距离限制效应，同时保持原始本地目标 для参数更新。我们的实验显示，FedSoL能够预量性能在不同的设置中具有前所未有的表现。

SieveNet: Selecting Point-Based Features for Mesh Networks

paper_url: http://arxiv.org/abs/2308.12530
repo_url: https://github.com/sievenet/sievenet.github.io
paper_authors: Shengchao Yuan, Yishun Dou, Rui Shi, Bingbing Ni, Zhong Zheng
for: This paper aims to address the challenges of applying mesh neural networks to existing architectures due to the irregular topology of meshes.
methods: The proposed method, SieveNet, utilizes both the regular topology from remeshing and accurate geometric information from distortion-aware point sampling on the surface of the original mesh.
results: The proposed method achieves effective and superior performance on classification and segmentation tasks, eliminating the need for hand-crafted feature engineering and leveraging off-the-shelf network architectures such as the vision transformer.Here is the text in Simplified Chinese:
for: 这篇论文目标是解决将三维计算机视觉和图形中的网格应用于现有架构所存在的挑战。
methods: 提议的方法是使用结构化网格topology从重新排序和精确地从原始网格表面上的点抽取 geometric information。
results: 提议的方法在分类和分割任务中取得了有效和超越性的表现，不需要手动设计特征工程和可以利用现有的网格架构such as 视Transformer。

Abstract
Meshes are widely used in 3D computer vision and graphics, but their irregular topology poses challenges in applying them to existing neural network architectures. Recent advances in mesh neural networks turn to remeshing and push the boundary of pioneer methods that solely take the raw meshes as input. Although the remeshing offers a regular topology that significantly facilitates the design of mesh network architectures, features extracted from such remeshed proxies may struggle to retain the underlying geometry faithfully, limiting the subsequent neural network's capacity. To address this issue, we propose SieveNet, a novel paradigm that takes into account both the regular topology and the exact geometry. Specifically, this method utilizes structured mesh topology from remeshing and accurate geometric information from distortion-aware point sampling on the surface of the original mesh. Furthermore, our method eliminates the need for hand-crafted feature engineering and can leverage off-the-shelf network architectures such as the vision transformer. Comprehensive experimental results on classification and segmentation tasks well demonstrate the effectiveness and superiority of our method.

摘要
mesh 是计算机视觉和图形领域广泛使用的，但它们的不规则 topology 使得应用于现有的神经网络架构带来挑战。 recent advances in mesh neural networks 推动了重新拼接和推界的方法，这些方法可以让神经网络架构设计变得更加容易。 although remeshing 提供了规则的 topology，但是从这些拼接的代理中提取出来的特征可能会产生不准确地表示原始的几何结构，这限制了后续神经网络的能力。 To address this issue, we propose SieveNet, a novel paradigm that takes into account both the regular topology and the exact geometry. Specifically, this method utilizes structured mesh topology from remeshing and accurate geometric information from distortion-aware point sampling on the surface of the original mesh. Furthermore, our method eliminates the need for hand-crafted feature engineering and can leverage off-the-shelf network architectures such as the vision transformer. Comprehensive experimental results on classification and segmentation tasks well demonstrate the effectiveness and superiority of our method.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. The Traditional Chinese version of the translation would be slightly different.

UNISOUND System for VoxCeleb Speaker Recognition Challenge 2023

paper_url: http://arxiv.org/abs/2308.12526
repo_url: None
paper_authors: Yu Zheng, Yajun Zhang, Chuanying Niu, Yibin Zhan, Yanhua Long, Dongxing Xu
for: 本文是对VoxCeleb Speaker Recognition Challenge 2023（VoxSRC 2023）的论文提交，包括Track 1和Track 2。
methods: 该系统使用了大规模ResNet和RepVGG架构，并提出了一种稳定性 aware的分数均衡方法（CMF），以提高对话音频印痕的稳定性。
results: 该系统通过将六个模型进行融合，在VoxSRC 2023中获得了Track 1的第一名和Track 2的第二名，其minDCF为0.0855%，EER为1.5880%。

Abstract
This report describes the UNISOUND submission for Track1 and Track2 of VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC 2023). We submit the same system on Track 1 and Track 2, which is trained with only VoxCeleb2-dev. Large-scale ResNet and RepVGG architectures are developed for the challenge. We propose a consistency-aware score calibration method, which leverages the stability of audio voiceprints in similarity score by a Consistency Measure Factor (CMF). CMF brings a huge performance boost in this challenge. Our final system is a fusion of six models and achieves the first place in Track 1 and second place in Track 2 of VoxSRC 2023. The minDCF of our submission is 0.0855 and the EER is 1.5880%.

摘要
这份报告描述了我们在VoxCeleb Speaker Recognition Challenge 2023（VoxSRC 2023）中的UNISOUND提交，包括Track1和Track2。我们使用了VoxCeleb2-dev进行训练，并开发了大规模ResNet和RepVGG架构。我们提出了一种具有稳定性的声音印记稳定因子（CMF），以改进比赛中的相似性分数。我们的最终系统是六个模型的拟合，在Track 1中获得了第一名，在Track 2中获得了第二名。我们的最小dCF是0.0855，EER是1.5880%。

Not Only Rewards But Also Constraints: Applications on Legged Robot Locomotion

paper_url: http://arxiv.org/abs/2308.12517
repo_url: None
paper_authors: Yunho Kim, Hyunsik Oh, Jeonghyun Lee, Jinhyeok Choi, Gwanghyeon Ji, Moonkyu Jung, Donghoon Youm, Jemin Hwangbo
for: 这个论文的目的是提出一种新的强化学习框架，用于训练神经网络控制器，以实现复杂的机器人系统的高性能控制。
methods: 这种框架使用了两种约束类型和一种高效的政策优化算法，以便让工程师在极少的计算开销下，准确地反映他们的意图和处理约束。
results: 在 simulate 和实际实验中，这种学习框架可以让控制器在不同的四足机器人系统中提供高性能和自然的运动样式，并且只需要调整单个奖励系数，可以减少奖励工程的努力和时间。

Abstract
Several earlier studies have shown impressive control performance in complex robotic systems by designing the controller using a neural network and training it with model-free reinforcement learning. However, these outstanding controllers with natural motion style and high task performance are developed through extensive reward engineering, which is a highly laborious and time-consuming process of designing numerous reward terms and determining suitable reward coefficients. In this work, we propose a novel reinforcement learning framework for training neural network controllers for complex robotic systems consisting of both rewards and constraints. To let the engineers appropriately reflect their intent to constraints and handle them with minimal computation overhead, two constraint types and an efficient policy optimization algorithm are suggested. The learning framework is applied to train locomotion controllers for several legged robots with different morphology and physical attributes to traverse challenging terrains. Extensive simulation and real-world experiments demonstrate that performant controllers can be trained with significantly less reward engineering, by tuning only a single reward coefficient. Furthermore, a more straightforward and intuitive engineering process can be utilized, thanks to the interpretability and generalizability of constraints. The summary video is available at https://youtu.be/KAlm3yskhvM.

摘要
前些研究已经表明使用神经网络设计控制器和无模型奖励学习可以实现复杂机器人系统中的出色控制性能。然而，这些出色的控制器通过大量的奖励工程来实现自然的运动风格和高任务性能，这是一个非常劳动ious和时间consuming的过程。在这项工作中，我们提出了一种基于奖励学习的控制器训练框架，以便让工程师适当地反映约束并处理其减少计算开销。我们建议了两种约束类型和一种高效的政策优化算法。我们在许多模拟和实际实验中证明了，可以通过调整单个奖励系数来训练高性能的控制器，而不需要大量的奖励工程。此外，由于约束的可读性和普遍性，可以使用更直观和直接的工程过程。关于这个研究的摘要视频可以在以下链接中找到：https://youtu.be/KAlm3yskhvM。

Masked Autoencoders are Efficient Class Incremental Learners

paper_url: http://arxiv.org/abs/2308.12510
repo_url: https://github.com/scok30/mae-cil
paper_authors: Jiang-Tian Zhai, Xialei Liu, Andrew D. Bagdanov, Ke Li, Ming-Ming Cheng
for: 这篇论文旨在Sequential Learning new classes while avoiding catastrophic forgetting of previous knowledge.
methods: 使用Masked Autoencoders (MAEs) as efficient learners for CIL, 并且通过组合supervised loss for classification.
results: 实验结果显示，我们的方法比顶对照方法在CIFAR-100, ImageNet-Subset, 和 ImageNet-Full 的表现更好.

Abstract
Class Incremental Learning (CIL) aims to sequentially learn new classes while avoiding catastrophic forgetting of previous knowledge. We propose to use Masked Autoencoders (MAEs) as efficient learners for CIL. MAEs were originally designed to learn useful representations through reconstructive unsupervised learning, and they can be easily integrated with a supervised loss for classification. Moreover, MAEs can reliably reconstruct original input images from randomly selected patches, which we use to store exemplars from past tasks more efficiently for CIL. We also propose a bilateral MAE framework to learn from image-level and embedding-level fusion, which produces better-quality reconstructed images and more stable representations. Our experiments confirm that our approach performs better than the state-of-the-art on CIFAR-100, ImageNet-Subset, and ImageNet-Full. The code is available at https://github.com/scok30/MAE-CIL .

摘要
classe 增量学习 (CIL) 目的是逐步学习新的类，而不导致之前的知识减弱。我们提议使用假设抑制器 (MAE) 作为高效的学习器，MAE 原本是用于通过无监督学习获得有用的表示，它可以轻松地与分类的监督损失结合。此外，MAE 可靠地从随机选择的 patches 中重建原始输入图像，我们使用这些 patches 来更高效地存储过去任务中的 exemplars。我们还提出了双向 MAE 框架，用于学习图像级和嵌入级的混合，这会生成更高质量的重建图像和更稳定的表示。我们的实验表明，我们的方法比当前状态的更高效于 CIFAR-100、ImageNet-Subset 和 ImageNet-Full。代码可以在上获取。

paper_url: http://arxiv.org/abs/2308.12497
repo_url: None
paper_authors: Mohammad Majid Akhtar, Rahat Masood, Muhammad Ikram, Salil S. Kanhere
for: This paper aims to provide a comprehensive analysis of the manipulation landscape on online social networks (OSNs), including false information, bots, and malicious campaigns.
methods: The paper synthesizes insights from various disciplines and integrates primary elements of social media manipulation (SMM) to extensively examine each SMM element.
results: The findings highlight the urgent need for interdisciplinary research to effectively combat social media manipulations, and provide valuable insights for OSN providers to ensure the safety and integrity of their platforms.Here are the three points in Simplified Chinese text:
for: 这篇论文目标是为online社交网络（OSN）提供全面的欺诈景观，包括假信息、机器人和恶意运动。
methods: 这篇论文将不同领域的知识融合，并将社交媒体欺诈（SMM）的主要元素集成，进行广泛的研究。
results: 发现表明需要跨学科研究，以有效应对社交媒体欺诈，并为社交媒体平台提供安全和完整性。

Abstract
The rapid spread of false information and persistent manipulation attacks on online social networks (OSNs), often for political, ideological, or financial gain, has affected the openness of OSNs. While researchers from various disciplines have investigated different manipulation-triggering elements of OSNs (such as understanding information diffusion on OSNs or detecting automated behavior of accounts), these works have not been consolidated to present a comprehensive overview of the interconnections among these elements. Notably, user psychology, the prevalence of bots, and their tactics in relation to false information detection have been overlooked in previous research. To address this research gap, this paper synthesizes insights from various disciplines to provide a comprehensive analysis of the manipulation landscape. By integrating the primary elements of social media manipulation (SMM), including false information, bots, and malicious campaigns, we extensively examine each SMM element. Through a systematic investigation of prior research, we identify commonalities, highlight existing gaps, and extract valuable insights in the field. Our findings underscore the urgent need for interdisciplinary research to effectively combat social media manipulations, and our systematization can guide future research efforts and assist OSN providers in ensuring the safety and integrity of their platforms.

摘要
在社交媒体网络（OSN）上，快速传播的假信息和持续的操纵攻击已经影响了OSN的开放性。虽然来自不同领域的研究人员已经调查了OSN上的不同攻击触发元素（如了解信息传播在OSN上或检测账户的自动行为），但这些研究没有被集成来提供全面的概念概述。特别是用户心理学、灵活的机器人和它们与假信息检测之间的关系尚未得到过去研究的关注。为了解决这个研究漏洞，本文将从不同领域的视角 integrates 社交媒体攻击（SMM）的主要元素，包括假信息、机器人和恶意运动。我们对每个SMM元素进行了系统性的调查，并通过对先前研究的系统性分析，找到了共同点、突出了现有的漏洞、提取了有价值的发现。我们的发现表明，需要跨学科研究，以有效地抗击社交媒体攻击，而我们的系统化分析可以引导未来的研究努力和帮助OSN提供者保持平台的安全和完整性。

Optimizing Neural Network Scale for ECG Classification

paper_url: http://arxiv.org/abs/2308.12492
repo_url: None
paper_authors: Byeong Tak Lee, Yong-Yeon Jo, Joon-Myoung Kwon
for: 这个论文旨在研究用于分析电cardiogram（ECG）的卷积神经网络（CNN），特指Residual神经网络（ResNet）。
methods: 该论文使用了CNN模型，并对不同参数进行了探索和分析，以优化网络缩放。
results: 研究发现，采用更浅的网络结构、更多的通道数和更小的核心大小可以提高ECG分类的性能。结果表明，针对不同目标任务，可以根据我们的发现来获得更高效和准确的模型，即使使用更少的计算资源或时间。在实践中，我们示例了一种基于我们发现的窄搜索空间可以提高性能。

Abstract
We study scaling convolutional neural networks (CNNs), specifically targeting Residual neural networks (ResNet), for analyzing electrocardiograms (ECGs). Although ECG signals are time-series data, CNN-based models have been shown to outperform other neural networks with different architectures in ECG analysis. However, most previous studies in ECG analysis have overlooked the importance of network scaling optimization, which significantly improves performance. We explored and demonstrated an efficient approach to scale ResNet by examining the effects of crucial parameters, including layer depth, the number of channels, and the convolution kernel size. Through extensive experiments, we found that a shallower network, a larger number of channels, and smaller kernel sizes result in better performance for ECG classifications. The optimal network scale might differ depending on the target task, but our findings provide insight into obtaining more efficient and accurate models with fewer computing resources or less time. In practice, we demonstrate that a narrower search space based on our findings leads to higher performance.

摘要
我们研究了卷积神经网络（CNN）的扩大，特指幂值神经网络（ResNet），用于分析电心律ogram（ECG）信号。虽然ECG信号是时序数据，但CNN模型在ECG分析中表现更好，而其他不同结构的神经网络则被忽略了。然而，大多数之前的ECG分析研究忽略了网络扩大优化的重要性，这对性能有着显著的提高效果。我们探讨了和分析了关键参数的效果，包括层深度、通道数和卷积核大小。经过广泛的实验，我们发现，更浅的网络、更多的通道和更小的卷积核大小会对ECG分类提供更好的表现。不同目标任务的优化策略可能会不同，但我们的发现可以帮助您更快地获得更高性能和更有效的模型，使用更少的计算资源或更少的时间。在实践中，我们示出了基于我们发现的更窄的搜索空间可以提高性能。

Fall Detection using Knowledge Distillation Based Long short-term memory for Offline Embedded and Low Power Devices

paper_url: http://arxiv.org/abs/2308.12481
repo_url: None
paper_authors: Hannah Zhou, Allison Chen, Celine Buer, Emily Chen, Kayleen Tang, Lauryn Gong, Zhiqi Liu, Jianbin Tang
for: 这篇论文旨在提出一种低功耗、成本效果的滑落探测方法，通过知识传授基于LSTM模型优化精确性。
methods: 本论文使用时间序列数据集合发展知识传授基于LSTM模型，并评估不同传感器的滑落探测模型精确性。此外， authors 还使用知识传授技术优化模型的精确性，以降低功耗消耗。
results: 本论文的结果显示，这种基于LSTM模型的滑落探测方法可以实现实时探测，并且可以提高滑落探测精确性。此外， authors 发现知识传授技术可以优化模型的精确性，并降低功耗消耗。

Abstract
This paper presents a cost-effective, low-power approach to unintentional fall detection using knowledge distillation-based LSTM (Long Short-Term Memory) models to significantly improve accuracy. With a primary focus on analyzing time-series data collected from various sensors, the solution offers real-time detection capabilities, ensuring prompt and reliable identification of falls. The authors investigate fall detection models that are based on different sensors, comparing their accuracy rates and performance. Furthermore, they employ the technique of knowledge distillation to enhance the models' precision, resulting in refined accurate configurations that consume lower power. As a result, this proposed solution presents a compelling avenue for the development of energy-efficient fall detection systems for future advancements in this critical domain.

摘要
Translation Notes:* "unintentional fall detection" is translated as "意外落下探测" (yì wài luò xià tèng)* "knowledge distillation-based LSTM" is translated as "基于知识填充的LSTM" (jī yǔ zhī xí shén shì de LSTM)* "time-series data" is translated as "时间序列数据" (shí jiān xiàng xuě dà)* "real-time detection" is translated as "实时探测" (shí jī tèng)* "prompt and reliable identification" is translated as "快速可靠识别" (kuài sù kě huì bǐ)* "energy-efficient" is translated as "能效的" (néng xiǎo de)

Zero-delay Consistent Signal Reconstruction from Streamed Multivariate Time Series

paper_url: http://arxiv.org/abs/2308.12459
repo_url: None
paper_authors: Emilio Ruiz-Moreno, Luis Miguel López-Ramos, Baltasar Beferull-Lozano
for: 本研究旨在提出一种能够在数据流中逐步重建数字化的实时分析信号的方法，以实现零延迟响应。
methods: 本方法基于循环神经网络学习多变量时间序列的空间时间相关性，以降低重建过程中的干扰。
results: 实验结果显示，提议的方法可以在采样率增加的情况下实现逐步重建，并且与非一致重建相比，实现较好的误差衰减。

Abstract
Digitalizing real-world analog signals typically involves sampling in time and discretizing in amplitude. Subsequent signal reconstructions inevitably incur an error that depends on the amplitude resolution and the temporal density of the acquired samples. From an implementation viewpoint, consistent signal reconstruction methods have proven a profitable error-rate decay as the sampling rate increases. Despite that, these results are obtained under offline settings. Therefore, a research gap exists regarding methods for consistent signal reconstruction from data streams. This paper presents a method that consistently reconstructs streamed multivariate time series of quantization intervals under a zero-delay response requirement. On the other hand, previous work has shown that the temporal dependencies within univariate time series can be exploited to reduce the roughness of zero-delay signal reconstructions. This work shows that the spatiotemporal dependencies within multivariate time series can also be exploited to achieve improved results. Specifically, the spatiotemporal dependencies of the multivariate time series are learned, with the assistance of a recurrent neural network, to reduce the roughness of the signal reconstruction on average while ensuring consistency. Our experiments show that our proposed method achieves a favorable error-rate decay with the sampling rate compared to a similar but non-consistent reconstruction.

摘要
数字化实际世界的 аналоговignal通常通过时间采样和扫描幅度的精度来实现。随后的信号重建必然会产生错误，这个错误取决于扫描频率和采样时间的分辨率。从实现角度来看，一致的信号重建方法在采样率增加时显示了负责任的错误下降。然而，这些结果在线上设置下获得。因此，一个研究空白是关于从数据流中一致地重建信号的方法。这篇文章提出了一种方法，可以在零延迟响应下一致地重建流动的多变量时间序列。在这种情况下，我们发现了在多变量时间序列中的空间时间相互关系可以被利用，以实现改善的结果。具体来说，我们使用回归神经网络学习多变量时间序列中的空间时间相互关系，以降低重建后信号的抖音程度的平均值，同时保证一致性。我们的实验表明，我们的提议方法在采样率增加时与非一致重建相比，可以获得更好的错误下降。

PFL-GAN: When Client Heterogeneity Meets Generative Models in Personalized Federated Learning

paper_url: http://arxiv.org/abs/2308.12454
repo_url: None
paper_authors: Achintha Wijesinghe, Songyang Zhang, Zhi Ding
for: 强调在多 Client 环境下实现对应的 Federated Learning (FL) 案例，特别是在客户数据不同性下实现更好的学习效果。
methods: 基于 Generative Adversarial Network (GAN) 模型，提出了一个 novel GAN sharing and aggregation strategy for Personalized Federated Learning (PFL)，包括客户相似性学习和权重联合数据聚合。
results: 透过严谨的实验评估在多个知名数据集上，证明 PFL-GAN 能够在不同客户数据不同性下实现更好的学习效果。

Abstract
Recent advances of generative learning models are accompanied by the growing interest in federated learning (FL) based on generative adversarial network (GAN) models. In the context of FL, GAN can capture the underlying client data structure, and regenerate samples resembling the original data distribution without compromising the private raw data. Although most existing GAN-based FL works focus on training a global model, Personalized FL (PFL) sometimes can be more effective in view of client data heterogeneity in terms of distinct data sample distributions, feature spaces, and labels. To cope with client heterogeneity in GAN-based FL, we propose a novel GAN sharing and aggregation strategy for PFL. The proposed PFL-GAN addresses the client heterogeneity in different scenarios. More specially, we first learn the similarity among clients and then develop an weighted collaborative data aggregation. The empirical results through the rigorous experimentation on several well-known datasets demonstrate the effectiveness of PFL-GAN.

摘要
近期生成学模型的进步引起了基于联合学习（Federated Learning，FL）的生成对抗网络（Generative Adversarial Network，GAN）模型的增加兴趣。在FL中，GAN可以捕捉客户端数据结构的下面，并生成符合原始数据分布的样本，而无需让客户端披露私人原始数据。虽然大多数现有的GAN基于FL工作集中在全球模型的训练上，但在视 Client数据多样性的情况下，个性化FL（Personalized FL，PFL）可能更有效。为了处理客户端多样性在GAN基于FL中，我们提出了一种新的GAN共享和汇聚策略。我们首先学习客户端之间的相似性，然后开发一种权重协同数据汇聚。实际结果通过对severalwell-known数据集进行严谨的实验证明了PFL-GAN的有效性。

Augmenting medical image classifiers with synthetic data from latent diffusion models

paper_url: http://arxiv.org/abs/2308.12453
repo_url: None
paper_authors: Luke W. Sagers, James A. Diao, Luke Melas-Kyriazi, Matthew Groh, Pranav Rajpurkar, Adewole S. Adamson, Veronica Rotemberg, Roxana Daneshjou, Arjun K. Manrai
for: 这个研究旨在测试generative AI可以帮助医疗人员开发更好的医疗AI算法，特别是在资料有限的情况下。
methods: 研究使用了latent diffusion模型，并与现实影像进行混合训练，以提高模型的表现。
results: 研究发现，使用生成的影像可以帮助提高医疗AI模型的表现，但是这些表现 improvements尚未到达显著的水平。另外，研究发现了一个新的数据集，包含458,920帧生成的影像。

Abstract
While hundreds of artificial intelligence (AI) algorithms are now approved or cleared by the US Food and Drugs Administration (FDA), many studies have shown inconsistent generalization or latent bias, particularly for underrepresented populations. Some have proposed that generative AI could reduce the need for real data, but its utility in model development remains unclear. Skin disease serves as a useful case study in synthetic image generation due to the diversity of disease appearance, particularly across the protected attribute of skin tone. Here we show that latent diffusion models can scalably generate images of skin disease and that augmenting model training with these data improves performance in data-limited settings. These performance gains saturate at synthetic-to-real image ratios above 10:1 and are substantially smaller than the gains obtained from adding real images. As part of our analysis, we generate and analyze a new dataset of 458,920 synthetic images produced using several generation strategies. Our results suggest that synthetic data could serve as a force-multiplier for model development, but the collection of diverse real-world data remains the most important step to improve medical AI algorithms.

摘要
美国食品和药品管理局（FDA）已批准或核可了数百种人工智能（AI）算法，但许多研究表明这些算法在不同人群中存在不一致的泛化或隐藏偏见，尤其是对于受保护属性的人口。一些人提议用生成AI降低实际数据的需求，但其在模型开发中的用途仍未得到清楚的回答。皮肤病 serves as a useful case study in synthetic image generation due to the diversity of disease appearance, particularly across the protected attribute of skin tone. 在本研究中，我们显示了潜在扩散模型可以可扩展地生成皮肤病图像，并且在数据有限的情况下，通过这些数据进行模型训练可以提高表现。这些表现提升随synthetic-to-real image ratio的增加而增加，但是与使用真实图像相比，这些提升的效果远远小于。在我们的分析中，我们生成了458,920个synthetic图像，并对其进行分析。我们的结果表明，生成的数据可以作为模型开发中的力量multiplier，但是收集真实世界数据仍然是改进医疗AI算法的最重要步骤。

An Intentional Forgetting-Driven Self-Healing Method For Deep Reinforcement Learning Systems

paper_url: http://arxiv.org/abs/2308.12445
repo_url: https://github.com/ahmedhajyahmed/drdrl
paper_authors: Ahmed Haj Yahmed, Rached Bouchoucha, Houssem Ben Braiek, Foutse Khomh
for: 这篇论文是针对大规模生产中的深度强化学习（DRL）系统进行应用，并解决DRL系统在环境变化中导致的不适用行为问题。
methods: 这篇论文提出了一种具有自我疗愈能力的DRL系统，称为Dr. DRL，它通过新的忘记机制来解决传统的CL潜在问题，例如 catastrophic forgetting、warm-starting failure 和 slow convergence。
results: 相比传统CL，Dr. DRL能够将疗愈时间和精灵化集合数量降低，平均降低18.74%和17.72%。此外，Dr. DRL能够在19.63%的推移环境中帮助代理人适应，并在已经解决的环境中保持和提高回报率。

Abstract
Deep reinforcement learning (DRL) is increasingly applied in large-scale productions like Netflix and Facebook. As with most data-driven systems, DRL systems can exhibit undesirable behaviors due to environmental drifts, which often occur in constantly-changing production settings. Continual Learning (CL) is the inherent self-healing approach for adapting the DRL agent in response to the environment's conditions shifts. However, successive shifts of considerable magnitude may cause the production environment to drift from its original state. Recent studies have shown that these environmental drifts tend to drive CL into long, or even unsuccessful, healing cycles, which arise from inefficiencies such as catastrophic forgetting, warm-starting failure, and slow convergence. In this paper, we propose Dr. DRL, an effective self-healing approach for DRL systems that integrates a novel mechanism of intentional forgetting into vanilla CL to overcome its main issues. Dr. DRL deliberately erases the DRL system's minor behaviors to systematically prioritize the adaptation of the key problem-solving skills. Using well-established DRL algorithms, Dr. DRL is compared with vanilla CL on various drifted environments. Dr. DRL is able to reduce, on average, the healing time and fine-tuning episodes by, respectively, 18.74% and 17.72%. Dr. DRL successfully helps agents to adapt to 19.63% of drifted environments left unsolved by vanilla CL while maintaining and even enhancing by up to 45% the obtained rewards for drifted environments that are resolved by both approaches.

摘要
深度强化学习（DRL）在大规模生产中越来越普遍应用，如Netflix和Facebook。然而，与大多数数据驱动系统一样，DRL系统可能会出现不жела的行为，即因环境变化而导致的环境漂移。Continual Learning（CL）是DRLagent的自适应方法，但继续的大规模变化可能会让生产环境偏离原始状态。现有研究表明，这些环境变化可能会让CL进入长时间的或者无法成功的恢复循环，这些循环由多种不足所致，如恐慌忘记、温启失败和慢 converges。在这篇论文中，我们提出了Dr. DRL，一种有效的自适应方法，用于解决DRL系统中的主要问题。Dr. DRL通过novel的意图忘记机制，系统地优先级掌握DRL系统的关键问题解决技能。使用已知的DRL算法，Dr. DRL与vanilla CL进行比较，在多个漂移环境中显示出了更好的表现。Dr. DRL能够降低，在 average，恢复时间和精度调整集数量，分别降低18.74%和17.72%。Dr. DRL成功地帮助代理人适应了vanilla CL无法解决的19.63%漂移环境，同时保持和提高了对漂移环境的解决得到的奖励。

TAI-GAN: Temporally and Anatomically Informed GAN for early-to-late frame conversion in dynamic cardiac PET motion correction

paper_url: http://arxiv.org/abs/2308.12443
repo_url: https://github.com/gxq1998/tai-gan
paper_authors: Xueqi Guo, Luyao Shi, Xiongchao Chen, Bo Zhou, Qiong Liu, Huidong Xie, Yi-Hwa Liu, Richard Palyo, Edward J. Miller, Albert J. Sinusas, Bruce Spottiswoode, Chi Liu, Nicha C. Dvornek
for:* 这个论文主要关注的是 Dynamic cardiac positron emission tomography (PET) 图像序列中的迅速跟踪器动态分布和高异常性，尤其是在早期帧中，常见的INTENSITY-based image registration技术不能适用。methods:* 作者提出了一种使用生成方法处理 tracer 分布变化，以帮助现有的 registration 方法进行框架匹配。* 特别是，作者提出了一种 Temporally and Anatomically Informed Generative Adversarial Network (TAI-GAN)，用于将早期帧转换成参照帧中的图像，通过一个 all-to-one 映射。results:* 作者验证了他们的提议在临床 $^{82}$Rb PET 数据集上，并发现他们的 TAI-GAN 可以生成高质量的转换图像，与参照帧的真实图像相似。* 经过 TAI-GAN 转换后，运动估计精度和临床血液流量（MBF）的量化也有所改善，与原始帧相比。

Abstract
The rapid tracer kinetics of rubidium-82 ($^{82}$Rb) and high variation of cross-frame distribution in dynamic cardiac positron emission tomography (PET) raise significant challenges for inter-frame motion correction, particularly for the early frames where conventional intensity-based image registration techniques are not applicable. Alternatively, a promising approach utilizes generative methods to handle the tracer distribution changes to assist existing registration methods. To improve frame-wise registration and parametric quantification, we propose a Temporally and Anatomically Informed Generative Adversarial Network (TAI-GAN) to transform the early frames into the late reference frame using an all-to-one mapping. Specifically, a feature-wise linear modulation layer encodes channel-wise parameters generated from temporal tracer kinetics information, and rough cardiac segmentations with local shifts serve as the anatomical information. We validated our proposed method on a clinical $^{82}$Rb PET dataset and found that our TAI-GAN can produce converted early frames with high image quality, comparable to the real reference frames. After TAI-GAN conversion, motion estimation accuracy and clinical myocardial blood flow (MBF) quantification were improved compared to using the original frames. Our code is published at https://github.com/gxq1998/TAI-GAN.

摘要
<>使用 rubidium-82 ($^{82}$Rb) 的快速追踪器和动态心脏 позиトрон发射 Tomatoes（PET）中的高程度分布变化 pose significant challenges for inter-frame motion correction, especially for early frames where conventional intensity-based image registration techniques are not applicable. Alternatively, a promising approach utilizes generative methods to handle tracer distribution changes to assist existing registration methods. To improve frame-wise registration and parametric quantification, we propose a Temporally and Anatomically Informed Generative Adversarial Network (TAI-GAN) to transform early frames into a late reference frame using an all-to-one mapping. Specifically, a feature-wise linear modulation layer encodes channel-wise parameters generated from temporal tracer kinetics information, and rough cardiac segmentations with local shifts serve as anatomical information. We validated our proposed method on a clinical $^{82}$Rb PET dataset and found that our TAI-GAN can produce converted early frames with high image quality, comparable to the real reference frames. After TAI-GAN conversion, motion estimation accuracy and clinical myocardial blood flow (MBF) quantification were improved compared to using the original frames. Our code is published at .

BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection

paper_url: http://arxiv.org/abs/2308.12439
repo_url: None
paper_authors: Tinghao Xie, Xiangyu Qi, Ping He, Yiming Li, Jiachen T. Wang, Prateek Mittal
For: 防止深度神经网络（DNNs）中的后门攻击（backdoor attacks）。* Methods: 基于反工程技术，从backdoored模型中提取出后门功能，并将其转化为高精度的后门输入检测器。* Results: 对16种State-of-the-Art（SOTA）后门攻击进行了有效防御，而无需干扰清洁功能。验证在多个 datasets（CIFAR10、GTSRB和ImageNet）和不同的模型架构（ResNet、VGG、MobileNetV2和Vision Transformer）上。

Abstract
We present a novel defense, against backdoor attacks on Deep Neural Networks (DNNs), wherein adversaries covertly implant malicious behaviors (backdoors) into DNNs. Our defense falls within the category of post-development defenses that operate independently of how the model was generated. The proposed defense is built upon a novel reverse engineering approach that can directly extract backdoor functionality of a given backdoored model to a backdoor expert model. The approach is straightforward -- finetuning the backdoored model over a small set of intentionally mislabeled clean samples, such that it unlearns the normal functionality while still preserving the backdoor functionality, and thus resulting in a model (dubbed a backdoor expert model) that can only recognize backdoor inputs. Based on the extracted backdoor expert model, we show the feasibility of devising highly accurate backdoor input detectors that filter out the backdoor inputs during model inference. Further augmented by an ensemble strategy with a finetuned auxiliary model, our defense, BaDExpert (Backdoor Input Detection with Backdoor Expert), effectively mitigates 16 SOTA backdoor attacks while minimally impacting clean utility. The effectiveness of BaDExpert has been verified on multiple datasets (CIFAR10, GTSRB and ImageNet) across various model architectures (ResNet, VGG, MobileNetV2 and Vision Transformer).

摘要
我们提出了一种新的防御机制，对深度神经网络（DNN）中的后门攻击。在这种攻击中，敌人将附加了黑客程式码到DNN中。我们的防御方法属于 poste development 防御，即在模型生成后进行防御。我们的防御方法基于一种新的反向工程方法，可以直接将黑客模型中的黑客功能扩展到一个名为“黑客专家模型”（Backdoor Expert Model）中。这种方法简单易行，只需要使用一小批 INTENTIONALLY 误 Labelled 的清洁样本进行调整，以让模型忘记正常功能，但保留黑客功能，从而产生了一个仅能识别黑客输入的模型。基于这个黑客专家模型，我们显示了如何设计高准确度的黑客输入检测器，以筛选掉黑客输入 durante 模型推导。另外，我们还提出了一个 ensemble 策略，将调整后的专家模型与一个调整后的副模型 ensemble together，从而产生了一个高效的防御机制，名为 BaDExpert（黑客输入检测器）。我们在多个数据集（CIFAR10、GTSRB 和 ImageNet）和多种模型架构（ResNet、VGG、MobileNetV2 和 Vision Transformer）上验证了 BaDExpert 的效果，发现它能够有效地抵销16种 SOTA 黑客攻击，而且对于清洁输入的影响相对轻微。

Deploying Deep Reinforcement Learning Systems: A Taxonomy of Challenges

paper_url: http://arxiv.org/abs/2308.12438
repo_url: https://github.com/drldeploymentchallenges-icsme2023/replicationpackage
paper_authors: Ahmed Haj Yahmed, Altaf Allah Abbassi, Amin Nikanjam, Heng Li, Foutse Khomh
for: This paper aims to investigate the challenges faced by practitioners when deploying deep reinforcement learning (DRL) systems, specifically on Stack Overflow (SO), the most popular Q&A forum for developers.methods: The authors conducted an empirical study on SO to identify and understand the challenges related to deploying DRL systems. They categorized relevant SO posts by deployment platforms and manually analyzed 357 posts to investigate the current state and prevalence of these challenges.results: The study found that the general interest in DRL deployment is growing, and DRL deployment is more difficult than other DRL issues. The authors also built a taxonomy of 31 unique challenges in deploying DRL to different platforms, with RL environment-related challenges being the most popular and communication-related challenges being the most difficult among practitioners.

Abstract
Deep reinforcement learning (DRL), leveraging Deep Learning (DL) in reinforcement learning, has shown significant potential in achieving human-level autonomy in a wide range of domains, including robotics, computer vision, and computer games. This potential justifies the enthusiasm and growing interest in DRL in both academia and industry. However, the community currently focuses mostly on the development phase of DRL systems, with little attention devoted to DRL deployment. In this paper, we propose an empirical study on Stack Overflow (SO), the most popular Q&A forum for developers, to uncover and understand the challenges practitioners faced when deploying DRL systems. Specifically, we categorized relevant SO posts by deployment platforms: server/cloud, mobile/embedded system, browser, and game engine. After filtering and manual analysis, we examined 357 SO posts about DRL deployment, investigated the current state, and identified the challenges related to deploying DRL systems. Then, we investigate the prevalence and difficulty of these challenges. Results show that the general interest in DRL deployment is growing, confirming the study's relevance and importance. Results also show that DRL deployment is more difficult than other DRL issues. Additionally, we built a taxonomy of 31 unique challenges in deploying DRL to different platforms. On all platforms, RL environment-related challenges are the most popular, and communication-related challenges are the most difficult among practitioners. We hope our study inspires future research and helps the community overcome the most common and difficult challenges practitioners face when deploying DRL systems.

摘要
Results show that the general interest in DRL deployment is growing, confirming the study's relevance and importance. Results also show that DRL deployment is more difficult than other DRL issues. Additionally, we built a taxonomy of 31 unique challenges in deploying DRL to different platforms. On all platforms, RL environment-related challenges are the most popular, and communication-related challenges are the most difficult among practitioners. We hope our study inspires future research and helps the community overcome the most common and difficult challenges practitioners face when deploying DRL systems.以下是我们的研究结果：1. DRL 部署的总兴趣在增长，这证明了这项研究的重要性和 relevance。2. DRL 部署比其他 DRL 问题更加困难。3. 在不同的平台上部署 DRL 系统时，RL 环境相关的挑战是最受欢迎的，而通信相关的挑战是最difficult的。4. 我们建立了一个包含 31 个唯一挑战的 DRL 部署稠密度图表，这些挑战分布在不同的平台上。我们希望这项研究能够激发未来的研究，并帮助社区解决实践者在部署 DRL 系统时遇到的最常见和最Difficult的挑战。

Evolution of ESG-focused DLT Research: An NLP Analysis of the Literature

paper_url: http://arxiv.org/abs/2308.12420
repo_url: None
paper_authors: Walter Hernandez, Kamil Tylinski, Alastair Moore, Niall Roche, Nikhil Vadgama, Horst Treiblmaier, Jiangbo Shangguan, Paolo Tasca, Jiahua Xu
For: 本研究的目的是提供一种基于机器学习的系统性文献复查方法，用于探讨分布式记录技术（DLT）领域中环境、可持续发展和管理（ESG）方面的多个组成部分。* Methods: 本研究使用了107种种子论文建立了63,083个参考文献的公共网络，并将其缩减到24,539篇文献进行分析。然后，对46篇论文中的命名实体进行了12种顶层类别的标注，并将DLT的ESG元素细化。使用基于转换器的自然语言处理模型，进行了一个Named Entity Recognition（NER）任务的精度调整。* Results: 本研究通过Named Entity Recognition（NER）任务的精度调整，将论文库缩减到505篇关键论文，并通过命名实体和时间图分析，对DLT的发展进行了一种系统性的文献复查。本研究的贡献包括一种用于DLT领域的机器学习驱动的系统性文献复查方法，以及一个特有的ESG方面的Named Entity Recognition（NER）数据集，包含54,808个命名实体。

Abstract
Distributed Ledger Technologies (DLTs) have rapidly evolved, necessitating comprehensive insights into their diverse components. However, a systematic literature review that emphasizes the Environmental, Sustainability, and Governance (ESG) components of DLT remains lacking. To bridge this gap, we selected 107 seed papers to build a citation network of 63,083 references and refined it to a corpus of 24,539 publications for analysis. Then, we labeled the named entities in 46 papers according to twelve top-level categories derived from an established technology taxonomy and enhanced the taxonomy by pinpointing DLT's ESG elements. Leveraging transformer-based language models, we fine-tuned a pre-trained language model for a Named Entity Recognition (NER) task using our labeled dataset. We used our fine-tuned language model to distill the corpus to 505 key papers, facilitating a literature review via named entities and temporal graph analysis on DLT evolution in the context of ESG. Our contributions are a methodology to conduct a machine learning-driven systematic literature review in the DLT field, placing a special emphasis on ESG aspects. Furthermore, we present a first-of-its-kind NER dataset, composed of 54,808 named entities, designed for DLT and ESG-related explorations.

摘要
分布式记录技术（DLT）在短时间内快速发展，需要全面的检视其多样化组件。然而，一篇系统性的文献评论，强调环境、可持续发展和管理（ESG）方面的DLT组件，仍然缺失。为填补这一漏洞，我们选择了107个种子论文，建立了63,083个参考文献的公共网络，并将其缩小至24,539篇文献进行分析。然后，我们对46篇论文中的名称实体进行了12种顶级分类，基于已有的技术分类，并将DLT的ESG元素归类。通过使用基于转换器的自然语言处理模型，我们对这些名称实体进行了微调，并使用我们的微调模型来进行名称实体识别（NER）任务。我们使用了这些微调模型来缩小 corpus 到505个关键论文，以便通过名称实体和时间图分析来探讨DLT在ESG CONTEXT中的发展。我们的贡献包括一种在DLT领域进行机器学习驱动的系统性文献评论方法，以及一个由我们微调的NER数据集，包含54,808个名称实体，适用于DLT和ESG相关的探索。

Machine learning in parameter estimation of nonlinear systems

paper_url: http://arxiv.org/abs/2308.12393
repo_url: None
paper_authors: Kaushal Kumar
for: 这篇论文旨在探讨一种基于神经网络的参数估测方法，用于处理复杂的非线性系统。
methods: 这篇论文使用一个具有哈伯损失函数的神经网络，来探索非线性系统中参数的潜在行为。
results: 这篇论文透过训练神经网络使用噪音时间序数据，实现了参数的精确估测，并证明了这种方法的稳定性和灵活性。

Abstract
Accurately estimating parameters in complex nonlinear systems is crucial across scientific and engineering fields. We present a novel approach for parameter estimation using a neural network with the Huber loss function. This method taps into deep learning's abilities to uncover parameters governing intricate behaviors in nonlinear equations. We validate our approach using synthetic data and predefined functions that model system dynamics. By training the neural network with noisy time series data, it fine-tunes the Huber loss function to converge to accurate parameters. We apply our method to damped oscillators, Van der Pol oscillators, Lotka-Volterra systems, and Lorenz systems under multiplicative noise. The trained neural network accurately estimates parameters, evident from closely matching latent dynamics. Comparing true and estimated trajectories visually reinforces our method's precision and robustness. Our study underscores the Huber loss-guided neural network as a versatile tool for parameter estimation, effectively uncovering complex relationships in nonlinear systems. The method navigates noise and uncertainty adeptly, showcasing its adaptability to real-world challenges.

摘要
估算复杂非线性系统中参数的精度是科学和工程领域的关键。我们提出了一种使用神经网络和战棘损失函数来进行参数估算的新方法。这种方法利用深度学习的能力来探索非线性方程中的参数。我们使用模拟数据和预定的函数来验证我们的方法。通过训练神经网络，它可以根据噪声时间序列数据来细化战棘损失函数，以达到准确的参数。我们在振荡者、范德波振荡器、洛特卡-沃尔特拉系统和洛兹系统下进行了multiplicative噪声的应用。训练神经网络可以准确地估算参数，可以从相似的潜在动力来证明。比较真实和估算的轨迹可以视觉地证明我们的方法的精度和可靠性。我们的研究证明了使用战棘损失函数导航神经网络为参数估算的方法，可以快速和稳定地探索复杂的非线性系统，并在噪声和不确定性中 navigation adaptively。

FOSA: Full Information Maximum Likelihood (FIML) Optimized Self-Attention Imputation for Missing Data

paper_url: http://arxiv.org/abs/2308.12388
repo_url: https://github.com/oudeng/fosa
paper_authors: Ou Deng, Qun Jin
for: 填充缺失数据，特别是复杂的数据集中的缺失值。
methods: 融合FIML估计和自注意力神经网络的FIML优化自注意力（FOSA）框架。
results: FOSA的实验表明，它在模拟数据和实际数据集上具有优于传统FIML技术的优势，包括准确性、计算效率和数据结构的适应性。即使SEM可能不准确，FOSA的自注意力结构仍能修复和优化投入值。FOSA在40%随机缺失情况下也能提供优秀的预测结果，证明其 Robustness 和广泛应用的潜力。

Abstract
In data imputation, effectively addressing missing values is pivotal, especially in intricate datasets. This paper delves into the FIML Optimized Self-attention (FOSA) framework, an innovative approach that amalgamates the strengths of Full Information Maximum Likelihood (FIML) estimation with the capabilities of self-attention neural networks. Our methodology commences with an initial estimation of missing values via FIML, subsequently refining these estimates by leveraging the self-attention mechanism. Our comprehensive experiments on both simulated and real-world datasets underscore FOSA's pronounced advantages over traditional FIML techniques, encapsulating facets of accuracy, computational efficiency, and adaptability to diverse data structures. Intriguingly, even in scenarios where the Structural Equation Model (SEM) might be mis-specified, leading to suboptimal FIML estimates, the robust architecture of FOSA's self-attention component adeptly rectifies and optimizes the imputation outcomes. Our empirical tests reveal that FOSA consistently delivers commendable predictions, even in the face of up to 40% random missingness, highlighting its robustness and potential for wide-scale applications in data imputation.

摘要
在数据填充中，有效地处理缺失值是非常重要，特别是在复杂的数据集中。这篇论文探讨了FIML优化自注意（FOSA）框架，这是一种将FIML估计的优点与自注意神经网络的能力相结合的创新方法。我们的方法开始于初步估计缺失值via FIML，然后通过自注意机制来进一步改进这些估计。我们的全面实验表明，FOSA在 simulate 和实际数据集上具有明显的优势，包括精度、计算效率和适应不同数据结构。即使SEM可能是错误的，导致FIML估计不佳，FOSA的自注意结构仍能够正确地修正和优化填充结果。我们的实验表明，FOSA在40%的随机缺失情况下仍然能够提供优秀的预测结果，这表明其Robustness和广泛应用的潜力。

Open-set Face Recognition with Neural Ensemble, Maximal Entropy Loss and Feature Augmentation

paper_url: http://arxiv.org/abs/2308.12371
repo_url: None
paper_authors: Rafael Henrique Vareto, Manuel Günther, William Robson Schwartz
for: 开放集face认证问题中，生物 metric系统缺乏所有已注册的主体的完整知识，因此需要避免未注册的主体的面征amples被识别为先前注册的标识体。
methods: 该研究提出一种新的方法，即将 ensemble of 精简神经网络与边缘基于的成本函数结合，通过采用外部数据库或在训练时间使用新的混合特征生成方法来获取补充的负样本。
results: 研究在well-known LFW和IJB-C数据集上进行了实验，结果显示该方法能够提高closed和开放集标识率。

Abstract
Open-set face recognition refers to a scenario in which biometric systems have incomplete knowledge of all existing subjects. Therefore, they are expected to prevent face samples of unregistered subjects from being identified as previously enrolled identities. This watchlist context adds an arduous requirement that calls for the dismissal of irrelevant faces by focusing mainly on subjects of interest. As a response, this work introduces a novel method that associates an ensemble of compact neural networks with a margin-based cost function that explores additional samples. Supplementary negative samples can be obtained from external databases or synthetically built at the representation level in training time with a new mix-up feature augmentation approach. Deep neural networks pre-trained on large face datasets serve as the preliminary feature extraction module. We carry out experiments on well-known LFW and IJB-C datasets where results show that the approach is able to boost closed and open-set identification rates.

摘要
开放集face认识指的是一种情况，在生物认证系统中存在部分知情的人员。因此，它们需要避免已经注册的人员面部样本被识别为未注册的人员。这个观察者上下文添加了一项艰辛的要求，即排除不相关的面部样本，主要关注关注点。为回应这个问题，本研究提出了一种新的方法，它将一组紧凑型神经网络 ensemble 与一种基于margin的成本函数相结合，并利用外部数据库或者在训练时期synthesize constructed 的新混合特征增强方法来获得补充性质样本。启用大面库 dataset 进行预处理的深度神经网络 serving 作为先期特征提取模块。我们在well-known LFW和IJB-C datasets上进行了实验，结果显示该方法能够提高closed和open-set认证率。

SafeAR: Towards Safer Algorithmic Recourse by Risk-Aware Policies

paper_url: http://arxiv.org/abs/2308.12367
repo_url: None
paper_authors: Haochen Wu, Shubham Sharma, Sunandita Patra, Sriram Gopalakrishnan
for: 提供了一种基于机器学习模型的决策中的抗议和改善机制，以便在决策中帮助人们更好地处理不利的结果。
methods: 使用了sequential algorithmic recourse的方法，考虑了变量的不确定性和风险，并使用了金融领域的Value at Risk和Conditional Value at Risk等风险措施来衡量风险。
results: 通过应用该方法于两个实际数据集，发现了不同风险偏好的策略之间的区别，并且在使用不同的抗议措施时，可以更好地满足用户的需求。

Abstract
With the growing use of machine learning (ML) models in critical domains such as finance and healthcare, the need to offer recourse for those adversely affected by the decisions of ML models has become more important; individuals ought to be provided with recommendations on actions to take for improving their situation and thus receive a favorable decision. Prior work on sequential algorithmic recourse -- which recommends a series of changes -- focuses on action feasibility and uses the proximity of feature changes to determine action costs. However, the uncertainties of feature changes and the risk of higher than average costs in recourse have not been considered. It is undesirable if a recourse could (with some probability) result in a worse situation from which recovery requires an extremely high cost. It is essential to incorporate risks when computing and evaluating recourse. We call the recourse computed with such risk considerations as Safer Algorithmic Recourse (SafeAR). The objective is to empower people to choose a recourse based on their risk tolerance. In this work, we discuss and show how existing recourse desiderata can fail to capture the risk of higher costs. We present a method to compute recourse policies that consider variability in cost and connect algorithmic recourse literature with risk-sensitive reinforcement learning. We also adopt measures ``Value at Risk'' and ``Conditional Value at Risk'' from the financial literature to summarize risk concisely. We apply our method to two real-world datasets and compare policies with different levels of risk-aversion using risk measures and recourse desiderata (sparsity and proximity).

摘要
随着机器学习（ML）模型在金融和医疗领域的使用的增长，为那些由ML模型决策所受到的不良影响者提供了救济的需求变得更加重要。人们应该被提供改善其情况的建议，并从而获得有利的决策。先前的序列算法救济工作（sequential algorithmic recourse），关注行动可行性，使用特征变化的邻近来确定行动成本。然而，特征变化的不确定性和救济成本的风险没有被考虑。如果救济可能（有一定的概率）导致情况更加糟糕，从而需要极高的成本来恢复，这是不жела的。因此，在计算和评估救济时需要考虑风险。我们称之为考虑风险的救济为Safer Algorithmic Recourse（SafeAR）。SafeAR的目标是让人们根据其风险忍耐度选择救济。在这种工作中，我们讨论了现有的救济需求可能无法捕捉成本变化的风险。我们提出了一种计算救济策略的方法，该方法考虑特征变化的变化和连接算法救济文献与风险敏感的再增强学习。我们还采用金融文献中的“Value at Risk”和“Conditional Value at Risk”等度量准确地描述风险。我们应用我们的方法于两个实际数据集，并与不同风险偏好的策略进行比较，使用风险度量和救济需求（简洁和邻近）。

Renormalizing Diffusion Models

paper_url: http://arxiv.org/abs/2308.12355
repo_url: None
paper_authors: Jordan Cotler, Semon Rezchikov
for: 学习 inverse renormalization group flows of statistical and quantum field theories
methods: 使用 diffusion models 学习 inverse 过程，并将其与物理的 renormalization group schemes 结合起来
results: 提出了一种基于机器学习的方法来研究场理论，并实现了在 lattice field theory 中使用 adaptive bridge (或平行温度) 抽样器Here’s a more detailed explanation of each point:1. for: The paper is written for studying the inverse renormalization group flows of statistical and quantum field theories, using machine learning techniques.2. methods: The paper uses diffusion models to learn the inverse process of renormalization group schemes, which are explicitly specified in the context of field theories. The models are combined with adaptive bridge (or parallel tempering) samplers to efficiently explore the space of fields.3. results: The paper provides a new approach to studying field theories using machine learning, and demonstrates the effectiveness of the method by applying it to numerically find renormalization group flows of interacting statistical field theories. Additionally, the paper provides explicit prescriptions for comparing results derived from models associated with different renormalization group schemes, and discusses the use of diffusion models in a variational method to find ground states of quantum systems.

Abstract
We explain how to use diffusion models to learn inverse renormalization group flows of statistical and quantum field theories. Diffusion models are a class of machine learning models which have been used to generate samples from complex distributions, such as the distribution of natural images, by learning the inverse process to a diffusion process which adds noise to the data until the distribution of the data is pure noise. Nonperturbative renormalization group schemes can naturally be written as diffusion processes in the space of fields. We combine these observations in a concrete framework for building ML-based models for studying field theories, in which the models learn the inverse process to an explicitly-specified renormalization group scheme. We detail how these models define a class of adaptive bridge (or parallel tempering) samplers for lattice field theory. Because renormalization group schemes have a physical meaning, we provide explicit prescriptions for how to compare results derived from models associated to several different renormalization group schemes of interest. We also explain how to use diffusion models in a variational method to find ground states of quantum systems. We apply some of our methods to numerically find RG flows of interacting statistical field theories. From the perspective of machine learning, our work provides an interpretation of multiscale diffusion models, and gives physically-inspired suggestions for diffusion models which should have novel properties.

摘要
我们说明如何使用扩散模型来学习逆减少量子场论理的倒数法。扩散模型是一种机器学习模型，可以生成复杂分布中的样本，如自然图像分布，通过学习逆 процес，即将数据中的噪声添加到数据，直到数据的分布成为纯噪声。非perturbative renormalization group scheme可以自然地写作扩散 процес在场论理空间中。我们结合这些观察，建立一个实际应用的框架，以机器学习方式研究场论理。这些模型学习逆 процес，即一个明确指定的renormalization group scheme。我们详细说明如何使用这些模型建立一种适应桥（或平行温度）抽样器，以便探索量子场论理的基本点。因为renormalization group scheme有物理意义，我们提供明确的比较方法，以便从各个不同的renormalization group scheme中获得结果。我们还说明如何使用扩散模型在统计力学中找到量子系统的基本点。我们将一些我们的方法应用到量子场论理中，以 numerically 找到各种相互作用的量子场论理的流动。从机器学习的角度来看，我们的工作具有物理激发的解释，并提供了具有新特性的扩散模型的建议。

Improving Generative Model-based Unfolding with Schrödinger Bridges

paper_url: http://arxiv.org/abs/2308.12351
repo_url: None
paper_authors: Sascha Diefenbacher, Guan-Horng Liu, Vinicius Mikuni, Benjamin Nachman, Weili Nie
for: 这个论文是为了探讨机器学习基于 unfolding 的高维ensional差异观测方法。
methods: 这个论文使用了 Schroedinger Bridges 和 diffusion models 等方法。
results: 论文表明，SBUnfold 方法可以在 Synthetic Z+jets 数据集上达到优秀的性能。

Abstract
Machine learning-based unfolding has enabled unbinned and high-dimensional differential cross section measurements. Two main approaches have emerged in this research area: one based on discriminative models and one based on generative models. The main advantage of discriminative models is that they learn a small correction to a starting simulation while generative models scale better to regions of phase space with little data. We propose to use Schroedinger Bridges and diffusion models to create SBUnfold, an unfolding approach that combines the strengths of both discriminative and generative models. The key feature of SBUnfold is that its generative model maps one set of events into another without having to go through a known probability density as is the case for normalizing flows and standard diffusion models. We show that SBUnfold achieves excellent performance compared to state of the art methods on a synthetic Z+jets dataset.

摘要
机器学习基于的 unfolding 技术已经实现了无桶和高维差分观测。这两种主要方法是基于抑制模型和基于生成模型。抑制模型的主要优点是它们可以学习一个小的修正来于起始 simulations，而生成模型则更适合在数据稀缺的区域中进行观测。我们提议使用 Schrödinger Bridges 和扩散模型来创建 SBUnfold，一种结合了抑制和生成模型的 unfolding 方法。SBUnfold 的关键特点是其生成模型可以将一个事件集转换成另一个事件集，而不需要通过一个已知的概率密度。我们示出 SBUnfold 在一个 synthetic Z+jets 数据集上的表现准确性比州对照方法更高。

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

paper_url: http://arxiv.org/abs/2308.12284
repo_url: None
paper_authors: Kushal Tirumala, Daniel Simig, Armen Aghajanyan, Ari S. Morcos
for: 这篇论文的目的是探讨大型自然语言模型（LLM）的预训练和下游性能如何受到数据选择的影响。
methods: 这篇论文使用了预训练模型embeddings进行数据选择，并证明了智能重复数据可以提高预训练速度（20%的效率提升）和下游任务的均值准确率（最高达2%）。
results: 这篇论文的结果表明，智能数据选择可以显著提高LLM预训练的性能，并质疑了 randomly sampling web data 的常见做法。

Abstract
Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While training on ever-larger portions of the internet leads to consistent performance improvements, the size of these improvements diminishes with scale, and there has been little work exploring the effect of data selection on pre-training and downstream performance beyond simple de-duplication methods such as MinHash. Here, we show that careful data selection (on top of de-duplicated data) via pre-trained model embeddings can speed up training (20% efficiency gains) and improves average downstream accuracy on 16 NLP tasks (up to 2%) at the 6.7B model scale. Furthermore, we show that repeating data intelligently consistently outperforms baseline training (while repeating random data performs worse than baseline training). Our results indicate that clever data selection can significantly improve LLM pre-training, calls into question the common practice of training for a single epoch on as much data as possible, and demonstrates a path to keep improving our models past the limits of randomly sampling web data.

摘要
在最近几年，越来越多的计算和数据被投入到训练大型语言模型（LLM）中，通常是通过单 passes 学习大量网络资料中的随机选择进行。虽然在越来越大的规模上进行训练会导致性能提高，但这些提高的大小随着规模减少，而且有少量的研究探讨了数据选择对预训练和下游性能的影响 beyond 简单的去重方法such as MinHash。我们显示，通过预训练模型 embedding 进行精心的数据选择（以上下文为了减少数据重复）可以提高训练效率（20%效率提升）和提高16种 NLP 任务的平均下游准确率（最高达2%）。此外，我们还显示，通过智能重复数据可以不断超过基eline训练（而Random data 重复的情况下则比基eline训练更差）。我们的结果表明，精心的数据选择可以显著提高 LLM 预训练，质疑了训练一次性处理大量网络数据的常见做法，并提出了继续改进我们的模型的路径。

Extended Linear Regression: A Kalman Filter Approach for Minimizing Loss via Area Under the Curve

paper_url: http://arxiv.org/abs/2308.12280
repo_url: None
paper_authors: Gokulprasath R
for: 增强线性回归模型，使用kalman filter和分析曲线面积来降低损失。
methods: 使用随机梯度下降（SGD）更新参数，并使用kalman filter来预测下一个融合参数。
results: 实现了一个优化的线性回归方程，并且可以避免常量参数更新和使用完整数据集。但需要考虑计算复杂性。

Abstract
This research enhances linear regression models by integrating a Kalman filter and analysing curve areas to minimize loss. The goal is to develop an optimal linear regression equation using stochastic gradient descent (SGD) for weight updating. Our approach involves a stepwise process, starting with user-defined parameters. The linear regression model is trained using SGD, tracking weights and loss separately and zipping them finally. A Kalman filter is then trained based on weight and loss arrays to predict the next consolidated weights. Predictions result from multiplying input averages with weights, evaluated for loss to form a weight-versus-loss curve. The curve's equation is derived using the two-point formula, and area under the curve is calculated via integration. The linear regression equation with minimum area becomes the optimal curve for prediction. Benefits include avoiding constant weight updates via gradient descent and working with partial datasets, unlike methods needing the entire set. However, computational complexity should be considered. The Kalman filter's accuracy might diminish beyond a certain prediction range.

摘要
这种研究提高了线性回归模型，通过纳曼滤波器和分析曲线面积来减少损失。目标是通过随机梯度下降（SGD）来开发最优的线性回归方程。我们的方法包括一步骤过程，从用户定义的参数开始，使用SGD训练线性回归模型，并分别跟踪 weights 和损失。然后，使用纳曼滤波器根据权重和损失数组预测下一个总结权重。预测结果是通过输入均值与权重进行乘法，并评估损失来形成权重与损失曲线。曲线的方程由两点方程 derive，并通过积分来计算曲线面积。最优的曲线方程是 minimum 损失曲线方程。这种方法的优点包括：不需要不断更新权重 via 梯度下降，并且可以处理部分数据集，而不是需要整个数据集。然而，计算复杂性应该被考虑。纳曼滤波器的准确性可能在预测范围内逐渐减退。

On-Manifold Projected Gradient Descent

paper_url: http://arxiv.org/abs/2308.12279
repo_url: https://github.com/JonasGrabbe/GradientDecentOnManifolds
paper_authors: Aaron Mahler, Tyrus Berry, Tom Stephens, Harbir Antil, Michael Merritt, Jeanie Schreiber, Ioannis Kevrekidis
for: 这项研究的目的是为高维数据提供计算可能、直接、数学上正确的抛物线geometryapproximation，以及从输入空间投影到这些类型 manifold上。
methods: 这项研究使用了协形变换图（CIDM）来近似类型 manifold，并开发了 Nystr"{o}m 投影来将新点投影到类型 manifold 上。此外，它还使用了spectral exterior calculus（SEC）来确定类型 manifold 上的 géométríques量 such as tangent vectors。
results: 这项研究得出了一种可以在类型 manifold 上生成骗子样本，并且这些骗子样本可以让类型 manifold 上的分类器进行骗子攻击。此外，这项研究还表明了骗子样本在分类器准确率下的影响，并且提供了一种以人类可理解的投影来解释这些骗子样本。

Abstract
This work provides a computable, direct, and mathematically rigorous approximation to the differential geometry of class manifolds for high-dimensional data, along with nonlinear projections from input space onto these class manifolds. The tools are applied to the setting of neural network image classifiers, where we generate novel, on-manifold data samples, and implement a projected gradient descent algorithm for on-manifold adversarial training. The susceptibility of neural networks (NNs) to adversarial attack highlights the brittle nature of NN decision boundaries in input space. Introducing adversarial examples during training has been shown to reduce the susceptibility of NNs to adversarial attack; however, it has also been shown to reduce the accuracy of the classifier if the examples are not valid examples for that class. Realistic "on-manifold" examples have been previously generated from class manifolds in the latent of an autoencoder. Our work explores these phenomena in a geometric and computational setting that is much closer to the raw, high-dimensional input space than can be provided by VAE or other black box dimensionality reductions. We employ conformally invariant diffusion maps (CIDM) to approximate class manifolds in diffusion coordinates, and develop the Nystr\"{o}m projection to project novel points onto class manifolds in this setting. On top of the manifold approximation, we leverage the spectral exterior calculus (SEC) to determine geometric quantities such as tangent vectors of the manifold. We use these tools to obtain adversarial examples that reside on a class manifold, yet fool a classifier. These misclassifications then become explainable in terms of human-understandable manipulations within the data, by expressing the on-manifold adversary in the semantic basis on the manifold.

摘要
Translated into Simplified Chinese:这个工作提供了一种可计算、直接、数学上正确的方法来 aproximate class manifold的 diferencial geometry for high-dimensional data,以及将输入空间中的点映射到这些class manifold上的非线性投影。这些工具被应用于神经网络图像分类器，where we generate novel, on-manifold data samples and implement a projected gradient descent algorithm for on-manifold adversarial training.神经网络（NN）对 adversarial attack的抵触 highlights the brittle nature of NN decision boundaries in input space.引入 adversarial examples during training has been shown to reduce the susceptibility of NNs to adversarial attack, but it has also been shown to reduce the accuracy of the classifier if the examples are not valid examples for that class.Realistic "on-manifold" examples have been previously generated from class manifolds in the latent of an autoencoder. Our work explores these phenomena in a geometric and computational setting that is much closer to the raw, high-dimensional input space than can be provided by VAE or other black box dimensionality reductions. we employ conformally invariant diffusion maps (CIDM) to approximate class manifolds in diffusion coordinates, and develop the Nystr\"{o}m projection to project novel points onto class manifolds in this setting. On top of the manifold approximation, we leverage the spectral exterior calculus (SEC) to determine geometric quantities such as tangent vectors of the manifold. we use these tools to obtain adversarial examples that reside on a class manifold, yet fool a classifier. These misclassifications then become explainable in terms of human-understandable manipulations within the data, by expressing the on-manifold adversary in the semantic basis on the manifold.

LCANets++: Robust Audio Classification using Multi-layer Neural Networks with Lateral Competition

paper_url: http://arxiv.org/abs/2308.12882
repo_url: None
paper_authors: Sayanton V. Dibbo, Juston S. Moore, Garrett T. Kenyon, Michael A. Teti
for: Audio classification aims to recognize audio signals, including speech commands or sound events, but current audio classifiers are vulnerable to perturbations and adversarial attacks.
methods: To address these challenges, the paper introduces LCANets++, which are CNNs that perform sparse coding in multiple layers via the Locally Competitive Algorithm (LCA).
results: LCANets++ are more robust than standard CNNs and LCANets against perturbations and adversarial attacks, such as background noise and black-box and white-box attacks.

Abstract
Audio classification aims at recognizing audio signals, including speech commands or sound events. However, current audio classifiers are susceptible to perturbations and adversarial attacks. In addition, real-world audio classification tasks often suffer from limited labeled data. To help bridge these gaps, previous work developed neuro-inspired convolutional neural networks (CNNs) with sparse coding via the Locally Competitive Algorithm (LCA) in the first layer (i.e., LCANets) for computer vision. LCANets learn in a combination of supervised and unsupervised learning, reducing dependency on labeled samples. Motivated by the fact that auditory cortex is also sparse, we extend LCANets to audio recognition tasks and introduce LCANets++, which are CNNs that perform sparse coding in multiple layers via LCA. We demonstrate that LCANets++ are more robust than standard CNNs and LCANets against perturbations, e.g., background noise, as well as black-box and white-box attacks, e.g., evasion and fast gradient sign (FGSM) attacks.

摘要
Audio 分类目标是识别音频信号，包括语音命令或声音事件。然而，当前的音频分类器受到干扰和反对攻击的影响。此外，实际世界中的音频分类任务经常受到有限的标注数据的限制。为了bridge这些差距，先前的工作开发了基于神经元的启发式卷积神经网络（CNN），使用本地竞争算法（LCA）在第一层进行稀盐编码，称为LCANets。LCANets通过组合监督和无监督学习，减少依赖于标注样本。听说 auditory cortex 也是稀盐的，我们扩展LCANets到音频识别任务，并引入LCANets++，它是多层通过LCA进行稀盐编码的CNN。我们示示LCANets++ 比标准 CNN 和 LCANets 更加鲁棒，对干扰（例如背景噪音）和黑盒和白盒攻击（例如欺骗和快速梯度签名）表现更好。

Language Reward Modulation for Pretraining Reinforcement Learning

paper_url: http://arxiv.org/abs/2308.12270
repo_url: https://github.com/ademiadeniji/lamp
paper_authors: Ademi Adeniji, Amber Xie, Carmelo Sferrazza, Younggyo Seo, Stephen James, Pieter Abbeel
for: 本文探讨了使用学习奖函数（LRF）来解决稀谱奖励学习（RL）任务的进展。
methods: 本文提出了使用语言奖函数（VLM）作为RL的预训练Signal，并使用冻结的VLM生成各种语言指令和图像观察值的对比准则来生成噪音的探索奖励。
results: 本文的方法可以在RLBench中快速启动样本繁殖的RL学习，并且可以在稀谱奖励情况下提高任务复杂度。

Abstract
Using learned reward functions (LRFs) as a means to solve sparse-reward reinforcement learning (RL) tasks has yielded some steady progress in task-complexity through the years. In this work, we question whether today's LRFs are best-suited as a direct replacement for task rewards. Instead, we propose leveraging the capabilities of LRFs as a pretraining signal for RL. Concretely, we propose $\textbf{LA}$nguage Reward $\textbf{M}$odulated $\textbf{P}$retraining (LAMP) which leverages the zero-shot capabilities of Vision-Language Models (VLMs) as a $\textit{pretraining}$ utility for RL as opposed to a downstream task reward. LAMP uses a frozen, pretrained VLM to scalably generate noisy, albeit shaped exploration rewards by computing the contrastive alignment between a highly diverse collection of language instructions and the image observations of an agent in its pretraining environment. LAMP optimizes these rewards in conjunction with standard novelty-seeking exploration rewards with reinforcement learning to acquire a language-conditioned, pretrained policy. Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks in RLBench.

摘要
LAMP uses a frozen, pretrained VLM to generate noisy, but shaped exploration rewards by computing the contrastive alignment between a highly diverse collection of language instructions and the image observations of an agent in its pretraining environment. LAMP optimizes these rewards in conjunction with standard novelty-seeking exploration rewards with reinforcement learning to acquire a language-conditioned, pretrained policy. Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks in RLBench.

FECoM: A Step towards Fine-Grained Energy Measurement for Deep Learning

paper_url: http://arxiv.org/abs/2308.12264
repo_url: None
paper_authors: Saurabhsingh Rajput, Tim Widmayer, Ziyuan Shang, Maria Kechagia, Federica Sarro, Tushar Sharma
for: 这篇论文的目的是来提高深度学习（Deep Learning）模型的能源消耗量的测量和优化。
methods: 这篇论文使用了精确的能源消耗量测量方法，即 Fine-grained Energy Consumption Meter（FECoM），并考虑了不同因素，例如计算负载和温度稳定性。
results: 这篇论文使用 FECoM 来测量 TensorFlow 框架中 API 的能源消耗量，并 investigate 了参数大小和执行时间对能源消耗量的影响。

Abstract
With the increasing usage, scale, and complexity of Deep Learning (DL) models, their rapidly growing energy consumption has become a critical concern. Promoting green development and energy awareness at different granularities is the need of the hour to limit carbon emissions of DL systems. However, the lack of standard and repeatable tools to accurately measure and optimize energy consumption at a fine granularity (e.g., at method level) hinders progress in this area. In this paper, we introduce FECoM (Fine-grained Energy Consumption Meter), a framework for fine-grained DL energy consumption measurement. Specifically, FECoM provides researchers and developers a mechanism to profile DL APIs. FECoM addresses the challenges of measuring energy consumption at fine-grained level by using static instrumentation and considering various factors, including computational load and temperature stability. We assess FECoM's capability to measure fine-grained energy consumption for one of the most popular open-source DL frameworks, namely TensorFlow. Using FECoM, we also investigate the impact of parameter size and execution time on energy consumption, enriching our understanding of TensorFlow APIs' energy profiles. Furthermore, we elaborate on the considerations, issues, and challenges that one needs to consider while designing and implementing a fine-grained energy consumption measurement tool. We hope this work will facilitate further advances in DL energy measurement and the development of energy-aware practices for DL systems.

摘要

Learning from Negative User Feedback and Measuring Responsiveness for Sequential Recommenders

paper_url: http://arxiv.org/abs/2308.12256
repo_url: None
paper_authors: Yueqi Wang, Yoni Halpern, Shuo Chang, Jingchen Feng, Elaine Ya Le, Longfei Li, Xujian Liang, Min-Cheng Huang, Shane Li, Alex Beutel, Yaping Zhang, Shuchao Bi
for: 这个论文主要是为了提高sequential retrieval模型中对负反馈的学习和应用。
methods: 这个论文使用了explicit和implicit的负反馈来调整sequential retrieval模型的训练目标，并使用了一个”not-to-recommend”损失函数来优化模型的逻辑可能性。
results: 实验结果表明，通过这种方法可以提高sequential retrieval模型的应对负反馈性能，并且通过对不同用户行为进行对照分析，提高了推荐器的反应性。

Abstract
Sequential recommenders have been widely used in industry due to their strength in modeling user preferences. While these models excel at learning a user's positive interests, less attention has been paid to learning from negative user feedback. Negative user feedback is an important lever of user control, and comes with an expectation that recommenders should respond quickly and reduce similar recommendations to the user. However, negative feedback signals are often ignored in the training objective of sequential retrieval models, which primarily aim at predicting positive user interactions. In this work, we incorporate explicit and implicit negative user feedback into the training objective of sequential recommenders in the retrieval stage using a "not-to-recommend" loss function that optimizes for the log-likelihood of not recommending items with negative feedback. We demonstrate the effectiveness of this approach using live experiments on a large-scale industrial recommender system. Furthermore, we address a challenge in measuring recommender responsiveness to negative feedback by developing a counterfactual simulation framework to compare recommender responses between different user actions, showing improved responsiveness from the modeling change.

摘要
幂值推荐器在工业中广泛使用，主要是因为它们能够准确地模型用户的喜好。然而，这些模型往往忽略了从用户的负反馈中学习。负反馈是用户控制的重要手段，用户对推荐的预期是快速地避免类似的推荐。然而，现有的推荐模型在训练目标中忽略了负反馈信号，主要是预测正确的用户交互。在这种情况下，我们将显式和隐式的负反馈 integrate into 推荐模型的训练目标中，使用一个“不推荐”损失函数，以便优化log-likelihood的不推荐ITEMS。我们通过实际实验证明了这种方法的有效性，并解决了对推荐响应负反馈的测量挑战。在这种框架下，我们开发了一种对推荐响应的counterfactual simulation框架，以比较不同用户行为下的推荐响应，显示了改进后的模型响应性。

How Safe Am I Given What I See? Calibrated Prediction of Safety Chances for Image-Controlled Autonomy

paper_url: http://arxiv.org/abs/2308.12252
repo_url: https://github.com/maozj6/hsai-predictor
paper_authors: Zhenjiang Mao, Carson Sobolewski, Ivan Ruchkin
for: 这篇论文是为了解决自动化系统的安全验证问题而写的。
methods: 该论文提出了一种基于生成世界模型的可配置的学习管道，不需要低维度的状态。它们解决了在预测过程中数据分布shift的问题，并提供了对预测安全概率的统计加验 garanties。
results: 在两个图像控制系统的案例研究中，提出的学习管道实现了对预测安全概率的统计加验 garanties。

Abstract
End-to-end learning has emerged as a major paradigm for developing autonomous systems. Unfortunately, with its performance and convenience comes an even greater challenge of safety assurance. A key factor of this challenge is the absence of the notion of a low-dimensional and interpretable dynamical state, around which traditional assurance methods revolve. Focusing on the online safety prediction problem, this paper proposes a configurable family of learning pipelines based on generative world models, which do not require low-dimensional states. To implement these pipelines, we overcome the challenges of learning safety-informed latent representations and missing safety labels under prediction-induced distribution shift. These pipelines come with statistical calibration guarantees on their safety chance predictions based on conformal prediction. We perform an extensive evaluation of the proposed learning pipelines on two case studies of image-controlled systems: a racing car and a cartpole.

摘要
<>endo-to-endo 学习已经成为自主系统的主要 парадигмы。然而，与其性能和便利性一样，这也带来了更大的安全保证挑战。一个关键因素是absence of a low-dimensional and interpretable dynamical state，这使得传统的安全保证方法无法运作。本文提出了一种可 configurable 的学习管道，基于生成世界模型，不需要低维状态。为实现这些管道，我们解决了学习安全信息学 Representation 和预测导致的分布变化下缺失安全标签的问题。这些管道具有基于conformal prediction的统计加革保证安全可能性预测。我们对两个图像控制系统的case study进行了广泛的评估：一辆racing car和一个cartpole。>>>

How to Protect Copyright Data in Optimization of Large Language Models?

paper_url: http://arxiv.org/abs/2308.12247
repo_url: None
paper_authors: Timothy Chu, Zhao Song, Chiwun Yang
for: 本研究旨在解决大语言模型（LLMs）训练和优化过程中是否生成版权数据的问题。
methods: 本研究使用了softmax回归问题来解决大语言模型训练和优化问题，并提出了一种有效地实现softmax回归的方法，以避免生成版权数据。
results: 本研究显示，可以通过视为softmax回归问题来有效地训练和优化大语言模型，并避免生成版权数据。这种方法提供了一种理论上的训练大语言模型的方法，以避免生成版权数据。

Abstract
Large language models (LLMs) and generative AI have played a transformative role in computer research and applications. Controversy has arisen as to whether these models output copyrighted data, which can occur if the data the models are trained on is copyrighted. LLMs are built on the transformer neural network architecture, which in turn relies on a mathematical computation called Attention that uses the softmax function. In this paper, we show that large language model training and optimization can be seen as a softmax regression problem. We then establish a method of efficiently performing softmax regression, in a way that prevents the regression function from generating copyright data. This establishes a theoretical method of training large language models in a way that avoids generating copyright data.

摘要
大型语言模型（LLM）和生成AI在计算机研究和应用中发挥了转变性的作用。但是，有争议是这些模型输出的数据是否受版权保护，这可能会发生如果模型训练数据是受版权保护的。LLMs是基于变换器神经网络架构，变换器神经网络又 rely on一种数学计算called Attention，它使用softmax函数。在这篇论文中，我们表明了大型语言模型的训练和优化可以看作一种softmax回归问题。然后，我们建立了一种有效地进行softmax回归的方法，以避免回归函数生成受版权保护的数据。这种方法可以在训练大型语言模型时避免生成受版权保护的数据。

Multi-Objective Optimization for Sparse Deep Neural Network Training

paper_url: http://arxiv.org/abs/2308.12243
repo_url: https://github.com/salomonhotegni/mdmtn
paper_authors: S. S. Hotegni, S. Peitz, M. Berkemeier
for: 本研究旨在提出一种多目标优化算法，用于在深度学习中训练多任务模型。
methods: 本研究使用修改后的Weighted Chebyshev scalarization方法，将多任务问题转化为一系列单目标问题，然后使用扩展拉格朗日方法解决。
results: 实验结果表明，通过在训练过程中动态减少模型中的参数数量，可以降低模型的计算成本，同时不会影响模型的性能。

Abstract
Different conflicting optimization criteria arise naturally in various Deep Learning scenarios. These can address different main tasks (i.e., in the setting of Multi-Task Learning), but also main and secondary tasks such as loss minimization versus sparsity. The usual approach is a simple weighting of the criteria, which formally only works in the convex setting. In this paper, we present a Multi-Objective Optimization algorithm using a modified Weighted Chebyshev scalarization for training Deep Neural Networks (DNNs) with respect to several tasks. By employing this scalarization technique, the algorithm can identify all optimal solutions of the original problem while reducing its complexity to a sequence of single-objective problems. The simplified problems are then solved using an Augmented Lagrangian method, enabling the use of popular optimization techniques such as Adam and Stochastic Gradient Descent, while efficaciously handling constraints. Our work aims to address the (economical and also ecological) sustainability issue of DNN models, with a particular focus on Deep Multi-Task models, which are typically designed with a very large number of weights to perform equally well on multiple tasks. Through experiments conducted on two Machine Learning datasets, we demonstrate the possibility of adaptively sparsifying the model during training without significantly impacting its performance, if we are willing to apply task-specific adaptations to the network weights. Code is available at https://github.com/salomonhotegni/MDMTN.

摘要
不同的冲突优化标准在深度学习场景中自然而然而生成。这些标准可以处理不同的主任务（例如在多任务学习设置中），也可以处理主要和次要任务，如损失最小化与稀疏化。通常的方法是简单地将标准加权，这只有在几何设定下才能正确工作。在这篇论文中，我们提出了一种多目标优化算法，使用修改后的Weighted ChebyshevScalarization来训练深度神经网络（DNNs）在多个任务之间。通过使用这种Scalarization技术，算法可以找到原始问题的所有优化解决方案，同时减少问题的复杂性为一个序列单个目标问题。这些简化后的问题然后可以使用增强的拉格朗日方法解决，这使得可以使用流行的优化技术，如Adam和随机梯度下降，同时有效地处理约束。我们的工作旨在解决深度神经网络模型中的（经济和生态）可持续性问题，尤其是深度多任务模型，这些模型通常具有非常多的权重，以便在多个任务上表现均优。通过在两个机器学习数据集上进行实验，我们示出了在训练过程中适量缩减模型的可能性，而不需要对性能产生显著影响，只要采用任务特定的网络权重修改。代码可以在https://github.com/salomonhotegni/MDMTN上找到。

Critical Learning Periods Emerge Even in Deep Linear Networks

paper_url: http://arxiv.org/abs/2308.12221
repo_url: None
paper_authors: Michael Kleinman, Alessandro Achille, Stefano Soatto
for: 这篇论文探讨了深度网络中的批处理期，并解释了这些期限的出现是由哪些因素导致的。
methods: 作者使用了深度线性网络模型，并通过分析和实验来研究批处理期的特点和影响。
results: 研究发现，批处理期的出现与网络深度和数据分布结构有关，同时也与特定任务的学习和竞争关系相关。此外，在多任务学习中，预训练certain tasks可能会对新任务的传输性能产生负面影响，这与批处理期的长度和任务之间的关系有关。

Abstract
Critical learning periods are periods early in development where temporary sensory deficits can have a permanent effect on behavior and learned representations. Despite the radical differences between biological and artificial networks, critical learning periods have been empirically observed in both systems. This suggests that critical periods may be fundamental to learning and not an accident of biology. Yet, why exactly critical periods emerge in deep networks is still an open question, and in particular it is unclear whether the critical periods observed in both systems depend on particular architectural or optimization details. To isolate the key underlying factors, we focus on deep linear network models, and show that, surprisingly, such networks also display much of the behavior seen in biology and artificial networks, while being amenable to analytical treatment. We show that critical periods depend on the depth of the model and structure of the data distribution. We also show analytically and in simulations that the learning of features is tied to competition between sources. Finally, we extend our analysis to multi-task learning to show that pre-training on certain tasks can damage the transfer performance on new tasks, and show how this depends on the relationship between tasks and the duration of the pre-training stage. To the best of our knowledge, our work provides the first analytically tractable model that sheds light into why critical learning periods emerge in biological and artificial networks.

摘要
critical learning periods are periods early in development where temporary sensory deficits can have a permanent effect on behavior and learned representations. despite the radical differences between biological and artificial networks, critical periods have been empirically observed in both systems. this suggests that critical periods may be fundamental to learning and not an accident of biology. yet, why exactly critical periods emerge in deep networks is still an open question, and in particular it is unclear whether the critical periods observed in both systems depend on particular architectural or optimization details. to isolate the key underlying factors, we focus on deep linear network models, and show that, surprisingly, such networks also display much of the behavior seen in biology and artificial networks, while being amenable to analytical treatment. we show that critical periods depend on the depth of the model and structure of the data distribution. we also show analytically and in simulations that the learning of features is tied to competition between sources. finally, we extend our analysis to multi-task learning to show that pre-training on certain tasks can damage the transfer performance on new tasks, and show how this depends on the relationship between tasks and the duration of the pre-training stage. to the best of our knowledge, our work provides the first analytically tractable model that sheds light into why critical learning periods emerge in biological and artificial networks.Here is the translation in Traditional Chinese:critical learning periods are periods early in development where temporary sensory deficits can have a permanent effect on behavior and learned representations. despite the radical differences between biological and artificial networks, critical periods have been empirically observed in both systems. this suggests that critical periods may be fundamental to learning and not an accident of biology. yet, why exactly critical periods emerge in deep networks is still an open question, and in particular it is unclear whether the critical periods observed in both systems depend on particular architectural or optimization details. to isolate the key underlying factors, we focus on deep linear network models, and show that, surprisingly, such networks also display much of the behavior seen in biology and artificial networks, while being amenable to analytical treatment. we show that critical periods depend on the depth of the model and structure of the data distribution. we also show analytically and in simulations that the learning of features is tied to competition between sources. finally, we extend our analysis to multi-task learning to show that pre-training on certain tasks can damage the transfer performance on new tasks, and show how this depends on the relationship between tasks and the duration of the pre-training stage. to the best of our knowledge, our work provides the first analytically tractable model that sheds light into why critical learning periods emerge in biological and artificial networks.

Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning

paper_url: http://arxiv.org/abs/2308.12219
repo_url: https://github.com/yegcjs/diffusionllm
paper_authors: Jiasheng Ye, Zaixiang Zheng, Yu Bao, Lihua Qian, Quanquan Gu
for: 这篇论文主要目标是探讨Diffusion Probabilistic Models是否能够解决通用的语言任务，并证明可以通过扩大数据、大小和任务来使Diffusion Models成为强大的语言学习模型。
methods: 该论文使用了Diffusion Models和大语言模型的混合，通过预训练和特定任务的精度适应来转换预训练模型为Diffusion Models，并通过自适应和指令精度适应来解锁其多样化的语言任务能力。
results: 实验显示，随着Diffusion Models的扩大，其表现在下游语言任务中得到了重大提升，而 instruciton finetuning 还能够启动零shot和几shot在Context learning 的能力，并且在进一步的和复杂的任务，如推理，表现出了扎实的能力。

Abstract
The recent surge of generative AI has been fueled by the generative power of diffusion probabilistic models and the scalable capabilities of large language models. Despite their potential, it remains elusive whether diffusion language models can solve general language tasks comparable to their autoregressive counterparts. This paper demonstrates that scaling diffusion models w.r.t. data, sizes, and tasks can effectively make them strong language learners. We build competent diffusion language models at scale by first acquiring knowledge from massive data via masked language modeling pretraining thanks to their intrinsic connections. We then reprogram pretrained masked language models into diffusion language models via diffusive adaptation, wherein task-specific finetuning and instruction finetuning are explored to unlock their versatility in solving general language tasks. Experiments show that scaling diffusion language models consistently improves performance across downstream language tasks. We further discover that instruction finetuning can elicit zero-shot and few-shot in-context learning abilities that help tackle many unseen tasks by following natural language instructions, and show promise in advanced and challenging abilities such as reasoning

摘要
最近的生成AI冲击浪潮得到了扩散概率模型的生成能力和大语言模型的可扩展能力。虽然它们的潜力尚未得到证明，但这篇论文表明，通过数据、大小和任务的扩散可以让扩散语言模型成为强大的语言学习者。我们通过先从巨量数据中获得知识，然后通过扩散适应来重新编程已有的偏Masked语言模型，以解锁其在普通语言任务中的多样性。实验表明，通过扩散语言模型的扩大会提高下游语言任务的性能。我们还发现，对于特定任务和指令进行练习和调整可以让模型在没有seen任务中表现出零shot和几shot在场景中学习的能力，并且在复杂和挑战性任务中表现出了承袭。

2023-08-24

eess.IV

eess.IV - 2023-08-24

Learned Local Attention Maps for Synthesising Vessel Segmentations

paper_url: http://arxiv.org/abs/2308.12861
repo_url: None
paper_authors: Yash Deo, Rodrigo Bonazzola, Haoran Dou, Yan Xia, Tianyou Wei, Nishant Ravikumar, Alejandro F. Frangi, Toni Lassila
for: 这个论文主要用于图像识别领域，具体来说是用于synthesizing主要血管分割图像，从而帮助更好地诊断血管疾病。
methods: 这个论文使用了encoder-decoder模型，并提出了一种两阶段多目标学习方法，使用了学习的本地注意力地图，以提高synthesizing的精度。
results: 测试结果表明，这个方法可以在只使用T2 MRI数据时，Synthesize主要血管分割图像，并达到了state-of-the-art segmentation网络的水平，包括transformer U-Net和nnU-net，同时使用的参数数量也相对较少。主要的qualitative difference在于synthetic血管分割图像的分辨率更高，特别是在后 Circulation中。

Abstract
Magnetic resonance angiography (MRA) is an imaging modality for visualising blood vessels. It is useful for several diagnostic applications and for assessing the risk of adverse events such as haemorrhagic stroke (resulting from the rupture of aneurysms in blood vessels). However, MRAs are not acquired routinely, hence, an approach to synthesise blood vessel segmentations from more routinely acquired MR contrasts such as T1 and T2, would be useful. We present an encoder-decoder model for synthesising segmentations of the main cerebral arteries in the circle of Willis (CoW) from only T2 MRI. We propose a two-phase multi-objective learning approach, which captures both global and local features. It uses learned local attention maps generated by dilating the segmentation labels, which forces the network to only extract information from the T2 MRI relevant to synthesising the CoW. Our synthetic vessel segmentations generated from only T2 MRI achieved a mean Dice score of $0.79 \pm 0.03$ in testing, compared to state-of-the-art segmentation networks such as transformer U-Net ($0.71 \pm 0.04$) and nnU-net($0.68 \pm 0.05$), while using only a fraction of the parameters. The main qualitative difference between our synthetic vessel segmentations and the comparative models was in the sharper resolution of the CoW vessel segments, especially in the posterior circulation.

摘要
磁共振成像（MRA）是一种成像血管的方式，可以用于许多诊断应用以及评估风险因素，如血栓形成的可能性。然而，MRA不是每次都会被获取，因此一种能够从常见的MR增强像素，如T1和T2， sintesize血管分割的方法会非常有用。我们提出了一种encoder-decoder模型，可以从T2 MRI中 sintesize主要脑血管的分割。我们使用了两个阶段多目标学习方法，可以捕捉全局和局部特征。它使用了学习的局部注意力图，由扩展分割标签来生成，这使得网络只能从T2 MRI中提取与synthesizing CoW相关的信息。我们在测试中得到的合成血管分割的Mean Dice分数为$0.79 \pm 0.03$,比比较方法，如trasformer U-Net($0.71 \pm 0.04$)和nnU-net($0.68 \pm 0.05$)高，同时使用的参数只是其中的一部分。主要的qualitative差异在于CoW血管段的分辨率，特别是后流程的血管段。

Achromatic imaging systems with flat lenses enabled by deep learning

paper_url: http://arxiv.org/abs/2308.12776
repo_url: None
paper_authors: Roy Maman, Eitan Mualem, Noa Mazurski, Jacob Engelberg, Uriel Levy
for: 这篇论文旨在提出一种解决折射镜的各种颜色偏误问题，使得平面镜在多色光源下实现高质量成像的方法。
methods: 该方法基于创建一个新的色彩外部图像数据集，使用这些数据集来训练深度学习模型来修正折射镜中的各种颜色偏误。
results: 通过这种方法， authors 实现了在整个可见光谱中的高质量成像，并且在量化方面也达到了 PSNR 和 SSIM 分数的新高水平。

Abstract
Motivated by their great potential to reduce the size, cost and weight, flat lenses, a category that includes diffractive lenses and metalenses, are rapidly emerging as key components with the potential to replace the traditional refractive optical elements in modern optical systems. Yet, the inherently strong chromatic aberration of these flat lenses is significantly impairing their performance in systems based on polychromatic illumination or passive ambient light illumination, stalling their widespread implementation. Hereby, we provide a promising solution and demonstrate high quality imaging based on flat lenses over the entire visible spectrum. Our approach is based on creating a novel dataset of color outdoor images taken with our flat lens and using this dataset to train a deep-learning model for chromatic aberrations correction. Based on this approach we show unprecedented imaging results not only in terms of qualitative measures but also in the quantitative terms of the PSNR and SSIM scores of the reconstructed images. The results pave the way for the implementation of flat lenses in advanced polychromatic imaging systems.

摘要
Translated into Simplified Chinese:驱动于它们的巨大潜力减少大小、成本和重量，扁镜（包括杂散镜和金属镜）在现代光学系统中迅速emerging为关键组件，替代传统的准直光学元件。然而，扁镜的自然强度偏振镜辐射对于基于多色照明或自然环境照明的系统来说，会 significatively impair其性能，阻碍其广泛应用。在这里，我们提供了一个有优势的解决方案，并展示了基于扁镜的高质量成像，覆盖整个可见光谱。我们的方法是基于创建一个新的彩色外景图像数据集，用于训练深度学习模型来纠正偏振镜辐射。基于这种方法，我们展示了无 precedent的成像结果，不仅在质量度量上，还在量化度量上，如PSNR和SSIM分数。这些结果铺开了扁镜在高级多色成像系统中的应用之路。

A Study of Age and Sex Bias in Multiple Instance Learning based Classification of Acute Myeloid Leukemia Subtypes

paper_url: http://arxiv.org/abs/2308.12675
repo_url: None
paper_authors: Ario Sadafi, Matthias Hehr, Nassir Navab, Carsten Marr
for: 这个研究旨在探讨急性白血病（AML）分型的精确分类是否受到年龄和性别的偏见影响，以便在临床决策和患者照顾中做出更加可靠和公平的结果。
methods: 本研究使用多例学习（MIL）建筑来探讨AML分型的可能存在的年龄和性别偏见影响。具体来说，我们在不同的性别协调和年龄层级上训练多个MIL模型，并评估这些模型在不同的测试集中的表现。
results: 我们发现AML分型的性别和年龄偏见对模型的表现有 statistically significant 的影响。具体来说，女性患者更容易受到性别偏见的影响，而certain age groups, such as patients with 72 to 86 years of age with the RUNX1::RUNX1T1 genetic subtype, are significantly affected by an age bias present in the training data。确保训练数据的包容性是为获得可靠和公平的结果，最终对多元化患者人口具有帮助。

Abstract
Accurate classification of Acute Myeloid Leukemia (AML) subtypes is crucial for clinical decision-making and patient care. In this study, we investigate the potential presence of age and sex bias in AML subtype classification using Multiple Instance Learning (MIL) architectures. To that end, we train multiple MIL models using different levels of sex imbalance in the training set and excluding certain age groups. To assess the sex bias, we evaluate the performance of the models on male and female test sets. For age bias, models are tested against underrepresented age groups in the training data. We find a significant effect of sex and age bias on the performance of the model for AML subtype classification. Specifically, we observe that females are more likely to be affected by sex imbalance dataset and certain age groups, such as patients with 72 to 86 years of age with the RUNX1::RUNX1T1 genetic subtype, are significantly affected by an age bias present in the training data. Ensuring inclusivity in the training data is thus essential for generating reliable and equitable outcomes in AML genetic subtype classification, ultimately benefiting diverse patient populations.

摘要
正确地分类AML分型是临床决策和患者照顾中非常重要的。在这项研究中，我们探索了AML分型分配中的年龄和性别偏袋的潜在影响。为此，我们使用多个MIL建筑物（Multiple Instance Learning）进行训练，其中一些包含性别偏袋，一些则排除某些年龄组。为了评估性别偏袋的影响，我们对男女测试集进行评估。对年龄偏袋，我们将模型测试于训练数据中下 Representatives of underrepresented age groups。我们发现，女性更容易受到数据中的性别偏袋影响，而 certain age groups，如72-86岁的患者，受到训练数据中的年龄偏袋的显著影响。因此，在AML分型分配中确保训练数据的多样性非常重要，以便为不同的患者人口群体提供可靠和公平的结果。

SCP: Spherical-Coordinate-based Learned Point Cloud Compression

paper_url: http://arxiv.org/abs/2308.12535
repo_url: https://github.com/luoao-kddi/SCP
paper_authors: Ao Luo, Linxin Song, Keisuke Nonaka, Kyohei Unno, Heming Sun, Masayuki Goto, Jiro Katto
for: 本研究的目的是提出一种基于圆柱坐标系的学习型点云压缩方法，利用点云中的很多圆形和方位角特征，以提高压缩率和图像质量。
methods: 该方法基于点云中的圆柱坐标系，通过一种基于多层 Octree 的方法，将点云分割成不同水平，并对每个水平进行压缩。同时，该方法还使用了一种模型无关的方法，以便应用于不同的学习型点云压缩技术。
results: 实验结果表明，SCP 方法可以覆盖前一代方法，提高点云压缩率和图像质量。在 point-to-point PSNR BD-Rate 指标上，SCP 方法与前一代方法相比，提高了29.14%。

Abstract
In recent years, the task of learned point cloud compression has gained prominence. An important type of point cloud, the spinning LiDAR point cloud, is generated by spinning LiDAR on vehicles. This process results in numerous circular shapes and azimuthal angle invariance features within the point clouds. However, these two features have been largely overlooked by previous methodologies. In this paper, we introduce a model-agnostic method called Spherical-Coordinate-based learned Point cloud compression (SCP), designed to leverage the aforementioned features fully. Additionally, we propose a multi-level Octree for SCP to mitigate the reconstruction error for distant areas within the Spherical-coordinate-based Octree. SCP exhibits excellent universality, making it applicable to various learned point cloud compression techniques. Experimental results demonstrate that SCP surpasses previous state-of-the-art methods by up to 29.14% in point-to-point PSNR BD-Rate.

摘要
近年来，学习点云压缩任务得到了更多的关注。一种重要的点云类型是旋转LiDAR点云，由旋转LiDAR在车辆上生成。这个过程会生成许多圆形和方位角度的不变性特征在点云中。然而，这两个特征在前一代方法中得到了相对较少的注意。在本文中，我们介绍了一种模型无关的方法called Spherical-Coordinate-based learned Point cloud compression (SCP)，旨在充分利用上述特征。此外，我们还提议了一种多层Octree来 mitigate the reconstruction error for distant areas within the Spherical-coordinate-based Octree。SCP具有优秀的通用性，使其适用于多种学习点云压缩技术。实验结果表明，SCP比前一代方法提高了29.14%的点到点PSNRBD率。

FFEINR: Flow Feature-Enhanced Implicit Neural Representation for Spatio-temporal Super-Resolution

paper_url: http://arxiv.org/abs/2308.12508
repo_url: None
paper_authors: Chenyue Jiao, Chongke Bi, Lu Yang
for: 提高流场数据的空间和时间分辨率
methods: 基于卷积神经网络和征特增强的匿名表示
results: 与三线 interpolate方法相比，FFEINR得到了显著更好的结果

Abstract
Large-scale numerical simulations are capable of generating data up to terabytes or even petabytes. As a promising method of data reduction, super-resolution (SR) has been widely studied in the scientific visualization community. However, most of them are based on deep convolutional neural networks (CNNs) or generative adversarial networks (GANs) and the scale factor needs to be determined before constructing the network. As a result, a single training session only supports a fixed factor and has poor generalization ability. To address these problems, this paper proposes a Feature-Enhanced Implicit Neural Representation (FFEINR) for spatio-temporal super-resolution of flow field data. It can take full advantage of the implicit neural representation in terms of model structure and sampling resolution. The neural representation is based on a fully connected network with periodic activation functions, which enables us to obtain lightweight models. The learned continuous representation can decode the low-resolution flow field input data to arbitrary spatial and temporal resolutions, allowing for flexible upsampling. The training process of FFEINR is facilitated by introducing feature enhancements for the input layer, which complements the contextual information of the flow field.To demonstrate the effectiveness of the proposed method, a series of experiments are conducted on different datasets by setting different hyperparameters. The results show that FFEINR achieves significantly better results than the trilinear interpolation method.

摘要
大规模的数字实验能生成数据达到tera bytes甚至是 petabytes 的规模。作为科学视觉社区中广泛研究的有前途的数据减少方法，超分辨率（SR）在这些研究中得到了广泛的应用。然而，大多数都是基于深度卷积神经网络（CNN）或生成对抗网络（GAN），并且需要在建立网络之前确定缩放因子。这导致单个训练会话只能支持固定因子，并且具有低通用能力。为了解决这些问题，本文提出了带有特征增强的偏 implicit neural representation（FFEINR），用于流体场数据的空间时间超分辨率。它可以完全利用偏 implicit neural representation 的结构和采样分辨率来获得轻量级模型。学习的连续表示可以将低分辨率流体场输入数据解码到任意的空间和时间分辨率，以便灵活地进行放大。 FFINR 的训练过程由引入特征增强来支持输入层的 Contextual information ，使得模型能够更好地利用输入数据的信息。为证明 FFINR 的有效性，在不同的数据集上进行了一系列实验，并通过设置不同的超参数来评估其性能。结果显示，FFINR 比较 trilinear interpolation 方法更好地得到结果。

MOFA: A Model Simplification Roadmap for Image Restoration on Mobile Devices

paper_url: http://arxiv.org/abs/2308.12494
repo_url: None
paper_authors: Xiangyu Chen, Ruiwen Zhen, Shuai Li, Xiaotian Li, Guanghui Wang
for: 图像恢复（image restoration）技术是为了恢复高质量图像，并通过深度学习技术进行了显著的进步。这种技术在移动设备上进行了广泛应用，如手机摄影。由于移动设备的资源有限，如内存约束和运行时要求，因此在部署时模型的效率变得非常重要。然而，大多数之前的工作都是专注于分析单个模块的效率，并优化它们。这篇论文检查了不同层次的效率。
methods: 我们提出了一个可以用于降低图像恢复模型的部署时间和参数数量的路线图。该路线图首先增加了模型容量，通过在FLOPs不敏感层添加更多参数。然后，它应用了部分深度卷积并与解 Coupling 采样/下采样层结合，以加速模型速度。
results: 我们的方法在多个图像恢复数据集上进行了广泛的实验，并显示了减少运行时间13%，减少参数数量23%，同时提高PSNR和SSIM指标。源代码可以在 \href{https://github.com/xiangyu8/MOFA}{https://github.com/xiangyu8/MOFA} 上获取。

Abstract
Image restoration aims to restore high-quality images from degraded counterparts and has seen significant advancements through deep learning techniques. The technique has been widely applied to mobile devices for tasks such as mobile photography. Given the resource limitations on mobile devices, such as memory constraints and runtime requirements, the efficiency of models during deployment becomes paramount. Nevertheless, most previous works have primarily concentrated on analyzing the efficiency of single modules and improving them individually. This paper examines the efficiency across different layers. We propose a roadmap that can be applied to further accelerate image restoration models prior to deployment while simultaneously increasing PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index). The roadmap first increases the model capacity by adding more parameters to partial convolutions on FLOPs non-sensitive layers. Then, it applies partial depthwise convolution coupled with decoupling upsampling/downsampling layers to accelerate the model speed. Extensive experiments demonstrate that our approach decreases runtime by up to 13% and reduces the number of parameters by up to 23%, while increasing PSNR and SSIM on several image restoration datasets. Source Code of our method is available at \href{https://github.com/xiangyu8/MOFA}{https://github.com/xiangyu8/MOFA}.

摘要
Image 修复目标是将高质量图像从受损版本中恢复，深度学习技术在此领域已经取得了显著进步。这种技术已经广泛应用于移动设备上进行手机摄影等任务。由于移动设备的资源有限，如内存约束和运行时间要求，在部署过程中模型的效率变得非常重要。然而，前一些工作主要集中在分析单个模块的效率，并将其改进。这篇论文则研究模型各层之间的效率，并提出一种可以在部署前加速图像修复模型的路线图。该路线图首先增加了模型容量，通过在FLOPs不敏感层添加更多参数进行partial convolution。然后，它应用partial depthwise convolution和分离upsampling/downsampling层来加速模型速度。经验表明，我们的方法可以降低运行时间13%，并降低参数数量23%，同时提高PSNR和SSIM在多个图像修复数据集上。我们的代码可以在 \href{https://github.com/xiangyu8/MOFA}{https://github.com/xiangyu8/MOFA} 中找到。

InverseSR: 3D Brain MRI Super-Resolution Using a Latent Diffusion Model

paper_url: http://arxiv.org/abs/2308.12465
repo_url: https://github.com/biomedai-ucsc/inversesr
paper_authors: Jueqi Wang, Jacob Levman, Walter Hugo Lopez Pinaya, Petru-Daniel Tudosiu, M. Jorge Cardoso, Razvan Marinescu
for: 提高低分辨率MRI扫描的分辨率
methods: 利用Latent Diffusion Model（LDM）作为生成模型，并提出两种新策略：InverseSR（LDM）和InverseSR（Decoder）
results: 在IXI dataset上 validate 了方法，并证明LDM可以用于MRI重建Here’s the full answer in Simplified Chinese:
for: 本研究旨在提高低分辨率MRI扫描的分辨率。
methods: 我们利用Latent Diffusion Model（LDM）作为生成模型，并提出两种新策略：InverseSR（LDM）和InverseSR（Decoder）。
results: 我们在IXI dataset上 validate 了方法，并证明LDM可以用于MRI重建。

Abstract
High-resolution (HR) MRI scans obtained from research-grade medical centers provide precise information about imaged tissues. However, routine clinical MRI scans are typically in low-resolution (LR) and vary greatly in contrast and spatial resolution due to the adjustments of the scanning parameters to the local needs of the medical center. End-to-end deep learning methods for MRI super-resolution (SR) have been proposed, but they require re-training each time there is a shift in the input distribution. To address this issue, we propose a novel approach that leverages a state-of-the-art 3D brain generative model, the latent diffusion model (LDM) trained on UK BioBank, to increase the resolution of clinical MRI scans. The LDM acts as a generative prior, which has the ability to capture the prior distribution of 3D T1-weighted brain MRI. Based on the architecture of the brain LDM, we find that different methods are suitable for different settings of MRI SR, and thus propose two novel strategies: 1) for SR with more sparsity, we invert through both the decoder of the LDM and also through a deterministic Denoising Diffusion Implicit Models (DDIM), an approach we will call InverseSR(LDM); 2) for SR with less sparsity, we invert only through the LDM decoder, an approach we will call InverseSR(Decoder). These two approaches search different latent spaces in the LDM model to find the optimal latent code to map the given LR MRI into HR. The training process of the generative model is independent of the MRI under-sampling process, ensuring the generalization of our method to many MRI SR problems with different input measurements. We validate our method on over 100 brain T1w MRIs from the IXI dataset. Our method can demonstrate that powerful priors given by LDM can be used for MRI reconstruction.

摘要
高解像（HR）MRI扫描从研究级医疗机构获得的数据提供了精确的组织信息。然而，日常临床MRI扫描通常是低解像（LR）的，并且因为扫描参数的调整而具有不同的冲击和空间分辨率。为解决这个问题，我们提出了一种新的方法，利用UK BioBank上训练的状态艺术3D脑生成模型（LDM），提高临床MRI扫描的分辨率。LDM acted as a generative prior，可以捕捉3D T1-weighted MRI的先验分布。基于脑LDM的建筑，我们发现不同的方法适用于不同的MRI SR设置，因此我们提出了两种新策略：1）为SR具有更多的缺乏，我们通过LDM的解码器和干扰推理模型（DDIM）的推理进行反向推理，称之为InverseSR(LDM)；2）为SR具有更少的缺乏，我们只通过LDM解码器进行反向推理，称之为InverseSR(Decoder)。这两种方法在LDM模型中不同的秘密空间中寻找最佳的秘密编码，将给出LR MRI扫描变换为HR。我们的训练过程不受MRI下抽象过程的影响，因此我们的方法可以通用多种MRI SR问题。我们验证了我们的方法在IXI数据集上的100余个脑T1w MRI中，我们的方法可以证明LDM给出的强大先验可以用于MRI重建。

HNAS-reg: hierarchical neural architecture search for deformable medical image registration

paper_url: http://arxiv.org/abs/2308.12440
repo_url: None
paper_authors: Jiong Wu, Yong Fan
for: 这篇论文是为了找到适合对于弹性医疗影像注册的深度学习模型。
methods: 这篇论文使用了一个层次 NAS 框架（HNAS-Reg），包括了涉及 convolutional 操作的搜索和网络架构搜索，以找到最佳的网络架构。实际上，这个框架使用了一种叫做 partial channel strategy，以降低计算负载和内存限制，但不失去优化质量。
results: 实验结果显示，提案的方法可以建立一个具有改善医疗影像注册精度和减少模型大小的深度学习模型，比过去的 estado-of-the-art 医疗影像注册方法更好。

Abstract
Convolutional neural networks (CNNs) have been widely used to build deep learning models for medical image registration, but manually designed network architectures are not necessarily optimal. This paper presents a hierarchical NAS framework (HNAS-Reg), consisting of both convolutional operation search and network topology search, to identify the optimal network architecture for deformable medical image registration. To mitigate the computational overhead and memory constraints, a partial channel strategy is utilized without losing optimization quality. Experiments on three datasets, consisting of 636 T1-weighted magnetic resonance images (MRIs), have demonstrated that the proposal method can build a deep learning model with improved image registration accuracy and reduced model size, compared with state-of-the-art image registration approaches, including one representative traditional approach and two unsupervised learning-based approaches.

摘要
convolutional neural networks (CNNs) 已经广泛应用于医疗图像 регистраción中深度学习模型建立，但是手动设计的网络架构可能并不是最佳的。这篇论文提出了一种层次 NAS 框架（HNAS-Reg），包括 convolutional 操作搜索和网络架构搜索，以确定最佳的网络架构 для 弹性医疗图像REGISTRACTION。为了减少计算开销和内存限制，本文使用了一种 incomplete channel 策略，而不失去优化质量。实验在三个数据集上，包括 636 个 T1 束缚Magnetic Resonance Images (MRI)，表明提案方法可以建立一个具有改善图像REGISTRACTION精度和减少模型大小的深度学习模型，相比之前的图像REGISTRACTION方法，包括一种传统方法和两种无监督学习方法。

Reframing the Brain Age Prediction Problem to a More Interpretable and Quantitative Approach

paper_url: http://arxiv.org/abs/2308.12416
repo_url: None
paper_authors: Neha Gianchandani, Mahsa Dibaji, Mariana Bento, Ethan MacDonald, Roberto Souza
for: 这项研究旨在使用深度学习模型从核磁共振成像（MR）图像中预测脑年龄，并提供更有 interpretable 的结果。
methods: 该研究使用了图像到图像回归模型，对每个脑细胞 voxel 进行预测，并与全局预测模型和其相对的精度地图进行比较。
results: 结果表明， voxel-wise 预测模型比全局预测模型更加有 interpretability，因为它们提供了脑年龄层次结构信息，并且具有量化的优势。

Abstract
Deep learning models have achieved state-of-the-art results in estimating brain age, which is an important brain health biomarker, from magnetic resonance (MR) images. However, most of these models only provide a global age prediction, and rely on techniques, such as saliency maps to interpret their results. These saliency maps highlight regions in the input image that were significant for the model's predictions, but they are hard to be interpreted, and saliency map values are not directly comparable across different samples. In this work, we reframe the age prediction problem from MR images to an image-to-image regression problem where we estimate the brain age for each brain voxel in MR images. We compare voxel-wise age prediction models against global age prediction models and their corresponding saliency maps. The results indicate that voxel-wise age prediction models are more interpretable, since they provide spatial information about the brain aging process, and they benefit from being quantitative.

摘要
深度学习模型已经达到了脑年龄估计的州前性result，这是脑健康指标之一，从核磁共振（MR）图像中获取。然而，大多数这些模型只提供全局年龄预测，并使用技术，如saliency map来解释其结果。这些saliency map highlights输入图像中对模型预测的重要区域，但它们具有困难 interpretability，并且saliency map值不同样品直接比较。在这项工作中，我们将MR图像中的脑年龄预测问题重新带为一个图像到图像回归问题，我们预测每个脑voxel的脑年龄。我们比较了voxel-wise年龄预测模型和全局年龄预测模型以及其相应的saliency map。结果表明，voxel-wise年龄预测模型更加可读性，因为它们提供了脑年龄变化过程的空间信息，并且它们受益于量化的。

SPPNet: A Single-Point Prompt Network for Nuclei Image Segmentation

paper_url: http://arxiv.org/abs/2308.12231
repo_url: https://github.com/xq141839/sppnet
paper_authors: Qing Xu, Wenwei Kuang, Zeyu Zhang, Xueyao Bao, Haoran Chen, Wenting Duan
for: 这个论文主要针对的是核体像素分割问题，具体来说是为了提高核体像素分割的效率和准确性。
methods: 该论文提出了一种基于单点提示网络的核体像素分割方法，称为SPPNet。该方法使用轻量级视力变换器取代原始图像编码器，并在平行的卷积块中提取低级别 semantic信息以补做性能下降。此外，该方法还提出了基于 Gaussian kernel 的新的点抽象方法。
results: 根据 MoNuSeg-2018 数据集的测试结果，SPPNet 比现有的 U-Shape 架构表现出更好的性能，并且在训练过程中更快地收敛。相比之下，SPPNet 比 segment anything 模型更快，只需要一个点提取，而不需要多个点提取和训练。这些成果表明，SPPNet 是一种高效、可靠的核体像素分割方法。

Abstract
Image segmentation plays an essential role in nuclei image analysis. Recently, the segment anything model has made a significant breakthrough in such tasks. However, the current model exists two major issues for cell segmentation: (1) the image encoder of the segment anything model involves a large number of parameters. Retraining or even fine-tuning the model still requires expensive computational resources. (2) in point prompt mode, points are sampled from the center of the ground truth and more than one set of points is expected to achieve reliable performance, which is not efficient for practical applications. In this paper, a single-point prompt network is proposed for nuclei image segmentation, called SPPNet. We replace the original image encoder with a lightweight vision transformer. Also, an effective convolutional block is added in parallel to extract the low-level semantic information from the image and compensate for the performance degradation due to the small image encoder. We propose a new point-sampling method based on the Gaussian kernel. The proposed model is evaluated on the MoNuSeg-2018 dataset. The result demonstrated that SPPNet outperforms existing U-shape architectures and shows faster convergence in training. Compared to the segment anything model, SPPNet shows roughly 20 times faster inference, with 1/70 parameters and computational cost. Particularly, only one set of points is required in both the training and inference phases, which is more reasonable for clinical applications. The code for our work and more technical details can be found at https://github.com/xq141839/SPPNet.

摘要
图像分割在生物图像分析中扮演着关键性的角色。近些年来，“ segment anything” 模型在这些任务中做出了重要突破。然而，目前的模型存在两个主要问题：（1） segment anything 模型的图像编码器包含大量参数，需要贵重的计算资源进行 retraining 或 fine-tuning。（2）在点提示模式下，点会从真实的图像中心 sampling，需要多个点集来确保可靠性，这对实际应用来说是不fficient的。在这篇论文中，我们提出了一种基于单点提示网络的核心图像分割模型，称为 SPPNet。我们将原始图像编码器替换为轻量级视Transformer，并在平行地添加了一个有效的 convolutional 块，以提取图像中低级别的 semantics 信息，补偿因小图像编码器而导致的性能下降。我们提出了基于 Gaussian kernel 的新的点提取方法。我们的模型在 MoNuSeg-2018 数据集上进行了评估，结果显示，SPPNet 在 U-shape 架构中出色地超越了现有的模型，并在训练中显示了更快的收敛速度。相比于 segment anything 模型，SPPNet 在执行速度方面比其快约 20 倍，具有 1/70 的参数和计算成本。特别是，在训练和执行阶段都只需要一个点集，这更加合理 для临床应用。我们的代码和技术详细信息可以在 https://github.com/xq141839/SPPNet 找到。

2023-08-24

cs.SD

eess.AS - 2023-08-24

Towards Automated Animal Density Estimation with Acoustic Spatial Capture-Recapture

paper_url: http://arxiv.org/abs/2308.12859
repo_url: None
paper_authors: Yuheng Wang, Juan Ye, David L. Borchers

Abstract:
Passive acoustic monitoring can be an effective way of monitoring wildlife populations that are acoustically active but difficult to survey visually. Digital recorders allow surveyors to gather large volumes of data at low cost, but identifying target species vocalisations in these data is non-trivial. Machine learning (ML) methods are often used to do the identification. They can process large volumes of data quickly, but they do not detect all vocalisations and they do generate some false positives (vocalisations that are not from the target species). Existing wildlife abundance survey methods have been designed specifically to deal with the first of these mistakes, but current methods of dealing with false positives are not well-developed. They do not take account of features of individual vocalisations, some of which are more likely to be false positives than others. We propose three methods for acoustic spatial capture-recapture inference that integrate individual-level measures of confidence from ML vocalisation identification into the likelihood and hence integrate ML uncertainty into inference. The methods include a mixture model in which species identity is a latent variable. We test the methods by simulation and find that in a scenario based on acoustic data from Hainan gibbons, in which ignoring false positives results in 17% positive bias, our methods give negligible bias and coverage probabilities that are close to the nominal 95% level.

Sparks of Large Audio Models: A Survey and Outlook

paper_url: http://arxiv.org/abs/2308.12792
repo_url: None
paper_authors: Siddique Latif, Moazzam Shoukat, Fahad Shamshad, Muhammad Usama, Heriberto Cuayáhuitl, Björn W. Schuller

Abstract:
This survey paper provides a comprehensive overview of the recent advancements and challenges in applying large language models to the field of audio signal processing. Audio processing, with its diverse signal representations and a wide range of sources–from human voices to musical instruments and environmental sounds–poses challenges distinct from those found in traditional Natural Language Processing scenarios. Nevertheless, \textit{Large Audio Models}, epitomized by transformer-based architectures, have shown marked efficacy in this sphere. By leveraging massive amount of data, these models have demonstrated prowess in a variety of audio tasks, spanning from Automatic Speech Recognition and Text-To-Speech to Music Generation, among others. Notably, recently these Foundational Audio Models, like SeamlessM4T, have started showing abilities to act as universal translators, supporting multiple speech tasks for up to 100 languages without any reliance on separate task-specific systems. This paper presents an in-depth analysis of state-of-the-art methodologies regarding \textit{Foundational Large Audio Models}, their performance benchmarks, and their applicability to real-world scenarios. We also highlight current limitations and provide insights into potential future research directions in the realm of \textit{Large Audio Models} with the intent to spark further discussion, thereby fostering innovation in the next generation of audio-processing systems. Furthermore, to cope with the rapid development in this area, we will consistently update the relevant repository with relevant recent articles and their open-source implementations at https://github.com/EmulationAI/awesome-large-audio-models.

WavMark: Watermarking for Audio Generation

paper_url: http://arxiv.org/abs/2308.12770
repo_url: None
paper_authors: Guangyu Chen, Yu Wu, Shujie Liu, Tao Liu, Xiaoyong Du, Furu Wei

Abstract:
Recent breakthroughs in zero-shot voice synthesis have enabled imitating a speaker’s voice using just a few seconds of recording while maintaining a high level of realism. Alongside its potential benefits, this powerful technology introduces notable risks, including voice fraud and speaker impersonation. Unlike the conventional approach of solely relying on passive methods for detecting synthetic data, watermarking presents a proactive and robust defence mechanism against these looming risks. This paper introduces an innovative audio watermarking framework that encodes up to 32 bits of watermark within a mere 1-second audio snippet. The watermark is imperceptible to human senses and exhibits strong resilience against various attacks. It can serve as an effective identifier for synthesized voices and holds potential for broader applications in audio copyright protection. Moreover, this framework boasts high flexibility, allowing for the combination of multiple watermark segments to achieve heightened robustness and expanded capacity. Utilizing 10 to 20-second audio as the host, our approach demonstrates an average Bit Error Rate (BER) of 0.48% across ten common attacks, a remarkable reduction of over 2800% in BER compared to the state-of-the-art watermarking tool. See https://aka.ms/wavmark for demos of our work.

Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

paper_url: http://arxiv.org/abs/2308.12734
repo_url: None
paper_authors: Jordan J. Bird, Ahmad Lotfi

Abstract:
There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion. To address the above emerging issues, the DEEP-VOICE dataset is generated in this study, comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion. Presenting as a binary classification problem of whether the speech is real or AI-generated, statistical analysis of temporal audio features through t-testing reveals that there are significantly different distributions. Hyperparameter optimisation is implemented for machine learning models to identify the source of speech. Following the training of 208 individual machine learning models over 10-fold cross validation, it is found that the Extreme Gradient Boosting model can achieve an average classification accuracy of 99.3% and can classify speech in real-time, at around 0.004 milliseconds given one second of speech. All data generated for this study is released publicly for future research on AI speech detection.

Whombat: An open-source annotation tool for machine learning development in bioacoustics

paper_url: http://arxiv.org/abs/2308.12688
repo_url: None
paper_authors: Santiago Martinez Balvanera, Oisin Mac Aodha, Matthew J. Weldy, Holly Pringle, Ella Browning, Kate E. Jones

Abstract:

Automated analysis of bioacoustic recordings using machine learning (ML) methods has the potential to greatly scale biodiversity monitoring efforts. The use of ML for high-stakes applications, such as conservation research, demands a data-centric approach with a focus on utilizing carefully annotated and curated evaluation and training data that is relevant and representative. Creating annotated datasets of sound recordings presents a number of challenges, such as managing large collections of recordings with associated metadata, developing flexible annotation tools that can accommodate the diverse range of vocalization profiles of different organisms, and addressing the scarcity of expert annotators. 2. We present Whombat a user-friendly, browser-based interface for managing audio recordings and annotation projects, with several visualization, exploration, and annotation tools. It enables users to quickly annotate, review, and share annotations, as well as visualize and evaluate a set of machine learning predictions on a dataset. The tool facilitates an iterative workflow where user annotations and machine learning predictions feedback to enhance model performance and annotation quality. 3. We demonstrate the flexibility of Whombat by showcasing two distinct use cases: an project aimed at enhancing automated UK bat call identification at the Bat Conservation Trust (BCT), and a collaborative effort among the USDA Forest Service and Oregon State University researchers exploring bioacoustic applications and extending automated avian classification models in the Pacific Northwest, USA. 4. Whombat is a flexible tool that can effectively address the challenges of annotation for bioacoustic research. It can be used for individual and collaborative work, hosted on a shared server or accessed remotely, or run on a personal computer without the need for coding skills.

Naaloss: Rethinking the objective of speech enhancement

paper_url: http://arxiv.org/abs/2308.12615
repo_url: None
paper_authors: Kuan-Hsun Ho, En-Lun Yu, Jeih-weih Hung, Berlin Chen

Abstract:
Reducing noise interference is crucial for automatic speech recognition (ASR) in a real-world scenario. However, most single-channel speech enhancement (SE) generates “processing artifacts” that negatively affect ASR performance. Hence, in this study, we suggest a Noise- and Artifacts-aware loss function, NAaLoss, to ameliorate the influence of artifacts from a novel perspective. NAaLoss considers the loss of estimation, de-artifact, and noise ignorance, enabling the learned SE to individually model speech, artifacts, and noise. We examine two SE models (simple/advanced) learned with NAaLoss under various input scenarios (clean/noisy) using two configurations of the ASR system (with/without noise robustness). Experiments reveal that NAaLoss significantly improves the ASR performance of most setups while preserving the quality of SE toward perception and intelligibility. Furthermore, we visualize artifacts through waveforms and spectrograms, and explain their impact on ASR.

Emotion-Aligned Contrastive Learning Between Images and Music

paper_url: http://arxiv.org/abs/2308.12610
repo_url: None
paper_authors: Shanti Stewart, Tiantian Feng, Kleanthis Avramidis, Shrikanth Narayanan

Abstract:
Traditional music search engines rely on retrieval methods that match natural language queries with music metadata. There have been increasing efforts to expand retrieval methods to consider the audio characteristics of music itself, using queries of various modalities including text, video, and speech. Most approaches aim to match general music semantics to the input queries, while only a few focus on affective qualities. We address the task of retrieving emotionally-relevant music from image queries by proposing a framework for learning an affective alignment between images and music audio. Our approach focuses on learning an emotion-aligned joint embedding space between images and music. This joint embedding space is learned via emotion-supervised contrastive learning, using an adapted cross-modal version of the SupCon loss. We directly evaluate the joint embeddings with cross-modal retrieval tasks (image-to-music and music-to-image) based on emotion labels. In addition, we investigate the generalizability of the learned music embeddings with automatic music tagging as a downstream task. Our experiments show that our approach successfully aligns images and music, and that the learned embedding space is effective for cross-modal retrieval applications.

Exploiting Time-Frequency Conformers for Music Audio Enhancement

paper_url: http://arxiv.org/abs/2308.12599
repo_url: None
paper_authors: Yunkee Chae, Junghyun Koo, Sungho Lee, Kyogu Lee

Abstract:
With the proliferation of video platforms on the internet, recording musical performances by mobile devices has become commonplace. However, these recordings often suffer from degradation such as noise and reverberation, which negatively impact the listening experience. Consequently, the necessity for music audio enhancement (referred to as music enhancement from this point onward), involving the transformation of degraded audio recordings into pristine high-quality music, has surged to augment the auditory experience. To address this issue, we propose a music enhancement system based on the Conformer architecture that has demonstrated outstanding performance in speech enhancement tasks. Our approach explores the attention mechanisms of the Conformer and examines their performance to discover the best approach for the music enhancement task. Our experimental results show that our proposed model achieves state-of-the-art performance on single-stem music enhancement. Furthermore, our system can perform general music enhancement with multi-track mixtures, which has not been examined in previous work.

Hybrid noise shaping for audio coding using perfectly overlapped window

paper_url: http://arxiv.org/abs/2308.12566
repo_url: None
paper_authors: Byeongho Jo, Seungkwon Beack

Abstract:
In recent years, audio coding technology has been standardized based on several frameworks that incorporate linear predictive coding (LPC). However, coding the transient signal using frequency-domain LP residual signals remains a challenge. To address this, temporal noise shaping (TNS) can be adapted, although it cannot be effectively operated since the estimated temporal envelope in the modified discrete cosine transform (MDCT) domain is accompanied by the time-domain aliasing (TDA) terms. In this study, we propose the modulated complex lapped transform-based coding framework integrated with transform coded excitation (TCX) and complex LPC-based TNS (CTNS). Our approach uses a 50% overlap window and switching scheme for the CTNS to improve the coding efficiency. Additionally, an adaptive calculation of the target bits for the sub-bands using the frequency envelope information based on the quantized LPC coefficients is proposed. To minimize the quantization mismatch between both modes, an integrated quantization for real and complex values and a TDA augmentation method that compensates for the artificially generated TDA components during switching operations are proposed. The proposed coding framework shows a superior performance in both objective metrics and subjective listening tests, thereby demonstrating its low bit-rate audio coding.

UNISOUND System for VoxCeleb Speaker Recognition Challenge 2023

paper_url: http://arxiv.org/abs/2308.12526
repo_url: None
paper_authors: Yu Zheng, Yajun Zhang, Chuanying Niu, Yibin Zhan, Yanhua Long, Dongxing Xu

Abstract:
This report describes the UNISOUND submission for Track1 and Track2 of VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC 2023). We submit the same system on Track 1 and Track 2, which is trained with only VoxCeleb2-dev. Large-scale ResNet and RepVGG architectures are developed for the challenge. We propose a consistency-aware score calibration method, which leverages the stability of audio voiceprints in similarity score by a Consistency Measure Factor (CMF). CMF brings a huge performance boost in this challenge. Our final system is a fusion of six models and achieves the first place in Track 1 and second place in Track 2 of VoxSRC 2023. The minDCF of our submission is 0.0855 and the EER is 1.5880%.

MultiPA: a multi-task speech pronunciation assessment system for a closed and open response scenario

paper_url: http://arxiv.org/abs/2308.12490
repo_url: None
paper_authors: Yu-Wen Chen, Zhou Yu, Julia Hirschberg

Abstract:
The design of automatic speech pronunciation assessment can be categorized into closed and open response scenarios, each with strengths and limitations. A system with the ability to function in both scenarios can cater to diverse learning needs and provide a more precise and holistic assessment of pronunciation skills. In this study, we propose a Multi-task Pronunciation Assessment model called MultiPA. MultiPA provides an alternative to Kaldi-based systems in that it has simpler format requirements and better compatibility with other neural network models. Compared with previous open response systems, MultiPA provides a wider range of evaluations, encompassing assessments at both the sentence and word-level. Our experimental results show that MultiPA achieves comparable performance when working in closed response scenarios and maintains more robust performance when directly used for open responses.

Attention-Based Acoustic Feature Fusion Network for Depression Detection

paper_url: http://arxiv.org/abs/2308.12478
repo_url: https://github.com/xuxiaoooo/abafnet
paper_authors: Xiao Xu, Yang Wang, Xinru Wei, Fei Wang, Xizhe Zhang

Abstract:
Depression, a common mental disorder, significantly influences individuals and imposes considerable societal impacts. The complexity and heterogeneity of the disorder necessitate prompt and effective detection, which nonetheless, poses a difficult challenge. This situation highlights an urgent requirement for improved detection methods. Exploiting auditory data through advanced machine learning paradigms presents promising research directions. Yet, existing techniques mainly rely on single-dimensional feature models, potentially neglecting the abundance of information hidden in various speech characteristics. To rectify this, we present the novel Attention-Based Acoustic Feature Fusion Network (ABAFnet) for depression detection. ABAFnet combines four different acoustic features into a comprehensive deep learning model, thereby effectively integrating and blending multi-tiered features. We present a novel weight adjustment module for late fusion that boosts performance by efficaciously synthesizing these features. The effectiveness of our approach is confirmed via extensive validation on two clinical speech databases, CNRAC and CS-NRAC, thereby outperforming previous methods in depression detection and subtype classification. Further in-depth analysis confirms the key role of each feature and highlights the importance of MFCCrelated features in speech-based depression detection.

An Initial Exploration: Learning to Generate Realistic Audio for Silent Video

paper_url: http://arxiv.org/abs/2308.12408
repo_url: None
paper_authors: Matthew Martel, Jackson Wagner

Abstract:
Generating realistic audio effects for movies and other media is a challenging task that is accomplished today primarily through physical techniques known as Foley art. Foley artists create sounds with common objects (e.g., boxing gloves, broken glass) in time with video as it is playing to generate captivating audio tracks. In this work, we aim to develop a deep-learning based framework that does much the same - observes video in it’s natural sequence and generates realistic audio to accompany it. Notably, we have reason to believe this is achievable due to advancements in realistic audio generation techniques conditioned on other inputs (e.g., Wavenet conditioned on text). We explore several different model architectures to accomplish this task that process both previously-generated audio and video context. These include deep-fusion CNN, dilated Wavenet CNN with visual context, and transformer-based architectures. We find that the transformer-based architecture yields the most promising results, matching low-frequencies to visual patterns effectively, but failing to generate more nuanced waveforms.

AdVerb: Visually Guided Audio Dereverberation

paper_url: http://arxiv.org/abs/2308.12370
repo_url: None
paper_authors: Sanjoy Chowdhury, Sreyan Ghosh, Subhrajyoti Dasgupta, Anton Ratnarajah, Utkarsh Tyagi, Dinesh Manocha

Abstract:
We present AdVerb, a novel audio-visual dereverberation framework that uses visual cues in addition to the reverberant sound to estimate clean audio. Although audio-only dereverberation is a well-studied problem, our approach incorporates the complementary visual modality to perform audio dereverberation. Given an image of the environment where the reverberated sound signal has been recorded, AdVerb employs a novel geometry-aware cross-modal transformer architecture that captures scene geometry and audio-visual cross-modal relationship to generate a complex ideal ratio mask, which, when applied to the reverberant audio predicts the clean sound. The effectiveness of our method is demonstrated through extensive quantitative and qualitative evaluations. Our approach significantly outperforms traditional audio-only and audio-visual baselines on three downstream tasks: speech enhancement, speech recognition, and speaker verification, with relative improvements in the range of 18% - 82% on the LibriSpeech test-clean set. We also achieve highly satisfactory RT60 error scores on the AVSpeech dataset.

LCANets++: Robust Audio Classification using Multi-layer Neural Networks with Lateral Competition

paper_url: http://arxiv.org/abs/2308.12882
repo_url: None
paper_authors: Sayanton V. Dibbo, Juston S. Moore, Garrett T. Kenyon, Michael A. Teti

Abstract:
Audio classification aims at recognizing audio signals, including speech commands or sound events. However, current audio classifiers are susceptible to perturbations and adversarial attacks. In addition, real-world audio classification tasks often suffer from limited labeled data. To help bridge these gaps, previous work developed neuro-inspired convolutional neural networks (CNNs) with sparse coding via the Locally Competitive Algorithm (LCA) in the first layer (i.e., LCANets) for computer vision. LCANets learn in a combination of supervised and unsupervised learning, reducing dependency on labeled samples. Motivated by the fact that auditory cortex is also sparse, we extend LCANets to audio recognition tasks and introduce LCANets++, which are CNNs that perform sparse coding in multiple layers via LCA. We demonstrate that LCANets++ are more robust than standard CNNs and LCANets against perturbations, e.g., background noise, as well as black-box and white-box attacks, e.g., evasion and fast gradient sign (FGSM) attacks.

2023-08-23

cs.SD

cs.SD - 2023-08-23

Analysis of XLS-R for Speech Quality Assessment

paper_url: http://arxiv.org/abs/2308.12077
repo_url: https://github.com/lcn-kul/xls-r-analysis-sqa
paper_authors: Bastiaan Tamm, Rik Vandenberghe, Hugo Van hamme
for: 这个论文的目的是对 speech quality assessment 进行深入分析，以提高自动评估speech quality的能力。methods: 作者使用了 pre-trained wav2vec-based XLS-R embeddings，并分析了每层的表征特征以及不同模型大小的表征特征。results: 研究发现，低级别特征和高级别特征都有优秀的表征特征，并且这两个特征各自捕捉不同的特征。此外，作者还研究了这些表征特征在不同水平的损害下的敏感性，以及将这两个特征融合是否可以提高 MOS 预测的性能。

Abstract
In online conferencing applications, estimating the perceived quality of an audio signal is crucial to ensure high quality of experience for the end user. The most reliable way to assess the quality of a speech signal is through human judgments in the form of the mean opinion score (MOS) metric. However, such an approach is labor intensive and not feasible for large-scale applications. The focus has therefore shifted towards automated speech quality assessment through end-to-end training of deep neural networks. Recently, it was shown that leveraging pre-trained wav2vec-based XLS-R embeddings leads to state-of-the-art performance for the task of speech quality prediction. In this paper, we perform an in-depth analysis of the pre-trained model. First, we analyze the performance of embeddings extracted from each layer of XLS-R and also for each size of the model (300M, 1B, 2B parameters). Surprisingly, we find two optimal regions for feature extraction: one in the lower-level features and one in the high-level features. Next, we investigate the reason for the two distinct optima. We hypothesize that the lower-level features capture characteristics of noise and room acoustics, whereas the high-level features focus on speech content and intelligibility. To investigate this, we analyze the sensitivity of the MOS predictions with respect to different levels of corruption in each category. Afterwards, we try fusing the two optimal feature depths to determine if they contain complementary information for MOS prediction. Finally, we compare the performance of the proposed models and assess the generalizability of the models on unseen datasets.

摘要
在在线会议应用程序中，估算语音信号的感知质量是关键以确保用户的高质量经验。 however, such an approach is labor-intensive and not feasible for large-scale applications. Therefore, the focus has shifted towards automated speech quality assessment through end-to-end training of deep neural networks. Recently, it was shown that leveraging pre-trained wav2vec-based XLS-R embeddings leads to state-of-the-art performance for the task of speech quality prediction. In this paper, we perform an in-depth analysis of the pre-trained model. First, we analyze the performance of embeddings extracted from each layer of XLS-R and also for each size of the model (300M, 1B, 2B parameters). Surprisingly, we find two optimal regions for feature extraction: one in the lower-level features and one in the high-level features. Next, we investigate the reason for the two distinct optima. We hypothesize that the lower-level features capture characteristics of noise and room acoustics, whereas the high-level features focus on speech content and intelligibility. To investigate this, we analyze the sensitivity of the MOS predictions with respect to different levels of corruption in each category. Afterwards, we try fusing the two optimal feature depths to determine if they contain complementary information for MOS prediction. Finally, we compare the performance of the proposed models and assess the generalizability of the models on unseen datasets.

Joint Prediction of Audio Event and Annoyance Rating in an Urban Soundscape by Hierarchical Graph Representation Learning

paper_url: http://arxiv.org/abs/2308.11980
repo_url: https://github.com/yuanbo2020/hgrl
paper_authors: Yuanbo Hou, Siyang Song, Cheng Luo, Andrew Mitchell, Qiaoqiao Ren, Weicheng Xie, Jian Kang, Wenwu Wang, Dick Botteldooren
for: This paper is written for the purpose of exploring the relationship between objective audio events and subjective annoyance ratings in a soundscape.
methods: The paper proposes a novel hierarchical graph representation learning (HGRL) approach that links objective audio events with subjective annoyance ratings of the soundscape perceived by humans.
results: The proposed HGRL approach successfully integrates objective audio events with subjective annoyance ratings for audio event classification (AEC) and audio scene perception (ARP) tasks, and coordinates the relations between coarse-grained and fine-grained audio event information with the subjective annoyance ratings.

Abstract
Sound events in daily life carry rich information about the objective world. The composition of these sounds affects the mood of people in a soundscape. Most previous approaches only focus on classifying and detecting audio events and scenes, but may ignore their perceptual quality that may impact humans' listening mood for the environment, e.g. annoyance. To this end, this paper proposes a novel hierarchical graph representation learning (HGRL) approach which links objective audio events (AE) with subjective annoyance ratings (AR) of the soundscape perceived by humans. The hierarchical graph consists of fine-grained event (fAE) embeddings with single-class event semantics, coarse-grained event (cAE) embeddings with multi-class event semantics, and AR embeddings. Experiments show the proposed HGRL successfully integrates AE with AR for AEC and ARP tasks, while coordinating the relations between cAE and fAE and further aligning the two different grains of AE information with the AR.

摘要
日常生活中的听觉事件含有丰富的对象世界信息。听觉事件的组成会影响人们在听觉景中的情绪。以前的方法主要是对听觉事件和场景进行分类和检测，可能忽略了这些听觉事件对人们听觉环境的影响，例如厌烦。为此，这篇论文提出了一种新的层次图表学习（HGRL）方法，将对象听觉事件（AE）与人们对听觉景的感知愤怒评分（AR）关联起来。层次图包括细化事件（fAE）嵌入、粗化事件（cAE）嵌入和AR嵌入。实验显示，提议的HGRL方法成功地将AE与AR集成在AEC和ARP任务中，同时协调粗化事件和细化事件之间的关系，并将两种不同的AE信息与AR进行对齐。

CED: Consistent ensemble distillation for audio tagging

paper_url: http://arxiv.org/abs/2308.11957
repo_url: https://github.com/richermans/ced
paper_authors: Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Junbo Zhang, Yujun Wang
for: 提高音频分类任务的性能和减少模型大小
methods: 使用扩展和知识填充（KD）技术，以及一种名为Consistent Teaching的简单训练框架
results: 在Audioset（AS）benchmark上实现49.0 mean average precision（mAP）的10M参数模型，以及可用于预训练模型和代码的GitHub链接

Abstract
Augmentation and knowledge distillation (KD) are well-established techniques employed in the realm of audio classification tasks, aimed at enhancing performance and reducing model sizes on the widely recognized Audioset (AS) benchmark. Although both techniques are effective individually, their combined use, called consistent teaching, hasn't been explored before. This paper proposes CED, a simple training framework that distils student models from large teacher ensembles with consistent teaching. To achieve this, CED efficiently stores logits as well as the augmentation methods on disk, making it scalable to large-scale datasets. Central to CED's efficacy is its label-free nature, meaning that only the stored logits are used for the optimization of a student model only requiring 0.3\% additional disk space for AS. The study trains various transformer-based models, including a 10M parameter model achieving a 49.0 mean average precision (mAP) on AS. Pretrained models and code are available at https://github.com/RicherMans/CED.

摘要
《增强和知识储存（KD）在音频分类任务中广泛应用，以提高性能并减少模型大小。虽然两种技术都是有效的，但它们的共同使用，即一致教学，尚未被探讨。这篇论文提出了CED，一种简单的训练框架，将大师模型ensemble中的学习logits与增强方法存储在磁盘上，使其可扩展到大规模数据集。CED的核心特点是无标签的，即只使用存储的logits来优化学生模型，需要额外磁盘空间为0.3%。研究使用了不同的变换器模型，包括1000万参数的模型，在AS上达到49.0的 mean average precision（mAP）。预训练模型和代码可以在https://github.com/RicherMans/CED上下载。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Example-Based Framework for Perceptually Guided Audio Texture Generation

paper_url: http://arxiv.org/abs/2308.11859
repo_url: None
paper_authors: Purnima Kamath, Chitralekha Gupta, Lonce Wyse, Suranga Nanayakkara
for: 用于控制生成的Audio texture的semantic特征
methods: 使用自动生成的示例来确定用户定义的 semantic attribute的导向向量
results: 方法可以找到相对应的、准确的导向向量，并应用于其他任务 such as selective semantic attribute transfer。In English, it means:
for: Control the semantic features of generated Audio textures
methods: Use automatically generated examples to determine user-defined semantic attributes guidance vectors
results: The method can find relevant and deterministic guidance vectors for controllable generation of both discrete and continuous textures, and can be applied to other tasks such as selective semantic attribute transfer.

Abstract
Generative models for synthesizing audio textures explicitly encode controllability by conditioning the model with labelled data. While datasets for audio textures can be easily recorded in-the-wild, semantically labeling them is expensive, time-consuming, and prone to errors due to human annotator subjectivity. Thus, to control generation, there is a need to automatically infer user-defined perceptual factors of variation in the latent space of a generative model while modelling unlabeled textures. In this paper, we propose an example-based framework to determine vectors to guide texture generation based on user-defined semantic attributes. By synthesizing a few synthetic examples to indicate the presence or absence of a semantic attribute, we can infer the guidance vectors in the latent space of a generative model to control that attribute during generation. Our results show that our method is capable of finding perceptually relevant and deterministic guidance vectors for controllable generation for both discrete as well as continuous textures. Furthermore, we demonstrate the application of this method to other tasks such as selective semantic attribute transfer.

摘要
<>将给定文本翻译成简化中文。<>生成模型用于Synthesizing audio textureExplicitly编码可控性，通过将模型 conditioning Labelled data。虽然 audio texture dataset 可以轻松在野外录制，但Semantically labeling 它们是 expensive， time-consuming，和 prone to errors due to human annotator subjectivity。因此，要控制生成，需要自动推导用户定义的感知因素变化在生成模型的 latent space 中。在这篇论文中，我们提出了一个 example-based 框架，可以根据用户定义的 semantic attribute 推导生成 texture 的 guidance vectors。通过在 latent space 中生成一些 synthetic examples，我们可以控制生成中的 attribute。我们的结果表明，我们的方法可以找到具有感知 relevance 和 deterministic 的 guidance vectors，以控制生成 discrete 和 continuous texture。此外，我们还应用了这种方法到其他任务，如选择性 semantic attribute transfer。

paper_url: http://arxiv.org/abs/2308.11773
repo_url: None
paper_authors: Yuezhou Zhang, Amos A Folarin, Judith Dineley, Pauline Conde, Valeria de Angel, Shaoxiong Sun, Yatharth Ranjan, Zulqarnain Rashid, Callum Stewart, Petroula Laiou, Heet Sankesara, Linglong Qian, Faith Matcham, Katie M White, Carolin Oetzmann, Femke Lamers, Sara Siddi, Sara Simblett, Björn W. Schuller, Srinivasan Vairavan, Til Wykes, Josep Maria Haro, Brenda WJH Penninx, Vaibhav A Narayan, Matthew Hotopf, Richard JB Dobson, Nicholas Cummins, RADAR-CNS consortium
For: The paper aims to identify specific speech topics that may indicate depression severity using natural language processing on social media data.* Methods: The study uses the Whisper tool and BERTopic model to analyze 3919 smartphone-collected speech recordings from 265 participants, and compares behavioral and linguistic characteristics across identified topics to elucidate their associations with depression.* Results: The study finds that six specific speech topics (No Expectations, Sleep, Mental Therapy, Haircut, Studying, and Coursework) are associated with high depression severity, and that topic shifts and changes in depression severity over time are correlated. The study also shows that the BERTopic model is effective in identifying these topics in a smaller dataset.Here are the three points in Simplified Chinese text:* For: 这项研究目标是通过自然语言处理技术来分析社交媒体数据，以确定特定的言语主题与抑郁程度的关系。* Methods: 这项研究使用了Whisper工具和BERTopic模型，对3919个智能手机采集的语音记录中的356个语音记录进行分析，并比较了不同话题的行为特征和语言特征，以了解它们与抑郁的关系。* Results: 研究发现，六个特定的话题（无期望、睡眠、心理治疗、剪发、学习和课程）与高度抑郁相关，而话题变化和抑郁程度变化的关系也存在正相关性。此外，研究还证明了BERTopic模型在相似的小型数据集上的效果。

Abstract
Language use has been shown to correlate with depression, but large-scale validation is needed. Traditional methods like clinic studies are expensive. So, natural language processing has been employed on social media to predict depression, but limitations remain-lack of validated labels, biased user samples, and no context. Our study identified 29 topics in 3919 smartphone-collected speech recordings from 265 participants using the Whisper tool and BERTopic model. Six topics with a median PHQ-8 greater than or equal to 10 were regarded as risk topics for depression: No Expectations, Sleep, Mental Therapy, Haircut, Studying, and Coursework. To elucidate the topic emergence and associations with depression, we compared behavioral (from wearables) and linguistic characteristics across identified topics. The correlation between topic shifts and changes in depression severity over time was also investigated, indicating the importance of longitudinally monitoring language use. We also tested the BERTopic model on a similar smaller dataset (356 speech recordings from 57 participants), obtaining some consistent results. In summary, our findings demonstrate specific speech topics may indicate depression severity. The presented data-driven workflow provides a practical approach to collecting and analyzing large-scale speech data from real-world settings for digital health research.

摘要
研究表明，语言使用与抑郁有相关性，但大规模验证是需要的。传统方法如临床研究昂贵。因此，自然语言处理技术在社交媒体上预测抑郁，但有限制：无验证标签、用户样本偏斜和无 Context。我们的研究通过听众工具和BERTopic模型分析了3919个语音记录和265名参与者的语音，并确定了29个话题。6个话题的PHQ-8分数大于或等于10被视为抑郁风险话题：无期望、睡眠、心理治疗、剪发、学习和课程。为了解释话题出现和抑郁的关系，我们比较了语音和行为特征（来自佩戴设备），并investigated了话题变化和抑郁严重程度的时间变化。我们还在类似的小型数据集（356个语音记录和57名参与者）上测试了BERTopic模型，获得了一些一致的结果。总之，我们的发现表明，特定的语音话题可能指示抑郁严重程度。我们提供的数据驱动的工作流程为数据科学研究提供了实用的方法。

2023-08-23

cs.CV

cs.CV - 2023-08-23

CIParsing: Unifying Causality Properties into Multiple Human Parsing

paper_url: http://arxiv.org/abs/2308.12218
repo_url: None
paper_authors: Xiaojia Chen, Xuanhan Wang, Lianli Gao, Beitao Chen, Jingkuan Song, HenTao Shen
for: 提高多个人分割（MHP）模型的泛化能力和鲁棒性，使其能够更好地适应不同的图像样式和外部干扰。
methods: 基于 causality 原则，提出了一种新的人分割 paradigma 称为 CIParsing，该 paradigma 假设输入图像是由 causal 因素（人体部分的特征）和非 causal 因素（外部上下文）组成，只有 causal 因素才会导致人分割的生成过程。在这种情况下，人分割器需要构建 latent 表示 causal 因素，并学习保持这些表示符合 causality 原则。
results: 对两个广泛使用的 refer benchmark 进行了广泛的实验，结果表明，提出的 CIParsing 方法可以提高 MHP 模型的泛化能力和鲁棒性，并且可以适应不同的图像样式和外部干扰。

Abstract
Existing methods of multiple human parsing (MHP) apply statistical models to acquire underlying associations between images and labeled body parts. However, acquired associations often contain many spurious correlations that degrade model generalization, leading statistical models to be vulnerable to visually contextual variations in images (e.g., unseen image styles/external interventions). To tackle this, we present a causality inspired parsing paradigm termed CIParsing, which follows fundamental causal principles involving two causal properties for human parsing (i.e., the causal diversity and the causal invariance). Specifically, we assume that an input image is constructed by a mix of causal factors (the characteristics of body parts) and non-causal factors (external contexts), where only the former ones cause the generation process of human parsing.Since causal/non-causal factors are unobservable, a human parser in proposed CIParsing is required to construct latent representations of causal factors and learns to enforce representations to satisfy the causal properties. In this way, the human parser is able to rely on causal factors w.r.t relevant evidence rather than non-causal factors w.r.t spurious correlations, thus alleviating model degradation and yielding improved parsing ability. Notably, the CIParsing is designed in a plug-and-play fashion and can be integrated into any existing MHP models. Extensive experiments conducted on two widely used benchmarks demonstrate the effectiveness and generalizability of our method.

摘要

SG-Former: Self-guided Transformer with Evolving Token Reallocation

paper_url: http://arxiv.org/abs/2308.12216
repo_url: None
paper_authors: Sucheng Ren, Xingyi Yang, Songhua Liu, Xinchao Wang
for: 提高 vision transformer 的 computation cost 和 parameter number, 以便应用于大型图像处理任务。
methods: 提出了一种新的模型，即 Self-guided Transformer~(SG-Former)，通过自适应的 significancy map 来实现有效的全局自注意。
results: 在 ImageNet-1K 和 CoCo 等任务上达到了现状之最，比如 Swin Transformer 高出 \textbf{+1.3% / +2.7 mAP/ +3 mIoU}，同时具有较低的计算成本和参数数量。

Abstract
Vision Transformer has demonstrated impressive success across various vision tasks. However, its heavy computation cost, which grows quadratically with respect to the token sequence length, largely limits its power in handling large feature maps. To alleviate the computation cost, previous works rely on either fine-grained self-attentions restricted to local small regions, or global self-attentions but to shorten the sequence length resulting in coarse granularity. In this paper, we propose a novel model, termed as Self-guided Transformer~(SG-Former), towards effective global self-attention with adaptive fine granularity. At the heart of our approach is to utilize a significance map, which is estimated through hybrid-scale self-attention and evolves itself during training, to reallocate tokens based on the significance of each region. Intuitively, we assign more tokens to the salient regions for achieving fine-grained attention, while allocating fewer tokens to the minor regions in exchange for efficiency and global receptive fields. The proposed SG-Former achieves performance superior to state of the art: our base size model achieves \textbf{84.7\%} Top-1 accuracy on ImageNet-1K, \textbf{51.2mAP} bbAP on CoCo, \textbf{52.7mIoU} on ADE20K surpassing the Swin Transformer by \textbf{+1.3\% / +2.7 mAP/ +3 mIoU}, with lower computation costs and fewer parameters. The code is available at \href{https://github.com/OliverRensu/SG-Former}{https://github.com/OliverRensu/SG-Former}

摘要
“视Transformer”已经在不同的视觉任务上表现出色。然而，它的计算成本呈 quadratic 关系，即Sequence length的长度，这限制了它在处理大的特征图时的能力。以前的工作通过 either 精细自注意或者缩短序列长度来降低计算成本，但这会导致 coarse 粒度。在这篇论文中，我们提出了一种新的模型，称为自适应Transformer（SG-Former），以实现高效的全球自注意。我们的方法是通过在训练中估算出的 Significance 地图，来调整token的分配，以达到 fine-grained 的自注意。具体来说，我们将更多的token分配给突出的区域，以实现细致的自注意，而同时减少 Token 的分配，以减少计算成本和全球接收器。我们的SG-Former 模型在 ImageNet-1K 上达到了 \textbf{84.7\%} Top-1 准确率，在 CoCo 上达到了 \textbf{51.2mAP} bbAP，在 ADE20K 上达到了 \textbf{52.7mIoU}，超过 Swin Transformer 的 \textbf{+1.3\% / +2.7 mAP / +3 mIoU}，同时具有更低的计算成本和 fewer 参数。代码可以在 \href{https://github.com/OliverRensu/SG-Former}{https://github.com/OliverRensu/SG-Former} 上找到。

Towards Real-Time Analysis of Broadcast Badminton Videos

paper_url: http://arxiv.org/abs/2308.12199
repo_url: None
paper_authors: Nitin Nilesh, Tushar Sharma, Anurag Ghosh, C. V. Jawahar
for: 这个研究旨在实时分析羽毛球比赛中玩家的运动动作。
methods: 该方法使用直播启用的视频输入，并从视频中提取玩家的运动轨迹。
results: 该方法可以实时计算玩家在场上覆盖的距离和速度，以及在场上覆盖的区域。I hope this helps! Let me know if you have any other questions.

Abstract
Analysis of player movements is a crucial subset of sports analysis. Existing player movement analysis methods use recorded videos after the match is over. In this work, we propose an end-to-end framework for player movement analysis for badminton matches on live broadcast match videos. We only use the visual inputs from the match and, unlike other approaches which use multi-modal sensor data, our approach uses only visual cues. We propose a method to calculate the on-court distance covered by both the players from the video feed of a live broadcast badminton match. To perform this analysis, we focus on the gameplay by removing replays and other redundant parts of the broadcast match. We then perform player tracking to identify and track the movements of both players in each frame. Finally, we calculate the distance covered by each player and the average speed with which they move on the court. We further show a heatmap of the areas covered by the player on the court which is useful for analyzing the gameplay of the player. Our proposed framework was successfully used to analyze live broadcast matches in real-time during the Premier Badminton League 2019 (PBL 2019), with commentators and broadcasters appreciating the utility.

摘要
analysis of player movements 是体育分析中的一个重要子集。现有的玩家运动分析方法使用已经完成的比赛录像。在这项工作中，我们提议一种终端框架 дляBadminton比赛中的玩家运动分析。我们只使用比赛直播中的视觉输入，与其他方法不同，我们的方法不使用多Modal感知数据。我们提出了一种方法来计算比赛直播中Badminton比赛中每个玩家在视频中的场地覆盖距离。为了进行这种分析，我们减去了重播和其他无关的部分，然后对每帧中的玩家运动进行跟踪和识别。最后，我们计算了每个玩家在场地上的平均速度和覆盖距离。此外，我们还显示了每个玩家在场地上的热图，这有助于分析玩家的游戏风格。我们的提议的框架在Premier Badminton League 2019（PBL 2019）中实时分析了直播比赛，评论员和广播员都很欣赏了其有用性。

Sign Language Translation with Iterative Prototype

paper_url: http://arxiv.org/abs/2308.12191
repo_url: None
paper_authors: Huijie Yao, Wengang Zhou, Hao Feng, Hezhen Hu, Hao Zhou, Houqiang Li
for: 这个论文是为了提出一种简单 yet effective的手语翻译（SLT）框架，即IP-SLT。
methods: 该框架采用迭代结构，通过迭代改进输入手语视频的语义表示（prototype）来提高翻译效果。
results: 实验表明，IP-SLT可以提供更加流畅和合适的翻译结果，并且可以轻松地整合到现有的SLT系统中。

Abstract
This paper presents IP-SLT, a simple yet effective framework for sign language translation (SLT). Our IP-SLT adopts a recurrent structure and enhances the semantic representation (prototype) of the input sign language video via an iterative refinement manner. Our idea mimics the behavior of human reading, where a sentence can be digested repeatedly, till reaching accurate understanding. Technically, IP-SLT consists of feature extraction, prototype initialization, and iterative prototype refinement. The initialization module generates the initial prototype based on the visual feature extracted by the feature extraction module. Then, the iterative refinement module leverages the cross-attention mechanism to polish the previous prototype by aggregating it with the original video feature. Through repeated refinement, the prototype finally converges to a more stable and accurate state, leading to a fluent and appropriate translation. In addition, to leverage the sequential dependence of prototypes, we further propose an iterative distillation loss to compress the knowledge of the final iteration into previous ones. As the autoregressive decoding process is executed only once in inference, our IP-SLT is ready to improve various SLT systems with acceptable overhead. Extensive experiments are conducted on public benchmarks to demonstrate the effectiveness of the IP-SLT.

摘要

Tumor-Centered Patching for Enhanced Medical Image Segmentation

paper_url: http://arxiv.org/abs/2308.12168
repo_url: None
paper_authors: Mutyyba Asghar, Ahmad Raza Shahid, Akhtar Jamil, Kiran Aftab, Syed Ather Enam
for: 这个研究的目的是提高医疗影像诊断中的像素分类精度，以提高电脑助诊系统的效能。
methods: 这个研究使用了一种新的肿瘤中心的图像分析方法，将图像分割成肿瘤的不同部分，以解决类别不均衡和边界不充分的问题。这种方法可以增强图像特征提取的精度和减少计算负载。
results: 实验结果显示，这个方法可以改善类别不均衡， segmentation scores 为 0.78、0.76 和 0.71 для整体、核心和增强肿瘤，分别。这些结果显示这种方法具有潜力，可以帮助改善医疗影像诊断系统的效能。

Abstract
The realm of medical image diagnosis has advanced significantly with the integration of computer-aided diagnosis and surgical systems. However, challenges persist, particularly in achieving precise image segmentation. While deep learning techniques show potential, obstacles like limited resources, slow convergence, and class imbalance impede their effectiveness. Traditional patch-based methods, though common, struggle to capture intricate tumor boundaries and often lead to redundant samples, compromising computational efficiency and feature quality. To tackle these issues, this research introduces an innovative approach centered on the tumor itself for patch-based image analysis. This novel tumor-centered patching method aims to address the class imbalance and boundary deficiencies, enabling focused and accurate tumor segmentation. By aligning patches with the tumor's anatomical context, this technique enhances feature extraction accuracy and reduces computational load. Experimental results demonstrate improved class imbalance, with segmentation scores of 0.78, 0.76, and 0.71 for whole, core, and enhancing tumors, respectively using a lightweight simple U-Net. This approach shows potential for enhancing medical image segmentation and improving computer-aided diagnosis systems.

摘要
医学影像诊断领域已经具有了深刻的改进，尤其是通过计算机助成诊断和手术系统的结合。然而，困难仍然存在，主要是准确的图像分割。深度学习技术表现出了潜在的优势，但是有限的资源、慢速的融合和类别不均衡问题使其效果受限。传统的贴图方法，虽然广泛使用，但是它们往往难以捕捉肿瘤边界的细节，并且经常产生重复的样本，从而降低计算效率和特征质量。为了解决这些问题，本研究提出了一种推新的肿瘤中心的贴图方法。这种新方法通过将贴图与肿瘤的生物学上的位置进行对应，以提高特征提取精度和减少计算负担。实验结果表明，这种方法可以更好地处理类别不均衡问题，并且 segmentation 分数为0.78、0.76和0.71 для整体、核心和增强肿瘤，分别使用一种轻量级的简单U-Net。这种方法显示出了改善医学影像分割的潜在可能性，并且可能为计算机助成诊断系统提供新的思路。

paper_url: http://arxiv.org/abs/2308.12163
repo_url: None
paper_authors: Ziyu Yang, Sucheng Ren, Zongwei Wu, Nanxuan Zhao, Junle Wang, Jing Qin, Shengfeng He
for: 这个研究的目的是理解人们如何看到非拟屏视频，以便提高媒体制作、艺术设计和游戏用户体验。
methods: 这个研究使用了一个大规模的多模态数据集（NPF-200），以及一些现有的状态older方法进行比较。同时，研究者还提出了一种包括视觉和声音特征的多模态非拟屏焦点检测模型（NPSNet），以提高模型的性能。
results: 研究结果表明，NPSNet可以达到状态older的性能水平，并且在不同的频谱和多Modal中展示了优异的性能。此外，研究还发现了多模态网络设计和多频域训练的优势和缺陷，为未来的研究提供了丰富的发展方向。

Abstract
Non-photorealistic videos are in demand with the wave of the metaverse, but lack of sufficient research studies. This work aims to take a step forward to understand how humans perceive non-photorealistic videos with eye fixation (\ie, saliency detection), which is critical for enhancing media production, artistic design, and game user experience. To fill in the gap of missing a suitable dataset for this research line, we present NPF-200, the first large-scale multi-modal dataset of purely non-photorealistic videos with eye fixations. Our dataset has three characteristics: 1) it contains soundtracks that are essential according to vision and psychological studies; 2) it includes diverse semantic content and videos are of high-quality; 3) it has rich motions across and within videos. We conduct a series of analyses to gain deeper insights into this task and compare several state-of-the-art methods to explore the gap between natural images and non-photorealistic data. Additionally, as the human attention system tends to extract visual and audio features with different frequencies, we propose a universal frequency-aware multi-modal non-photorealistic saliency detection model called NPSNet, demonstrating the state-of-the-art performance of our task. The results uncover strengths and weaknesses of multi-modal network design and multi-domain training, opening up promising directions for future works. {Our dataset and code can be found at \url{https://github.com/Yangziyu/NPF200}.

摘要
非摄影实际视频在metaverse潮流中受欢迎，但是研究不充分。这项工作希望通过眼动检测（即眼动检测）来理解人类如何看待非摄影实际视频，这对媒体制作、艺术设计和游戏用户体验都是重要的。为了填补这一研究领域缺乏适用 datasets的问题，我们提供了NPF-200 dataset，这是首个含有声道和多Modal的非摄影实际视频眼动检测 dataset。我们的数据集有以下三个特点：1）它们包含视觉和心理学研究认为是必要的声道；2）它们包含多样化的 semantics 内容和高质量的视频；3）它们具有视频中的丰富运动。我们进行了一系列分析，以获得更深入的理解，并与多种现有方法进行比较，以探索自然图像和非摄影实际数据之间的差距。此外，人类注意系统通常会通过不同频率EXTRACT visual和声音特征，因此我们提出了一种适用于多Modal非摄影实际数据的频率意识非摄影眼动检测模型，称为NPSNet。结果表明NPSNet在这种任务上具有国际前沿性。我们的数据集和代码可以在 GitHub 上找到。

Mesh Conflation of Oblique Photogrammetric Models using Virtual Cameras and Truncated Signed Distance Field

paper_url: http://arxiv.org/abs/2308.12139
repo_url: None
paper_authors: Shuang Song, Rongjun Qin
for: 高精度地表模型的合并 (concatenation of high-resolution digital surface models)
methods: 基于投影相机场的TSDF折叠 (concatenation using Truncated Signed Distance Fields and panoramic cameras)
results: 提高了传统方法的精度和效率 (improved accuracy and efficiency compared to traditional methods)

Abstract
Conflating/stitching 2.5D raster digital surface models (DSM) into a large one has been a running practice in geoscience applications, however, conflating full-3D mesh models, such as those from oblique photogrammetry, is extremely challenging. In this letter, we propose a novel approach to address this challenge by conflating multiple full-3D oblique photogrammetric models into a single, and seamless mesh for high-resolution site modeling. Given two or more individually collected and created photogrammetric meshes, we first propose to create a virtual camera field (with a panoramic field of view) to incubate virtual spaces represented by Truncated Signed Distance Field (TSDF), an implicit volumetric field friendly for linear 3D fusion; then we adaptively leverage the truncated bound of meshes in TSDF to conflate them into a single and accurate full 3D site model. With drone-based 3D meshes, we show that our approach significantly improves upon traditional methods for model conflations, to drive new potentials to create excessively large and accurate full 3D mesh models in support of geoscience and environmental applications.

摘要
合并/缝合2.5D矩阵数字地表模型(DSM)是地球科学应用中常见的做法，但是将全3D网格模型，如从斜视 photogrammetry 获得的模型，则是非常困难的。在这封信中，我们提出了一种新的方法，用于解决这个问题，即将两个或更多个 individually 收集和创建的 photogrammetric 模型进行合并，并生成一个完整、精准的full 3D站点模型。我们首先创建了一个虚拟摄像头场（具有扩展的全景视野），以便在 Truncated Signed Distance Field (TSDF) 中表示虚拟空间，然后利用TSDF中的截断缩限来合并这些模型，并生成一个准确、完整的full 3D站点模型。使用无人机生成的3D模型，我们表明了我们的方法在传统方法上显著提高了模型合并的性能，以支持地球科学和环境应用。

Select-and-Combine (SAC): A Novel Multi-Stereo Depth Fusion Algorithm for Point Cloud Generation via Efficient Local Markov Netlets

paper_url: http://arxiv.org/abs/2308.12138
repo_url: None
paper_authors: Mostafa Elhashash, Rongjun Qin
for: 这个论文的目的是提出一种新的深度融合方法，以解决多视图深度匹配中的噪声和不一致问题。
methods: 该方法基于本地Markov Netlets来模型点级融合，通过选择最佳深度图并将其融合成一个清晰的点云。
results: 对比 existed方法，该方法提高了F1分数（考虑准确性和完整性）的提升2.07%，并生成了18%更加简洁的点云，同时保持高度准确性。

Abstract
Many practical systems for image-based surface reconstruction employ a stereo/multi-stereo paradigm, due to its ability to scale for large scenes and its ease of implementation for out-of-core operations. In this process, multiple and abundant depth maps from stereo matching must be combined and fused into a single, consistent, and clean point cloud. However, the noises and outliers caused by stereo matching and the heterogenous geometric errors of the poses present a challenge for existing fusion algorithms, since they mostly assume Gaussian errors and predict fused results based on data from local spatial neighborhoods, which may inherit uncertainties from multiple depths resulting in lowered accuracy. In this paper, we propose a novel depth fusion paradigm, that instead of numerically fusing points from multiple depth maps, selects the best depth map per point, and combines them into a single and clean point cloud. This paradigm, called select-and-combine (SAC), is achieved through modeling the point level fusion using local Markov Netlets, a micro-network over point across neighboring views for depth/view selection, followed by a Netlets collapse process for point combination. The Markov Netlets are optimized such that they can inherently leverage spatial consistencies among depth maps of neighboring views, thus they can address errors beyond Gaussian ones. Our experiment results show that our approach outperforms existing depth fusion approaches by increasing the F1 score that considers both accuracy and completeness by 2.07% compared to the best existing method. Finally, our approach generates clearer point clouds that are 18% less redundant while with a higher accuracy before fusion

摘要
多种实用系统用stereo/多 Sterero paradigm进行图像基于表面重建，因为它可以扩展到大型场景并且实现离核操作。在这个过程中，多个和充沛的深度地图从stereo匹配得到，然后需要将它们合并和融合成一个单一、一致和干净的点云。然而，stereo匹配中的噪声和异常值以及不同视角的姿态差引入的精度问题，对现有的融合算法提出了挑战，因为它们大多数假设 Gaussian 错误并基于本地空间邻域数据预测融合结果，这可能会继承多个深度的不确定性，导致准确性下降。在这篇论文中，我们提出了一种新的深度融合方法，而不是纯数学地融合多个深度地图中的点，而是选择每个点最佳的深度地图，然后将它们融合成一个单一、一致和干净的点云。这种方法，我们称之为选择并融合（SAC）方法。我们通过模型点级融合使用本地Markov Netlets，一种微网络在不同视角的深度/视图之间的点级选择，然后进行Netlets归一化过程来融合点。Markov Netlets是通过优化，使其能够自然地利用邻近视角的深度地图之间的空间一致性，因此它们可以解决超出Gaussian错误的错误。我们的实验结果显示，我们的方法比现有的深度融合方法提高了F1分数，考虑到准确性和完整性的比较，提高了2.07%。此外，我们的方法生成的点云更清晰，同时具有18% less redundancy，而且准确率更高。

Lite-HRNet Plus: Fast and Accurate Facial Landmark Detection

paper_url: http://arxiv.org/abs/2308.12133
repo_url: None
paper_authors: Sota Kato, Kazuhiro Hotta, Yuhki Hatakeyama, Yoshinori Konishi
for: 这个研究是为了提出一个新的抽象方法，以解决现有的颜面特征点检测方法中的computational complexity问题。
methods: 这个方法使用了一个新的混合块，基于通道注意力，以及一个新的输出模组，使用多resolution的特征图像。
results: 这个方法在两个颜面特征检测 dataset 上进行了实验，并证明了它在比较于传统方法更高的精度下，并且在10M FLOPs的计算复杂度范围内实现了顶尖的性能。

Abstract
Facial landmark detection is an essential technology for driver status tracking and has been in demand for real-time estimations. As a landmark coordinate prediction, heatmap-based methods are known to achieve a high accuracy, and Lite-HRNet can achieve a fast estimation. However, with Lite-HRNet, the problem of a heavy computational cost of the fusion block, which connects feature maps with different resolutions, has yet to be solved. In addition, the strong output module used in HRNetV2 is not applied to Lite-HRNet. Given these problems, we propose a novel architecture called Lite-HRNet Plus. Lite-HRNet Plus achieves two improvements: a novel fusion block based on a channel attention and a novel output module with less computational intensity using multi-resolution feature maps. Through experiments conducted on two facial landmark datasets, we confirmed that Lite-HRNet Plus further improved the accuracy in comparison with conventional methods, and achieved a state-of-the-art accuracy with a computational complexity with the range of 10M FLOPs.

摘要
Facial landmark detection 是一种重要的技术，用于车手状态追踪，需要实时估计。为了实现高精度的坐标预测，热图基方法在实时估计中具有高精度。然而，Lite-HRNet中的融合块具有重要的计算成本问题，即将不同分辨率的特征图连接。此外，HRNetV2中使用的强大输出模块没有应用于Lite-HRNet。为了解决这些问题，我们提出了一种新的架构：Lite-HRNet Plus。Lite-HRNet Plus实现了两项改进：一种新的融合块基于通道注意力，以及一种新的输出模块使用多resolution特征图，从而降低计算复杂性。通过在两个 facial landmark 数据集上进行的实验，我们证明了 Lite-HRNet Plus 在比较方法中提高了精度，并在10M FLOPs的计算复杂性范围内达到了状态之精度。

The TYC Dataset for Understanding Instance-Level Semantics and Motions of Cells in Microstructures

paper_url: http://arxiv.org/abs/2308.12116
repo_url: https://github.com/ChristophReich1996/TYC-Dataset
paper_authors: Christoph Reich, Tim Prangemeier, Heinz Koeppl
for: 这篇论文的目的是为了提供一个大规模的细胞和微结构批量分割和跟踪的数据集，以便更好地理解细胞的 semantics 和运动。
methods: 这篇论文使用了高分辨率的快速扫描顾密镜像来获取细胞和微结构的批量分割和跟踪数据。
results: 这篇论文发布了105个高分辨率的快速扫描顾密镜像，包括约19000个实例的掩码。此外，还发布了261个 curaated 视频剪辑，包括1293个高分辨率镜像，以便无监督地理解细胞的运动和形态。

Abstract
Segmenting cells and tracking their motion over time is a common task in biomedical applications. However, predicting accurate instance-wise segmentation and cell motions from microscopy imagery remains a challenging task. Using microstructured environments for analyzing single cells in a constant flow of media adds additional complexity. While large-scale labeled microscopy datasets are available, we are not aware of any large-scale dataset, including both cells and microstructures. In this paper, we introduce the trapped yeast cell (TYC) dataset, a novel dataset for understanding instance-level semantics and motions of cells in microstructures. We release $105$ dense annotated high-resolution brightfield microscopy images, including about $19$k instance masks. We also release $261$ curated video clips composed of $1293$ high-resolution microscopy images to facilitate unsupervised understanding of cell motions and morphology. TYC offers ten times more instance annotations than the previously largest dataset, including cells and microstructures. Our effort also exceeds previous attempts in terms of microstructure variability, resolution, complexity, and capturing device (microscopy) variability. We facilitate a unified comparison on our novel dataset by introducing a standardized evaluation strategy. TYC and evaluation code are publicly available under CC BY 4.0 license.

摘要
分 segmenting cells和跟踪其运动过时是生物医学应用中常见任务。然而，准确预测单个单元Instance-wise segmentation和细胞运动从微scopic imaging中remains a challenging task。使用微结构环境 для单元细胞分析在流动媒体中添加了额外复杂性。虽然大规模标注微scopic imaging数据集是可用的，但我们没有发现任何包括细胞和微结构的大规模数据集。在这篇论文中，我们介绍了被陷 yeast cell（TYC）数据集，一个新的数据集用于理解单元级别 semantics和细胞运动。我们发布了105个高分辨率炸einstein imaging microscopy图像，包括约19000个实例涂抹。我们还发布了261个CURATED video clip，包括1293个高分辨率微scopic imaging图像，以便无监督地理解细胞运动和形态。TYC提供了前一次最大的实例标注数量，包括细胞和微结构。我们的努力也超过了之前的尝试，териms of microstructure variability, resolution, complexity, and capturing device（微scopic imaging）variability。我们提出了一种标准化评估策略，以便对我们的新数据集进行一致的比较。TYC和评估代码在CC BY 4.0license下公开可用。

Less is More – Towards parsimonious multi-task models using structured sparsity

paper_url: http://arxiv.org/abs/2308.12114
repo_url: None
paper_authors: Richa Upadhyay, Ronald Phlypo, Rajkumar Saini, Marcus Liwicki
for: 本研究旨在把结构化群 sparse（group sparsity）引入多任务学习（Multi-Task Learning，MTL）框架中，以开发更简洁、更易理解的模型，同时可以更好地解决多个任务。
methods: 本研究使用了通道级l1/l2群 sparse（channel-wise l1/l2 group sparsity）在共享层中，这种方法不仅可以消除无关的通道（group），同时也对 weights 加以罚 penalty，从而提高所有任务的学习。
results: compared with单任务和多任务实验，在group sparse情况下，模型的性能保持在相似或更高水平，而且可以降低模型的内存占用量、计算量和预测时间。此外，研究还发现，随着 sparse度的变化，模型的性能和群的稀畴度具有一定的关系。

Abstract
Group sparsity in Machine Learning (ML) encourages simpler, more interpretable models with fewer active parameter groups. This work aims to incorporate structured group sparsity into the shared parameters of a Multi-Task Learning (MTL) framework, to develop parsimonious models that can effectively address multiple tasks with fewer parameters while maintaining comparable or superior performance to a dense model. Sparsifying the model during training helps decrease the model's memory footprint, computation requirements, and prediction time during inference. We use channel-wise l1/l2 group sparsity in the shared layers of the Convolutional Neural Network (CNN). This approach not only facilitates the elimination of extraneous groups (channels) but also imposes a penalty on the weights, thereby enhancing the learning of all tasks. We compare the outcomes of single-task and multi-task experiments under group sparsity on two publicly available MTL datasets, NYU-v2 and CelebAMask-HQ. We also investigate how changing the sparsification degree impacts both the performance of the model and the sparsity of groups.

摘要
（简化中文）机器学习（ML）中的组 sparse 激励 simpler, more interpretable 的模型，具有 fewer active parameter groups。这项工作想要将结构化的组 sparse integrate into 多任务学习（MTL）框架中，以开发更具有简洁性和可解释性的模型，能够更好地处理多个任务，而且具有更少的参数。在训练过程中，将模型简化可以降低模型的内存占用量、计算需求和预测时间。我们在 convolutional neural network（CNN）中使用 channel-wise L1/L2 组 sparse，这种方法不仅可以消除无用的组（通道），还对权重进行罚款，从而提高所有任务的学习。我们在 NYU-v2 和 CelebAMask-HQ 两个公开available MTL 数据集上进行单任务和多任务实验，并 investigate 如何更改简化学习度强度对模型性能和组简洁度产生的影响。

Advancements in Point Cloud Data Augmentation for Deep Learning: A Survey

paper_url: http://arxiv.org/abs/2308.12113
repo_url: None
paper_authors: Qinfeng Zhu, Lei Fan, Ningxin Weng
for: This paper focuses on point cloud data augmentation methods for tasks such as detection, segmentation, and classification in computer vision.
methods: The paper surveys and discusses various point cloud data augmentation methods, categorizing them into a taxonomy framework, and evaluates their potentials and limitations.
results: The paper provides a comprehensive understanding of the current status of point cloud data augmentation and suggests possible future research directions, promoting the wider application and development of point cloud processing techniques.

Abstract
Point cloud has a wide range of applications in areas such as autonomous driving, mapping, navigation, scene reconstruction, and medical imaging. Due to its great potentials in these applications, point cloud processing has gained great attention in the field of computer vision. Among various point cloud processing techniques, deep learning (DL) has become one of the mainstream and effective methods for tasks such as detection, segmentation and classification. To reduce overfitting during training DL models and improve model performance especially when the amount and/or diversity of training data are limited, augmentation is often crucial. Although various point cloud data augmentation methods have been widely used in different point cloud processing tasks, there are currently no published systematic surveys or reviews of these methods. Therefore, this article surveys and discusses these methods and categorizes them into a taxonomy framework. Through the comprehensive evaluation and comparison of the augmentation methods, this article identifies their potentials and limitations and suggests possible future research directions. This work helps researchers gain a holistic understanding of the current status of point cloud data augmentation and promotes its wider application and development.

摘要
点云处理具有广泛的应用领域，如自动驾驶、地图建模、导航、场景重建和医疗影像等。由于其在这些应用领域的潜力，点云处理技术在计算机视觉领域得到了广泛的关注。深度学习（DL）已成为点云处理中一种主流和有效的方法，用于任务如检测、分割和分类。在训练DL模型时，以避免过拟合，数据扩展是非常重要。现在，点云数据扩展方法已经广泛应用在不同的点云处理任务中，但是没有发表过系统性的报告或评论。因此，本文对这些方法进行了抽查和讨论，并将其分类到一个分类框架中。通过对各种扩展方法的全面评估和比较，本文可以承认和限制这些方法，并提出可能的未来研究方向。这些研究结果可以帮助研究人员更好地了解当前点云数据扩展的状况，并推动其更广泛的应用和发展。

Generalized Continual Category Discovery

paper_url: http://arxiv.org/abs/2308.12112
repo_url: None
paper_authors: Daniel Marczak, Grzegorz Rypeść, Sebastian Cygert, Tomasz Trzciński, Bartłomiej Twardowski
for: 这个论文的目的是解决常见学习（Continual Learning，CL）方法在实际场景中的限制，即一个学习Agent需要在新任务上学习新的标签数据，而不是忘记之前的知识。
methods: 这个论文提出了一种新的框架，即通用类发现（Generalized Category Discovery，GCD）和常见学习（CL）的结合，用于解决实际场景中的学习问题。具体来说，在任务中，允许存在新的无标签样本和已知类的样本，并使用常见学习方法来发现它们。
results: 实验表明，现有的CL方法在新任务中不能够积累知识，而我们提出的方法可以通过 combining supervised和无标签信号，使得学习Agent可以更好地积累知识并避免忘记。此外，我们的方法也可以在实际场景中表现出优于强CL方法和GCD技术。

Abstract
Most of Continual Learning (CL) methods push the limit of supervised learning settings, where an agent is expected to learn new labeled tasks and not forget previous knowledge. However, these settings are not well aligned with real-life scenarios, where a learning agent has access to a vast amount of unlabeled data encompassing both novel (entirely unlabeled) classes and examples from known classes. Drawing inspiration from Generalized Category Discovery (GCD), we introduce a novel framework that relaxes this assumption. Precisely, in any task, we allow for the existence of novel and known classes, and one must use continual version of unsupervised learning methods to discover them. We call this setting Generalized Continual Category Discovery (GCCD). It unifies CL and GCD, bridging the gap between synthetic benchmarks and real-life scenarios. With a series of experiments, we present that existing methods fail to accumulate knowledge from subsequent tasks in which unlabeled samples of novel classes are present. In light of these limitations, we propose a method that incorporates both supervised and unsupervised signals and mitigates the forgetting through the use of centroid adaptation. Our method surpasses strong CL methods adopted for GCD techniques and presents a superior representation learning performance.

摘要
大多数持续学习（CL）方法在supervised learning设置下进行推广，其中一个agent需要学习新的标注任务而不忘记之前的知识。然而，这些设置并不适合实际生活中的情景，where a learning agent有大量未标注数据，包括新的类和已知类的示例。 drawing inspiration from Generalized Category Discovery（GCD），我们介绍了一个新的框架，允许在任务中存在新的和已知的类，并且使用 continual version of unsupervised learning方法来发现它们。我们称这种设置为Generalized Continual Category Discovery（GCCD）。它将CL和GCD融合， bridge the gap between synthetic benchmarks和实际生活中的情景。通过一系列实验，我们发现了现有方法在后续任务中处理未标注的新类样本时存在问题，即忘记之前的知识。为了解决这些限制，我们提出了一种方法，该方法将supervised和unsupervised信号相互衔接，并通过中心点修改来减轻忘记。我们的方法超越了强大的CL方法，并在表示学习性能方面表现出色。

Cross-Modality Proposal-guided Feature Mining for Unregistered RGB-Thermal Pedestrian Detection

paper_url: http://arxiv.org/abs/2308.12111
repo_url: None
paper_authors: Chao Tian, Zikun Zhou, Yuqing Huang, Gaojun Li, Zhenyu He
for:这篇论文的目的是提出一种新的不注册RGB-T人体检测方法，以解决实际中RGB-T图像对不匹配的问题。methods:这种方法使用了交叉模态提案导向特征挖掘（CPFM）机制，提取RGB和热图像中人体特征，并将其组合成为一个精准的人体检测结果。results:实验结果表明，这种方法可以有效地处理不匹配的RGB-T图像，并且能够提高人体检测的精度和稳定性。

Abstract
RGB-Thermal (RGB-T) pedestrian detection aims to locate the pedestrians in RGB-T image pairs to exploit the complementation between the two modalities for improving detection robustness in extreme conditions. Most existing algorithms assume that the RGB-T image pairs are well registered, while in the real world they are not aligned ideally due to parallax or different field-of-view of the cameras. The pedestrians in misaligned image pairs may locate at different positions in two images, which results in two challenges: 1) how to achieve inter-modality complementation using spatially misaligned RGB-T pedestrian patches, and 2) how to recognize the unpaired pedestrians at the boundary. To deal with these issues, we propose a new paradigm for unregistered RGB-T pedestrian detection, which predicts two separate pedestrian locations in the RGB and thermal images, respectively. Specifically, we propose a cross-modality proposal-guided feature mining (CPFM) mechanism to extract the two precise fusion features for representing the pedestrian in the two modalities, even if the RGB-T image pair is unaligned. It enables us to effectively exploit the complementation between the two modalities. With the CPFM mechanism, we build a two-stream dense detector; it predicts the two pedestrian locations in the two modalities based on the corresponding fusion feature mined by the CPFM mechanism. Besides, we design a data augmentation method, named Homography, to simulate the discrepancy in scales and views between images. We also investigate two non-maximum suppression (NMS) methods for post-processing. Favorable experimental results demonstrate the effectiveness and robustness of our method in dealing with unregistered pedestrians with different shifts.

摘要

DISGAN: Wavelet-informed Discriminator Guides GAN to MRI Super-resolution with Noise Cleaning

paper_url: http://arxiv.org/abs/2308.12084
repo_url: None
paper_authors: Qi Wang, Lucas Mahler, Julius Steiglechner, Florian Birk, Klaus Scheffler, Gabriele Lohmann
for: 这 paper 是为了解决 MRI 超分解 (SR) 和噪声纠正 зада题的基础挑战。
methods: 这 paper 使用了一种新的方法，即同时解决 SR 和噪声纠正两个任务的单一深度学习模型，不需要额外的噪声和清晰图像对应的训练数据。
results: 这 paper 的模型能够同时实现高质量的 SR 和噪声纠正效果，并且不需要额外的噪声和清晰图像对应的训练数据。模型的性能在多个 MRI 数据集上进行了评估，包括 Human Connectome Project (HCP) 数据集和患有脑肿和 эпилепси的subjects的 MRI 数据集。

Abstract
MRI super-resolution (SR) and denoising tasks are fundamental challenges in the field of deep learning, which have traditionally been treated as distinct tasks with separate paired training data. In this paper, we propose an innovative method that addresses both tasks simultaneously using a single deep learning model, eliminating the need for explicitly paired noisy and clean images during training. Our proposed model is primarily trained for SR, but also exhibits remarkable noise-cleaning capabilities in the super-resolved images. Instead of conventional approaches that introduce frequency-related operations into the generative process, our novel approach involves the use of a GAN model guided by a frequency-informed discriminator. To achieve this, we harness the power of the 3D Discrete Wavelet Transform (DWT) operation as a frequency constraint within the GAN framework for the SR task on magnetic resonance imaging (MRI) data. Specifically, our contributions include: 1) a 3D generator based on residual-in-residual connected blocks; 2) the integration of the 3D DWT with $1\times 1$ convolution into a DWT+conv unit within a 3D Unet for the discriminator; 3) the use of the trained model for high-quality image SR, accompanied by an intrinsic denoising process. We dub the model "Denoising Induced Super-resolution GAN (DISGAN)" due to its dual effects of SR image generation and simultaneous denoising. Departing from the traditional approach of training SR and denoising tasks as separate models, our proposed DISGAN is trained only on the SR task, but also achieves exceptional performance in denoising. The model is trained on 3D MRI data from dozens of subjects from the Human Connectome Project (HCP) and further evaluated on previously unseen MRI data from subjects with brain tumours and epilepsy to assess its denoising and SR performance.

摘要
MRI超分解（SR）和噪声除除（denoising）是深度学习领域的基本挑战，传统上被视为独立的两个任务，需要分别培训独立的深度学习模型。在这篇论文中，我们提出了一种创新的方法， simultaneous addressing both tasks using a single deep learning model, eliminating the need for explicitly paired noisy and clean images during training. Our proposed model is primarily trained for SR, but also exhibits remarkable noise-cleaning capabilities in the super-resolved images. Instead of conventional approaches that introduce frequency-related operations into the generative process, our novel approach involves the use of a GAN模型 guided by a frequency-informed discriminator. To achieve this, we harness the power of the 3D Discrete Wavelet Transform (DWT) operation as a frequency constraint within the GAN framework for the SR task on magnetic resonance imaging (MRI) data. Specifically, our contributions include: 1) a 3D generator based on residual-in-residual connected blocks; 2) the integration of the 3D DWT with $1\times 1$ convolution into a DWT+conv unit within a 3D Unet for the discriminator; 3) the use of the trained model for high-quality image SR, accompanied by an intrinsic denoising process. We dub the model "Denoising Induced Super-resolution GAN (DISGAN)" due to its dual effects of SR image generation and simultaneous denoising. Departing from the traditional approach of training SR and denoising tasks as separate models, our proposed DISGAN is trained only on the SR task, but also achieves exceptional performance in denoising. The model is trained on 3D MRI data from dozens of subjects from the Human Connectome Project (HCP) and further evaluated on previously unseen MRI data from subjects with brain tumours and epilepsy to assess its denoising and SR performance.

paper_url: http://arxiv.org/abs/2308.12320
repo_url: https://github.com/palmdong/smmcl
paper_authors: Xiaoyu Dong, Naoto Yokoya
for: 提高多Modal Image数据下黑色场景理解的精度
methods: 引入监督多Modal Contrastive Learning方法，通过跨Modal和同Modal对比来增强多Modal feature空间的semantic锐度
results: 在多种任务上，包括不同的光照条件和图像模式，实验显示我们的方法可以效果地提高基于多Modal Image的黑色场景理解，并且与之前的方法进行比较而显示我们的状态精度性能。

Abstract
Understanding dark scenes based on multi-modal image data is challenging, as both the visible and auxiliary modalities provide limited semantic information for the task. Previous methods focus on fusing the two modalities but neglect the correlations among semantic classes when minimizing losses to align pixels with labels, resulting in inaccurate class predictions. To address these issues, we introduce a supervised multi-modal contrastive learning approach to increase the semantic discriminability of the learned multi-modal feature spaces by jointly performing cross-modal and intra-modal contrast under the supervision of the class correlations. The cross-modal contrast encourages same-class embeddings from across the two modalities to be closer and pushes different-class ones apart. The intra-modal contrast forces same-class or different-class embeddings within each modality to be together or apart. We validate our approach on a variety of tasks that cover diverse light conditions and image modalities. Experiments show that our approach can effectively enhance dark scene understanding based on multi-modal images with limited semantics by shaping semantic-discriminative feature spaces. Comparisons with previous methods demonstrate our state-of-the-art performance. Code and pretrained models are available at https://github.com/palmdong/SMMCL.

摘要
《理解黑色场景基于多Modal图像数据是具有挑战性的，因为可见和辅助modalities都提供有限的semantic信息 для任务。先前的方法主要关注两Modalities的融合，但忽视了semantic类别之间的相关性when minimizing losses to align pixels with labels, resulting in inaccurate class predictions。为解决这些问题，我们提出了一种监督多Modal异构学习方法，以增强学习的多Modal特征空间semantic抑制能力。我们同时在交叉Modal和内部Modal上进行对比，以便在监督class关系下同时实现跨Modal和内部Modal的匹配。交叉Modal对比使得同一个类别的embeddings从不同modalities中更加相近，而不同类别的embeddings则更加分开。内部Modal对比使得同一个类别或不同类别的embeddings在每个modalities中都是 вместе或分开。我们在多种任务上验证了我们的方法，包括不同的照明条件和图像modalities。实验结果表明，我们的方法可以有效地提高基于多Modal图像的黑色场景理解，并且可以Shape semantic-discriminative feature spaces。与先前的方法进行比较，我们的性能达到了国际水平。代码和预训练模型可以在https://github.com/palmdong/SMMCL上获取。》

SILT: Shadow-aware Iterative Label Tuning for Learning to Detect Shadows from Noisy Labels

paper_url: http://arxiv.org/abs/2308.12064
repo_url: https://github.com/cralence/silt
paper_authors: Han Yang, Tianyu Wang, Xiaowei Hu, Chi-Wing Fu
for: 提高阴影检测模型的性能， addresses the issue of missing or mislabeled shadows in existing shadow detection datasets.
methods: 提出了一种名为SILT的shadow-aware iterative label tuning框架， which explicitly considers noise in shadow labels and trains the deep model in a self-training manner.
results: 通过对SBU数据集的测试集重新标注和多种实验，our results show that even a simple U-Net trained with SILT can outperform all state-of-the-art methods by a large margin. When trained on SBU / UCF / ISTD, our network can successfully reduce the Balanced Error Rate by 25.2% / 36.9% / 21.3% over the best state-of-the-art method.

Abstract
Existing shadow detection datasets often contain missing or mislabeled shadows, which can hinder the performance of deep learning models trained directly on such data. To address this issue, we propose SILT, the Shadow-aware Iterative Label Tuning framework, which explicitly considers noise in shadow labels and trains the deep model in a self-training manner. Specifically, we incorporate strong data augmentations with shadow counterfeiting to help the network better recognize non-shadow regions and alleviate overfitting. We also devise a simple yet effective label tuning strategy with global-local fusion and shadow-aware filtering to encourage the network to make significant refinements on the noisy labels. We evaluate the performance of SILT by relabeling the test set of the SBU dataset and conducting various experiments. Our results show that even a simple U-Net trained with SILT can outperform all state-of-the-art methods by a large margin. When trained on SBU / UCF / ISTD, our network can successfully reduce the Balanced Error Rate by 25.2% / 36.9% / 21.3% over the best state-of-the-art method.

摘要
现有的阴影检测 dataset oftentimes 包含遗传或错abeled 的阴影，这可能会妨碍 deep learning 模型直接在这些数据上训练。为解决这个问题，我们提出了 SILT，即 Shadow-aware Iterative Label Tuning 框架，这个框架会明确地考虑阴影标签中的噪声，并在自适应方式下训练 deep model。具体来说，我们将强大的数据增强器与阴影伪造相结合，帮助网络更好地识别非阴影区域，并减少适应。我们还创造了一个简单 yet effective 的标签修正策略，具有全球-本地融合和阴影-aware 范围筛选，以鼓励网络对阴影标签中的噪声进行重要修正。我们通过对 SBU 测试集进行重新标签并进行多种实验评估 SILT 的性能。我们的结果显示，即使使用 SILT 训练的简单 U-Net 可以在所有状态对抗方法中表现出色，并且在 SBU / UCF / ISTD 上训练时可以成功降低平衡错误率 BY 25.2% / 36.9% / 21.3%。

HarvestNet: A Dataset for Detecting Smallholder Farming Activity Using Harvest Piles and Remote Sensing

paper_url: http://arxiv.org/abs/2308.12061
repo_url: None
paper_authors: Jonathan Xu, Amna Elmustafa, Liya Weldegebriel, Emnet Negash, Richard Lee, Chenlin Meng, Stefano Ermon, David Lobell
for: 这研究旨在为叙利亚和阿姆拉地区的小型农场提供更准确和有时间的农田评估方法。
methods: 该研究使用了检测收割堆的方法，利用专家知识和卫星图像，收集了7000张手动标注图像和2000张地面采集标签。
results: 研究结果表明，使用最新的模型可以在手动标注数据上达到80%的分类性能，并在真实数据上达到90%、98%的准确率。此外，研究还发现了一些已有的覆盖地图中的偏差，并在提供了56,621公顷的新的农田面积。

Abstract
Small farms contribute to a large share of the productive land in developing countries. In regions such as sub-Saharan Africa, where 80% of farms are small (under 2 ha in size), the task of mapping smallholder cropland is an important part of tracking sustainability measures such as crop productivity. However, the visually diverse and nuanced appearance of small farms has limited the effectiveness of traditional approaches to cropland mapping. Here we introduce a new approach based on the detection of harvest piles characteristic of many smallholder systems throughout the world. We present HarvestNet, a dataset for mapping the presence of farms in the Ethiopian regions of Tigray and Amhara during 2020-2023, collected using expert knowledge and satellite images, totaling 7k hand-labeled images and 2k ground collected labels. We also benchmark a set of baselines including SOTA models in remote sensing with our best models having around 80% classification performance on hand labelled data and 90%, 98% accuracy on ground truth data for Tigray, Amhara respectively. We also perform a visual comparison with a widely used pre-existing coverage map and show that our model detects an extra 56,621 hectares of cropland in Tigray. We conclude that remote sensing of harvest piles can contribute to more timely and accurate cropland assessments in food insecure region.

摘要
小型农场对发展中国家的生产地域占有重要的比重。如在非洲萨赫拉区，80%的农场面积在2公顷以下（小型农场），评估可持续发展的重要一环是映射小holder农场。然而，传统方法的映射农场面积受到小型农场的多样性和细节的限制。在这里，我们介绍了一种新的方法，基于农作物收割堆的检测。我们提供了在埃塞俄比亚地区特拉YES和阿姆拉地区2020-2023年度的HarvestNet数据集，包括7000个专家知识和卫星图像的手动标注，以及2000个地面采集标注。我们还对比了一些标准的准确性模型，我们的最佳模型在手动标注数据上达到80%的分类性能，在地面真实数据上达到90%、98%的准确性。我们还进行了与一个广泛使用的现有的覆盖地图进行视觉比较，并显示了我们的模型可以检测到特拉YES地区的56621公顷更多的农地。我们结论是，通过远程感知收割堆可以为食物不足地区提供更时准确的农地评估。

Manipulating Embeddings of Stable Diffusion Prompts

paper_url: http://arxiv.org/abs/2308.12059
repo_url: https://github.com/webis-de/arxiv23-prompt-embedding-manipulation
paper_authors: Niklas Deckers, Julia Peters, Martin Potthast
for: 这篇论文旨在提出和分析一种通过直接修改提示 embedding 来更加细化地控制生成的图像的方法，以满足用户的愿望。
methods: 该论文提出了一种将生成图像模型看作连续函数，将图像空间和提示 embedding 空间之间传递梯度的方法，以实现更加细化的控制。
results: 实验结果表明该方法的可行性。

Abstract
Generative text-to-image models such as Stable Diffusion allow users to generate images based on a textual description, the prompt. Changing the prompt is still the primary means for the user to change a generated image as desired. However, changing the image by reformulating the prompt remains a difficult process of trial and error, which has led to the emergence of prompt engineering as a new field of research. We propose and analyze methods to change the embedding of a prompt directly instead of the prompt text. It allows for more fine-grained and targeted control that takes into account user intentions. Our approach treats the generative text-to-image model as a continuous function and passes gradients between the image space and the prompt embedding space. By addressing different user interaction problems, we can apply this idea in three scenarios: (1) Optimization of a metric defined in image space that could measure, for example, image style. (2) Assistance of users in creative tasks by enabling them to navigate the image space along a selection of directions of "near" prompt embeddings. (3) Changing the embedding of the prompt to include information that the user has seen in a particular seed but finds difficult to describe in the prompt. Our experiments demonstrate the feasibility of the described methods.

摘要
<>将文本描述转换为生成图像的模型，如稳定扩散，允许用户根据文本描述生成图像。但是，通过修改描述文本来更改生成的图像仍然是一个困难的过程，它导致了提前工程的出现。我们提议和分析改变描述文本 embedding 的方法，以实现更细化和有target的控制，考虑用户的INTENT。我们的方法将生成文本到图像模型看作是连续函数，将 gradients 传递 между图像空间和描述文本 embedding 空间。通过解决不同的用户交互问题，我们可以应用这个想法在三个场景中：（1）优化图像空间中定义的一个指标，例如图像风格。（2）帮助用户完成创意任务，让他们可以在选择的方向上导航图像空间。（3）将描述文本 embedding 包含用户看到的信息，但是很难用描述在描述文本中。我们的实验表明这些方法的可行性。

DR-Tune: Improving Fine-tuning of Pretrained Visual Models by Distribution Regularization with Semantic Calibration

paper_url: http://arxiv.org/abs/2308.12058
repo_url: https://github.com/weeknan/dr-tune
paper_authors: Nan Zhou, Jiaxin Chen, Di Huang
for: 这个论文主要针对的是如何使用先验知识来提高下游任务的表达力，并提出了一种新的资源调整框架。
methods: 该论文提出了一种名为分布调整 WITH semantic calibration（DR-Tune）的新方法，该方法通过对下游任务头部进行分布调整，来防止过拟合而允许下游Encoder sufficient training。此外，该方法还提出了一种名为semantic calibration（SC）模块，用于对先验知识和下游知识之间的差异进行填充。
results: 该论文通过对多个图像分类 datasets 进行广泛的实验，证明了 DR-Tune 可以在不同的预训练策略下提高表达力的性能。

Abstract
The visual models pretrained on large-scale benchmarks encode general knowledge and prove effective in building more powerful representations for downstream tasks. Most existing approaches follow the fine-tuning paradigm, either by initializing or regularizing the downstream model based on the pretrained one. The former fails to retain the knowledge in the successive fine-tuning phase, thereby prone to be over-fitting, and the latter imposes strong constraints to the weights or feature maps of the downstream model without considering semantic drift, often incurring insufficient optimization. To deal with these issues, we propose a novel fine-tuning framework, namely distribution regularization with semantic calibration (DR-Tune). It employs distribution regularization by enforcing the downstream task head to decrease its classification error on the pretrained feature distribution, which prevents it from over-fitting while enabling sufficient training of downstream encoders. Furthermore, to alleviate the interference by semantic drift, we develop the semantic calibration (SC) module to align the global shape and class centers of the pretrained and downstream feature distributions. Extensive experiments on widely used image classification datasets show that DR-Tune consistently improves the performance when combing with various backbones under different pretraining strategies. Code is available at: https://github.com/weeknan/DR-Tune.

摘要
“视觉模型在大规模标准 datasets 预训练后encode general knowledge，并且在下游任务建立更强大的表示。现有的方法大多采用 fine-tuning 方法，包括在预训练模型的基础上初始化或正则化下游模型。前者容易过拟合，后者对下游模型的权重或特征图进行强制约束，而不考虑 semantics 的变化，通常导致优化不足。为解决这些问题，我们提出了一种新的 fine-tuning 框架，即 distribution regularization with semantic calibration (DR-Tune)。它通过在下游任务头中减少预训练特征分布上的类错误，防止过拟合而允许下游编码器充分训练。此外，为了缓解 semantics 的变化所带来的干扰，我们开发了 semantic calibration (SC) 模块，用于对预训练和下游特征分布的全球形态和类中心进行对齐。我们在各种图像分类数据集进行了广泛的实验，结果表明 DR-Tune 在不同的预训练策略下均能提高表现。代码可以在 GitHub 上获取：https://github.com/weeknan/DR-Tune。”

Head-Tail Cooperative Learning Network for Unbiased Scene Graph Generation

paper_url: http://arxiv.org/abs/2308.12048
repo_url: https://github.com/wanglei0618/htcl
paper_authors: Lei Wang, Zejian Yuan, Yao Lu, Badong Chen
for: solves the challenge of head-biased prediction in scene graph generation by proposing a model-agnostic Head-Tail Collaborative Learning (HTCL) network.
methods: includes head-prefer and tail-prefer feature representation branches that collaborate to achieve accurate recognition of both head and tail predicates, and a self-supervised learning approach to enhance the prediction ability of the tail-prefer feature representation branch.
results: achieves higher mean Recall with a minimal sacrifice in Recall and achieves a new state-of-the-art overall performance on various SGG models on VG150, Open Images V6 and GQA200 datasets.

Abstract
Scene Graph Generation (SGG) as a critical task in image understanding, facing the challenge of head-biased prediction caused by the long-tail distribution of predicates. However, current unbiased SGG methods can easily prioritize improving the prediction of tail predicates while ignoring the substantial sacrifice in the prediction of head predicates, leading to a shift from head bias to tail bias. To address this issue, we propose a model-agnostic Head-Tail Collaborative Learning (HTCL) network that includes head-prefer and tail-prefer feature representation branches that collaborate to achieve accurate recognition of both head and tail predicates. We also propose a self-supervised learning approach to enhance the prediction ability of the tail-prefer feature representation branch by constraining tail-prefer predicate features. Specifically, self-supervised learning converges head predicate features to their class centers while dispersing tail predicate features as much as possible through contrast learning and head center loss. We demonstrate the effectiveness of our HTCL by applying it to various SGG models on VG150, Open Images V6 and GQA200 datasets. The results show that our method achieves higher mean Recall with a minimal sacrifice in Recall and achieves a new state-of-the-art overall performance. Our code is available at https://github.com/wanglei0618/HTCL.

摘要
Scene Graph Generation（SGG）是图像理解中的关键任务，面临长 хвоста分布的预测问题。然而，当前的无偏SGG方法可能会忽略改进头预测的代价，导致偏头偏尾的转换。为解决这个问题，我们提出了无关模型的Head-Tail Collaborative Learning（HTCL）网络，包括头预测和尾预测的特征表示分支。我们还提出了一种自然学习方法来增强尾预测特征表示分支的预测能力。具体来说，自然学习使得头预测特征分布到其类中心，而尾预测特征分布到最大程度可能的位置，通过对比学习和头中心损失来实现。我们在VG150、Open Images V6和GQA200 datasets上应用了我们的HTCL方法，结果显示，我们的方法可以实现更高的含涵率，同时减少预测错误的代价，达到新的领先性表现。我们的代码可以在https://github.com/wanglei0618/HTCL上获取。

RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D

paper_url: http://arxiv.org/abs/2308.12035
repo_url: None
paper_authors: Shuhei Kurita, Naoki Katsura, Eri Onami
for: 本研究旨在开发能够根据自己周围环境的描述文本进行场景物体的定位，以实现智能设备或自主机器人能够根据直觉文本指令进行行动。
methods: 本研究使用了大量 Egocentric 视频数据集 Ego4D，并在其基础上构建了视频基于referring表达理解数据集 RefEgo，包括12000个视频剪辑和41小时的视频基于referring表达理解标注数据。
results: 实验结果表明，通过结合现状最佳2D referring表达理解模型和对象跟踪算法，可以实现视频中referenced对象的跟踪，包括对象离屏或多个相似对象出现在视频中。

Abstract
Grounding textual expressions on scene objects from first-person views is a truly demanding capability in developing agents that are aware of their surroundings and behave following intuitive text instructions. Such capability is of necessity for glass-devices or autonomous robots to localize referred objects in the real-world. In the conventional referring expression comprehension tasks of images, however, datasets are mostly constructed based on the web-crawled data and don't reflect diverse real-world structures on the task of grounding textual expressions in diverse objects in the real world. Recently, a massive-scale egocentric video dataset of Ego4D was proposed. Ego4D covers around the world diverse real-world scenes including numerous indoor and outdoor situations such as shopping, cooking, walking, talking, manufacturing, etc. Based on egocentric videos of Ego4D, we constructed a broad coverage of the video-based referring expression comprehension dataset: RefEgo. Our dataset includes more than 12k video clips and 41 hours for video-based referring expression comprehension annotation. In experiments, we combine the state-of-the-art 2D referring expression comprehension models with the object tracking algorithm, achieving the video-wise referred object tracking even in difficult conditions: the referred object becomes out-of-frame in the middle of the video or multiple similar objects are presented in the video.

摘要
文本表达grounding在场景物体上从第一人称视角是开发意识到围场景和按照直觉文本指令行为的真正挑战。这种能力是让玻璃设备或自动化机器人本地化参照对象的必要条件。在传统的图像引用表达理解任务上， however， dataset是基于网络爬虫数据构建的，而不具备多样化的实际世界结构。近期，一个大规模的 Egocentric video 数据集 Ego4D 被提出。Ego4D 覆盖了全球多样化的实际场景，包括室内和室外的 cooking、shopping、散步、谈话、制造等场景。基于 Ego4D 的 egocentric 视频，我们构建了包括 более чем 12k 视频剪辑和 41 小时的视频基于引用表达理解注解。在实验中，我们将 state-of-the-art 2D 引用表达理解模型与物体跟踪算法结合，实现视频基础referenced对象跟踪，包括对象在视频中心位置出现或多个类似对象出现在视频中的情况。

Distribution-Aware Calibration for Object Detection with Noisy Bounding Boxes

paper_url: http://arxiv.org/abs/2308.12017
repo_url: None
paper_authors: Donghao Zhou, Jialin Li, Jinpeng Li, Jiancheng Huang, Qiang Nie, Yong Liu, Bin-Bin Gao, Qiong Wang, Pheng-Ann Heng, Guangyong Chen
for: 提高对象检测器的训练效果，避免因噪音绘制而减降检测性能。
methods: 基于提取物体潜在位置的空间分布模型，开发了三种分布意识技术：分布意识提档（DA-Aug）、分布意识盒体细化（DA-Ref）和分布意识信任估计（DA-Est），以提高分类、定位和解释性。
results: 在大规模噪音图像集（Pascal VOC和MS-COCO）上进行了广泛的实验，结果表明，DISCO可以在高噪音水平上达到顶峰性状检测性能。

Abstract
Large-scale well-annotated datasets are of great importance for training an effective object detector. However, obtaining accurate bounding box annotations is laborious and demanding. Unfortunately, the resultant noisy bounding boxes could cause corrupt supervision signals and thus diminish detection performance. Motivated by the observation that the real ground-truth is usually situated in the aggregation region of the proposals assigned to a noisy ground-truth, we propose DIStribution-aware CalibratiOn (DISCO) to model the spatial distribution of proposals for calibrating supervision signals. In DISCO, spatial distribution modeling is performed to statistically extract the potential locations of objects. Based on the modeled distribution, three distribution-aware techniques, i.e., distribution-aware proposal augmentation (DA-Aug), distribution-aware box refinement (DA-Ref), and distribution-aware confidence estimation (DA-Est), are developed to improve classification, localization, and interpretability, respectively. Extensive experiments on large-scale noisy image datasets (i.e., Pascal VOC and MS-COCO) demonstrate that DISCO can achieve state-of-the-art detection performance, especially at high noise levels.

摘要
Motivated by the observation that real ground truth is usually located in the aggregation region of proposals assigned to noisy ground truth, we propose Distribution-aware Calibration (DISCO) to model the spatial distribution of proposals for calibrating supervision signals. DISCO uses spatial distribution modeling to statistically extract potential object locations. Based on the modeled distribution, we develop three distribution-aware techniques: distribution-aware proposal augmentation (DA-Aug), distribution-aware box refinement (DA-Ref), and distribution-aware confidence estimation (DA-Est) to improve classification, localization, and interpretability, respectively.Extensive experiments on large-scale noisy image datasets (Pascal VOC and MS-COCO) show that DISCO achieves state-of-the-art detection performance, especially at high noise levels.

StofNet: Super-resolution Time of Flight Network

paper_url: http://arxiv.org/abs/2308.12009
repo_url: https://github.com/hahnec/stofnet
paper_authors: Christopher Hahne, Michel Hayoz, Raphael Sznitman
for: 这篇论文主要关注于解决时间探测技术（ToF）在复杂 ambient condition 下的探测困难，通过 moderne super-resolution 技术来学习不同的周围环境，以提高 ToF 探测的可靠性和准确性。
methods: 该论文提出了一种新的 StofNet 架构，结合 super-resolution 和高效的 residual contraction block，以平衡细详信号素和大规模上下文信息。
results: 对于 six state-of-the-art 方法进行了比较，结果表明 StofNet 具有较高的精度、可靠性和模型复杂度。我们还发布了 SToF-Chirp 数据集， captured by 一架飞行式 ultrasound 探测器，并提供了相关的代码。

Abstract
Time of Flight (ToF) is a prevalent depth sensing technology in the fields of robotics, medical imaging, and non-destructive testing. Yet, ToF sensing faces challenges from complex ambient conditions making an inverse modelling from the sparse temporal information intractable. This paper highlights the potential of modern super-resolution techniques to learn varying surroundings for a reliable and accurate ToF detection. Unlike existing models, we tailor an architecture for sub-sample precise semi-global signal localization by combining super-resolution with an efficient residual contraction block to balance between fine signal details and large scale contextual information. We consolidate research on ToF by conducting a benchmark comparison against six state-of-the-art methods for which we employ two publicly available datasets. This includes the release of our SToF-Chirp dataset captured by an airborne ultrasound transducer. Results showcase the superior performance of our proposed StofNet in terms of precision, reliability and model complexity. Our code is available at https://github.com/hahnec/stofnet.

摘要
时间飞行（ToF）是现代深度感知技术的广泛应用领域，包括 робо扮、医疗成像和非锋渠测试。然而，ToF感知受到了复杂的 ambient 环境的挑战，从而使得对稀疏时间信息的逆模型变得不可能。本文强调了现代超分解技术的潜在作用，以提高ToF探测的可靠性和准确性。与现有模型不同，我们专门设计了一种束缚精度和大规模信息的混合块，以平衡细信息和大规模信息的权重。我们对ToF进行了 benchmark 比较，使用了六种现有的状态之 искусственный智能方法，并使用了两个公共可用的数据集。这包括我们发布的 SToF-Chirp 数据集，由空中ultrasound 传感器记录。结果表明我们提posed StofNet 的性能比其他六种方法更高， both in terms of precision and reliability.我们的代码可以在 https://github.com/hahnec/stofnet 上获取。

Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition

paper_url: http://arxiv.org/abs/2308.12006
repo_url: None
paper_authors: Yujun Ma, Benjia Zhou, Ruili Wang, Pichao Wang
for: 这篇论文主要是为了RGB-D动作和姿势识别而写的。
methods: 这篇论文使用了一种新的启发式架构，即多stage因子化空间时间（MFST）模型，该模型包括3D中心差减核心（CDC）模块和多个因子化空间时间stage。
results: 该模型在RGB-D动作和姿势识别任务上表现出色，比之前的方法更高效。

Abstract
RGB-D action and gesture recognition remain an interesting topic in human-centered scene understanding, primarily due to the multiple granularities and large variation in human motion. Although many RGB-D based action and gesture recognition approaches have demonstrated remarkable results by utilizing highly integrated spatio-temporal representations across multiple modalities (i.e., RGB and depth data), they still encounter several challenges. Firstly, vanilla 3D convolution makes it hard to capture fine-grained motion differences between local clips under different modalities. Secondly, the intricate nature of highly integrated spatio-temporal modeling can lead to optimization difficulties. Thirdly, duplicate and unnecessary information can add complexity and complicate entangled spatio-temporal modeling. To address the above issues, we propose an innovative heuristic architecture called Multi-stage Factorized Spatio-Temporal (MFST) for RGB-D action and gesture recognition. The proposed MFST model comprises a 3D Central Difference Convolution Stem (CDC-Stem) module and multiple factorized spatio-temporal stages. The CDC-Stem enriches fine-grained temporal perception, and the multiple hierarchical spatio-temporal stages construct dimension-independent higher-order semantic primitives. Specifically, the CDC-Stem module captures bottom-level spatio-temporal features and passes them successively to the following spatio-temporal factored stages to capture the hierarchical spatial and temporal features through the Multi- Scale Convolution and Transformer (MSC-Trans) hybrid block and Weight-shared Multi-Scale Transformer (WMS-Trans) block. The seamless integration of these innovative designs results in a robust spatio-temporal representation that outperforms state-of-the-art approaches on RGB-D action and gesture recognition datasets.

摘要
MFST模型包括3D中心差异卷积核心（CDC-Stem）模块和多个因子化空间时间阶段。CDC-Stem模块使得细节的时间感知更加细化，而多个层次的因子化空间时间阶段通过多 scales卷积和Transformer（MSC-Trans）混合块和Weight-shared Multi-Scale Transformer（WMS-Trans）块来构建维度独立的高级semantic primitives。具体来说，CDC-Stem模块首先捕捉最低级的空间时间特征，然后将其传递给以下因子化空间时间阶段，以 capture层次的空间和时间特征。这些创新的设计结合使得RGB-D动作和姿势识别达到了state-of-the-art的表现。

Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment

paper_url: http://arxiv.org/abs/2308.12001
repo_url: None
paper_authors: Kangmin Xu, Liang Liao, Jing Xiao, Chaofeng Chen, Haoning Wu, Qiong Yan, Weisi Lin
for:This paper focuses on improving image quality assessment (IQA) tasks, which are challenging due to diverse image contents and limited data availability.methods:The authors use a combination of pre-trained convolutional neural networks (CNNs) and a local distortion extractor/injector to extract and inject local distortion features into a large-scale pre-trained vision transformer (ViT) model.results:The proposed method achieves state-of-the-art performance on popular IQA datasets, indicating that IQA can benefit from stronger high-level features drawn from large-scale pre-trained models.Here’s the simplified Chinese version:for:本文关注改进图像质量评估（IQA）任务，这些任务受到多样化图像内容和数据有限的限制。methods:作者使用了一种组合使用预训练的 convolutional neural networks（CNN）和一个地方扰动提取器/注入器来提取和注入地方扰动特征到大规模预训练的视transformer（ViT）模型中。results:提出的方法在流行的IQA数据集上达到了状态机器人的性能，表明IQA不仅是一个低级别的问题，而且可以受益于更强的高级别特征，从大规模预训练模型中继承知识。

Abstract
Image Quality Assessment (IQA) constitutes a fundamental task within the field of computer vision, yet it remains an unresolved challenge, owing to the intricate distortion conditions, diverse image contents, and limited availability of data. Recently, the community has witnessed the emergence of numerous large-scale pretrained foundation models, which greatly benefit from dramatically increased data and parameter capacities. However, it remains an open problem whether the scaling law in high-level tasks is also applicable to IQA task which is closely related to low-level clues. In this paper, we demonstrate that with proper injection of local distortion features, a larger pretrained and fixed foundation model performs better in IQA tasks. Specifically, for the lack of local distortion structure and inductive bias of vision transformer (ViT), alongside the large-scale pretrained ViT, we use another pretrained convolution neural network (CNN), which is well known for capturing the local structure, to extract multi-scale image features. Further, we propose a local distortion extractor to obtain local distortion features from the pretrained CNN and a local distortion injector to inject the local distortion features into ViT. By only training the extractor and injector, our method can benefit from the rich knowledge in the powerful foundation models and achieve state-of-the-art performance on popular IQA datasets, indicating that IQA is not only a low-level problem but also benefits from stronger high-level features drawn from large-scale pretrained models.

摘要
图像质量评估（IQA）是计算机视觉领域的基本任务，但它仍然是一个未解决的挑战，主要是因为图像的复杂预期、多样化内容和数据的有限性。在最近，社区所看到的是许多大规模预训练基础模型的出现，这些模型受益于数据和参数的增加。然而，是否存在一个涉及到IQA任务的扩展法则仍然是一个开放的问题。在这篇论文中，我们表明了一种将本地扭曲特征注入到大规模预训练的基础模型中，以提高IQA任务的表现。具体来说，由于视Transformer（ViT）缺乏本地扭曲结构和引导因子，我们采用另一种预训练的卷积神经网络（CNN），以提取多尺度图像特征。此外，我们提出了一种本地扭曲EXTractor，以从预训练CNN中提取本地扭曲特征。最后，我们提出了一种本地扭曲注入器，以注入本地扭曲特征到ViT中。只需训练EXTractor和注入器，我们的方法可以从强大的基础模型中继承丰富的知识，并在流行的IQA数据集上达到状态率表现，表明IQA不仅是一个低级问题，还可以受益于更强的高级特征，从大规模预训练模型中继承。

Progressive Feature Mining and External Knowledge-Assisted Text-Pedestrian Image Retrieval

paper_url: http://arxiv.org/abs/2308.11994
repo_url: None
paper_authors: Huafeng Li, Shedan Yang, Yafei Zhang, Dapeng Tao, Zhengtao Yu
for: 用文本描述人体外观来检索匹配的人体图像。
methods: 提出进步性特征挖掘和外知助持特征纯化方法，以避免杂音特征的损失和提高表达能力。
results: 对三个挑战性数据集进行了广泛的实验，并证明了提议方法的有效性和超越性，甚至在大规模数据集上超越大规模模型基础方法。

Abstract
Text-Pedestrian Image Retrieval aims to use the text describing pedestrian appearance to retrieve the corresponding pedestrian image. This task involves not only modality discrepancy, but also the challenge of the textual diversity of pedestrians with the same identity. At present, although existing research progress has been made in text-pedestrian image retrieval, these methods do not comprehensively consider the above-mentioned problems. Considering these, this paper proposes a progressive feature mining and external knowledge-assisted feature purification method. Specifically, we use a progressive mining mode to enable the model to mine discriminative features from neglected information, thereby avoiding the loss of discriminative information and improving the expression ability of features. In addition, to further reduce the negative impact of modal discrepancy and text diversity on cross-modal matching, we propose to use other sample knowledge of the same modality, i.e., external knowledge to enhance identity-consistent features and weaken identity-inconsistent features. This process purifies features and alleviates the interference caused by textual diversity and negative sample correlation features of the same modal. Extensive experiments on three challenging datasets demonstrate the effectiveness and superiority of the proposed method, and the retrieval performance even surpasses that of the large-scale model-based method on large-scale datasets.

摘要
文本行人图像检索目标是使用描述行人外观的文本来检索对应的行人图像。这个任务面临着多样性扩散和文本多样性问题。现有研究已取得一定进展，但这些方法并不完全考虑上述问题。为此，本文提出了一种进程式特征挖掘和外知助动特征纯化方法。具体来说，我们使用进程式挖掘模式，让模型从抛弃信息中挖掘出特征，以避免损失特征表达能力和提高特征表达能力。此外，为了进一步减少模式差异和文本多样性对于跨模态匹配的负面影响，我们提出了使用同一模式其他样本的知识，即外知来增强一致性特征和弱化不一致性特征。这个过程纯化特征，减少了文本多样性和负样本相互干扰的影响。我们在三个挑战性 dataset 进行了广泛的实验，结果表明我们的方法效果和前期研究超越，甚至在大规模模型基础方法上的大规模dataset上表现出色。

RankMixup: Ranking-Based Mixup Training for Network Calibration

paper_url: http://arxiv.org/abs/2308.11990
repo_url: None
paper_authors: Jongyoun Noh, Hyekang Park, Junghyup Lee, Bumsub Ham
for:这篇论文的目的是为了精确地估算深度神经网络的信任度，特别是在实际应用中使用深度神经网络时。methods:这篇论文使用了mixup来帮助网络进行训练，并且提出了一个新的架构，即RankMixup，以解决mixup中混合标签问题。results:实验结果显示，RankMixup可以对网络进行更好的训练，并且可以实现更好的信任度估算。

Abstract
Network calibration aims to accurately estimate the level of confidences, which is particularly important for employing deep neural networks in real-world systems. Recent approaches leverage mixup to calibrate the network's predictions during training. However, they do not consider the problem that mixtures of labels in mixup may not accurately represent the actual distribution of augmented samples. In this paper, we present RankMixup, a novel mixup-based framework alleviating the problem of the mixture of labels for network calibration. To this end, we propose to use an ordinal ranking relationship between raw and mixup-augmented samples as an alternative supervisory signal to the label mixtures for network calibration. We hypothesize that the network should estimate a higher level of confidence for the raw samples than the augmented ones (Fig.1). To implement this idea, we introduce a mixup-based ranking loss (MRL) that encourages lower confidences for augmented samples compared to raw ones, maintaining the ranking relationship. We also propose to leverage the ranking relationship among multiple mixup-augmented samples to further improve the calibration capability. Augmented samples with larger mixing coefficients are expected to have higher confidences and vice versa (Fig.1). That is, the order of confidences should be aligned with that of mixing coefficients. To this end, we introduce a novel loss, M-NDCG, in order to reduce the number of misaligned pairs of the coefficients and confidences. Extensive experimental results on standard benchmarks for network calibration demonstrate the effectiveness of RankMixup.

摘要

paper_url: http://arxiv.org/abs/2308.11983
repo_url: https://github.com/erkanmilli/3mt-roadseg
paper_authors: Erkan Milli, Özgür Erkent, Asım Egemen Yılmaz
for: 这 paper 的目的是提出一种cost-effective和高度准确的道路分割方法，使用多modal sensor数据，包括RGB和LiDAR深度图像，以及IMU/GNSS各自导航系统。
methods: 该方法使用raw sensor输入，而不是 Typically done in many SOTA works，使用高预处理成本的表面法或dense depth prediction。它使用一种低成本模型，以Minimize both pre-processing and model computation costs。
results: 该方法在 KITTI 数据集上进行了实验，并达到了fast和高性能的解决方案。同时，对 Cityscapes 数据集进行了实验，并证明了该方法可以使用不同的感知模式。 segmentation 结果对于全分辨率和半分辨率图像均是与现有方法竞争力的。

Abstract
Multi-modal systems have the capacity of producing more reliable results than systems with a single modality in road detection due to perceiving different aspects of the scene. We focus on using raw sensor inputs instead of, as it is typically done in many SOTA works, leveraging architectures that require high pre-processing costs such as surface normals or dense depth predictions. By using raw sensor inputs, we aim to utilize a low-cost model thatminimizes both the pre-processing andmodel computation costs. This study presents a cost-effective and highly accurate solution for road segmentation by integrating data from multiple sensorswithin a multi-task learning architecture.Afusion architecture is proposed in which RGB and LiDAR depth images constitute the inputs of the network. Another contribution of this study is to use IMU/GNSS (inertial measurement unit/global navigation satellite system) inertial navigation system whose data is collected synchronously and calibrated with a LiDAR-camera to compute aggregated dense LiDAR depth images. It has been demonstrated by experiments on the KITTI dataset that the proposed method offers fast and high-performance solutions. We have also shown the performance of our method on Cityscapes where raw LiDAR data is not available. The segmentation results obtained for both full and half resolution images are competitive with existing methods. Therefore, we conclude that our method is not dependent only on raw LiDAR data; rather, it can be used with different sensor modalities. The inference times obtained in all experiments are very promising for real-time experiments.

摘要
多模式系统可以生成更可靠的结果，因为它们可以感知不同方面的场景。我们集中在使用原始感知输入而不是，如多数State-of-the-Art工作一样，利用需要高预处理成本的 arquitectures，例如表面法向量或密集深度预测。通过使用原始感知输入，我们希望实现低成本模型，以降低预处理和计算成本。这个研究提出了一种可靠和高精度的解决方案，通过将多种感知器 Integration into a multi-task learning architecture。我们提议的架构包括RGB和LiDAR深度图像作为网络的输入。此外，我们还使用IMU/GNSS（普通测量单元/全球导航卫星系统）的抗 gravitational 数据，同步采集并与LiDAR-Camera进行同步准确 calibration，以计算聚合的密集LiDAR深度图像。经过实验表明，我们的方法可以在KITTI数据集上提供快速和高性能的解决方案。此外，我们还对Cityscapes数据集进行了实验，并证明我们的方法可以使用不同的感知模式。所得到的分割结果与现有方法相当，因此我们可以 concluced that our method is not dependent on raw LiDAR data; rather, it can be used with different sensor modalities。实验结果表明，在所有实验中的推理时间很有前途，适用于实时实验。

Rotation-Invariant Completion Network

paper_url: http://arxiv.org/abs/2308.11979
repo_url: https://github.com/agiachris/rotational3DCNN
paper_authors: Yu Chen, Pengcheng Shi
for: 提高Point cloud completion方法的稳定性和可靠性，能够处理不同pose的实际世界Point cloud
methods: 提出了一种基于Dual Pipeline Completion Network（DPCNet）和增强模块的Rotation-Invariant Completion Network（RICNet），通过提取不受旋转影响的特征来保证Feature extraction的稳定性和可靠性
results: 通过对MVP数据集上进行随机变换，对Point cloud completion方法进行比较，研究发现RICNet在不同pose下的Point cloud completion性能较高，超过了现有方法

Abstract
Real-world point clouds usually suffer from incompleteness and display different poses. While current point cloud completion methods excel in reproducing complete point clouds with consistent poses as seen in the training set, their performance tends to be unsatisfactory when handling point clouds with diverse poses. We propose a network named Rotation-Invariant Completion Network (RICNet), which consists of two parts: a Dual Pipeline Completion Network (DPCNet) and an enhancing module. Firstly, DPCNet generates a coarse complete point cloud. The feature extraction module of DPCNet can extract consistent features, no matter if the input point cloud has undergone rotation or translation. Subsequently, the enhancing module refines the fine-grained details of the final generated point cloud. RICNet achieves better rotation invariance in feature extraction and incorporates structural relationships in man-made objects. To assess the performance of RICNet and existing methods on point clouds with various poses, we applied random transformations to the point clouds in the MVP dataset and conducted experiments on them. Our experiments demonstrate that RICNet exhibits superior completion performance compared to existing methods.

摘要
Translation:real-world point clouds 通常会受到不完整性和不同姿态的影响。当前的点云完成方法能够很好地重建完整的点云，但是它们在处理不同姿态的点云时表现不佳。我们提议一种名为Rotation-Invariant Completion Network（RICNet）的网络，它包括两部分：一个双管道完成网络（DPCNet）和一个优化模块。首先，DPCNet生成一个粗略的完整点云。DPCNet的特征提取模块可以无论输入点云是否经历了旋转或平移，都可以提取一致的特征。然后，优化模块进行细化细节的更新。RICNet实现了更好的旋转不变性在特征提取中，并具有结构关系在人工物体中。为了评估RICNet和现有方法在不同姿态的点云上的表现，我们对MVP数据集中的点云应用了随机变换，并在其上进行了实验。我们的实验结果表明，RICNet在不同姿态的点云上的完成性比现有方法更高。

Anisotropic Hybrid Networks for liver tumor segmentation with uncertainty quantification

paper_url: http://arxiv.org/abs/2308.11969
repo_url: None
paper_authors: Benjamin Lambert, Pauline Roca, Florence Forbes, Senan Doyle, Michel Dojat
for: Liver and tumor segmentation for treatment strategy guidance.
methods: Two different pipelines based on anisotropic models were used for segmentation: a baseline multi-class model and two distinct binary models.
results: Both pipelines exhibited different strengths and weaknesses, and an uncertainty quantification strategy was proposed to identify potential false positive tumor lesions.

Abstract
The burden of liver tumors is important, ranking as the fourth leading cause of cancer mortality. In case of hepatocellular carcinoma (HCC), the delineation of liver and tumor on contrast-enhanced magnetic resonance imaging (CE-MRI) is performed to guide the treatment strategy. As this task is time-consuming, needs high expertise and could be subject to inter-observer variability there is a strong need for automatic tools. However, challenges arise from the lack of available training data, as well as the high variability in terms of image resolution and MRI sequence. In this work we propose to compare two different pipelines based on anisotropic models to obtain the segmentation of the liver and tumors. The first pipeline corresponds to a baseline multi-class model that performs the simultaneous segmentation of the liver and tumor classes. In the second approach, we train two distinct binary models, one segmenting the liver only and the other the tumors. Our results show that both pipelines exhibit different strengths and weaknesses. Moreover we propose an uncertainty quantification strategy allowing the identification of potential false positive tumor lesions. Both solutions were submitted to the MICCAI 2023 Atlas challenge regarding liver and tumor segmentation.

摘要
liver tumors 是一种重要的负担， ranks as the fourth leading cause of cancer mortality。在hepatocellular carcinoma（HCC）的情况下，通过对吸引磁共振成像（CE-MRI）图像进行定义肝脏和肿瘤的分割，以便guide the treatment strategy。然而，这项工作需要很高的专业技能，时间费用很高，并且存在 между观察员的变化，因此有强需求 для自动工具。然而，由于数据不足以及图像分辨率和MRI序列的高变化，这些挑战是非常大的。在这项工作中，我们提出了两个不同的管道，基于不规则模型来获得肝脏和肿瘤的分割。第一个管道是一个基线多类模型，同时分割肝脏和肿瘤类型。在第二个方法中，我们训练了两个不同的二进制模型，一个用于分割肝脏，另一个用于分割肿瘤。我们的结果表明，这两个管道具有不同的优劣点。此外，我们还提出了一种不确定性评估策略，以便标识潜在的假阳性肿瘤涂抹。这两个解决方案都被提交到了MICCAI 2023 Atlas challenge关于肝脏和肿瘤分割。

Gaze Estimation on Spresense

paper_url: http://arxiv.org/abs/2308.12313
repo_url: None
paper_authors: Thomas Ruegg, Pietro Bonazzi, Andrea Ronco
for: 这个研究旨在实现一个用于人机交互、虚拟现实和医学等领域的眼动估计系统，并评估其延迟、MAC/周期和电力消耗。
methods: 这个系统使用索尼Spresense微控制器板，并使用TinyTrackerS模型，具有169K字节大小、85.8k参数和3 FPS的运行速度。
results: 研究发现这个系统具有低延迟和低MAC/周期，并且能够在实时进行眼动估计。

Abstract
Gaze estimation is a valuable technology with numerous applications in fields such as human-computer interaction, virtual reality, and medicine. This report presents the implementation of a gaze estimation system using the Sony Spresense microcontroller board and explores its performance in latency, MAC/cycle, and power consumption. The report also provides insights into the system's architecture, including the gaze estimation model used. Additionally, a demonstration of the system is presented, showcasing its functionality and performance. Our lightweight model TinyTrackerS is a mere 169Kb in size, using 85.8k parameters and runs on the Spresense platform at 3 FPS.

摘要
gaze estimation 是一种有价值的技术，它在人工智能、虚拟现实和医疗领域有很多应用。这份报告介绍了使用索尼 Spresense 微控器板实现的 gaze estimation 系统，并评估了它的响应时间、MAC/周期和电力消耗。报告还提供了系统的架构设计，包括使用的 gaze estimation 模型。此外，报告还提供了系统的示例，展示了它的功能和性能。我们的轻量级模型 TinyTrackerS 仅有 169 KB 大小，使用 85.8k 参数，在 Spresense 平台上运行于 3 FPS。

Efficient Transfer Learning in Diffusion Models via Adversarial Noise

paper_url: http://arxiv.org/abs/2308.11948
repo_url: None
paper_authors: Xiyu Wang, Baijiong Lin, Daochang Liu, Chang Xu
for: 解决有限数据问题在图像生成任务中
methods: 提出了一种基于DPM的传输学习方法，包括两种策略：一是 similarity-guided training，即通过分类器提高传输；二是 adversarial noise selection，即根据输入图像选择目标噪声。
results: 在几个数据少的图像生成任务中进行了广泛的实验，结果表明，我们的方法不仅高效，而且在图像质量和多样性方面也比现有的GAN-based和DPM-based方法出色。

Abstract
Diffusion Probabilistic Models (DPMs) have demonstrated substantial promise in image generation tasks but heavily rely on the availability of large amounts of training data. Previous works, like GANs, have tackled the limited data problem by transferring pre-trained models learned with sufficient data. However, those methods are hard to be utilized in DPMs since the distinct differences between DPM-based and GAN-based methods, showing in the unique iterative denoising process integral and the need for many timesteps with no-targeted noise in DPMs. In this paper, we propose a novel DPMs-based transfer learning method, TAN, to address the limited data problem. It includes two strategies: similarity-guided training, which boosts transfer with a classifier, and adversarial noise selection which adaptive chooses targeted noise based on the input image. Extensive experiments in the context of few-shot image generation tasks demonstrate that our method is not only efficient but also excels in terms of image quality and diversity when compared to existing GAN-based and DDPM-based methods.

摘要
各种扩散概率模型（DPM）在图像生成任务中表现出了重要的承袭潜力，但它们受到充足的训练数据的限制。先前的工作，如GANs，通过将预训练的模型转移到具有足够数据的环境中来解决这个问题。然而，这些方法在DPM中很难实现，因为DPM和GAN之间存在重要的差异，即DPM中的迭代净化过程的独特特性和需要许多步骤和无目标噪声。在这篇论文中，我们提出了一种基于DPM的转移学习方法，称为TAN，以解决有限数据问题。该方法包括两个策略：相似性引导的训练和对输入图像进行适应性选择噪声。我们在少量图像生成任务中进行了广泛的实验，并证明了我们的方法不仅高效，还能够在图像质量和多样性方面超越现有的GAN基于和DPM基于的方法。

Boosting Diffusion Models with an Adaptive Momentum Sampler

paper_url: http://arxiv.org/abs/2308.11941
repo_url: None
paper_authors: Xiyu Wang, Anh-Dung Dinh, Daochang Liu, Chang Xu
for: This paper aims to improve the sampling process in Diffusion Probabilistic Models (DPMs) to generate high-quality images.
methods: The proposed method is a novel reverse sampler for DPMs, inspired by the Adam optimizer, which uses momentum mechanisms and adaptive updating to smooth the reverse sampling process and ensure stable generation.
results: The proposed reverse sampler achieves remarkable improvements over different baselines, yielding enhanced quality outputs.Here’s the full summary in Simplified Chinese:
for: 本文目的是提高Diffusion Probabilistic Models (DPMs)中的抽样过程，以生成高质量的图像。
methods: 提议的方法是一种基于Adam优化器的reverse抽样器，利用势量机制和自适应更新来缓和反抽样过程，确保稳定的生成。
results: 提议的reverse抽样器在多个 benchmark 上得到了显著的改善，生成的输出质量得到了提升。

Abstract
Diffusion probabilistic models (DPMs) have been shown to generate high-quality images without the need for delicate adversarial training. However, the current sampling process in DPMs is prone to violent shaking. In this paper, we present a novel reverse sampler for DPMs inspired by the widely-used Adam optimizer. Our proposed sampler can be readily applied to a pre-trained diffusion model, utilizing momentum mechanisms and adaptive updating to smooth the reverse sampling process and ensure stable generation, resulting in outputs of enhanced quality. By implicitly reusing update directions from early steps, our proposed sampler achieves a better balance between high-level semantics and low-level details. Additionally, this sampler is flexible and can be easily integrated into pre-trained DPMs regardless of the sampler used during training. Our experimental results on multiple benchmarks demonstrate that our proposed reverse sampler yields remarkable improvements over different baselines. We will make the source code available.

摘要
diffusion probabilistic models (DPMs) 有 shown 能生成高质量图像，无需精细 adversarial training。然而，当前的采样 процес在 DPMs 中存在激动难控的问题。在这篇论文中，我们提出了一种新的 reverse sampler for DPMs， Drawing inspiration from the widely-used Adam optimizer。我们的提议的抽样器可以 readily applied 到预训练的 diffusion model，使用动量机制和自适应更新来平稳抽样过程，以保证生成的输出质量。通过重用早期步骤的更新方向，我们的抽样器实现了更好的semantic balance 和细节层次结构。此外，这种抽样器 flexible 可以轻松地integrated into pre-trained DPMs，不管在training中使用的抽样器。我们的实验结果表明，我们的提议的 reverse sampler 在多个benchmark上具有remarkable improvement。我们将会公开源代码。

paper_url: http://arxiv.org/abs/2308.11932
repo_url: None
paper_authors: Dehuan Zhang, Jingchun Zhou, Weishi Zhang, ChunLe Guo, Chongyi Li
for: 本文提出了一种synergistic multiscale detail refinement via intrinsic supervision（SMDR-IS）方法，用于 восстановление水下场景图像。
methods: 该方法包括一个low-degradation stage和多个高层次降低stage，以及一个适应性选择性内在监督特征模块（ASISF）。ASISF使用内在监督来精准地控制和引导特征传输在多个降低stage中。
results: 对比 estado-of-the-art方法，SMDR-IS示出出色的性能。

Abstract
Visual restoration of underwater scenes is crucial for visual tasks, and avoiding interference from underwater media has become a prominent concern. In this work, we present a synergistic multiscale detail refinement via intrinsic supervision (SMDR-IS) to recover underwater scene details. The low-degradation stage provides multiscale detail for original stage, which achieves synergistic multiscale detail refinement through feature propagation via the adaptive selective intrinsic supervised feature module (ASISF), which achieves synergistic multiscale detail refinement. ASISF is developed using intrinsic supervision to precisely control and guide feature transmission in the multi-degradation stages. ASISF improves the multiscale detail refinement while reducing interference from irrelevant scene information from the low-degradation stage. Additionally, within the multi-degradation encoder-decoder of SMDR-IS, we introduce a bifocal intrinsic-context attention module (BICA). This module is designed to effectively leverage multi-scale scene information found in images, using intrinsic supervision principles as its foundation. BICA facilitates the guidance of higher-resolution spaces by leveraging lower-resolution spaces, considering the significant dependency of underwater image restoration on spatial contextual relationships. During the training process, the network gains advantages from the integration of a multi-degradation loss function. This function serves as a constraint, enabling the network to effectively exploit information across various scales. When compared with state-of-the-art methods, SMDR-IS demonstrates its outstanding performance. Code will be made publicly available.

摘要
<>TRANSLATE_TEXT视觉修复水下场景是重要的视觉任务之一，避免水下媒体干扰已成为一项显著问题。在这种工作中，我们提出了一种同时多级细节重建via自身监督（SMDR-IS），用于恢复水下场景细节。低损阶段提供多级细节 для原始阶段，通过特有的自适应选择性内在监督特征模块（ASISF）实现同时多级细节重建。ASISF通过内在监督精准控制和导引特征传输在多损阶段。ASISF提高多级细节重建，同时减少不相关场景信息的干扰。此外，在SMDR-IS中的多损阶段Encoder-Decoder中，我们引入了一种多级内在上下文注意力模块（BICA）。这个模块是基于内在监督原则设计的，用于有效利用图像中的多尺度场景信息，并且可以在训练过程中为网络提供优势。在训练过程中，网络通过多损阶段损失函数的集成获得了优势。这个函数作为约束，使网络能够有效利用多个尺度的信息。与现状技术相比，SMDR-IS表现出色。代码将公开。

OFVL-MS: Once for Visual Localization across Multiple Indoor Scenes

paper_url: http://arxiv.org/abs/2308.11928
repo_url: https://github.com/mooncake199809/ufvl-net
paper_authors: Tao Xie, Kun Dai, Siyi Lu, Ke Wang, Zhiqiang Jiang, Jinghan Gao, Dedong Liu, Jie Xu, Lijun Zhao, Ruifeng Li
for: 本 paper 的目的是 Predict camera poses across scenes with a multi-task learning manner.
methods: 本 paper 提出了 OFVL-MS 框架，一种可以高效地存储和精确地visual localization 的框架，通过自适应分享策略和权重学习来解决多场景集合学习中的梯度冲突问题。
results: 在多个benchmark上和新发布的室内 dataset LIVL 上，OFVL-MS 家族的模型显著超越了现有的状态网络，并且可以在新场景中训练 fewer parameters 而 achieve superior localization performance。

Abstract
In this work, we seek to predict camera poses across scenes with a multi-task learning manner, where we view the localization of each scene as a new task. We propose OFVL-MS, a unified framework that dispenses with the traditional practice of training a model for each individual scene and relieves gradient conflict induced by optimizing multiple scenes collectively, enabling efficient storage yet precise visual localization for all scenes. Technically, in the forward pass of OFVL-MS, we design a layer-adaptive sharing policy with a learnable score for each layer to automatically determine whether the layer is shared or not. Such sharing policy empowers us to acquire task-shared parameters for a reduction of storage cost and task-specific parameters for learning scene-related features to alleviate gradient conflict. In the backward pass of OFVL-MS, we introduce a gradient normalization algorithm that homogenizes the gradient magnitude of the task-shared parameters so that all tasks converge at the same pace. Furthermore, a sparse penalty loss is applied on the learnable scores to facilitate parameter sharing for all tasks without performance degradation. We conduct comprehensive experiments on multiple benchmarks and our new released indoor dataset LIVL, showing that OFVL-MS families significantly outperform the state-of-the-arts with fewer parameters. We also verify that OFVL-MS can generalize to a new scene with much few parameters while gaining superior localization performance.

摘要
在这个工作中，我们尝试预测相机pose在场景之间，使用多任务学习方式，视每个场景为一个新任务。我们提出了OFVL-MS框架，它摒弃了传统的每个场景都需要单独训练模型的做法，并避免了多场景集合优化induced的梯度冲突，从而实现高效的存储和精确的视觉地址 localization。技术上，在OFVL-MS的前向传播中，我们设计了层adaptive共享策略，通过学习得分来自动决定每层是否共享。这种共享策略使得我们可以获得任务共享参数，以降低存储成本，同时获得任务特定参数，以学习场景相关特征，解决梯度冲突。在OFVL-MS的反向传播中，我们引入了梯度 normalization算法，使得任务共享参数的梯度大小均匀，使所有任务在同一个步长进行迭代。此外，我们还应用了稀疏罚失函数来促进参数共享，确保OFVL-MS不会影响性能。我们在多个benchmark上进行了广泛的实验，并在我们新发布的室内 dataset LIVL 上进行了测试，结果显示OFVL-MS家族在参数量少的情况下significantly outperform了状态 искус技术。我们还证明OFVL-MS可以在新场景中进行参数化，并且在获得superior localization性能的情况下，具有较少的参数。

Recovering a Molecule’s 3D Dynamics from Liquid-phase Electron Microscopy Movies

paper_url: http://arxiv.org/abs/2308.11927
repo_url: None
paper_authors: Enze Ye, Yuhang Wang, Hong Zhang, Yiqin Gao, Huan Wang, He Sun
for: 这研究旨在利用流体电子镜像技术（liquid-phase EM）观察生物分子的动态变化。
methods: 这研究使用了一种新的Temporal Electron MicroscoPy Object Reconstruction算法（TEMPOR），其 combining implicit neural representation（INR）和动态variational auto-encoder（DVAE）来回归时间序列中的分子结构。
results: 研究表明，TEMPOR算法可以从流体电子镜像电影中回归不同的动态变化，并且是首次直接从流体电子镜像电影中回归3D结构。这提供了一种有前途的新方法，用于生物分子结构生物学中研究分子的3D动态。

Abstract
The dynamics of biomolecules are crucial for our understanding of their functioning in living systems. However, current 3D imaging techniques, such as cryogenic electron microscopy (cryo-EM), require freezing the sample, which limits the observation of their conformational changes in real time. The innovative liquid-phase electron microscopy (liquid-phase EM) technique allows molecules to be placed in the native liquid environment, providing a unique opportunity to observe their dynamics. In this paper, we propose TEMPOR, a Temporal Electron MicroscoPy Object Reconstruction algorithm for liquid-phase EM that leverages an implicit neural representation (INR) and a dynamical variational auto-encoder (DVAE) to recover time series of molecular structures. We demonstrate its advantages in recovering different motion dynamics from two simulated datasets, 7bcq and Cas9. To our knowledge, our work is the first attempt to directly recover 3D structures of a temporally-varying particle from liquid-phase EM movies. It provides a promising new approach for studying molecules' 3D dynamics in structural biology.

摘要
生物分子的动力学是我们理解它们在生物系统中的功能的关键。然而，现有的3D成像技术，如冷气电子镜微scopy（cryo-EM），需要将样本冻结，这限制了观察分子的拓展变化的实时观察。新的液相电子镜微scopy（液相EM）技术使分子能够置于原始液体环境中，提供了一个独特的机会来观察它们的动态。在这篇论文中，我们提出了TEMPOR，一种基于偶极神经网络表示（INR）和动态变分自动编码器（DVAE）的电子镜微scopy对象重建算法。我们在两个 simulate datasets，7bcq和Cas9 中展示了它的优势。根据我们所知，我们的工作是直接从液相EM电影中恢复3D变化的首次尝试。它提供了一个有前途的新方法，用于Structural biology中研究分子的3D动力学。

MixNet: Toward Accurate Detection of Challenging Scene Text in the Wild

paper_url: http://arxiv.org/abs/2308.12817
repo_url: None
paper_authors: Yu-Xiang Zeng, Jun-Wei Hsieh, Xin Li, Ming-Ching Chang
for: 实时检测小型Scene文字，尤其是在不 идеal的光线和不规则位置下，具有较高的准确率和效率。
methods: 混合 CNN 和 Transformer 架构，包括 Feature Shuffle Network (FSNet) 和 Central Transformer Block (CTBlock)，可以实现高精度的小文字检测。
results: 在多个Scene文字检测 dataset 上，MixNet 已经实现了State-of-the-art 的结果，并且在不 ideal 的光线和不规则位置下具有较高的准确率和效率。

Abstract
Detecting small scene text instances in the wild is particularly challenging, where the influence of irregular positions and nonideal lighting often leads to detection errors. We present MixNet, a hybrid architecture that combines the strengths of CNNs and Transformers, capable of accurately detecting small text from challenging natural scenes, regardless of the orientations, styles, and lighting conditions. MixNet incorporates two key modules: (1) the Feature Shuffle Network (FSNet) to serve as the backbone and (2) the Central Transformer Block (CTBlock) to exploit the 1D manifold constraint of the scene text. We first introduce a novel feature shuffling strategy in FSNet to facilitate the exchange of features across multiple scales, generating high-resolution features superior to popular ResNet and HRNet. The FSNet backbone has achieved significant improvements over many existing text detection methods, including PAN, DB, and FAST. Then we design a complementary CTBlock to leverage center line based features similar to the medial axis of text regions and show that it can outperform contour-based approaches in challenging cases when small scene texts appear closely. Extensive experimental results show that MixNet, which mixes FSNet with CTBlock, achieves state-of-the-art results on multiple scene text detection datasets.

摘要
通过推出 MixNet 混合体系，我们可以准确地检测自然场景中小文本实例，无论文本方向、风格或照明条件如何。 MixNet 包含两个关键模块：（1）特点混淆网络（FSNet）作为基础，以及（2）中心转换块（CTBlock）来利用场景文本的1D manifold约束。我们首先介绍了一种新的特点混淆策略，以便在多个缩放级别之间互换特点，生成高分辨率的特点，超过了流行的 ResNet 和 HRNet。FSNet 后置网络已经超越了许多现有的文本检测方法，包括 PAN、DB 和 FAST。然后，我们设计了一种补充的 CTBlock，以利用文本区域的中心线基本特征，并示出它可以在挑战性较高的情况下，当小场景文本相互靠近时，超过边框基本方法。广泛的实验结果表明，将 MixNet 混合体系与多个场景文本检测数据集进行比较，可以获得最佳结果。

AMSP-UOD: When Vortex Convolution and Stochastic Perturbation Meet Underwater Object Detection

paper_url: http://arxiv.org/abs/2308.11918
repo_url: None
paper_authors: Jingchun Zhou, Zongxin He, Kin-Man Lam, Yudong Wang, Weishi Zhang, ChunLe Guo, Chongyi Li
for: 本研究提出了一种新的干扰噪谱扩展Vortex Convolutional Network（AMSP-UOD），用于水下物体检测。AMSP-UOD特点是对水下环境中物体检测精度的影响进行了优化。
methods: 我们提出了AMSP Vortex Convolution（AMSP-VConv）来破坏噪声分布，提高特征提取能力，减少参数，提高网络的可靠性。此外，我们设计了Feature Association Decoupling Cross Stage Partial（FAD-CSP）模块，增强了长和短距离特征之间的强相关性，提高网络在复杂水下环境中的性能。
results: 我们对URPC和RUOD数据集进行了广泛的实验，结果显示，我们的方法在精度和噪声抗性方面比现有的状态切入方法更高。AMSP-UOD提出了一种创新的解决方案，具有实际应用潜在性。代码将公开发布。

Abstract
In this paper, we present a novel Amplitude-Modulated Stochastic Perturbation and Vortex Convolutional Network, AMSP-UOD, designed for underwater object detection. AMSP-UOD specifically addresses the impact of non-ideal imaging factors on detection accuracy in complex underwater environments. To mitigate the influence of noise on object detection performance, we propose AMSP Vortex Convolution (AMSP-VConv) to disrupt the noise distribution, enhance feature extraction capabilities, effectively reduce parameters, and improve network robustness. We design the Feature Association Decoupling Cross Stage Partial (FAD-CSP) module, which strengthens the association of long and short-range features, improving the network performance in complex underwater environments. Additionally, our sophisticated post-processing method, based on non-maximum suppression with aspect-ratio similarity thresholds, optimizes detection in dense scenes, such as waterweed and schools of fish, improving object detection accuracy. Extensive experiments on the URPC and RUOD datasets demonstrate that our method outperforms existing state-of-the-art methods in terms of accuracy and noise immunity. AMSP-UOD proposes an innovative solution with the potential for real-world applications. Code will be made publicly available.

摘要
在这篇论文中，我们提出了一种新的振荡干扰随机激活网络（AMSP-UOD），用于水下物体检测。AMSP-UOD特点是解决水下环境中物体检测精度下降的非理想捕集因素的影响。为了减少噪声对物体检测性能的影响，我们提议AMSP激活 Vortex Convolution（AMSP-VConv），以扰乱噪声分布，提高特征提取能力，降低参数，提高网络的可靠性。我们设计了Feature Association Decoupling Cross Stage Partial（FAD-CSP）模块，强化长和短距离特征之间的关联，提高网络在复杂水下环境中的性能。此外，我们提出了一种复杂的后处理方法，基于非最大值抑制器和方向相似度阈值，以优化检测在紧凑场景中，如水蕴和鱼群，提高物体检测精度。广泛的实验表明，我们的方法在URPC和RUOD数据集上比现有状态的方法更高的准确率和噪声抗性。AMSP-UOD提出了一种创新的解决方案，具有实际应用前景。代码将公开发布。

Semantic-Aware Implicit Template Learning via Part Deformation Consistency

paper_url: http://arxiv.org/abs/2308.11916
repo_url: None
paper_authors: Sihyeon Kim, Minseok Joo, Jaewon Lee, Juyeon Ko, Juhan Cha, Hyunwoo J. Kim
for: 本研究旨在提高无监督形态匹配中的模板学习，以便在不同物体形状下实现semantically plausible的变换。
methods: 本文提出了一种semantic-aware implicit template学习框架，通过自动学习的Semantic feature extractor来提供semantic prior，并通过本地conditioning和新的semantic-aware减杂码来实现semantically plausible的变换。
results: 对baseline方法进行了广泛的实验，并证明了提议的方法在不同任务中（包括键点传输、部分标签传输和 Texture传输）具有更高的性能。此外，我们还提供了质量分析，以验证semantic-aware减杂码的效果。代码可以在https://github.com/mlvlab/PDC上下载。

Abstract
Learning implicit templates as neural fields has recently shown impressive performance in unsupervised shape correspondence. Despite the success, we observe current approaches, which solely rely on geometric information, often learn suboptimal deformation across generic object shapes, which have high structural variability. In this paper, we highlight the importance of part deformation consistency and propose a semantic-aware implicit template learning framework to enable semantically plausible deformation. By leveraging semantic prior from a self-supervised feature extractor, we suggest local conditioning with novel semantic-aware deformation code and deformation consistency regularizations regarding part deformation, global deformation, and global scaling. Our extensive experiments demonstrate the superiority of the proposed method over baselines in various tasks: keypoint transfer, part label transfer, and texture transfer. More interestingly, our framework shows a larger performance gain under more challenging settings. We also provide qualitative analyses to validate the effectiveness of semantic-aware deformation. The code is available at https://github.com/mlvlab/PDC.

摘要
学习隐式模板作为神经场景，最近显示了无监督形态匹配的卓越表现。尽管成功，我们发现当前的方法，即solely rely on geometric information，经常学习低效的形态变换 across 通用物体形态，这些形态具有高度结构变动。在这篇论文中，我们强调部分形态一致性的重要性，并提出了semantic-aware implicit template learning框架，以启用semantically plausible的形态变换。通过利用自动学习的semantic prior，我们建议了本地条件化和新的semantic-aware deformation code，以及deformation consistency regularization regarding part deformation、global deformation和global scaling。我们的广泛实验表明我们的提案方法比基eline在各种任务中表现出色：键点传输、部件标签传输和 Texture传输。更有趣的是，我们的框架在更加复杂的设置下表现出更大的性能提升。我们还提供了质量分析，以验证semantic-aware deformation的效果。代码可以在https://github.com/mlvlab/PDC上获取。

ACLS: Adaptive and Conditional Label Smoothing for Network Calibration

paper_url: http://arxiv.org/abs/2308.11911
repo_url: None
paper_authors: Hyekang Park, Jongyoun Noh, Youngmin Oh, Donghyeon Baek, Bumsub Ham
for: 本研究旨在解决深度神经网络的调整误差问题，提高神经网络的报告精度。
methods: 该研究使用了现有的正则化基于方法，并进行了深入分析，以便更好地理解这些方法对神经网络调整的影响。
results: 研究人员通过实验证明，新引入的损失函数ACLS可以兼顾现有正则化方法的优点，并避免其缺点。这种损失函数在图像分类和 semantic segmentation 中具有广泛的应用前景。

Abstract
We address the problem of network calibration adjusting miscalibrated confidences of deep neural networks. Many approaches to network calibration adopt a regularization-based method that exploits a regularization term to smooth the miscalibrated confidences. Although these approaches have shown the effectiveness on calibrating the networks, there is still a lack of understanding on the underlying principles of regularization in terms of network calibration. We present in this paper an in-depth analysis of existing regularization-based methods, providing a better understanding on how they affect to network calibration. Specifically, we have observed that 1) the regularization-based methods can be interpreted as variants of label smoothing, and 2) they do not always behave desirably. Based on the analysis, we introduce a novel loss function, dubbed ACLS, that unifies the merits of existing regularization methods, while avoiding the limitations. We show extensive experimental results for image classification and semantic segmentation on standard benchmarks, including CIFAR10, Tiny-ImageNet, ImageNet, and PASCAL VOC, demonstrating the effectiveness of our loss function.

摘要
我们考虑了深度神经网络的准确率调整问题，有许多使用常规化方法来调整神经网络的方法。虽然这些方法有效地调整神经网络，但是还没有很好地理解这些常规化方法在神经网络调整中的下面原理。在这篇论文中，我们提供了对现有常规化方法的深入分析，从而更好地理解它们如何影响神经网络调整。 Specifically, we have observed that 1) 常规化方法可以被视为变种的标签平滑，和 2) 它们不总是愿望的。基于分析，我们提出了一种新的损失函数，名为 ACLS，它结合了现有常规化方法的优点，而避免了其限制。我们在标准的benchmark上，包括CIFAR10、Tiny-ImageNet、ImageNet和PASCAL VOC，进行了广泛的实验，并证明了我们的损失函数的效果。

Edge-aware Hard Clustering Graph Pooling for Brain Imaging Data

paper_url: http://arxiv.org/abs/2308.11909
repo_url: None
paper_authors: Cheng Zhu, Jiayi Zhu, Lijuan Zhang, Xi Wu, Shuqi Yang, Ping Liang, Honghan Chen, Ying Tan
for: This paper aims to develop a deep learning method for probing different types of abnormal functional brain networks from a data-driven perspective.
methods: The proposed method, called Edge-aware hard clustering graph pooling (EHCPool), uses a clustering graph pooling method that supports multidimensional edge features and assesses node feature significance based on edge features. It also uses a novel Iteration n-top strategy to adaptively learn sparse hard clustering assignments for graphs, and an innovative N-E Aggregation strategy to aggregate node and edge feature information in each independent subgraph.
results: The proposed model was evaluated on multi-site brain imaging public datasets and yielded state-of-the-art performance.

Abstract
Graph Convolutional Networks (GCNs) can capture non-Euclidean spatial dependence between different brain regions, and the graph pooling operator in GCNs is key to enhancing the representation learning capability and acquiring abnormal brain maps. However, the majority of existing research designs graph pooling operators only from the perspective of nodes while disregarding the original edge features, in a way that not only confines graph pooling application scenarios, but also diminishes its ability to capture critical substructures. In this study, a clustering graph pooling method that first supports multidimensional edge features, called Edge-aware hard clustering graph pooling (EHCPool), is developed. EHCPool proposes the first 'Edge-to-node' score evaluation criterion based on edge features to assess node feature significance. To more effectively capture the critical subgraphs, a novel Iteration n-top strategy is further designed to adaptively learn sparse hard clustering assignments for graphs. Subsequently, an innovative N-E Aggregation strategy is presented to aggregate node and edge feature information in each independent subgraph. The proposed model was evaluated on multi-site brain imaging public datasets and yielded state-of-the-art performance. We believe this method is the first deep learning tool with the potential to probe different types of abnormal functional brain networks from data-driven perspective.

摘要
格子卷积网络（GCNs）可以捕捉不同脑区之间的非欧几何空间相互关系，并且图 pooling 运算在 GCNs 中是关键来提高表示学习能力和获得异常脑图。然而，现有大多数研究只从节点的角度设计图 pooling 操作符，而忽视原始边特征，这不仅限制了图 pooling 应用场景，而且减少了其捕捉关键子结构的能力。在这种研究中，我们开发了一种集群图 pooling 方法，称为 Edge-aware hard clustering graph pooling（EHCPool）。EHCPool 首先支持多维边特征，并提出了基于边特征的 'Edge-to-node' 分数评价标准来评估节点特征重要性。为更好地捕捉关键子图，我们还设计了一种 novel Iteration n-top 策略，以适应性地学习 sparse hard clustering 分配。然后，我们提出了一种 innovative N-E Aggregation 策略，用于在每个独立子图中集成节点和边特征信息。我们的模型在多地点脑成像公共数据集上进行了评估，并实现了状态前的性能。我们认为这是深度学习工具的第一个能够从数据驱动的方式探索不同类型的异常功能脑网络的方法。

Rethinking Data Perturbation and Model Stabilization for Semi-supervised Medical Image Segmentation

paper_url: http://arxiv.org/abs/2308.11903
repo_url: https://github.com/zhenzhao/dpms
paper_authors: Zhen Zhao, Ye Liu, Meng Zhao, Di Yin, Yixuan Yuan, Luping Zhou
for: 这篇论文目的是提高 semi-supervised medical image segmentation (SSMIS) 的性能。
methods: 本文提出了一个简单 yet effective 的方法，名为 DPMS，以生成大量适当的预测不同，以提高 SSMIS 的性能。DPMS 使用了教师-学生架构，并运用了标准的监督损失和无监督一致损失。
results: DPMS 可以实现新的顶尖性能在公共 2D ACDC 和 3D LA 数据集上，在不同的半supervised 设定下。例如，DPMS 在 ACDC 上比前一代 SOTA 提高了22.62%。

Abstract
Studies on semi-supervised medical image segmentation (SSMIS) have seen fast progress recently. Due to the limited labelled data, SSMIS methods mainly focus on effectively leveraging unlabeled data to enhance the segmentation performance. However, despite their promising performance, current state-of-the-art methods often prioritize integrating complex techniques and loss terms rather than addressing the core challenges of semi-supervised scenarios directly. We argue that the key to SSMIS lies in generating substantial and appropriate prediction disagreement on unlabeled data. To this end, we emphasize the crutiality of data perturbation and model stabilization in semi-supervised segmentation, and propose a simple yet effective approach to boost SSMIS performance significantly, dubbed DPMS. Specifically, we first revisit SSMIS from three distinct perspectives: the data, the model, and the loss, and conduct a comprehensive study of corresponding strategies to examine their effectiveness. Based on these examinations, we then propose DPMS, which adopts a plain teacher-student framework with a standard supervised loss and unsupervised consistency loss. To produce appropriate prediction disagreements, DPMS perturbs the unlabeled data via strong augmentations to enlarge prediction disagreements considerably. On the other hand, using EMA teacher when strong augmentation is applied does not necessarily improve performance. DPMS further utilizes a forwarding-twice and momentum updating strategies for normalization statistics to stabilize the training on unlabeled data effectively. Despite its simplicity, DPMS can obtain new state-of-the-art performance on the public 2D ACDC and 3D LA datasets across various semi-supervised settings, e.g. obtaining a remarkable 22.62% improvement against previous SOTA on ACDC with 5% labels.

摘要
研究 semi-supervised medical image segmentation (SSMIS) 在近期已经进展很快。由于有限的标签数据，SSMIS 方法主要是利用无标注数据来提高 segmentation 性能。然而，当前状态的艺术方法frequently强调混合复杂的技术和损失函数，而不是直接面临 semi-supervised 场景的核心挑战。我们认为，SSMIS 的关键在于生成足够和适当的预测差异在无标注数据上。为此，我们强调了数据抖动和模型稳定在 semi-supervised 分割中的重要性，并提出了一种简单 yet effective 的方法，称为 DPMS。 Specifically, we first revisit SSMIS from three distinct perspectives: the data, the model, and the loss, and conduct a comprehensive study of corresponding strategies to examine their effectiveness. Based on these examinations, we then propose DPMS, which adopts a plain teacher-student framework with a standard supervised loss and unsupervised consistency loss. To produce appropriate prediction disagreements, DPMS perturbs the unlabeled data via strong augmentations to enlarge prediction disagreements considerably. On the other hand, using EMA teacher when strong augmentation is applied does not necessarily improve performance. DPMS further utilizes a forwarding-twice and momentum updating strategies for normalization statistics to stabilize the training on unlabeled data effectively. Despite its simplicity, DPMS can obtain new state-of-the-art performance on the public 2D ACDC and 3D LA datasets across various semi-supervised settings, e.g. obtaining a remarkable 22.62% improvement against previous SOTA on ACDC with 5% labels.

Camera-Driven Representation Learning for Unsupervised Domain Adaptive Person Re-identification

paper_url: http://arxiv.org/abs/2308.11901
repo_url: None
paper_authors: Geon Lee, Sanghoon Lee, Dohyung Kim, Younghoon Shin, Yongsang Yoon, Bumsub Ham
for: 这个论文是为了解决人识别领域中的频道适应问题，即将源频道上训练的模型转移到目标频道上，而不需要目标频道的标注数据。
methods: 这个论文提出了一种基于摄像头标签的curriculum学习框架，通过逐渐使用不同摄像头的数据集来进行学习，从而将源频道上的知识传播到目标频道上。
results: 实验结果表明，该方法可以在标本频道和Synthetic-to-real场景中实现高效的人识别，并且可以减少摄像头偏见问题。

Abstract
We present a novel unsupervised domain adaption method for person re-identification (reID) that generalizes a model trained on a labeled source domain to an unlabeled target domain. We introduce a camera-driven curriculum learning (CaCL) framework that leverages camera labels of person images to transfer knowledge from source to target domains progressively. To this end, we divide target domain dataset into multiple subsets based on the camera labels, and initially train our model with a single subset (i.e., images captured by a single camera). We then gradually exploit more subsets for training, according to a curriculum sequence obtained with a camera-driven scheduling rule. The scheduler considers maximum mean discrepancies (MMD) between each subset and the source domain dataset, such that the subset closer to the source domain is exploited earlier within the curriculum. For each curriculum sequence, we generate pseudo labels of person images in a target domain to train a reID model in a supervised way. We have observed that the pseudo labels are highly biased toward cameras, suggesting that person images obtained from the same camera are likely to have the same pseudo labels, even for different IDs. To address the camera bias problem, we also introduce a camera-diversity (CD) loss encouraging person images of the same pseudo label, but captured across various cameras, to involve more for discriminative feature learning, providing person representations robust to inter-camera variations. Experimental results on standard benchmarks, including real-to-real and synthetic-to-real scenarios, demonstrate the effectiveness of our framework.

摘要
我们提出了一种新的无监督领域适应方法，用于人重识别（reID），可以将源频率域上学习的模型映射到目标频率域上。我们引入了一个摄像头驱动的课程学习（CaCL）框架，利用摄像头标签来从源频率域传递知识到目标频率域。为此，我们将目标频率域数据集分成多个子集，并在每个子集上进行逐步训练，按照一个摄像头驱动的时间序列来调度。时间序列中的每一个子集都是根据摄像头标签与源频率域数据集的最大平均差（MMD）进行选择的。为了在目标频率域上训练reID模型，我们生成了一系列pseudo标签，用于在目标频率域上进行supervised学习。我们发现pseudo标签具有强烈的摄像头偏好， sugggesting that人像图像来自同一摄像头的人是可能具有相同的pseudo标签，即使这些人图像是不同的ID。为了解决摄像头偏好问题，我们还引入了一个摄像头多样性（CD）损失，强制人像图像同一pseudo标签，但来自不同摄像头的人图像，参与更多的特征学习，以提供对摄像头变化的人表示robust。我们的实验结果表明，我们的框架在标准benchmark上得到了显著的效果，包括实际到实际和 sintetic到实际的场景。

HashReID: Dynamic Network with Binary Codes for Efficient Person Re-identification

paper_url: http://arxiv.org/abs/2308.11900
repo_url: None
paper_authors: Kshitij Nikhal, Yujunrong Ma, Shuvra S. Bhattacharyya, Benjamin S. Riggan
for: 这个研究旨在提高人预识系统（ReID）的实际应用，特别是在能源有限的设备上。
methods: 我们提出了一个入口适应网络，具有多个终端封顶，可以在搜寻是简单或是噪声时中止计算，从而大幅降低计算量。我们还引入了一个基于时间的分类器，并运用了一个新的训练策略。此外，我们还采用了一个二进制对称码生成方法，从而大幅改善搜寻过程。
results: 我们的提案可以在Market1501数据集上降低了超过70%的样本使用对称码，从而降低了网络的计算成本80%，并与其他对称码基础的方法相比提高了60%。这些结果表明了我们的方法具有重要的改善，并且与传统ReID方法相似的精度性表现。

Abstract
Biometric applications, such as person re-identification (ReID), are often deployed on energy constrained devices. While recent ReID methods prioritize high retrieval performance, they often come with large computational costs and high search time, rendering them less practical in real-world settings. In this work, we propose an input-adaptive network with multiple exit blocks, that can terminate computation early if the retrieval is straightforward or noisy, saving a lot of computation. To assess the complexity of the input, we introduce a temporal-based classifier driven by a new training strategy. Furthermore, we adopt a binary hash code generation approach instead of relying on continuous-valued features, which significantly improves the search process by a factor of 20. To ensure similarity preservation, we utilize a new ranking regularizer that bridges the gap between continuous and binary features. Extensive analysis of our proposed method is conducted on three datasets: Market1501, MSMT17 (Multi-Scene Multi-Time), and the BGC1 (BRIAR Government Collection). Using our approach, more than 70% of the samples with compact hash codes exit early on the Market1501 dataset, saving 80% of the networks computational cost and improving over other hash-based methods by 60%. These results demonstrate a significant improvement over dynamic networks and showcase comparable accuracy performance to conventional ReID methods. Code will be made available.

摘要
“生物emetric应用程序（ReID）经常在能源有限的设备上部署。而现代ReID方法则倾向于优先高回传率，但是这些方法往往具有较高的计算成本和搜寻时间，使其在实际应用中不太实际。在这个工作中，我们提议一个输入适应网络，该网络可以根据输入的复杂度进行计算节省。为了评估输入的复杂度，我们引入了一个基于时间的分类器，这个分类器驱动了一个新的训练策略。此外，我们还采用了一个二进制对应码生成方法，而不是依赖连续值的特征。这种方法可以很大幅度地改善搜寻过程，提高比例为20倍。确保相似性保持，我们利用一个新的排名调整仪，它可以跨越连续和二进制特征之间的差异。我们对 Market1501、MSMT17（多个场景多个时间）和 BGC1（BRIAR政府收集）三个数据集进行了广泛的分析。使用我们的方法，Market1501上的超过70%的样本可以在输入简短的情况下提前退出，实现80%的网络计算成本的减少，并且与其他对应码基本方法相比，提高了60%.这些结果表明我们的方法具有优秀的改善，并且与传统ReID方法相似的准确性表现。我们将会公开代码。”

Age Prediction From Face Images Via Contrastive Learning

paper_url: http://arxiv.org/abs/2308.11896
repo_url: None
paper_authors: Yeongnam Chae, Poulami Raha, Mijung Kim, Bjorn Stenger
for: accurately estimating age from face images
methods: contrastive learning to extract age-related features, combining cosine similarity and triplet margin losses to suppress identity-related features
results: achieved state-of-the-art performance on two public datasets, FG-NET and MORPH-II

Abstract
This paper presents a novel approach for accurately estimating age from face images, which overcomes the challenge of collecting a large dataset of individuals with the same identity at different ages. Instead, we leverage readily available face datasets of different people at different ages and aim to extract age-related features using contrastive learning. Our method emphasizes these relevant features while suppressing identity-related features using a combination of cosine similarity and triplet margin losses. We demonstrate the effectiveness of our proposed approach by achieving state-of-the-art performance on two public datasets, FG-NET and MORPH-II.

摘要
这篇论文提出了一种新的方法，用于准确地从面像中估算年龄，解决了收集大量同一个人的年龄不同的数据的挑战。而是利用可得到的不同人种的年龄面像数据，并尝试EXTRACT年龄相关特征使用对比学习。我们的方法强调这些相关特征，同时压制身份相关特征使用抽象相似性和三重margin损失。我们在两个公共数据集FG-NET和MORPH-II上进行了实验，并达到了状态的表现。

Does Physical Adversarial Example Really Matter to Autonomous Driving? Towards System-Level Effect of Adversarial Object Evasion Attack

paper_url: http://arxiv.org/abs/2308.11894
repo_url: None
paper_authors: Ningfei Wang, Yunpeng Luo, Takami Sato, Kaidi Xu, Qi Alfred Chen
for: 本研究旨在探讨physical adversarial object evasion攻击在自动驾驶（AD）中的安全性问题，特别是对于AD系统的全面性和上下文。
methods: 我们使用现有的AD系统和Physical Adversarial Attack（PAA）技术，并对现有的攻击方法进行了改进和扩展，以实现更高效的攻击效果。
results: 我们的研究结果显示，现有的攻击方法无法实现系统级别的攻击效果（如违反交通规则）在实际AD上。我们还发现了两个设计限制：1）物理模型与像素抽象不匹配，2）缺乏车辆植物模型和AD系统模型考虑。我们提出了SysAdv，一种基于系统的攻击设计，并证明了它可以显著提高攻击效果，即违反率提高约70%。

Abstract
In autonomous driving (AD), accurate perception is indispensable to achieving safe and secure driving. Due to its safety-criticality, the security of AD perception has been widely studied. Among different attacks on AD perception, the physical adversarial object evasion attacks are especially severe. However, we find that all existing literature only evaluates their attack effect at the targeted AI component level but not at the system level, i.e., with the entire system semantics and context such as the full AD pipeline. Thereby, this raises a critical research question: can these existing researches effectively achieve system-level attack effects (e.g., traffic rule violations) in the real-world AD context? In this work, we conduct the first measurement study on whether and how effectively the existing designs can lead to system-level effects, especially for the STOP sign-evasion attacks due to their popularity and severity. Our evaluation results show that all the representative prior works cannot achieve any system-level effects. We observe two design limitations in the prior works: 1) physical model-inconsistent object size distribution in pixel sampling and 2) lack of vehicle plant model and AD system model consideration. Then, we propose SysAdv, a novel system-driven attack design in the AD context and our evaluation results show that the system-level effects can be significantly improved, i.e., the violation rate increases by around 70%.

摘要
自动驾驶（AD）中精准感知是安全驾驶的关键。由于其安全性的重要性，AD感知的安全性已经得到了广泛的研究。 amongst different AD感知攻击，物理对抗对象逃脱攻击最为严重。然而，我们发现所有的文献都只评估了这些攻击的目标AI组件级别的影响，而不是整个系统的 semantics和context，例如整个AD管道。这引出了一个关键的研究问题：现有的研究是否可以在实际的AD上实现系统级别的效果？在这种工作中，我们进行了首次的测量研究，以确定现有的设计是否可以在AD上实现系统级别的效果，特别是STOP标志逃脱攻击的情况。我们的评估结果表明，所有代表性的先前工作都无法实现任何系统级别的效果。我们发现了两个设计 limitation：1）物理模型不一致的对象大小分布在像素抽样中，2）缺乏车辆植物模型和AD系统模型考虑。然后，我们提出了SysAdv，一种基于系统的攻击设计在AD上。我们的评估结果表明，可以显著提高系统级别的效果，即违反率提高约70%。

A Unified Framework for 3D Point Cloud Visual Grounding

paper_url: http://arxiv.org/abs/2308.11887
repo_url: https://github.com/leon1207/3dreftr
paper_authors: Haojia Lin, Yongdong Luo, Xiawu Zheng, Lijiang Li, Fei Chao, Taisong Jin, Donghao Luo, Chengjie Wang, Yan Wang, Liujuan Cao
for: 本研究旨在提出一个统一的3D参考框架，协助3DScene理解和3D Referring Expression Comprehension (3DREC)。
methods: 本研究使用3D Referring Transformer (3DRefTR) 框架，具有双重功能：一是利用3DREC模型生成高分辨率的visual tokens，二是通过对superpoint进行组合，以提高3DRES的性能。
results: 实验结果显示，3DRefTR在ScanRefer dataset上比前一代3DRES方法提高12.43%的mIoU，并比前一代3DREC方法提高0.6%的Acc@0.25IoU。

Abstract
3D point cloud visual grounding plays a critical role in 3D scene comprehension, encompassing 3D referring expression comprehension (3DREC) and segmentation (3DRES). We argue that 3DREC and 3DRES should be unified in one framework, which is also a natural progression in the community. To explain, 3DREC can help 3DRES locate the referent, while 3DRES can also facilitate 3DREC via more finegrained language-visual alignment. To achieve this, this paper takes the initiative step to integrate 3DREC and 3DRES into a unified framework, termed 3D Referring Transformer (3DRefTR). Its key idea is to build upon a mature 3DREC model and leverage ready query embeddings and visual tokens from the 3DREC model to construct a dedicated mask branch. Specially, we propose Superpoint Mask Branch, which serves a dual purpose: i) By leveraging the heterogeneous CPU-GPU parallelism, while the GPU is occupied generating visual tokens, the CPU concurrently produces superpoints, equivalently accomplishing the upsampling computation; ii) By harnessing on the inherent association between the superpoints and point cloud, it eliminates the heavy computational overhead on the high-resolution visual features for upsampling. This elegant design enables 3DRefTR to achieve both well-performing 3DRES and 3DREC capacities with only a 6% additional latency compared to the original 3DREC model. Empirical evaluations affirm the superiority of 3DRefTR. Specifically, on the ScanRefer dataset, 3DRefTR surpasses the state-of-the-art 3DRES method by 12.43% in mIoU and improves upon the SOTA 3DREC method by 0.6% Acc@0.25IoU.

摘要
三维点云视觉基础LAYER plays a critical role in 3D scene comprehension, including 3D referring expression comprehension (3DREC) and segmentation (3DRES). We argue that 3DREC and 3DRES should be unified in one framework, which is also a natural progression in the community. To explain, 3DREC can help 3DRES locate the referent, while 3DRES can also facilitate 3DREC via more finegrained language-visual alignment. To achieve this, this paper takes the initiative step to integrate 3DREC and 3DRES into a unified framework, termed 3D Referring Transformer (3DRefTR). Its key idea is to build upon a mature 3DREC model and leverage ready query embeddings and visual tokens from the 3DREC model to construct a dedicated mask branch. Specially, we propose Superpoint Mask Branch, which serves a dual purpose: i) By leveraging the heterogeneous CPU-GPU parallelism, while the GPU is occupied generating visual tokens, the CPU concurrently produces superpoints, equivalently accomplishing the upsampling computation; ii) By harnessing on the inherent association between the superpoints and point cloud, it eliminates the heavy computational overhead on the high-resolution visual features for upsampling. This elegant design enables 3DRefTR to achieve both well-performing 3DRES and 3DREC capacities with only a 6% additional latency compared to the original 3DREC model. Empirical evaluations affirm the superiority of 3DRefTR. Specifically, on the ScanRefer dataset, 3DRefTR surpasses the state-of-the-art 3DRES method by 12.43% in mIoU and improves upon the SOTA 3DREC method by 0.6% Acc@0.25IoU.

paper_url: http://arxiv.org/abs/2308.11880
repo_url: https://github.com/csimo005/summit
paper_authors: Cody Simons, Dripta S. Raychaudhuri, Sk Miraj Ahmed, Suya You, Konstantinos Karydis, Amit K. Roy-Chowdhury
for: 这篇论文的目的是提出一个可以在多种情况下进行Scene Understanding的方法，并且可以适应不同的数据分布而不需要实际的数据标注。
methods: 这篇论文使用了一个 switching 框架，它可以自动选择两种跨模式的 pseudo-label 融合方法（agreement filtering 和 entropy weighting），以适应目标领域中的数据分布。
results: 这篇论文的实验结果显示，这个方法可以在七个问题中 achieved results comparable to, 和在一些情况下甚至超过了可以接触到源数据的方法。实验结果显示，这个方法可以提高 mIoU 的表现，最高提高幅度达12%。

Abstract
Scene understanding using multi-modal data is necessary in many applications, e.g., autonomous navigation. To achieve this in a variety of situations, existing models must be able to adapt to shifting data distributions without arduous data annotation. Current approaches assume that the source data is available during adaptation and that the source consists of paired multi-modal data. Both these assumptions may be problematic for many applications. Source data may not be available due to privacy, security, or economic concerns. Assuming the existence of paired multi-modal data for training also entails significant data collection costs and fails to take advantage of widely available freely distributed pre-trained uni-modal models. In this work, we relax both of these assumptions by addressing the problem of adapting a set of models trained independently on uni-modal data to a target domain consisting of unlabeled multi-modal data, without having access to the original source dataset. Our proposed approach solves this problem through a switching framework which automatically chooses between two complementary methods of cross-modal pseudo-label fusion -- agreement filtering and entropy weighting -- based on the estimated domain gap. We demonstrate our work on the semantic segmentation problem. Experiments across seven challenging adaptation scenarios verify the efficacy of our approach, achieving results comparable to, and in some cases outperforming, methods which assume access to source data. Our method achieves an improvement in mIoU of up to 12% over competing baselines. Our code is publicly available at https://github.com/csimo005/SUMMIT.

摘要
Scene理解使用多Modal数据是必需的许多应用程序中，例如自主导航。为了实现这种多种情况下，现有的模型需要能够适应数据分布的变化，而不需要费力数据注释。现有的方法假设源数据可以在适应过程中获得，并且假设源数据是paired multiModal数据。这两个假设可能会成为许多应用程序的问题。源数据可能不可用due to privacy, security, or economic concerns。假设存在paired multiModal数据进行训练也会导致数据收集成本增加，并且不使用可以获得的广泛分布的自由训练uniModal模型。在这种工作中，我们放弃了这两个假设，通过对策 Switching framework来解决在目标领域中使用独立训练在uniModal数据上的模型适应问题，不需要访问原始源数据集。我们的提出的方法通过自动选择 Agreement filtering和Entropy weighting两种跨Modal pseudo-label fusions的方法来解决这个问题，根据估算的领域漏洞。我们在Semantic segmentation问题上进行了实验，并在七个困难的适应场景中证明了我们的方法的效果，与访问源数据的方法相当，甚至在一些场景下超越了这些方法。我们的方法在mIoU上提高了12%以上 compared to竞争对手。我们的代码可以在https://github.com/csimo005/SUMMIT上获取。

Motion-to-Matching: A Mixed Paradigm for 3D Single Object Tracking

paper_url: http://arxiv.org/abs/2308.11875
repo_url: https://github.com/leozhiheng/mtm-tracker
paper_authors: Zhiheng Li, Yu Lin, Yubo Cui, Shuo Li, Zheng Fang
for: 本文主要针对3D单个目标跟踪问题，提出了一种基于LiDAR点云的混合方法，即MTM-Tracker。
methods: 该方法包括两个阶段，首先使用历史桶的连续性来模型目标的运动，然后通过特征相互交互模块提取连续点云中的运动相关特征，并与之前的特征进行匹配以 refine 目标运动和其他目标状态。
results: 广泛的实验表明，MTM-Tracker 在大规模数据集（KITTI和NuScenes）上达到了竞争性表现（70.9% 和 51.70%）。

Abstract
3D single object tracking with LiDAR points is an important task in the computer vision field. Previous methods usually adopt the matching-based or motion-centric paradigms to estimate the current target status. However, the former is sensitive to the similar distractors and the sparseness of point cloud due to relying on appearance matching, while the latter usually focuses on short-term motion clues (eg. two frames) and ignores the long-term motion pattern of target. To address these issues, we propose a mixed paradigm with two stages, named MTM-Tracker, which combines motion modeling with feature matching into a single network. Specifically, in the first stage, we exploit the continuous historical boxes as motion prior and propose an encoder-decoder structure to locate target coarsely. Then, in the second stage, we introduce a feature interaction module to extract motion-aware features from consecutive point clouds and match them to refine target movement as well as regress other target states. Extensive experiments validate that our paradigm achieves competitive performance on large-scale datasets (70.9% in KITTI and 51.70% in NuScenes). The code will be open soon at https://github.com/LeoZhiheng/MTM-Tracker.git.

摘要
“3D单目标追踪使用LiDAR点cloud是计算机见识领域中重要的任务。先前的方法通常运用匹配基本或动作中心派生估算目标状态。然而，前者受到相似的余杂物和点云 sparse 的影响，导致依赖于外观匹配而较为敏感；而后者通常专注于短期动作指标（如两帧），忽略了长期动作模式的目标。为了解决这些问题，我们提出了一种混合 paradigma 的两个阶段方法，名为 MTM-Tracker，它结合了运动模型和特征匹配到单一的网络中。具体来说，在第一阶段，我们利用独特的历史盒子作为运动假设，并提出了Encoder-Decoder结构来粗略定位目标。然后，在第二阶段，我们引入了运动相互作用模块，从 consecutives point clouds 中提取动作感知特征，并与其他 point clouds 进行匹配以精确地评估目标运动，同时预测其他目标状态。广泛的实验证明了我们的方法在大规模数据集（KITTI 70.9%和NuScenes 51.70%）中具有竞争性的表现。代码将将在https://github.com/LeoZhiheng/MTM-Tracker.git 中开源。”

Semi-Supervised Learning via Weight-aware Distillation under Class Distribution Mismatch

paper_url: http://arxiv.org/abs/2308.11874
repo_url: None
paper_authors: Pan Du, Suyun Zhao, Zisen Sheng, Cuiping Li, Hong Chen
for: 这个研究目的是提出一种robust semi-supervised learning（SSL）框架，以便在分布不对称的情况下提高SSL的性能。
methods: 本研究使用了Weight-Aware Distillation（WAD）框架，通过设置了适当的权重，将有用的知识传递到目标类别器中，以提高SSL的性能。
results: 实验结果显示，WAD比五种现有的SSL方法和一个基准方法在CIFAR10和CIFAR100类别Dataset上表现更好，并且在人工跨dataset上也获得了良好的结果。

Abstract
Semi-Supervised Learning (SSL) under class distribution mismatch aims to tackle a challenging problem wherein unlabeled data contain lots of unknown categories unseen in the labeled ones. In such mismatch scenarios, traditional SSL suffers severe performance damage due to the harmful invasion of the instances with unknown categories into the target classifier. In this study, by strict mathematical reasoning, we reveal that the SSL error under class distribution mismatch is composed of pseudo-labeling error and invasion error, both of which jointly bound the SSL population risk. To alleviate the SSL error, we propose a robust SSL framework called Weight-Aware Distillation (WAD) that, by weights, selectively transfers knowledge beneficial to the target task from unsupervised contrastive representation to the target classifier. Specifically, WAD captures adaptive weights and high-quality pseudo labels to target instances by exploring point mutual information (PMI) in representation space to maximize the role of unlabeled data and filter unknown categories. Theoretically, we prove that WAD has a tight upper bound of population risk under class distribution mismatch. Experimentally, extensive results demonstrate that WAD outperforms five state-of-the-art SSL approaches and one standard baseline on two benchmark datasets, CIFAR10 and CIFAR100, and an artificial cross-dataset. The code is available at https://github.com/RUC-DWBI-ML/research/tree/main/WAD-master.

摘要
半监督学习（SSL）在类分布不匹配场景下面临着一个具有挑战性的问题，那就是无标签数据中含有很多未知类别，这些类别不在标签数据中出现过。在这种场景下，传统的SSL表现糟糕，这是因为未知类别的实例侵入了目标分类器，从而导致SSL错误的增加。在这项研究中，通过严格的数学推理，我们发现SSL错误在类分布不匹配情况下是由pseudo-labeling错误和入侵错误两者相互绑定的。为了减轻SSL错误，我们提出了一种可靠SSL框架calledWeight-Aware Distillation（WAD）。WAD通过权重来选择ively传输无标签数据中有助于目标任务的知识到目标分类器。具体来说，WAD捕捉适应性权重和高质量pseudo标签，以便target实例通过在表示空间中探索点对应信息（PMI）来最大化无标签数据的作用，并过滤未知类别。理论上，我们证明WAD在类分布不匹配情况下具有紧binding的人口风险。实验证明，WAD在CIFAR10和CIFAR100两个benchmarkdataset和一个人工交叉dataset上比五种现状顶峰SSL方法和一个标准基eline表现出色，并且可以减轻SSL错误。代码可以在https://github.com/RUC-DWBI-ML/research/tree/main/WAD-master中下载。

CoC-GAN: Employing Context Cluster for Unveiling a New Pathway in Image Generation

paper_url: http://arxiv.org/abs/2308.11857
repo_url: None
paper_authors: Zihao Wang, Yiming Huang, Ziyu Zhou
for: 这个论文的目的是提出一种新的图像生成方法，旨在将图像转化为点云集，并使用简单的聚类算法来生成图像。
methods: 该方法使用的是Context Clustering（CoC）聚类算法，并结合多层感知网络（MLP）来生成图像。另外，该方法还包括一个名为“点增加器”的模块，用于生成额外的点数据，以便聚类。
results: 实验表明，该方法无需使用核函数或注意力机制，却可以达到出色的性能。此外，该方法的可读性也使得可以在实验中进行可视化。这些结果证明了该方法的可行性，并促使未来对Context Clustering在更多的图像生成 task中进行进一步研究。

Abstract
Image generation tasks are traditionally undertaken using Convolutional Neural Networks (CNN) or Transformer architectures for feature aggregating and dispatching. Despite the frequent application of convolution and attention structures, these structures are not fundamentally required to solve the problem of instability and the lack of interpretability in image generation. In this paper, we propose a unique image generation process premised on the perspective of converting images into a set of point clouds. In other words, we interpret an image as a set of points. As such, our methodology leverages simple clustering methods named Context Clustering (CoC) to generate images from unordered point sets, which defies the convention of using convolution or attention mechanisms. Hence, we exclusively depend on this clustering technique, combined with the multi-layer perceptron (MLP) in a generative model. Furthermore, we implement the integration of a module termed the 'Point Increaser' for the model. This module is just an MLP tasked with generating additional points for clustering, which are subsequently integrated within the paradigm of the Generative Adversarial Network (GAN). We introduce this model with the novel structure as the Context Clustering Generative Adversarial Network (CoC-GAN), which offers a distinctive viewpoint in the domain of feature aggregating and dispatching. Empirical evaluations affirm that our CoC-GAN, devoid of convolution and attention mechanisms, exhibits outstanding performance. Its interpretability, endowed by the CoC module, also allows for visualization in our experiments. The promising results underscore the feasibility of our method and thus warrant future investigations of applying Context Clustering to more novel and interpretable image generation.

摘要
Image 生成任务通常使用 Convolutional Neural Networks (CNN) 或 Transformer 架构来进行特征聚合和派发。尽管这些结构频繁应用，但它们并不是解决图像生成中的不稳定和不可解释性的基本要求。在这篇论文中，我们提出了一种独特的图像生成过程，基于将图像转换为一组点云的思想。即我们将图像视为一组点。因此，我们的方法ология利用简单的聚合方法名为 Context Clustering (CoC) 来生成图像，而不需要使用 convolution 或 attention 机制。因此，我们几乎完全依赖这种聚合技术，加上多层感知机制 (MLP) 在生成模型中。此外，我们还实现了一个模块，称为 'Point Increaser'，用于生成更多的点，并将其集成到生成模型中。我们称之为 Context Clustering Generative Adversarial Network (CoC-GAN)，它在特征聚合和派发领域提供了一种新的视角。我们的实验结果表明，我们的 CoC-GAN 模型，没有使用 convolution 或 attention 机制，在性能上表现出色。此外，它的可读性，受 CoC 模块的启发，也允许我们在实验中进行可见化。这些优秀的结果证明了我们的方法的可行性，因此在将来的研究中，我们可以继续探索在更多的新和可解释的图像生成中应用 Context Clustering。

Compressed Models Decompress Race Biases: What Quantized Models Forget for Fair Face Recognition

paper_url: http://arxiv.org/abs/2308.11840
repo_url: None
paper_authors: Pedro C. Neto, Eduarda Caldeira, Jaime S. Cardoso, Ana F. Sequeira
for: 本研究旨在调查使用现有深度学习模型进行脸recognition的量化方法对模型性能和人类偏见的影响。
methods: 本研究使用了State-of-the-Art量化方法，并在真实数据和synthetic数据上进行了测试。
results: 研究发现，使用synthetic数据可以减少大多数测试场景中的偏见，并且对不同的种族背景进行了分析。

Abstract
With the ever-growing complexity of deep learning models for face recognition, it becomes hard to deploy these systems in real life. Researchers have two options: 1) use smaller models; 2) compress their current models. Since the usage of smaller models might lead to concerning biases, compression gains relevance. However, compressing might be also responsible for an increase in the bias of the final model. We investigate the overall performance, the performance on each ethnicity subgroup and the racial bias of a State-of-the-Art quantization approach when used with synthetic and real data. This analysis provides a few more details on potential benefits of performing quantization with synthetic data, for instance, the reduction of biases on the majority of test scenarios. We tested five distinct architectures and three different training datasets. The models were evaluated on a fourth dataset which was collected to infer and compare the performance of face recognition models on different ethnicity.

摘要
随着深度学习面Recognition模型的复杂度不断增加，实际部署变得越来越Difficult.研究人员有两个选择：1）使用更小的模型；2）压缩当前模型。然而，使用更小的模型可能会导致问题的偏见，因此压缩变得更加重要。然而，压缩也可能会导致最终模型的偏见增加。我们 investigate了State-of-the-Art量化方法的总性能、每个种族 subgroup 的性能和最终模型的种族偏见。这种分析提供了一些更多的细节，例如使用synthetic数据进行量化可以减少大多数测试场景中的偏见。我们测试了五种不同的架构和三个不同的训练数据集。模型被评估在一个 fourth 数据集上，该数据集用于对不同种族的面Recognition模型的性能进行比较。

PatchBackdoor: Backdoor Attack against Deep Neural Networks without Model Modification

paper_url: http://arxiv.org/abs/2308.11822
repo_url: https://github.com/xaiveryuan/patchbackdoor
paper_authors: Yizhen Yuan, Rui Kong, Shenghao Xie, Yuanchun Li, Yunxin Liu
for: 这种论文旨在攻击深度学习系统，尤其是在安全关键场景下。
methods: 该论文提出了一种新的后门攻击方法，即通过在摄像头前面放置一个特制的贴图（称为后门贴图），使得模型在攻击者控制的条件下产生错误预测。
results: 实验结果显示，该攻击方法可以在常见的深度学习模型（VGG、MobileNet、ResNet）上实现攻击成功率为93%～99%。此外，作者还在实际应用中实现了该攻击方法，并证明了其仍然具有威胁性。

Abstract
Backdoor attack is a major threat to deep learning systems in safety-critical scenarios, which aims to trigger misbehavior of neural network models under attacker-controlled conditions. However, most backdoor attacks have to modify the neural network models through training with poisoned data and/or direct model editing, which leads to a common but false belief that backdoor attack can be easily avoided by properly protecting the model. In this paper, we show that backdoor attacks can be achieved without any model modification. Instead of injecting backdoor logic into the training data or the model, we propose to place a carefully-designed patch (namely backdoor patch) in front of the camera, which is fed into the model together with the input images. The patch can be trained to behave normally at most of the time, while producing wrong prediction when the input image contains an attacker-controlled trigger object. Our main techniques include an effective training method to generate the backdoor patch and a digital-physical transformation modeling method to enhance the feasibility of the patch in real deployments. Extensive experiments show that PatchBackdoor can be applied to common deep learning models (VGG, MobileNet, ResNet) with an attack success rate of 93% to 99% on classification tasks. Moreover, we implement PatchBackdoor in real-world scenarios and show that the attack is still threatening.

摘要
<> translate the following text into Simplified Chinese<>深度学习系统中的后门攻击是一种重要的威胁，该攻击目的是在攻击者控制的条件下让神经网络模型出现异常行为。然而，大多数后门攻击需要修改神经网络模型通过受攻击者控制的数据和/或直接模型编辑，这导致了一种常见 pero false的信念，即后门攻击可以通过正确地保护模型来避免。在这篇论文中，我们展示了后门攻击可以无需修改模型。相反，我们提议在前置摄像头上放置一个特制的贴图（称为后门贴图），该贴图在与输入图像一起被传递给模型时被训练。贴图可以在大多数情况下保持正常行为，而在攻击者控制的触发对象存在时产生错误预测。我们的主要技术包括生成后门贴图的有效训练方法和提高贴图在实际部署中的可行性的数字物理变换模型。我们的实验显示，PatchBackdoor可以应用于常见的深度学习模型（VGG、MobileNet、ResNet），攻击成功率为93%到99%。此外，我们在实际场景中实现了PatchBackdoor攻击，并证明了攻击仍然具有威胁性。

paper_url: http://arxiv.org/abs/2308.11797
repo_url: None
paper_authors: Jian Zhu, Mingkai Sheng, Mingda Ke, Zhangmin Huang, Jingfei Chang
for: 提高多媒体检索精度
methods: 使用CLIP模型提取文本和图像特征，并将其拼接生成哈希码
results: 与状态对比，CLIPMH可以显著提高多媒体检索精度（最大提高率8.38%），CLIP也比文本和视觉基础网络更有优势。

Abstract
The multi-modal hashing method is widely used in multimedia retrieval. It can fuse multi-source data to generate binary hash code. However, the current multi-modal methods have the problem of low retrieval accuracy. The reason is that the individual backbone networks have limited feature expression capabilities and are not jointly pre-trained on large-scale unsupervised multi-modal data. To solve this problem, we propose a new baseline CLIP Multi-modal Hashing (CLIPMH) method. It uses CLIP model to extract text and image features, and then fuse to generate hash code. CLIP improves the expressiveness of each modal feature. In this way, it can greatly improve the retrieval performance of multi-modal hashing methods. In comparison to state-of-the-art unsupervised and supervised multi-modal hashing methods, experiments reveal that the proposed CLIPMH can significantly enhance performance (Maximum increase of 8.38%). CLIP also has great advantages over the text and visual backbone networks commonly used before.

摘要
多模式哈希方法广泛应用于多媒体检索。它可以将多源数据 fusion 生成 binary 哈希码。然而，现有的多模式方法受到低回aterra retrieval 精度的限制。原因在于各个后 nac 网络具有有限的特征表达能力，并未在大规模无监督多模式数据上进行联合预训练。为解决这个问题，我们提出了一个新基线CLIP多模式哈希（CLIPMH）方法。它使用CLIP模型提取文本和图像特征，然后融合生成哈希码。CLIP提高了每种Modal特征的表达能力，从而可以大幅提高多模式哈希方法的检索性能。与state-of-the-art无监督和监督多模式哈希方法进行比较，实验表明，提议的CLIPMH可以显著提高性能（最大提升8.38%）。CLIP还在文本和视觉后 nac 网络通常使用之前具有优势。

Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations

paper_url: http://arxiv.org/abs/2308.11796
repo_url: https://github.com/smsd75/timetuning
paper_authors: Mohammadreza Salehi, Efstratios Gavves, Cees G. M. Snoek, Yuki M. Asano
for: 本研究的目的是提出一种 incorporating temporal consistency in dense self-supervised learning 的方法，以提高视频和图像表示质量。
methods: 该方法从图像预训练模型开始，并使用一种新的自我超vised temporal-alignment clustering loss 来练化图像表示。这种方法可以帮助图像表示中传递高级信息到视频中。
results: 对于无监督 semantic segmentation 任务，该方法可以提高视频表示质量8-10%，并与图像表示质量相同。这种方法可以推动更多的自我超vised scaling，因为视频的可用性很高。代码可以在这里找到：https://github.com/SMSD75/Timetuning。

Abstract
Spatially dense self-supervised learning is a rapidly growing problem domain with promising applications for unsupervised segmentation and pretraining for dense downstream tasks. Despite the abundance of temporal data in the form of videos, this information-rich source has been largely overlooked. Our paper aims to address this gap by proposing a novel approach that incorporates temporal consistency in dense self-supervised learning. While methods designed solely for images face difficulties in achieving even the same performance on videos, our method improves not only the representation quality for videos-but also images. Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos. This effectively facilitates the transfer of high-level information from videos to image representations. Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images. We believe this method paves the way for further self-supervised scaling by leveraging the abundant availability of videos. The implementation can be found here : https://github.com/SMSD75/Timetuning

摘要
隐式密集自监学习是一个迅速增长的问题领域，推荐潜在应用于无监督分割和预训练 dense downstream task。尽管视频充满了时间信息，但这些信息却被大量忽略。我们的论文旨在解决这个问题，我们提出了一种新的方法，即时间调整（Time-tuning）。这种方法从图像预训练模型开始，并使用一种新的自我监督时间对齐分群损失来练化图像表示。这有效地传递高级信息从视频到图像表示。时间调整可以提高无监督 semantic segmentation 的状态得到的表示质量，并与图像表示匹配。我们的方法可以提高无监督 semantic segmentation 的状态得到的表示质量，并与图像表示匹配。我们认为这种方法将开拓更多的自监督扩展空间，因为视频的可用性是充沛的。实现可以在以下链接中找到：https://github.com/SMSD75/Timetuning。

Enhancing NeRF akin to Enhancing LLMs: Generalizable NeRF Transformer with Mixture-of-View-Experts

paper_url: http://arxiv.org/abs/2308.11793
repo_url: https://github.com/VITA-Group/GNT-MOVE
paper_authors: Wenyan Cong, Hanxue Liang, Peihao Wang, Zhiwen Fan, Tianlong Chen, Mukund Varma, Yi Wang, Zhangyang Wang
for: 本研究旨在提出一种基于 Mixture-of-Experts (MoE) 的泛化 NeRF 模型，以提高 NeRF 模型在不同场景下的泛化能力。
methods: 本研究使用了一种基于 transformers 的 feedforward “neuralized” 架构，并将 Mixture-of-Experts (MoE) idea应用到这个架构中，以提高模型的泛化能力。
results: 实验表明，使用 GNT-MOVE 模型可以在不同场景下提取出高质量的视图 synthesis 结果，并且在 zero-shot 和 few-shot Setting 中表现出色， indicating 模型具有remarkable 的泛化能力。

Abstract
Cross-scene generalizable NeRF models, which can directly synthesize novel views of unseen scenes, have become a new spotlight of the NeRF field. Several existing attempts rely on increasingly end-to-end "neuralized" architectures, i.e., replacing scene representation and/or rendering modules with performant neural networks such as transformers, and turning novel view synthesis into a feed-forward inference pipeline. While those feedforward "neuralized" architectures still do not fit diverse scenes well out of the box, we propose to bridge them with the powerful Mixture-of-Experts (MoE) idea from large language models (LLMs), which has demonstrated superior generalization ability by balancing between larger overall model capacity and flexible per-instance specialization. Starting from a recent generalizable NeRF architecture called GNT, we first demonstrate that MoE can be neatly plugged in to enhance the model. We further customize a shared permanent expert and a geometry-aware consistency loss to enforce cross-scene consistency and spatial smoothness respectively, which are essential for generalizable view synthesis. Our proposed model, dubbed GNT with Mixture-of-View-Experts (GNT-MOVE), has experimentally shown state-of-the-art results when transferring to unseen scenes, indicating remarkably better cross-scene generalization in both zero-shot and few-shot settings. Our codes are available at https://github.com/VITA-Group/GNT-MOVE.

摘要
cross-scene总体化NeRF模型，可直接生成未看过场景的新视图，成为NeRF领域的新焦点。现有尝试通过逐渐替换场景表示和渲染模块为高性能神经网络，如转换器，并将新视图生成变成feed-forward推理管道。然而，这些feedforward“神经化”体系仍然无法适应多样化场景，我们提议通过大型语言模型（LLM）的强大混合专家（MoE）想法来桥接它们，这个想法在LLM中已经证明了更好的总体化能力，通过平衡更大的总体模型容量和每个实例特定化的专家。从最近的通用NeRF架构GNT开始，我们首先示出MoE可以简单地插入以提高模型。我们然后定制了共享永久专家和geometry-aware的一致损失，以保持 across-scene一致性和空间平滑性，这些属性是通用视图生成中重要的。我们的提出的模型，命名为GNT with Mixture-of-View-Experts（GNT-MOVE），在实验中表现出了state-of-the-art的结果，在未看过场景中转移时，表现出了很好的横向一致性和少量示例设定下的优秀总体化能力。我们的代码可以在https://github.com/VITA-Group/GNT-MOVE中找到。

An extensible point-based method for data chart value detection

paper_url: http://arxiv.org/abs/2308.11788
repo_url: https://github.com/bnlnlp/ppn_model
paper_authors: Carlos Soto, Shinjae Yoo
for: 本研究提出了一种可扩展的方法，用于从科学文献中提取数据图表中的Semantic Points，特别是复杂的折线图。
methods: 该方法使用点提议网络（类似于对象检测中的区域提议网络）直接预测数据图表中的关键点的位置，并可以轻松地扩展到多种图表类型和元素。
results: 研究表明，在复杂的折线图上，该模型可以准确地检测出关键点，其性能为0.8705 F1（@1.5维度最大差），并在synthetically生成的图表上达到0.9810 F1的高性能。此外，通过专门训练于synthetic数据上的新的扩展，该模型在实际图表上也可以达到良好的性能（0.6621 F1）。

Abstract
We present an extensible method for identifying semantic points to reverse engineer (i.e. extract the values of) data charts, particularly those in scientific articles. Our method uses a point proposal network (akin to region proposal networks for object detection) to directly predict the position of points of interest in a chart, and it is readily extensible to multiple chart types and chart elements. We focus on complex bar charts in the scientific literature, on which our model is able to detect salient points with an accuracy of 0.8705 F1 (@1.5-cell max deviation); it achieves 0.9810 F1 on synthetically-generated charts similar to those used in prior works. We also explore training exclusively on synthetic data with novel augmentations, reaching surprisingly competent performance in this way (0.6621 F1) on real charts with widely varying appearance, and we further demonstrate our unchanged method applied directly to synthetic pie charts (0.8343 F1). Datasets, trained models, and evaluation code are available at https://github.com/BNLNLP/PPN_model.

摘要
我们提出了一种可扩展的方法来识别数据图表中的 semantic point（即提取值），特别是在科学文献中的数据图表。我们的方法使用一种点提议网络（类似于物体检测中的区域提议网络）直接预测数据图表中的点位置，并且可以轻松扩展到多种图表类型和元素。我们主要关注科学文献中的复杂柱形图，我们的模型能够检测出这些图表中的突出点，准确率为0.8705 F1（@1.5细分差值）；在先前作者使用的图表上进行synthetically生成的图表上，我们的模型可以达到0.9810 F1的准确率。此外，我们还探索了专门使用生成的synthetic数据进行训练，并在实际图表上达到了意外地好的性能（0.6621 F1）。此外，我们还证明了我们的无变方法可以直接应用于synthetic pie chart（0.8343 F1）。我们的数据集、训练模型和评估代码可以在https://github.com/BNLNLP/PPN_model上获取。

Coarse-to-Fine Multi-Scene Pose Regression with Transformers

paper_url: http://arxiv.org/abs/2308.11783
repo_url: https://github.com/yolish/c2f-ms-transformer
paper_authors: Yoli Shavit, Ron Ferens, Yosi Keller
for: 本文旨在提出一种基于 transformer 的多景绝对摄像头姿态 regression 方法，用于同时Localization 多个景象。
methods: 该方法使用 encoder 聚合活动图像，并通过 self-attention 机制将多个景象编码进行同时embedding。decoder 则将 latent features 和景象编码转换成姿态预测。
results: 该方法在常见的indoor和outdoor dataset上进行评估，并与多个景象和单个景象绝对摄像头姿态推导器进行比较，并表明其在Localization 精度方面具有优势。

Abstract
Absolute camera pose regressors estimate the position and orientation of a camera given the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron (MLP) head is trained using images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended to learn multiple scenes by replacing the MLP head with a set of fully connected layers. In this work, we propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention and decoders transform latent features and scenes encoding into pose predictions. This allows our model to focus on general features that are informative for localization, while embedding multiple scenes in parallel. We extend our previous MS-Transformer approach \cite{shavit2021learning} by introducing a mixed classification-regression architecture that improves the localization accuracy. Our method is evaluated on commonly benchmark indoor and outdoor datasets and has been shown to exceed both multi-scene and state-of-the-art single-scene absolute pose regressors.

摘要
绝对摄像头姿势回归器可以根据捕捉到的图像来估算摄像头的位置和方向。通常情况下，会使用卷积神经网络和多层感知器（MLP）来训练一个单个参考场景。在最近的扩展中，这种方案被替换为多个完全连接层，以学习多个场景。在这个工作中，我们提议使用变换器来学习多个场景绝对摄像头姿势回归，其中扩展器被用来聚合活动图像和自身注意力，而解码器则将缓存特征和场景编码转换为姿势预测。这使得我们的模型能够关注到通用特征，并同时将多个场景embedded在平行。我们在先前的MS-Transformer方法 \cite{shavit2021learning} 上进行了扩展，并引入了混合分类-回归架构，以提高地理位置准确性。我们的方法在常见的indoor和outdoor数据集上进行了评估，并已经超过了多个场景和状态的绝对摄像头姿势回归器。

Understanding Hessian Alignment for Domain Generalization

paper_url: http://arxiv.org/abs/2308.11778
repo_url: https://github.com/huawei-noah/federated-learning
paper_authors: Sobhan Hemati, Guojun Zhang, Amir Estiri, Xi Chen
for: 本文主要针对 Deep Learning 模型在各种实际场景中的 OUT-OF-distribution (OOD) 泛化问题，包括医疗和自动驾驶等。
methods: 本文使用了不同的技术来提高 OOD 泛化，其中 gradient-based 正则化表现最佳。然而，我们对 Hessian 和梯度在领域泛化中的角色还是有限的。本文通过使用最近的 OOD 理论来分析 Hessian 和梯度在领域泛化中的作用。
results: 本文的分析表明，在领域泛化中，梯度和 Hessian 的匹配可以提高 OOD 泛化的性能。此外，本文还提出了两种简单 yet effective 的方法来匹配梯度和 Hessian，不需要直接计算 Hessian。这些方法在不同的 OOD 场景中都达到了良好的性能。

Abstract
Out-of-distribution (OOD) generalization is a critical ability for deep learning models in many real-world scenarios including healthcare and autonomous vehicles. Recently, different techniques have been proposed to improve OOD generalization. Among these methods, gradient-based regularizers have shown promising performance compared with other competitors. Despite this success, our understanding of the role of Hessian and gradient alignment in domain generalization is still limited. To address this shortcoming, we analyze the role of the classifier's head Hessian matrix and gradient in domain generalization using recent OOD theory of transferability. Theoretically, we show that spectral norm between the classifier's head Hessian matrices across domains is an upper bound of the transfer measure, a notion of distance between target and source domains. Furthermore, we analyze all the attributes that get aligned when we encourage similarity between Hessians and gradients. Our analysis explains the success of many regularizers like CORAL, IRM, V-REx, Fish, IGA, and Fishr as they regularize part of the classifier's head Hessian and/or gradient. Finally, we propose two simple yet effective methods to match the classifier's head Hessians and gradients in an efficient way, based on the Hessian Gradient Product (HGP) and Hutchinson's method (Hutchinson), and without directly calculating Hessians. We validate the OOD generalization ability of proposed methods in different scenarios, including transferability, severe correlation shift, label shift and diversity shift. Our results show that Hessian alignment methods achieve promising performance on various OOD benchmarks. The code is available at \url{https://github.com/huawei-noah/Federated-Learning/tree/main/HessianAlignment}.

摘要
外部数据（OOD）泛化是深度学习模型在许多实际场景中的关键能力，包括医疗和自动驾驶等。在这些场景中，不同的技术已经被提出来改进OOD泛化。其中，梯度基的正则化方法已经显示了出色的表现，相比其他竞争对手。然而，我们对梯度和梯度对域泛化的理解仍然有限。为了解决这个问题，我们使用最近的OOD理论来分析泛化中梯度和梯度的角色。我们发现，在不同域之间的梯度头矩阵的 спектраль范数是泛化度量的上界，这是一个距离目标和源域的诱导。此外，我们分析了梯度和梯度之间的各种特征的对应关系，包括梯度对域的对应关系。我们的分析解释了许多正则化器，如CORAL、IRM、V-REx、Fish、IGA和Fishr的成功，因为它们在梯度和梯度之间进行了一定的对齐。最后，我们提出了两种简单 yet有效的方法来匹配梯度头矩阵和梯度，基于梯度和梯度产生的产品（HGP）和哈特金森的方法（Hutchinson）。我们在不同的OOD场景中进行了验证，包括传输性、严重相关变换、标签变换和多样性变换。我们的结果显示，梯度对齐方法在多种OOD场景中具有优秀的表现。代码可以在 \url{https://github.com/huawei-noah/Federated-Learning/tree/main/HessianAlignment} 中找到。

SAMSNeRF: Segment Anything Model (SAM) Guides Dynamic Surgical Scene Reconstruction by Neural Radiance Field (NeRF)

paper_url: http://arxiv.org/abs/2308.11774
repo_url: None
paper_authors: Ange Lou, Yamin Li, Xing Yao, Yike Zhang, Jack Noble
for: 提供高精度动态外科场景重建，以增强外科Navigation和自动化。
methods: combining Segment Anything Model (SAM)和Neural Radiance Field (NeRF)技术，通过生成高精度的分割mask来导引NeRF进行高精度动态场景重建。
results: 实验结果表明，我们的方法可以成功重建高精度的动态外科场景，并准确地反映外科工具的空间信息。

Abstract
The accurate reconstruction of surgical scenes from surgical videos is critical for various applications, including intraoperative navigation and image-guided robotic surgery automation. However, previous approaches, mainly relying on depth estimation, have limited effectiveness in reconstructing surgical scenes with moving surgical tools. To address this limitation and provide accurate 3D position prediction for surgical tools in all frames, we propose a novel approach called SAMSNeRF that combines Segment Anything Model (SAM) and Neural Radiance Field (NeRF) techniques. Our approach generates accurate segmentation masks of surgical tools using SAM, which guides the refinement of the dynamic surgical scene reconstruction by NeRF. Our experimental results on public endoscopy surgical videos demonstrate that our approach successfully reconstructs high-fidelity dynamic surgical scenes and accurately reflects the spatial information of surgical tools. Our proposed approach can significantly enhance surgical navigation and automation by providing surgeons with accurate 3D position information of surgical tools during surgery.The source code will be released soon.

摘要
准确重建手术场景从手术视频中是许多应用程序的关键之一，包括实时导航和基于图像感知的 робо脑外科自动化。然而，先前的方法，主要基于深度估计，具有限制性，不能准确重建移动的手术工具。为了解决这个限制并提供所有帧中的准确3D位置预测，我们提出了一种新的方法，即SAMSNeRF，它结合了 Segment Anything Model（SAM）和 Neural Radiance Field（NeRF）技术。我们的方法生成了高精度的手术工具分割推荐，这导引了 NeRF 的动态手术场景重建的精度更高。我们的实验结果表明，我们的方法可以成功重建高精度的动态手术场景，并准确反映手术工具的空间信息。我们的提议的方法可以在手术过程中为外科医生提供准确的3D位置信息，从而明显提高手术导航和自动化。我们即将发布源代码。

Weakly Supervised Face and Whole Body Recognition in Turbulent Environments

paper_url: http://arxiv.org/abs/2308.11757
repo_url: None
paper_authors: Kshitij Nikhal, Benjamin S. Riggan
for: 提高扩展距离人脸识别系统的稳定性和可靠性，尤其是在大气干扰和各种距离下。
methods: 提出了一种新的弱监督框架，利用自我注意机制生成域不涉及的表示，将抖擞和平静图像匹配到同一个空间中。同时，引入了一种新的倾斜地图估计器，预测大气干扰图像中的几何变形。
results: 对于LRFID和BGC1两个预测集，我们的方法可以提高排名一的精度，尤其是在不同的大气干扰和距离下。

Abstract
Face and person recognition have recently achieved remarkable success under challenging scenarios, such as off-pose and cross-spectrum matching. However, long-range recognition systems are often hindered by atmospheric turbulence, leading to spatially and temporally varying distortions in the image. Current solutions rely on generative models to reconstruct a turbulent-free image, but often preserve photo-realism instead of discriminative features that are essential for recognition. This can be attributed to the lack of large-scale datasets of turbulent and pristine paired images, necessary for optimal reconstruction. To address this issue, we propose a new weakly supervised framework that employs a parameter-efficient self-attention module to generate domain agnostic representations, aligning turbulent and pristine images into a common subspace. Additionally, we introduce a new tilt map estimator that predicts geometric distortions observed in turbulent images. This estimate is used to re-rank gallery matches, resulting in up to 13.86\% improvement in rank-1 accuracy. Our method does not require synthesizing turbulent-free images or ground-truth paired images, and requires significantly fewer annotated samples, enabling more practical and rapid utility of increasingly large datasets. We analyze our framework using two datasets -- Long-Range Face Identification Dataset (LRFID) and BRIAR Government Collection 1 (BGC1) -- achieving enhanced discriminability under varying turbulence and standoff distance.

摘要
面部和人识别在现在的挑战性场景中已经取得了很大的成功，如偏pose和跨谱匹配。然而，长距离识别系统经常受到大气扰动的影响，导致图像中的扰动变得空间和时间变化。现有的解决方案是使用生成模型来重建扰动的图像，但经常保留着照片实际的特征而不是识别特征，这可以归结于缺乏大规模的扰动和静止图像对应的数据集，这些数据集是必要的 для优化重建。为解决这个问题，我们提出了一个新的弱监督框架，该框架使用高效的自注意模块来生成域无关的表示，将扰动和静止图像调整到共同的空间中。此外，我们还引入了一个新的倾斜地图预测器，该预测器预测了大气扰动中图像中的几何扭曲。这个预测结果用于重新排序画库匹配结果，从而提高了rank-1准确率达到13.86%。我们的方法不需要生成扰动自由图像或真实的对应图像，也不需要大量的注释样本，因此可以更实用和快速地利用越来越大的数据集。我们分析了我们的框架使用两个数据集--长距离面部识别数据集（LRFID）和BRIAR政府收集1（BGC1）--并在不同的扰动和距离 circumstance中获得了提高的分布性。

Efficient Controllable Multi-Task Architectures

paper_url: http://arxiv.org/abs/2308.11744
repo_url: https://github.com/AriGho/An-Efficient-Algorithm-for-Increasing-Modularity-in-IoT-Based-Automation-Systems
paper_authors: Abhishek Aich, Samuel Schulter, Amit K. Roy-Chowdhury, Manmohan Chandraker, Yumin Suh
for: 这种方法用于训练一个多任务模型，以便在部署后由用户调整计算负载和任务性能的权重，而无需重新训练和存储不同场景的模型。
methods: 这种方法包括一个共享Encoder和任务特定的解码器，其中Encoder和解码器通道宽度都是可缩放的。我们的关键思想是通过变化任务特定的解码器的容量来控制任务的重要性，而不是固定的训练多任务模型。
results: 我们的方法可以提高总准确率，同时提供高质量的缩放下的子架构，而无需重新训练和存储不同场景的模型。在三个多任务benchmark上（PASCALContext、NYUDv2和CIFAR100-MTL），我们的方法可以提高控制性by ~33.5%，而且计算成本较低。

Abstract
We aim to train a multi-task model such that users can adjust the desired compute budget and relative importance of task performances after deployment, without retraining. This enables optimizing performance for dynamically varying user needs, without heavy computational overhead to train and save models for various scenarios. To this end, we propose a multi-task model consisting of a shared encoder and task-specific decoders where both encoder and decoder channel widths are slimmable. Our key idea is to control the task importance by varying the capacities of task-specific decoders, while controlling the total computational cost by jointly adjusting the encoder capacity. This improves overall accuracy by allowing a stronger encoder for a given budget, increases control over computational cost, and delivers high-quality slimmed sub-architectures based on user's constraints. Our training strategy involves a novel 'Configuration-Invariant Knowledge Distillation' loss that enforces backbone representations to be invariant under different runtime width configurations to enhance accuracy. Further, we present a simple but effective search algorithm that translates user constraints to runtime width configurations of both the shared encoder and task decoders, for sampling the sub-architectures. The key rule for the search algorithm is to provide a larger computational budget to the higher preferred task decoder, while searching a shared encoder configuration that enhances the overall MTL performance. Various experiments on three multi-task benchmarks (PASCALContext, NYUDv2, and CIFAR100-MTL) with diverse backbone architectures demonstrate the advantage of our approach. For example, our method shows a higher controllability by ~33.5% in the NYUD-v2 dataset over prior methods, while incurring much less compute cost.

摘要
我们目标是训练一个多任务模型，以便用户在部署后可以调整计算预算和任务表现的重要性，而不需要重新训练。这样可以在动态变化的用户需求下优化性能，而不是带来大量的计算开销来训练和存储多种场景的模型。为此，我们提议一个多任务模型，其中包括共享Encoder和任务特定Decoder。Encoder和Decoder的通道宽度都是可变的。我们的关键思想是通过变化任务特定Decoder的容量来控制任务的重要性，而控制总计算成本的同时，通过共同调整Encoder的容量来提高总MTL性能。这样可以提高精度，提高计算成本控制，并提供基于用户的限制而生成高质量的纤维子模型。我们的训练策略包括一种新的“配置不变知识传播”损失函数，该函数使得后处理表示不变于不同的运行时宽度配置，以提高精度。此外，我们还提出了一种简单 yet有效的搜索算法，该算法将用户的限制翻译成运行时宽度配置的共享Encoder和任务特定Decoder，以采样子模型。搜索算法的关键规则是为高优先级任务Decoder提供更大的计算预算，而搜索共享Encoder配置，以提高总MTL性能。我们在三个多任务测试集（PASCALContext、NYUDv2和CIFAR100-MTL）上进行了多种不同后处理架构的实验，结果显示我们的方法具有更高的可控性（~33.5%），而且带来许多计算成本的减少。

Animal3D: A Comprehensive Dataset of 3D Animal Pose and Shape

paper_url: http://arxiv.org/abs/2308.11737
repo_url: None
paper_authors: Jiacong Xu, Yi Zhang, Jiawei Peng, Wufei Ma, Artur Jesslen, Pengliang Ji, Qixin Hu, Jiehua Zhang, Qihao Liu, Jiahao Wang, Wei Ji, Chen Wang, Xiaoding Yuan, Prakhar Kaushik, Guofeng Zhang, Jie Liu, Yushan Xie, Yawen Cui, Alan Yuille, Adam Kortylewski
for: This paper aims to provide a comprehensive dataset for mammal animal 3D pose and shape estimation, which can potentially benefit many downstream applications such as wildlife conservation.
methods: The paper proposes a dataset called Animal3D, which consists of 3379 images collected from 40 mammal species, high-quality annotations of 26 keypoints, and the pose and shape parameters of the SMAL model.
results: The paper benchmarks representative shape and pose estimation models on the Animal3D dataset and demonstrates that synthetic pre-training is a viable strategy to boost the model performance. However, predicting the 3D shape and pose of animals across species remains a very challenging task.

Abstract
Accurately estimating the 3D pose and shape is an essential step towards understanding animal behavior, and can potentially benefit many downstream applications, such as wildlife conservation. However, research in this area is held back by the lack of a comprehensive and diverse dataset with high-quality 3D pose and shape annotations. In this paper, we propose Animal3D, the first comprehensive dataset for mammal animal 3D pose and shape estimation. Animal3D consists of 3379 images collected from 40 mammal species, high-quality annotations of 26 keypoints, and importantly the pose and shape parameters of the SMAL model. All annotations were labeled and checked manually in a multi-stage process to ensure highest quality results. Based on the Animal3D dataset, we benchmark representative shape and pose estimation models at: (1) supervised learning from only the Animal3D data, (2) synthetic to real transfer from synthetically generated images, and (3) fine-tuning human pose and shape estimation models. Our experimental results demonstrate that predicting the 3D shape and pose of animals across species remains a very challenging task, despite significant advances in human pose estimation. Our results further demonstrate that synthetic pre-training is a viable strategy to boost the model performance. Overall, Animal3D opens new directions for facilitating future research in animal 3D pose and shape estimation, and is publicly available.

摘要
accurately estimating 动物的3D姿态和形状是理解动物行为的重要步骤，可能有多个下游应用，如野生动物保护。然而，这一领域的研究受到缺乏完整和多样的数据集，高质量的3D姿态和形状注释的限制。本文提出了《动物3D》数据集，包括40种哺乳动物的3379张图像，高质量的26个关键点注释，以及SMAL模型的姿态和形状参数。所有注释都是人工标注和检查的，以确保最高质量的结果。基于《动物3D》数据集，我们对代表性的形状和姿态估计模型进行了：（1）只使用《动物3D》数据进行监督学习，（2）从 sintetically生成的图像进行synthetic to real transfer，以及（3）人体姿态和形状估计模型的精细调整。我们的实验结果表明，预测动物各种种的3D姿态和形状仍然是一个非常困难的任务，尽管人体姿态估计技术有所进步。我们的结果还表明，Synthetic pre-training是一种有效的策略，可以提高模型性能。总的来说，《动物3D》开启了未来动物3D姿态和形状估计研究的新途径，并公共可用。

(Un)fair Exposure in Deep Face Rankings at a Distance

paper_url: http://arxiv.org/abs/2308.11732
repo_url: None
paper_authors: Andrea Atzori, Gianni Fenu, Mirko Marras
for: 该论文旨在探讨法律执法人员在使用深度Face模型时面临的涉嫌人脸图像排名挑战，以及这种挑战中不同人群的偏见问题。
methods: 该论文提出了一种新的实验方案，包括六种现代的Face编码器和两个公共数据集，用于检测法律执法人员在面临涉嫌人脸图像排名任务中受到的偏见问题。
results: 经过对两个数据集的重复和识别任务的广泛实验，论文显示了这个领域中的偏见问题仍然存在，需要采取特殊的政策和 corrected measures 来解决。

Abstract
Law enforcement regularly faces the challenge of ranking suspects from their facial images. Deep face models aid this process but frequently introduce biases that disproportionately affect certain demographic segments. While bias investigation is common in domains like job candidate ranking, the field of forensic face rankings remains underexplored. In this paper, we propose a novel experimental framework, encompassing six state-of-the-art face encoders and two public data sets, designed to scrutinize the extent to which demographic groups suffer from biases in exposure in the context of forensic face rankings. Through comprehensive experiments that cover both re-identification and identification tasks, we show that exposure biases within this domain are far from being countered, demanding attention towards establishing ad-hoc policies and corrective measures. The source code is available at https://github.com/atzoriandrea/ijcb2023-unfair-face-rankings

摘要
法 enforcement régulièrement 面临涉及到嫌犯人像的排名挑战。深度脸部模型可以帮助这个过程，但经常引入偏见，这些偏见会对某些民族群体产生不公正的影响。在领域 like 聘请候选人排名中，偏见调查是常见的，但在法医脸部排名领域，这一问题仍未得到充分关注。在这篇论文中，我们提出了一个新的实验框架，包括六种最新的脸部编码器和两个公共数据集，用于检验嫌犯人脸部排名中不同民族群体是否受到偏见的影响。通过对重识别和识别任务进行全面的实验，我们显示出，在这个领域中的曝光偏见仍未得到控制，需要采取特殊的政策和纠正措施。源代码可以在 https://github.com/atzoriandrea/ijcb2023-unfair-face-rankings 中下载。

GRIP: Generating Interaction Poses Using Latent Consistency and Spatial Cues

paper_url: http://arxiv.org/abs/2308.11617
repo_url: None
paper_authors: Omid Taheri, Yi Zhou, Dimitrios Tzionas, Yang Zhou, Duygu Ceylan, Soren Pirk, Michael J. Black
for: 模型人类手部与物体的互动，包括手指的细微运动，是计算机图形、计算机视觉和混合现实等领域的关键问题。
methods: 我们提出了一种学习基于的方法GRIP，它可以从身体和物体的3D运动中生成真实的手部运动。首先，我们使用ANet网络进行采样，然后利用身体和物体之间的空间时间关系提取两种新的互动时间指标，并在两个阶段的推理管道中使用这些指标生成手部运动。
results: 我们的GRIP方法可以在不同的运动捕捉数据集上对手部运动进行升级，并且在不同的物体和运动方式下保持高度的一致性和普适性。量化实验和感知研究表明，GRIP方法在比基eline方法更高效，并且可以扩展到未看过的物体和运动。

Abstract
Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment. Consequently, modeling realistic hand-object interactions, including the subtle motion of individual fingers, is critical for applications in computer graphics, computer vision, and mixed reality. Prior work on capturing and modeling humans interacting with objects in 3D focuses on the body and object motion, often ignoring hand pose. In contrast, we introduce GRIP, a learning-based method that takes, as input, the 3D motion of the body and the object, and synthesizes realistic motion for both hands before, during, and after object interaction. As a preliminary step before synthesizing the hand motion, we first use a network, ANet, to denoise the arm motion. Then, we leverage the spatio-temporal relationship between the body and the object to extract two types of novel temporal interaction cues, and use them in a two-stage inference pipeline to generate the hand motion. In the first stage, we introduce a new approach to enforce motion temporal consistency in the latent space (LTC), and generate consistent interaction motions. In the second stage, GRIP generates refined hand poses to avoid hand-object penetrations. Given sequences of noisy body and object motion, GRIP upgrades them to include hand-object interaction. Quantitative experiments and perceptual studies demonstrate that GRIP outperforms baseline methods and generalizes to unseen objects and motions from different motion-capture datasets.

摘要
手是人类与物体之间的灵活和多样化抓取器，对人类与物体交互的研究非常重要。因此，模拟人类与物体之间的真实互动，包括手指的细微运动，是计算机图形、计算机视觉和混合现实等领域的关键问题。在过去的工作中，对3D中人体和物体的互动进行捕捉和模拟，通常会忽略手姿势。然而，我们介绍了GRIP，一种基于学习的方法，它可以将人体和物体的3D运动作为输入，并生成真实的手姿势。在生成手姿势之前，我们首先使用ANet网络来减噪人体运动。然后，我们利用人体和物体之间的空间时间关系提取两种新的时间互动指示，并使其在两个阶段的推理管道中使用，以生成真实的手姿势。在第一阶段，我们引入了一种新的方法来在幂空间中保证运动的时间一致性（LTC），以生成一致的互动运动。在第二阶段，GRIP生成了不具有手object割辑的补做手姿势。给定含杂的人体和物体运动序列，GRIP可以将其升级为包括手object互动的序列。量化实验和人类感知研究表明，GRIP在不同的动作捕捉数据集上超过了基eline方法，并在不同的物体和运动中展现了普适性。

Delving into Motion-Aware Matching for Monocular 3D Object Tracking

paper_url: http://arxiv.org/abs/2308.11607
repo_url: https://github.com/kuanchihhuang/moma-m3t
paper_authors: Kuan-Chih Huang, Ming-Hsuan Yang, Yi-Hsuan Tsai
for: 提高低成本摄像头感知器的3D多bject跟踪任务
methods: 利用物体运动cue在不同时帧中的研究
results: 提出了一种基于运动感知的monocular 3D MOT框架，在nuScenes和KITTI datasets上进行了广泛的实验，并达到了与状态当前方法的竞争性表现。 Code和模型在https://github.com/kuanchihhuang/MoMA-M3T上提供。

Abstract
Recent advances of monocular 3D object detection facilitate the 3D multi-object tracking task based on low-cost camera sensors. In this paper, we find that the motion cue of objects along different time frames is critical in 3D multi-object tracking, which is less explored in existing monocular-based approaches. In this paper, we propose a motion-aware framework for monocular 3D MOT. To this end, we propose MoMA-M3T, a framework that mainly consists of three motion-aware components. First, we represent the possible movement of an object related to all object tracklets in the feature space as its motion features. Then, we further model the historical object tracklet along the time frame in a spatial-temporal perspective via a motion transformer. Finally, we propose a motion-aware matching module to associate historical object tracklets and current observations as final tracking results. We conduct extensive experiments on the nuScenes and KITTI datasets to demonstrate that our MoMA-M3T achieves competitive performance against state-of-the-art methods. Moreover, the proposed tracker is flexible and can be easily plugged into existing image-based 3D object detectors without re-training. Code and models are available at https://github.com/kuanchihhuang/MoMA-M3T.

摘要

We represent the possible movement of an object in the feature space as its motion features.2. We model the historical object tracklet in a spatial-temporal perspective using a motion transformer.3. We propose a motion-aware matching module to associate historical object tracklets and current observations as final tracking results.We conduct extensive experiments on the nuScenes and KITTI datasets and show that our MoMA-M3T achieves competitive performance against state-of-the-art methods. Additionally, our proposed tracker is flexible and can be easily integrated into existing image-based 3D object detectors without re-training. Code and models are available at https://github.com/kuanchihhuang/MoMA-M3T.

GOPro: Generate and Optimize Prompts in CLIP using Self-Supervised Learning

paper_url: http://arxiv.org/abs/2308.11605
repo_url: None
paper_authors: Mainak Singha, Ankit Jha, Biplab Banerjee
for: 提高图像识别任务的性能，并解决CLIP和SSL的组合问题
methods: 使用提问学习方法，即使用CLIP和SSL的损失函数，并 introduce visual contrastive loss和prompt consistency loss
results: 在三个困难的领域通用化任务上，GOPro比前一个状态的提示技术得到了显著的提高，并且可以充分发挥CLIP和SSL的优势

Abstract
Large-scale foundation models, such as CLIP, have demonstrated remarkable success in visual recognition tasks by embedding images in a semantically rich space. Self-supervised learning (SSL) has also shown promise in improving visual recognition by learning invariant features. However, the combination of CLIP with SSL is found to face challenges due to the multi-task framework that blends CLIP's contrastive loss and SSL's loss, including difficulties with loss weighting and inconsistency among different views of images in CLIP's output space. To overcome these challenges, we propose a prompt learning-based model called GOPro, which is a unified framework that ensures similarity between various augmented views of input images in a shared image-text embedding space, using a pair of learnable image and text projectors atop CLIP, to promote invariance and generalizability. To automatically learn such prompts, we leverage the visual content and style primitives extracted from pre-trained CLIP and adapt them to the target task. In addition to CLIP's cross-domain contrastive loss, we introduce a visual contrastive loss and a novel prompt consistency loss, considering the different views of the images. GOPro is trained end-to-end on all three loss objectives, combining the strengths of CLIP and SSL in a principled manner. Empirical evaluations demonstrate that GOPro outperforms the state-of-the-art prompting techniques on three challenging domain generalization tasks across multiple benchmarks by a significant margin. Our code is available at https://github.com/mainaksingha01/GOPro.

摘要
大规模基础模型，如CLIP，在视觉识别任务中表现出了惊人的成功，通过嵌入图像在semantically rich space中。自我超vised学习（SSL）也表现出了提高视觉识别的潜力，通过学习不变的特征。然而，CLIP与SSL的组合面临了多任务框架的挑战，包括权重失衡和图像不同视图的不一致性。为解决这些挑战，我们提出了一种名为GOPro的提问学习模型，它是一个统一的框架，使得输入图像的多个扩展视图在一个共享图像文本嵌入空间中保持相似性，通过使用CLIP顶部的两个可学习的图像和文本投影器。我们通过可学习的提问来自动学习这些提示，并利用预训练CLIP中的视觉内容和风格元素来适应目标任务。此外，我们还引入了视觉对比损失和新的提示一致损失，考虑到图像不同视图的差异。GOPro通过综合CLIP和SSL的优点，在三个损失目标上进行了一个整体的训练，并在三个难度大的领域总结任务上表现出了显著的超越。我们的代码可以在https://github.com/mainaksingha01/GOPro上下载。

G3Reg: Pyramid Graph-based Global Registration using Gaussian Ellipsoid Model

paper_url: http://arxiv.org/abs/2308.11573
repo_url: https://github.com/hkust-aerial-robotics/lidar-registration-benchmark
paper_authors: Zhijian Qiao, Zehuan Yu, Binqian Jiang, Huan Yin, Shaojie Shen
for: 这个研究是为了提出一种新的全球扫描LiDAR点云的框架G3Reg，以实现快速和稳定的全球 регистраción。
methods: 这个框架使用基本的地理 primitives，包括平面、集群和直线（PCL）从Raw点云中提取低级别的semantic分割，然后使用这些GEM来实现全球 регистраción。
results: 研究表明G3Reg框架在三个公共可用数据集和一个自收集的多会话数据集上展现出了superior的Robustness和实时性，并且可以将个体GEM和PAGOR组件与其他算法框架结合以提高其效果。

Abstract
This study introduces a novel framework, G3Reg, for fast and robust global registration of LiDAR point clouds. In contrast to conventional complex keypoints and descriptors, we extract fundamental geometric primitives including planes, clusters, and lines (PCL) from the raw point cloud to obtain low-level semantic segments. Each segment is formulated as a unified Gaussian Ellipsoid Model (GEM) by employing a probability ellipsoid to ensure the ground truth centers are encompassed with a certain degree of probability. Utilizing these GEMs, we then present a distrust-and-verify scheme based on a Pyramid Compatibility Graph for Global Registration (PAGOR). Specifically, we establish an upper bound, which can be traversed based on the confidence level for compatibility testing to construct the pyramid graph. Gradually, we solve multiple maximum cliques (MAC) for each level of the graph, generating numerous transformation candidates. In the verification phase, we adopt a precise and efficient metric for point cloud alignment quality, founded on geometric primitives, to identify the optimal candidate. The performance of the algorithm is extensively validated on three publicly available datasets and a self-collected multi-session dataset, without changing any parameter settings in the experimental evaluation. The results exhibit superior robustness and real-time performance of the G3Reg framework compared to state-of-the-art methods. Furthermore, we demonstrate the potential for integrating individual GEM and PAGOR components into other algorithmic frameworks to enhance their efficacy. To advance further research and promote community understanding, we have publicly shared the source code.

摘要
To perform the registration, the framework uses a distrust-and-verify scheme based on a Pyramid Compatibility Graph for Global Registration (PAGOR). This involves establishing an upper bound for compatibility testing and gradually solving multiple maximum cliques (MAC) for each level of the graph. The verification phase then uses a precise and efficient metric for point cloud alignment quality, founded on geometric primitives, to identify the optimal candidate.The performance of the G3Reg framework is extensively validated on three publicly available datasets and a self-collected multi-session dataset, without changing any parameter settings in the experimental evaluation. The results show that the G3Reg framework exhibits superior robustness and real-time performance compared to state-of-the-art methods. Additionally, the framework has the potential to be integrated with other algorithmic frameworks to enhance their efficacy. To promote further research and community understanding, the source code has been publicly shared.

SPANet: Frequency-balancing Token Mixer using Spectral Pooling Aggregation Modulation

paper_url: http://arxiv.org/abs/2308.11568
repo_url: None
paper_authors: Guhnoo Yun, Juhan Yoo, Kijung Kim, Jeongho Lee, Dong Hwan Kim
for: 该研究旨在提高模型性能，特别是在计算机视觉任务上。
methods: 该研究使用了spectral analysis和token mixer来提高模型性能。
results: 实验结果表明，使用优化的token mixer可以提高模型性能，并且可以在多个计算机视觉任务上实现性能提升。Here’s the simplified Chinese text:
for: 该研究旨在提高模型性能，特别是在计算机视觉任务上。
methods: 该研究使用了spectral analysis和token mixer来提高模型性能。
results: 实验结果表明，使用优化的token mixer可以提高模型性能，并且可以在多个计算机视觉任务上实现性能提升。

Abstract
Recent studies show that self-attentions behave like low-pass filters (as opposed to convolutions) and enhancing their high-pass filtering capability improves model performance. Contrary to this idea, we investigate existing convolution-based models with spectral analysis and observe that improving the low-pass filtering in convolution operations also leads to performance improvement. To account for this observation, we hypothesize that utilizing optimal token mixers that capture balanced representations of both high- and low-frequency components can enhance the performance of models. We verify this by decomposing visual features into the frequency domain and combining them in a balanced manner. To handle this, we replace the balancing problem with a mask filtering problem in the frequency domain. Then, we introduce a novel token-mixer named SPAM and leverage it to derive a MetaFormer model termed as SPANet. Experimental results show that the proposed method provides a way to achieve this balance, and the balanced representations of both high- and low-frequency components can improve the performance of models on multiple computer vision tasks. Our code is available at $\href{https://doranlyong.github.io/projects/spanet/}{\text{https://doranlyong.github.io/projects/spanet/}$.

摘要
近期研究表明，自我注意力 behave like low-pass filters（而不是卷积），增强高频过滤能力可以提高模型性能。然而，我们通过 spectral analysis 对现有的卷积基于模型进行分析，发现提高卷积操作中的low-pass filtering也可以提高模型性能。为了解释这一观察，我们假设可以通过使用优化的 токен混合器来捕捉高频和低频组件的平衡表示。我们通过分解视觉特征到频域并将其混合在一起来实现这一点。然后，我们将混合问题转化为频域中的面积滤波问题。接着，我们提出了一种新的 токен混合器名为 SPAM，并使用它来Derive一种基于 SPANet 的 MetaFormer 模型。实验结果表明，我们的方法可以实现这种平衡，并且在多个计算机视觉任务上提高模型性能。我们的代码可以在 $\href{https://doranlyong.github.io/projects/spanet/}{\text{https://doranlyong.github.io/projects/spanet/}$ 上找到。

EndoNet: model for automatic calculation of H-score on histological slides

paper_url: http://arxiv.org/abs/2308.11562
repo_url: None
paper_authors: Egor Ushakov, Anton Naumov, Vladislav Fomberg, Polina Vishnyakova, Aleksandra Asaturova, Alina Badlaeva, Anna Tregubova, Evgeny Karpulevich, Gennady Sukhikh, Timur Fatkhudinov
for: used to assess the presence and distribution of proteins in tissue samples
methods: combines intensity of staining and percentage of stained nuclei, with a computer-aided model (EndoNet) using neural networks to predict H-score values
results: 0.77 mAP on a test dataset, with the ability to adjust the model for specific specialists or laboratories to reproduce the manner of calculating H-scores

Abstract
H-score is a semi-quantitative method used to assess the presence and distribution of proteins in tissue samples by combining the intensity of staining and percentage of stained nuclei. It is widely used but time-consuming and can be limited in accuracy and precision. Computer-aided methods may help overcome these limitations and improve the efficiency of pathologists' workflows. In this work, we developed a model EndoNet for automatic calculation of H-score on histological slides. Our proposed method uses neural networks and consists of two main parts. The first is a detection model which predicts keypoints of centers of nuclei. The second is a H-score module which calculates the value of the H-score using mean pixel values of predicted keypoints. Our model was trained and validated on 1780 annotated tiles with a shape of 100x100 $\mu m$ and performed 0.77 mAP on a test dataset. Moreover, the model can be adjusted to a specific specialist or whole laboratory to reproduce the manner of calculating the H-score. Thus, EndoNet is effective and robust in the analysis of histology slides, which can improve and significantly accelerate the work of pathologists.

摘要
¹ H-score是一种半量化方法，用于评估组织样本中蛋白的存在和分布，通过将染色强度和染色细胞核的百分比相加。这种方法广泛使用，但时间消耗大并且精度和精度有限。计算机助成方法可能可以帮助超越这些限制，改善病理医生的工作流程。在这项工作中，我们开发了一个名为EndoNet的自动计算H-score模型。我们的提议方法包括两个主要部分：第一是一个检测模型，用于预测细胞核中心点的位置。第二是H-score模块，使用预测的中心点的平均像素值来计算H-score的值。我们的模型在1780个标注的块上进行了训练和验证，在测试数据集上达到了0.77 mAP。此外，模型可以根据特定的专家或整个实验室来调整计算H-score的方式，因此EndoNet是有效和可靠的 histology 板块分析方法，可以加速和改善病理医生的工作。

paper_url: http://arxiv.org/abs/2308.11561
repo_url: https://github.com/yifeisu/avdn-challenge
paper_authors: Yifei Su, Dong An, Yuan Xu, Kehan Chen, Yan Huang
for: 本研究旨在提出一种 Target-Grounded Graph-Aware Transformer (TG-GAT) 框架，以提高飞行器代理人的 cross-modal 定位能力。
methods: 本研究使用了 graph-aware transformer capture 空间时间相关性，以及 auxiliary visual grounding task 提高飞行器对参照地标的意识。 hybrid augmentation strategy 基于大语言模型也被利用，以抵消数据稀缺的限制。
results: 本研究在 AVDN Challenge 2023 中获得了冠军，与基线相比，SPL 和 SR 指标上减少了2.2%和3.0%的绝对差。代码可以在 https://github.com/yifeisu/avdn-challenge 中找到。

Abstract
This report details the method of the winning entry of the AVDN Challenge in ICCV 2023. The competition addresses the Aerial Navigation from Dialog History (ANDH) task, which requires a drone agent to associate dialog history with aerial observations to reach the destination. For better cross-modal grounding abilities of the drone agent, we propose a Target-Grounded Graph-Aware Transformer (TG-GAT) framework. Concretely, TG-GAT first leverages a graph-aware transformer to capture spatiotemporal dependency, which benefits navigation state tracking and robust action planning. In addition, an auxiliary visual grounding task is devised to boost the agent's awareness of referred landmarks. Moreover, a hybrid augmentation strategy based on large language models is utilized to mitigate data scarcity limitations. Our TG-GAT framework won the AVDN Challenge 2023, with 2.2% and 3.0% absolute improvements over the baseline on SPL and SR metrics, respectively. The code is available at https://github.com/yifeisu/avdn-challenge.

摘要
本报告详细介绍了ICCV 2023年AVDN Challenge的赢家方案。比赛要求飞行器代理人 associate dialog history with aerial observations来达到目的地。为了提高飞行器代理人的 Cross-modal grounding能力，我们提议了 Target-Grounded Graph-Aware Transformer（TG-GAT）框架。具体来说，TG-GAT首先利用graph-aware transformer来捕捉空时间dependency，这有助于导航状态跟踪和Robust action planning。此外，我们还提出了一个auxiliary visual grounding任务，以提高飞行器代理人对参照地标的意识。此外，我们还使用基于大语言模型的混合增强策略来缓解数据稀缺的限制。我们的TG-GAT框架在SPL和SR指标上分别提高了2.2%和3.0%的绝对优势，详细的实现细节可以在https://github.com/yifeisu/avdn-challenge上找到。

Open Set Synthetic Image Source Attribution

paper_url: http://arxiv.org/abs/2308.11557
repo_url: None
paper_authors: Shengbang Fang, Tai D. Nguyen, Matthew C. Stamm
for: 这个研究旨在开发一个基于度量学习的开集源推实测试方法，以应对新兴的图像生成技术。
methods: 这个方法使用度量学习来学习对 generator 的推实，并且透过将图像与 generator 的对应关系储存在一个数据库中，以便在识别图像的来源时进行推实。
results: 这个研究的结果显示，使用度量学习的开集源推实方法可以很好地识别图像的来源，并且可以应对新兴的图像生成技术。

Abstract
AI-generated images have become increasingly realistic and have garnered significant public attention. While synthetic images are intriguing due to their realism, they also pose an important misinformation threat. To address this new threat, researchers have developed multiple algorithms to detect synthetic images and identify their source generators. However, most existing source attribution techniques are designed to operate in a closed-set scenario, i.e. they can only be used to discriminate between known image generators. By contrast, new image-generation techniques are rapidly emerging. To contend with this, there is a great need for open-set source attribution techniques that can identify when synthetic images have originated from new, unseen generators. To address this problem, we propose a new metric learning-based approach. Our technique works by learning transferrable embeddings capable of discriminating between generators, even when they are not seen during training. An image is first assigned to a candidate generator, then is accepted or rejected based on its distance in the embedding space from known generators' learned reference points. Importantly, we identify that initializing our source attribution embedding network by pretraining it on image camera identification can improve our embeddings' transferability. Through a series of experiments, we demonstrate our approach's ability to attribute the source of synthetic images in open-set scenarios.

摘要
人工生成的图像在真实性方面有所提高，引起了公众的广泛关注。然而，这些人工图像同时也存在误导性的威胁。为了解决这一问题，研究人员们已经开发出多种检测人工图像并确定其生成器的算法。然而，大多数现有的源归属技术只能在关闭集成enario下运行，即只能用于 отличаbetween已知的图像生成器。而新的图像生成技术在不断演化。为了应对这一问题，有一个很大的需求是开放集成源归属技术，可以在新的生成器出现前就可以识别出来的人工图像的来源。为了解决这个问题，我们提出了一种基于度量学习的新方法。我们的技术通过学习可转移的嵌入来分类生成器，即在训练时没有看到的生成器可以通过嵌入来分类。首先，我们将图像分配给候选生成器，然后根据图像与已知生成器学习的参考点之间的距离来接受或拒绝。我们发现，在初始化我们的源归属嵌入网络时，通过先进行图像摄像头识别的初始化可以提高我们的嵌入的传输性。通过一系列的实验，我们证明了我们的方法可以在开放集成 scenarios中归属人工图像的来源。

Multi-event Video-Text Retrieval

paper_url: http://arxiv.org/abs/2308.11551
repo_url: https://github.com/gengyuanmax/mevtr
paper_authors: Gengyuan Zhang, Jisen Ren, Jindong Gu, Volker Tresp
for: 这个研究是为了解决在互联网上充斥着巨量视频和文本数据的现象下，视频和文本之间的多modal任务，即视频文本同步问题。
methods: 这个研究使用了一种简单的模型，即Me-Retriever，该模型包括关键事件视频表示和一种新的MeVTR损失函数，以解决多事件视频文本同步问题。
results: 实验表明，这种简单的模型在视频到文本和文本到视频两个任务中表现出色，超过了其他模型，并成为了MeVTR任务的robust基础。

Abstract
Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet. A plethora of work characterized by using a two-stream Vision-Language model architecture that learns a joint representation of video-text pairs has become a prominent approach for the VTR task. However, these models operate under the assumption of bijective video-text correspondences and neglect a more practical scenario where video content usually encompasses multiple events, while texts like user queries or webpage metadata tend to be specific and correspond to single events. This establishes a gap between the previous training objective and real-world applications, leading to the potential performance degradation of earlier models during inference. In this study, we introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events, as a niche scenario of the conventional Video-Text Retrieval Task. We present a simple model, Me-Retriever, which incorporates key event video representation and a new MeVTR loss for the MeVTR task. Comprehensive experiments show that this straightforward framework outperforms other models in the Video-to-Text and Text-to-Video tasks, effectively establishing a robust baseline for the MeVTR task. We believe this work serves as a strong foundation for future studies. Code is available at https://github.com/gengyuanmax/MeVTR.

摘要
视频文本检索（VTR）是一个重要的多模态任务，在互联网上充斥着大量视频文本数据时成为了一个突出的方法。许多工作都是使用两核视语言模型架构，这些模型学习了视频文本对的共同表示，成为VTR任务的主要方法。然而，这些模型假设视频内容与文本相对应一一，忽略了实际应用中的具有多个事件的视频内容，而文本则是单一的事件的描述。这种情况导致了以前的训练目标与实际应用之间的差距，从而可能导致早期模型的性能下降。在这个研究中，我们引入了多事件视频文本检索任务（MeVTR），解决了每个视频包含多个不同的事件的场景，这是传统的VTR任务的一种特殊情况。我们提出了一种简单的模型，Me-Retriever，它包括了关键事件视频表示和一个新的MeVTR损失函数。我们进行了全面的实验，表明这种简单的框架在视频到文本和文本到视频两个任务中表现出色，有效地建立了MeVTR任务的基本参考模型。我们认为这项工作为未来研究提供了一个坚实的基础。代码可以在https://github.com/gengyuanmax/MeVTR上下载。

2023-08-23

cs.AI

cs.AI - 2023-08-23

CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No

paper_url: http://arxiv.org/abs/2308.12213
repo_url: https://github.com/xmed-lab/clipn
paper_authors: Hualiang Wang, Yi Li, Huifeng Yao, Xiaomeng Li
For: This paper focuses on developing a novel method for zero-shot out-of-distribution (OOD) detection using CLIP, a text-to-image model. The goal is to equip CLIP with the ability to distinguish between in-distribution (ID) and OOD samples using positive-semantic prompts and negation-semantic prompts.* Methods: The proposed method, called CLIP saying no (CLIPN), utilizes a novel learnable no prompt and a no text encoder to capture negation semantics within images. Two loss functions are introduced to teach CLIPN to associate images with no prompts, enabling it to identify unknown samples. Additionally, two threshold-free inference algorithms are proposed for OOD detection.* Results: The proposed CLIPN method, based on ViT-B-16, outperforms 7 well-used algorithms by at least 2.34% and 11.64% in terms of AUROC and FPR95 for zero-shot OOD detection on ImageNet-1K. The code is available on GitHub.

Abstract
Out-of-distribution (OOD) detection refers to training the model on an in-distribution (ID) dataset to classify whether the input images come from unknown classes. Considerable effort has been invested in designing various OOD detection methods based on either convolutional neural networks or transformers. However, zero-shot OOD detection methods driven by CLIP, which only require class names for ID, have received less attention. This paper presents a novel method, namely CLIP saying no (CLIPN), which empowers the logic of saying no within CLIP. Our key motivation is to equip CLIP with the capability of distinguishing OOD and ID samples using positive-semantic prompts and negation-semantic prompts. Specifically, we design a novel learnable no prompt and a no text encoder to capture negation semantics within images. Subsequently, we introduce two loss functions: the image-text binary-opposite loss and the text semantic-opposite loss, which we use to teach CLIPN to associate images with no prompts, thereby enabling it to identify unknown samples. Furthermore, we propose two threshold-free inference algorithms to perform OOD detection by utilizing negation semantics from no prompts and the text encoder. Experimental results on 9 benchmark datasets (3 ID datasets and 6 OOD datasets) for the OOD detection task demonstrate that CLIPN, based on ViT-B-16, outperforms 7 well-used algorithms by at least 2.34% and 11.64% in terms of AUROC and FPR95 for zero-shot OOD detection on ImageNet-1K. Our CLIPN can serve as a solid foundation for effectively leveraging CLIP in downstream OOD tasks. The code is available on https://github.com/xmed-lab/CLIPN.

摘要
OUT-OF-DISTRIBUTION (OOD) 检测指的是在 IN-DISTRIBUTION (ID) 数据集上训练模型，以判断输入图像来自未知类。针对这问题，各种 OOD 检测方法已经得到了广泛的投入，其中一些基于卷积神经网络，一些基于 transformers。然而，驱动 CLIP 的零shot OOD 检测方法却受到了更少的关注。本文提出了一种新的方法，即 CLIP 说不 (CLIPN)，该方法通过帮助 CLIP 内部的逻辑分别 ID 和 OOD 样本。我们的关键动机是让 CLIP 能够通过正面 semantics 和否定 semantics 来分辨 ID 和 OOD 样本。具体来说，我们设计了一个可学习的 no 提示和一个 no 文本编码器，以捕捉图像中的否定 semantics。然后，我们引入了两个损失函数：图像文本二进制对立损失和文本 semantics 对立损失，以教 CLIPN 将图像与 no 提示相关联，从而让它能够识别未知样本。此外，我们提出了两种无阈值的推理算法，以利用 no 提示和文本编码器来进行 OOD 检测。实验结果表明，基于 ViT-B-16 的 CLIPN 在 9 个标准数据集（3 ID 数据集和 6 OOD 数据集）上的 OOD 检测任务中，与 7 种常用算法相比，至少提高了 2.34% 和 11.64% 的 AUROC 和 FPR95。我们的 CLIPN 可以作为一个可靠的基础，用于有效地利用 CLIP 在下游 OOD 任务中。代码可以在 https://github.com/xmed-lab/CLIPN 上获取。

Learning to Learn Financial Networks for Optimising Momentum Strategies

paper_url: http://arxiv.org/abs/2308.12212
repo_url: None
paper_authors: Xingyue Pu, Stefan Zohren, Stephen Roberts, Xiaowen Dong
for: 这篇论文旨在提供一种新型的风险豁免，利用金融网络中资产之间的连接来预测未来的回报。
methods: 该论文提出了一种名为L2GMOM的机器学习框架，该框架同时学习金融网络和股票投资策略，以提高股票投资的性能和风险控制。
results: 根据64个连续Future合约的回报测试，L2GMOM模型在20年时间段内能够显著提高股票投资的盈利率和风险控制，Sharpe比率为1.74。

Abstract
Network momentum provides a novel type of risk premium, which exploits the interconnections among assets in a financial network to predict future returns. However, the current process of constructing financial networks relies heavily on expensive databases and financial expertise, limiting accessibility for small-sized and academic institutions. Furthermore, the traditional approach treats network construction and portfolio optimisation as separate tasks, potentially hindering optimal portfolio performance. To address these challenges, we propose L2GMOM, an end-to-end machine learning framework that simultaneously learns financial networks and optimises trading signals for network momentum strategies. The model of L2GMOM is a neural network with a highly interpretable forward propagation architecture, which is derived from algorithm unrolling. The L2GMOM is flexible and can be trained with diverse loss functions for portfolio performance, e.g. the negative Sharpe ratio. Backtesting on 64 continuous future contracts demonstrates a significant improvement in portfolio profitability and risk control, with a Sharpe ratio of 1.74 across a 20-year period.

摘要
网络势头提供了一种新型的风险偏好，利用财务网络中资产之间的关系预测未来的回报。然而，现有的金融网络建构过程受到高优质数据库和金融专业知识的限制，导致小型和学术机构的访问受到限制。此外，传统方法将网络建构和投资策略优化视为两个独立的任务，可能会降低投资策略的优化性。为解决这些挑战，我们提出了L2GMOM，一种结束到终点的机器学习框架，同时学习金融网络和优化交易信号。L2GMOM的模型是一种高度可解释的前进卷积神经网络，由算法抽象而来。L2GMOM是灵活的，可以使用多种损失函数来优化股票表现，例如负方均值系数。在64个连续未来合约的回测中，L2GMOM显示出了 significiant提高投资收益和风险控制，负方均值系数为20年期间1.74。

Robustness Analysis of Continuous-Depth Models with Lagrangian Techniques

paper_url: http://arxiv.org/abs/2308.12192
repo_url: None
paper_authors: Sophie A. Neubauer, Radu Grosu
for: 这个论文旨在统一地present deterministic和统计 lagrange 验证技术，以量化时间连续过程中行为的稳定性。
methods: 这个论文使用了 LRT-NG、SLR 和 GoTube 算法来构建紧距盒，即在给定时间范围内可达的状态的上下文。这些算法提供了确定性和统计性的保证。
results: 实验表明，lagrange 技术在比较于 LRT、Flow* 和 CAPD 的情况下表现更优异，并用于不同的时间连续模型的稳定性分析。

Abstract
This paper presents, in a unified fashion, deterministic as well as statistical Lagrangian-verification techniques. They formally quantify the behavioral robustness of any time-continuous process, formulated as a continuous-depth model. To this end, we review LRT-NG, SLR, and GoTube, algorithms for constructing a tight reachtube, that is, an over-approximation of the set of states reachable within a given time-horizon, and provide guarantees for the reachtube bounds. We compare the usage of the variational equations, associated to the system equations, the mean value theorem, and the Lipschitz constants, in achieving deterministic and statistical guarantees. In LRT-NG, the Lipschitz constant is used as a bloating factor of the initial perturbation, to compute the radius of an ellipsoid in an optimal metric, which over-approximates the set of reachable states. In SLR and GoTube, we get statistical guarantees, by using the Lipschitz constants to compute local balls around samples. These are needed to calculate the probability of having found an upper bound, of the true maximum perturbation at every timestep. Our experiments demonstrate the superior performance of Lagrangian techniques, when compared to LRT, Flow*, and CAPD, and illustrate their use in the robustness analysis of various continuous-depth models.

摘要
In LRT-NG, the Lipschitz constant is used as a bloating factor of the initial perturbation to compute the radius of an ellipsoid that over-approximates the set of reachable states. In SLR and GoTube, the Lipschitz constants are used to compute local balls around samples, which are needed to calculate the probability of finding an upper bound of the true maximum perturbation at each timestep. The authors demonstrate the superior performance of Lagrangian techniques compared to LRT, Flow*, and CAPD, and illustrate their use in the robustness analysis of various continuous-depth models through experiments.

Unsupervised anomalies detection in IIoT edge devices networks using federated learning

paper_url: http://arxiv.org/abs/2308.12175
repo_url: None
paper_authors: Niyomukiza Thamar, Hossam Samy Elsaid Sharara
for: solves the privacy problem for IoT/ IIoT devices that held sensitive data for the owners.
methods: Federated learning(FL) as a distributed machine learning approach, specifically the Fedavg algorithm.
results: Almost the same as the centralized machine learning approach, but with the added benefit of addressing privacy concerns.Here’s the simplified Chinese text for the three points:
for: 解决 IoT/ IIoT 设备上的敏感数据所有者隐私问题。
methods: 分布式机器学习方法（Federated Learning，FL），特别是 Fedavg 算法。
results: 与中央机器学习方法相似，但具有隐私保护的优点。

Abstract
In a connection of many IoT devices that each collect data, normally training a machine learning model would involve transmitting the data to a central server which requires strict privacy rules. However, some owners are reluctant of availing their data out of the company due to data security concerns. Federated learning(FL) as a distributed machine learning approach performs training of a machine learning model on the device that gathered the data itself. In this scenario, data is not share over the network for training purpose. Fedavg as one of FL algorithms permits a model to be copied to participating devices during a training session. The devices could be chosen at random, and a device can be aborted. The resulting models are sent to the coordinating server and then average models from the devices that finished training. The process is repeated until a desired model accuracy is achieved. By doing this, FL approach solves the privacy problem for IoT/ IIoT devices that held sensitive data for the owners. In this paper, we leverage the benefits of FL and implemented Fedavg algorithm on a recent dataset that represent the modern IoT/ IIoT device networks. The results were almost the same as the centralized machine learning approach. We also evaluated some shortcomings of Fedavg such as unfairness that happens during the training when struggling devices do not participate for every stage of training. This inefficient training of local or global model could lead in a high number of false alarms in intrusion detection systems for IoT/IIoT gadgets developed using Fedavg. Hence, after evaluating the FedAv deep auto encoder with centralized deep auto encoder ML, we further proposed and designed a Fair Fedavg algorithm that will be evaluated in the future work.

摘要
在许多物联网设备之间的连接中，通常需要将数据传输到中央服务器进行机器学习模型的训练，但有些 propietarios 对于数据安全问题感到担忧。 Federated learning（FL）作为分布式机器学习方法，在设备上进行机器学习模型的训练，不需要将数据传输到服务器。 Fedavg 是 FL 算法之一，允许在训练过程中将模型复制到参与设备上。这些设备可以随机选择，并且可以在训练过程中被终止。获得的模型将被发送到协调服务器，并与其他完成训练的设备的模型进行平均值。这种方法可以解决物联网/IIoT 设备持有敏感数据的所有者隐私问题。在这篇论文中，我们利用 FL 的优点，并在最新的数据集上实现 Fedavg 算法。结果与中央机器学习方法的结果几乎相同。我们还评估了 Fedavg 的一些缺点，如训练过程中不参与的设备会导致不公平性。这可能导致 IoT/IIoT 设备上开发的投入检测系统中出现高比例的假警示。因此，我们在未来工作中将提出和实现公平的 Fedavg 算法。

Evaluation of Faithfulness Using the Longest Supported Subsequence

paper_url: http://arxiv.org/abs/2308.12157
repo_url: None
paper_authors: Anirudh Mittal, Timo Schick, Mikel Artetxe, Jane Dwivedi-Yu
for: evaluating the trustworthiness of machine-generated text, specifically in tasks such as summarization and question-answering
methods: introducing a novel approach called the Longest Supported Subsequence (LSS) to compute the faithfulness of machine-generated text, and finetuning a model to generate LSS using a new human-annotated dataset
results: demonstrating that the proposed metric correlates better with human ratings than prevailing state-of-the-art metrics, with an 18% enhancement in faithfulness on the dataset, and consistently outperforming other metrics on a summarization dataset across six different models, as well as comparing several popular Large Language Models (LLMs) for faithfulness using this metric.

Abstract
As increasingly sophisticated language models emerge, their trustworthiness becomes a pivotal issue, especially in tasks such as summarization and question-answering. Ensuring their responses are contextually grounded and faithful is challenging due to the linguistic diversity and the myriad of possible answers. In this paper, we introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous substring of the claim that is supported by the context, which we refer to as the Longest Supported Subsequence (LSS). Using a new human-annotated dataset, we finetune a model to generate LSS. We introduce a new method of evaluation and demonstrate that these metrics correlate better with human ratings when LSS is employed, as opposed to when it is not. Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset. Our metric consistently outperforms other metrics on a summarization dataset across six different models. Finally, we compare several popular Large Language Models (LLMs) for faithfulness using this metric. We release the human-annotated dataset built for predicting LSS and our fine-tuned model for evaluating faithfulness.

摘要
“随着越来越进步的语言模型出现，它们的可靠性成为一个关键的问题，特别是在摘要和问答中。确保它们的回答是基于上下文的，并不是单纯地根据语言模型的假设，是一个具有挑战性的任务。在这篇论文中，我们提出了一种新的方法来评估机器生成的文本的可靠性，通过计算文本中最长的不连续子串，我们称之为“最长支持子串”（LSS）。我们使用了一个新的人类验证数据集，调整了一个模型以生成LSS，并导入了一个新的评估方法。我们示示了这些指标与人类评分更加相似，而且在摘要数据集上，我们的提案的指标与现有的指标相比，有18%的提升。我们的指标在六个不同的模型上的表现都与其他指标相比较高。最后，我们使用这个指标评估了一些流行的大型语言模型的可靠性。我们发布了我们建立的人类验证数据集和调整后的模型，以便用于评估可靠性。”

Multimodal Latent Emotion Recognition from Micro-expression and Physiological Signals

paper_url: http://arxiv.org/abs/2308.12156
repo_url: None
paper_authors: Liangfei Zhang, Yifei Qian, Ognjen Arandjelovic, Anthony Zhu
for: 提高隐藏情感识别精度
methods: combining微表情(ME)和生理信号(PS)，使用1D可分和混合深度卷积网络，标准化分布预测权重混合法，以及深度/生理指导注意模块
results: 提高比较方法的表现

Abstract
This paper discusses the benefits of incorporating multimodal data for improving latent emotion recognition accuracy, focusing on micro-expression (ME) and physiological signals (PS). The proposed approach presents a novel multimodal learning framework that combines ME and PS, including a 1D separable and mixable depthwise inception network, a standardised normal distribution weighted feature fusion method, and depth/physiology guided attention modules for multimodal learning. Experimental results show that the proposed approach outperforms the benchmark method, with the weighted fusion method and guided attention modules both contributing to enhanced performance.

摘要
这篇论文介绍了通过多模式数据的汇入来提高潜在情绪识别精度，特点在微表情（ME）和生理信号（PS）之间。提议的方法框架组合了ME和PS，包括一个可分离的深度wise嵌入网络，一种标准化正态分布权重Feature合并方法，以及深度/生理学引导注意模块 для多模式学习。实验结果显示，提议的方法在比较方法上表现出色，权重合并方法和引导注意模块都对精度提高做出了贡献。

A Probabilistic Fluctuation based Membership Inference Attack for Generative Models

paper_url: http://arxiv.org/abs/2308.12143
repo_url: None
paper_authors: Wenjie Fu, Huandong Wang, Chen Gao, Guanghua Liu, Yong Li, Tao Jiang
for: 本研究探讨了基于生成模型的会员推测攻击（MIA），并提出了一种基于概率波动的会员推测方法（PFAMI）。
methods: PFAMI 基于生成模型中的记忆效应，通过分析生成记录的概率波动来推断会员性。
results: 对多种生成模型和数据集进行了广泛的实验，显示 PFAMI 可以提高攻击成功率（ASR）约27.9% comparing with 基准值。

Abstract
Membership Inference Attack (MIA) identifies whether a record exists in a machine learning model's training set by querying the model. MIAs on the classic classification models have been well-studied, and recent works have started to explore how to transplant MIA onto generative models. Our investigation indicates that existing MIAs designed for generative models mainly depend on the overfitting in target models. However, overfitting can be avoided by employing various regularization techniques, whereas existing MIAs demonstrate poor performance in practice. Unlike overfitting, memorization is essential for deep learning models to attain optimal performance, making it a more prevalent phenomenon. Memorization in generative models leads to an increasing trend in the probability distribution of generating records around the member record. Therefore, we propose a Probabilistic Fluctuation Assessing Membership Inference Attack (PFAMI), a black-box MIA that infers memberships by detecting these trends via analyzing the overall probabilistic fluctuations around given records. We conduct extensive experiments across multiple generative models and datasets, which demonstrate PFAMI can improve the attack success rate (ASR) by about 27.9% when compared with the best baseline.

摘要
机制成员攻击（MIA）可以决定一个记录是否在机器学习模型的训练集中，通过询问模型。过往的研究主要集中在传统的分类模型上，而现在的研究则开始对生成模型进行应用。我们的研究显示，现有的生成模型MIA主要依赖目标模型的过滤。然而，过滤可以使用多种正规化技术来避免，而现有的MIA实际上却表现不佳。不同的过滤，记忆是深度学习模型所需的一种基本现象，它会使模型在实际应用中表现更好。记忆在生成模型中导致生成记录的概率分布增加，因此我们提出了一个概率波动评估机制成员攻击（PFAMI），这是一种黑盒子MIA，可以通过分析givens record的概率波动来决定成员。我们进行了多种生成模型和数据集的广泛实验，结果显示，PFAMI可以提高攻击成功率（ASR）约27.9%，相比最佳基eline。

Semantic Change Detection for the Romanian Language

paper_url: http://arxiv.org/abs/2308.12131
repo_url: https://github.com/ds4ai-upb/semanticchange-ro
paper_authors: Ciprian-Octavian Truică, Victor Tudose, Elena-Simona Apostol
for: 本研究旨在分析语言变化的自动Semantic Change Methods，以及在实际英语和罗马尼亚语 Corporas中的应用。
methods: 本研究使用Word2Vec和ELMo两种静态和 контекстual word embedding模型，并对这两种模型在英语dataset上进行评估。然后，对罗马尼亚语 dataset进行实验，并强调不同的semantic change aspect，如意义获得和丢失。
results: 实验结果显示，取决于 corpus，模型选择和评估距离是检测semantic change的重要因素。

Abstract
Automatic semantic change methods try to identify the changes that appear over time in the meaning of words by analyzing their usage in diachronic corpora. In this paper, we analyze different strategies to create static and contextual word embedding models, i.e., Word2Vec and ELMo, on real-world English and Romanian datasets. To test our pipeline and determine the performance of our models, we first evaluate both word embedding models on an English dataset (SEMEVAL-CCOHA). Afterward, we focus our experiments on a Romanian dataset, and we underline different aspects of semantic changes in this low-resource language, such as meaning acquisition and loss. The experimental results show that, depending on the corpus, the most important factors to consider are the choice of model and the distance to calculate a score for detecting semantic change.

摘要
自动 semantic change 方法试图通过分析在时间上的使用情况来识别词语的意义变化。在这篇论文中，我们分析了不同的策略来创建静态和 контекст word embedding 模型，即 Word2Vec 和 ELMo，在实际的英语和罗马尼亚数据集上。为了测试我们的管道和确定模型的表现，我们首先评估了这两种 word embedding 模型在英语数据集（SEMEVAL-CCOHA）上。接着，我们将注意力集中在罗马尼亚数据集上，并强调不同的 semantics 变化方面，如 meaning acquisition 和 loss。实验结果表明，具体取决于 corpus，最重要的因素是选择模型和计算分数的距离。

Masking Strategies for Background Bias Removal in Computer Vision Models

paper_url: http://arxiv.org/abs/2308.12127
repo_url: https://github.com/ananthu-aniraj/masking_strategies_bias_removal
paper_authors: Ananthu Aniraj, Cassio F. Dantas, Dino Ienco, Diego Marcos
for: 这种研究旨在探讨细化图像分类任务中背景引起的偏见问题，以及如何使用masking策略来 Mitigate这种偏见。
methods: 这些研究使用了标准的Convolutional Neural Network (CNN)和Vision Transformers (ViT)模型，并评估了两种masking策略来解决背景引起的偏见问题。
results: 研究发现，使用这两种masking策略可以提高模型对不同背景的抗干扰性能，特别是在使用GAP-Pooled Patch token-based classification和 early masking的情况下。

Abstract
Models for fine-grained image classification tasks, where the difference between some classes can be extremely subtle and the number of samples per class tends to be low, are particularly prone to picking up background-related biases and demand robust methods to handle potential examples with out-of-distribution (OOD) backgrounds. To gain deeper insights into this critical problem, our research investigates the impact of background-induced bias on fine-grained image classification, evaluating standard backbone models such as Convolutional Neural Network (CNN) and Vision Transformers (ViT). We explore two masking strategies to mitigate background-induced bias: Early masking, which removes background information at the (input) image level, and late masking, which selectively masks high-level spatial features corresponding to the background. Extensive experiments assess the behavior of CNN and ViT models under different masking strategies, with a focus on their generalization to OOD backgrounds. The obtained findings demonstrate that both proposed strategies enhance OOD performance compared to the baseline models, with early masking consistently exhibiting the best OOD performance. Notably, a ViT variant employing GAP-Pooled Patch token-based classification combined with early masking achieves the highest OOD robustness.

摘要
模型 для细化图像分类任务，其中一些类别之间的差别可能很小，而每个类别的样本数也很少，容易受到背景相关的偏见。为了更深入地理解这个重要问题，我们的研究探讨了背景引起的偏见对细化图像分类的影响，并评估了标准的背景模型，如卷积神经网络（CNN）和视Transformers（ViT）。我们研究了两种遮盾策略来减轻背景引起的偏见：早期遮盾，即在输入图像水平上移除背景信息，以及晚期遮盾，即在高级空间特征水平上选择性地遮盾背景相关的特征。我们进行了广泛的实验，评估不同遮盾策略对CNN和ViT模型的影响，尤其是对于不同的背景。结果显示，我们所提出的两种遮盾策略都能提高对于不同背景的性能，而早期遮盾一直保持最好的OOD性能。另外，一种基于GAP-Pooled Patch token的ViT变体，结合早期遮盾，达到了最高的OOD Robustness。

Quantifying degeneracy in singular models via the learning coefficient

paper_url: http://arxiv.org/abs/2308.12108
repo_url: https://github.com/edmundlth/scalable_learning_coefficient_with_sgld
paper_authors: Edmund Lau, Daniel Murfet, Susan Wei
for: This paper is written to explore the concept of degeneracy in deep neural networks (DNN) and to develop a method for quantifying the degree of degeneracy using a quantity called the “learning coefficient”.
methods: The paper uses singular learning theory and stochastic gradient Langevin dynamics to develop a computationally scalable approximation of the localized learning coefficient.
results: The paper demonstrates the accuracy of the proposed approach in low-dimensional models with known theoretical values, and shows that the local learning coefficient can correctly recover the ordering of degeneracy between various parameter regions of interest. Additionally, the paper demonstrates the ability of the local learning coefficient to reveal the inductive bias of stochastic optimizers for more or less degenerate critical points using an experiment on the MNIST dataset.

Abstract
Deep neural networks (DNN) are singular statistical models which exhibit complex degeneracies. In this work, we illustrate how a quantity known as the \emph{learning coefficient} introduced in singular learning theory quantifies precisely the degree of degeneracy in deep neural networks. Importantly, we will demonstrate that degeneracy in DNN cannot be accounted for by simply counting the number of "flat" directions. We propose a computationally scalable approximation of a localized version of the learning coefficient using stochastic gradient Langevin dynamics. To validate our approach, we demonstrate its accuracy in low-dimensional models with known theoretical values. Importantly, the local learning coefficient can correctly recover the ordering of degeneracy between various parameter regions of interest. An experiment on MNIST shows the local learning coefficient can reveal the inductive bias of stochastic opitmizers for more or less degenerate critical points.

摘要

Out of the Cage: How Stochastic Parrots Win in Cyber Security Environments

paper_url: http://arxiv.org/abs/2308.12086
repo_url: None
paper_authors: Maria Rigaki, Ondřej Lukáš, Carlos A. Catania, Sebastian Garcia
for: This paper focuses on using pre-trained language models (LLMs) as agents in cybersecurity network environments for sequential decision-making processes.
methods: The authors propose using pre-trained LLMs as attacking agents in two reinforcement learning environments and compare their performance to state-of-the-art agents and human testers.
results: The LLM agents demonstrate similar or better performance than state-of-the-art agents in most scenarios and configurations, and the best LLM agents perform similarly to human testers without any additional training. This suggests that LLMs have the potential to efficiently address complex decision-making tasks within cybersecurity.Here is the text in Simplified Chinese:
for: 这篇论文探讨了使用预训练语言模型（LLM）作为网络安全环境中的决策代理。
methods: 作者们提议使用预训练LLM作为两个强化学习环境中的攻击者，并与当前最佳代理进行比较。
results: LLM代理在大多数情况下和配置下表现相当或更好于当前最佳代理，并且最佳LLM代理在没有任何额外训练的情况下与人工测试人员表现相当。这表明LLM有可能高效地解决网络安全中的复杂决策问题。

Abstract
Large Language Models (LLMs) have gained widespread popularity across diverse domains involving text generation, summarization, and various natural language processing tasks. Despite their inherent limitations, LLM-based designs have shown promising capabilities in planning and navigating open-world scenarios. This paper introduces a novel application of pre-trained LLMs as agents within cybersecurity network environments, focusing on their utility for sequential decision-making processes. We present an approach wherein pre-trained LLMs are leveraged as attacking agents in two reinforcement learning environments. Our proposed agents demonstrate similar or better performance against state-of-the-art agents trained for thousands of episodes in most scenarios and configurations. In addition, the best LLM agents perform similarly to human testers of the environment without any additional training process. This design highlights the potential of LLMs to efficiently address complex decision-making tasks within cybersecurity. Furthermore, we introduce a new network security environment named NetSecGame. The environment is designed to eventually support complex multi-agent scenarios within the network security domain. The proposed environment mimics real network attacks and is designed to be highly modular and adaptable for various scenarios.

摘要
大语言模型（LLM）在多种自然语言处理任务中得到了广泛的推广，包括文本生成、摘要和各种自然语言处理任务。尽管它们有自然的限制，但LLM基本设计在开放世界enario中的规划和导航方面表现了扎实的能力。本文介绍了一种使用预训练LLM作为网络安全环境中的代理人，关注它们在顺序决策过程中的使用。我们提出了一种方法，其中预训练LLM被用作攻击者在两个循环学习环境中。我们的提议代理人在大多数情况下和现有EPisode数千个话的代理人之间表现相似或更好。此外，我们的最佳LLM代理人在没有任何额外训练过程的情况下与人类测试者的性能相似。这种设计高亮了LLM在网络安全中的潜在能力。此外，我们介绍了一个新的网络安全环境名为NetSecGame。该环境旨在最终支持复杂多代理人场景在网络安全领域。我们的设计模仿了实际网络攻击，并设计为高度可组合和可调整的多种场景。

Stabilizing RNN Gradients through Pre-training

paper_url: http://arxiv.org/abs/2308.12075
repo_url: None
paper_authors: Luca Herranz-Celotti, Jean Rouat
for: This paper aims to improve the stability of deep neural networks during training, particularly for complex networks that are difficult to analyze analytically.methods: The authors propose a new approach called the Local Stability Condition (LSC) to stabilize deep neural networks. They extend known stability theories to encompass a broader family of deep recurrent networks and propose a new initialization scheme that gives a weight of a half to the time and depth contributions to the gradient.results: The authors confirm that pre-training both feed-forward and recurrent networks to fulfill the LSC often results in improved final performance across models. Their approach can be implemented as an additional step before pre-training on large augmented datasets, and as an alternative to finding stable initializations analytically.

Abstract
Numerous theories of learning suggest to prevent the gradient variance from exponential growth with depth or time, to stabilize and improve training. Typically, these analyses are conducted on feed-forward fully-connected neural networks or single-layer recurrent neural networks, given their mathematical tractability. In contrast, this study demonstrates that pre-training the network to local stability can be effective whenever the architectures are too complex for an analytical initialization. Furthermore, we extend known stability theories to encompass a broader family of deep recurrent networks, requiring minimal assumptions on data and parameter distribution, a theory that we refer to as the Local Stability Condition (LSC). Our investigation reveals that the classical Glorot, He, and Orthogonal initialization schemes satisfy the LSC when applied to feed-forward fully-connected neural networks. However, analysing deep recurrent networks, we identify a new additive source of exponential explosion that emerges from counting gradient paths in a rectangular grid in depth and time. We propose a new approach to mitigate this issue, that consists on giving a weight of a half to the time and depth contributions to the gradient, instead of the classical weight of one. Our empirical results confirm that pre-training both feed-forward and recurrent networks to fulfill the LSC often results in improved final performance across models. This study contributes to the field by providing a means to stabilize networks of any complexity. Our approach can be implemented as an additional step before pre-training on large augmented datasets, and as an alternative to finding stable initializations analytically.

摘要
多种学习理论建议防止梯度变异的束缚增长，以稳定和改进训练。通常，这些分析是在具有数学 tractability 的批量化神经网络或单层循环神经网络上进行的。然而，这项研究表明，在神经网络太复杂以至于无法进行分析初始化时，可以预训练网络到地方稳定性。此外，我们扩展了已知稳定性理论，以覆盖更广泛的深度循环神经网络家族，不需要对数据和参数分布做出过多的假设。我们称之为地方稳定条件（LSC）。我们的调查表明，经典的格洛罗特、和合理初始化方案满足 LSC 当应用于批量化神经网络。然而，对深度循环神经网络进行分析，我们发现了一种新的加法式爆炸源，来自于计算梯度路径在深度和时间方向的矩阵中的计数。我们提出一种新的方法来缓解这个问题，即在计算梯度时，将时间和深度的贡献权重设为 0.5，而不是经典的 1.0。我们的实验结果表明，在 feed-forward 和循环神经网络中预训练满足 LSC 后，可以获得改进的最终性能。这项研究对深度学习领域的稳定性做出了贡献，并提供了一种可以稳定任何复杂性的神经网络的方法。我们的方法可以作为训练之前的额外步骤，或者作为在大量增强数据集上进行分析初始化的替代方案。

Identifying Reaction-Aware Driving Styles of Stochastic Model Predictive Controlled Vehicles by Inverse Reinforcement Learning

paper_url: http://arxiv.org/abs/2308.12069
repo_url: None
paper_authors: Ni Dang, Tao Shi, Zengjie Zhang, Wanxin Jin, Marion Leibold, Martin Buss
for: 本研究旨在提供一种基于最大熵逆激励学习（ME-IRL）方法的自动驾驶车辆（AV）驾驶风格识别方法，以便在多辆AV交通系统中评估风险和做出更合理的驾驶决策。
methods: 本研究使用ME-IRL方法来定义AV驾驶风格，并设计了一系列新的特征来捕捉AV对附近AV的反应。
results: 经过验证Using MATLAB实验和一个Off-the-shelf experiment，提出的方法可以准确地识别AV的驾驶风格，并且可以在多辆AV交通系统中提高安全性。

Abstract
The driving style of an Autonomous Vehicle (AV) refers to how it behaves and interacts with other AVs. In a multi-vehicle autonomous driving system, an AV capable of identifying the driving styles of its nearby AVs can reliably evaluate the risk of collisions and make more reasonable driving decisions. However, there has not been a consistent definition of driving styles for an AV in the literature, although it is considered that the driving style is encoded in the AV's trajectories and can be identified using Maximum Entropy Inverse Reinforcement Learning (ME-IRL) methods as a cost function. Nevertheless, an important indicator of the driving style, i.e., how an AV reacts to its nearby AVs, is not fully incorporated in the feature design of previous ME-IRL methods. In this paper, we describe the driving style as a cost function of a series of weighted features. We design additional novel features to capture the AV's reaction-aware characteristics. Then, we identify the driving styles from the demonstration trajectories generated by the Stochastic Model Predictive Control (SMPC) using a modified ME-IRL method with our newly proposed features. The proposed method is validated using MATLAB simulation and an off-the-shelf experiment.

摘要
自动驾驶车（AV）的驾驶方式指的是它如何行驶和与其他AV交互。在多辆自动驾驶车系统中，一个能够识别附近AV的驾驶方式的AV可以更加可靠地评估碰撞风险并做出更加合理的驾驶决策。然而，在文献中没有一个共识的自动驾驶车驾驶方式定义。尽管认为驾驶方式是在AV的轨迹中嵌入的，可以使用最大 entropy inverse reinforcement learning（ME-IRL）方法来识别它。然而，驾驶方式中一个重要指标，即AV如何 реаги于附近AV，并没有被完全包含在先前的ME-IRL方法中。在这篇论文中，我们定义了自动驾驶车的驾驶方式为一系列加权特征的成本函数。我们还设计了一些新的反应感知特征，以 capture AV的响应特性。然后，我们使用修改后的ME-IRL方法和我们新提出的特征来识别驾驶方式。我们的方法在MATLAB simulations和一个商业实验中得到了验证。

RemovalNet: DNN Fingerprint Removal Attacks

paper_url: http://arxiv.org/abs/2308.12319
repo_url: https://github.com/grasses/removalnet
paper_authors: Hongwei Yao, Zheng Li, Kunzhe Huang, Jian Lou, Zhan Qin, Kui Ren
for: 这个论文主要是研究DNNS的知识抽取和模型权利保护问题。methods: 作者提出了一种基于最小最大二重优化的DNNS模型抽取攻击方法，以逃脱模型权利验证。在下面优化中，作者将攻击者模型的特定指纹知识除掉，而在上面优化中，作者通过液化模型的总semantic知识来保持代理模型的性能。results: 作者通过对四种高级防御方法进行了广泛的实验，证明了RemovalNet的效果、效率和精度。特别是，与基准攻击方法相比，RemovalNet使用的计算资源减少了约85%。同时，创造的代理模型保持了高精度 послеDNNS模型抽取过程。

Abstract
With the performance of deep neural networks (DNNs) remarkably improving, DNNs have been widely used in many areas. Consequently, the DNN model has become a valuable asset, and its intellectual property is safeguarded by ownership verification techniques (e.g., DNN fingerprinting). However, the feasibility of the DNN fingerprint removal attack and its potential influence remains an open problem. In this paper, we perform the first comprehensive investigation of DNN fingerprint removal attacks. Generally, the knowledge contained in a DNN model can be categorized into general semantic and fingerprint-specific knowledge. To this end, we propose a min-max bilevel optimization-based DNN fingerprint removal attack named RemovalNet, to evade model ownership verification. The lower-level optimization is designed to remove fingerprint-specific knowledge. While in the upper-level optimization, we distill the victim model's general semantic knowledge to maintain the surrogate model's performance. We conduct extensive experiments to evaluate the fidelity, effectiveness, and efficiency of the RemovalNet against four advanced defense methods on six metrics. The empirical results demonstrate that (1) the RemovalNet is effective. After our DNN fingerprint removal attack, the model distance between the target and surrogate models is x100 times higher than that of the baseline attacks, (2) the RemovalNet is efficient. It uses only 0.2% (400 samples) of the substitute dataset and 1,000 iterations to conduct our attack. Besides, compared with advanced model stealing attacks, the RemovalNet saves nearly 85% of computational resources at most, (3) the RemovalNet achieves high fidelity that the created surrogate model maintains high accuracy after the DNN fingerprint removal process. Our code is available at: https://github.com/grasses/RemovalNet.

摘要
WITH 深度神经网络（DNN）性能显著提高，DNN已广泛应用于多个领域。因此，DNN模型成为了重要的财产，其知识产权得到了保护。然而，DNN指纹移除攻击的可能性和影响仍然是一个开放的问题。在这篇论文中，我们进行了首次全面的DNN指纹移除攻击调查。通常，DNN模型中的知识可以分为总Semantic和指纹特定知识。为此，我们提出了一种基于最小最大二级优化的DNN指纹移除攻击方法，名为RemovalNet，以避免模型所有权验证。lower-level优化设计移除指纹特定知识。而在upper-level优化中，我们通过液态热塑化将受害者模型的总Semantic知识萃取出来，以保持代理模型的性能。我们对四种高级防御方法进行了广泛的实验，并评估了RemovalNet的准确性、有效性和效率。实验结果显示了以下三点：1. RemovalNet是有效的。在我们的DNN指纹移除攻击后，模型之间的距离增加了100倍，比基eline攻击更高。2. RemovalNet是高效的。它只需使用400个样本和1000次迭代来进行攻击，而基eline攻击需要2000个样本和5000次迭代。此外，与高级模型盗取攻击相比，RemovalNet可以释放大约85%的计算资源。3. RemovalNet实现了高准确性，创建的代理模型在指纹移除过程后仍然保持高度准确。我们的代码可以在https://github.com/grasses/RemovalNet上下载。

InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4

paper_url: http://arxiv.org/abs/2308.12067
repo_url: None
paper_authors: Lai Wei, Zihao Jiang, Weiran Huang, Lichao Sun
for: 这个论文主要用于探讨大语言模型在多模态场景中遵循指令的能力是如何强化的。
methods: 论文使用了两个阶段的训练方法：首先在图片和文本对的情况下进行预训练，然后在超参数数据上进行精度调整。
results: 论文通过提出一些metric来评估多模态指令数据的质量，并使用这些metric来自动选择高质量的视力语言数据，从而使用InstructionGPT-4超越了原始的MiniGPT-4在多种评估（如视觉问答、GPT-4首选）中的表现。

Abstract
Multimodal large language models acquire their instruction-following capabilities through a two-stage training process: pre-training on image-text pairs and fine-tuning on supervised vision-language instruction data. Recent studies have shown that large language models can achieve satisfactory results even with a limited amount of high-quality instruction-following data. In this paper, we introduce InstructionGPT-4, which is fine-tuned on a small dataset comprising only 200 examples, amounting to approximately 6% of the instruction-following data used in the alignment dataset for MiniGPT-4. We first propose several metrics to access the quality of multimodal instruction data. Based on these metrics, we present a simple and effective data selector to automatically identify and filter low-quality vision-language data. By employing this method, InstructionGPT-4 outperforms the original MiniGPT-4 on various evaluations (e.g., visual question answering, GPT-4 preference). Overall, our findings demonstrate that less but high-quality instruction tuning data is efficient to enable multimodal large language models to generate better output.

摘要
多模态大语言模型通过两stage训练过程获得指令遵循能力：先于插入图像文本对的预训练，然后在指导视语言数据上进行精度调整。现有研究表明，大语言模型可以通过有限量高质量指令遵循数据来达到满意的结果。在本文中，我们介绍InstructionGPT-4，它是基于只有200个例子，相当于MiniGPT-4的整合数据中的6%的指令遵循数据进行精度调整。我们首先提出了评估多模态指令数据质量的多种指标，然后基于这些指标，我们提出了一种简单有效的数据选择器，可以自动将低质量的视语言数据滤除。通过使用这种方法，InstructionGPT-4在多种评估中（如视觉问答、GPT-4偏好）都超过了原始MiniGPT-4。总之，我们的发现表明，虽然只有少量但高质量的指令循数据，可以使多模态大语言模型生成更好的输出。

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

paper_url: http://arxiv.org/abs/2308.12066
repo_url: None
paper_authors: Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, Mao Yang, Minsoo Rhu
for: 大型自然语言模型（LLM）基于变换器的实现，以实现高性能。
methods: 使用 Mixture-of-Experts（MoE）架构，以适应大规模 LLM 的计算和存储需求。
results: 提出了 Pre-gated MoE 系统，可以有效地解决 conventional MoE 架构中的计算和存储挑战，同时保持高性能和减少 GPU 内存占用量。

Abstract
Large language models (LLMs) based on transformers have made significant strides in recent years, the success of which is driven by scaling up their model size. Despite their high algorithmic performance, the computational and memory requirements of LLMs present unprecedented challenges. To tackle the high compute requirements of LLMs, the Mixture-of-Experts (MoE) architecture was introduced which is able to scale its model size without proportionally scaling up its computational requirements. Unfortunately, MoE's high memory demands and dynamic activation of sparse experts restrict its applicability to real-world problems. Previous solutions that offload MoE's memory-hungry expert parameters to CPU memory fall short because the latency to migrate activated experts from CPU to GPU incurs high performance overhead. Our proposed Pre-gated MoE system effectively tackles the compute and memory challenges of conventional MoE architectures using our algorithm-system co-design. Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation, allowing our proposed system to address the large memory footprint of MoEs while also achieving high performance. We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality. These features allow our Pre-gated MoE system to cost-effectively deploy large-scale LLMs using just a single GPU with high performance.

摘要
Our proposed Pre-gated MoE system effectively tackles the compute and memory challenges of conventional MoE architectures using our algorithm-system co-design. Pre-gated MoE employs a novel pre-gating function that alleviates the dynamic nature of sparse expert activation, allowing our proposed system to address the large memory footprint of MoEs while also achieving high performance. We demonstrate that Pre-gated MoE can improve performance, reduce GPU memory consumption, and maintain the same level of model quality. These features allow our Pre-gated MoE system to cost-effectively deploy large-scale LLMs using just a single GPU with high performance.In simplified Chinese, the text would be:大型语言模型（LLM） based on transformers recent years achieved significant progress, success driven by scaling up model size. However, the high computational and memory requirements of LLMs present unprecedented challenges. To address these challenges, Mixture-of-Experts（MoE） architecture was introduced, which can scale its model size without proportionally increasing its computational requirements. Unfortunately, MoE's high memory demands and dynamic activation of sparse experts limit its applicability to real-world problems. Previous solutions that offload MoE's memory-hungry expert parameters to CPU memory fall short because the latency to migrate activated experts from CPU to GPU incurs high performance overhead.我们的 Pre-gated MoE 系统使用我们的算法-系统合理设计，有效地解决了传统 MoE 架构中的计算和内存挑战。Pre-gated MoE 使用我们的新的预 Gate 函数，解决了 sparse expert 动态 activation 的问题，使我们的提议的系统可以Addressing the large memory footprint of MoEs while also achieving high performance. We demonstrate that Pre-gated MoE can improve performance, reduce GPU memory consumption, and maintain the same level of model quality. These features allow our Pre-gated MoE system to cost-effectively deploy large-scale LLMs using just a single GPU with high performance.

Ensembling Uncertainty Measures to Improve Safety of Black-Box Classifiers

paper_url: http://arxiv.org/abs/2308.12065
repo_url: None
paper_authors: Tommaso Zoppi, Andrea Ceccarelli, Andrea Bondavalli
for: 本研究提出了一种安全包装（SPROUT），用于检测和防止机器学习（ML）算法的错误分类。
methods: 该方法使用多个不确定度测量来检测输入和输出的不确定性，并在检测到错误分类时阻止输出的传播。
results: 实验表明，SPROUT可以准确地检测大量的错误分类，并在特定情况下检测所有错误分类。 SPROUT适用于 binary 和多类分类问题，包括图像和表格数据集。

Abstract
Machine Learning (ML) algorithms that perform classification may predict the wrong class, experiencing misclassifications. It is well-known that misclassifications may have cascading effects on the encompassing system, possibly resulting in critical failures. This paper proposes SPROUT, a Safety wraPper thROugh ensembles of UncertainTy measures, which suspects misclassifications by computing uncertainty measures on the inputs and outputs of a black-box classifier. If a misclassification is detected, SPROUT blocks the propagation of the output of the classifier to the encompassing system. The resulting impact on safety is that SPROUT transforms erratic outputs (misclassifications) into data omission failures, which can be easily managed at the system level. SPROUT has a broad range of applications as it fits binary and multi-class classification, comprising image and tabular datasets. We experimentally show that SPROUT always identifies a huge fraction of the misclassifications of supervised classifiers, and it is able to detect all misclassifications in specific cases. SPROUT implementation contains pre-trained wrappers, it is publicly available and ready to be deployed with minimal effort.

摘要
机器学习（ML）算法可能会预测错误的类别，导致错误分类。这是已知的一点，错误分类可能会带来整体系统的崩溃。这篇文章提议了“护皮”（SPROUT），它是通过多个不确定度测量来怀疑错误分类的一种安全包装。如果检测到错误分类，SPROUT会阻止分类器的输出传递到包含系统。这会使安全性受到改善，因为SPROUT将异常输入（错误分类）转化为数据漏洞失败，这可以轻松地在系统层面进行管理。SPROUT适用于二分类和多分类，包括图像和表格数据集。我们实验表明，SPROUT总能够检测大量超级vised分类器中的错误分类，并且在某些情况下可以检测所有错误分类。SPROUT的实现包括预训练包装，它公共可用，ready to deploy 需要最小的努力。

FlexKBQA: A Flexible LLM-Powered Framework for Few-Shot Knowledge Base Question Answering

paper_url: http://arxiv.org/abs/2308.12060
repo_url: https://github.com/leezythu/flexkbqa
paper_authors: Zhenyu Li, Sunqi Fan, Yu Gu, Xiuxing Li, Zhichao Duan, Bowen Dong, Ning Liu, Jianyong Wang
for: 提高KBQA模型在实际应用中的性能，尤其是在缺乏高质量annotated数据的情况下。
methods: 利用自动生成的程序，如SPARQL查询，和大型自然语言模型（LLMs）来address问题。采用自动生成的程序可以减少人工标注的努力，而LLMs可以将程序转换成自然语言问题。
results: 在GrailQA、WebQSP和KQA Pro等 benchmark上进行了广泛的实验，发现在几个shot和零shot情况下，FlexKBQA可以达到很高的性能，比超过所有基eline和even approaching supervised模型的性能，达到93%相对于彻底supervised模型的性能。

Abstract
Knowledge base question answering (KBQA) is a critical yet challenging task due to the vast number of entities within knowledge bases and the diversity of natural language questions posed by users. Unfortunately, the performance of most KBQA models tends to decline significantly in real-world scenarios where high-quality annotated data is insufficient. To mitigate the burden associated with manual annotation, we introduce FlexKBQA by utilizing Large Language Models (LLMs) as program translators for addressing the challenges inherent in the few-shot KBQA task. Specifically, FlexKBQA leverages automated algorithms to sample diverse programs, such as SPARQL queries, from the knowledge base, which are subsequently converted into natural language questions via LLMs. This synthetic dataset facilitates training a specialized lightweight model for the KB. Additionally, to reduce the barriers of distribution shift between synthetic data and real user questions, FlexKBQA introduces an executionguided self-training method to iterative leverage unlabeled user questions. Furthermore, we explore harnessing the inherent reasoning capability of LLMs to enhance the entire framework. Consequently, FlexKBQA delivers substantial flexibility, encompassing data annotation, deployment, and being domain agnostic. Through extensive experiments on GrailQA, WebQSP, and KQA Pro, we observe that under the few-shot even the more challenging zero-shot scenarios, FlexKBQA achieves impressive results with a few annotations, surpassing all previous baselines and even approaching the performance of supervised models, achieving a remarkable 93% performance relative to the fully-supervised models. We posit that FlexKBQA represents a significant advancement towards exploring better integration of large and lightweight models. The code is open-sourced.

摘要
知识库问答（KBQA）是一项关键性的 yet 挑战性的任务，由于知识库中的维度多样性和用户提交的自然语言问题的多样性。尽管大多数 KBQA 模型在实际场景中表现不佳，这主要归结于缺乏高质量标注数据的问题。为了解决这个问题，我们引入 FlexKBQA，利用大型自然语言模型（LLMs）作为知识库程序翻译器，以解决几何shot KBQA 任务中的挑战。Specifically, FlexKBQA 使用自动生成算法来采样知识库中的多样程序，例如 SPARQL 查询，并将其转化为自然语言问题。这些人工生成的数据可以用来训练特殊的轻量级模型。此外，为了减少实际问题和人工标注数据之间的分布差异，FlexKBQA 引入执行引导自动训练方法，以便逐步利用无标注的用户问题进行自动训练。此外，我们还考虑了利用 LLMs 的内在逻辑能力来增强整个框架。通过广泛的实验在 GrailQA、WebQSP 和 KQA Pro 等平台上，我们发现在几何shot 和零shot enario下，FlexKBQA 可以很好地表现，与完全监督模型相当，达到了93% 的性能相对于完全监督模型。我们认为 FlexKBQA 代表了大量和轻量级模型更好的 интеграción的一个重要进展。代码开源。

Layer-wise Feedback Propagation

paper_url: http://arxiv.org/abs/2308.12053
repo_url: None
paper_authors: Leander Weber, Jim Berend, Alexander Binder, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
for: 这篇论文旨在提出层 wise feedback propagation（LFP），一种基于解释的训练方法，通过分层扩散反馈来评估神经网络中每个连接的贡献，从而实现比 tradicional Gradient Descent 更高效的训练。
methods: LFP 使用层 wise relevance propagation（LRP）来分层扩散反馈，不需要计算梯度，从而避免了一些基于梯度的限制。LFP 可以在不同的模型和数据集上实现相似的性能。
results: 在论文中， authors 提供了 LFP 的理论和实验证明，并证明了它在不同的模型和数据集上的效果。LFP 可以在不同的应用中提高模型的训练效率，例如在 Step-function activated Spiking Neural Networks（SNNs）中进行训练，或者进行知识传递学习。

Abstract
In this paper, we present Layer-wise Feedback Propagation (LFP), a novel training approach for neural-network-like predictors that utilizes explainability, specifically Layer-wise Relevance Propagation(LRP), to assign rewards to individual connections based on their respective contributions to solving a given task. This differs from traditional gradient descent, which updates parameters towards anestimated loss minimum. LFP distributes a reward signal throughout the model without the need for gradient computations. It then strengthens structures that receive positive feedback while reducingthe influence of structures that receive negative feedback. We establish the convergence of LFP theoretically and empirically, and demonstrate its effectiveness in achieving comparable performance to gradient descent on various models and datasets. Notably, LFP overcomes certain limitations associated with gradient-based methods, such as reliance on meaningful derivatives. We further investigate how the different LRP-rules can be extended to LFP, what their effects are on training, as well as potential applications, such as training models with no meaningful derivatives, e.g., step-function activated Spiking Neural Networks (SNNs), or for transfer learning, to efficiently utilize existing knowledge.

摘要
在这篇论文中，我们提出层 wise Feedback Propagation（LFP），一种基于解释的训练方法，使用层 wise Relevance Propagation（LRP）来为解决特定任务中的每个连接分配奖励。这与传统的梯度下降不同，梯度下降更新参数向估计损失最小值。LFP在模型中分配奖励信号，不需要梯度计算。它然后强化收到正面反馈的结构，而减少收到负面反馈的影响。我们 theoretically 和 empirically 证明 LFP 的 converges，并在不同模型和数据集上证明其效果。值得注意的是，LFP 可以超越一些相关的梯度基本方法的限制，如依赖于意义 derivatives。我们还 investigate 如何 extend LRP-rules 到 LFP，它们在训练中的效果，以及潜在应用，如训练无意义 derivatives 的模型，例如步函数激活的神经网络（SNNs），或者用于传输学习，以高效地利用现有的知识。

Aligning Language Models with Offline Reinforcement Learning from Human Feedback

paper_url: http://arxiv.org/abs/2308.12050
repo_url: None
paper_authors: Jian Hu, Li Tao, June Yang, Chandler Zhou
For: This paper aims to align language models with human preferences using offline reinforcement learning from human feedback (RLHF) frameworks, without relying on online reinforcement learning techniques like Proximal Policy Optimization (PPO) that can be unstable and challenging to tune.* Methods: The authors propose using maximum likelihood estimation (MLE) with filtering, reward-weighted regression (RWR), and Decision Transformer (DT) to align language models to human preferences. They employ a loss function similar to supervised fine-tuning to ensure stable model training, and compare their methods with PPO and other Offline RLHF methods.* Results: The experimental results show that the DT alignment outperforms other Offline RLHF methods and is better than PPO, with a much lower computing resource requirement (around 12.3%) and a simpler machine learning system.

Abstract
Learning from human preferences is crucial for language models (LMs) to effectively cater to human needs and societal values. Previous research has made notable progress by leveraging human feedback to follow instructions. However, these approaches rely primarily on online reinforcement learning (RL) techniques like Proximal Policy Optimization (PPO), which have been proven unstable and challenging to tune for language models. Moreover, PPO requires complex distributed system implementation, hindering the efficiency of large-scale distributed training. In this study, we propose an offline reinforcement learning from human feedback (RLHF) framework to align LMs using pre-generated samples without interacting with RL environments. Specifically, we explore maximum likelihood estimation (MLE) with filtering, reward-weighted regression (RWR), and Decision Transformer (DT) to align language models to human preferences. By employing a loss function similar to supervised fine-tuning, our methods ensure more stable model training than PPO with a simple machine learning system~(MLSys) and much fewer (around 12.3\%) computing resources. Experimental results demonstrate the DT alignment outperforms other Offline RLHF methods and is better than PPO.

摘要
学习人类偏好是语言模型（LM）效果服务的关键。过去的研究已经做出了可观的进步，通过使用人类反馈来跟进 instruction。然而，这些方法主要依赖于在线强化学习（RL）技术，如 proximal policy optimization（PPO），这些技术有unstable和难于调整的问题。另外，PPO需要复杂的分布式系统实现，这会阻碍大规模分布式训练的效率。在这种情况下，我们提出了一个偏好RLHF框架，用于不需要与RL环境交互的情况下，使语言模型与人类偏好相匹配。具体来说，我们explore maximum likelihood estimation（MLE）with filtering、reward-weighted regression（RWR）和Decision Transformer（DT）来对语言模型进行偏好调整。我们的方法使用一个类似于超vised fine-tuning的损失函数，以确保更稳定的模型训练，并且只需要相对较少的计算资源（约12.3%）。实验结果表明，DT调整超过其他Offline RLHF方法，并且比PPO更好。

Towards Privacy-Supporting Fall Detection via Deep Unsupervised RGB2Depth Adaptation

paper_url: http://arxiv.org/abs/2308.12049
repo_url: https://github.com/1015206533/privacy_supporting_fall_detection
paper_authors: Hejun Xiao, Kunyu Peng, Xiangsheng Huang, Alina Roitberg1, Hao Li, Zhaohui Wang, Rainer Stiefelhagen
for: 预防跌倒，提高健康监测的效果
methods: 利用深度感知器和RGB视频数据，通过域 adaptation进行跌倒检测
results: 实现了在不需要细致图像数据的情况下，使用RGB视频数据进行跌倒检测，并达到了最佳效果

Abstract
Fall detection is a vital task in health monitoring, as it allows the system to trigger an alert and therefore enabling faster interventions when a person experiences a fall. Although most previous approaches rely on standard RGB video data, such detailed appearance-aware monitoring poses significant privacy concerns. Depth sensors, on the other hand, are better at preserving privacy as they merely capture the distance of objects from the sensor or camera, omitting color and texture information. In this paper, we introduce a privacy-supporting solution that makes the RGB-trained model applicable in depth domain and utilizes depth data at test time for fall detection. To achieve cross-modal fall detection, we present an unsupervised RGB to Depth (RGB2Depth) cross-modal domain adaptation approach that leverages labelled RGB data and unlabelled depth data during training. Our proposed pipeline incorporates an intermediate domain module for feature bridging, modality adversarial loss for modality discrimination, classification loss for pseudo-labeled depth data and labeled source data, triplet loss that considers both source and target domains, and a novel adaptive loss weight adjustment method for improved coordination among various losses. Our approach achieves state-of-the-art results in the unsupervised RGB2Depth domain adaptation task for fall detection. Code is available at https://github.com/1015206533/privacy_supporting_fall_detection.

摘要
“fall detection是健康监控中的重要任务，可以让系统发送警示，从而更快地对人员坠落时进行应对。然而，大多数先前的方法仅使用标准的RGB影像数据，这种细节意识敏感的监控具有重要的隐私问题。深度感知器，则可以更好地保持隐私，因为它们仅capture物体对感知器或相机的距离，排除颜色和 texture信息。在本文中，我们介绍了一个关于隐私支持的解决方案，让RGB模型在深度领域中可用并在试用时使用深度数据进行坠落探测。”“实现跨模式的坠落探测，我们提出了一个不需要 labels的RGB to Depth（RGB2Depth）跨模式领域适应方法。我们的提案包括一个中继领域模组，用于Feature Bridging，模组挑战数据的类型和大小，以及一个对于模组的挑战数据的多对多挑战数据。我们还使用了一个对于source和target领域的多对多挑战数据，以及一个新的适应式损失调整方法，以改善不同损失函数之间的协调。”“我们的方法在RGB2Depth领域适应任务中得到了state-of-the-art的结果。我们的代码可以在https://github.com/1015206533/privacy_supporting_fall_detection中找到。”

CgT-GAN: CLIP-guided Text GAN for Image Captioning

paper_url: http://arxiv.org/abs/2308.12045
repo_url: https://github.com/lihr747/cgtgan
paper_authors: Jiarui Yu, Haoran Li, Yanbin Hao, Bin Zhu, Tong Xu, Xiangnan He
for: The paper is written for improving image captioning without human-annotated image-caption pairs, using a text-only training paradigm and incorporating images into the training process.
methods: The paper proposes a CLIP-guided text GAN (CgT-GAN) that uses adversarial training and a CLIP-based reward to provide semantic guidance, and introduces a novel semantic guidance reward called CLIP-agg that aligns the generated caption with a weighted text embedding.
results: The paper shows that CgT-GAN outperforms state-of-the-art methods significantly across all metrics on three subtasks (ZS-IC, In-UIC, and Cross-UIC).Here’s the simplified Chinese text version of the three key information points:
for: 文章是为了提高无人注意图像描述的image captioning，使用文本单独训练 paradigm，并在训练过程中包含图像。
methods: 文章提出了一种基于CLIP的文本GAN（CgT-GAN），使用对抗训练和基于CLIP的奖励来提供语义指导，并引入了一种新的语义指导奖励called CLIP-agg。
results: 文章表明，CgT-GAN在三个任务（ZS-IC、In-UIC和Cross-UIC）上比州前方法显著出众，包括所有指标。

Abstract
The large-scale visual-language pre-trained model, Contrastive Language-Image Pre-training (CLIP), has significantly improved image captioning for scenarios without human-annotated image-caption pairs. Recent advanced CLIP-based image captioning without human annotations follows a text-only training paradigm, i.e., reconstructing text from shared embedding space. Nevertheless, these approaches are limited by the training/inference gap or huge storage requirements for text embeddings. Given that it is trivial to obtain images in the real world, we propose CLIP-guided text GAN (CgT-GAN), which incorporates images into the training process to enable the model to "see" real visual modality. Particularly, we use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus and CLIP-based reward to provide semantic guidance. The caption generator is jointly rewarded based on the caption naturalness to human language calculated from the GAN's discriminator and the semantic guidance reward computed by the CLIP-based reward module. In addition to the cosine similarity as the semantic guidance reward (i.e., CLIP-cos), we further introduce a novel semantic guidance reward called CLIP-agg, which aligns the generated caption with a weighted text embedding by attentively aggregating the entire corpus. Experimental results on three subtasks (ZS-IC, In-UIC and Cross-UIC) show that CgT-GAN outperforms state-of-the-art methods significantly across all metrics. Code is available at https://github.com/Lihr747/CgtGAN.

摘要
大规模的视觉语言预训练模型CLIP（Contrastive Language-Image Pre-training）在没有人类标注的场景下提高了图像描述。最新的CLIP基于的图像描述方法采用文本只训练 paradigm，即在共享 embedding 空间中重建文本。然而，这些方法受到训练/推断差距或巨大的存储要求的限制。因为在实际世界中可以轻松地获得图像，我们提出了CLIP引导的文本GAN（CgT-GAN），它将图像 inclusion 到训练过程中，使模型可以"看到"实际的视觉Modal。特别是，我们使用对抗训练来教育CgT-GAN模仿外部文本聚合体和CLIP基于的奖励来提供语义指导。描述生成器被同时激励基于描述自然度计算从GAN的探测器和CLIP基于的奖励模块计算的语义指导奖励。此外，我们还引入了一种新的语义指导奖励called CLIP-agg，它将生成的描述与权重文本embedding进行协调，通过对整个聚合体进行注意力聚集来实现。实验结果在三个SUB Task（ZS-IC、In-UIC和Cross-UIC）中显示，CgT-GAN具有与状态艺术方法相比明显的优势，在所有指标上出现显著提升。代码可以在https://github.com/Lihr747/CgtGAN 上找到。

A multiobjective continuation method to compute the regularization path of deep neural networks

paper_url: http://arxiv.org/abs/2308.12044
repo_url: https://github.com/aamakor/continuation-method
paper_authors: Augustina C. Amakor, Konstantin Sonntag, Sebastian Peitz
for: 本文旨在提出一种高效的方法，以优化深度神经网络（DNN）的稀疏性和损失函数之间的衔接。
methods: 本文使用了一种基于多目标优化的算法，以 aproximate Pareto front 上的整个衔接。
results: 数据示出了该算法的高效性和通用性，并且可以在不同的梯度下进行数据的验证。此外，本文还证明了知道衔接路径可以帮助网络 Parametrization 得到更好的泛化性。

Abstract
Sparsity is a highly desired feature in deep neural networks (DNNs) since it ensures numerical efficiency, improves the interpretability of models (due to the smaller number of relevant features), and robustness. In machine learning approaches based on linear models, it is well known that there exists a connecting path between the sparsest solution in terms of the $\ell^1$ norm (i.e., zero weights) and the non-regularized solution, which is called the regularization path. Very recently, there was a first attempt to extend the concept of regularization paths to DNNs by means of treating the empirical loss and sparsity ($\ell^1$ norm) as two conflicting criteria and solving the resulting multiobjective optimization problem. However, due to the non-smoothness of the $\ell^1$ norm and the high number of parameters, this approach is not very efficient from a computational perspective. To overcome this limitation, we present an algorithm that allows for the approximation of the entire Pareto front for the above-mentioned objectives in a very efficient manner. We present numerical examples using both deterministic and stochastic gradients. We furthermore demonstrate that knowledge of the regularization path allows for a well-generalizing network parametrization.

摘要
深度神经网络（DNN）中的稀畴性是一个非常强地需求的特性，因为它确保了数学效率、提高模型解释性（由于更少的相关特征），并且提高了模型的稳定性。在线性机器学习方法基于的模型中，已经知道存在一个连接到最稀 Solution 的梯度路径，这个梯度路径被称为规regularization path。很近期，有一个首次尝试将这个概念扩展到 DNN 中，通过对 empirical loss 和稀畴性（$\ell^1$ 范数）作为两个矛盾的目标，解决 resulting 多目标优化问题。然而，由于 $\ell^1$ 范数的非滑坡性和参数的高数量，这种方法并不很有效从计算机科学的角度。为了解决这个限制，我们提出了一个算法，可以高效地 aproximate 整个 Pareto front 上的目标。我们通过 deterministic 和 Stochastic 梯度来进行数值示例。此外，我们还证明了知道规regularization path 可以提供一个良好的网络参数化。

IncreLoRA: Incremental Parameter Allocation Method for Parameter-Efficient Fine-tuning

paper_url: http://arxiv.org/abs/2308.12043
repo_url: https://github.com/feiyuzhang98/increlora
paper_authors: Feiyu Zhang, Liangzhi Li, Junhao Chen, Zhouqiang Jiang, Bowen Wang, Yiming Qian
for: 这篇论文主要针对于大型预训语言模型（PLMs）的精致化训练进行优化，以减少训练和储存成本，特别是在大量下游任务时。
methods: 这篇论文提出了一种增量化对应（IncreLoRA）方法，将预训模组中的参数转换为可变的权重矩阵，以提高模组之间的通信。此外，这篇论文还提出了一些对LoRA的修正方法，以提高其效能。
results: 在GLUE测试集上，这篇论文的方法与基eline相比，具有更高的参数效率，特别是在资源不足的情况下。另外，这篇论文还展示了对LoRA的修正方法可以对模组之间的通信进行更好的控制。

Abstract
With the increasing size of pre-trained language models (PLMs), fine-tuning all the parameters in the model is not efficient, especially when there are a large number of downstream tasks, which incur significant training and storage costs. Many parameter-efficient fine-tuning (PEFT) approaches have been proposed, among which, Low-Rank Adaptation (LoRA) is a representative approach that injects trainable rank decomposition matrices into every target module. Yet LoRA ignores the importance of parameters in different modules. To address this problem, many works have been proposed to prune the parameters of LoRA. However, under limited training conditions, the upper bound of the rank of the pruned parameter matrix is still affected by the preset values. We, therefore, propose IncreLoRA, an incremental parameter allocation method that adaptively adds trainable parameters during training based on the importance scores of each module. This approach is different from the pruning method as it is not limited by the initial number of training parameters, and each parameter matrix has a higher rank upper bound for the same training overhead. We conduct extensive experiments on GLUE to demonstrate the effectiveness of IncreLoRA. The results show that our method owns higher parameter efficiency, especially when under the low-resource settings where our method significantly outperforms the baselines. Our code is publicly available.

摘要
随着预训语言模型（PLM）的大小的增加，精细调整所有模型参数不是efficient，特别是当有大量下游任务时，会导致显著的训练和存储成本。许多参数精细调整（PEFT）approach已经提出，其中LoRA是一个代表性的方法，它在每个目标模块中注入可学习的排序矩阵。然而，LoRA忽略了参数在不同模块中的重要性。为解决这个问题，许多工作已经提出了对LoRA的剪枝。然而，在限制的训练条件下，剪枝后的参数矩阵的rankUpperBound仍然受到先前设置的值的影响。因此，我们提出了IncreLoRA，一种逐步分配参数的方法，它在训练过程中基于每个模块的重要性分数进行逐步添加可学习参数。这种方法与剪枝方法不同，它不受限于初始训练参数的数量，每个参数矩阵的rankUpperBound都高于同样的训练负担。我们在GLUE上进行了广泛的实验，结果表明我们的方法具有更高的参数效率，特别是在低资源设置下，我们的方法显著超过了基eline。我们的代码公开 disponibles。

PREFER: Prompt Ensemble Learning via Feedback-Reflect-Refine

paper_url: http://arxiv.org/abs/2308.12033
repo_url: https://github.com/zcrwind/prefer
paper_authors: Chenrui Zhang, Lin Liu, Jinpeng Wang, Chuyuan Wang, Xiao Sun, Hongyu Wang, Mingchen Cai
for: 提高 Large Language Model (LLM) 的表现，增强其能力。
methods: 提出了一种简单、通用、自动化的方法 named PREFER，通过反馈机制和迭代优化来提高 LLM 的表现。
results: 经过广泛的实验，我们的 PREFER 方法在多种任务上达到了 state-of-the-art 水平，超过了现有方法的表现。

Abstract
As an effective tool for eliciting the power of Large Language Models (LLMs), prompting has recently demonstrated unprecedented abilities across a variety of complex tasks. To further improve the performance, prompt ensemble has attracted substantial interest for tackling the hallucination and instability of LLMs. However, existing methods usually adopt a two-stage paradigm, which requires a pre-prepared set of prompts with substantial manual effort, and is unable to perform directed optimization for different weak learners. In this paper, we propose a simple, universal, and automatic method named PREFER (Pompt Ensemble learning via Feedback-Reflect-Refine) to address the stated limitations. Specifically, given the fact that weak learners are supposed to focus on hard examples during boosting, PREFER builds a feedback mechanism for reflecting on the inadequacies of existing weak learners. Based on this, the LLM is required to automatically synthesize new prompts for iterative refinement. Moreover, to enhance stability of the prompt effect evaluation, we propose a novel prompt bagging method involving forward and backward thinking, which is superior to majority voting and is beneficial for both feedback and weight calculation in boosting. Extensive experiments demonstrate that our PREFER achieves state-of-the-art performance in multiple types of tasks by a significant margin. We have made our code publicly available.

摘要
为了更好地利用大语言模型（LLM）的能力，提问最近在多种复杂任务中表现出了无 precedent 的能力。为了进一步提高性能，提问ensemble 已经吸引了很多关注，以解决 LLM 的幻觉和不稳定性。然而，现有的方法通常采用两个阶段 paradigm，需要大量的手动努力来预先准备提问集，并且无法 direktly 优化不同的弱学习者。在这篇论文中，我们提出了一种简单、通用和自动的方法 named PREFER (提问组合学习 via 反馈反思改进)，以解决所提到的限制。具体来说，我们知道弱学习者在扩大时会关注困难的示例，PREFER 建立了反馈机制，以反思现有弱学习者的不足。基于这，LLM 需要自动生成新的提问，进行迭代改进。此外，为了增强提问效果评估的稳定性，我们提出了一种新的提问袋裹法，其包括前向和后向思考，比较有利于提问评估和权重计算在扩大中。我们的 EXPERIMENT 表明，我们的 PREFER 可以在多种任务中达到 estado 的表现，与当前最佳方法相比，差距非常大。我们的代码已经公开发布。

CACTUS: a Comprehensive Abstraction and Classification Tool for Uncovering Structures

paper_url: http://arxiv.org/abs/2308.12031
repo_url: None
paper_authors: Luca Gherardini, Varun Ravi Varma, Karol Capala, Roger Woods, Jose Sousa
for: 本研究旨在提高安全分析的解释能力，为现代人工智能发展提供帮助。
methods: 本研究使用了CACTUS，一种可解释的人工智能工具，以提高安全分析的效果。CACTUS支持分类特征，保持特征的原始含义，提高内存使用率，并通过并行计算加速计算速度。
results: 本研究在应用于美洲矿业癌症和甲状腺癌0387数据集中展现出色的表现，并且可以显示每个类别中特征的频率和排名。

Abstract
The availability of large data sets is providing an impetus for driving current artificial intelligent developments. There are, however, challenges for developing solutions with small data sets due to practical and cost-effective deployment and the opacity of deep learning models. The Comprehensive Abstraction and Classification Tool for Uncovering Structures called CACTUS is presented for improved secure analytics by effectively employing explainable artificial intelligence. It provides additional support for categorical attributes, preserving their original meaning, optimising memory usage, and speeding up the computation through parallelisation. It shows to the user the frequency of the attributes in each class and ranks them by their discriminative power. Its performance is assessed by application to the Wisconsin diagnostic breast cancer and Thyroid0387 data sets.

摘要
大量数据的可用性正为现代人工智能发展提供了推动力。然而，对小数据集的解决方案存在实用和成本效益的挑战，尤其是深度学习模型的透明性问题。本文提出了一种名为“CACTUS”的全面抽象分类工具，用于提高安全分析。它能够有效地使用可解释人工智能，并且支持 categorical 特征，保持原始含义，优化内存使用情况，并通过并行计算加速计算。它可以在用户看到每个类别 attribute 的频率和排名它们的抑制力。它的性能被评估通过应用于美国威斯康星诊断乳腺癌和 thyroid0387 数据集。

Prompt-Based Length Controlled Generation with Reinforcement Learning

paper_url: http://arxiv.org/abs/2308.12030
repo_url: None
paper_authors: Renlong Jie, Xiaojun Meng, Lifeng Shang, Xin Jiang, Qun Liu
for: 提高 GPT 类模型的准确性和效率，以便更好地满足不同场景中的需求。
methods: 采用了反馈学习，通过训练或使用规则来定义奖励模型，以便控制 GPT 类模型的生成长度。
results: 在 популяр的数据集 CNNDM 和 NYT 上实现了更高的描述精度和准确性。

Abstract
Recently, large language models (LLMs) like ChatGPT and GPT-4 have attracted great attention given their surprising improvement and performance. Length controlled generation of LLMs emerges as an important topic, which also enables users to fully leverage the capability of LLMs in more real-world scenarios like generating a proper answer or essay of a desired length. In addition, the autoregressive generation in LLMs is extremely time-consuming, while the ability of controlling this generated length can arbitrarily reduce the inference cost by limiting the length, and thus satisfy different needs. Therefore, we aim to propose a prompt-based length control method to achieve this length controlled generation, which can also be widely applied in GPT-style LLMs. In particular, we adopt reinforcement learning with the reward signal given by either trainable or rule-based reward model, which further affects the generation of LLMs via rewarding a pre-defined target length. Experiments show that our method significantly improves the accuracy of prompt-based length control for summarization task on popular datasets like CNNDM and NYT. We believe this length-controllable ability can provide more potentials towards the era of LLMs.

摘要
To address this issue, we propose a prompt-based length control method using reinforcement learning with a trainable or rule-based reward model. Our method aims to achieve length-controlled generation in GPT-style LLMs, and experiments show that it significantly improves the accuracy of prompt-based length control for summarization tasks on popular datasets like CNNDM and NYT. We believe that this length-controllable ability has great potential in the era of LLMs.

A Scale-Invariant Task Balancing Approach for Multi-Task Learning

paper_url: http://arxiv.org/abs/2308.12029
repo_url: None
paper_authors: Baijiong Lin, Weisen Jiang, Feiyang Ye, Yu Zhang, Pengguang Chen, Ying-Cong Chen, Shu Liu
for: 提高多任务学习（MTL）中任务均衡的问题，以便同时学习多个相关任务并实现优秀表现。
methods: 提出了一种具有整数归一化特性的多任务学习方法（SI-MTL），通过对所有任务损失进行对数变换来保证损失水平的均衡，并通过SI-G方法对所有任务导数进行归一化，使所有任务导数具有同一个 максималь去向量范围。
results: 经过广泛的实验表明，SI-G方法能够有效地约束任务导数，而SI-MTL方法能够在多个 benchmark 数据集上达到领先的性能水平。

Abstract
Multi-task learning (MTL), a learning paradigm to learn multiple related tasks simultaneously, has achieved great success in various fields. However, task-balancing remains a significant challenge in MTL, with the disparity in loss/gradient scales often leading to performance compromises. In this paper, we propose a Scale-Invariant Multi-Task Learning (SI-MTL) method to alleviate the task-balancing problem from both loss and gradient perspectives. Specifically, SI-MTL contains a logarithm transformation which is performed on all task losses to ensure scale-invariant at the loss level, and a gradient balancing method, SI-G, which normalizes all task gradients to the same magnitude as the maximum gradient norm. Extensive experiments conducted on several benchmark datasets consistently demonstrate the effectiveness of SI-G and the state-of-the-art performance of SI-MTL.

摘要
多任务学习（MTL），一种同时学习多个相关任务的学习方法，在各个领域取得了很大成功。然而，任务均衡仍然是MTL中的主要挑战，因为任务损失/梯度的尺度差异常常导致性能下降。在这篇论文中，我们提出了一种减小任务均衡问题的扩展MTL方法（SI-MTL）。具体来说，SI-MTL包括一种对所有任务损失进行对数变换，以保证损失水平上的减小，以及一种梯度均衡方法SI-G，该方法将所有任务梯度 норmalizes到最大梯度 норма的同一个范围内。我们在多个标准数据集上进行了广泛的实验，并经常证明了SI-G的有效性和SI-MTL的状态之最性。

LKPNR: LLM and KG for Personalized News Recommendation Framework

paper_url: http://arxiv.org/abs/2308.12028
repo_url: https://github.com/xuan-zw/lkpnr
paper_authors: Chen hao, Xie Runfeng, Cui Xiangyang, Yan Zhou, Wang Xin, Xuan Zhanwei, Zhang Kai
for: 提高新闻推荐系统的准确率，解决传统方法对复杂新闻文本的理解困难和长尾问题。
methods: combining Large Language Models (LLM) and Knowledge Graphs (KG) into semantic representations of traditional methods, using LLMs’ powerful text understanding ability to generate news representations containing rich semantic information, and combining information about news entities and mining high-order structural information through multiple hops in KG.
results: compared with various traditional models, the framework significantly improves the recommendation effect, and the successful integration of LLM and KG in the framework has established a feasible path for achieving more accurate personalized recommendations in the news field.

Abstract
Accurately recommending candidate news articles to users is a basic challenge faced by personalized news recommendation systems. Traditional methods are usually difficult to grasp the complex semantic information in news texts, resulting in unsatisfactory recommendation results. Besides, these traditional methods are more friendly to active users with rich historical behaviors. However, they can not effectively solve the "long tail problem" of inactive users. To address these issues, this research presents a novel general framework that combines Large Language Models (LLM) and Knowledge Graphs (KG) into semantic representations of traditional methods. In order to improve semantic understanding in complex news texts, we use LLMs' powerful text understanding ability to generate news representations containing rich semantic information. In addition, our method combines the information about news entities and mines high-order structural information through multiple hops in KG, thus alleviating the challenge of long tail distribution. Experimental results demonstrate that compared with various traditional models, the framework significantly improves the recommendation effect. The successful integration of LLM and KG in our framework has established a feasible path for achieving more accurate personalized recommendations in the news field. Our code is available at https://github.com/Xuan-ZW/LKPNR.

摘要
基于大语言模型和知识图的新闻个性化推荐系统是一个基本挑战。传统方法通常难以捕捉新闻文本中复杂的 semantic information，导致推荐结果不 satisfactory。另外，这些传统方法更适合有活跃用户行为的活跃用户。然而，它们无法有效解决“长尾问题”，即不活跃用户。为解决这些问题，本研究提出了一种新的通用框架，combines Large Language Models (LLM) and Knowledge Graphs (KG) into semantic representations of traditional methods。为了提高新闻文本中的semantic理解，我们使用 LLMs的强大文本理解能力生成新闻表示形式，具有丰富的semantic信息。此外，我们的方法结合新闻实体信息，通过多个层次结构信息在 KG 中挖掘高阶结构信息，从而缓解长尾分布的挑战。实验结果表明，与各种传统模型相比，我们的框架显著提高了推荐效果。我们成功地将 LLM 和 KG 集成到我们的框架中，建立了实现更高精度的个性化推荐在新闻领域的可行道路。我们的代码可以在中找到。

From Instructions to Intrinsic Human Values – A Survey of Alignment Goals for Big Models

paper_url: http://arxiv.org/abs/2308.12014
repo_url: None
paper_authors: Jing Yao, Xiaoyuan Yi, Xiting Wang, Jindong Wang, Xing Xie
for: 本研究旨在探讨现有工作中的各种Alignment Goals，以帮助确定最重要的目标。
methods: 本研究从两个角度 investigate了现有工作：一是对Alignment Goals的定义，二是对Alignment evaluation的研究。
results: 研究发现了三级别的Alignment Goals，并发现了目标转化从基本能力到价值观，这表明了可以利用内在人类价值作为Enhanced LLMs的Alignment goal。

Abstract
Big models, exemplified by Large Language Models (LLMs), are models typically pre-trained on massive data and comprised of enormous parameters, which not only obtain significantly improved performance across diverse tasks but also present emergent capabilities absent in smaller models. However, the growing intertwining of big models with everyday human lives poses potential risks and might cause serious social harm. Therefore, many efforts have been made to align LLMs with humans to make them better follow user instructions and satisfy human preferences. Nevertheless, `what to align with' has not been fully discussed, and inappropriate alignment goals might even backfire. In this paper, we conduct a comprehensive survey of different alignment goals in existing work and trace their evolution paths to help identify the most essential goal. Particularly, we investigate related works from two perspectives: the definition of alignment goals and alignment evaluation. Our analysis encompasses three distinct levels of alignment goals and reveals a goal transformation from fundamental abilities to value orientation, indicating the potential of intrinsic human values as the alignment goal for enhanced LLMs. Based on such results, we further discuss the challenges of achieving such intrinsic value alignment and provide a collection of available resources for future research on the alignment of big models.

摘要
大型模型，如大语言模型（LLMs），是通常在庞大数据上预训练的模型，不仅在多种任务上显示出较好的性能，而且具有emergent功能，与更小的模型不同。然而，大型模型与人类生活的日益相互 penetration可能会带来潜在的风险，可能会对社会造成严重的危害。因此，许多努力已经被做出，以使LMMs与人类更好地配合，使其更好地遵从用户的指令和满足人类的偏好。然而，`与何进行对齐'的问题尚未得到了完全的讨论，不当的对齐目标可能会倒退。在这篇论文中，我们进行了完整的对齐目标的检查，并跟踪它们的演化路径，以帮助identify最重要的目标。特别是，我们从两个视角 investigate existing work：对齐目标的定义和对齐评估。我们的分析覆盖了三级别的对齐目标，并显示了对齐目标的变化从基本能力到价值观，这表明了内置人类价值的可能性作为LLMs的对齐目标，以提高它们的性能。基于这些结果，我们进一步讨论了实现这种内置价值对齐的挑战，并提供了未来对big models的对齐研究的可用资源。

Quantum-Noise-driven Generative Diffusion Models

paper_url: http://arxiv.org/abs/2308.12013
repo_url: None
paper_authors: Marco Parigi, Stefano Martina, Filippo Caruso
for: 这个论文旨在提出和讨论量子扩散模型的量子扩散模型，用于生成复杂的数据分布。
methods: 该论文使用机器学习技术实现生成模型，并利用量子随机过程中的偶极性、Entanglement和噪声来超越经典扩散模型的计算困难。
results: 该论文预计可以开拓新的量子感知或量子基于的生成扩散算法，用于解决经典任务，如数据生成/预测，并具有广泛的实际应用，如气候预测、神经科学、交通流量分析和财务预测。

Abstract
Generative models realized with machine learning techniques are powerful tools to infer complex and unknown data distributions from a finite number of training samples in order to produce new synthetic data. Diffusion models are an emerging framework that have recently overcome the performance of the generative adversarial networks in creating synthetic text and high-quality images. Here, we propose and discuss the quantum generalization of diffusion models, i.e., three quantum-noise-driven generative diffusion models that could be experimentally tested on real quantum systems. The idea is to harness unique quantum features, in particular the non-trivial interplay among coherence, entanglement and noise that the currently available noisy quantum processors do unavoidably suffer from, in order to overcome the main computational burdens of classical diffusion models during inference. Hence, we suggest to exploit quantum noise not as an issue to be detected and solved but instead as a very remarkably beneficial key ingredient to generate much more complex probability distributions that would be difficult or even impossible to express classically, and from which a quantum processor might sample more efficiently than a classical one. Therefore, our results are expected to pave the way for new quantum-inspired or quantum-based generative diffusion algorithms addressing more powerfully classical tasks as data generation/prediction with widespread real-world applications ranging from climate forecasting to neuroscience, from traffic flow analysis to financial forecasting.

摘要
通过机器学习技术实现的生成模型是一种 poderoso工具，可以从 finite 数据样本中推断出复杂而未知的数据分布，生成新的合成数据。扩散模型是一种emerging框架，最近已经超越了生成对抗网络在创造合成文本和高质量图像方面的性能。在这里，我们提出并讨论了量子扩散模型的普适化，即利用量子噪声驱动的三种量子扩散生成模型，可以在真正的量子系统上进行实验。我们的想法是利用量子特有的非rivial相互作用，即准确性、耦合和噪声，以超越经典扩散模型的主要计算危机。因此，我们建议利用量子噪声不作为问题，而是作为非常有利的重要组分，以生成更复杂的概率分布，这些分布可能是经典计算不能表达，而量子处理器可能可以更高效地采样这些分布。因此，我们的结果预计将为新的量子激发或量子基于的生成扩散算法开拓出新的应用领域，从气候预测到神经科学，从交通流量分析到金融预测。

Trustworthy Representation Learning Across Domains

paper_url: http://arxiv.org/abs/2308.12315
repo_url: None
paper_authors: Ronghang Zhu, Dongliang Guo, Daiqing Qi, Zhixuan Chu, Xiang Yu, Sheng Li
for: 这个论文的目的是提出一个可靠的表示学习框架，以适应实际应用场景中的跨domain问题。
methods: 该论文使用了四个概念，即Robustness、Privacy、Fairness和Explainability，以提供一个全面的文献复盘。
results: 该论文提出了一个基于这四个概念的信任worthy表示学习框架，并对现有方法进行了概括和分析。

Abstract
As AI systems have obtained significant performance to be deployed widely in our daily live and human society, people both enjoy the benefits brought by these technologies and suffer many social issues induced by these systems. To make AI systems good enough and trustworthy, plenty of researches have been done to build guidelines for trustworthy AI systems. Machine learning is one of the most important parts for AI systems and representation learning is the fundamental technology in machine learning. How to make the representation learning trustworthy in real-world application, e.g., cross domain scenarios, is very valuable and necessary for both machine learning and AI system fields. Inspired by the concepts in trustworthy AI, we proposed the first trustworthy representation learning across domains framework which includes four concepts, i.e, robustness, privacy, fairness, and explainability, to give a comprehensive literature review on this research direction. Specifically, we first introduce the details of the proposed trustworthy framework for representation learning across domains. Second, we provide basic notions and comprehensively summarize existing methods for the trustworthy framework from four concepts. Finally, we conclude this survey with insights and discussions on future research directions.

摘要
Inspired by the principles of trustworthy AI, we proposed the first trustworthy representation learning across domains framework, which includes four key concepts: robustness, privacy, fairness, and explainability. This comprehensive literature review provides an overview of this research direction, including the details of the proposed trustworthy framework for representation learning across domains, a summary of existing methods that align with the four concepts, and insights and discussions on future research directions. Specifically, we first introduce the details of the proposed trustworthy framework for representation learning across domains. We then provide a comprehensive overview of existing methods that align with the four concepts, including robustness, privacy, fairness, and explainability. Finally, we conclude this survey with insights and discussions on future research directions.The proposed trustworthy framework for representation learning across domains includes four key concepts:1. Robustness: The ability of the model to perform well in the presence of noise, outliers, or distributional shifts.2. Privacy: The protection of sensitive information and the prevention of unauthorized access or misuse.3. Fairness: The avoidance of bias and discrimination in the model's predictions, ensuring that all individuals or groups are treated equally and without prejudice.4. Explainability: The ability to provide clear and understandable explanations for the model's predictions, allowing users to understand the reasoning behind the model's decisions.Existing methods for the trustworthy framework from these four concepts include:1. Robustness: Techniques such as data augmentation, adversarial training, and ensemble methods can improve the model's robustness to noise and distributional shifts.2. Privacy: Methods such as differential privacy, secure multi-party computation, and homomorphic encryption can protect sensitive information and prevent unauthorized access.3. Fairness: Techniques such as fair batch normalization, fair representation learning, and fair evaluation metrics can help to mitigate bias and discrimination in the model's predictions.4. Explainability: Approaches such as feature importance, saliency maps, and model interpretability techniques can provide clear explanations for the model's predictions.In conclusion, this survey provides a comprehensive overview of the trustworthy representation learning across domains framework, including the proposed framework and existing methods that align with the four key concepts. We also discuss insights and future research directions in this field, highlighting the importance of trustworthy AI systems in our daily lives and human society.

Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations

paper_url: http://arxiv.org/abs/2308.11995
repo_url: https://github.com/alexa/Topical-Chat
paper_authors: Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Venkatesh, Raefer Gabriel, Dilek Hakkani-Tur
for: 这个论文的目的是提供一个基于知识的人机对话集，帮助开发更加深入、有趣的人机对话AI。
methods: 论文使用了知识基础的人机对话集，并在这个集合中采用了无显式角色的对话方式。
results: 论文通过对这个知识基础的人机对话集进行自动和人工评价，提出了一些state-of-the-art的对话模型。

Abstract
Building socialbots that can have deep, engaging open-domain conversations with humans is one of the grand challenges of artificial intelligence (AI). To this end, bots need to be able to leverage world knowledge spanning several domains effectively when conversing with humans who have their own world knowledge. Existing knowledge-grounded conversation datasets are primarily stylized with explicit roles for conversation partners. These datasets also do not explore depth or breadth of topical coverage with transitions in conversations. We introduce Topical-Chat, a knowledge-grounded human-human conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don't have explicitly defined roles, to help further research in open-domain conversational AI. We also train several state-of-the-art encoder-decoder conversational models on Topical-Chat and perform automated and human evaluation for benchmarking.

摘要
建立社交机器人，能够与人类进行深入有趣的开放领域对话，是人工智能（AI）的极大挑战之一。为此，机器人需要能够有效地利用多个领域的世界知识进行对话。现有的知识基础对话数据集主要是通过显式角色定义对话伙伴进行预设。这些数据集还不探讨对话的深度或广度，也没有探讨对话的转变。我们介绍Topical-Chat，一个基于知识的人类对话数据集，其下面知识覆盖8个广泛的主题，对话伙伴没有显式定义角色，以便进一步推动开放领域对话AI的研究。我们还在Topical-Chat上训练了多种当今最佳encoder-decoder对话模型，并进行自动和人类评估，以便作为参考。

Critical Evaluation of Artificial Intelligence as Digital Twin of Pathologist for Prostate Cancer Pathology

paper_url: http://arxiv.org/abs/2308.11992
repo_url: None
paper_authors: Okyaz Eminaga, Mahmoud Abbas, Christian Kunder, Yuri Tolkach, Ryan Han, James D. Brooks, Rosalie Nolley, Axel Semjonow, Martin Boegemann, Robert West, Jin Long, Richard Fan, Olaf Bettendorf
for: 这项研究旨在测试一种基于人工智能的 Digitaltwin 技术，用于检测 próstate cancer 和分类。
methods: 研究使用了 2,603 个 histology 图像，由 Hematoxylin 和 Eosin 染色。使用了多种因素对 prostate cancer 的诊断和分类进行了分析。
results: 研究发现，vPatho 可以与人类Pathologist 相比，在 prostate cancer 的检测和卷积量测量方面具有相当的表现。但是，在 tumor grading 方面，vPatho 和人类Pathologist 之间存在一定的不一致。此外，研究还发现了一些可能导致 grade 不一致的因素，如 tumor 的垂直扩展和抽屉含量。

Abstract
Prostate cancer pathology plays a crucial role in clinical management but is time-consuming. Artificial intelligence (AI) shows promise in detecting prostate cancer and grading patterns. We tested an AI-based digital twin of a pathologist, vPatho, on 2,603 histology images of prostate tissue stained with hematoxylin and eosin. We analyzed various factors influencing tumor-grade disagreement between vPatho and six human pathologists. Our results demonstrated that vPatho achieved comparable performance in prostate cancer detection and tumor volume estimation, as reported in the literature. Concordance levels between vPatho and human pathologists were examined. Notably, moderate to substantial agreement was observed in identifying complementary histological features such as ductal, cribriform, nerve, blood vessels, and lymph cell infiltrations. However, concordance in tumor grading showed a decline when applied to prostatectomy specimens (kappa = 0.44) compared to biopsy cores (kappa = 0.70). Adjusting the decision threshold for the secondary Gleason pattern from 5% to 10% improved the concordance level between pathologists and vPatho for tumor grading on prostatectomy specimens (kappa from 0.44 to 0.64). Potential causes of grade discordance included the vertical extent of tumors toward the prostate boundary and the proportions of slides with prostate cancer. Gleason pattern 4 was particularly associated with discordance. Notably, grade discordance with vPatho was not specific to any of the six pathologists involved in routine clinical grading. In conclusion, our study highlights the potential utility of AI in developing a digital twin of a pathologist. This approach can help uncover limitations in AI adoption and the current grading system for prostate cancer pathology.

摘要
prostata cancer 的生理学pathology 在临床管理中发挥关键作用，但是它很时间消耗。人工智能（AI）表示可能用于检测 prostata cancer 和分化模式。我们使用了一个基于 AI 的pathologist 数字 близнеvPatho 测试了 2,603 个 prostata组织片中的 Hematoxylin 和 Eosin 染色图像。我们分析了不同因素 influencing tumor-grade 的不一致性。结果表明，vPatho 在检测 prostata cancer 和组织体积方面达到了文献报告的性能。我们对 vPatho 和六名人类病理学家之间的一致性进行了分析。注意，在识别 complementary 的 histological 特征方面，such as ductal、cribriform、nerve、血管和lymphocyte infiltration 中，moderate to substantial 的一致性被观察到。然而，在评估 tumor grading 方面，一致性下降到 prostatectomy specimens （kappa = 0.44），比 biopsy cores （kappa = 0.70）更低。通过调整 secondary Gleason 模式的决策阈值从 5% 到 10%，提高了 pathologists 和 vPatho 之间的一致性水平（kappa from 0.44 to 0.64）。可能导致 grade discordance 的原因包括 tumor 的 vertical 分布向 prostata 边界以及检测到的肿瘤组织片的比例。Gleason 模式 4 特别与不一致相关。需要注意的是，grade discordance 与 vPatho 不特别任何一名病理学家的 routine clinical grading 相关。在结论中，我们的研究表明了 AI 可能在开发一个 pathologist 数字 близне的方面具有潜在的用途。这种方法可以帮助揭露 AI 的采用 limitation 和当前的 prostata cancer 生理学pathology 评估系统的限制。

Relational Concept Based Models

paper_url: http://arxiv.org/abs/2308.11991
repo_url: https://github.com/Aghoreshwar/Awesome-Customer-Analytics
paper_authors: Pietro Barbiero, Francesco Giannini, Gabriele Ciravegna, Michelangelo Diligenti, Giuseppe Marra
for: 这个论文的目的是解决关系领域中的深度学习模型可读性问题，这些模型不是专门设计来解决关系问题，而且关系模型不如概念基础模型（CBMs）那样可读性。
methods: 作者提议了一种名为关系概念基础模型（Relational CBMs）的家族关系深度学习方法，这些方法可以在关系领域中提供可读性的任务预测。
results: 作者的实验表明，关系CBMs可以与现有的关系黑obox（黑obox）相比，在图像分类和知识图表链接预测等问题上达到同等的泛化性能，同时支持生成量化的概念基础解释，并能够应对测试时间 intervención，在有限的训练数据 régime和罕见概念监督下也能够保持稳定性。

Abstract
The design of interpretable deep learning models working in relational domains poses an open challenge: interpretable deep learning methods, such as Concept-Based Models (CBMs), are not designed to solve relational problems, while relational models are not as interpretable as CBMs. To address this problem, we propose Relational Concept-Based Models, a family of relational deep learning methods providing interpretable task predictions. Our experiments, ranging from image classification to link prediction in knowledge graphs, show that relational CBMs (i) match generalization performance of existing relational black-boxes (as opposed to non-relational CBMs), (ii) support the generation of quantified concept-based explanations, (iii) effectively respond to test-time interventions, and (iv) withstand demanding settings including out-of-distribution scenarios, limited training data regimes, and scarce concept supervisions.

摘要
translate("The design of interpretable deep learning models working in relational domains poses an open challenge: interpretable deep learning methods, such as Concept-Based Models (CBMs), are not designed to solve relational problems, while relational models are not as interpretable as CBMs. To address this problem, we propose Relational Concept-Based Models, a family of relational deep learning methods providing interpretable task predictions. Our experiments, ranging from image classification to link prediction in knowledge graphs, show that relational CBMs (i) match generalization performance of existing relational black-boxes (as opposed to non-relational CBMs), (ii) support the generation of quantified concept-based explanations, (iii) effectively respond to test-time interventions, and (iv) withstand demanding settings including out-of-distribution scenarios, limited training data regimes, and scarce concept supervisions.")Here's the translation in Simplified Chinese:“relational deep learning模型的设计问题对开放式挑战：可解释深度学习方法，如基于概念的模型（CBMs），不适合解决关系问题，而关系模型不如CBMs可解释。为解决这个问题，我们提议了关系基于概念模型（Relational Concept-Based Models），这是一种可解释的关系深度学习方法。我们的实验，从图像分类到知识图的链接预测，显示了关系CBMs（i）与现有关系黑盒（as opposed to non-relational CBMs）的一致性表现，（ii）支持生成量化的概念基于解释，（iii）在测试时干预有效，（iv）在具有异常场景、有限训练数据 régime和罕见概念监督的情况下坚持。”

Will More Expressive Graph Neural Networks do Better on Generative Tasks?

paper_url: http://arxiv.org/abs/2308.11978
repo_url: None
paper_authors: Xiandong Zou, Xiangyu Zhao, Pietro Liò, Yiren Zhao
for: 本研究的目的是探讨 Graph Neural Network (GNN) 在分子图生成任务中的表达能力，并将 GNN 应用于两种不同的生成框架（GCPN 和 GraphAF）中。
methods: 本研究使用了六种不同的 GNN，包括 GCN、GAT、GGN、GraphSAGE、Graph Isomorphism Network (GIN) 和 Graph Attention Network (GAT)，并对这些 GNN 进行了比较。
results: 研究发现，使用更高级的 GNN 可以提高 GCPN 和 GraphAF 在分子图生成任务中的表现，但 GNN 表现不是必需的 condition для一个好的 GNN-based 生成模型。此外，研究还发现，使用更高级的 GNN 可以使 GCPN 和 GraphAF 在17种非 GNN-based 图生成方法（如变量 autoencoders 和 Bayesian 优化模型）中 achieve state-of-the-art 结果。

Abstract
Graph generation poses a significant challenge as it involves predicting a complete graph with multiple nodes and edges based on simply a given label. This task also carries fundamental importance to numerous real-world applications, including de-novo drug and molecular design. In recent years, several successful methods have emerged in the field of graph generation. However, these approaches suffer from two significant shortcomings: (1) the underlying Graph Neural Network (GNN) architectures used in these methods are often underexplored; and (2) these methods are often evaluated on only a limited number of metrics. To fill this gap, we investigate the expressiveness of GNNs under the context of the molecular graph generation task, by replacing the underlying GNNs of graph generative models with more expressive GNNs. Specifically, we analyse the performance of six GNNs in two different generative frameworks (GCPN and GraphAF), on six different molecular generative objectives on the ZINC-250k dataset. Through our extensive experiments, we demonstrate that advanced GNNs can indeed improve the performance of GCPN and GraphAF on molecular generation tasks, but GNN expressiveness is not a necessary condition for a good GNN-based generative model. Moreover, we show that GCPN and GraphAF with advanced GNNs can achieve state-of-the-art results across 17 other non-GNN-based graph generative approaches, such as variational autoencoders and Bayesian optimisation models, on the proposed molecular generative objectives (DRD2, Median1, Median2), which are important metrics for de-novo molecular design.

摘要
“图生成具有重要挑战，因为它需要预测一个完整的图像，包括多个节点和边，基于只提供的标签。这个任务对于许多实际应用都具有重要性，如新药和分子设计。在过去几年，图生成领域内出现了许多成功的方法。然而，这些方法受到两大缺点的影响：（1）用于这些方法的基本图神经网络（GNN）架构经常未得到充分的探索；（2）这些方法通常只被评估在有限的约束下。为了填补这个差距，我们在图生成任务中研究GNN的表达能力，通过将基本GNN替换为更表达能力的GNN来进行分析。我们在ZINC-250k数据集上进行了广泛的实验，并证明了高级GNN可以提高GCPN和GraphAF在分子生成任务中的表现，但GNN表达能力不是必要的condition。此外，我们还示出了GCPN和GraphAF与高级GNN的组合可以在17种非GNN基于的图生成方法（如变量自动编码器和搜索优化模型）中实现州际级结果，这些对于新药设计是重要的度量。”

Approximating Score-based Explanation Techniques Using Conformal Regression

paper_url: http://arxiv.org/abs/2308.11975
repo_url: None
paper_authors: Amr Alkhatib, Henrik Boström, Sofiane Ennadir, Ulf Johansson
for: 这些 papers 是为了解释黑obox 模型的逻辑而写的。
methods: 这些 papers 使用了 computationally costly 的 explanation techniques, such as SHAP, 并提出了一种使用 computationally less costly regression models 来近似 score-based explanation techniques 的方法。
results: 这些 papers 提出了一些 non-conformity measures 来考虑 approximating explanations 的困难度，并在大规模的 empirical investigation 中证明了其效果。 Results 表明，提出的方法可以significantly improve execution time compared to fast version of SHAP, TreeSHAP, 并且可以生成紧凑的 interval。

Abstract
Score-based explainable machine-learning techniques are often used to understand the logic behind black-box models. However, such explanation techniques are often computationally expensive, which limits their application in time-critical contexts. Therefore, we propose and investigate the use of computationally less costly regression models for approximating the output of score-based explanation techniques, such as SHAP. Moreover, validity guarantees for the approximated values are provided by the employed inductive conformal prediction framework. We propose several non-conformity measures designed to take the difficulty of approximating the explanations into account while keeping the computational cost low. We present results from a large-scale empirical investigation, in which the approximate explanations generated by our proposed models are evaluated with respect to efficiency (interval size). The results indicate that the proposed method can significantly improve execution time compared to the fast version of SHAP, TreeSHAP. The results also suggest that the proposed method can produce tight intervals, while providing validity guarantees. Moreover, the proposed approach allows for comparing explanations of different approximation methods and selecting a method based on how informative (tight) are the predicted intervals.

摘要
黑obox模型的解释技术 oftentimes 使用分数基因 explainable machine-learning 技术。然而，这些解释技术通常 computationally expensive，这限制了它们在时间敏感上下文中的应用。因此，我们提出并 investigate 使用 computationally less costly 回归模型来近似 score-based explanation techniques, such as SHAP。此外，我们提供了雇佣 inductive conformal prediction 框架来提供有效性保证。我们还提出了一些非准确度度量，用于考虑近似解释的困难性，同时保持计算成本低。我们在大规模的实验中发现，我们提posed方法可以在执行时间方面取得显著改进，比如 TreeSHAP 的快速版本。此外，我们的结果还表明，我们的方法可以生成紧凑的间隔，同时提供有效性保证。此外，我们的方法允许比较不同的近似方法的解释，并选择一个基于解释 intervals 的紧凑程度（tightness）。

Blending-NeRF: Text-Driven Localized Editing in Neural Radiance Fields

paper_url: http://arxiv.org/abs/2308.11974
repo_url: None
paper_authors: Hyeonseop Song, Seokhun Choi, Hoseok Do, Chul Lee, Taehyeong Kim
for: 本研究旨在提出一种基于NeRF的模型，用于文本驱动地地方化编辑3D对象，以实现在文本提示中指定的本地修改。
methods: 该模型包含两个NeRF网络：预训练NeRF和可编辑NeRF，以及新的混合操作。使用CLIP模型进行视觉语言对Alignment，引导Blending-NeRF模型在文本提示中添加新物体、修改 текстуры和 removing部分原对象。
results: 我们的广泛实验表明，Blending-NeRF模型能够自然地和地方化地编辑3D对象，从多种文本提示中生成修改后的结果。

Abstract
Text-driven localized editing of 3D objects is particularly difficult as locally mixing the original 3D object with the intended new object and style effects without distorting the object's form is not a straightforward process. To address this issue, we propose a novel NeRF-based model, Blending-NeRF, which consists of two NeRF networks: pretrained NeRF and editable NeRF. Additionally, we introduce new blending operations that allow Blending-NeRF to properly edit target regions which are localized by text. By using a pretrained vision-language aligned model, CLIP, we guide Blending-NeRF to add new objects with varying colors and densities, modify textures, and remove parts of the original object. Our extensive experiments demonstrate that Blending-NeRF produces naturally and locally edited 3D objects from various text prompts.

摘要
文本驱动的3D对象编辑 particullary difficult, because mixing the original 3D object with the intended new object and style effects without distorting the object's form is not a straightforward process. To address this issue, we propose a novel NeRF-based model, Blending-NeRF, which consists of two NeRF networks: pretrained NeRF and editable NeRF. Additionally, we introduce new blending operations that allow Blending-NeRF to properly edit target regions which are localized by text. By using a pretrained vision-language aligned model, CLIP, we guide Blending-NeRF to add new objects with varying colors and densities, modify textures, and remove parts of the original object. Our extensive experiments demonstrate that Blending-NeRF produces naturally and locally edited 3D objects from various text prompts.Here's the translation in Traditional Chinese:文本驱动的3D对象编译 particullary difficult, because mixing the original 3D object with the intended new object and style effects without distorting the object's form is not a straightforward process. To address this issue, we propose a novel NeRF-based model, Blending-NeRF, which consists of two NeRF networks: pretrained NeRF and editable NeRF. Additionally, we introduce new blending operations that allow Blending-NeRF to properly edit target regions which are localized by text. By using a pretrained vision-language aligned model, CLIP, we guide Blending-NeRF to add new objects with varying colors and densities, modify textures, and remove parts of the original object. Our extensive experiments demonstrate that Blending-NeRF produces naturally and locally edited 3D objects from various text prompts.

Value of Assistance for Mobile Agents

paper_url: http://arxiv.org/abs/2308.11961
repo_url: https://github.com/clair-lab-technion/voa
paper_authors: Adi Amuzig, David Dovrat, Sarah Keren
for: 这篇论文是为了解决移动机器人agent的地理位置uncertainty问题，通过增加协助行为来减少uncertainty。
methods: 该论文提出了一种基于Gaussian process的Value of Assistance（VOA）计算方法，用于评估协助行为的效果。
results: 研究人员通过实验和实际应用 validate了VOA计算方法，并证明了VOA可以准确预测机器人的成本减少效果。

Abstract
Mobile robotic agents often suffer from localization uncertainty which grows with time and with the agents' movement. This can hinder their ability to accomplish their task. In some settings, it may be possible to perform assistive actions that reduce uncertainty about a robot's location. For example, in a collaborative multi-robot system, a wheeled robot can request assistance from a drone that can fly to its estimated location and reveal its exact location on the map or accompany it to its intended location. Since assistance may be costly and limited, and may be requested by different members of a team, there is a need for principled ways to support the decision of which assistance to provide to an agent and when, as well as to decide which agent to help within a team. For this purpose, we propose Value of Assistance (VOA) to represent the expected cost reduction that assistance will yield at a given point of execution. We offer ways to compute VOA based on estimations of the robot's future uncertainty, modeled as a Gaussian process. We specify conditions under which our VOA measures are valid and empirically demonstrate the ability of our measures to predict the agent's average cost reduction when receiving assistance in both simulated and real-world robotic settings.

摘要
Mobile robotic agents often suffer from localization uncertainty, which increases over time and with the agents' movement. This can hinder their ability to complete tasks. In some cases, it may be possible to perform assistive actions that reduce uncertainty about a robot's location. For example, in a collaborative multi-robot system, a wheeled robot can request assistance from a drone that can fly to its estimated location and reveal its exact location on the map or accompany it to its intended location. Since assistance may be costly and limited, and may be requested by different team members, there is a need for principled ways to support the decision of which assistance to provide to an agent and when, as well as to decide which agent to help within a team. To address this need, we propose the Value of Assistance (VOA) to represent the expected cost reduction that assistance will yield at a given point of execution. We provide methods to compute VOA based on estimations of the robot's future uncertainty, modeled as a Gaussian process. We specify conditions under which our VOA measures are valid and empirically demonstrate the ability of our measures to predict the agent's average cost reduction when receiving assistance in both simulated and real-world robotic settings.

Physics informed Neural Networks applied to the description of wave-particle resonance in kinetic simulations of fusion plasmas

paper_url: http://arxiv.org/abs/2308.12312
repo_url: None
paper_authors: Jai Kumar, David Zarzoso, Virginie Grandgirard, Jan Ebert, Stefan Kesselheim
for: 这篇论文使用了哈曼-普朗托纳系统的减少形式版本（1D1V）作为物理信息学神经网络（PINN）的应用测试平台，以解决气体振荡和杯尖不稳定性问题。
methods: 这篇论文首先使用了PINN作为压缩方法来解决哈曼-普朗托纳系统的解，并与标准神经网络进行比较。其次，文章还应用了PINN来解决哈曼-普朗托纳系统，并强调了对部分权重的特殊强调，导致了一种基于自动导数和自动积分的PINN变体，称为可integrable PINN（I-PINN）。
results: 文章的结果表明，PINN可以成功地解决哈曼-普朗托纳系统的问题，并且可以提供更高精度的解决方案。此外，I-PINN还能够更好地处理部分权重的问题，提高了解决速度和精度。

Abstract
The Vlasov-Poisson system is employed in its reduced form version (1D1V) as a test bed for the applicability of Physics Informed Neural Network (PINN) to the wave-particle resonance. Two examples are explored: the Landau damping and the bump-on-tail instability. PINN is first tested as a compression method for the solution of the Vlasov-Poisson system and compared to the standard neural networks. Second, the application of PINN to solving the Vlasov-Poisson system is also presented with the special emphasis on the integral part, which motivates the implementation of a PINN variant, called Integrable PINN (I-PINN), based on the automatic-differentiation to solve the partial differential equation and on the automatic-integration to solve the integral equation.

摘要
<>使用减 simplify 的 Vlasov-Poisson 系统作为测试床，以检验物理学 Informed Neural Network (PINN) 在波动-粒子共振中的可应用性。两个例子被探讨：兰道抑压和块在尾部不稳定。首先，PINN 作为 Vlasov-Poisson 系统解的压缩方法，与标准神经网络进行比较。其次，通过特别强调完 integral part，实现了一种基于自动极点 differentiable 和自动极点 integrate 的 PINN 变体，称为可 integrate PINN（I-PINN），以解决 partial differential equation 和 integral equation。[/INST0] Here's the text in Traditional Chinese:<>使用减 simplify 的 Vlasov-Poisson 系统作为测试床，以检验物理学 Informed Neural Network (PINN) 在波动-粒子共振中的可应用性。两个例子被探讨：兰道抑压和块在尾部不稳定。首先，PINN 作为 Vlasov-Poisson 系统解的压缩方法，与标准神经网络进行比较。其次，通过特别强调完 integral part，实现了一种基于自动极点 differentiable 和自动极点 integrate 的 PINN 变体，称为可 integrate PINN（I-PINN），以解决 partial differential equation 和 integral equation。

Maintaining Plasticity via Regenerative Regularization

paper_url: http://arxiv.org/abs/2308.11958
repo_url: None
paper_authors: Saurabh Kumar, Henrik Marklund, Benjamin Van Roy
for: 维护权重的柔软性（plasticity）在处理非站点数据流时降低。
methods: 提出了L2Init方法，即在损失函数中添加L2正则项，以保持初始参数的柔软性。
results: 在不同类型的非站点性问题上，L2Init可以均衡权重的大小和柔软性，并在处理非站点数据流时提高模型的性能。

Abstract
In continual learning, plasticity refers to the ability of an agent to quickly adapt to new information. Neural networks are known to lose plasticity when processing non-stationary data streams. In this paper, we propose L2 Init, a very simple approach for maintaining plasticity by incorporating in the loss function L2 regularization toward initial parameters. This is very similar to standard L2 regularization (L2), the only difference being that L2 regularizes toward the origin. L2 Init is simple to implement and requires selecting only a single hyper-parameter. The motivation for this method is the same as that of methods that reset neurons or parameter values. Intuitively, when recent losses are insensitive to particular parameters, these parameters drift toward their initial values. This prepares parameters to adapt quickly to new tasks. On simple problems representative of different types of nonstationarity in continual learning, we demonstrate that L2 Init consistently mitigates plasticity loss. We additionally find that our regularization term reduces parameter magnitudes and maintains a high effective feature rank.

摘要

When MiniBatch SGD Meets SplitFed Learning:Convergence Analysis and Performance Evaluation

paper_url: http://arxiv.org/abs/2308.11953
repo_url: None
paper_authors: Chao Huang, Geng Tian, Ming Tang
for: 这个论文的目的是提出一种名为MiniBatch-SFL的新的分布式学习方法，以解决在分布式学习中发生的“客户端漂移”问题。
methods: 这个方法利用了MiniBatch SGD和分布式学习的概念，在客户端和服务器之间分成了两部分的模型，让客户端只需要训练部分模型，以减少 computation workload。
results: 这个方法可以提高分布式学习的精度，尤其是在非同一的数据时。在实验中，MiniBatch-SFL比传统的分布式学习和Federated learning方法提高了精度，具体来说，可以提高24.1%和17.1%。

Abstract
Federated learning (FL) enables collaborative model training across distributed clients (e.g., edge devices) without sharing raw data. Yet, FL can be computationally expensive as the clients need to train the entire model multiple times. SplitFed learning (SFL) is a recent distributed approach that alleviates computation workload at the client device by splitting the model at a cut layer into two parts, where clients only need to train part of the model. However, SFL still suffers from the \textit{client drift} problem when clients' data are highly non-IID. To address this issue, we propose MiniBatch-SFL. This algorithm incorporates MiniBatch SGD into SFL, where the clients train the client-side model in an FL fashion while the server trains the server-side model similar to MiniBatch SGD. We analyze the convergence of MiniBatch-SFL and show that the bound of the expected loss can be obtained by analyzing the expected server-side and client-side model updates, respectively. The server-side updates do not depend on the non-IID degree of the clients' datasets and can potentially mitigate client drift. However, the client-side model relies on the non-IID degree and can be optimized by properly choosing the cut layer. Perhaps counter-intuitive, our empirical result shows that a latter position of the cut layer leads to a smaller average gradient divergence and a better algorithm performance. Moreover, numerical results show that MiniBatch-SFL achieves higher accuracy than conventional SFL and FL. The accuracy improvement can be up to 24.1\% and 17.1\% with highly non-IID data, respectively.

摘要
分布式学习（FL）可以在分布式客户端（例如边缘设备）上进行模型训练，而不需要将原始数据共享。然而，FL可能会很 computationally expensive，因为客户端需要训练整个模型多次。SplitFed learning（SFL）是一种最近的分布式方法，它可以减轻客户端设备上的计算工作负担，通过在一层截分模型两部分，其中客户端只需要训练模型的一部分。然而，SFL仍然会遭受客户端数据高度异步的问题，称为“客户端漂移”问题。为解决这个问题，我们提出了MiniBatch-SFL。这个算法将MiniBatch SGD integrate到SFL中，客户端在FL的方式上训练客户端模型，服务器则在MiniBatch SGD的方式上训练服务器模型。我们分析MiniBatch-SFL的整合和融合，并证明了预期的损失下界可以通过分析服务器和客户端模型更新的预期值来获得。服务器端的更新不виси于客户端数据的异步度，可能减轻客户端漂移问题。然而，客户端模型取决于异步度，可以通过合适地选择截分层来优化。奇怪的是，我们的实验结果表明，将截分层放在后者位置可以减少平均梯度差异和提高算法性能。此外，我们的数值结果表明，MiniBatch-SFL可以在异步数据上达到高度的准确率，比 conventinal SFL和FL高达24.1%和17.1%。

Pose Modulated Avatars from Video

paper_url: http://arxiv.org/abs/2308.11951
repo_url: None
paper_authors: Chunjin Song, Bastian Wandt, Helge Rhodin
for: 用于重建动态人体运动和形态，并模型人体的衣物和皮肤塑形。
methods: 使用神经辐射场（NeRF）驱动下方skeleton，并开发了一个两极分支神经网络，以adaptive和explcit方式在频率域中模型人体部件之间的相互关系。
results: 对比州方法，该方法能够更好地保留细节和总体化能力。

Abstract
It is now possible to reconstruct dynamic human motion and shape from a sparse set of cameras using Neural Radiance Fields (NeRF) driven by an underlying skeleton. However, a challenge remains to model the deformation of cloth and skin in relation to skeleton pose. Unlike existing avatar models that are learned implicitly or rely on a proxy surface, our approach is motivated by the observation that different poses necessitate unique frequency assignments. Neglecting this distinction yields noisy artifacts in smooth areas or blurs fine-grained texture and shape details in sharp regions. We develop a two-branch neural network that is adaptive and explicit in the frequency domain. The first branch is a graph neural network that models correlations among body parts locally, taking skeleton pose as input. The second branch combines these correlation features to a set of global frequencies and then modulates the feature encoding. Our experiments demonstrate that our network outperforms state-of-the-art methods in terms of preserving details and generalization capabilities.

摘要
现在可以使用神经辐射场（NeRF）和下面的骨架来重建动态人体运动和形状。然而，模拟人体皮肤和衣服的塑形仍然是一个挑战。现有的人物模型通常是通过隐藏的方式学习或者通过代理表面来实现。我们的方法受到不同姿势需要唯一频谱分配的观察所启发。忽略这种分配会导致缺陷的纹理和形状细节。我们开发了一个两极分支神经网络，其中第一极是一个图像神经网络，地方地模型体部之间的相关性，带入骨架姿势作为输入。第二极将这些相关特征与一组全局频率相结合，然后修饰特征编码。我们的实验表明，我们的网络在保持细节和泛化能力方面超越了现有的方法。

High-quality Image Dehazing with Diffusion Model

paper_url: http://arxiv.org/abs/2308.11949
repo_url: None
paper_authors: Hu Yu, Jie Huang, Kaiwen Zheng, Man Zhou, Feng Zhao
for: 解压缩雾化图像，即在浓雾场景下还原原始图像的信息。
methods: 本文提出了一种基于DDPM的物理学习框架，即DehazeDDPM，它首先使用物理模型（ASM）模拟雾化任务，然后使用DDPM进行补偿，以恢复雾化induced的信息损失。
results: 对比实验表明，DehazeDDPM在 sintetic和实际雾化数据集上达到了领先的表现。

Abstract
Image dehazing is quite challenging in dense-haze scenarios, where quite less original information remains in the hazy image. Though previous methods have made marvelous progress, they still suffer from information loss in content and color in dense-haze scenarios. The recently emerged Denoising Diffusion Probabilistic Model (DDPM) exhibits strong generation ability, showing potential for solving this problem. However, DDPM fails to consider the physics property of dehazing task, limiting its information completion capacity. In this work, we propose DehazeDDPM: A DDPM-based and physics-aware image dehazing framework that applies to complex hazy scenarios. Specifically, DehazeDDPM works in two stages. The former stage physically models the dehazing task with the Atmospheric Scattering Model (ASM), pulling the distribution closer to the clear data and endowing DehazeDDPM with fog-aware ability. The latter stage exploits the strong generation ability of DDPM to compensate for the haze-induced huge information loss, by working in conjunction with the physical modelling. Extensive experiments demonstrate that our method attains state-of-the-art performance on both synthetic and real-world hazy datasets.

摘要
Image 降霾 quite challenging in dense-haze scenarios, where quite less original information remains in the hazy image. Although previous methods have made marvelous progress, they still suffer from information loss in content and color in dense-haze scenarios. The recently emerged Denoising Diffusion Probabilistic Model (DDPM) exhibits strong generation ability, showing potential for solving this problem. However, DDPM fails to consider the physics property of dehazing task, limiting its information completion capacity. In this work, we propose DehazeDDPM: A DDPM-based and physics-aware image dehazing framework that applies to complex hazy scenarios. Specifically, DehazeDDPM works in two stages. The former stage physically models the dehazing task with the Atmospheric Scattering Model (ASM), pulling the distribution closer to the clear data and endowing DehazeDDPM with fog-aware ability. The latter stage exploits the strong generation ability of DDPM to compensate for the haze-induced huge information loss, by working in conjunction with the physical modelling. Extensive experiments demonstrate that our method attains state-of-the-art performance on both synthetic and real-world hazy datasets.

LongDanceDiff: Long-term Dance Generation with Conditional Diffusion Model

paper_url: http://arxiv.org/abs/2308.11945
repo_url: None
paper_authors: Siqi Yang, Zejun Yang, Zhisheng Wang
for: 这个研究旨在解决长期三维真实舞蹈生成中的静止问题，以提高舞蹈生成的可调和自然性。
methods: 我们运用了一个条件扩散模型，长舞蹈扩散（LongDanceDiff），并将输入组合了音乐、过去的动作和随机化的未来动作。我们还引入了一个共同信息最小化目标，以优化生成的舞蹈动作的多样性和自然性。
results: 我们的方法与现有的方法相比，实现了重大的改善，包括增加了舞蹈生成的可调和自然性。我们计划将我们的代码和模型发布给社区。

Abstract
Dancing with music is always an essential human art form to express emotion. Due to the high temporal-spacial complexity, long-term 3D realist dance generation synchronized with music is challenging. Existing methods suffer from the freezing problem when generating long-term dances due to error accumulation and training-inference discrepancy. To address this, we design a conditional diffusion model, LongDanceDiff, for this sequence-to-sequence long-term dance generation, addressing the challenges of temporal coherency and spatial constraint. LongDanceDiff contains a transformer-based diffusion model, where the input is a concatenation of music, past motions, and noised future motions. This partial noising strategy leverages the full-attention mechanism and learns the dependencies among music and past motions. To enhance the diversity of generated dance motions and mitigate the freezing problem, we introduce a mutual information minimization objective that regularizes the dependency between past and future motions. We also address common visual quality issues in dance generation, such as foot sliding and unsmooth motion, by incorporating spatial constraints through a Global-Trajectory Modulation (GTM) layer and motion perceptual losses, thereby improving the smoothness and naturalness of motion generation. Extensive experiments demonstrate a significant improvement in our approach over the existing state-of-the-art methods. We plan to release our codes and models soon.

摘要
人类常用舞蹈作为表达情感的重要艺术形式。由于高度时空复杂性，长期3D真实舞蹈生成同音乐同步是一项挑战。现有方法受到预测-实际差异和错误积累的问题。为解决这问题，我们设计了一种 conditional diffusion 模型，长 dance diff（LongDanceDiff），用于这种序列到序列长期舞蹈生成任务，解决时间准确性和空间约束的挑战。LongDanceDiff 包括一个基于 transformer 的扩散模型，输入是音乐、过去动作和噪音未来动作的 concatenation。这种 partial noising 策略利用了全程注意机制，学习音乐和过去动作之间的依赖关系。为提高生成舞蹈动作的多样性和减少冻结问题，我们引入了一个 mutual information minimization 目标，规范过去和未来动作之间的依赖关系。我们还通过 incorporating 全球轨迹修饰（GTM）层和运动观察损失，提高生成动作的平滑性和自然性。广泛的实验表明我们的方法在现有状态的方法上显著提高了性能。我们计划 soon 发布我们的代码和模型。

RamseyRL: A Framework for Intelligent Ramsey Number Counterexample Searching

paper_url: http://arxiv.org/abs/2308.11943
repo_url: None
paper_authors: Steve Vott, Adam M. Lehavi
for: 本 paper 探讨了使用最佳先搜索算法和强化学习（RL）技术来找到特定 Ramsey 数字的反例。
methods: 本 paper 使用了图vectorization和深度神经网络（DNN）基于的优化和搜索算法，以评估图是否为反例。
results: 本 paper 提出了一种搜索框架，可以支持 Ramsey 反例探索使用其他heelures。

Abstract
The Ramsey number is the minimum number of nodes, $n = R(s, t)$, such that all undirected simple graphs of order $n$, contain a clique of order $s$, or an independent set of order $t$. This paper explores the application of a best first search algorithm and reinforcement learning (RL) techniques to find counterexamples to specific Ramsey numbers. We incrementally improve over prior search methods such as random search by introducing a graph vectorization and deep neural network (DNN)-based heuristic, which gauge the likelihood of a graph being a counterexample. The paper also proposes algorithmic optimizations to confine a polynomial search runtime. This paper does not aim to present new counterexamples but rather introduces and evaluates a framework supporting Ramsey counterexample exploration using other heuristics. Code and methods are made available through a PyPI package and GitHub repository.

摘要
“拉姆齐数”是最小的节点数量，$n = R(s, t)$, 使得所有无向简单图的顺序为$n$，必然包含一个 clique 的顺序为$s$，或一个独立集的顺序为$t$。这篇论文探索使用最佳先搜索算法和强化学习（RL）技术来找到特定拉姆齐数的反例。我们通过引入图像化和深度神经网络（DNN）基于的优化来提高先前的搜索方法，如随机搜索。 paper 还提出了算法优化，以确保搜索时间 polynomial。这篇论文不是想要发现新的反例，而是介绍和评估一个支持拉姆齐反例探索的框架，使用其他规则。代码和方法通过 PyPI 包和 GitHub 存储库提供。

Retail Demand Forecasting: A Comparative Study for Multivariate Time Series

paper_url: http://arxiv.org/abs/2308.11939
repo_url: None
paper_authors: Md Sabbirul Haque, Md Shahedul Amin, Jonayet Miah
for: 预测零售需求的精度是零售业的金融性和供应链效率的关键因素。在全球市场变得越来越连接起来，企业们正在寻找更高级别的预测模型，以获得竞争优势。
methods: 本研究使用时间系列数据中的顾客需求和macro经济变量（如Consumer Price Index（CPI）、Index of Consumer Sentiment（ICS）和失业率）进行拓展。我们采用了不同的回归和机器学习模型，以准确预测零售需求。
results: 我们的研究发现，通过拓展时间系列数据中的顾客需求和macro经济变量，可以提高预测零售需求的准确度。不同的回归和机器学习模型在预测零售需求方面具有不同的表现，但是综合评价下，机器学习模型的表现较好。

Abstract
Accurate demand forecasting in the retail industry is a critical determinant of financial performance and supply chain efficiency. As global markets become increasingly interconnected, businesses are turning towards advanced prediction models to gain a competitive edge. However, existing literature mostly focuses on historical sales data and ignores the vital influence of macroeconomic conditions on consumer spending behavior. In this study, we bridge this gap by enriching time series data of customer demand with macroeconomic variables, such as the Consumer Price Index (CPI), Index of Consumer Sentiment (ICS), and unemployment rates. Leveraging this comprehensive dataset, we develop and compare various regression and machine learning models to predict retail demand accurately.

摘要
Accurate demand forecasting in the retail industry is a critical determinant of financial performance and supply chain efficiency. As global markets become increasingly interconnected, businesses are turning towards advanced prediction models to gain a competitive edge. However, existing literature mostly focuses on historical sales data and ignores the vital influence of macroeconomic conditions on consumer spending behavior. In this study, we bridge this gap by enriching time series data of customer demand with macroeconomic variables, such as the Consumer Price Index (CPI), Index of Consumer Sentiment (ICS), and unemployment rates. Leveraging this comprehensive dataset, we develop and compare various regression and machine learning models to predict retail demand accurately.Here's the translation in Traditional Chinese:精准的预测是商业领域中的一个关键因素，对于财务性能和供应链效率都是决定性的。随着全球市场变得越来越联系，企业们正在转向更进步的预测模型，以获得竞争优势。然而，现有的文献主要集中在历史销售数据上，忽略了消费者支出行为中的重要影响因素。在这项研究中，我们将把历史销售数据丰富化，加入 macroeconomic 变量，例如消费者物价指数 (CPI)、消费者信心指数 (ICS) 和失业率。利用这个完整的数据集，我们将开发和比较不同的回归和机器学习模型，以精准预测零售需求。

Learning Bottleneck Transformer for Event Image-Voxel Feature Fusion based Classification

paper_url: http://arxiv.org/abs/2308.11937
repo_url: https://github.com/event-ahu/efv_event_classification
paper_authors: Chengguo Yuan, Yu Jin, Zongzhen Wu, Fanting Wei, Yangzirui Wang, Lan Chen, Xiao Wang
for: 本文提出了一种新的双流框架，用于事件表示、提取和融合，以解决现有方法的缺点，包括单一模式表达和网络结构设计。
methods: 本文使用了Transformer和结构化图 neural network（GNN）架构，同时学习事件图像和事件立方体信息。在这个框架中，用瓶颈Transformer来实现双流信息融合。
results: 经过广泛的实验表明，我们的提议的框架可以在两个常用的事件基本分类数据集上达到最新的性能水平。代码可以在：\url{https://github.com/Event-AHU/EFV_event_classification} 中找到。

Abstract
Recognizing target objects using an event-based camera draws more and more attention in recent years. Existing works usually represent the event streams into point-cloud, voxel, image, etc, and learn the feature representations using various deep neural networks. Their final results may be limited by the following factors: monotonous modal expressions and the design of the network structure. To address the aforementioned challenges, this paper proposes a novel dual-stream framework for event representation, extraction, and fusion. This framework simultaneously models two common representations: event images and event voxels. By utilizing Transformer and Structured Graph Neural Network (GNN) architectures, spatial information and three-dimensional stereo information can be learned separately. Additionally, a bottleneck Transformer is introduced to facilitate the fusion of the dual-stream information. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance on two widely used event-based classification datasets. The source code of this work is available at: \url{https://github.com/Event-AHU/EFV_event_classification}

摘要
recognizing target objects using event-based cameras has attracted increasing attention in recent years. existing works usually convert event streams into point clouds, voxels, images, etc., and learn feature representations using various deep neural networks. however, their final results may be limited by the following factors: monotonous modal expressions and the design of the network structure. to address these challenges, this paper proposes a novel dual-stream framework for event representation, extraction, and fusion. this framework simultaneously models two common representations: event images and event voxels. by utilizing transformer and structured graph neural network (gnn) architectures, spatial information and three-dimensional stereo information can be learned separately. additionally, a bottleneck transformer is introduced to facilitate the fusion of the dual-stream information. extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance on two widely used event-based classification datasets. the source code of this work is available at: \url{https://github.com/Event-AHU/EFV_event_classification}Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

Diverse Policies Converge in Reward-free Markov Decision Processe

paper_url: http://arxiv.org/abs/2308.11924
repo_url: https://github.com/openrl-lab/diversepolicies
paper_authors: Fanqi Lin, Shiyu Huang, Weiwei Tu
for: 本文旨在提供一个统一的多种策略学习框架，并调查多种策略学习算法的训练是如何 converges 和效率如何。
methods: 本文提出了一种可证明高效的多种策略学习算法，并通过数学实验证明了其效果。
results: 经过数学实验，本文发现了多种策略学习算法的训练可以高效地 converge 到优化策略，并且可以提高策略的多样性和鲁棒性。

Abstract
Reinforcement learning has achieved great success in many decision-making tasks, and traditional reinforcement learning algorithms are mainly designed for obtaining a single optimal solution. However, recent works show the importance of developing diverse policies, which makes it an emerging research topic. Despite the variety of diversity reinforcement learning algorithms that have emerged, none of them theoretically answer the question of how the algorithm converges and how efficient the algorithm is. In this paper, we provide a unified diversity reinforcement learning framework and investigate the convergence of training diverse policies. Under such a framework, we also propose a provably efficient diversity reinforcement learning algorithm. Finally, we verify the effectiveness of our method through numerical experiments.

摘要
“强化学习在很多决策任务中取得了很大成功，但传统的强化学习算法主要是为了获得单一的优化解决方案。然而，latest works表明了多种策略的重要性，使得这成为一个emerging研究话题。虽然多种多样性强化学习算法已经出现，但没有任何一个能回答强化学习算法如何 converges和效率如何。在这篇论文中，我们提出了一个统一的多样性强化学习框架，并investigate了训练多种策略的聚合。根据这种框架，我们还提出了可证明有效的多样性强化学习算法。最后，我们通过数值实验验证了我们的方法的有效性。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Concept Bottleneck with Visual Concept Filtering for Explainable Medical Image Classification

paper_url: http://arxiv.org/abs/2308.11920
repo_url: None
paper_authors: Injae Kim, Jongha Kim, Joonmyung Choi, Hyunwoo J. Kim
for: 提高医疗应用中模型可靠性的一个关键因素是可读性。概念瓶颈模型（CBM）可以使用人类理解的概念作为中间目标进行可读性图像分类。
methods: 在使用大型自然语言模型（LLM）生成概念的现有方法中，不考虑概念是否具有视觉特征，这是计算有意义的概念分数的重要因素。因此，我们提议使用视觉活动分数来衡量概念是否含有视觉cue，可以使用无标注图像数据来计算。
results: 我们的实验结果表明，采用我们提议的视觉活动分数来筛选概念可以consistently提高性能，相比基线。此外，qualitative analyses还证明了视觉相关概念被选择。

Abstract
Interpretability is a crucial factor in building reliable models for various medical applications. Concept Bottleneck Models (CBMs) enable interpretable image classification by utilizing human-understandable concepts as intermediate targets. Unlike conventional methods that require extensive human labor to construct the concept set, recent works leveraging Large Language Models (LLMs) for generating concepts made automatic concept generation possible. However, those methods do not consider whether a concept is visually relevant or not, which is an important factor in computing meaningful concept scores. Therefore, we propose a visual activation score that measures whether the concept contains visual cues or not, which can be easily computed with unlabeled image data. Computed visual activation scores are then used to filter out the less visible concepts, thus resulting in a final concept set with visually meaningful concepts. Our experimental results show that adopting the proposed visual activation score for concept filtering consistently boosts performance compared to the baseline. Moreover, qualitative analyses also validate that visually relevant concepts are successfully selected with the visual activation score.

摘要
“可读性”是医疗应用中建立可靠模型的重要因素。概念瓶颈模型（CBM）可以实现可读性检查，通过使用人类可理解的概念作为中间目标。与传统方法不同的是，这些方法不需要大量的人工劳动来建立概念集。最近的工作则是利用大型自然语言模型（LLM）生成概念，并使用这些概念来生成可读性检查。但是，这些方法并不考虑概念是否具有视觉相关性，这是 Computing meaningful concept scores 中的重要因素。因此，我们提出了视觉活动 scores，它可以衡量概念是否包含视觉提示，并且可以轻松地使用无标注图像资料来计算。我们的实验结果显示，运用我们提出的视觉活动 scores 进行概念筛选可以与基准相比，实现更高的性能。此外，实验分析也显示，这些视觉相关的概念被成功选择。

LFS-GAN: Lifelong Few-Shot Image Generation

paper_url: http://arxiv.org/abs/2308.11917
repo_url: https://github.com/jjuon/lfs-gan
paper_authors: Juwon Seo, Ji-Su Kang, Gyeong-Moon Park
For: The paper addresses the challenging task of lifelong few-shot image generation, where a generative model learns a sequence of tasks using only a few samples per task, and prevents catastrophic forgetting and overfitting.* Methods: The proposed framework, called Lifelong Few-Shot GAN (LFS-GAN), uses an efficient task-specific modulator called Learnable Factorized Tensor (LeFT) to learn each task, and a novel mode seeking loss to improve diversity in low-data circumstances.* Results: The proposed LFS-GAN can generate high-quality and diverse images in various domains without any forgetting and mode collapse, achieving state-of-the-art in lifelong few-shot image generation task, and even outperforming existing few-shot GANs in the few-shot image generation task.Here is the simplified Chinese text for the three key points:* For: 这篇论文首次解决了难度较高的生命周期几个shot图像生成任务，其中一个生成模型需要使用只有几个样本来学习每个任务。* Methods: 该提议的框架called Lifelong Few-Shot GAN (LFS-GAN)使用高效的任务特定修饰器called Learnable Factorized Tensor (LeFT)来学习每个任务，并使用一种新的模式寻找损失来提高模型在低数据情况下的多样性。* Results: 提议的LFS-GAN可以在不同领域中生成高质量和多样的图像，无论是在几个shot图像生成任务中还是在生命周期中，并且可以超越现有的几个shot GANs在几个shot图像生成任务中的性能。

Abstract
We address a challenging lifelong few-shot image generation task for the first time. In this situation, a generative model learns a sequence of tasks using only a few samples per task. Consequently, the learned model encounters both catastrophic forgetting and overfitting problems at a time. Existing studies on lifelong GANs have proposed modulation-based methods to prevent catastrophic forgetting. However, they require considerable additional parameters and cannot generate high-fidelity and diverse images from limited data. On the other hand, the existing few-shot GANs suffer from severe catastrophic forgetting when learning multiple tasks. To alleviate these issues, we propose a framework called Lifelong Few-Shot GAN (LFS-GAN) that can generate high-quality and diverse images in lifelong few-shot image generation task. Our proposed framework learns each task using an efficient task-specific modulator - Learnable Factorized Tensor (LeFT). LeFT is rank-constrained and has a rich representation ability due to its unique reconstruction technique. Furthermore, we propose a novel mode seeking loss to improve the diversity of our model in low-data circumstances. Extensive experiments demonstrate that the proposed LFS-GAN can generate high-fidelity and diverse images without any forgetting and mode collapse in various domains, achieving state-of-the-art in lifelong few-shot image generation task. Surprisingly, we find that our LFS-GAN even outperforms the existing few-shot GANs in the few-shot image generation task. The code is available at Github.

摘要
我们首次Addressing a challenging lifelong few-shot image generation task. In this situation, a generative model learns a sequence of tasks using only a few samples per task. As a result, the learned model encounters both catastrophic forgetting and overfitting problems at the same time. Existing studies on lifelong GANs have proposed modulation-based methods to prevent catastrophic forgetting, but these methods require additional parameters and cannot generate high-fidelity and diverse images from limited data. On the other hand, existing few-shot GANs suffer from severe catastrophic forgetting when learning multiple tasks. To address these issues, we propose a framework called Lifelong Few-Shot GAN (LFS-GAN) that can generate high-quality and diverse images in the lifelong few-shot image generation task. Our proposed framework learns each task using an efficient task-specific modulator called Learnable Factorized Tensor (LeFT). LeFT is rank-constrained and has a rich representation ability due to its unique reconstruction technique. Furthermore, we propose a novel mode seeking loss to improve the diversity of our model in low-data circumstances. Extensive experiments show that the proposed LFS-GAN can generate high-fidelity and diverse images without any forgetting and mode collapse in various domains, achieving state-of-the-art in the lifelong few-shot image generation task. Surprisingly, we find that our LFS-GAN even outperforms the existing few-shot GANs in the few-shot image generation task. The code is available on Github.Here's the translation in Traditional Chinese:我们首次Addressing a challenging lifelong few-shot image generation task. 在这个情况下, a generative model learns a sequence of tasks using only a few samples per task. 因此, the learned model encounters both catastrophic forgetting and overfitting problems at the same time. existing studies on lifelong GANs have proposed modulation-based methods to prevent catastrophic forgetting, but these methods require additional parameters and cannot generate high-fidelity and diverse images from limited data. On the other hand, existing few-shot GANs suffer from severe catastrophic forgetting when learning multiple tasks. To address these issues, we propose a framework called Lifelong Few-Shot GAN (LFS-GAN) that can generate high-quality and diverse images in the lifelong few-shot image generation task. Our proposed framework learns each task using an efficient task-specific modulator called Learnable Factorized Tensor (LeFT). LeFT is rank-constrained and has a rich representation ability due to its unique reconstruction technique. Furthermore, we propose a novel mode seeking loss to improve the diversity of our model in low-data circumstances. Extensive experiments show that the proposed LFS-GAN can generate high-fidelity and diverse images without any forgetting and mode collapse in various domains, achieving state-of-the-art in the lifelong few-shot image generation task. Surprisingly, we find that our LFS-GAN even outperforms the existing few-shot GANs in the few-shot image generation task. The code is available on Github.

Towards CausalGPT: A Multi-Agent Approach for Faithful Knowledge Reasoning via Promoting Causal Consistency in LLMs

paper_url: http://arxiv.org/abs/2308.11914
repo_url: None
paper_authors: Ziyi Tang, Ruilin Wang, Weixing Chen, Keze Wang, Yang Liu, Tianshui Chen, Liang Lin
for: 提高知识基于reasoning的 faithfulness和 causality
methods: 多智能体协作 reasoning-and-consensus 框架
results: 在多种知识reasoning任务（如科学问答和常识reasoning）中，我们的框架比所有比较方法都高得多

Abstract
Despite advancements in LLMs, knowledge-based reasoning remains a longstanding issue due to the fragility of knowledge recall and inference. Existing methods primarily encourage LLMs to autonomously plan and solve problems or to extensively sample reasoning chains without addressing the conceptual and inferential fallacies. Attempting to alleviate inferential fallacies and drawing inspiration from multi-agent collaboration, we present a framework to increase faithfulness and causality for knowledge-based reasoning. Specifically, we propose to employ multiple intelligent agents (i.e., reasoner and causal evaluator) to work collaboratively in a reasoning-and-consensus paradigm for elevated reasoning faithfulness. The reasoners focus on providing solutions with human-like causality to solve open-domain problems. On the other hand, the causal evaluator agent scrutinizes if the answer in a solution is causally deducible from the question and vice versa, with a counterfactual answer replacing the original. According to the extensive and comprehensive evaluations on a variety of knowledge reasoning tasks (e.g., science question answering and commonsense reasoning), our framework outperforms all compared state-of-the-art approaches by large margins.

摘要
尽管LLM技术得到了进步，知识基于的理解仍然是一个长期的问题，因为知识回忆和推理的 fragility。现有方法主要是让LLM自动规划和解决问题，或者广泛采样推理链而未能解决概念和推理错误。借鉴多智能代理（i.e., 理解者和 causal评估器）的合作，我们提出了增强知识基于的理解 faithfulness 的框架。 Specifically, 我们提议使用多个智能代理（i.e., 理解者和 causal评估器）在一种理解和共识 paradigm中合作，以提高理解的准确性。理解者专注于提供人类化的 causality 来解决开放领域问题，而 causal评估器代理则检查问题和答案之间的 causal 关系是否正确，并将对应的 counterfactual 答案替换原答案。根据对多种知识理解任务（如科学问答和常识理解）的广泛和全面评估，我们的框架在比较的state-of-the-art方法之上出现大幅提升。

Utilizing Admissible Bounds for Heuristic Learning

paper_url: http://arxiv.org/abs/2308.11905
repo_url: None
paper_authors: Carlos Núñez-Molina, Masataro Asai
for: 本研究的目的是提高前向搜索算法中的modern机器学习技术应用，并提供更好的理论基础 для这种应用。
methods: 本研究使用的方法包括使用Truncated Gaussian distribution作为参数，以及在训练过程中考虑扩展的admissible heuristics。
results: 研究发现，使用admissible heuristics作为参数，Truncated Gaussian distribution可以更好地适应实际问题，并且在训练过程中更快 converges。

Abstract
While learning a heuristic function for forward search algorithms with modern machine learning techniques has been gaining interest in recent years, there has been little theoretical understanding of \emph{what} they should learn, \emph{how} to train them, and \emph{why} we do so. This lack of understanding leads to various literature performing an ad-hoc selection of datasets (suboptimal vs optimal costs or admissible vs inadmissible heuristics) and optimization metrics (e.g., squared vs absolute errors). Moreover, due to the lack of admissibility of the resulting trained heuristics, little focus has been put on the role of admissibility \emph{during} learning. This paper articulates the role of admissible heuristics in supervised heuristic learning using them as parameters of Truncated Gaussian distributions, which tightens the hypothesis space compared to ordinary Gaussian distributions. We argue that this mathematical model faithfully follows the principle of maximum entropy and empirically show that, as a result, it yields more accurate heuristics and converges faster during training.

摘要
Recently, there has been growing interest in using modern machine learning techniques to learn heuristic functions for forward search algorithms. However, there has been little theoretical understanding of what these functions should learn, how to train them, and why we do so. This lack of understanding has led to various literature selecting datasets and optimization metrics on an ad-hoc basis, and little attention has been paid to the role of admissibility during learning.This paper focuses on the role of admissible heuristics in supervised heuristic learning, using Truncated Gaussian distributions as parameters. This approach tightens the hypothesis space compared to ordinary Gaussian distributions, and faithfully follows the principle of maximum entropy. Empirical results show that this approach yields more accurate heuristics and converges faster during training.

Exploring the Optimization Objective of One-Class Classification for Anomaly Detection

paper_url: http://arxiv.org/abs/2308.11898
repo_url: None
paper_authors: Han Gao, Huiyuan Luo, Fei Shen, Zhengtao Zhang
for:This paper focuses on the optimization objective space within one-class classification (OCC) methods and its impact on performance.methods:The paper proposes a novel, data-agnostic deep one-class classification method that utilizes a single 1x1 convolutional layer as a trainable projector and any space with a suitable norm as the optimization objective.results:The proposed method achieves state-of-the-art performance in both one-class classification and industrial vision anomaly detection and segmentation tasks, validating the effectiveness of the proposed approach.

Abstract
One-class classification (OCC) is a longstanding method for anomaly detection. With the powerful representation capability of the pre-trained backbone, OCC methods have witnessed significant performance improvements. Typically, most of these OCC methods employ transfer learning to enhance the discriminative nature of the pre-trained backbone's features, thus achieving remarkable efficacy. While most current approaches emphasize feature transfer strategies, we argue that the optimization objective space within OCC methods could also be an underlying critical factor influencing performance. In this work, we conducted a thorough investigation into the optimization objective of OCC. Through rigorous theoretical analysis and derivation, we unveil a key insights: any space with the suitable norm can serve as an equivalent substitute for the hypersphere center, without relying on the distribution assumption of training samples. Further, we provide guidelines for determining the feasible domain of norms for the OCC optimization objective. This novel insight sparks a simple and data-agnostic deep one-class classification method. Our method is straightforward, with a single 1x1 convolutional layer as a trainable projector and any space with suitable norm as the optimization objective. Extensive experiments validate the reliability and efficacy of our findings and the corresponding methodology, resulting in state-of-the-art performance in both one-class classification and industrial vision anomaly detection and segmentation tasks.

摘要
一类分类（OCC）是一种长期使用的异常检测方法。随着预训练后处理的特征表示能力的提高，OCC方法已经经历了显著性能提高。通常，大多数这些OCC方法使用传输学来强化预训练后处理的特征，从而实现了很好的效果。而我们认为，OCC方法的优化目标空间也是影响性能的关键因素。在这项工作中，我们进行了一项全面的OCC优化目标对象的调查。通过严格的理论分析和逻辑推导，我们揭示出一个关键发现：任何具有适当 нор 的空间都可以作为异常中心的等价substitute，不需要基于训练样本的分布假设。此外，我们还提供了确定OCC优化目标空间的可行范围的指南。这一新发现引出了一种简单、数据非依的深度一类分类方法。我们的方法包括一个单一的1x1卷积层作为可训练的投影器，以及任何具有适当 norm的空间作为优化目标。我们的实验证明了我们的发现和相应的方法ология的可靠性和效果，在一类分类和工业视觉异常检测和分割任务中实现了状态的末点性能。

Bridging the Gap: Deciphering Tabular Data Using Large Language Model

paper_url: http://arxiv.org/abs/2308.11891
repo_url: None
paper_authors: Hengyuan Zhang, Peng Chang, Zongcheng Ji
for: 本研究旨在探讨大语言模型如何用于表格问答 tasks，以提高表格结构和内容的理解。
methods: 我们提出了一种特有的模块，用于将表格 serialized 到可以与大语言模型集成的格式。此外，我们还实施了一种纠正机制，以检查和修正模型的可能错误。
results: 我们的提议方法在总体指标中落后 SOTA 约 11.7%，但在特定数据集上测试时，超过 SOTA 约 1.2%。这些结果表明我们的方法可以增强大语言模型对表格结构和内容的理解。

Abstract
In the realm of natural language processing, the understanding of tabular data has perpetually stood as a focal point of scholarly inquiry. The emergence of expansive language models, exemplified by the likes of ChatGPT, has ushered in a wave of endeavors wherein researchers aim to harness these models for tasks related to table-based question answering. Central to our investigative pursuits is the elucidation of methodologies that amplify the aptitude of such large language models in discerning both the structural intricacies and inherent content of tables, ultimately facilitating their capacity to provide informed responses to pertinent queries. To this end, we have architected a distinctive module dedicated to the serialization of tables for seamless integration with expansive language models. Additionally, we've instituted a corrective mechanism within the model to rectify potential inaccuracies. Experimental results indicate that, although our proposed method trails the SOTA by approximately 11.7% in overall metrics, it surpasses the SOTA by about 1.2% in tests on specific datasets. This research marks the first application of large language models to table-based question answering tasks, enhancing the model's comprehension of both table structures and content.

摘要
在自然语言处理领域中，表格数据的理解一直是学术研究的焦点。现在，大型语言模型的出现，如ChatGPT，使得研究人员尝试使用这些模型来解决表格问题。我们的探索的核心在于发展一种能够增强大型语言模型对表格结构和内容的理解，以便它们能够准确回答相关的问题。为此，我们设计了一个专门用于将表格序列化的模块，并在模型中实施了纠正机制以消除可能的错误。实验结果表明，虽然我们的提议方法相对于最佳实践（SOTA）落后约11.7%的总指标，但在特定数据集上测试时超过了SOTA约1.2%。这项研究是大型语言模型在表格问题回答上的首次应用，提高了模型对表格结构和内容的理解。

Integrating the Wikidata Taxonomy into YAGO

paper_url: http://arxiv.org/abs/2308.11884
repo_url: https://github.com/yago-naga/yago-4.5
paper_authors: Fabian Suchanek, Mehwish Alam, Thomas Bonald, Pierre-Henri Paris, Jules Soria
for: The paper aims to merge the entire Wikidata taxonomy into the YAGO KB as much as possible, while maintaining logical consistency.
methods: The authors combine Wikidata with the ontology from Schema.org to reduce and clean up the taxonomy, and use automated reasoners to ensure logical consistency.
results: The authors create YAGO 4.5, which adds a rich layer of informative classes to YAGO while keeping the KB logically consistent.

Abstract
Wikidata is one of the largest public general-purpose Knowledge Bases (KBs). Yet, due to its collaborative nature, its schema and taxonomy have become convoluted. For the YAGO 4 KB, we combined Wikidata with the ontology from Schema.org, which reduced and cleaned up the taxonomy and constraints and made it possible to run automated reasoners on the data. However, it also cut away large parts of the Wikidata taxonomy. In this paper, we present our effort to merge the entire Wikidata taxonomy into the YAGO KB as much as possible. We pay particular attention to logical constraints and a careful distinction of classes and instances. Our work creates YAGO 4.5, which adds a rich layer of informative classes to YAGO, while at the same time keeping the KB logically consistent.

摘要
wikidata是一个非常大的公共通用知识库（kb）。然而由于其协作性，其架构和分类已经变得混乱。为了构建yaogo4kb，我们将wikidata与schema.org的ontology结合了起来，这有效地减少了和约束，并使得数据可以通过自动推理。但是，这也剪辑了大量wikidata分类。在这篇论文中，我们报告了我们将wikidata分类完全 merged into yaogoKB中的努力。我们特别注重逻辑约束和精心分类和实例的区分。我们的工作创造了yaogo4.5，它添加了一层有用的类到yaogoKB中，同时保持kb的逻辑一致。

Cabrita: closing the gap for foreign languages

paper_url: http://arxiv.org/abs/2308.11878
repo_url: None
paper_authors: Celio Larcher, Marcos Piau, Paulo Finardi, Pedro Gengo, Piero Esposito, Vinicius Caridá
for: 提高特定语言或领域上表现，以及有效地进行分词。
methods: 使用自scratch训练模型，并开发了一种名为Cabrita的方法ологи。
results: 在评估少量学习任务中，与传统连续预训练方法和7B英语预训练模型的结果相似，并且减少了分词的数量。

Abstract
The strategy of training the model from scratch in a specific language or domain serves two essential purposes: i) enhancing performance in the particular linguistic or domain context, and ii) ensuring effective tokenization. The main limitation inherent to this approach lies in the associated cost, which can reach six to seven-digit dollar values, depending on the model size and the number of parameters involved. The main solution to overcome the cost challenge is to rely on available pre-trained models, which, despite recent advancements such as the LLaMA and LLaMA-2 models, still demonstrate inefficiency for certain specific domain problems or prove ineffective in scenarios involving conversational memory resources, given the large number of tokens required to represent text. To overcome this issue, we present a methodology named Cabrita, which, as our research demonstrates, successfully addresses the performance and efficient tokenization problem, all at an affordable cost. We believe that this methodology can be applied to any transformer-like architecture model. To validate the study, we conducted continuous pre-training exclusively using Portuguese text on a 3-billion-parameter model known as OpenLLaMA, resulting in a model named openCabrita 3B. The openCabrita 3B also features a new tokenizer that results in a significant reduction in the number of tokens required to represent the text. In our assessment, for few-shot learning tasks, we achieved similar results with this 3B model compared to a traditional continuous pre-training approach as well as to 7B models English pre-trained models.

摘要
strategy 训练模型从零开始在特定语言或领域上服务两个重要目的：一是提高特定语言或领域上表现，二是确保有效的分词。但这种方法存在一个主要的限制，即成本，可以达到6到7位数字的值，具体取决于模型大小和参数数量。为了缓解这个问题，可以依靠可用的预训练模型，尽管最近的进步，如LLaMA和LLaMA-2模型，仍然在特定领域问题上不具有效果，因为需要大量的字符表示文本。为了解决这个问题，我们提出了一种方法ологи，名为Cabrita，我们的研究表明，该方法能够成功地解决表现和有效的分词问题，并且具有可Affordable的成本。我们认为该方法可以应用于任何 transformer-like 架构模型。为了验证这种方法，我们进行了继续预训练，专门使用葡萄牙语文本，在一个3亿参数的模型OpenLLaMA上进行了不断预训练，从而获得了一个名为openCabrita 3B的模型。openCabrita 3B还使用了一种新的分词器，从而导致文本表示的字符数量减少了许多。在我们的评估中，对于几个shot学习任务，我们使用这个3B模型和传统预训练方法以及7B英语预训练模型进行比较，得到了类似的结果。

Integrated Image and Location Analysis for Wound Classification: A Deep Learning Approach

paper_url: http://arxiv.org/abs/2308.11877
repo_url: None
paper_authors: Yash Patel, Tirth Shah, Mrinal Kanti Dhar, Taiyu Zhang, Jeffrey Niezgoda, Sandeep Gopalakrishnan, Zeyun Yu
for: 提高伤口分类精度，以便更好地诊断和治疗伤口。
methods: 基于深度卷积神经网络的多Modal网络，使用伤口图像和其相应的体部位置进行更加精确的分类。
results: 比 tradicional方法高，达到了74.79%到100%的ROI（区域 интерес）无位置分类精度，73.98%到100%的ROIwith位置分类精度，和78.10%到100%的全图分类精度。

Abstract
The global burden of acute and chronic wounds presents a compelling case for enhancing wound classification methods, a vital step in diagnosing and determining optimal treatments. Recognizing this need, we introduce an innovative multi-modal network based on a deep convolutional neural network for categorizing wounds into four categories: diabetic, pressure, surgical, and venous ulcers. Our multi-modal network uses wound images and their corresponding body locations for more precise classification. A unique aspect of our methodology is incorporating a body map system that facilitates accurate wound location tagging, improving upon traditional wound image classification techniques. A distinctive feature of our approach is the integration of models such as VGG16, ResNet152, and EfficientNet within a novel architecture. This architecture includes elements like spatial and channel-wise Squeeze-and-Excitation modules, Axial Attention, and an Adaptive Gated Multi-Layer Perceptron, providing a robust foundation for classification. Our multi-modal network was trained and evaluated on two distinct datasets comprising relevant images and corresponding location information. Notably, our proposed network outperformed traditional methods, reaching an accuracy range of 74.79% to 100% for Region of Interest (ROI) without location classifications, 73.98% to 100% for ROI with location classifications, and 78.10% to 100% for whole image classifications. This marks a significant enhancement over previously reported performance metrics in the literature. Our results indicate the potential of our multi-modal network as an effective decision-support tool for wound image classification, paving the way for its application in various clinical contexts.

摘要
全球各类伤口的扩大问题，提出了加强伤口分类方法的需求，这是诊断和治疗伤口的重要一步。为此，我们介绍了一种创新的多模态网络，基于深度卷积神经网络，用于分类伤口为四类：糖尿病、压力、手术和血液溢出损伤。我们的多模态网络使用伤口图像和其相应的身体位置信息进行更加精确的分类。我们的方法的一个独特特点是通过身体地图系统，实现了更加准确的伤口位置标记，从传统伤口图像分类技术中的改进。我们的方法还integrates了多种模型，如VGG16、ResNet152和EfficientNet，并在一种新的架构中进行组合。这种架构包括空间和通道方向的压缩和激活模块、轴向注意力和适应阀控多层感知机制，为分类提供了坚实的基础。我们的多模态网络在两个不同的数据集上进行训练和评估，其中一个包含了相关的图像和身体位置信息，另一个只包含图像。我们的方法在这两个数据集上达到了74.79%到100%的ROI（区域 интереса）无地址分类精度范围，73.98%到100%的ROI与地址分类精度范围，以及78.10%到100%的整个图像分类精度范围。这表明我们的多模态网络在文献中已经报道的性能指标中达到了显著的提高。我们的结果表明，我们的多模态网络可以作为伤口图像分类的有效决策支持工具，为各种临床上下文应用。

Finding the Perfect Fit: Applying Regression Models to ClimateBench v1.0

paper_url: http://arxiv.org/abs/2308.11854
repo_url: None
paper_authors: Anmol Chaure, Ashok Kumar Behera, Sudip Bhattacharya
for: 本研究使用数据驱动机器学习模型作为气候模拟器，以便政策制定者能够做出有知识基础的决策。
methods: 本研究使用机器学习模型作为计算昂贵的GCM模拟器的代理，从而降低时间和碳脚印。特别是，使用核函数特性，回归模型可以捕捉复杂关系，提高预测能力。
results: 在使用 клима本chmark 数据集进行评估时，我们发现， amongst three non-linear regression models, Gaussian Process Regressor 表现最佳，在标准评估指标上对气候场的模拟表现出色。然而， Gaussian Process Regression 具有空间和时间复杂度的问题。相比之下， Support Vector 和 Kernel Ridge 模型也能够达到竞争力水平，但是有一定的交易offs。此外，我们正在 актив地调查 composite kernels 和变量抽象等技术，以提高回归模型的性能，更好地模拟复杂非线性 patrerns，包括降水现象。

Abstract
Climate projections using data driven machine learning models acting as emulators, is one of the prevailing areas of research to enable policy makers make informed decisions. Use of machine learning emulators as surrogates for computationally heavy GCM simulators reduces time and carbon footprints. In this direction, ClimateBench [1] is a recently curated benchmarking dataset for evaluating the performance of machine learning emulators designed for climate data. Recent studies have reported that despite being considered fundamental, regression models offer several advantages pertaining to climate emulations. In particular, by leveraging the kernel trick, regression models can capture complex relationships and improve their predictive capabilities. This study focuses on evaluating non-linear regression models using the aforementioned dataset. Specifically, we compare the emulation capabilities of three non-linear regression models. Among them, Gaussian Process Regressor demonstrates the best-in-class performance against standard evaluation metrics used for climate field emulation studies. However, Gaussian Process Regression suffers from being computational resource hungry in terms of space and time complexity. Alternatively, Support Vector and Kernel Ridge models also deliver competitive results and but there are certain trade-offs to be addressed. Additionally, we are actively investigating the performance of composite kernels and techniques such as variational inference to further enhance the performance of the regression models and effectively model complex non-linear patterns, including phenomena like precipitation.

摘要
政策制定者可以通过使用数据驱动机器学模型作为模拟器，来做出了 informed 的决策。通过使用机器学模型作为计算成本高GCM模拟器的代理，可以降低时间和碳脚印。在这个方向下，ClimateBench [1] 是最近筹集的气候模拟数据集，用于评估机器学模型的性能。据研究，尽管被视为基本的，但是回归模型在气候模拟方面具有许多优势。具体来说，通过核心技术，回归模型可以捕捉复杂的关系，提高预测能力。本研究将对非线性回归模型进行评估，并比较其表现。Specifically，我们将 comparing the emulation capabilities of three non-linear regression models. Among them, Gaussian Process Regressor demonstrates the best-in-class performance against standard evaluation metrics used for climate field emulation studies. However, Gaussian Process Regression suffers from being computationally resource-intensive in terms of space and time complexity. Alternatively, Support Vector and Kernel Ridge models also deliver competitive results, but there are certain trade-offs to be addressed. Additionally, we are actively investigating the performance of composite kernels and techniques such as variational inference to further enhance the performance of the regression models and effectively model complex non-linear patterns, including phenomena like precipitation.

A deep reinforcement learning approach for real-time demand-responsive railway rescheduling to mitigate station overcrowding using mobile data

paper_url: http://arxiv.org/abs/2308.11849
repo_url: None
paper_authors: Enze Liu, Zhiyuan Lin, Judith Y. T. Wang, Hong Chen
For: This paper aims to provide a demand-responsive approach for real-time railway rescheduling during severe emergency events such as natural disasters, with a focus on a heavy-demand station upstream of the disrupted area.* Methods: The paper proposes using mobile data (MD) to infer real-world passenger mobility and avoid overcrowding at the target station, and a deep reinforcement learning (DRL) framework to determine the optimal reschededuled timetable, route stops, and rolling stock allocation.* Results: The paper addresses challenges such as station overcrowding, rolling stock shortage, open-ended disruption duration, and delays due to detours, and aims to improve the efficiency and safety of real-time railway rescheduling during emergency events.

Abstract
Real-time railway rescheduling is a timely and flexible technique to automatically alter the operation schedule in response to time-varying conditions. Current research lacks data-driven approaches that capture real-time passenger mobility during railway disruptions, relying mostly on OD-based data and model-based methods for estimating demands of trains. Meanwhile, the schedule-updating principles for a long-term disruption overlook the uneven distribution of demand over time. To fill this gap, this paper proposes a demand-responsive approach by inferring real-world passenger mobility from mobile data (MD) to facilitate real-time rescheduling. Unlike network-level approaches, this paper focuses on a heavy-demand station upstream of the disrupted area. The objective is to reschedule all trains on multiple routes passing through this target station, which have been affected by a severe emergency event such as a natural disaster. Particular attention should be given to avoiding the accumulation of overcrowded passengers at this station, to prevent additional accidents arising from overcrowding. This research addresses the challenges associated with this scenario, including the dynamics of arriving and leaving of passengers, station overcrowding, rolling stock shortage, open-ended disruption duration, integrated rescheduling on multiple routes, and delays due to detours. A deep reinforcement learning (DRL) framework is proposed to determine the optimal rescheduled timetable, route stops, and rolling stock allocation, while considering real-time demand satisfaction, station overcrowding, train capacity utilization, and headway safety.

摘要
现实时铁路重新规划是一种时间变化的和灵活的技术，可以自动修改运营计划应对时间变化的条件。现有研究缺乏基于实时旅客流动数据的数据驱动方法，而是主要依赖于 Origin-Destination（OD）数据和模型基本方法来估算列车需求。同时，长期干扰的调度原则忽略了旅客需求的不均分布。为了填补这一漏洞，本文提出了一种需求响应的方法，通过推理实际旅客流动数据（MD）来促进实时重新规划。与传统网络水平方法不同，本文将注重一个重要的快速车站，该站位于干扰区域之上游。目标是重新规划通过该站的多个路线的列车，这些列车受到严重紧急事件（如自然灾害）的影响。特别是要避免该站堵塞的乘客堆积，以避免由过度堆积而导致的一次性事故。本研究解决了这种情况中的挑战，包括到站拥堵、车厢短缺、开放式干扰持续时间、多路线集成重新规划和延迟。一种深度鼓励学习（DRL）框架被提议，以确定最佳重新规划时间表、站点停留、车厢分配，同时考虑实时需求满足、站点拥堵、车厢容量利用和距离安全。

${\rm E}(3)$-Equivariant Actor-Critic Methods for Cooperative Multi-Agent Reinforcement Learning

paper_url: http://arxiv.org/abs/2308.11842
repo_url: https://github.com/dchen48/e3ac
paper_authors: Dingyang Chen, Qi Zhang
for: 这个论文旨在利用生物世界中的对称Pattern进行多智能体强化学习（MARL）问题的研究，以提高其在多种应用中的性能。
methods: 该论文使用了Euclidean symmetries作为多智能体强化学习问题的一种限制，并采用了具有对称约束的神经网络架构。
results: 该研究发现，通过对称约束的适应，神经网络架构在多种合作MARL benchmark中表现出色，并且具有很好的泛化能力，如零shot学习和转移学习。

Abstract
Identification and analysis of symmetrical patterns in the natural world have led to significant discoveries across various scientific fields, such as the formulation of gravitational laws in physics and advancements in the study of chemical structures. In this paper, we focus on exploiting Euclidean symmetries inherent in certain cooperative multi-agent reinforcement learning (MARL) problems and prevalent in many applications. We begin by formally characterizing a subclass of Markov games with a general notion of symmetries that admits the existence of symmetric optimal values and policies. Motivated by these properties, we design neural network architectures with symmetric constraints embedded as an inductive bias for multi-agent actor-critic methods. This inductive bias results in superior performance in various cooperative MARL benchmarks and impressive generalization capabilities such as zero-shot learning and transfer learning in unseen scenarios with repeated symmetric patterns. The code is available at: https://github.com/dchen48/E3AC.

摘要
Identification和分析自然界中的对称征象导致了各科学领域的重要发现，如物理学中的重力法律的制定和化学结构的研究的进步。在这篇论文中，我们关注利用多智能体强化学习（MARL）问题中的欧几何对称的特性，这些特性在许多应用中很普遍。我们首先正式定义了一类马尔可夫游戏，其具有一般对称性，这使得存在对称优质值和策略。这些属性激发我们在多智能体actor-critic方法中嵌入对称约束，这种约束导致在各种合作MARLbenchmark中表现出色，并且具有很好的泛化能力，如零shot学习和转移学习在未看到的对称 patrern中。代码可以在以下链接中找到：https://github.com/dchen48/E3AC。

A Benchmark Study on Calibration

paper_url: http://arxiv.org/abs/2308.11838
repo_url: https://github.com/Aryia-Behroziuan/history1
paper_authors: Linwei Tao, Younan Zhu, Haolan Guo, Minjing Dong, Chang Xu
for: 这种研究的目的是为了探讨神经网络模型的准确性和稳定性之间的关系，以及如何在神经网络模型中提高准确性和稳定性的方法。
methods: 这种研究使用了Neural Architecture Search（NAS）搜索空间，并创建了一个神经网络模型准确性评估集（Model Calibration Dataset），以探讨神经网络模型的准确性和稳定性问题。
results: 研究发现，模型准确性可以在不同任务之间进行泛化，而且可以使用Robustness作为准确性评估指标。另外，研究还发现了一些常见的准确性评估指标的不可靠性，以及在不同的搜索空间中，post-hoc准确性调整方法对所有模型是否具有同样的影响。此外，研究还发现了准确性和精度之间的关系，以及带区大小对准确性评估指标的影响。最后，研究发现了一些特定的建筑设计可以提高神经网络模型的准确性。

Abstract
Deep neural networks are increasingly utilized in various machine learning tasks. However, as these models grow in complexity, they often face calibration issues, despite enhanced prediction accuracy. Many studies have endeavored to improve calibration performance through data preprocessing, the use of specific loss functions, and training frameworks. Yet, investigations into calibration properties have been somewhat overlooked. Our study leverages the Neural Architecture Search (NAS) search space, offering an exhaustive model architecture space for thorough calibration properties exploration. We specifically create a model calibration dataset. This dataset evaluates 90 bin-based and 12 additional calibration measurements across 117,702 unique neural networks within the widely employed NATS-Bench search space. Our analysis aims to answer several longstanding questions in the field, using our proposed dataset: (i) Can model calibration be generalized across different tasks? (ii) Can robustness be used as a calibration measurement? (iii) How reliable are calibration metrics? (iv) Does a post-hoc calibration method affect all models uniformly? (v) How does calibration interact with accuracy? (vi) What is the impact of bin size on calibration measurement? (vii) Which architectural designs are beneficial for calibration? Additionally, our study bridges an existing gap by exploring calibration within NAS. By providing this dataset, we enable further research into NAS calibration. As far as we are aware, our research represents the first large-scale investigation into calibration properties and the premier study of calibration issues within NAS.

摘要
深度神经网络在不同的机器学习任务中日益普及，但是随着模型复杂度的增加，它们经常面临调整问题，即使Predictive accuracy得到了提高。许多研究尝试通过数据预处理、特定的损失函数和训练框架来改进调整性能。然而，调整性质的研究受到了一定的忽视。我们的研究利用Neural Architecture Search（NAS）搜索空间，提供了详细的模型建筑空间，以便对调整性质进行全面的探索。我们专门创建了一个模型调整数据集。这个数据集评估了90个分割值和12个附加的调整测量，在117,702个Unique Neural Networks中进行了广泛的 NATS-Bench 搜索空间中进行了测试。我们的分析旨在回答一些在领域中长期存在的问题，使用我们提出的数据集：（i）可否将模型调整 generalized 到不同任务？（ii）可否使用Robustness作为调整测量？（iii）如何判定调整指标的可靠性？（iv）post-hoc calibration方法对所有模型是否具有相同的影响？（v）调整与准确度之间是否存在相互关系？（vi）分割值如何影响调整测量？（vii）哪些建筑设计对调整有利？我们的研究填补了现有的空白，通过调整在NAS中进行exploration。我们的研究表明，调整性质在NAS中存在一定的问题，并且我们的研究是这类研究中的第一个大规模调整性质的研究。

Characterizing normal perinatal development of the human brain structural connectivity

paper_url: http://arxiv.org/abs/2308.11836
repo_url: None
paper_authors: Yihan Wu, Lana Vasung, Camilo Calixto, Ali Gholipour, Davood Karimi
for:这个研究的目的是为了研究新生儿期的脑发育中Structural connectome的发展趋势。methods:这个研究使用了基于时空平均的计算框架，以确定新生儿期的Structural connectivity的 normative baselines。results:研究发现，在33-44postmenstrual weeks期间，Structural connectivity发展了明显的趋势，包括全脑和局部效率的增加，特征路径长度的减少，以及脑叶和脑半球之间的连接强化。此外，研究还发现了一些偏好性特征，这些特征在不同的连接评估方法中都有一致性。

Abstract
Early brain development is characterized by the formation of a highly organized structural connectome. The interconnected nature of this connectome underlies the brain's cognitive abilities and influences its response to diseases and environmental factors. Hence, quantitative assessment of structural connectivity in the perinatal stage is useful for studying normal and abnormal neurodevelopment. However, estimation of the connectome from diffusion MRI data involves complex computations. For the perinatal period, these computations are further challenged by the rapid brain development and imaging difficulties. Combined with high inter-subject variability, these factors make it difficult to chart the normal development of the structural connectome. As a result, there is a lack of reliable normative baselines of structural connectivity metrics at this critical stage in brain development. In this study, we developed a computational framework, based on spatio-temporal averaging, for determining such baselines. We used this framework to analyze the structural connectivity between 33 and 44 postmenstrual weeks using data from 166 subjects. Our results unveiled clear and strong trends in the development of structural connectivity in perinatal stage. Connection weighting based on fractional anisotropy and neurite density produced the most consistent results. We observed increases in global and local efficiency, a decrease in characteristic path length, and widespread strengthening of the connections within and across brain lobes and hemispheres. We also observed asymmetry patterns that were consistent between different connection weighting approaches. The new computational method and results are useful for assessing normal and abnormal development of the structural connectome early in life.

摘要
早期大脑发展 caracterized by the formation of a highly organized structural connectome. 这个 connectome的交互性是大脑的认知能力的基础，也影响了它对疾病和环境因素的应对。因此，在早期生长阶段的量化评估结构连接性是研究正常和异常神经发展的有用工具。然而，从Diffusion MRI数据中计算structural connectivity的估计具有复杂的计算。在早期生长阶段，这些计算受到迅速发展的大脑和成像困难的挑战。此外，高 между个体变化性和不同年龄的数据也使得 Charting the normal development of the structural connectome 是困难的。因此，我们缺乏可靠的正常发展基线的结构连接度度量。在这项研究中，我们开发了一种基于时空平均的计算框架，以确定这些基线。我们使用这个框架分析33-44周孕期的结构连接度，使用166个主要数据。我们的结果表明，在早期生长阶段，结构连接度呈现了明确和强的趋势。基于分数方差和神经纤维密度的连接重量Produced the most consistent results。我们发现全球和局部效率增加，特征路径长度减少，并且广泛加强连接在大脑叶和半球之间和之间。我们还发现了相互 symmetries 的偏好，这些偏好在不同的连接重量方法之间呈现一致。这些新的计算方法和结果有用于评估早期生长阶段的正常和异常结构连接度发展。

Algorithm-assisted discovery of an intrinsic order among mathematical constants

paper_url: http://arxiv.org/abs/2308.11829
repo_url: None
paper_authors: Rotem Elimelech, Ofir David, Carlos De la Cruz Mengual, Rotem Kalisch, Wolfgang Berndt, Michael Shalyt, Mark Silberstein, Yaron Hadad, Ido Kaminer
for: 这个论文的目的是探索数学领域中的新概念和关系，利用计算机算法和人类直觉的结合来发现新的数学常量。
methods: 这篇论文使用了大规模并行计算机算法，探索了巨量的参数空间，并发现了一种新的数学结构——保守矩阵场。
results: 这篇论文发现了一系列新的约束数学常量表达式，包括ζ(3)的多个整数值，并通过新的数学证明，证明了这些常量的不可数性。这些结果表明了计算机支持的数学研究策略的力量，并开启了新的可能性 для解决长期开放的数学问题。

Abstract
In recent decades, a growing number of discoveries in fields of mathematics have been assisted by computer algorithms, primarily for exploring large parameter spaces that humans would take too long to investigate. As computers and algorithms become more powerful, an intriguing possibility arises - the interplay between human intuition and computer algorithms can lead to discoveries of novel mathematical concepts that would otherwise remain elusive. To realize this perspective, we have developed a massively parallel computer algorithm that discovers an unprecedented number of continued fraction formulas for fundamental mathematical constants. The sheer number of formulas discovered by the algorithm unveils a novel mathematical structure that we call the conservative matrix field. Such matrix fields (1) unify thousands of existing formulas, (2) generate infinitely many new formulas, and most importantly, (3) lead to unexpected relations between different mathematical constants, including multiple integer values of the Riemann zeta function. Conservative matrix fields also enable new mathematical proofs of irrationality. In particular, we can use them to generalize the celebrated proof by Ap\'ery for the irrationality of $\zeta(3)$. Utilizing thousands of personal computers worldwide, our computer-supported research strategy demonstrates the power of experimental mathematics, highlighting the prospects of large-scale computational approaches to tackle longstanding open problems and discover unexpected connections across diverse fields of science.

摘要

Exploring the Effectiveness of GPT Models in Test-Taking: A Case Study of the Driver’s License Knowledge Test

paper_url: http://arxiv.org/abs/2308.11827
repo_url: None
paper_authors: Saba Rahimi, Tucker Balch, Manuela Veloso
for: 本研究旨在使用Context不在GPT模型训练数据中的信息来解决GPT模型无法回答最新发展或非公共文档中的问题。
methods: 该方法包括对Context信息进行处理、将Context和问题embedding，通过Context embedding的集成构建提问、使用GPT模型回答问题。
results: 在控制测试enario中，使用加州driver’s Handbook作为信息源，GPT-3模型在50个样本驾驶知识测试题上达到了96%的通过率，而无Context情况下的通过率为82%。然而，Model仍然无法正确回答一些问题，反映出还有改进空间。研究还研究了提问长度和Context格式对模型性能的影响。

Abstract
Large language models such as Open AI's Generative Pre-trained Transformer (GPT) models are proficient at answering questions, but their knowledge is confined to the information present in their training data. This limitation renders them ineffective when confronted with questions about recent developments or non-public documents. Our research proposes a method that enables GPT models to answer questions by employing context from an information source not previously included in their training data. The methodology includes preprocessing of contextual information, the embedding of contexts and queries, constructing prompt through the integration of context embeddings, and generating answers using GPT models. We applied this method in a controlled test scenario using the California Driver's Handbook as the information source. The GPT-3 model achieved a 96% passing score on a set of 50 sample driving knowledge test questions. In contrast, without context, the model's passing score fell to 82%. However, the model still fails to answer some questions correctly even with providing library of context, highlighting room for improvement. The research also examined the impact of prompt length and context format, on the model's performance. Overall, the study provides insights into the limitations and potential improvements for GPT models in question-answering tasks.

摘要
大型语言模型如Open AI的生成预训练变换器（GPT）模型在回答问题方面表现出色，但它们的知识受训数据的限制。这个限制使得它们无法回答最新的发展或者非公共文档中的问题。我们的研究提出了一种方法，使得GPT模型可以通过使用不包含在它们训练数据中的信息源来回答问题。该方法包括Contextual information的处理、查询和Context的编码、通过 integrate context embedding和构建提问的推荐、使用GPT模型回答问题。我们在控制测试enario中应用了这种方法，使用加利福尼亚驾驶手册作为信息源。GPT-3模型在50个示例驾驶知识测试问题中取得96%的通过率，而无Context的情况下，模型的通过率下降到82%。然而，即使提供了库存Context，模型仍然无法回答一些问题正确，这 highlights 进一步的改进空间。研究还检查了提问长度和Context格式对模型性能的影响。总的来说，这项研究提供了GPT模型在问题回答任务中的局限性和可能的改进方向。

Expressive probabilistic sampling in recurrent neural networks

paper_url: http://arxiv.org/abs/2308.11809
repo_url: None
paper_authors: Shirui Chen, Linxin Preston Jiang, Rajesh P. N. Rao, Eric Shea-Brown
for: 这篇论文的目的是解决 sampling-based Bayesian 模型中神经活动的问题，即神经活动被视为是 probablistic 计算中的样本。
methods: 这篇论文使用了函数分析和随机差分方程来探讨 recurrent 神经网络是如何从复杂分布中采样的。
results: 论文表明，使用分立输出单元的 recurrent 神经网络可以采样到任意分布，并提出了一种有效的训练方法基于减噪得分匹配。Empirical 试验表明，该模型可以采样到一些复杂数据分布。

Abstract
In sampling-based Bayesian models of brain function, neural activities are assumed to be samples from probability distributions that the brain uses for probabilistic computation. However, a comprehensive understanding of how mechanistic models of neural dynamics can sample from arbitrary distributions is still lacking. We use tools from functional analysis and stochastic differential equations to explore the minimum architectural requirements for $\textit{recurrent}$ neural circuits to sample from complex distributions. We first consider the traditional sampling model consisting of a network of neurons whose outputs directly represent the samples (sampler-only network). We argue that synaptic current and firing-rate dynamics in the traditional model have limited capacity to sample from a complex probability distribution. We show that the firing rate dynamics of a recurrent neural circuit with a separate set of output units can sample from an arbitrary probability distribution. We call such circuits reservoir-sampler networks (RSNs). We propose an efficient training procedure based on denoising score matching that finds recurrent and output weights such that the RSN implements Langevin sampling. We empirically demonstrate our model's ability to sample from several complex data distributions using the proposed neural dynamics and discuss its applicability to developing the next generation of sampling-based brain models.

摘要
在基于抽样的贝叶斯模型中，神经活动被假设为抽样来自神经网络中的概率分布。然而，完整理解如何使机制模型的神经动力学可以从任意分布中抽样仍然缺乏。我们使用函数分析和随机差分方程来探索神经网络的最小建筑要求，以便它们可以从复杂的分布中抽样。我们首先考虑传统抽样模型，即一个由神经元输出直接表示抽样的网络（抽样器只网络）。我们 argue that synaptic current和神经元发射速率动力学在传统模型中有限的抽样能力。我们显示，一个具有分离输出单元的循环神经网络可以从任意概率分布中抽样。我们称之为储备抽样网络（RSN）。我们提出了一种高效的训练方法，基于排除掉噪声的对准得分，以找到循环和输出参数，使得 RSN 实现朗凡 sampling。我们employmontricate了我们的模型，并证明其可以从多种复杂数据分布中抽样，并讨论了其在开发下一代抽样基于脑模型方面的应用。

paper_url: http://arxiv.org/abs/2308.11804
repo_url: None
paper_authors: Eugene Bagdasaryan, Vitaly Shmatikov
for: 这种研究用于检测多modal embeddings中的攻击点，以及这些攻击点如何影响下游任务。
methods: 研究人员使用了多modal embeddings，并通过对这些 embeddings 进行攻击来证明它们的易攻击性。
results: 研究发现，使用这种攻击方法可以让恶意者将任意输入与其他模式的输入相关联，从而影响多modal embeddings 的性能。

Abstract
Multi-modal encoders map images, sounds, texts, videos, etc. into a single embedding space, aligning representations across modalities (e.g., associate an image of a dog with a barking sound). We show that multi-modal embeddings can be vulnerable to an attack we call "adversarial illusions." Given an input in any modality, an adversary can perturb it so as to make its embedding close to that of an arbitrary, adversary-chosen input in another modality. Illusions thus enable the adversary to align any image with any text, any text with any sound, etc. Adversarial illusions exploit proximity in the embedding space and are thus agnostic to downstream tasks. Using ImageBind embeddings, we demonstrate how adversarially aligned inputs, generated without knowledge of specific downstream tasks, mislead image generation, text generation, and zero-shot classification.

摘要
多模态编码器将图像、声音、文本、视频等转换到单一的嵌入空间中，使表示之间匹配（例如，将一张狗图像与一个叫声相对应）。我们表明，多模态嵌入可能敏感于我们称为“ adversarial 幻觉”的攻击。给定任意模式的输入，敌方可以对其进行扰动，使其嵌入接近另一种选择的敌方输入的嵌入。幻觉如此能让敌方将任意图像与任意文本、任意声音等相对应。这些幻觉利用嵌入空间的近似性，因此不受下游任务的限制。使用 ImageBind 嵌入，我们示例了如何使用不知道下游任务的情况，通过生成 adversarially 对齐的输入，使图像生成、文本生成和零学习分类发生幻觉。

WS-SfMLearner: Self-supervised Monocular Depth and Ego-motion Estimation on Surgical Videos with Unknown Camera Parameters

paper_url: http://arxiv.org/abs/2308.11776
repo_url: None
paper_authors: Ange Lou, Jack Noble
for: 这个论文的目的是建立一个自我超vised的深度和 egocentric 运动估计系统，以便在外科视频中提取深度图和摄像头参数。
methods: 该论文使用了一种基于 cost-volume 的超vision 方法来为系统提供辅助supervision，以便预测摄像头参数。
results: 实验结果表明，提议的方法可以改善摄像头参数、 egocentric 运动和深度估计的准确性。

Abstract
Depth estimation in surgical video plays a crucial role in many image-guided surgery procedures. However, it is difficult and time consuming to create depth map ground truth datasets in surgical videos due in part to inconsistent brightness and noise in the surgical scene. Therefore, building an accurate and robust self-supervised depth and camera ego-motion estimation system is gaining more attention from the computer vision community. Although several self-supervision methods alleviate the need for ground truth depth maps and poses, they still need known camera intrinsic parameters, which are often missing or not recorded. Moreover, the camera intrinsic prediction methods in existing works depend heavily on the quality of datasets. In this work, we aimed to build a self-supervised depth and ego-motion estimation system which can predict not only accurate depth maps and camera pose, but also camera intrinsic parameters. We proposed a cost-volume-based supervision manner to give the system auxiliary supervision for camera parameters prediction. The experimental results showed that the proposed method improved the accuracy of estimated camera parameters, ego-motion, and depth estimation.

摘要
depth 估算在手术视频中发挥重要作用，但创建深度地图真实数据集在手术视频中具有不一致的亮度和噪声，因此建立精准和可靠的自我超视方法在计算机视觉社区中受到更多的关注。虽然一些自我超视方法可以减少需要深度地图和姿态的真实数据，但它们仍然需要已知的摄像头内参数，这些参数通常缺失或者不记录。此外，现有的摄像头内参数预测方法依赖于数据质量的改进。在这项工作中，我们希望建立一个可以预测不仅准确的深度地图和摄像头姿态，还可以预测摄像头内参数的自我超视系统。我们提出了基于Cost Volume的超视方式，以供系统进行摄像头参数预测的auxiliary超视。实验结果表明，我们的方法可以提高摄像头参数、ego-动作和深度估算的准确性。

3ET: Efficient Event-based Eye Tracking using a Change-Based ConvLSTM Network

paper_url: http://arxiv.org/abs/2308.11771
repo_url: None
paper_authors: Qinyu Chen, Zuowen Wang, Shih-Chii Liu, Chang Gao
for: 这 paper 是为了开发一种基于 Change-Based Convolutional Long Short-Term Memory (CB-ConvLSTM) 模型，用于Event-based eye tracking，这是下一代可穿戴医疗技术，如 AR/VR 头盔。
methods: 这 paper 使用了 retina-inspired event cameras，它们具有低延迟响应和稀疏输出事件流，而不是传统的帧基本摄像头。CB-ConvLSTM 架构 efficiently 提取了 spatial-temporal 特征，用于 pupil tracking，并且比 convential CNN 结构更高效。
results: 这 paper 使用 delta-encoded recurrent path 提高了 activation sparsity，从而减少了数学运算量约 4.7 $\times$，不会失去准确性。这使得它成为实时眼动跟踪的理想选择，特别是在资源有限的设备上。项目代码和数据集都公开可用于 \url{https://github.com/qinche106/cb-convlstm-eyetracking}.

Abstract
This paper presents a sparse Change-Based Convolutional Long Short-Term Memory (CB-ConvLSTM) model for event-based eye tracking, key for next-generation wearable healthcare technology such as AR/VR headsets. We leverage the benefits of retina-inspired event cameras, namely their low-latency response and sparse output event stream, over traditional frame-based cameras. Our CB-ConvLSTM architecture efficiently extracts spatio-temporal features for pupil tracking from the event stream, outperforming conventional CNN structures. Utilizing a delta-encoded recurrent path enhancing activation sparsity, CB-ConvLSTM reduces arithmetic operations by approximately 4.7$\times$ without losing accuracy when tested on a \texttt{v2e}-generated event dataset of labeled pupils. This increase in efficiency makes it ideal for real-time eye tracking in resource-constrained devices. The project code and dataset are openly available at \url{https://github.com/qinche106/cb-convlstm-eyetracking}.

摘要
这篇论文介绍了一种稀疏变化基于卷积长短期记忆遮盾（CB-ConvLSTM）模型，用于事件基于眼动跟踪，这是下一代可穿戴医疗技术如AR/VR头戴式设备的关键。我们利用了眼睛引发的事件摄像头的优点，即快速响应和稀疏输出事件流，而不是传统的帧基本摄像头。我们的CB-ConvLSTM架构有效地提取了眼动跟踪的空间时间特征，超过了传统的CNN结构。通过使用delta编码的回归路增强活动稀疏，CB-ConvLSTM可以减少笔算操作数约4.7倍，无损loss性能，使其适用于实时眼动跟踪 resource-constrained设备。项目代码和数据集在GitHub上公开可用，请参考\url{https://github.com/qinche106/cb-convlstm-eyetracking}.

Halo: Estimation and Reduction of Hallucinations in Open-Source Weak Large Language Models

paper_url: http://arxiv.org/abs/2308.11764
repo_url: https://github.com/engsalem/halo
paper_authors: Mohamed Elaraby, Mengyin Lu, Jacob Dunn, Xueying Zhang, Yu Wang, Shizhu Liu
for: 这 paper 的目的是量化和缓解强大语言模型（LLMs）中的幻觉现象。
methods: 这 paper 使用了一种名为 HaloCheck 的轻量级黑盒无知框架，以量化 LLMS 中幻觉的严重程度。此外，paper 还探讨了知识注入和教师学生方法来缓解 LLMS 中幻觉现象。
results: experiments 表明，使用 HaloCheck 和其他技术可以有效缓解 LLMS 中幻觉现象，特别是在复杂的领域。

Abstract
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP). Although convenient for research and practical applications, open-source LLMs with fewer parameters often suffer from severe hallucinations compared to their larger counterparts. This paper focuses on measuring and reducing hallucinations in BLOOM 7B, a representative of such weaker open-source LLMs that are publicly available for research and commercial applications. We introduce HaloCheck, a lightweight BlackBox knowledge-free framework designed to quantify the severity of hallucinations in LLMs. Additionally, we explore techniques like knowledge injection and teacher-student approaches to alleviate hallucinations in low-parameter LLMs. Our experiments effectively demonstrate the reduction of hallucinations in challenging domains for these LLMs.

摘要
Translated into Simplified Chinese:大型自然语言处理模型（LLM）已经革命化了自然语言处理（NLP）领域。虽然对研究和实际应用来说非常方便，但公共源 LLM 的 fewer parameters often suffer from severe hallucinations compared to their larger counterparts. 这篇论文关注测量和减少 LLM 中的 hallucinations，特别是公共源 LLM 中的 BLOOM 7B。我们介绍 HaloCheck，一种轻量级黑盒无知框架，用于评估 LLM 中 hallucinations 的严重程度。此外，我们还探讨了如何通过知识注入和教师-学生方法来缓解低参数 LLM 中的 hallucinations。我们的实验效果地示了在这些 LLM 中的挑战领域中减少 hallucinations。

VBMO: Voting-Based Multi-Objective Path Planning

paper_url: http://arxiv.org/abs/2308.11755
repo_url: None
paper_authors: Raj Korpan
for: 本研究开发了一种VBMO算法，用于生成优化单个目标计划，并对每个目标进行评估，使用投票机制来选择最佳计划。
methods: VBMO算法不使用手动调整的权重，不是基于进化算法，而是根据一个计划在一个目标方面的优化程度来评估其在其他目标方面的表现。VBMO使用三种投票机制：范围、波达和组合批准。
results: 对于多种和复杂的环境，VBMO算法能够高效生成满足多个目标的计划。

Abstract
This paper presents VBMO, the Voting-Based Multi-Objective path planning algorithm, that generates optimal single-objective plans, evaluates each of them with respect to the other objectives, and selects one with a voting mechanism. VBMO does not use hand-tuned weights, consider the multiple objectives at every step of search, or use an evolutionary algorithm. Instead, it considers how a plan that is optimal in one objective may perform well with respect to others. VBMO incorporates three voting mechanisms: range, Borda, and combined approval. Extensive evaluation in diverse and complex environments demonstrates the algorithm's ability to efficiently produce plans that satisfy multiple objectives.

摘要

Multi-Instance Adversarial Attack on GNN-Based Malicious Domain Detection

paper_url: http://arxiv.org/abs/2308.11754
repo_url: None
paper_authors: Mahmoud Nazzal, Issa Khalil, Abdallah Khreishah, NhatHai Phan, Yao Ma
for: 这种研究的目的是检测互联网域名是否与网络攻击相关。
methods: 该方法使用图神经网络（GNN）来推断互联网域名的危险程度，并使用DNS日志来构建域名图（DMG）。
results: 该研究发现，现有的单个攻击者节点 manipulation 技术不具备防止多个节点同时 manipulate 的能力，并且提出了一种基于黑盒模型的多实例攻击方法（MintA），可以在无法访问模型的情况下进行攻击。该方法可以在实际数据上实现攻击成功率超过 80%。

Abstract
Malicious domain detection (MDD) is an open security challenge that aims to detect if an Internet domain is associated with cyber-attacks. Among many approaches to this problem, graph neural networks (GNNs) are deemed highly effective. GNN-based MDD uses DNS logs to represent Internet domains as nodes in a maliciousness graph (DMG) and trains a GNN to infer their maliciousness by leveraging identified malicious domains. Since this method relies on accessible DNS logs to construct DMGs, it exposes a vulnerability for adversaries to manipulate their domain nodes' features and connections within DMGs. Existing research mainly concentrates on threat models that manipulate individual attacker nodes. However, adversaries commonly generate multiple domains to achieve their goals economically and avoid detection. Their objective is to evade discovery across as many domains as feasible. In this work, we call the attack that manipulates several nodes in the DMG concurrently a multi-instance evasion attack. We present theoretical and empirical evidence that the existing single-instance evasion techniques for are inadequate to launch multi-instance evasion attacks against GNN-based MDDs. Therefore, we introduce MintA, an inference-time multi-instance adversarial attack on GNN-based MDDs. MintA enhances node and neighborhood evasiveness through optimized perturbations and operates successfully with only black-box access to the target model, eliminating the need for knowledge about the model's specifics or non-adversary nodes. We formulate an optimization challenge for MintA, achieving an approximate solution. Evaluating MintA on a leading GNN-based MDD technique with real-world data showcases an attack success rate exceeding 80%. These findings act as a warning for security experts, underscoring GNN-based MDDs' susceptibility to practical attacks that can undermine their effectiveness and benefits.

摘要
“恶意域名检测（MDD）是一个开放的安全挑战，旨在检测互联网域名是否与网络攻击相关。许多方法来解决这个问题，Graph Neural Networks（GNN）被视为非常有效。GNN基于的 MDD 使用 DNS 日志来表示互联网域名为节点在恶意域名图（DMG）中，并使用 GNN 来推断它们的恶意程度，利用已知的恶意域名。由于这种方法依赖于可 accessible DNS 日志来构建 DMG，因此暴露了一个攻击者可以 manipulate 其域名节点的特征和连接在 DMG 中的漏洞。现有的研究主要集中在单个攻击者节点的威胁模型上。然而，攻击者通常会生成多个域名来实现他们的目标，以避免检测。我们称这种攻击为多实例逃脱攻击。我们提供了证明和实验证据，证明现有的单实例逃脱技术无法对 GNN 基于的 MDD 进行多实例逃脱攻击。因此，我们引入 MintA，一种在检测时进行多实例逃脱攻击的含糊逻辑攻击。MintA 通过优化的杂化和只有黑盒访问目标模型的能力，实现节点和邻居隐蔽性的提高。我们提出了 MintA 的优化挑战，并实现了一个近似解决方案。对一种主流 GNN 基于 MDD 技术进行实验，MintA 的攻击成功率超过 80%。这些发现作为一个警告，强调 GNN 基于 MDD 的抵御力和优势受到了实际攻击的威胁。”

Patient Clustering via Integrated Profiling of Clinical and Digital Data

paper_url: http://arxiv.org/abs/2308.11748
repo_url: None
paper_authors: Dongjin Choi, Andy Xiang, Ozgur Ozturk, Deep Shrestha, Barry Drake, Hamid Haidarian, Faizan Javed, Haesun Park
for: 这个研究是为了开发一个基于Profile的患者划分模型，用于医疗数据分析。
methods: 这个模型使用基于受限制低维度approximation的方法，利用患者的临床数据和数字互动数据（包括浏览和搜索）构建患者profile。这个方法生成了非负嵌入 вектор，作为患者的低维度表示。
results: 对于使用真实世界患者数据集进行评估，这个方法在划分准确性和推荐精度方面表现出优于其他基线方法。

Abstract
We introduce a novel profile-based patient clustering model designed for clinical data in healthcare. By utilizing a method grounded on constrained low-rank approximation, our model takes advantage of patients' clinical data and digital interaction data, including browsing and search, to construct patient profiles. As a result of the method, nonnegative embedding vectors are generated, serving as a low-dimensional representation of the patients. Our model was assessed using real-world patient data from a healthcare web portal, with a comprehensive evaluation approach which considered clustering and recommendation capabilities. In comparison to other baselines, our approach demonstrated superior performance in terms of clustering coherence and recommendation accuracy.

摘要
我们介绍了一种基于专业 patient clustering 模型，适用于医疗数据。我们的模型通过使用受限制的低级数据方法，使用病人的临床数据和数字互动数据（包括浏览和搜索）构建病人 профи。这样，我们可以生成非负嵌入 вектор，作为病人的低维度表示。我们的模型在使用实际的病人数据from a healthcare web portal 进行评估，并通过了一种全面的评估方法，考虑 clustering 和推荐能力。与其他基准相比，我们的方法在 clustering 准确性和推荐精度方面表现出色。

Lifted Inference beyond First-Order Logic

paper_url: http://arxiv.org/abs/2308.11738
repo_url: None
paper_authors: Sagar Malhotra, Davide Bizzaro, Luciano Serafini
for: 这 paper 的目的是探讨Weighted First Order Model Counting (WFOMC) 的基本性和可行性，以及可以在 probabilistic inference 中使用的逻辑 фрагментов。
methods: 这 paper 使用了一种新的方法 called “counting by splitting”，用于解决 WFOMC 的难题。这种方法可以应用于多种不同的逻辑结构，如directed acyclic graphs, connected graphs, trees, etc。
results: 这 paper 的结果表明，使用 “counting by splitting” 方法可以将多种逻辑结构转化为 domain-liftable 的形式，从而实现 probabilistic inference。此外，这 paper 还推广了许多之前的结果，如 directed acyclic graphs, phylogenetic networks, etc。

Abstract
Weighted First Order Model Counting (WFOMC) is fundamental to probabilistic inference in statistical relational learning models. As WFOMC is known to be intractable in general ($\#$P-complete), logical fragments that admit polynomial time WFOMC are of significant interest. Such fragments are called domain liftable. Recent works have shown that the two-variable fragment of first order logic extended with counting quantifiers ($\mathrm{C^2}$) is domain-liftable. However, many properties of real-world data, like acyclicity in citation networks and connectivity in social networks, cannot be modeled in $\mathrm{C^2}$, or first order logic in general. In this work, we expand the domain liftability of $\mathrm{C^2}$ with multiple such properties. We show that any $\mathrm{C^2}$ sentence remains domain liftable when one of its relations is restricted to represent a directed acyclic graph, a connected graph, a tree (resp. a directed tree) or a forest (resp. a directed forest). All our results rely on a novel and general methodology of "counting by splitting". Besides their application to probabilistic inference, our results provide a general framework for counting combinatorial structures. We expand a vast array of previous results in discrete mathematics literature on directed acyclic graphs, phylogenetic networks, etc.

摘要
Weighted First Order Model Counting (WFOMC) 是统计关系学习模型的基本概念。由于 WFOMC 在总体来说是NP完全的（#P-完全），因此可以在有限时间内完成的逻辑 фрагменты具有重要的科学意义。这些 фрагменты被称为域 liftable。 recent works 表明，两变量 fragments 的第一阶alogic 加上计数量词（C^2）可以域 liftable。然而，许多实际数据的特性，如社交网络中的环状图和 citations 网络中的连接性，无法在 C^2 或首阶alogic 中表示。在这种情况下，我们扩展了 C^2 的域 liftability，使其能够模型这些特性。我们证明，任何 C^2 句子都可以域 liftable，当其中一个关系被限制为表示直接径向图、连接图、树（resp. 直接树）或森林（resp. 直接森林）时。我们的结果基于一种新的通用方法，称为“计数分解”。 beside 其应用于 probabilistic inference，我们的结果提供了一个总体的计数结构框架。我们扩展了许多过去的结果 в 直接径向图、phylogenetic 网络等数学文献中。

Knowledge Graph Prompting for Multi-Document Question Answering

paper_url: http://arxiv.org/abs/2308.11730
repo_url: None
paper_authors: Yu Wang, Nedim Lipka, Ryan A. Rossi, Alexa Siu, Ruiyi Zhang, Tyler Derr
for: 这种方法可以帮助大语言模型在多文档问答 задании中提高表现，特别是在需要深刻理解不同文档之间的逻辑关系时。methods: 这种方法包括建立知识图和LM帮助图 traversal模块，用于导航多文档之间的Semantic/语言相似性和结构关系。results: 广泛的实验表明，这种方法可以提高大语言模型在多文档问答 задании中的表现，并且可以减少检索延迟。

Abstract
The 'pre-train, prompt, predict' paradigm of large language models (LLMs) has achieved remarkable success in open-domain question answering (OD-QA). However, few works explore this paradigm in the scenario of multi-document question answering (MD-QA), a task demanding a thorough understanding of the logical associations among the contents and structures of different documents. To fill this crucial gap, we propose a Knowledge Graph Prompting (KGP) method to formulate the right context in prompting LLMs for MD-QA, which consists of a graph construction module and a graph traversal module. For graph construction, we create a knowledge graph (KG) over multiple documents with nodes symbolizing passages or document structures (e.g., pages/tables), and edges denoting the semantic/lexical similarity between passages or intra-document structural relations. For graph traversal, we design an LM-guided graph traverser that navigates across nodes and gathers supporting passages assisting LLMs in MD-QA. The constructed graph serves as the global ruler that regulates the transitional space among passages and reduces retrieval latency. Concurrently, the LM-guided traverser acts as a local navigator that gathers pertinent context to progressively approach the question and guarantee retrieval quality. Extensive experiments underscore the efficacy of KGP for MD-QA, signifying the potential of leveraging graphs in enhancing the prompt design for LLMs. Our code is at https://github.com/YuWVandy/KG-LLM-MDQA.

摘要
“对多篇文档问题回答（MD-QA）任务，大型语言模型（LLM）的“预训、提示、预测”模式已经实现了杰出的成功。然而，现有的研究几乎没有探讨这种模式在MD-QA任务中的应用。为了填补这个重要的空白，我们提出了知识图表示法（KGP），用于将适当的 контекст提示 LLM 进行 MD-QA，这包括对多篇文档建立知识图和对图中的node和edge进行遍历。 для知识图建立，我们创建了多篇文档之间的知识图，其中node表示文档或文档中的章节/表格，而edge表示文档之间的semantic/lexical相似性或内部结构相关。 для图中的遍历，我们设计了LM-导向的图游击者，可以在图中穿梭，寻找支持文档，以助LLM进行MD-QA。建立的图 acted as a global ruler，对文档之间的转换空间实现了统一的规范，同时LM-导向的游击者 acted as a local navigator，寻找适当的上下文，以逐渐进行问题回答和提高检索质量。实验结果证明了KGP的可行性和有效性，这表明了可以通过图在提高LLM的预训设计中应用。我们的代码可以在https://github.com/YuWVandy/KG-LLM-MDQA中找到。”

Efficient Benchmarking (of Language Models)

paper_url: http://arxiv.org/abs/2308.11696
repo_url: https://github.com/sumankrsh/Sentiment-Analysis.ipynb
paper_authors: Yotam Perlitz, Elron Bandel, Ariel Gera, Ofir Arviv, Liat Ein-Dor, Eyal Shnarch, Noam Slonim, Michal Shmueli-Scheuer, Leshem Choshen
for: 本研究旨在提高语言模型（LM）评估 benchmark的效率，不会 compromising 可靠性。
methods: 该研究使用 HELM benchmark 作为测试例子， investigate 不同 benchmark 设计选择对计算成本-可靠性贸易的影响。提出一种新的度量方法 Decision Impact on Reliability（DIoR）来评估决策对可靠性的影响。
results: 研究发现，现有领先者在 HELM 上可能会改变，只需要移除一些低排名的模型即可；同时，一些不同的 HELM 场景选择可以导致成本计算的变化。基于这些发现，提出了一些具体的建议，可以带来计算成本的减少，而无需牺牲 benchmark 的可靠性。例如，可以通过 x100 或更多的计算成本减少来实现这一目标。

Abstract
The increasing versatility of language models LMs has given rise to a new class of benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks are associated with massive computational costs reaching thousands of GPU hours per model. However the efficiency aspect of these evaluation efforts had raised little discussion in the literature. In this work we present the problem of Efficient Benchmarking namely intelligently reducing the computation costs of LM evaluation without compromising reliability. Using the HELM benchmark as a test case we investigate how different benchmark design choices affect the computation-reliability tradeoff. We propose to evaluate the reliability of such decisions by using a new measure Decision Impact on Reliability DIoR for short. We find for example that the current leader on HELM may change by merely removing a low-ranked model from the benchmark and observe that a handful of examples suffice to obtain the correct benchmark ranking. Conversely a slightly different choice of HELM scenarios varies ranking widely. Based on our findings we outline a set of concrete recommendations for more efficient benchmark design and utilization practices leading to dramatic cost savings with minimal loss of benchmark reliability often reducing computation by x100 or more.

摘要
LM模型的多样化化使得新一代的测试标准出现了，这些标准可以全面评估LM模型的各种能力。然而，这些评估努力的效率却得到了少量的讨论。在这篇文章中，我们提出了一个问题，即如何智能减少LM评估计算成本而不失去可靠性。使用HELM测试标准作为示例，我们研究了不同的测试标准设计选择对计算vs可靠性的负担交互的影响。我们提出了一个新的度量指标，即决策影响可靠性（DIoR），以评估这些决策的可靠性。我们发现，例如，当 removing一个低排名模型时，可以改变当前领先者的位置，并且只需几个例子即可获得正确的排名。然而，不同的HELM场景选择会导致排名差异很大。根据我们的发现，我们提出了一些具体的建议，以提高LM评估设计和使用做法，从而实现计算成本的减少，通常是x100或更多的减少，而且失去可靠性的可能性很低。

Tryage: Real-time, intelligent Routing of User Prompts to Large Language Models

paper_url: http://arxiv.org/abs/2308.11601
repo_url: None
paper_authors: Surya Narayanan Hari, Matt Thomson
for: 这个研究是为了提出一个适应环境的路由系统，Tryage，以便自动选择适当的语言模型库中的专家模型，以满足用户的多元化工作流程和数据领域的需求，同时解决了计算、安全性和新鲜度等考虑。
methods: Tryage使用了一个语言模型路由器来预测下游模型的表现，然后使用一个目标函数集成表现预测、用户目标和限制（例如模型大小、模型新鲜度、安全性、 verbosity 和可读性）来做路由决策。
results: Tryage在多元的数据集中，包括代码、文本、医疗资料和专利，超过 Gorilla 和 GPT3.5 Turbo 在动态模型选择中，可以实现50.9% 的准确率，比 GPT3.5 Turbo 的23.6% 和 Gorilla 的10.8% 高得多。

Abstract
The introduction of the transformer architecture and the self-attention mechanism has led to an explosive production of language models trained on specific downstream tasks and data domains. With over 200, 000 models in the Hugging Face ecosystem, users grapple with selecting and optimizing models to suit multifaceted workflows and data domains while addressing computational, security, and recency concerns. There is an urgent need for machine learning frameworks that can eliminate the burden of model selection and customization and unleash the incredible power of the vast emerging model library for end users. Here, we propose a context-aware routing system, Tryage, that leverages a language model router for optimal selection of expert models from a model library based on analysis of individual input prompts. Inspired by the thalamic router in the brain, Tryage employs a perceptive router to predict down-stream model performance on prompts and, then, makes a routing decision using an objective function that integrates performance predictions with user goals and constraints that are incorporated through flags (e.g., model size, model recency). Tryage allows users to explore a Pareto front and automatically trade-off between task accuracy and secondary goals including minimization of model size, recency, security, verbosity, and readability. Across heterogeneous data sets that include code, text, clinical data, and patents, the Tryage framework surpasses Gorilla and GPT3.5 turbo in dynamic model selection identifying the optimal model with an accuracy of 50.9% , compared to 23.6% by GPT 3.5 Turbo and 10.8% by Gorilla. Conceptually, Tryage demonstrates how routing models can be applied to program and control the behavior of multi-model LLM systems to maximize efficient use of the expanding and evolving language model ecosystem.

摘要
textNote that Simplified Chinese is the standard form of Chinese used in mainland China, and it is different from Traditional Chinese, which is used in Taiwan and other parts of the world.

Practical Insights on Incremental Learning of New Human Physical Activity on the Edge

paper_url: http://arxiv.org/abs/2308.11691
repo_url: None
paper_authors: George Arvanitakis, Jingwei Zuo, Mthandazo Ndhlovu, Hakim Hacid
for: This paper explores the challenges of Edge-based learning, particularly in the context of limited data storage, computing power, and the number of learning classes.
methods: The paper uses the MAGNETO system to conduct experiments and demonstrate the challenges of Edge ML, using data collected from mobile sensors to learn human activities.
results: The paper highlights the challenges of Edge ML and offers valuable perspectives on how to address them.

Abstract
Edge Machine Learning (Edge ML), which shifts computational intelligence from cloud-based systems to edge devices, is attracting significant interest due to its evident benefits including reduced latency, enhanced data privacy, and decreased connectivity reliance. While these advantages are compelling, they introduce unique challenges absent in traditional cloud-based approaches. In this paper, we delve into the intricacies of Edge-based learning, examining the interdependencies among: (i) constrained data storage on Edge devices, (ii) limited computational power for training, and (iii) the number of learning classes. Through experiments conducted using our MAGNETO system, that focused on learning human activities via data collected from mobile sensors, we highlight these challenges and offer valuable perspectives on Edge ML.

摘要
《边缘机器学习（边缘ML）》，它将计算智能从云端系统传递到边缘设备，目前吸引了广泛关注，因为它的明显优势包括降低延迟、提高数据隐私和减少连接依赖。虽然这些优势吸引人，但它们也引入了传统云端方法中缺失的挑战。本文将探讨边缘学习的细节，探讨（i）边缘设备的受限数据存储、（ii）训练计算能力的限制和（iii）学习类数。通过我们的MAGNETO系统的实验，我们探讨了这些挑战，并提供了边缘ML的有价值见解。

Handling the inconsistency of systems of $\min\rightarrow$ fuzzy relational equations

paper_url: http://arxiv.org/abs/2308.12385
repo_url: None
paper_authors: Ismaïl Baaj
for: 这篇论文研究了系统$\min-\rightarrow$抽象关系方程的不一致性。
methods: 该论文使用了分析方法，计算了基于系统$\min-\rightarrow$抽象关系方程的Chebysev距离$\nabla = \inf_{d \in \mathcal{D} \Vert \beta - d \Vert$。
results: 该论文显示了$\nabla$的下界是一个vector不等式的解，无论使用了哪种剩下推导器（G"odel、Goguen或Lukasiewicz）。此外，在$\min-\rightarrow_{G}$系统中，$\nabla$可能是下界，而在$\min-\rightarrow_{GG}$和$\min-\rightarrow_{L}$系统中，$\nabla$总是最小值。

Abstract
In this article, we study the inconsistency of systems of $\min-\rightarrow$ fuzzy relational equations. We give analytical formulas for computing the Chebyshev distances $\nabla = \inf_{d \in \mathcal{D} \Vert \beta - d \Vert$ associated to systems of $\min-\rightarrow$ fuzzy relational equations of the form $\Gamma \Box_{\rightarrow}^{\min} x = \beta$, where $\rightarrow$ is a residual implicator among the G\"odel implication $\rightarrow_G$, the Goguen implication $\rightarrow_{GG}$ or Lukasiewicz's implication $\rightarrow_L$ and $\mathcal{D}$ is the set of second members of consistent systems defined with the same matrix $\Gamma$. The main preliminary result that allows us to obtain these formulas is that the Chebyshev distance $\nabla$ is the lower bound of the solutions of a vector inequality, whatever the residual implicator used. Finally, we show that, in the case of the $\min-\rightarrow_{G}$ system, the Chebyshev distance $\nabla$ may be an infimum, while it is always a minimum for $\min-\rightarrow_{GG}$ and $\min-\rightarrow_{L}$ systems.

摘要
在这篇文章中，我们研究了系统$\min-\rightarrow$uzzifiable relational equation的不一致性。我们给出了计算$\nabla$的分解式，其相关于系统$\Gamma \Box_{\rightarrow}^{\min} x = \beta$，其中$\rightarrow$是G\"odel逻辑$\rightarrow_G$, Goguen逻辑$\rightarrow_{GG}$或Lukasiewicz逻辑$\rightarrow_L$中的剩余逻辑，而$\mathcal{D}$是定义同一个矩阵$\Gamma$的一致系统中的第二个成员的集合。主要的前提结果是$\nabla$是一个准确的下界，无论使用哪种剩余逻辑。最后，我们表明，在$\min-\rightarrow_{G}$系统中，$\nabla$可能是下界，而在$\min-\rightarrow_{GG}$和$\min-\rightarrow_{L}$系统中，$\nabla$总是最小值。

2023-08-23

cs.CL

cs.CL - 2023-08-23

The Challenges of Machine Learning for Trust and Safety: A Case Study on Misinformation Detection

paper_url: http://arxiv.org/abs/2308.12215
repo_url: https://github.com/ramybaly/News-Media-Reliability
paper_authors: Madelyne Xiao, Jonathan Mayer
for: 这个论文旨在探讨机器学习在信任和安全问题上的应用，使用假信息检测作为 caso study。
methods: 作者系мати化了关于自动检测假信息的文献，并对270篇最具影响力的论文进行分析。
results: 研究发现现有文献中存在 significiant 缺陷，包括数据和代码可用性差、设计错误、可重现性和泛化能力差。这些缺陷使得现有的模型在实际应用中效果不佳。

Abstract
We examine the disconnect between scholarship and practice in applying machine learning to trust and safety problems, using misinformation detection as a case study. We systematize literature on automated detection of misinformation across a corpus of 270 well-cited papers in the field. We then examine subsets of papers for data and code availability, design missteps, reproducibility, and generalizability. We find significant shortcomings in the literature that call into question claimed performance and practicality. Detection tasks are often meaningfully distinct from the challenges that online services actually face. Datasets and model evaluation are often non-representative of real-world contexts, and evaluation frequently is not independent of model training. Data and code availability is poor. Models do not generalize well to out-of-domain data. Based on these results, we offer recommendations for evaluating machine learning applications to trust and safety problems. Our aim is for future work to avoid the pitfalls that we identify.

摘要
我团队研究机器学习应用于信任和安全问题上的偏误，使用假信息检测为案例研究。我们系统化了 relate to 270 篇引用论文中的自动检测假信息方法。然后，我们分析了这些论文中的数据和代码可用性、设计异常、可重现性和泛化性问题。我们发现了 significiant 的缺陷，质疑了已经宣称的性能和实用性。检测任务经常与实际场景不同，数据集和模型评估不符合实际情况，评估方法常常与模型训练无关。模型对尝试数据的泛化性也很差。根据这些结果，我们提出了评估机器学习应用于信任和安全问题的建议。我们希望未来的研究可以避免我们所identify的坑。

Curriculum Learning with Adam: The Devil Is in the Wrong Details

paper_url: http://arxiv.org/abs/2308.12202
repo_url: None
paper_authors: Lucas Weber, Jaap Jumelet, Paul Michel, Elia Bruni, Dieuwke Hupkes
for: 这篇论文主要研究了机器学习模型在不同学习阶段上的学习效果，以及如何使机器学习模型更加高效地学习。
methods: 作者们使用了许多现有的CURRICULUM学习方法，包括手动设计的CL方法和自动生成的CL方法，以评估它们在自然语言处理（NLP）领域的效果。
results: 作者们发现，当CURRICULUM方法与流行的Adam优化算法结合使用时，它们经常会适应不合适的优化参数，导致学习效果下降。作者们通过多个实验案例来证明这一点，并发现无论使用哪种CL方法，都无法超越仅使用Adam优化器和合适的Hyperparameter的学习效果。

Abstract
Curriculum learning (CL) posits that machine learning models -- similar to humans -- may learn more efficiently from data that match their current learning progress. However, CL methods are still poorly understood and, in particular for natural language processing (NLP), have achieved only limited success. In this paper, we explore why. Starting from an attempt to replicate and extend a number of recent curriculum methods, we find that their results are surprisingly brittle when applied to NLP. A deep dive into the (in)effectiveness of the curricula in some scenarios shows us why: when curricula are employed in combination with the popular Adam optimisation algorithm, they oftentimes learn to adapt to suboptimally chosen optimisation parameters for this algorithm. We present a number of different case studies with different common hand-crafted and automated CL approaches to illustrate this phenomenon, and we find that none of them outperforms optimisation with only Adam with well-chosen hyperparameters. As such, our results contribute to understanding why CL methods work, but at the same time urge caution when claiming positive results.

摘要

Instruction Position Matters in Sequence Generation with Large Language Models

paper_url: http://arxiv.org/abs/2308.12097
repo_url: https://github.com/adaxry/post-instruction
paper_authors: Yijin Liu, Xianfeng Zeng, Fandong Meng, Jie Zhou
for: 提高大型自然语言模型（LLM）的条件序列生成能力，包括翻译和摘要等任务。
methods: 通过修改模型的 instruciton 排序来增强 LLM 的 instruction 遵循能力。
results: 对多种模型规模（1B / 7B / 13B）和不同的序列生成任务（翻译和摘要）进行了实验，并且在零基eline情况下显著提高了 conditional sequence generation 的性能，例如在 WMT zero-shot 翻译任务上提高了最高达 9.7 BLEU 点。

Abstract
Large language models (LLMs) are capable of performing conditional sequence generation tasks, such as translation or summarization, through instruction fine-tuning. The fine-tuning data is generally sequentially concatenated from a specific task instruction, an input sentence, and the corresponding response. Considering the locality modeled by the self-attention mechanism of LLMs, these models face the risk of instruction forgetting when generating responses for long input sentences. To mitigate this issue, we propose enhancing the instruction-following capability of LLMs by shifting the position of task instructions after the input sentences. Theoretical analysis suggests that our straightforward method can alter the model's learning focus, thereby emphasizing the training of instruction-following capabilities. Concurrently, experimental results demonstrate that our approach consistently outperforms traditional settings across various model scales (1B / 7B / 13B) and different sequence generation tasks (translation and summarization), without any additional data or annotation costs. Notably, our method significantly improves the zero-shot performance on conditional sequence generation, e.g., up to 9.7 BLEU points on WMT zero-shot translation tasks.

摘要

Hybrid Retrieval and Multi-stage Text Ranking Solution at TREC 2022 Deep Learning Track

paper_url: http://arxiv.org/abs/2308.12039
repo_url: None
paper_authors: Guangwei Xu, Yangzhao Zhang, Longhui Zhang, Dingkun Long, Pengjun Xie, Ruijie Guo
for: 本文是提交到TREC 2022 Deep Learning Track的系统描述。
methods: 本文采用混合文本 Retrieval和多阶段文本排名方法。 Retrieval阶段结合了传统稀疏检索和神经积累检索两种结构。排名阶段除了基于大型预训练语言模型的全交互式排名模型之外，还提出了轻量级子排名模块以进一步提高文本排名性能。
results: 评估结果表明我们提出的方法有效。我们的模型在试用集上 achieved the 1st和4th rank for passage ranking and document ranking respectively。

Abstract
Large-scale text retrieval technology has been widely used in various practical business scenarios. This paper presents our systems for the TREC 2022 Deep Learning Track. We explain the hybrid text retrieval and multi-stage text ranking method adopted in our solution. The retrieval stage combined the two structures of traditional sparse retrieval and neural dense retrieval. In the ranking stage, in addition to the full interaction-based ranking model built on large pre-trained language model, we also proposes a lightweight sub-ranking module to further enhance the final text ranking performance. Evaluation results demonstrate the effectiveness of our proposed approach. Our models achieve the 1st and 4th rank on the test set of passage ranking and document ranking respectively.

摘要
大规模文本检索技术在各种实际业务场景中广泛应用。本文介绍我们在TREC 2022深度学习轨道上的系统。我们解释了我们采用的混合文本检索和多stage文本排名方法。检索阶段组合了传统稀疏检索和神经 dense检索两种结构。排名阶段除了基于大型预训练语言模型构建的全面互动型排名模型外，我们还提出了轻量级副排名模块，以进一步提高文本排名性能。评估结果表明我们提出的方法效果。我们的模型在测试集上取得了文章排名和文档排名的1st和4th名。

Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages

paper_url: http://arxiv.org/abs/2308.12038
repo_url: https://github.com/openbmb/viscpm
paper_authors: Jinyi Hu, Yuan Yao, Chongyi Wang, Shan Wang, Yinxu Pan, Qianyu Chen, Tianyu Yu, Hanghao Wu, Yue Zhao, Haoye Zhang, Xu Han, Yankai Lin, Jiao Xue, Dahai Li, Zhiyuan Liu, Maosong Sun
for: 本研究旨在提出一种有效的训练方法，以便在低资源语言中训练大型多modal模型。
methods: 本研究使用的方法是基于强大的多语言大型语言模型，将英语Only的图像文本数据使用零 shot学习 transferred to other languages，并 achieved state-of-the-art performance in Chinese。
results: 研究表明，基于英语Only的图像文本数据进行零 shot学习 transferred to other languages，可以在多语言多modal learning中取得优秀的表现，并在中文场景中达到了开源最佳性能。

Abstract
Recently there has been a significant surge in multimodal learning in terms of both image-to-text and text-to-image generation. However, the success is typically limited to English, leaving other languages largely behind. Building a competitive counterpart in other languages is highly challenging due to the low-resource nature of non-English multimodal data (i.e., lack of large-scale, high-quality image-text data). In this work, we propose MPM, an effective training paradigm for training large multimodal models in low-resource languages. MPM demonstrates that Multilingual language models can Pivot zero-shot Multimodal learning across languages. Specifically, based on a strong multilingual large language model, multimodal models pretrained on English-only image-text data can well generalize to other languages in a zero-shot manner for both image-to-text and text-to-image generation, even surpassing models trained on image-text data in native languages. Taking Chinese as a practice of MPM, we build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese. To facilitate future research, we open-source codes and model weights at https://github.com/OpenBMB/VisCPM.git.

摘要
近期 Multimodal 学习在图像到文本和文本到图像生成方面发生了明显的增长，但成功通常受限于英语，其他语言剩下来。建立竞争力强的对手在其他语言是非常困难，因为非英语多模态数据的资源短缺（即图像-文本数据的大规模高质量数据缺乏）。在这项工作中，我们提出了 MPM，一种有效的训练方法，用于在低资源语言中训练大型多模态模型。MPM表明，多语言语言模型可以在零shot多模态学习中作为中转站。具体来说，基于一个强大的多语言大语言模型，我们在英语只有图像-文本数据上进行预训练，然后在零shot情况下，我们的多模态模型可以很好地泛化到其他语言，包括图像-文本生成和文本-图像生成。我们选择中文作为MPM的实践，并在图像-文本和文本-图像生成方面建立了 VisCPM 大型多模态模型，其性能与开源数据集中的状态机器达到了领先水平。为便于未来的研究，我们将模型权重和代码开源在 GitHub 上，请参考。

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

paper_url: http://arxiv.org/abs/2308.12032
repo_url: None
paper_authors: Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, Jing Xiao
for: 提高 Large Language Model 的优化效率和资源利用率
methods: 自动从开源数据集中选择 “cherry” 样本，使用 Instruction-Following Difficulty 指标对模型自动生成能力进行评估
results: 在 Alpaca 和 WizardLM 等著名数据集上实践 Validation 结果显示，只使用 10% 的传统数据输入，我们的策略可以 дости到更好的结果

Abstract
In the realm of Large Language Models, the balance between instruction data quality and quantity has become a focal point. Recognizing this, we introduce a self-guided methodology for LLMs to autonomously discern and select cherry samples from vast open-source datasets, effectively minimizing manual curation and potential cost for instruction tuning an LLM. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal tool to identify discrepancies between a model's expected responses and its autonomous generation prowess. Through the adept application of IFD, cherry samples are pinpointed, leading to a marked uptick in model training efficiency. Empirical validations on renowned datasets like Alpaca and WizardLM underpin our findings; with a mere 10% of conventional data input, our strategy showcases improved results. This synthesis of self-guided cherry-picking and the IFD metric signifies a transformative leap in the optimization of LLMs, promising both efficiency and resource-conscious advancements.

摘要

Knowledge-injected Prompt Learning for Chinese Biomedical Entity Normalization

paper_url: http://arxiv.org/abs/2308.12025
repo_url: None
paper_authors: Songhua Yang, Chenghao Zhang, Hongfei Xu, Yuxiang Jia
for:这个论文的目的是提高生物医学数据的一致性，通过将raw的医学实体规范化为标准实体，以便更好地应用医学应用程序。methods:该论文提出了一种新的知识注入推断（PL-Knowledge）方法，具体来说是一种五个阶段的方法：候选实体匹配、知识提取、知识编码、知识注入和预测输出。该方法通过有效地编码医学实体中含义的知识项并将其 integrate into我们自己的定制知识注入模板，以提高模型捕捉医学实体之间的潜在关系，从而更好地匹配标准实体。results:该论文对一个 benchmark 数据集进行了广泛的评估，并在 few-shot 和 full-scale scenarios 中比较了现有的基eline。结果表明，我们的方法在 few-shot enario中平均提高了12.96%的准确率，而在 full-data enario中平均提高了0.94%的准确率，这都证明了我们的方法在 BEN 任务中的优秀性。

Abstract
The Biomedical Entity Normalization (BEN) task aims to align raw, unstructured medical entities to standard entities, thus promoting data coherence and facilitating better downstream medical applications. Recently, prompt learning methods have shown promising results in this task. However, existing research falls short in tackling the more complex Chinese BEN task, especially in the few-shot scenario with limited medical data, and the vast potential of the external medical knowledge base has yet to be fully harnessed. To address these challenges, we propose a novel Knowledge-injected Prompt Learning (PL-Knowledge) method. Specifically, our approach consists of five stages: candidate entity matching, knowledge extraction, knowledge encoding, knowledge injection, and prediction output. By effectively encoding the knowledge items contained in medical entities and incorporating them into our tailor-made knowledge-injected templates, the additional knowledge enhances the model's ability to capture latent relationships between medical entities, thus achieving a better match with the standard entities. We extensively evaluate our model on a benchmark dataset in both few-shot and full-scale scenarios. Our method outperforms existing baselines, with an average accuracy boost of 12.96\% in few-shot and 0.94\% in full-data cases, showcasing its excellence in the BEN task.

摘要
文本翻译：生物医学实体 Normalization（BEN）任务的目标是将原始、未结构化医学实体与标准实体进行对应，从而提高数据准确性并促进下游医学应用。现在，提前学习方法在这个任务中已经显示出了promising的结果。然而，现有的研究仍然缺乏在中文BEN任务中更加复杂的挑战，特别是在有限的医学数据下的少量学习情况下，以及外部医学知识库的庞大潜力尚未得到完全利用。为了解决这些挑战，我们提出了一种新的知识注入推理（PL-Knowledge）方法。具体来说，我们的方法包括以下五个阶段：候选实体匹配、知识提取、知识编码、知识注入和预测输出。通过有效地编码医学实体中包含的知识项和将其注入到我们自定义的知识注入模板中，我们可以使得模型更好地捕捉医学实体之间的潜在关系，从而实现更好的匹配标准实体。我们在一个标准 benchmark dataset 上进行了广泛的评估，并在少量学习和全量数据两种情况下进行了比较。我们的方法在少量学习情况下平均提高了12.96％，而在全量数据情况下平均提高了0.94％，这显示了我们在BEN任务中的优秀表现。

Reranking Passages with Coarse-to-Fine Neural Retriever using List-Context Information

paper_url: http://arxiv.org/abs/2308.12022
repo_url: None
paper_authors: Hongyin Zhu
for: 提高大规模文档中答案选取的精度
methods: 利用列Context注意力机制增强文段表示，并将列Context模型分解成两个子过程，以提高效率
results: 实验表明提出的方法有效地提高了答案选取的精度

Abstract
Passage reranking is a crucial task in many applications, particularly when dealing with large-scale documents. Traditional neural architectures are limited in retrieving the best passage for a question because they usually match the question to each passage separately, seldom considering contextual information in other passages that can provide comparison and reference information. This paper presents a list-context attention mechanism to augment the passage representation by incorporating the list-context information from other candidates. The proposed coarse-to-fine (C2F) neural retriever addresses the out-of-memory limitation of the passage attention mechanism by dividing the list-context modeling process into two sub-processes, allowing for efficient encoding of context information from a large number of candidate answers. This method can be generally used to encode context information from any number of candidate answers in one pass. Different from most multi-stage information retrieval architectures, this model integrates the coarse and fine rankers into the joint optimization process, allowing for feedback between the two layers to update the model simultaneously. Experiments demonstrate the effectiveness of the proposed approach.

摘要

Graecia capta ferum victorem cepit. Detecting Latin Allusions to Ancient Greek Literature

paper_url: http://arxiv.org/abs/2308.12008
repo_url: None
paper_authors: Frederick Riemenschneider, Anette Frank
for: 本研究旨在开发一种适用于古希腊和拉丁文学研究的多语言BERT模型，以便自动发现古希腊和拉丁文本之间的文本相似性。
methods: 本研究使用了一种多语言RoBERTa模型，并通过自动将英文文本翻译成古希腊文本来生成新的训练数据。
results: 研究表明，SPhilBERTa模型在跨语言语义理解和找到古希腊和拉丁文本中相同的句子方面表现出色，并可以自动检测古希腊和拉丁文本之间的文本相似性。

Abstract
Intertextual allusions hold a pivotal role in Classical Philology, with Latin authors frequently referencing Ancient Greek texts. Until now, the automatic identification of these intertextual references has been constrained to monolingual approaches, seeking parallels solely within Latin or Greek texts. In this study, we introduce SPhilBERTa, a trilingual Sentence-RoBERTa model tailored for Classical Philology, which excels at cross-lingual semantic comprehension and identification of identical sentences across Ancient Greek, Latin, and English. We generate new training data by automatically translating English texts into Ancient Greek. Further, we present a case study, demonstrating SPhilBERTa's capability to facilitate automated detection of intertextual parallels. Our models and resources are available at https://github.com/Heidelberg-NLP/ancient-language-models.

摘要
古典文学中的文本相互参照占据着重要地位，拉丁作家 часто参照古希腊文本。在本研究中，我们引入SPhilBERTa，一种适用于古典文学的三语句子BERT模型，能够强大地捕捉跨语言semantic comprehension和identical sentences的同义 sentences。我们生成了新的训练数据，通过自动将英文文本翻译成古希腊语。此外，我们还提供了一个案例研究，证明SPhilBERTa能够自动检测文本相互参照。我们的模型和资源可以在https://github.com/Heidelberg-NLP/ancient-language-models中找到。

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

paper_url: http://arxiv.org/abs/2308.11971
repo_url: None
paper_authors: Junyi Chen, Longteng Guo, Jia Sun, Shuai Shao, Zehuan Yuan, Liang Lin, Dongyu Zhang
for: 本研究旨在开发一种可扩展的视觉语言模型，以便从多Modal的数据中学习。
methods: 本研究使用了一种名为EVE的高效的视觉语言基础模型，该模型使用了一个共享的Transformer网络，并将视觉和语言编码在一起。具体来说，EVE使用了一种模态感知的零噪Module，以捕捉不同的感知信息。
results: 本研究表明，EVE可以快速地在训练过程中进行训练，并且在多种视觉语言下沉淀任务中表现出色，包括视觉问答、视觉理解和图像文本检索等。

Abstract
Building scalable vision-language models to learn from diverse, multimodal data remains an open challenge. In this paper, we introduce an Efficient Vision-languagE foundation model, namely EVE, which is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Specifically, EVE encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which capture modality-specific information by selectively switching to different experts. To unify pre-training tasks of vision and language, EVE performs masked signal modeling on image-text pairs to reconstruct masked signals, i.e., image pixels and text tokens, given visible signals. This simple yet effective pre-training objective accelerates training by 3.5x compared to the model pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed. Despite its simplicity, EVE achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.

摘要
建立可扩展的视觉语言模型，从多Modal的数据中学习，仍然是一个开放的挑战。在这篇论文中，我们介绍了一个高效的视觉语言基础模型，称为EVE，它是一个共享Transformer网络中的一个多modal Mixture-of-Experts（MoE）模块，可以同时处理视觉信息。特别是，EVE通过在不同专家中选择性地 switching来捕捉不同的modal信息。为了统一视觉和语言预训练任务，EVE在图像文本对中进行遮盖信号模型，即图像像素和文本符号的重建。这种简单 yet 有效的预训练目标可以加速训练，比Image-Text Contrastive和Image-Text Matching损失快3.5倍。由于EVE的共享架构和预训练任务的组合，它可以轻松扩展，以便在更多的资源和更快的训练速度下 достичь更好的下游性能。尽管其简单，EVE可以达到视觉语言下游任务的状态码性能。

Audio Generation with Multiple Conditional Diffusion Model

paper_url: http://arxiv.org/abs/2308.11940
repo_url: None
paper_authors: Zhifang Guo, Jianguo Mao, Rui Tao, Long Yan, Kazushige Ouchi, Hong Liu, Xiangdong Wang
for: 提高现有预训练文本到Audio模型的可控性，使其能够更好地控制音频的时间顺序、抑噪和音高。
methods: 提出一种新的模型，将额外的内容（时间戳）和风格（抑噪和音高）作为文本模型的补充条件，以提高音频生成的可控性。使用可调式控制条件编码器和Fusion-Net将额外条件编码并融合到文本模型中，保持预训练模型的权重冰结。
results: 实验结果表明，我们的模型成功实现了细致的控制，以达到可控的音频生成。

Abstract
Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation. Audio samples and our dataset are publicly available at https://conditionaudiogen.github.io/conditionaudiogen/

摘要
文本基于的音频生成模型存在限制，因为它们无法包含所有音频信息，导致仅仅基于文本的控制性有限。为解决这个问题，我们提出了一种新的模型，它可以增强现有的预训练文本到Audio模型的控制性，通过添加内容（时间戳）和风格（折射和能量折射）等补充条件。这种方法可以实现精细的控制时间顺序、折射和能量等 audio 生成的属性。为保持生成的多样性，我们使用可训练的控制条件编码器，并使用大语言模型和可训练的融合网来编码和融合更多的条件，而不论预训练的文本到Audio模型的 weights 保持冻结。由于缺乏适合的数据集和评价指标，我们将现有数据集合并成一个新的数据集，并使用一系列的评价指标来评估控制性性能。实验结果表明，我们的模型成功实现了精细的控制，以完成可控的 audio 生成。音频样本和我们的数据集可以在上公开获取。

Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement

paper_url: http://arxiv.org/abs/2308.11923
repo_url: None
paper_authors: Daiki Takeuchi, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada, Kunio Kashino
for: 本研究旨在描述输入对应的 Audio Clip 之间的semantic difference，而不是只是描述它们的同义性。
methods: 本研究提出了 Audio Difference Captioning (ADC) 任务，使用了 cross-attention-concentrated transformer encoder 抽取对比两个 Audio Clip 的差异，并使用了 similarity-discrepancy disentanglement 强调在latent space中提取差异。
results: 实验表明，提出的方法可以有效地解决 ADC 任务，并使得 transformer encoder 中的注意力权重更加集中在差异EXTRACTION 上。

Abstract
We proposed Audio Difference Captioning (ADC) as a new extension task of audio captioning for describing the semantic differences between input pairs of similar but slightly different audio clips. The ADC solves the problem that conventional audio captioning sometimes generates similar captions for similar audio clips, failing to describe the difference in content. We also propose a cross-attention-concentrated transformer encoder to extract differences by comparing a pair of audio clips and a similarity-discrepancy disentanglement to emphasize the difference in the latent space. To evaluate the proposed methods, we built an AudioDiffCaps dataset consisting of pairs of similar but slightly different audio clips with human-annotated descriptions of their differences. The experiment with the AudioDiffCaps dataset showed that the proposed methods solve the ADC task effectively and improve the attention weights to extract the difference by visualizing them in the transformer encoder.

摘要
我们提出了听音差异描述（ADC）作为audio描述的新扩展任务，用于描述输入对的类似 yet slightly different 听音clip之间的semantic差异。ADC解决了传统的听音描述经常生成类似的描述，而不能描述听音clip之间的内容差异。我们还提出了一种基于cross-attention的transformer编码器，用于比较一对听音clip，并且提出了一种similarity-discrepancy拟合来强调在latent space中的差异。为评估我们提出的方法，我们建立了AudioDiffCaps dataset，该dataset包含了类似 yet slightly different 的听音clip，以及人工标注了这些差异的描述。实验结果表明，我们的方法能够有效解决ADC任务，并且可以在transformer编码器中更好地强调差异。

Diagnosing Infeasible Optimization Problems Using Large Language Models

paper_url: http://arxiv.org/abs/2308.12923
repo_url: None
paper_authors: Hao Chen, Gonzalo E. Constante-Flores, Can Li
for: 这篇论文是为了帮助决策问题的解决带来了帮助，通过使用自然语言对话系统来解释无法满足的优化模型。
methods: 这篇论文使用了GPT-4和优化解决方案来识别优化模型中的不可能性源，并提供了一些减少不可能性的建议。
results: 实验表明，使用OptiChat可以帮助专家和非专家用户更好地理解优化模型，快速地找到优化模型中的不可能性源。

Abstract
Decision-making problems can be represented as mathematical optimization models, finding wide applications in fields such as economics, engineering and manufacturing, transportation, and health care. Optimization models are mathematical abstractions of the problem of making the best decision while satisfying a set of requirements or constraints. One of the primary barriers to deploying these models in practice is the challenge of helping practitioners understand and interpret such models, particularly when they are infeasible, meaning no decision satisfies all the constraints. Existing methods for diagnosing infeasible optimization models often rely on expert systems, necessitating significant background knowledge in optimization. In this paper, we introduce OptiChat, a first-of-its-kind natural language-based system equipped with a chatbot GUI for engaging in interactive conversations about infeasible optimization models. OptiChat can provide natural language descriptions of the optimization model itself, identify potential sources of infeasibility, and offer suggestions to make the model feasible. The implementation of OptiChat is built on GPT-4, which interfaces with an optimization solver to identify the minimal subset of constraints that render the entire optimization problem infeasible, also known as the Irreducible Infeasible Subset (IIS). We utilize few-shot learning, expert chain-of-thought, key-retrieve, and sentiment prompts to enhance OptiChat's reliability. Our experiments demonstrate that OptiChat assists both expert and non-expert users in improving their understanding of the optimization models, enabling them to quickly identify the sources of infeasibility.

摘要
决策问题可以表示为数学优化模型，找到广泛应用于经济、工程和生产、交通和医疗等领域。优化模型是决策问题的数学抽象，它的目标是找到满足一系列要求或限制的最佳决策。但是现有的方法用于诊断无法满足限制的优化模型通常需要较高的背景知识。在这篇论文中，我们介绍了OptiChat，一个新的自然语言基于系统，它通过交互对话来描述无法满足限制的优化模型，并提供可能的原因和解决方案。OptiChat的实现基于GPT-4，它与优化解除器结合以确定整个优化问题中的最小不可能集（IIS）。我们使用了少量学习、专家链条思考、关键提取和情感提示来提高OptiChat的可靠性。我们的实验表明，OptiChat可以帮助专家和非专家用户更好地理解优化模型，快速地确定ources of infeasibility。

Towards an On-device Agent for Text Rewriting

paper_url: http://arxiv.org/abs/2308.11807
repo_url: None
paper_authors: Yun Zhu, Yinxiao Liu, Felix Stahlberg, Shankar Kumar, Yu-hui Chen, Liangchen Luo, Lei Shu, Renjie Liu, Jindong Chen, Lei Meng
for: 这篇论文是为了开发一个轻量级的语言模型（LLM），用于在设备上进行文本重写。
methods: 作者提出了一种新的指令优化方法，以生成高质量的训练数据无需人工标注。此外，他们还提出了一种决策回归学习框架，可以大幅提高性能无需偏好数据。
results: 经验表明，作者的在设备上的模型超过了现有的状态艺术LLMs在文本重写任务上的表现，同时具有显著减少的模型大小。此外，他们还提出了一种有效的缓存方法，可以更好地衔接服务器端模型。

Abstract
Large Language Models (LLMs) have demonstrated impressive capabilities for text rewriting. Nonetheless, the large sizes of these models make them impractical for on-device inference, which would otherwise allow for enhanced privacy and economical inference. Creating a smaller yet potent language model for text rewriting presents a formidable challenge because it requires balancing the need for a small size with the need to retain the emergent capabilities of the LLM, that requires costly data collection. To address the above challenge, we introduce a new instruction tuning approach for building a mobile-centric text rewriting model. Our strategies enable the generation of high quality training data without any human labeling. In addition, we propose a heuristic reinforcement learning framework which substantially enhances performance without requiring preference data. To further bridge the performance gap with the larger server-side model, we propose an effective approach that combines the mobile rewrite agent with the server model using a cascade. To tailor the text rewriting tasks to mobile scenarios, we introduce MessageRewriteEval, a benchmark that focuses on text rewriting for messages through natural language instructions. Through empirical experiments, we demonstrate that our on-device model surpasses the current state-of-the-art LLMs in text rewriting while maintaining a significantly reduced model size. Notably, we show that our proposed cascading approach improves model performance.

摘要
大型语言模型（LLM）已经展示了抽象文本重写的卓越能力。然而，这些大型模型的大小使得在设备上进行推理变得不切实际，这会导致隐私和经济性推理的问题。为了解决这个挑战，我们提出了一种新的指令调整方法，用于在移动设备上建立一个高质量的文本重写模型。我们的策略可以生成高质量的训练数据，而无需人工标注。此外，我们提出了一种归纳学习框架，可以在不需要偏好数据的情况下，大幅提高性能。为了补偿大型服务器端模型的性能差距，我们提出了一种有效的级联方法，将移动重写代理与服务器模型结合使用。为了适应移动设备上的文本重写任务，我们介绍了MessageRewriteEval，一个专门针对文本重写的自然语言指令 benchmark。通过实验证明，我们的在设备上运行的模型可以胜过当前状态的各种LLMs在文本重写任务中，同时具有显著减少的模型大小。此外，我们还证明了我们的归纳方法可以提高模型性能。

Few-shot Anomaly Detection in Text with Deviation Learning

paper_url: http://arxiv.org/abs/2308.11780
repo_url: None
paper_authors: Anindya Sundar Das, Aravind Ajay, Sriparna Saha, Monowar Bhuyan
for: 本文旨在提出一种基于深度几个示例学习的方法，以便利用有限的异常示例来直接学习异常分数，并在整个过程中使用偏移学习来学习异常行为。
methods: 本文使用的方法包括深度几个示例学习、偏移学习和多头自注意力层，以及多个实例学习方法。
results: 经过实验表明，本文提出的方法可以在多个标准 benchmark 数据集上达到新的州OF-the-art性能水平。

Abstract
Most current methods for detecting anomalies in text concentrate on constructing models solely relying on unlabeled data. These models operate on the presumption that no labeled anomalous examples are available, which prevents them from utilizing prior knowledge of anomalies that are typically present in small numbers in many real-world applications. Furthermore, these models prioritize learning feature embeddings rather than optimizing anomaly scores directly, which could lead to suboptimal anomaly scoring and inefficient use of data during the learning process. In this paper, we introduce FATE, a deep few-shot learning-based framework that leverages limited anomaly examples and learns anomaly scores explicitly in an end-to-end method using deviation learning. In this approach, the anomaly scores of normal examples are adjusted to closely resemble reference scores obtained from a prior distribution. Conversely, anomaly samples are forced to have anomalous scores that considerably deviate from the reference score in the upper tail of the prior. Additionally, our model is optimized to learn the distinct behavior of anomalies by utilizing a multi-head self-attention layer and multiple instance learning approaches. Comprehensive experiments on several benchmark datasets demonstrate that our proposed approach attains a new level of state-of-the-art performance.

摘要
现有的方法 для检测文本中的异常都集中在建立仅靠无标示资料的模型上。这些模型假设没有具有标示异常的例子存在，这限制了它们使用实际世界应用中通常存在的异常小量知识。此外，这些模型专注于学习特征嵌入而不是直接优化异常分数，这可能导致异常分数不佳和数据学习过程中的数据使用不燥。在这篇论文中，我们介绍了FATE，一个深度几何学习基础架构，它利用有限异常例子来直接学习异常分数，并使用偏差学习方法。在这种方法中，正常示例的异常分数被调整，以接近对待分布中的参考分数。相反，异常示例的异常分数需要与参考分数在Upper tail上有大幅度的偏差。此外，我们的模型还利用多头自我注意和多个实例学习方法来学习异常的特别行为。我们在多个benchmark dataset上进行了充分的实验，结果显示我们的提议方法可以达到新的州立顶点性能。

paper_url: http://arxiv.org/abs/2308.11773
repo_url: None
paper_authors: Yuezhou Zhang, Amos A Folarin, Judith Dineley, Pauline Conde, Valeria de Angel, Shaoxiong Sun, Yatharth Ranjan, Zulqarnain Rashid, Callum Stewart, Petroula Laiou, Heet Sankesara, Linglong Qian, Faith Matcham, Katie M White, Carolin Oetzmann, Femke Lamers, Sara Siddi, Sara Simblett, Björn W. Schuller, Srinivasan Vairavan, Til Wykes, Josep Maria Haro, Brenda WJH Penninx, Vaibhav A Narayan, Matthew Hotopf, Richard JB Dobson, Nicholas Cummins, RADAR-CNS consortium
for: 这个研究是为了检测听力语言与抑郁的关系，并采用了大规模验证的方法。
methods: 这个研究使用了自然语言处理技术，特别是BERTopic模型，对3919个手机采集的语音记录进行分析，并从中提取了29个话题。
results: 研究发现，患有抑郁的人更容易提到“没有期望”、“睡眠”、“心理治疗”、“头发”、“学习”和“课程”等话题，这些话题可能是抑郁的指标。此外，研究还发现了语言使用和行为特征之间的相关性，以及语言使用的变化和抑郁程度之间的相关性。

Abstract
Language use has been shown to correlate with depression, but large-scale validation is needed. Traditional methods like clinic studies are expensive. So, natural language processing has been employed on social media to predict depression, but limitations remain-lack of validated labels, biased user samples, and no context. Our study identified 29 topics in 3919 smartphone-collected speech recordings from 265 participants using the Whisper tool and BERTopic model. Six topics with a median PHQ-8 greater than or equal to 10 were regarded as risk topics for depression: No Expectations, Sleep, Mental Therapy, Haircut, Studying, and Coursework. To elucidate the topic emergence and associations with depression, we compared behavioral (from wearables) and linguistic characteristics across identified topics. The correlation between topic shifts and changes in depression severity over time was also investigated, indicating the importance of longitudinally monitoring language use. We also tested the BERTopic model on a similar smaller dataset (356 speech recordings from 57 participants), obtaining some consistent results. In summary, our findings demonstrate specific speech topics may indicate depression severity. The presented data-driven workflow provides a practical approach to collecting and analyzing large-scale speech data from real-world settings for digital health research.

摘要
研究表明语言使用与抑郁有相关性，但大规模验证还需要进行。传统方法如临床研究过于昂贵。因此，人工智能技术在社交媒体上进行语言预测，但存在限制：无效验证标签、偏向用户样本和无Context。我们的研究在265名参与者的3919则语音记录中发现了29个话题，使用Whisper工具和BERTopic模型。6个话题的中值PHQ-8大于或等于10被视为抑郁风险话题：无期望、睡眠、心理治疗、剪发、学习和课程。为了详细描述话题的出现和与抑郁相关性，我们比较了语音和行为特征。我们还 investigate了话题变化和抑郁严重度的时间变化的相关性，表明重要监测语言使用的长期变化。此外，我们在相似的小数据集上测试了BERTopic模型，获得了一些一致的结果。总之，我们的发现表明特定的语音话题可能指示抑郁严重度。我们提供的数据驱动的工作流程为数字健康研究提供了实用的方法。

StoryBench: A Multifaceted Benchmark for Continuous Story Visualization

paper_url: http://arxiv.org/abs/2308.11606
repo_url: https://github.com/google/storybench
paper_authors: Emanuele Bugliarello, Hernan Moraldo, Ruben Villegas, Mohammad Babaeizadeh, Mohammad Taghi Saffar, Han Zhang, Dumitru Erhan, Vittorio Ferrari, Pieter-Jan Kindermans, Paul Voigtlaender
for: 这个论文的目的是提出一个新的多任务Benchmark，用于评估未来的文本到视频模型。
methods: 这个论文使用了三个视频生成任务，包括行动执行、续写故事和故事生成。它还使用了人类标注来评估模型的性能。
results: 研究人员通过使用这些任务和人类标注，证明了小 yet 强的文本到视频基eline的好处。此外，他们还提出了一种新的评估方法，以便更好地评估视频生成模型的性能。

Abstract
Generating video stories from text prompts is a complex task. In addition to having high visual quality, videos need to realistically adhere to a sequence of text prompts whilst being consistent throughout the frames. Creating a benchmark for video generation requires data annotated over time, which contrasts with the single caption used often in video datasets. To fill this gap, we collect comprehensive human annotations on three existing datasets, and introduce StoryBench: a new, challenging multi-task benchmark to reliably evaluate forthcoming text-to-video models. Our benchmark includes three video generation tasks of increasing difficulty: action execution, where the next action must be generated starting from a conditioning video; story continuation, where a sequence of actions must be executed starting from a conditioning video; and story generation, where a video must be generated from only text prompts. We evaluate small yet strong text-to-video baselines, and show the benefits of training on story-like data algorithmically generated from existing video captions. Finally, we establish guidelines for human evaluation of video stories, and reaffirm the need of better automatic metrics for video generation. StoryBench aims at encouraging future research efforts in this exciting new area.

摘要
生成视频故事从文本提示是一个复杂的任务。除了具有高质量的视觉外，视频还需要在文本提示的时间序列中准确遵循，并在帧中保持一致。为了填补这个空白，我们收集了大量人类标注数据，并引入了StoryBench：一个新的、挑战性的多任务 bench mark，用于可靠地评估未来的文本到视频模型。我们的benchmark包括三个视频生成任务：行动执行、故事续写和故事生成。我们评估了一些小 yet 强大的文本到视频基线，并显示了使用 Algorithmically 生成的故事数据的好处。最后，我们确立了人类评估视频故事的指南，并重申了自动度量的改进。StoryBench 的目标是鼓励未来的研究努力在这一新领域。

SeamlessM4T-Massively Multilingual & Multimodal Machine Translation

paper_url: http://arxiv.org/abs/2308.11596
repo_url: https://github.com/facebookresearch/seamless_communication
paper_authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-jussà, Onur Celebi, Maha Elbayad, Cynthia Gao, Francisco Guzmán, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang
for: 该研究目的是创建一个能够将任何两种语言的语音翻译成另一种语言的工具，即babel fish。
methods: 该研究使用了100种语言的自动对时抽象的语音数据，并使用了w2v-BERT 2.0来学习自我监督的语音表示。然后，他们创建了一个多Modal的译文库，并将其与人工标注和 Pseudo标注数据进行了混合。
results: 该研究实现了一个可以同时支持语音译文、文本译文、语音识别和文本识别的多语言模型，可以在100种语言之间进行同时翻译。相比之前的最佳实现，该模型在FLEURS上实现了20%的BLEU提升，在直接语音译文任务上实现了1.3个BLEU点的提升，在语音译文任务上实现了2.6个ASR-BLEU点的提升。此外，该模型在干扰背景和 speaker变化的情况下也表现更加稳定。

Abstract
What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication

摘要
To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text.On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model.Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication.

Using ChatGPT as a CAT tool in Easy Language translation

paper_url: http://arxiv.org/abs/2308.11563
repo_url: https://github.com/katjakaterina/chatgpt4easylang
paper_authors: Silvana Deilen, Sergio Hernández Garrido, Ekaterina Lapshinova-Koltunski, Christiane Maaß
for: investigate the feasibility of using ChatGPT to translate citizen-oriented administrative texts into German Easy Language
methods: use ChatGPT to translate selected texts from websites of German public authorities using two strategies, i.e. linguistic and holistic
results: the generated texts are easier than the standard texts, but still do not fully meet the established Easy Language standards, and the content is not always rendered correctly.Here’s the format you requested:
for: <what are the paper written for?>
methods: <what methods the paper use?>
results: <what results the paper get?>

Abstract
This study sets out to investigate the feasibility of using ChatGPT to translate citizen-oriented administrative texts into German Easy Language, a simplified, controlled language variety that is adapted to the needs of people with reading impairments. We use ChatGPT to translate selected texts from websites of German public authorities using two strategies, i.e. linguistic and holistic. We analyse the quality of the generated texts based on different criteria, such as correctness, readability, and syntactic complexity. The results indicated that the generated texts are easier than the standard texts, but that they still do not fully meet the established Easy Language standards. Additionally, the content is not always rendered correctly.

摘要
这项研究旨在探讨使用ChatGPT来翻译公民面向的行政文本into German Easy Language，一种简化、控制的语言变体，适应Allemagne人读取障碍者的需求。我们使用ChatGPT翻译选择的公共机构网站上的文本，使用两种策略，即语言和整体。我们分析生成的文本质量，包括正确性、可读性和 синтакситиче复杂性等多个标准。结果表明，生成的文本比标准文本更易读，但并不完全符合Established Easy Language标准。此外，内容并不总是正确地表达。

BELB: a Biomedical Entity Linking Benchmark

paper_url: http://arxiv.org/abs/2308.11537
repo_url: https://github.com/sg-wbi/belb-exp
paper_authors: Samuele Garda, Leon Weber-Genzel, Robert Martin, Ulf Leser
for: 本研究旨在提供一个 Biomedical Entity Linking（BEL） benchmark，以测试不同系统在多个 corpora 上的性能。
methods: 本研究使用了不同的方法，包括rule-based系统和基于预训练语言模型的 neural方法。
results: 研究结果显示，基于预训练语言模型的 neural方法在不同的entity type上表现不一致， highlighting the need of further studies towards entity-agnostic models。

Abstract
Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base. It plays a vital role in information extraction pipelines for the life sciences literature. We review recent work in the field and find that, as the task is absent from existing benchmarks for biomedical text mining, different studies adopt different experimental setups making comparisons based on published numbers problematic. Furthermore, neural systems are tested primarily on instances linked to the broad coverage knowledge base UMLS, leaving their performance to more specialized ones, e.g. genes or variants, understudied. We therefore developed BELB, a Biomedical Entity Linking Benchmark, providing access in a unified format to 11 corpora linked to 7 knowledge bases and spanning six entity types: gene, disease, chemical, species, cell line and variant. BELB greatly reduces preprocessing overhead in testing BEL systems on multiple corpora offering a standardized testbed for reproducible experiments. Using BELB we perform an extensive evaluation of six rule-based entity-specific systems and three recent neural approaches leveraging pre-trained language models. Our results reveal a mixed picture showing that neural approaches fail to perform consistently across entity types, highlighting the need of further studies towards entity-agnostic models.

摘要
生物医学实体链接（BEL）是将实体提及链接到知识库的任务。它在生物医学文献抽取管道中扮演着重要的角色。我们回顾了领域的最新研究，发现现有的生物医学文献检索标准不包含BEL任务，不同的研究采用不同的实验设置，使得基于已发布的数字进行比较困难。另外，神经系统主要在使用UMLS广泛覆盖知识库下进行测试，忽略了更专业的实体类型，例如基因或变异。为解决这一问题，我们开发了BELB，一个生物医学实体链接准则，提供11个 корпу和7个知识库的联合格式，涵盖6种实体类型：基因、疾病、化学物质、物种、细胞系和变异。BELB可以减少测试BEL系统的前处理开销，提供标准化的测试床，为可重复的实验提供了一个统一的格式。使用BELB，我们进行了广泛的BEL系统和神经网络方法的评估，结果表明，神经网络方法在不同的实体类型上表现不一致，需要进一步的研究以实现实体无关的模型。

2023-08-24

VNI-Net: Vector Neurons-based Rotation-Invariant Descriptor for LiDAR Place Recognition

ToonTalker: Cross-Domain Face Reenactment

SkipcrossNets: Adaptive Skip-cross Fusion for Road Detection

Learned Local Attention Maps for Synthesising Vessel Segmentations

Implicit Obstacle Map-driven Indoor Navigation Model for Robust Obstacle Avoidance

EFormer: Enhanced Transformer towards Semantic-Contour Features of Foreground for Portraits Matting

Robotic Scene Segmentation with Memory Network for Runtime Surgical Context Inference

On Offline Evaluation of 3D Object Detection for Autonomous Driving

LISTER: Neighbor Decoding for Length-Insensitive Scene Text Recognition

IP-UNet: Intensity Projection UNet Architecture for 3D Medical Volume Segmentation

PartSeg: Few-shot Part Segmentation via Part-aware Prompt Learning

Learning Heavily-Degraded Prior for Underwater Object Detection

FastSurfer-HypVINN: Automated sub-segmentation of the hypothalamus and adjacent structures on high-resolutional brain MRI

Ground-to-Aerial Person Search: Benchmark Dataset and Approach

A Parse-Then-Place Approach for Generating Graphic Layouts from Textual Descriptions

A Continual Learning Approach for Cross-Domain White Blood Cell Classification

A Study of Age and Sex Bias in Multiple Instance Learning based Classification of Acute Myeloid Leukemia Subtypes

Masked Feature Modelling: Feature Masking for the Unsupervised Pre-training of a Graph Attention Network Block for Bottom-up Video Event Recognition

An All Deep System for Badminton Game Analysis

Tag-Based Annotation for Avatar Face Creation

Cross-Video Contextual Knowledge Exploration and Exploitation for Ambiguity Reduction in Weakly Supervised Temporal Action Localization

HR-Pro: Point-supervised Temporal Action Localization via Hierarchical Reliability Propagation

PoseSync: Robust pose based video synchronization

Logic-induced Diagnostic Reasoning for Semi-supervised Semantic Segmentation

Self-supervised Learning of Implicit Shape Representation with Dense Correspondence for Deformable Objects

Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation

LORD: Leveraging Open-Set Recognition with Unknown Data

StreamMapNet: Streaming Mapping Network for Vectorized Online HD Map Construction

NOVA: NOvel View Augmentation for Neural Composition of Dynamic Objects

Hyperbolic Audio-visual Zero-shot Learning

Hybrid Models for Facial Emotion Recognition in Children

Mutual-Guided Dynamic Network for Image Fusion

HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt interaction tasks

SCP: Spherical-Coordinate-based Learned Point Cloud Compression

Channel and Spatial Relation-Propagation Network for RGB-Thermal Semantic Segmentation

SieveNet: Selecting Point-Based Features for Mesh Networks

Uniformly Distributed Category Prototype-Guided Vision-Language Framework for Long-Tail Recognition

Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval

FFEINR: Flow Feature-Enhanced Implicit Neural Representation for Spatio-temporal Super-Resolution

DD-GCN: Directed Diffusion Graph Convolutional Network for Skeleton-based Human Action Recognition

MOFA: A Model Simplification Roadmap for Image Restoration on Mobile Devices

Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion

InverseSR: 3D Brain MRI Super-Resolution Using a Latent Diffusion Model

Overcoming General Knowledge Loss with Selective Parameter Finetuning

ARF-Plus: Controlling Perceptual Factors in Artistic Radiance Fields for 3D Scene Stylization

MOFO: MOtion FOcused Self-Supervision for Video Understanding

TAI-GAN: Temporally and Anatomically Informed GAN for early-to-late frame conversion in dynamic cardiac PET motion correction

HNAS-reg: hierarchical neural architecture search for deformable medical image registration

Characterising representation dynamics in recurrent neural networks for object recognition

A Spatiotemporal Correspondence Approach to Unsupervised LiDAR Segmentation with Traffic Applications

An Initial Exploration: Learning to Generate Realistic Audio for Silent Video

FG-Net: Facial Action Unit Detection with Generalizable Pyramidal Features

AdVerb: Visually Guided Audio Dereverberation

Continual Zero-Shot Learning through Semantically Guided Generative Random Walks

Saliency-based Video Summarization for Face Anti-spoofing

Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation

A Generative Approach for Image Registration of Visible-Thermal (VT) Cancer Faces

MolGrapher: Graph-based Visual Recognition of Chemical Structures

SPPNet: A Single-Point Prompt Network for Nuclei Image Segmentation

2023-08-24

FaceTouch: Detecting hand-to-face touch with supervised contrastive learning to assist in tracing infectious disease

Short Run Transit Route Planning Decision Support System Using a Deep Learning-Based Weighted Graph

Job Shop Scheduling Benchmark: Environments and Instances for Learning and Non-learning Methods

Acquiring Qualitative Explainable Graphs for Automated Driving Scene Interpretation

Motion In-Betweening with Phase Manifolds

Separating the Human Touch from AI-Generated Text using Higher Criticism: An Information-Theoretic Approach

Human Comprehensible Active Learning of Genome-Scale Metabolic Networks

Asymmetric Co-Training with Explainable Cell Graph Ensembling for Histopathological Image Classification

DeepLOC: Deep Learning-based Bone Pathology Localization and Classification in Wrist X-ray Images

Continuous Reinforcement Learning-based Dynamic Difficulty Adjustment in a Visual Working Memory Game

VIGC: Visual Instruction Generation and Correction

SayCanPay: Heuristic Planning with Large Language Models using Learnable Domain Knowledge

LR-XFL: Logical Reasoning-based Explainable Federated Learning

Improving Translation Faithfulness of Large Language Models via Augmenting Instructions

Don’t Look into the Sun: Adversarial Solarization Attacks on Image Classifiers

kTrans: Knowledge-Aware Transformer for Binary Code Embedding

APART: Diverse Skill Discovery using All Pairs with Ascending Reward and DropouT

Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines

Towards Hierarchical Regional Transformer-based Multiple Instance Learning