cs.CV - 2023-08-29

Efficient Discovery and Effective Evaluation of Visual Perceptual Similarity: A Benchmark and Beyond

  • paper_url: http://arxiv.org/abs/2308.14753
  • repo_url: https://github.com/vsd-benchmark/vsd
  • paper_authors: Oren Barkan, Tal Reiss, Jonathan Weill, Ori Katz, Roy Hirsch, Itzik Malkiel, Noam Koenigstein
  • for: 该论文的目的是提出一个大规模的时尚视觉相似性数据集,以及一种有效的标签过程,以便评估视觉相似性检测方法。
  • methods: 该论文使用了专业注释的图像对进行标签,并提出了一种新的标签过程,可以应用于任何数据集。
  • results: 该论文提出了一个大规模的时尚视觉相似性数据集,并进行了对该数据集的分析和评估。Here’s the full text in Simplified Chinese:
  • for: 该论文的目的是提出一个大规模的时尚视觉相似性数据集,以及一种有效的标签过程,以便评估视觉相似性检测方法。
  • methods: 该论文使用了专业注释的图像对进行标签,并提出了一种新的标签过程,可以应用于任何数据集。
  • results: 该论文提出了一个大规模的时尚视觉相似性数据集,并进行了对该数据集的分析和评估。
    Abstract Visual similarities discovery (VSD) is an important task with broad e-commerce applications. Given an image of a certain object, the goal of VSD is to retrieve images of different objects with high perceptual visual similarity. Although being a highly addressed problem, the evaluation of proposed methods for VSD is often based on a proxy of an identification-retrieval task, evaluating the ability of a model to retrieve different images of the same object. We posit that evaluating VSD methods based on identification tasks is limited, and faithful evaluation must rely on expert annotations. In this paper, we introduce the first large-scale fashion visual similarity benchmark dataset, consisting of more than 110K expert-annotated image pairs. Besides this major contribution, we share insight from the challenges we faced while curating this dataset. Based on these insights, we propose a novel and efficient labeling procedure that can be applied to any dataset. Our analysis examines its limitations and inductive biases, and based on these findings, we propose metrics to mitigate those limitations. Though our primary focus lies on visual similarity, the methodologies we present have broader applications for discovering and evaluating perceptual similarity across various domains.
    摘要 文本翻译为简化字符串。视觉相似性发现(VSD)是一项广泛应用于电子商务的重要任务。给定一个对象的图像,VSD的目标是检索具有高度感知视觉相似性的不同对象的图像。尽管是一个已经受到广泛关注的问题,但评估提出的VSD方法的方法通常基于一种对象预测任务的代理,评估模型是否可以正确地检索不同的对象图像。我们认为,基于预测任务进行评估VSD方法有限制,我们应该依靠专家注释来进行 faithful 评估。在这篇文章中,我们发布了首个大规模的时尚视觉相似性准 benchmark 数据集,包含 более 110K 专家注释的图像对。此外,我们还分享了在编 Curate 这个数据集时遇到的挑战,并提出了一种新的和高效的标签过程,可以应用于任何数据集。我们的分析检查了这些限制和偏好,并基于这些发现,我们提出了一些缓解这些限制的度量。虽然我们的主要焦点是视觉相似性,但我们所提出的方法ologies 有更广泛的应用于发现和评估不同领域中的感知相似性。

MagicEdit: High-Fidelity and Temporally Coherent Video Editing

  • paper_url: http://arxiv.org/abs/2308.14749
  • repo_url: None
  • paper_authors: Jun Hao Liew, Hanshu Yan, Jianfeng Zhang, Zhongcong Xu, Jiashi Feng
  • for: 这个论文是为了解决文本引导视频编辑任务而写的。
  • methods: 这个论文使用了分离内容、结构和动作信号的方法来在训练中学习高效的视频-to-视频翻译。
  • results: 论文表明,这种方法可以实现高精度和时间协调的视频翻译,并且支持多种下游视频编辑任务,如视频 стилизация、本地编辑、视频MagicMix和视频填充。
    Abstract In this report, we present MagicEdit, a surprisingly simple yet effective solution to the text-guided video editing task. We found that high-fidelity and temporally coherent video-to-video translation can be achieved by explicitly disentangling the learning of content, structure and motion signals during training. This is in contradict to most existing methods which attempt to jointly model both the appearance and temporal representation within a single framework, which we argue, would lead to degradation in per-frame quality. Despite its simplicity, we show that MagicEdit supports various downstream video editing tasks, including video stylization, local editing, video-MagicMix and video outpainting.
    摘要 在这份报告中,我们介绍MagicEdit,一种高效且简单的解决文本引导视频编辑问题的解决方案。我们发现,在训练中分离内容、结构和运动信号的学习可以实现高精度和时间启示的视频-to-视频翻译。这与大多数现有方法不同,这些方法尝试同时模型视频的外观和时间表示,我们认为这会导致每帧质量下降。尽管简单,我们显示MagicEdit支持多种下渠视频编辑任务,包括视频风格化、本地编辑、MagicMix和视频外缩。

MagicAvatar: Multimodal Avatar Generation and Animation

  • paper_url: http://arxiv.org/abs/2308.14748
  • repo_url: https://github.com/magic-research/magic-avatar
  • paper_authors: Jianfeng Zhang, Hanshu Yan, Zhongcong Xu, Jiashi Feng, Jun Hao Liew
  • for: 这个论文旨在提出一种基于多模态视频生成和人偶动画的框架,即 MagicAvatar。
  • methods: 这个框架包括两个阶段:第一个阶段是将多模态输入翻译成动作/控制信号(例如人姿、深度、DensePose),第二个阶段是根据这些动作信号生成人偶视频。
  • results: MagicAvatar可以通过提供一些人的图像来animate人偶,并且可以实现文本指导和视频指导的人偶生成。此外,它还可以应用于多模态人偶动画。
    Abstract This report presents MagicAvatar, a framework for multimodal video generation and animation of human avatars. Unlike most existing methods that generate avatar-centric videos directly from multimodal inputs (e.g., text prompts), MagicAvatar explicitly disentangles avatar video generation into two stages: (1) multimodal-to-motion and (2) motion-to-video generation. The first stage translates the multimodal inputs into motion/ control signals (e.g., human pose, depth, DensePose); while the second stage generates avatar-centric video guided by these motion signals. Additionally, MagicAvatar supports avatar animation by simply providing a few images of the target person. This capability enables the animation of the provided human identity according to the specific motion derived from the first stage. We demonstrate the flexibility of MagicAvatar through various applications, including text-guided and video-guided avatar generation, as well as multimodal avatar animation.
    摘要
  1. Multimodal-to-motion: Translating the multimodal inputs into motion/control signals (e.g., human pose, depth, DensePose).2. Motion-to-video generation: Generating avatar-centric video guided by these motion signals.Moreover, MagicAvatar supports avatar animation by simply providing a few images of the target person. This capability enables the animation of the provided human identity according to the specific motion derived from the first stage. We demonstrate the flexibility of MagicAvatar through various applications, including text-guided and video-guided avatar generation, as well as multimodal avatar animation.中文翻译:这份报告介绍了 MagicAvatar,一个用于多模态视频生成和人物动画的框架。与大多数现有方法不同,MagicAvatarexplicitly归纳了人物视频生成到两个阶段:1. 多模态到动作:将多模态输入翻译成动作/控制信号(例如人姿、深度、DensePose)。2. 动作到视频生成:根据这些动作信号生成人物视频。此外,MagicAvatar还支持人物动画,只需提供一些目标人物的图像即可。这使得可以根据第一阶段得到的特定动作来动画提供的人物。我们通过多种应用,包括文本指导和视频指导的人物生成以及多模态人物动画,展示了MagicAvatar的灵活性。

CoVR: Learning Composed Video Retrieval from Web Video Captions

  • paper_url: http://arxiv.org/abs/2308.14746
  • repo_url: https://github.com/lucas-ventura/CoVR
  • paper_authors: Lucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol
  • for: 这篇论文的目的是提出一种可扩展的自动生成compose Image Retrieval(CoIR)数据集方法,以替代手动纪录CoIR triplets的高成本和不可扩展性。
  • methods: 该方法利用了视频-caption对的对应关系,并采用了大型自然语言模型来生成修改文本。
  • results: 通过应用该方法于WebVid2M数据集,自动生成了160万个CoIR triplets,并提出了一个新的CoVR数据集和一个手动注解的评估集。实验表明,训练CoVR模型在我们的数据集上可以有效转移到CoIR,在零基eline设置下在CIRR和FashionIQ数据集上达到了最佳性能。
    Abstract Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. Our experiments further demonstrate that training a CoVR model on our dataset effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on both the CIRR and FashionIQ benchmarks. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/~ventural/covr.
    摘要 团体图像检索(CoIR)在最近几年内得到了广泛关注,因为它同时考虑图像和文本查询,以搜索图像库中相关的图像。大多数CoIR方法需要手动标注数据集,包括图像-文本-图像三元组,其中文本描述了查询图像到目标图像的修改。然而,手动筛选CoIR三元组是贵重的并不可扩展。在这种情况下,我们提议一种可扩展的自动数据创建方法,使用视频-标题对进行生成三元组,同时扩展CoIR任务的范围,以包括组合视频检索(CoVR)。为此,我们从大量视频库中挖掘具有相同标题的视频对,并利用大型自然语言模型生成相应的修改文本。通过应用这种方法,我们自动建立了WebVid-CoVR数据集,共计1600万三元组。此外,我们还提供了一个手动标注的评估集,以及基准结果。我们的实验表明,训练CoVR模型于我们的数据集后,可以转移到CoIR任务,在零例情况下在CIRR和FashionIQ benchmark上实现了状态的推进性表现。我们的代码、数据集和模型都公开提供在https://imagine.enpc.fr/~ventural/covr。

Total Selfie: Generating Full-Body Selfies

  • paper_url: http://arxiv.org/abs/2308.14740
  • repo_url: None
  • paper_authors: Bowei Chen, Brian Curless, Ira Kemelmacher-Shlizerman, Steve Seitz
  • for: 生成全身自拍照(用户提供的视频、目标姿势照片和每个位置的自拍照+背景对)
  • methods: diffusion-based 方法组合这些信息生成高质量、正确pose和背景的全身自拍照
  • results: 生成高品质、自然的全身自拍照,包括用户所需的姿势和背景
    Abstract We present a method to generate full-body selfies -- photos that you take of yourself, but capturing your whole body as if someone else took the photo of you from a few feet away. Our approach takes as input a pre-captured video of your body, a target pose photo, and a selfie + background pair for each location. We introduce a novel diffusion-based approach to combine all of this information into high quality, well-composed photos of you with the desired pose and background.
    摘要 我们提出了一种方法,可以生成全身自拍照片。这种方法使用已经捕捉的视频、目标姿势照片和每个位置的自拍+背景对。我们提出了一种新的扩散方法,可以将这些信息组合成高质量、正确姿势和背景的全身自拍照片。

R3D3: Dense 3D Reconstruction of Dynamic Scenes from Multiple Cameras

  • paper_url: http://arxiv.org/abs/2308.14713
  • repo_url: None
  • paper_authors: Aron Schmied, Tobias Fischer, Martin Danelljan, Marc Pollefeys, Fisher Yu
    for:多camera系统提供了一个简单且便宜的选择,但是实现实时三维重建和自己运动估计的挑战非常大。methods:我们提出了R3D3,一个多camera系统,它通过调合几何推导和单目深度精准化,实现了紧密的三维重建和自己运动估计。results:我们的设计能够在充满活动的外部环境中实现紧密、一致的三维重建,并在DDAD和NuScenes参考数据集上实现了状态顶尖的紧密深度预测。
    Abstract Dense 3D reconstruction and ego-motion estimation are key challenges in autonomous driving and robotics. Compared to the complex, multi-modal systems deployed today, multi-camera systems provide a simpler, low-cost alternative. However, camera-based 3D reconstruction of complex dynamic scenes has proven extremely difficult, as existing solutions often produce incomplete or incoherent results. We propose R3D3, a multi-camera system for dense 3D reconstruction and ego-motion estimation. Our approach iterates between geometric estimation that exploits spatial-temporal information from multiple cameras, and monocular depth refinement. We integrate multi-camera feature correlation and dense bundle adjustment operators that yield robust geometric depth and pose estimates. To improve reconstruction where geometric depth is unreliable, e.g. for moving objects or low-textured regions, we introduce learnable scene priors via a depth refinement network. We show that this design enables a dense, consistent 3D reconstruction of challenging, dynamic outdoor environments. Consequently, we achieve state-of-the-art dense depth prediction on the DDAD and NuScenes benchmarks.
    摘要 dense 3D 重建和自己运动估算是autéonomous driving 和机器人控制中的关键挑战。相比较复杂的多Modal 系统,多摄像头系统提供了一个简单、低成本的替代方案。然而,基于多摄像头的3D 重建复杂动态场景已经证明是非常困难,因为现有的解决方案通常会生成不完整或不一致的结果。我们提出了 R3D3,一个多摄像头系统用于精密3D 重建和自己运动估算。我们的方法在多摄像头的空间-时间信息上进行几何估算,并使用单摄像头的深度精度优化。我们将多摄像头特征相关和紧凑缓冲调整算法相结合,以获得强健的几何深度和pose估算。为了改进重建,特别是对于在运动 объек 或低文纹区域的情况,我们引入了学习场景假设。我们显示,这种设计可以实现高精度、一致的3D 重建,并在复杂的户外环境中达到了状态 искусственный智能 的 dense depth prediction benchmarks。

360-Degree Panorama Generation from Few Unregistered NFoV Images

  • paper_url: http://arxiv.org/abs/2308.14686
  • repo_url: https://github.com/shanemankiw/panodiff
  • paper_authors: Jionghao Wang, Ziyu Chen, Jun Ling, Rong Xie, Li Song
  • for: 生成完整的360度全景图像
  • methods: 使用一或多个不准确地捕捉的窄视场图像,以及文本提示和几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几yles in Simplified Chinese text:
  • for: 生成完整的360度全景图像
  • methods: 使用一或多个不准确地捕捉的窄视场图像,以及文本提示和几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几种几ypes几种几种几种几种几种几ypes几种几种几种几ypes几种几种几Types几种几Types几种几Types几种几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types几Types�
    Abstract 360$^\circ$ panoramas are extensively utilized as environmental light sources in computer graphics. However, capturing a 360$^\circ$ $\times$ 180$^\circ$ panorama poses challenges due to the necessity of specialized and costly equipment, and additional human resources. Prior studies develop various learning-based generative methods to synthesize panoramas from a single Narrow Field-of-View (NFoV) image, but they are limited in alterable input patterns, generation quality, and controllability. To address these issues, we propose a novel pipeline called PanoDiff, which efficiently generates complete 360$^\circ$ panoramas using one or more unregistered NFoV images captured from arbitrary angles. Our approach has two primary components to overcome the limitations. Firstly, a two-stage angle prediction module to handle various numbers of NFoV inputs. Secondly, a novel latent diffusion-based panorama generation model uses incomplete panorama and text prompts as control signals and utilizes several geometric augmentation schemes to ensure geometric properties in generated panoramas. Experiments show that PanoDiff achieves state-of-the-art panoramic generation quality and high controllability, making it suitable for applications such as content editing.
    摘要 “360度全景图是计算机图形中广泛应用的环境光源。然而,捕捉360度×180度全景图受到特殊设备和人工资源的限制。先前的研究已经开发了基于学习的生成方法,以synthesize全景图从单个镜头视场(NFoV)图像中,但它们受到输入模式的局限性、生成质量和可控性的限制。为了解决这些问题,我们提出了一个新的渠道 called PanoDiff,它能够高效地使用一个或多个不准确的NFoV图像,从任意角度捕捉到完整的360度全景图。我们的方法包括两个主要组成部分:首先,一个两stage的角度预测模块,可以处理不同数量的NFoV输入。其次,一种新的扩散增强的全景图生成模型,使用部分全景图和文本提示作为控制信号,并使用多种几何增强方案来保证生成的全景图具备几何性质。实验表明,PanoDiff可以实现状态最佳的全景图生成质量和高可控性,适用于内容编辑等应用。”

Video-Based Hand Pose Estimation for Remote Assessment of Bradykinesia in Parkinson’s Disease

  • paper_url: http://arxiv.org/abs/2308.14679
  • repo_url: None
  • paper_authors: Gabriela T. Acevedo Trebbau, Andrea Bandini, Diego L. Guarin
  • for: 这个研究旨在检验pose estimation算法是否能够在视频流服务中进行远程PD评估和监测。
  • methods: 研究使用了7种商业可用的手势估计模型来估计视频中手部的运动。
  • results: 结果显示,在本地记录的视频中,3种模型表现良好,而在视频流服务中记录的视频中,模型的准确率显著下降。研究还发现,视频流服务中的运动速度和模型准确率之间存在负相关性。
    Abstract There is a growing interest in using pose estimation algorithms for video-based assessment of Bradykinesia in Parkinson's Disease (PD) to facilitate remote disease assessment and monitoring. However, the accuracy of pose estimation algorithms in videos from video streaming services during Telehealth appointments has not been studied. In this study, we used seven off-the-shelf hand pose estimation models to estimate the movement of the thumb and index fingers in videos of the finger-tapping (FT) test recorded from Healthy Controls (HC) and participants with PD and under two different conditions: streaming (videos recorded during a live Zoom meeting) and on-device (videos recorded locally with high-quality cameras). The accuracy and reliability of the models were estimated by comparing the models' output with manual results. Three of the seven models demonstrated good accuracy for on-device recordings, and the accuracy decreased significantly for streaming recordings. We observed a negative correlation between movement speed and the model's accuracy for the streaming recordings. Additionally, we evaluated the reliability of ten movement features related to bradykinesia extracted from video recordings of PD patients performing the FT test. While most of the features demonstrated excellent reliability for on-device recordings, most of the features demonstrated poor to moderate reliability for streaming recordings. Our findings highlight the limitations of pose estimation algorithms when applied to video recordings obtained during Telehealth visits, and demonstrate that on-device recordings can be used for automatic video-assessment of bradykinesia in PD.
    摘要 有越来越多的关注使用pose estimation算法来评估基于视频的PD患者的静止症状评估和监测。然而,在视频流服务中使用pose estimation算法的准确性尚未被研究。本研究使用了七种市售手势估计模型来估计视频中的手指 thumb和index fingers的运动。我们使用了HC和PD患者在两种不同条件下录制的视频:流程(在live Zoom会议中录制的视频)和本地(使用高质量摄像头录制的视频)。我们将模型的准确性和可靠性与手动结果进行比较。三种模型在本地录制下表现良好,而流程录制下的准确性显著降低。我们发现视频录制中运动速度与模型的准确性之间存在负相关性。此外,我们评估了PD患者在执行FT测试时录制的视频中关于静止症状的10种运动特征的可靠性。大多数特征在本地录制下表现出色,而流程录制下大多数特征表现为亮度到中度的可靠性。我们的发现表明在Telehealth访问中使用pose estimation算法进行自动视频评估的难度,并且可以使用本地录制来实现更高的准确性和可靠性。