Action recognition from video is highly important for assistive care robots, as it enables them to understand and respond appropriately to the needs and activities of the people they assist. Recent DL models for action recognition are moving toward more data-efficient, interpretable, and computationally optimized frameworks: The combination of transformer architectures, spatio-temporal attention, multimodal fusion, and self-supervised learning, just to mention a few. Meanwhile, the recent emergence of large-scale pre-trained vision-language models (VLMs) has demonstrated remarkable performance and transferability to different types of visual recognition tasks, thanks to their generalizable visual and textual representations. It has been confirmed by our recent study[1] [2], where our developed model learns and improves visual, textual, and numerical representations of patient gait videos based on a large-scale pre-trained Vision Language Model (VLM), for several classification tasks.
More details on the subject can be found in the link provided below.
[1]Wang D., Yuan K., Muller C., Blanc F., Padoy N., Seo H., “Enhancing Gait Video Analysis in Neurogenerative Diseases by Knowledge Augmentation in Vision Language Model”, Lecture Notes in Computer Science (Proc. Medical Image Computing and Computer-Assisted Intervention), vol. 15005, pp 251–261, Springer, 2024.
[2] Wang D., Yuan K., Seo H., “GaVA-CLIP: Refining Multimodal Representations with Clinical Knowledge and Numerical Parameters for Gait Video Analysis in Neurodegenerative Diseases”, under revision, 2024.
(c) GdR IASIS - CNRS - 2024.