Vous êtes ici : Kiosque » Annonce

Identification

Identifiant: 
Mot de passe : 

Mot de passe oublié ?
Détails d'identification oubliés ?

Annonce

7 novembre 2024

Action Recognition by Knowledge Augmentation in Vision Language Model


Catégorie : Stagiaire


Action recognition from video is highly important for assistive care robots, as it enables them to understand and respond appropriately to the needs and activities of the people they assist. Recent DL models for action recognition are moving toward more data-efficient, interpretable, and computationally optimized frameworks: The combination of transformer architectures, spatio-temporal attention, multimodal fusion, and self-supervised learning, just to mention a few. Meanwhile, the recent emergence of large-scale pre-trained vision-language models (VLMs) has demonstrated remarkable performance and transferability to different types of visual recognition tasks, thanks to their generalizable visual and textual representations. It has been confirmed by our recent study[1] [2], where our developed model learns and improves visual, textual, and numerical representations of patient gait videos based on a large-scale pre-trained Vision Language Model (VLM), for several classification tasks.

Motivated by these recent successes, we will extend our previous developed model and the multimodal representation for a new classification task – action recognition from video. Similarly to our previous method, we will adopt the prompt learning strategy, keeping the pre-trained VLM frozen to preserve its general representation and leverage the pre-aligned multi-modal latent space the prompt’s context with learnable vectors, which is initialized with domain-specific knowledge.

More details on the subject can be found in the link provided below.


[1]Wang D., Yuan K., Muller C., Blanc F., Padoy N., Seo H., Enhancing Gait Video Analysis in Neurogenerative Diseases by Knowledge Augmentation in Vision Language Model”, Lecture Notes in Computer Science (Proc. Medical Image Computing and Computer-Assisted Intervention), vol. 15005, pp 251–261, Springer, 2024.

[2] Wang D., Yuan K., Seo H., “GaVA-CLIP: Refining Multimodal Representations with Clinical Knowledge and Numerical Parameters for Gait Video Analysis in Neurodegenerative Diseases”, under revision, 2024.

https://mlms.icube.unistra.fr/img_auth_namespace.php/c/c5/Stage-ActionRecognition.pdf

Dans cette rubrique

(c) GdR IASIS - CNRS - 2024.