Journée Action "Visage, geste, action et comportement"

Nous vous rappelons que, afin de garantir l'accès de tous les inscrits aux salles de réunion, l'inscription aux réunions est gratuite mais obligatoire.

Inscriptions closes à cette réunion.

Inscriptions

55 personnes membres du GdR ISIS, et 39 personnes non membres du GdR, sont inscrits à cette réunion.
Capacité de la salle : 150 personnes.

Annonce

Résumé

L'objectif de cette journée est de présenter des travaux concernant l'extraction du mouvement humain (visage, mains, corps, gestes) à partir de séquences vidéo, et son analyse à plus haut niveau (tâches, comportement), notamment pour des applications interactives ou de surveillance (video-surveillance, health monitoring, ...). Les travaux présentés pourront être d'ordre méthodologiques (incluant les méthodes par apprentissage) et/ou applicatifs.

La journée inclura deux conférences invitées :

Mohamed Daoudi, IMT Lille Douai/ CRIStAL (UMR 9189)
Geometric and Deep learning approaches for Dynamic Facial Expression Analysis and Generation
Efstratios Gavves, QUVA Lab, University of Amsterdam
The Machine Learning of Time: Past and Future

Elle se poursuivra par des communications pour lesquelles nous lançons un appel à contribution sur l'ensemble des thèmes de l'action :

Reconnaissance d'émotions (visage, audio, geste, ...)
Reconnaissance de gestes interactifs
Reconnaissance d'actions et d'activités
Estimation de la posture articulée (corps entier, mains, etc.)
Etude de la synchronie des signaux non verbaux (mouvement de tête, tours de parole, geste, posture, ...).
Modélisation et détection de l'engagement dans les interactions
Modélisation et génération de comportements (avatar)
Applications...

Appel à contributions

Les personnes souhaitant présenter leurs travaux à cette journée sont invitées à envoyer, par e-mail, leur proposition (titre et résumé de 1 page maximum) aux organisateurs avant le 17 décembre 2020.

Orateurs invités

Mohamed Daoudi, IMT Lille Douai/ CRIStAL (UMR 9189)

Titre : Geometric and Deep learning approaches for Dynamic Facial Expression Analysis and Generation

Résumé : Cette présentation décrit les possibilités apportées par l'exploitation des outils issus de la géométrie Riemannienne et les techniques d'apprentissage profond en reconnaissance et génération des expressions faciales. Nous proposons une nouvelle approche pour la reconnaissance des expressions faciales à l'aide de descripteurs appelés Deep covariance. La solution est basée sur l'idée d'encoder les caractéristiques locales et globales du réseau neuronal convolutif extraites d'images fixes dans des matrices de covariance locales et globales. Ensuite nous analyserons la dynamique des expressions faciales dans l'espace des matrices symétriques définies positives. Nous proposons aussi une nouvelle approche pour générer des vidéos des six expressions faciales à partir d'une image neutre d'un visage. Nous exploiterons la géométrie du visage en modélisant le mouvement des points de repère (landmarks) du visage sous la forme de courbes encodées par des points dans une hypersphère. En proposant une version conditionnelle du réseau antagoniste génératif (GAN) de Wasserstein pour la génération de mouvements sur l'hypersphère, nous apprenons la distribution de la dynamique des expressions faciales des différentes classes, à partir de laquelle nous synthétisons de nouveaux mouvements d'expressions faciales.

Efstratios Gavves, QUVA Lab, University of Amsterdam

Titre : The Machine Learning of Time: Past and Future

Résumé : Visual artificial intelligence automatically interprets what happens in visual data like videos. Today's research strives with queries like: 'Is this person playing basketball?; 'Find the location of the brain stroke'; or 'Track the glacier fractures in satellite footage'. All these queries are about visual observations already taken place. Today's algorithms focus on explaining past visual observations. Naturally, not all queries are about the past: 'Will this person draw something in or out of their pocket?; 'Where will the tumour be in 5 seconds given breathing patterns and moving organs?; or, 'How will the glacier fracture given the current motion and melting patterns?. For these queries and all others, the next generation of visual algorithms must expect what happens next given past visual observations. Visual artificial intelligence must also be able to prevent before the fact, rather than explain only after it. In this talk, I will present my vision on what these algorithms should look like, and investigate possible synergies with other fields of science, like biomedical research, astronomy and others. Furthermore, I will present some recent works and applications in this direction within my lab and spinoff.

Organisateurs

Catherine Achard, ISIR Sorbonne Université
Olivier Alata, Lab. Hubert Curien, Univ. Jean Monnet Saint-Etienne
Christophe Ducottet, Lab. Hubert Curien, Univ. Jean Monnet Saint-Etienne

Programme

Présentations invitées

14h - 14h40 Mohamed Daoudi, IMT Lille Douai/ CRIStAL

Geometric and Deep learning approaches for Dynamic Facial Expression Analysis and Generation

14h40 - 15h20 Efstratios Gavves, QUVA Lab, University of Amsterdam

The Machine Learning of Time: Past and Future

Présentations

15h 30 - 15h55 Guillaume Vaudaux-Ruth, Adrien Chan Hon Tong, Catherine Achard

Apprentissage par auto-évaluation pour la localisation d?actions dans des vidéos

15h55 - 16h20 Joseph Gesnouin, Steve Pechberti, Guillaume Bresson, Bogdan Stanciulescu, Fabien Moutarde

Rethinking Robust Embedding for Skeleton Human ActionRecognition

16h20 - 16h45 Tristan Cladière, Hubert Konik, Olivier Alata

Facial Expressions Spotting Using Simulated Event Images and a Convolutional Neural Network

16H45 - 17h10 Tuan Hung VU, Jacques BOONAERT, Sebastien AMBELLOUIS, Abdelmalik TALEB-AHMED

Anomaly detection in surveillance video by multi channel generative framework and supervised learning

Résumés des contributions

Geometric and Deep learning approaches for Dynamic Facial Expression Analysis and Generation

Mohamed Daoudi,

IMT Lille Douai/ CRIStAL (UMR 9189)

The Machine Learning of Time: Past and Future

Efstratios Gavves,

QUVA Lab, University of Amsterdam

Apprentissage par auto-évaluation pour la localisation d'actions dans des vidéos

Guillaume Vaudaux-Ruth^1,2, Adrien Chan Hon Tong², Catherine Achard¹

¹ ONERA

² Sorbonne Université
Résumé : Dans cette présentation, nous proposons une méthode de localisation d?actions dans des vidéos basée sur le principe d?auto-évaluation. Pour cela, le modèle de localisation proposé va apprendre simultanément à régresser des intervalles temporels et à s'auto-évaluer. Cela permet, tout d'abord, d'accéder à une confiance dans la valeur régressée, mais surtout d'améliorer les résultats de régression. Nous montrons alors que cet apprentissage commun est particulièrement pertinent pour la localisation d'actions car il permet, non seulement, d?aider à trouver des caractéristiques pertinentes pour chacune des tâches, mais aussi de prendre en compte naturellement certaines de ses spécificités comme, par exemple, l'importance de ne prévoir qu'une seule détection par instance d'action.

Rethinking Robust Embedding for Skeleton Human ActionRecognition

Joseph Gesnouin^1,2, Steve Pechberti¹, Guillaume Bresson¹, Bogdan Stanciulescu², Fabien Moutarde²

¹ Institut VEDECOM, 78000 Versailles, France

² MINES ParisTech, Universit_e PSL, Centre de Robotique (CAOR), 75006 Paris, France

Résumé : Nowadays, most of the skeleton action recognition approaches tend to focus towards sequential modelling while ignoring to a certain extent the question of data representation.

We present here an approach for skeleton action recognition with no sequential modelling at all that focuses on the question of data representation. To demonstrate that the question of data representation is almost as important as sequential modelling for such task, we use the simplest form of an autoencoder (a feedforward, non-recurrent neural network similar to single layer perceptrons that participate in multilayer perceptrons) to reconstruct the actions. We add to the reconstruction cost function of the autoencoder a statistical supervised regularization with a Linear Discriminant Analysis. This allows to condition the projection of the instances in the latent space upon their class. We then obtain, in addition to a reduced in size representation of the action, a first draft of the separability of the classes in the latent space. We then extract the encoder part of the trained autoencoder and evaluate its classification ability.

We tested our approach on two public databases: the SHREC database (3D Hand Gesture Recognition) and the JHMDB database (2D Body Action). On both databases, results match state of the art for skeleton action recognition tasks while being the fastest approach proposed. We therefore show that a trivial model focusing on the representation of its data with statistical regularization can compete with more complex approaches such as state neural networks or convolutional neural networks for skeleton action recognition.

Facial Expressions Spotting Using Simulated Event Images and a Convolutional Neural Network

Tristan Cladière, Hubert Konik, Olivier Alata

Lab. Hubert Curien, UMR CNRS 5516, IOGS, Université Jean Monnet, Saint-Etienne, France

Résumé : Facial expressions are one of the most important external indicators to reveal the emotion and the psychological status of a person. Among them, there are the micro-expressions (ME), which can be recognized by their duration (less than 500 ms) and their generally low intensity. These specific expressions are considered as representative of the real emotion of a person, making them interesting in many applications such as national security, medical care, studies on political psychology and studies on educational psychology.

Since the 2000s, research on automatic spotting and recognition of ME (MESR) has developed. However, these researches focused mainly on the recognition tasks, and the spotting part was left besides until quite recently. Furthermore, ME are particularly challenging to detect due to their nature. These facts can explain why current methods are still not accurate enough. For example, during the Third Facial Micro-Expression Grand Challenge (MEGC2020), the best f1-scores on SAMM and CAS(ME)² databases (two databases with long videos containing spontaneous facial expressions) were respectively 0.3299 and 0.1403.

Our approach to detect expressions is to use simulated event images and a fit-tuned convolutional neural network (CNN). Simulated event images are images we created from existent facial expressions databases to mimic the recording of an event-based camera. Each image is then labelled either as containing movement or not, regarding Actions Units (AU) annotations attached to the databases. An AU is a movement of an individual facial muscle, and the combination of many AUs is traditionally used to describe emotions, in what we call the Facial Action Coding System (FACS). Finally, we fine-tuned the Resnet-18 CNN to classify our event images.

Concerning the overall detection, our method will be compared to the other ones involved in the MEGC2020. Moreover, using a multi-labels formalism, we could both spot movements and give indications about its location on the face (eyes, eyebrows, mouth, or nose).

Anomaly detection in surveillance video by multi channel generative framework and supervised learning

Tuan Hung VU¹, Jacques BOONAERT¹, Sebastien AMBELLOUIS², Abdelmalik TALEB-AHMED³

¹ IMT Lille Douai

² COSYS, Université Gustave Eiffel

³ Université Polytechinque Hauts-de-France

Résumé : Anomaly detection in surveillance videos is the identification of rare events which produce different features from normal events. In this work, we present a survey about the progress of anomaly detection techniques and introduce our proposed framework to tackle this very challenging objective. Recently, most of state-of-the-art anomaly detection methods are based on apparent motion and appearance reconstruction networks and use error estimation between generated and real information as detection features. These approaches achieve promising results by only using normal samples for training steps. Inspired by this solution, our approach is based on themore recent state-of-the-art techniques and casts anomalous events as unexpected events in future frames. Our contributions are two-fold. On the one hand, we propose a flexible multichannel framework to generate multi-type frame-level features. On the other hand, we study how it is possible to improve the detection performance by supervised learning. The multi-channel framework is based on four Conditional GANs (CGANs) taking various type of appearance and motion information as input and producing prediction information as output. These CGANs provide a better feature space to represent the distinction between normal and abnormal events. Then, the difference between those generative and ground-truth information is encoded by Peak Signal-toNoise Ratio (PSNR). We propose to classify those features in a classical supervised scenario by building a small training set with some abnormal samples of the original test set of the dataset. The binary Support Vector Machine (SVM) is applied for frame-level anomaly detection. Finally, we use Mask R-CNN as detector to perform object-centric anomaly localisation. Our solution is largely evaluated on Avenue, Ped1, Ped2 and ShanghaiTech datasets. Our experiment results demonstrate that PSNR features combined with supervised SVM are better than error maps computed by previous methods. We achieve state-of-the-art performance for frame-level AUC on Avenue, Ped1 and ShanghaiTech. Especially, for the most challenging Shanghaitech dataset, a supervised training model outperforms up to 9% the stateof-the-art an unsupervised strategy.

Identification