Apprentissage auto-supervisé et apprentissage non-supervisé de représentations

Nous vous rappelons que, afin de garantir l'accès de tous les inscrits aux salles de réunion, l'inscription aux réunions est gratuite mais obligatoire.

Inscriptions closes à cette réunion.

Inscriptions

57 personnes membres du GdR ISIS, et 30 personnes non membres du GdR, sont inscrits à cette réunion.
Capacité de la salle : 100 personnes.

Annonce

Nous organisons une réunion du thème tranverse T sur le sujet "Apprentissage auto-supervisé et apprentissage non-supervisé de représentations".

Dans le contexte actuel de pandémie Covid, la réunion aura lieu en visioconférence. Cependant pour des raisons techniques liées au nombre de connexions simultanées, l'inscription à la réunion est gratuite mais obligatoire. Les identifiants de connexion sont communiquées par mail aux inscrits la veille ou le matin de la réunion.

L'objectif de l'apprentissage de représentations est l'apprentissage automatique, à partir de données diverses, de représentations, souvent hiérarchiques et basés sur des niveaux d'abstractions. Contrairement aux modèles classiques, ces représentations sont apprises à partir de données et non pas conçues manuellement à partir de connaissances de métiers. Dans ce contexte, l'apprentissage à partir de grandes masses de données étiquetées a longuement été la norme.

Plus récemment nous avons vu émerger un nouveau défi consistant à apprendre, de manière entièrement non supervisée, des représentations uniques et riches, permettant de répondre à des tâches multiples et diverses (reconnaissance visuelle, prédictions, etc.). Similaire à l'apprentissage humain, effectué en grande majorité de manière non supervisé, il s'agit de découvrir les régularités qui gouvernent notre monde physique pour apprendre des abstractions utiles pour le raisonnement. L'exploitation de ces représentations se fait habituellement par transfert vers une tâche cible.

La réunion aura lieu le

9 Juin de 9h à 11h15 (en visio-conférence).

Elle inclura deux conférences invitées par

Cordelia Schmid, Inria / Google (https://thoth.inrialpes.fr/~schmid)
Spyros Gidaris, Valéo, (https://dblp.org/pers/g/Gidaris:Spyros.html)

Appel à contributions :

Nous lançons également un appel à contribution sur les thèmes :

Apprentissage auto-supervisé
Apprentissage non-supervisé de représentations

Les personnes souhaitant présenter leurs travaux à cette journée sont invitées à envoyer, par e-mail, leur proposition (titre et résumé de 1/2 page maximum) aux organisateurs avant le 27 Mai 2020.

Organisateurs :

Nicolas Thome, Cnam, CEDRIC : nicolas.thome@cnam.fr
Christian Wolf, INSA-Lyon, LIRIS : christian.wolf@insa-lyon.fr

Programme

Introduction à la journée (Nicolas Thome, Christian Wolf)

9h10

Keynote par Cordelia Schmid, Inria/Google, https://thoth.inrialpes.fr/~schmid

Video Bert and its extension

9h40

Keynote par Spyros Gidaris, Valéo, https://dblp.org/pers/g/Gidaris:Spyros.html

Self-supervised image representation learning

10h10

Fabio Pizzati, Pietro Cerri and Raoul de Charette

Model-based disentanglement of lens occlusions

10h25

Florent Chiaroni, Giuseppe Valenzise et Frédéric Dufaux

Self-supervised learning for autonomous vehicle perception: A conciliation between analytical and learning methods

10h40

Guillaume Devineau, Fabien Moutarde

Time-sampled Triplet Autoencoders for Poses

10h55

Hubert Banville, Omar Chehab, Aapo Hyvarinen, Denis Engemann, Alexandre Gramfort

Self-supervised Representation Learning from Electroencephalography signals

11h10

Fin de la journée

Résumés des contributions

Cordelia Schmid

Video Bert and its extension

Abstract:

Self-supervised learning has become increasingly important to leverage

the abundance of unlabeled data available on platforms like

YouTube. Whereas most existing approaches learn low-level

representations, we propose a joint visual-linguistic model to learn

high-level features without any explicit supervision. In particular,

inspired by its recent success in language modeling, we build upon the

BERT model to learn bidirectional joint distributions over sequences

of visual and linguistic tokens, derived from vector quantization of

video data and off-the-shelf speech recognition outputs,

respectively. We use VideoBERT in numerous tasks, including action

classification and video captioning. We show that it can be applied

directly to open-vocabulary classification, and confirm that large

amounts of training data and cross-modal information are critical to

performance. Furthermore, we outperform the state-of-the-art on video

captioning, and quantitative results verify that the model learns

high-level semantic features.

Spyros Gidaris

Self-supervised image representation learning.

Abstract:

Over the last few years, deep learning-based methods have achieved impressive results on image understanding problems, such as image classification, object detection, or semantic segmentation. However, real-word computer vision applications often require models that are able to learn without extensive human supervision. In contrast, in order to be successful, classic supervised deep learning methods need to access large volumes of manually labeled training data. As a result, one of the next big challenges in computer vision is to develop learning approaches that are capable of addressing this important shortcoming of existing deep learning methods.

One promising approach towards this direction is the so-called self-supervised representation learning. The goal of self-supervised representation learning is to learn convolutional neural network (convnet) based representations without human supervision. To that end, it advocates to train the convnet with an annotation-free pretext task defined using only the information available within an image, e.g., predicting the relative location of two image patches. In this talk, I will present main self-supervised tasks and explain how pre-training a neural network with them leads to learning useful visual representations for downstream image understanding tasks, such as image classification and object detection. Also, I will cover recent self-supervised methods based on contrastive learning that have managed to significantly narrow or eliminate the gap with supervised representation learning.

Fabio Pizzati, Pietro Cerri and Raoul de Charette

Model-based disentanglement of lens occlusions

Abstract:

Naive image-to-image (I2I) learns source to target mapping without any representation of the underlying domain manifold. As a result, while I2I can accurately translate images from one domain to an other, it usually fails at representing intermediate domains. To circumvent this problem we propose to guide the unsupervised training with a simple physical model.

We apply our proposal for I2I translations in the context of lens occlusions and show that it enables disentanglement of lens occlusions from scene. In details, we propose a model-based disentanglement training, which learns to disentangle scene from lens occlusion from the injection of a physics-based rendering layer and can regress the occlusion model parameters from the target dataset, in a completely unsupervised setting. The experiments demonstrate our method is able to handle varying types of occlusions (raindrops, dirt, watermarks, etc.) and generate highly realistic translations, qualitatively and quantitatively outperforming the state-of-the-art on multiple datasets.

Florent Chiaroni, Giuseppe Valenzise et Frédéric Dufaux

Self-supervised learning for autonomous vehicle perception: A conciliation between analytical and learning methods

Abstract:

Nowadays, supervised deep learning techniques yield the best state-of-the-art prediction performances for a wide variety of computer vision tasks. However, such supervised techniques generally require a large amount of manually labeled training data. In the context of autonomous vehicle perception, this requirement is critical, as the distribution of sensor data can continuously change and include several unexpected variations.

It turns out that a category of learning techniques, referred to as self-supervised (SSL), consists of replacing the manual labeling effort by an automatic labeling process. Thanks to their ability to learn on the application time and in varying environments, state-of-the-art SSL techniques provide a valid alternative to supervised learning for a variety of different tasks, including long-range traversable area segmentation, moving obstacle instance segmentation, long-term moving obstacle tracking, or depth map prediction.

In this presentation we present an overview and a general formalization of the concept of self-supervised learning (SSL) for autonomous vehicle perception. This formalization provides helpful guidelines for developing novel frameworks based on generic SSL principles. Moreover, it enables to point out significant challenges in the design of future SSL systems. We illustrate our formalization through an example in the context of moving obstacles analysis. Specifically, we propose to classify detected moving obstacles depending on their motion patterns. This is performed by applying a deep clustering algorithm on temporal patch sequences, and then considering the obtained clusters as labeled sets to train a real-time image classifier. This approach outperforms state-of-the-art unsupervised image classification methods on the BDD100K video dataset.

Finally, I conclude the talk by discussing future scientific and application perspectives of SSL in the context of autonomous driving, such as potentially moving obstacles analysis and benchmark SSL datasets creation.

Guillaume Devineau, Fabien Moutarde

Time-sampled Triplet Autoencoders for Poses

Abstract:

Studies on human visual perception of biological motion (Johansson 1973) have shown that humans can recognize human body motion actions, using the motion of the body's (skeletal) joints positions only.

Such skeletal ("pose") representations are lightweight and very sparse compared to image and video representations.

Gesture, action and activity recognition can thus be considered as sequence classification tasks, where the input sequence to classify represents, at each time step, a human skeletal pose.

Current state-of-the-art approaches for action recognition are deep learning models that possess a CNN-based or RNN-based architecture that gives them the ability to recognize spatial, temporal or spatio-temporal patterns.

In an other domain, the natural language processing (NLP) domain, where the input consists in sequences of token/word vectors, RNN-based architectures have long been considered as the state-of-the-art approaches.

However, in the recent years, the use of self-supervised text embeddings (e.g. word2vec, glove or fasttext) and attention-based models (e.g. BERT, GPT2 or other Transformer-based models) has lead to substantial gains in models performance for numerous NLP tasks.

The assumed goal of self-supervised text embeddings is to organize representations into a latent space (via "deep metric learning" or via handcrafted metrics), based on similarities between vectors.

Among general deep metric learning techniques, a key and widely used technique consists in the use of a triplet loss.

In this talk, we propose a Triplet Autoencoder model for Human Poses. We show that the model is able to supervisely denoise poses (as poses can sometimes be very noisy) and to self-supervisely organize poses in a latent-space, in a temporally-coherent fashion. The model can also be used to generate human pose motions from scratch or with pose-related conditions. Finally, we discuss the advantages and drawbacks of the proposed approach.

Hubert Banville, Omar Chehab, Aapo Hyvarinen, Denis Engemann, Alexandre Gramfort

Self-supervised Representation Learning from Electroencephalography signals

Abstract:

Supervised deep learning paradigms are often limited by the amount of labeled data that is available. This is particularly problematic when working with clinically-relevant data, such as electroencephalography (EEG), where labelling can be costly in terms of specialized expertise and human processing time.

In this context, we investigate different self-supervised learning (SSL) approaches to learn representations from unlabeled EEG signals. We develop tasks inspired by previous computer vision and audio processing research (Relative Positioning, Temporal Shuffling) and adapt a generic SSL task (Contrastive Predictive Coding) on two clinically relevant problems: sleep staging and pathology detection. Rigorous baselining against purely supervised and hand-engineered models on two large public datasets reveals that SSL outperforms alternative approaches in low-to-medium labeled data regimes, and is competitive will full supervision on fully-labeled data.

In an effort to understand the performance of SSL, we further inspect the learnt features. First, we show that the embedding learnt on clinical data encodes meaningful physiological structure, such as sleep stages and pathological EEG. Second, we simulate EEG-like data and show that SSL can recover nonlinearly mixed independent sources, bridging the gap with recent work on nonlinear Independent Components Analysis (ICA).

Self-Supervised Learning from EEG signals remains largely unexplored: these novel results suggest that SSL may pave the way to a wider use of deep learning models on EEG data.

Identification

Apprentissage auto-supervisé et apprentissage non-supervisé de représentations

Inscriptions

Annonce

Programme

Résumés des contributions