Vous êtes ici : Réunions » Réunion

Nous vous rappelons que, afin de garantir l'accès de tous
les inscrits aux salles de réunion, **l'inscription aux réunions est
gratuite mais obligatoire**.

**Inscriptions closes à cette réunion.**

**74 personnes membres** du GdR ISIS, et **114 personnes non membres** du GdR, sont inscrits à cette réunion.

Capacité de la salle : 250 personnes.

Nous organiserons la troisième réunion commune entre le GDR ISIS et le GDR Robotique sur le thème "Apprentissage et Robotique". Dans le contexte actuel de pandémie Covid, cette réunion sera virtuelle, elle aura lieu sur Zoom. Les identifiants seront directement envoyés aux participants inscrits, au cours de la matinée précédant la réunion.

Le but de la réunion est d'offrir l'opportunité d'échanges sous forme d'exposés de différentes communautés (robotique, apprentissage statistique, traitement de signal et des images ...) travaillant sur l'apprentissage pour les différents aspects de la robotique (perception, contrôle, navigation, boucles action / perception etc.).

La journée inclura des conférences invitées et des communications pour lesquelles nous lançons un appel à contribution sur les thèmes :

- Apprentissage pour la navigation dans des espaces complexes et la navigation sociale
- Apprentissage pour le SLAM et le SLAM sémantique
- RL et Deep-RL pour la robotique et pour les agents mobiles
- Apprentissage et asservissement visuel
- Fusion de capteurs pour la robotique
- Transfert de la simulation vers le monde réel
- Géométrie, sémantique et apprentissage
- Vision, Robotique et Langage (Embodied Question Answering et commande de robots par langage)
- Liens entre apprentissage et contrôle / automatique
- ...

Les résumés des propositions (1/2 page environ) devront parvenir à christian.wolf@insa-lyon.fr, david.filliat@ensta-paristech.fr, cedric.demonceaux@u-bourgogne.fr, avant le 20 Juin 2021.

Organisation :

- Cédric Demonceaux (Université de Bourgogne, VIBOT)
- David Filliat (ENSTA ParisTech, U2IS, Inria Flowers)
- Christian Wolf (INSA-Lyon, LIRIS)

La réunion aura lieu le 29 Juin 2021, 14h-17h.

Elle inclura une conférence invitée par

- Ingmar Posner, Oxford University
- https://ori.ox.ac.uk/people/ingmar-posner/

**14h** Keynote invité: Ingmar Posner

**"Learning to Perceive and to Act - Disentangling Tales from (Structured) Latent Space"**

Efficient Online Transfer Learning for 3D Object Classification in Autonomous Driving

3D Semantic Scene Completion -- a Survey and new Lightweight architecture

Bidirectional interaction between visual and motor generative models using predictive coding and active inference

CNN Networks for State Estimation: application to the estimation of projectile trajectory with an Imperfect Invariant Extended Kalman Filter

Teaching Agents how to Map: SpatialReasoning for Multi-Object Navigation

Apprentissage et robotique: de l'optimisation à la diversification

Giuseppe Paolo, Alexandre Coninx, Stephane Doncieux, Alban Laflaquière

Unsupervised continual learning for object recognition

Kaouther Messaoud

**==== Ingmar Posner**

Unsupervised learning is experiencing a renaissance. Driven by an abundance of unlabelled data and the advent of deep generative models, machines are now able to synthesise complex images, videos and sounds. In robotics, one of the most promising features of these models - the ability to learn structured latent spaces - is gradually gaining traction. The ability of a deep generative model to disentangle semantic information into individual latent-space dimensions seems naturally suited to state-space estimation. Combining this information with generative *world-models, *models which are able to predict the likely sequence of future states given an initial observation, is widely recognised to be a promising research direction with applications in perception, planning and control. Yet, to date, designing generative models capable of decomposing and synthesising scenes based on higher-level concepts such as objects remains elusive in all but simple cases. In this talk I will motivate and describe our recent work using deep generative models for unsupervised *object-centric* scene inference and generation. I will demonstrate that explicitly modelling correlations in the latent space is a key ingredient required to synthesise physically plausible scenes. Furthermore, I will make the case that exploiting correlations encoded in latent space, and learnt through experience, lead to a powerful and intuitive way to disentangle and manipulate task-relevant factors of variation. I will show that this not only casts a novel light on affordance learning, but also that the same framework is capable of generating a walking gait in a real-world quadruped robot.

** ==== Rui Yang, Yassine Ruichek and Zhi Yan**

Paper: https://arxiv.org/abs/2104.10037

Code: https://github.com/epan-utbm/efficient_online_learning

Video: https://youtu.be/wl5ehOFV5Ac

Autonomous driving has achieved rapid development over the last few decades, including the machine perception as an important issue of it. Although object detection based on conventional cameras has achieved remarkable results in 2D/3D, non-visual sensors such as 3D LiDAR still have incomparable advantages in the accuracy of object position detection. However, the challenge also exists with the difficulty in properly interpreting point cloud generated by LiDAR. This paper presents a multi-modal-based online learning system for 3D LiDAR-based object classification in urban environments, including cars, cyclists and pedestrians. The proposed system aims to effectively transfer the mature detection capabilities based on visual sensors to the new model learning based on non-visual sensors through a multi-target tracker (i.e. using one sensor to train another). In particular, it integrates the Online Random Forests (ORF) \cite{saffari2009line} method, which inherently has the abilities of fast and multi-class learning. Through experiments, we show that our system is capable of learning a high-performance model for LiDAR-based 3D object classification on-the-fly, which is especially suitable for robotics in-situ deployment while responding to the widespread challenge of insufficient detector generalization capabilities.

(3DV'oral + Journal)

LMSCNet - https://arxiv.org/abs/2008.10559, 3DV 2020 oral

SSC Survey - https://arxiv.org/abs/2103.07466, soumission journal 2021

We present an in-depth survey on the semantic scene completion literature and a new lightweight architecture. The tasks consist of jointly completing and predicting the semantic labels of sparse 3D scenes. As opposed to the literature, our LMSCNet proposal uses a 2D UNet backbone with comprehensive multiscale skip connections to enhance feature flow, along with 3D segmentation heads. On the SemanticKITTI benchmark, our method performs on par on semantic completion and better on occupancy completion than all other published methods at submission time -- while being significantly lighter and faster. As such it provides a great performance/speed trade-off for mobile-robotics applications. The ablation studies demonstrate our method is robust to lower density inputs, and that it enables very high-speed semantic completion at the coarsest level.

Winning entry of the CVPR 2021 Multi-On Challenge

In the context of Visual Navigation, the capacity to map a novel environment is necessary for an agent to exploit its observation history in the considered place and efficiently reach known goals.

This ability can be associated with spatial reasoning, where an agent is able to perceive spatial relationships and regularities and discover object affordances. In classical Reinforcement Learning (RL) setups, this capacity is learned from reward alone. We introduce supplementary supervision in the form of auxiliary tasks designed to favor the emergence of spatial perception capabilities in agents trained for a goal-reaching downstream objectives. We show that learning to estimate metrics quantifying the spatial relationships between an agent at a given location and a goal to reach has a high positive impact in Multi-Object Navigation settings. Our method significantly improves the performance of different baseline agents, that either build an explicit or implicit representation of the environment, even matching the performance of incomparable oracle agents taking ground-truth maps as input.

A difficult problem in robotics is the learning of motor skills without direct supervision in the motor space (a.k.a. kinestetic teaching). A possible approach to solve it is to take inspiration from the active inference framework, suggesting that motor commands can be determined from desired sensory observations, by an inference process minimizing variational free-energy.

After introducing the free-energy principle and the concept of active inference, we will present a model taking inspiration from these principles in order to learn a repertoire of motor trajectories for handwriting. This model is composed of two generative models that are interconnected by feedback mechanisms. The first model is a RNN for visual prediction. The second model is a RNN for motor prediction, coupled with a forward model predicting visual outcomes of motor commands. In both generative models, learning and inference can be performed by minimizing variational free-energy.

We show how the minimization of variational free-energy induces a bidirectional influence between the visual and generative models, and study the advantages of this interaction for motor control: robustness to perturbations, adaptation to transformed visual predictions.

Annabi et al., "Bidirectional interaction between visual and motor generative models using predictive coding and active inference", 2021, Under review for the Neural Networks Special Issue on Artificial Intelligence and Brain Science.

https://arxiv.org/abs/2104.09163v1

A difficult problem in robotics is the learning of motor skills without direct supervision in the motor space (a.k.a. kinestetic teaching). A possible approach to solve it is to take inspiration from the active inference framework, suggesting that motor commands can be determined from desired sensory observations, by an inference process minimizing variational free-energy.

In the military field, the accurate knowledge of the position, velocity and orientation of a projectile at each moment is essential for its guidance. For this, different navigation methods can be considered depending on the type of sensors em- bedded in the projectile. Currently, only methods combining Inertial Measurement Unit (IMU) and Global Navigation Satellite System (GNSS) are used for projectile navigation.

The aim of this work is to present a new low-cost navigation solution to estimate the projectile trajectory using only noisy and biased accelerometer, gyrometer and magnetometer readings.

The Extended Kalman Filter (EKF) is a popular solution to integrate IMU measure- ments. Nevertheless, EKF convergence is not guaranteed. A new approach can be considered: the Invariant Extended Kalman Filter (IEKF). It is a nonlinear convergent observer defined on matrix Lie group. Thus, under a condition verified by the system dynamics ("group affine" property), the invariant-error evolution becomes indepen- dent of the estimated trajectory. Moreover, according to the Log-linear property, the nonlinear error dynamics can be determined exactly by a linear differential equation. Due to convergence properties, an algorithm based on IEKF theory is developed to estimate the projectile position, velocity and orientation and the sensor biases. It is an Imperfet Right-Invariant Extended Kalman filter (R-IEKF) because there is no Lie group able to represent projectile states and sensor biases. The projectile trajectory is embedded in a Lie group as IEKF and is evaluated by a right-invariant error while the sensor biases are represented by vectors and evaluated by a linear error.

In addition, Imperfect R-IEKF is sensitive to the measurement noise covariance matrix tuning. A convolutional neural network (CNN) is trained to estimate this matrix from magnetometer readings. This provides a time-varying covariance matrix adapted to each phase of the projectile flight.

The joint use of Imperfect R-IEKF and a CNN helps to significantly reduce the pro- jectile trajectory estimation errors. Indeed, an Imperfect R-IEKF and a constant mea- surement noise covariance matrix provide accurate estimation errors, about 20 meters. This same algorithm with a matrix dynamically tuned by a CNN exhibits estimation errors less than five meters. To compare, the position estimation errors of a Dead reckoning are about 70m.

L'apprentissage par renforcement promet de définir automatiquement le comportement de robots sans avoir à les programmer explicitement. C'est un enjeu majeur pour sortir les robots des usines et leur permettre d'accomplir des tâches non prédéfinies dans des conditions non contrôlées. Malgré ses succès dans de nombreuses disciplines, l'apprentissage machine en général et l'apprentissage par renforcement en particulier est à la peine en robotique. Elle cumule en effet des propriétés qui compliquent singulièrement l'apprentissage: des espaces continus et de grande dimension à explorer, des récompenses rares et trompeuses, le coût important de chaque essai et, en cas d'apprentissage en simulation, la difficulté du transfert vers le monde réel.

Dans le contexte de la robotique, l'apprentissage par renforcement est essentiellement vu comme un processus d'optimisation qui génère une solution unique, optimale pour une tâche définie par une fonction de récompense. Cette vision peut être avantageusement remplacée par un apprentissage très différent, dans lequel le résultat n'est plus une solution unique, mais un ensemble de solutions aussi vaste et divers que possible. Ce changement de paradigme permet d'améliorer les capacités d'exploration de l'apprentissage tout en ouvrant de nouvelles possibilités, notamment sur le transfert entre simulation et réalité. Dans la continuité du projet Européen DREAM, nous présenterons les méthodes développées dans ce sens, qui s'appuient notamment sur des algorithmes évolutionnistes de type recherche de nouveauté ou algorithmes de qualité-diversité et nous présenterons les perspectives qu'ils permettent d'envisager dans un contexte d'apprentissage ouvert.

Embodied agents can learn how to act in a given setting by performing actions and observing their consequences. Such actions are evaluated through a reward function assessing how good the performed actions are with respect to the task the agent needs to solve. Given the reliance of these systems on such reward functions, it is fundamental that the function is well designed and that the reward signal is given often enough. If this is not the case, and the reward is only provided after multiple actions have been performed or if a certain condition is met the setting is defined as a sparse rewards setting. Such situations can prove very problematic to tackle, due to the agent having difficulties in discovering which actions can trigger the reward signal. A way to address this problem is by performing exploration in an efficient way in order to discover all possible rewards present in the search space.

To address this problem, we introduce the SparsE Reward Exploration via Novelty and Emitters (SERENE) algorithm, capable of efficiently exploring a search space and optimize any possible discovered reward. SERENE separates the exploration of the search space and the exploitation of the reward in two alternating processes. The first process performs exploration through Novelty Search, a divergent search algorithm. The second one exploits discovered reward areas through emitters, i.e. local instances of population-based optimization algorithms. The two processes are alternated through to a meta-scheduler that splits the total computation budget in smaller chunks and assigns them to either one of the processes. This ensures the

discovery and efficient exploitation of any possible disjoint reward areas. SERENE returns both a collection of diverse solutions covering the search space and a collection of high-performing solutions for each distinct reward area. We evaluate the algorithm on four different sparse rewards environment and compare the results against multiple baselines. We show that SERENE compares favorably to such baselines both from the point of view of exploration of the space and the exploitation of all the rewarding areas in the environment. Finally, we will also propose possible extensions of the algorithm and future direction of research arising from it.

Predicting the trajectories of surrounding agents is an essential ability for autonomous vehicles navigating through complex traffic scenes. The future trajectories of agents can be inferred using two important cues: the locations and past motion of agents, and the static scene structure. Due to the high variability in scene structure and agent configurations, prior work has employed the attention mechanism, applied separately to the scene and agent configuration to learn the most salient parts of both cues. However, the two cues are tightly linked. The agent configuration can inform what part of the scene is most relevant to prediction. The static scene in turn can help determine the relative influence of agents on each other?s motion. Moreover, the distribution of future trajectories is multimodal, with modes corresponding to the agent?s intent. The agent?s intent also informs what part of the scene and agent configuration is relevant to prediction. We thus propose a novel approach applying multi-head attention by considering a joint representation of the static scene and surrounding agents. We use each attention head to generate a distinct future trajectory to address multimodality of future trajectories. Our model achieves state of the art results on the nuScenes prediction benchmark and generates diverse future trajectories compliant with scene structure and agent configuration.

In real-world applications, an agent that interacts with objects in an unknown environment, continuously perceives the objects through sensors. Being able to adapt to changes in the environment as well as continuously building a (visual) representation of objects of new classes while exploiting its acquired knowledge is a crucial property, without forgetting the knowledge which was gained in the past. Hence we aim to study the problem of continual learning in the context of unsupervised and continual object recognition. Among state-of-the-art researches in the domain of continual learning, few related works focus on unsupervised continual learning. In particular, Curl [1] proposed an algorithm that permits continual and unsupervised representation learning for object recognition that fits the most to our objective and learning scenario. The core of the model is a Variational Autoencoder that optimizes ELBO as objective function and models different categories with a Gaussian Mixture Model. The model expands with regard to detected new categories by adding multiple latent encoder heads . It considers poor quality examples as new category candidates, however it doesn?t permit to determinate automatically the number of categories, therefore it tends to over-segment the clusters. Based on the Curl[1] model, our model extends the original model by improving the new category detection process. We suppose that there is temporal continuity in the data stream and objects are learned in the sequential order instead of being completely chaotic and mixed. While comparing loss function to a threshold to detect examples of poor quality, we apply the Page-Hinkley?s test that is commonly used to detect abrupt changes or concept drift in the data flow, corresponding in our case to the presence of an object that is different from the current instance and is new to the agent. Note that in realistic scenarios, that an object could appear multiple times. Thus in combination with the Page-Hinkley?s test, we use the Hotelling t test for each existing category to either accept the arriving category as a learned object or reject it as an unknown object. The improved new category detection process results in an internal self-supervision signal that guides learning. We evaluate our model on the MNIST and the Fashion MNIST datasets. Compared to Curl, our model permits to avoid the over-segmentation of categories, thus better fits the category distribution. In this manner, during evaluation, the process of sub-clusters regrouping with regard to ground truth label was alleviated thus facilitates clustering.

**Date :** 2021-06-29

**Lieu :** Virtuelle / Zoom

**Thèmes scientifiques :**

B - Image et Vision

T - Apprentissage pour l'analyse du signal et des images

**Inscriptions closes à cette réunion.**

(c) GdR 720 ISIS - CNRS - 2011-2022.