Contract type : Fixed-term contract
Level of qualifications required : Graduate degree or equivalent
Other valued qualifications : Engineer or Master's diploma in computer science,
machine learning, and/or other relevant domains
Fonction : PhD Position
Level of experience : Recently graduated
About the research centre or Inria department
The Inria Université Côte d’Azur center counts 36 research teams as well as 7 support departments. The center's staff (about 500 people including 320 Inria employees) is
made up of scientists of different nationalities (250 foreigners of 50 nationalities), engineers, technicians and administrative staff. 1/3 of the staff are civil servants, the others are contractual agents. The majority of the center’s research teams are located in Sophia Antipolis and Nice in the Alpes-Maritimes. Four teams are based in Montpellier and two teams are hosted in Bologna in Italy and Athens. The Center is a founding member of Université Côte d'Azur and partner of the I-site MUSE supported by the University of Montpellier.
ANR CREATTIVE3D (https://project.inria.fr/creattive3d/) is a French National Research Agency funded project that aims to establish a framework for the creation of attention-driven 3D environments for training low-vision navigation tasks. For patients with age-related macular degeneration (DMLA in French), which results in the loss of central field vision obscuring part of their visual field, navigating complex and potentially dangerous environments with autonomy and safety is one of the most demanding and dangerous tasks. Immersive environments and virtual reality (VR) technologies hold a strong potential towards training for low-vision navigation tasks [Rap18]. Yet the use of VR in these contexts is impeded by the difficulty of creating adapted 3D content to simulate a large range of real-life situations, and to study its effectiveness for training and rehabilitation. This limits incentives for accompanying staff to familiarize with VR, and broadly adopt it. The project sets out from the study and then the modeling of user behavior during navigation tasks in context, and subsequently using this understanding for assisted creativity tools to design adapted training scenarios.
This thesis focuses on the intermediate goal: the modeling and prediction of user behavior when navigating contextual environments. The greatest challenge of modeling human behavior is to represent its uncertainty given the wide difference in physiology, experience, and perception between individual users, and to provide interpretability for the predictions of these behaviors to allow human-in-the-loop intervention and usage of the prediction for decision making. When deploying predictive systems in real-world applications, being able to consider both uncertainty and interpretability requirements in machine learning models is therefore of high importance to personalize user experiences, and more generally for the security of its users.
Key to this, are (1) understanding the impact of the scene context on user behavior, and (2) understanding the interrelation between components of user behavior (e.g., attention, motion, emotion, and others). This thesis is situated at the center of these two aspects, using deep neural networks (DNN) to model user behavior in 6 Degrees-
of-Freedom (6DoF) navigation tasks in annotated immersive VR environments. The thesis project will address three three challenges:
1. Investigate representation learning methods for graphs and images to represent 3D scene context for learning models
2. Modeling uncertainty in human behavior by predicting multiple future behaviors (gaze, motion, and combined multimodal components)
3. Using contextual information to address interpretability of u s e r behavior (gaze, motion, and combined multimodal components)
The datasets involved in this thesis will come from both existing and new sources. We focus on those involving human motion data and where contextual information is available such as the GTA-IM dataset of synthetic human motion [Cao20], the Whole Body Human Motion database [Man15], and the PROX dataset involving real world
interactions with objects [Has19].
A rich dataset of users conducting road crossing tasks in annotated 3D scenes in VR with captured gaze, motion, and physiology data is previewed in the early stages of the thesis. The PhD student will integrate in the research team and participate in the data collection process in the first year to familiarize with the different types of data that will be explored throughout the thesis. It is important to note that the establishing of this dataset is not a main goal and the thesis project will not hinge on the availability of this new dataset. The remainder of the description of the thesis project will apply generally to both the existing and new datasets.
In the first half of the thesis, the student will investigate multimodal deep neural networks for user motion and gaze prediction, that is, where inputs are both related to previous user patterns and to the content of the 3D scene. A first step of thorough data analysis will investigate the inter-dependency between the context and the behavior, and the potential delays between user behavior responses from visual stimuli, which could be further augmented by the simulated low-vision conditions in our own dataset. Model design will start out by benchmarking and refining Recurrent Neural Networks (RNN) [Pal18] and Transformers (i.e., with attention mechanisms) [Mao20] that have previously been used for motion prediction tasks, but not yet explored for contextual and multimodal data. We inspire from Romero et al. [RoS21], where the performances of most recent deep learning-based approaches to predict head motion and gaze in VR are analyzed, and the design of an RNN-based prediction architectures within a Seq2Seq framework is detailed, establishing state-of-the-art performance. Different representations of the scene content will be investigated, both unstructured or partly-structured, in the form of point clouds with coarse semantic information, and structured in the form of scene graphs representing the different objects, their categories, affordances, and their relationships (obtained from annotated 3D contexts with, e.g., the domain-specific language presented in [Wu22]). Representation learning methods such as Variational Autoencoders or graph embedding techniques for scene graphs [Ham17] will be adopted to obtain a World Model – a concise representation of a 3D scene popularly adopted for robotics control tasks [HaS18].
The second half of the thesis will then be devoted to tackling the question of motion uncertainty and model interpretability. On one hand, uncertainty in motion prediction has been previously addressed using variational autoencoders to sample the latent space for different destinations for multiple head motion prediction [Gui22], 3D pose estimation [Cao20], and Mixture Density Networks (MDN) to model potential trajectories of 2D pose as mixed Gaussian models [Cho18]. This thesis would be the first to explore this in relation to contextual information and within immersive 6DoF environments. On the other hand, interpretability will be key for the 3D designer to understand what to change in the scene to impact user attention. Two major families of approach will be investigated for this purpose: prototype-learning and attribution-based methods. Prototype learning consists of comparing the encoded input to exemplar cases and presenting the ones closest to the input as decision justification [Min19]. Attribution-based methods, on the other hand, investigate the effect an input feature has on the output, such as the work of Chattopadhyay et al. [Cha19] who provide algorithmic proofs for both feedforward networks and RNNs , and demonstrate the causal attributions on an LSTM trained to predict airplane flight trajectories.
[AMJ18] Alvarez-Melis, D., & Jaakkola, T. S. (2018). Towards robust interpretability with self-explaining neural networks. arXiv preprint arXiv:1806.07538. (OALink)
[Cao20] Cao, Z., et al. (2020, August). Long-term human motion prediction with scene context. In European Conference on Computer Vision (pp. 387-404). Springer, Cham.
[Cha19] Chattopadhyay, A., et al. (2019). Neural network attributions: A causal perspective. In International Conference on Machine Learning (pp. 981-990). PMLR. (OALink)
[Cho18] Choi, S., et al. (2018). Uncertainty-aware learning from demonstration using mixture density networks with sampling-free variance modeling. 2018 IEEE International Conference on Robotics and Automation (ICRA), 6915–6922. (OALink)
[Gui22] Q. Guimard, L. Sassatelli, F. Marchetti, F. Becattini, L. Seidenari, and A. Del Bimbo. Deep Variational Learning for Multiple Trajectory Prediction of 360° Head Movements. ACM International Conference on Multimedia Systems (MMSys), Athlone, Ireland, Jun. 2022.
[Ham17] Hamilton, W. L., et al. (2017). Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584. (OALink)
[HaS18] Ha, D., & Schmidhuber, J. (2018). World models. arXiv preprint arXiv:1803.10122. (OALink)
[Has19] Hassan, M., et al. (2019). Resolving 3D human pose ambiguities with 3D scene constraints. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2282-2292). (OALink)
[Man15] Mandery, C.,et al. (2015, July). The KIT whole-body human motion database. In 2015 International Conference on Advanced Robotics (ICAR) (pp. 329-336). IEEE. (OALink)
[Mao20] Mao, et al. (2020, August). History repeats itself: Human motion prediction via motion attention. In European Conference on Computer Vision (pp. 474-489). Springer, Cham. (OALink)
[Min19] Ming, Y., et al. (2019). Interpretable and steerable sequence learning via prototypes. In Proc. of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 903-913). (DOI)
[Pal18] Palmero, C., et al. (2018). Recurrent cnn for 3d gaze estimation using appearance and shape cues. arXiv preprint arXiv:1805.03064. (OALink)
[Rap18] Raphanel, et al. (2018). Current practice in low vision rehabilitation of age-related macular degeneration and usefulness of virtual reality as a rehabilitation tool. Journal of Aging Science . (DOI)
[RoS21] Romero, M., Sassatelli, L., et al. (2021) TRACK: A New Method from a Re- examination of Deep Architectures for Head Motion Prediction in 360-degree Videos. IEEE Transactions on Pattern Analysis and Machine Intelligence. (OALink)
[Wu22] H.-Y. Wu , F. Robert, T. Fafet, B. Graulier, B. Passin-Cauneau, L. Sassatelli, M. Winckler, 2022, “Creating embodied experiences in Virtual Reality”. Proceedings of
ACM on Human-Computer Interactions.
Study and analyze the related work
Propose solutions and insights to the research questions
Implement said solutions with suitable tools and languages (Python)
Participate in regular research discussions
Publish and present the outcomes and results of work to various audiences
Strong basis in machine learning and deep neural networks supported by coursework, previous project outcomes, and code repositories
Excellent python programming skills and experience with ML libraries (tensorflow, sci-kit learn, torch)
Previous experience in processing temporal data streams
Strong willingness to read, analyze, and understand the state of the art, and implement it where needed
A good level of written and spoken English
Knowledge of 3D virtual environments
Other: Code management (git)
Experience with 3D content development (Unity, OpenGL)
Knowledge of virtualization (docker, singularity), working on remote servers (ssh)
Partial reimbursement of public transport costs
Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
Possibility of teleworking and flexible organization of working hours
Professional equipment available (videoconferencing, loan of computer equipment, etc.)
Social, cultural and sports events and activities
Access to vocational training
Social security coverage
Gross Salary per month: 1982€brut per month (year 1 & 2) and 2085€ brut/month (year 3)
(c) GdR 720 ISIS - CNRS - 2011-2022.