Multi-modal 3D Vision | Tolga Birdal

A majority of tasks in computer vision can be interpreted as scene understanding problems conditioned on either 2D image or 3D scan modalities. Usually, these scenes are digitizations of our man made environments composed of objects. Hence, a fundamental piece of this perception problem is pose estimation, i.e. figuring out how these objects are positioned and oriented in 3D space. A rigid transformation is a six degrees of freedom (6-DoF) entity explaining the pose either of an acquisition device (e.g. Lidar or cam- era) or an object enclosed within the captured data. Solving for the former is known as camera relocalization, while the latter is related to 3D object pose estimation. Both of these are now key technologies in enabling a multitude of applications such as augmented reality, autonomous driving, human computer interaction and robot guidance, thanks to their extensive integration in simultaneous localization and mapping (SLAM) [24, 32, 83], structure from motion (SfM), metrology, visual localization and 3D object detection. A myriad of papers have worked on finding the unique solution to the pose estimation problem: a pose per view/scan. However, this trend is now witnessing a fundamental challenge. A recent school of thought has begun to point out that for our highly complex and ambiguous real environments, obtaining a single solution i.e. the correct pose, is simply not sufficient. For example, an image of a scene with repeating structures can look similar even though the location and orientation of the capture de- vice is drastically different. Likewise, objects with rotational symmetries lead to very similar point clouds when scanned from different viewpoints, say with a laser scanner. These observations have led to a paradigm shift that has opened a multitude of research directions focusing on these issues. In- stead of estimating a single solution, methods now propose to predict a range of solutions providing multiple pose hypotheses, solutions that can associate uncertainties to their predictions or even solutions in the form of full probability distributions.

2020

6D Camera Relocalization in Ambiguous Scenes via Continuous Multimodal Inference

Mai Bui, Tolga Birdal, Haowen Deng, and 4 more authors

In Eur. Conf. Computer Vision (ECCV), 2020

arXiv Code Website

2022

Deep bingham networks: Dealing with uncertainty and ambiguity in pose estimation

Haowen Deng, Mai Bui, Nassir Navab, and 3 more authors

Int. Journal of Computer Vision (IJCV), 2022

arXiv Code Website

2024

Attending to Topological Spaces: The Cellular Transformer

Rubén Ballester, Pablo Hernández-Garcı́a, Mathilde Papillon, and 6 more authors

arXiv preprint arXiv:2405.14094, 2024

arXiv

2024

Position: Topological Deep Learning is the New Frontier for Relational Learning

Theodore Papamarkou, Tolga Birdal, Michael M Bronstein, and 8 more authors

In Int. Conf. Machine Learning (ICML), 2024

arXiv

Related Publications

2020

2022

2024

2024