Vision Banana Surpasses Specialised AI Vision Systems
Google DeepMind has developed a single artificial intelligence model capable of performing the functions of multiple specialised computer vision systems. It also exceeds their performance on several benchmark tests. The model, named Vision Banana, is described in a technical report published on arXiv and was presented at the International Conference on Learning Representations.
The system combines capabilities that have traditionally required separate models. Researchers adapted an existing image generation system, Nano Banana Pro, and refined it through instruction tuning with natural language prompts. This produced a unified model capable of performing object segmentation, depth estimation from a single image, surface orientation prediction, and image generation with a single set of parameters.
The method entails providing an input image along with a natural language instruction describing the required task. The model generates an output image that represents the result. This output can then be converted into structured data. This process allows different forms of visual analysis to be handled within one system.
Vision Banana performs multiple segmentation methods. In semantic segmentation, it identifies and labels different regions within an image, such as separating people, the ocean, and the sky in a beach scene. In instance segmentation, it distinguishes between individual objects of the same category, for example, identifying separate pieces of garlic within a single image. In referring expression segmentation, the model isolates specific objects described in natural language. This allows it to differentiate between elements based on textual input.
The model also carries out three-dimensional interpretation tasks. In monocular depth estimation, it predicts the relative distance of objects from a single photograph and produces a depth map. In surface normal estimation, it determines the direction in which surfaces are facing within an image. These functions are used in applications including robotics, augmented reality, autonomous systems, and spatial computing.
Performance was measured under zero-shot conditions. This means the model was tested on datasets without task-oriented training. Within segmentation benchmarks, Vision Banana achieved higher scores than multiple established systems. On the Cityscapes dataset, it recorded a mean Intersection over Union score of 0.699, compared with 0.652 for SAM 3 and 0.520 for X-Decoder. On the ReasonSeg benchmark, the model used with Gemini 2.5 Pro reached a score of 0.793. The benchmark evaluates segmentation requiring reasoning.
In the in-depth estimation tasks, Vision Banana also achieved the highest performance. Averaged across six benchmarks, it achieved a score of 0.882, compared with 0.823 for UniK3D and 0.715 for Depth Pro. In surface normal estimation, it produced the lowest error rates among the systems tested. The model functions without using camera intrinsic data, which is commonly required for depth estimation.
The research involved 25 contributors at Google DeepMind. The project was led by Valentin Gabeur, Shangbang Long, and Songyou Peng, with additional involvement from Kaiming He, Saining Xie, and Thomas Funkhouser.
The report presents the use of generative pretraining for visual tasks. In this approach, a model trained for image generation is adapted to perform a range of visual understanding tasks via natural language instructions.








