AI EDUCATION: What Is a Vision Language Model?

341

Each week we find a new topic for our readers to learn about in our AI Education column. 

How do we say what we see? 

You’re probably reading this on the screen of a computer or handheld device. Think about the shape of that screen—which is almost certainly a rectangle of some sort. How does our brain take the image of the rectangle, received from our eyes, retrieve and then report the right word, “rectangle,” to our vocal chords and mouth?

Science already answered that question. Neurologists have actually mapped out how the brain takes visual data from our eyes into the occipital lobe, moves it to the temporal lobe for processing, and then uses other parts of the brain to retrieve the language it associates with the information from the image.

That sounds strangely like the kind of workflow that a computer system could mimic, and, indeed, there’s a lot of artificial intelligence- and machine learning-related technology being built to intervene, assist or replicate and scale that process. We’re already starting to use AI to convert thoughts into text and images—which might make for a good future “science non-fiction” edition of AI Education. 

This week’s topic is a little bit less wild than AI that reads your thoughts, but it’s still pretty cool. It’s a type of AI that sees your images and video, and it’s called a vision language model, or VLM. 

What Is a VLM? 

A vision language model is a compound AI combining computer vision with natural language processing. In technical terms, they are most often a blend of three AI-related components: a vision encoder, a projector, and a language encoder—a large language model. Many of us have already encountered VLMs in everyday use in the form of Google Gemini 2.0 Flash, Flamingo, LLaVA, and DeepSeek VL-2. OpenAI introduced a VLM with GPT-4V but incorporated it fully into GPT-4o and has made it part of its standard offering ever since. Google has followed suit with Gemini, Anthropic with Claude, and other public-facing LLMs. In other words, these things are everywhere, and if you haven’t used one, you probably will at some point. 

The vision encoder is the eyes of the VLM. Using machine learning, this part of the VLM has been trained using a massive volume of image-text pairs. The projector is layers of neural network technology between the vision encoder and the ears and voice of the VLM, the large language model (LLM). These interstitial layers are responsible for moving and processing data between the vision encoder and LLM. The LLM accepts instructions from users in natural language prompts and can deliver text output in the format they request, creating a chatbot like interface for the VLM.  

This is another piece of technology that sounds deceptively simple. There’s some serious technological sausage-making involved in how VLMs are intricately trained so the encoder and LLM and projector layers are all comprehending the same data the same way. Both the LLM and the vision encoder transform data into tokens—the VLM thus is able to treat data derived from images almost as if it were already words. 

Basically, the vision encoder-LLM combination allows the large language model technology to “see” images, video and text, rather than merely read text and infer the relationships between words, letters and phrases, VLMs can find and understand the relationships between text and visual data, and therefore, can learn simultaneously from images and text.  

Why Are VLMs Useful 

A vision language model can accept inputs in the form of text, images, video, audio and digital documents in their protected format—like PDFs. However, one of the big benefits of a vision language model is that it can be instructed in plain text. While VLMs can generate outputs in visual or text content, to this point they have been used mainly to generate text—though VLMs are increasingly being used to generate images from natural language prompts, too. 

A vision language model is used to analyze images, summarize videos, parse documents and help power multi-modal chatbots. VLMs actually originally evolved from video captioning technology—they now power the technology that automatically captions images and subtitles videos uploaded to the web. VLMs are similarly used in image and video summarization. A VLM also enables text and voice visual question answering, making them powerful accessibility tools for vision and mobility impaired users. A VLM can identify colors, shapes, textures and patterns within images or groups of images. A VLM’s summarization and information-extracting capabilities are also useful as a general assistant—for example, a VLM could watch an episode of a painting show and come up with a step-by-step instruction to create a specific piece of art, or it produce a recipe from analyzing a cooking video. A VLM could also read hand-written documents and convert them to plain text.  

Vision-language models are capable of performing multiple functions, while more conventional and pre-existing computer vision (CV) models based on convolutional neural networks were bound to specific tasks, VLMs are multi-taskers. CV models are trained to do one thing, very well, like to recognize and report text within images, or to classify images based on their content. A VLM can do any or all of these kinds of tasks. VLMs are an important evolution, potentially saving the time and resources required to retrain multiple CV models. 

Then there’s scale. VLMs are also capable of seeing and summarizing massive amounts of content—but only a little bit at a time. VLMs can be used to help physical AI systems, like robots and autonomous vehicles, analyze and interpret the visual data they take in. Still, if anyone is going to sit through all the world’s security and traffic camera footage to help find the crooks and crazy drivers causing crashes among us, it’s going to be an AI model, and it might very well be a VLM. On a larger and longer scale, VLMs might potentially be used to scan satellite data over months or years to help track climate and weather patterns, or be attuned to telescope data to search for objects—and creatures—among the stars.