How AI is Generating Images from Captions

845142_Bitvore images_9_092820

AI continues to evolve and improve at a rapid pace. Advanced models are now learning to generate images from simple text-based captions. OpenAI’s GPT-3 model recently captured the world’s attention with its ability to create poems, short stories, and songs when given little direction.

OpenAI’s GPT-3 model was so advanced that many people could not differentiate computer-generated text from actual human writing. The problem is the technology was more of a show trick rather than intellectual ability.


Nevertheless, technology researchers have begun using GPT-3 to enhance other advanced capabilities. GPT-3 can decipher and analyze large amounts of information, but what about text and images?


Text and Image Modeling


The Allen Institute for Artificial Intelligence, also known as AI2, has created a new text-and-image model that generates images when given a single caption. The visual language model creates oddly designed pictures that resemble mashed up images. Although the images do not appear hyperrealistic like many deepfakes, the technology presents some potential futuristic capabilities.


GPT-3 and other similar technologies fall under a category of models otherwise known as “transformers.” These AI models provide predictive power and useful applications such as autocomplete but cannot generate complete grammatical sentences.


Google’s BERT and other similar forms of AI utilize a technique known as “masking.” The technology hides different words in a sentence and the model fills in potential gap areas.


Examples may include:

  • The man drove the _____ to work today.
  • They bought a _____ of bread to make sandwiches.


As the model learns and improves these exercises, the AI can discover patterns and predict how words and sentences best fit together. The result is enhanced text capabilities and language understanding. Although the concept was applied to text-based scenarios, researchers have expanded the capabilities to include visual-language models.


How Image Masking Works


Visual-language models can look at surrounding words and images to fill-in missing gaps. Using pattern techniques and image recognition, the AI can comprehend the relationships between words and context.


This AI technology can closely relate text descriptions to visual references, similar to how children learn at a young age. The visual-language models can reference images and identify specific objects, people, and activities after several repetitions.


AI2 researchers wanted to determine if models had developed conceptual understandings of the world. Humans can identify objects and understand words even if they’re not present. The researchers wondered if AI models could function similarly, essentially generating images by reading text-based captions. Although the technology worked to an extent, the result left much to be desired.


The AI models synthesized pixelated patterns that did not closely resemble actual images. The problem is transforming text to images is far more difficult than converting images to text. Captions don’t provide detailed context to images leaving many holes for AI technology. 


AI models would have to possess common sense about the world and context to fill in details appropriately. Asking AI to draw “a giraffe on the road” must also provide details related to colors, shapes, and external context. Without exact and explicit information, the technology will not function properly.


Researchers at AI2 decided to tweak the technology by predicting masked pixels in photos based on corresponding captions. The final images don’t appear entirely realistic but contain many high-level visual elements resembling actual images. The ability of visual-language models to create near realistic representations is a step forward for future AI technologies.


Other technologies, such as robotics, may benefit from advanced AI models. The team plans on utilizing various experiments to improve the quality of image generation and linguistic vocabulary capabilities. 


Creating AI-Ready Data


AI is continuing to advance with an array of new capabilities. As technology continues to improve and expand, organizations should do everything in their power to maximize functionality.


Bitvore uses unstructured datasets to create AI-ready data. Our advanced AI-techniques and machine learning models eliminate the massive manual efforts required to research companies, industries, and markets from unstructured text available over the internet. Bitvore helps improve decision-making processes by providing immediate quantifiable results.


For additional information on how Bitvore can improve your business operations, download our white paper for more information: Tractable Understanding of the World Using AI.

Tractable Understanding of the World Using AI + NLP

Subscribe to Bitvore News Blog Weekly Email

Recent Posts


See all