The artificial intelligence research group OpenAI has created a new version of DALL-E, its text-to-image generation program. DALL-E 2 has a higher resolution and lower latency version of the original system, which produces images that show descriptions written by users. It also includes new features like editing an existing image. As with previous OpenAI work, the tool is not released directly to the public. But researchers can sign up online to view the system, and OpenAI hopes to make it available for use in third-party apps later.
The original DALL-E, a portmanteau by the artist “Salvador Dalí” and the robot “WALL-E”, debuted in January 2021. It was a limited but fascinating test of AI’s ability to visually represent concepts, from worldly depictions of a mannequin. in flannel shirt for “a giraffe made of turtle” or an illustration of a radish walking a dog. At the time, OpenAI said it would continue to build on the system while investigating potential hazards such as bias in image generation or misinformation production. It seeks to address these issues through technical security measures and a new content policy, while reducing its computer load and pushing the basic capabilities of the model forward.
One of the new DALL-E 2 features, inpainting, uses DALL-E’s text-to-image features at a more detailed level. Users can start with an existing image, select an area and ask the model to edit it. You can block a painting on a living room wall and replace it with another picture, for example, or add a vase of flowers on a coffee table. The model can fill (or remove) objects while taking into account details such as the directions of the shadows in a room. Another feature, variations, is like an image search tool for images that do not exist. Users can upload a startup image and then create a series of variants similar to it. They can also mix two images and generate images that have elements of both. The generated images are 1,024 x 1,024 pixels, a skip over the 256 x 256 pixels provided by the original model.
DALL-E 2 is based on CLIP, a computer vision system that OpenAI also announced last year. “DALL-E 1 just took our GPT-3 approach from language and used it to produce an image: we compressed images into a series of words, and we just learned to predict what’s coming next,” says OpenAI researcher Prafulla Dhariwal with reference to the GPT model used by many text AI apps. But the word matching did not necessarily capture the qualities that people found most important, and the predictable process limited the realism of the images. CLIP was designed to look at images and summarize their content as a human would, and OpenAI repeated this process to create “unCLIP” – a reverse version that starts with the description and works towards an image. DALL-E 2 generates the image using a process called diffusion, which Dhariwal describes as starting with a “bag of dots” and then filling in a pattern with more and more details.
Interestingly, a draft of unCLIP says that it is partly resistant to a very funny weakness of CLIP: the fact that people can fool the model’s identifying abilities by labeling an object (like a Granny Smith apple) with a word that indicates something other (like an iPod)). The variation tool, the authors say, “still generates images of apples with high probability,” even when using an incorrectly labeled image that CLIP cannot identify as a Granny Smith. Conversely, “the model never produces images of iPods, despite the very high relative predicted probability of this caption.”
DALL-E’s full model was never released publicly, but other developers have refined their own tools that mimic some of its features over the past year. One of the most popular mainstream applications is Wombo’s Dream mobile app, which generates images of whatever users describe in a variety of art styles. OpenAI does not release any new models today, but developers could use its technical results to update their own work.
OpenAI has implemented some built-in security measures. The model was trained on data that had removed some offensive material, which ideally limited its ability to produce offensive content. There is a watermark indicating the AI-generated nature of the work, although it could theoretically be cut out. As a preventative anti-abuse feature, the model can also not generate any recognizable faces based on a name – even asking for something like Mona Lisa would apparently return a variant on the face itself from the painting.
DALL-E 2 can be tested by approved partners with some reservations. Users are prohibited from uploading or generating images that are “not G-rated” and “may cause harm”, including anything involving hate symbols, nudity, obscene gestures or “major conspiracies or events related to major ongoing geopolitical events.” They also need to reveal the role of AI in the generation of the images, and they can not show generated images to other people through an app or website – so you will not initially see a DALL-E-powered version of something like Dream. But OpenAI hopes to add it to the group’s API toolkit later so it can run third-party apps. “Our hope is to continue to make a step-by-step process here so that we can continue to evaluate, based on the feedback we receive, how we can release this technology safely,” said Dhariwal.
Additional reporting from James Vincent.