The realm of fake stuff continues to be refined by artificial intelligence, with fake text being mastered a few years ago with startup OpenAI’s GPT-3 natural language processing program.
Now images, which had achieved substantial tampering through programs such as Nvidia’s StyleGAN, introduced by Tero Karras and his Nvidia colleagues in 2019, got a boost this summer with OpenAI’s announcement of ‘a new image tampering program, DALL•E 2, which builds on the first DALL•E, released in January 2021. It can take a sentence you type and convert it into an image, with many ways to shape the output image.
This week, OpenAI removed the waitlist; anyone can now go to the site to try DALL•E 2 as long as they are willing to create an account on the OpenAI website with an email address and phone number.
The strength of DALL•E 2, like its predecessor, is to create images from text that a person types into a field on the web page. Type in the phrase “astronaut riding a horse in a photorealistic style” and an image will appear something like this: a realistic rendering of a character in profile in an astronaut uniform, astride a horse walking against this which seems to be an image of the cosmos.
The work is described in a research paper by OpenAI scientists Aditya Ramesh and colleagues, “Hierarchical Text-Conditional Image Generation with CLIP Latents,” posted on preprint server arXiv.
DALL•E 2 what is called a contrastive encoder-decoder. It is built by compressing images and their captions into a sort of abstract, combined representation, and then decompressing them. This training regime develops the program’s ability to associate text and image.
The main point of Ramesh and his colleagues is that the way compression/decompression happens allows more than just translating between text and image, it allows sentences to be used to shape aspects of an image, like the addition of the term “photorealistic”, which produces something with a certain smooth realism.
Although the images are still a bit rough, you can see that DALL•E 2 has the potential to replace many commercial illustrations and even stock photographs. By typing in a phrase and a style, such as “photo”, you can produce a variety of images that may be suitable for illustrating articles.
You can see for yourself by trying it. Most of the things that immediately come to mind are fun combos. For example, “A blue whale and a kitten making friends on a beach, digital art” produces the endearing greeting card-style output below.
Four versions are offered at a time, and you can download each of them in PNG format.
But it’s also possible to get a number of more mundane images that fit a stock photography context. By typing the phrase “A ZDNet contributing writer seeing the future of technology in his own articles by a mountainside hovering in space” produces a kind of sci-fi image that is close to what might accompany an article.
We can add the phrase “realistic image” and get something a little smoother.
Using the phrase “Photo of a very anxious computer user looking at his computer screen and seeing a Windows patch alert” produced a delightful series of images of generally fearful computer users.
The sentence can be expanded with additional words to get more specific results, such as “Photo of a very anxious computer user at their office looking at their computer screen and seeing a Windows patch alert.”
Once you start focusing on stock photography, you’ll find that you can come up with a lot of scenarios to turn into an image. For example, “Photo of a person with glasses making a point to several people at a conference table in a meeting room” gives a pretty good selection of what at first glance looks like real office scenes.
Again, more specific and changing attributes of the scene can be obtained with a few words, such as “Photo of a person with glasses standing next to a blackboard in a conference room explaining something to his colleagues “.
As you can see, elements such as facial features are usually degraded in the DALL•E 2 output.
By applying terms of artists or artistic media or style, one can move the same image from the realm of stock photography to the realm of illustration, as in the phrase “Francis Bacon painting a group of people in a room conference and a person with glasses standing next to a blackboard explaining something.”
Once you create an account, OpenAI gives you 50 “credits”, these are free requests to the system, where each phrase entered counts as one request. Once you’ve used all 50, you can either wait a month and get the next 15 free credits or buy credits. Credits are sold in packs of 115 for $15, or 13 cents per credit.
It is possible to stub the channel program. Some requests may be too much of a mix of the real and the imaginary to be rendered convincingly. For example, a request for “blue furry rats taking over Times Square” produces a decent first attempt, but the furry element gives the image a sloppy, uneven quality that doesn’t quite work.
Other requests can trip things up by the choice of a single word.
The request “a bag of money sitting on a lawn chair on a porch overlooking the sunset” generated completely bizarre and unrelated images, such as a close-up of toenails, and an ambiguous image that appeared to be flowers stuck inside a rug.
Replacing the word “placed” with “sitting” allowed DALL•E 2 to produce a satisfactory result in one out of three images.
The program may not be able to find an appropriate combination of elements for what appears to be an active verb, sit, when combined with an inanimate object, a bag.
In general, the program seems to struggle with certain aspects of location, such as “standing in front of an easel”.
Sentences that are not descriptions but questions or interjections seem to start the system in random mode. For example, “Does DALL•E 2 know its own name?” is an expression that produces multiple flower images. That might be a poetic response, but it’s more like a dismissal of the prompt.
There are guardrails put in place by OpenAI, spelled out in the published content policy, and they will be used to automatically zap any attempts to verboten. For example, typing “Microsoft co-founder Bill Gates smoking a cigar in a rundown apartment with broken furniture” will not be generated. Instead, an error message is displayed stating that the request violates the policy and directs you to the policy page. This is probably a case of violation of the “Do not create images of public figures” rule.
The same request, replacing the rather less well-known public figure Tiernan Ray, a ZDNet contributor, generated a selection of funny images of people who are not Tiernan Ray.
Additionally, the copyrighted text appears to be protected against mass infringement. The phrase “a group of people hanging out in front of McDonald’s” produces a suitably appropriate scene, but each proposed result has a slight modification of “McDonald’s” so that it isn’t actually that word.
Where do things go next? Work on the basic text-to-image approach is progressing on many fronts. One adds more lexical complexity to the program. For example, Chitwan Saharia and the Google Brain team published their work in May on “Imagen,” a program they say has an “unprecedented degree of photorealism.” The trick was to use a much larger corpus of language carriers to train the network.
And there’s work going on to expand the complexity of the kinds of things a program can do. For example, Google scientists Wenhu Chen and his colleagues this month created a program that extends Sahari and his team’s Imagen, called “Re-imagen,” which combines the basic idea of text compression and image compression. image with a third element, search results.
By adding what they call “retrieval”, the program is developed not only to find a “semantic” combination of word and image, but also to search Internet search results for combinations that will refine the output. They claim the results are far superior to Imagen and DALL•E 2 in handling rare and obscure phrases such as “Picarones are served with wine”, referring to the Peruvian sweet potato dessert.