After more than 40 years of operation, DTVE is closing its doors and our website will no longer be updated daily. Thank you for all of your support.
OpenAI debuts Sora to create movie-quality video from text
Generative AI pioneer OpenAI has introduced a new interface that can produce videos of up to 60 seconds in length from text instructions.
The new offering, Sora (the Japanese word for ‘sky’) can be instructed in natural language to create a video.
OpenAI, known for its Dall-E still image generator and ChatGPT chatbot, is not the first to introduce this kind of capability, but the results appear to be of vastly superior quality to those previously seen, and of greater length.
Open AI has not divulged the source material that went into training Sora, although it told the New York Times that training included publicly available videos and those that were licensed by copyright holders. OpenAI has previously been sued on multiple occasions for making use of copyrighted material in training its models – including by the New York Times itself.
In a blog post the company said that Sora could “generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt”.
OpenAI has made the product available to ‘red teamers’ who are tasked with assessing potential areas where it could cause harm or risk.
The company said it was also making Sora available to “number of visual artists, designers, and filmmakers to gain feedback on how to advance the model to be most helpful for creative professionals”.
According to OpenAI, Sora can “generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background. The model understands not only what the user has asked for in the prompt, but also how those things exist in the physical world”.
In addition to correctly interpreting instructions, Sora has the ability to generate multiple shots within a video to give continuity to characters and style.
OpenAI added a few caveats. It said that the current model could “struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect”, citing the example of a character taking a bite out of a cookie when then reappeared without a bite mark.
It said the model could also “confuse spatial details of a prompt, for example, mixing up left and right, and may struggle with precise descriptions of events that take place over time”.