Text, image, audio and video... How powerful is Microsoft's cross-modal model CoDi?

Question

![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-f937af10e0-dd1a6f-7649e1) Image source: Generated by Unbounded AIThe Microsoft Azure research team and the University of North Carolina researchers published a paper "Arbitrary Generation Through Composable Diffusion", introducing a new multimodal generation model - CoDi (Composable Diffusion).CoDi is capable of generating any combination of output modalities from any combination of input modalities, such as language, image, video, or audio. Unlike existing generative AI systems, CoDi can generate multiple modalities in parallel, and its input is not limited to subsets of modalities such as text or images. CoDi is free to condition any combination of inputs and generate any set of modalities, even if they are not present in the training data.CoDi introduces an unprecedented level of content generation by simultaneously processing and generating multimodal content such as text, images, audio, and video. Using diffusion models and composable techniques, CoDi can generate high-quality, diverse outputs from single or multiple inputs, transforming content creation, accessibility, and personalized learning.CoDi is highly customizable and flexible, enabling robust joint modality generation quality that outperforms or rivals state-of-the-art single modality synthesis.Recently, CoDi has made new progress and has been officially available on the Microsoft Azure platform. It can be used for free for 12 months.## **How powerful is CoDi**CoDi emerged as part of Microsoft's ambitious i-Code project, a research initiative dedicated to advancing multimodal AI capabilities. CoDi's ability to seamlessly integrate information from various sources and generate consistent output is expected to revolutionize multiple areas of human-computer interaction.One of the areas where CoDi could bring about change is assistive technology, enabling people with disabilities to interact with computers more effectively. By seamlessly generating content across text, images, video, and audio, CoDi can provide users with a more immersive and accessible computing experience.Additionally, CoDi has the potential to reinvent custom learning tools by providing a comprehensive interactive learning environment. Students engage with multimodal content that seamlessly integrates information from a variety of sources, enhancing their understanding and engagement with the topic.CoDi will also revolutionize content generation. The model is able to generate high-quality output across multiple modalities, which can simplify the content creation process and reduce the burden on creators. Whether generating engaging social media posts, crafting interactive multimedia presentations, or creating engaging storytelling experiences, CoDi's capabilities have the potential to reshape the content generation landscape.To address the limitations of traditional unimodal AI models, CoDi provides a solution to the tedious and slow process of combining modality-specific generative models.This novel model employs a unique composable generation strategy that bridges alignment during diffusion and facilitates simultaneous generation of interwoven modalities, such as time-aligned video and audio.CoDi's model training process is also quite distinctive. It involves projecting input modalities such as image, video, audio, and language into a common semantic space. This allows for flexible handling of multimodal inputs, and through the cross-attention module and the environment encoder, it is able to simultaneously generate arbitrary combinations of output modalities.![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-9bc9f58601-dd1a6f-7649e1) (Above) CoDi's model architecture: CoDi uses a multi-stage training scheme capable of training only on a linear number of tasks but inferring on all combinations of input and output modalities.## **丨Single or multiple inputs --> multiple outputs**CoDi models can take single or multiple cues (including video, image, text, or audio) to generate multiple aligned outputs, such as video with accompanying sound.For example:**1. Text+Image+Audio——>Audio+Video**"A teddy bear on a skateboard, 4k, high resolution" + a picture of Times Square in New York + a rainy audio --> After CoDi generation, a piece of "A teddy bear skateboards in Times Square in the rain, Accompanied by the simultaneous sound of rain and street noise."![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-17f537e017-dd1a6f-7649e1) How is it generated?> CoDi can jointly generate any combination of video, image, audio, and text via composable diffusion. CoDi first receives audio tracks to generate text subtitles, then receives images for image+audio-audio, and then receives image+audio+text to combine their information to generate a new joint image+subtitle. Finally, CoDi can also receive image+audio+text and generate video+audio.**2 text+audio+image -->text+image**![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-6467153927-dd1a6f-7649e1) 1. **3.** **Audio + Image --> Text + Image**![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-f959317814-dd1a6f-7649e1) 1. **4. Text+Image ——>Text+Image**![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-d32b2de333-dd1a6f-7649e1) **5. Text——>Video+Audio**![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-fcc4d70bb8-dd1a6f-7649e1) **6. Text——>Text+Audio+Image**![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-247b38d9d6-dd1a6f-7649e1)## **丨Multiple inputs --> single output****1. Text+Audio——Image**![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-586363abe2-dd1a6f-7649e1) **2. Text + Image --> Image**![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-77e8f7810d-dd1a6f-7649e1) **3 Text+Audio -->Video**![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-1799ea0bcd-dd1a6f-7649e1) **4 text + image --> video**![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-e145e766ea-dd1a6f-7649e1) **5. There are also video + audio --> text, image + audio --> audio, text + image --> audio...etc**## **丨Single input——single output****1 Text --> Image**![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-3b57b17518-dd1a6f-7649e1) **2 Audio --> Image**![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-97ced2f7d6-dd1a6f-7649e1) **3 Image --> Video**![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-12e68d0230-dd1a6f-7649e1) **4 Image --> Audio**![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-a34494b67f-dd1a6f-7649e1) **5 Audio --> Text**![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-f1beb7662f-dd1a6f-7649e1) **6 Image --> Text**![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-5aa0a798f4-dd1a6f-7649e1) References:***