💞 #Gate Square Qixi Celebration# 💞
Couples showcase love / Singles celebrate self-love — gifts for everyone this Qixi!
📅 Event Period
August 26 — August 31, 2025
✨ How to Participate
Romantic Teams 💑
Form a “Heartbeat Squad” with one friend and submit the registration form 👉 https://www.gate.com/questionnaire/7012
Post original content on Gate Square (images, videos, hand-drawn art, digital creations, or copywriting) featuring Qixi romance + Gate elements. Include the hashtag #GateSquareQixiCelebration#
The top 5 squads with the highest total posts will win a Valentine's Day Gift Box + $1
Text, image, audio and video... How powerful is Microsoft's cross-modal model CoDi?
The Microsoft Azure research team and the University of North Carolina researchers published a paper "Arbitrary Generation Through Composable Diffusion", introducing a new multimodal generation model - CoDi (Composable Diffusion).
CoDi is capable of generating any combination of output modalities from any combination of input modalities, such as language, image, video, or audio. Unlike existing generative AI systems, CoDi can generate multiple modalities in parallel, and its input is not limited to subsets of modalities such as text or images. CoDi is free to condition any combination of inputs and generate any set of modalities, even if they are not present in the training data.
CoDi introduces an unprecedented level of content generation by simultaneously processing and generating multimodal content such as text, images, audio, and video. Using diffusion models and composable techniques, CoDi can generate high-quality, diverse outputs from single or multiple inputs, transforming content creation, accessibility, and personalized learning.
CoDi is highly customizable and flexible, enabling robust joint modality generation quality that outperforms or rivals state-of-the-art single modality synthesis.
Recently, CoDi has made new progress and has been officially available on the Microsoft Azure platform. It can be used for free for 12 months.
How powerful is CoDi
CoDi emerged as part of Microsoft's ambitious i-Code project, a research initiative dedicated to advancing multimodal AI capabilities. CoDi's ability to seamlessly integrate information from various sources and generate consistent output is expected to revolutionize multiple areas of human-computer interaction.
One of the areas where CoDi could bring about change is assistive technology, enabling people with disabilities to interact with computers more effectively. By seamlessly generating content across text, images, video, and audio, CoDi can provide users with a more immersive and accessible computing experience.
Additionally, CoDi has the potential to reinvent custom learning tools by providing a comprehensive interactive learning environment. Students engage with multimodal content that seamlessly integrates information from a variety of sources, enhancing their understanding and engagement with the topic.
CoDi will also revolutionize content generation. The model is able to generate high-quality output across multiple modalities, which can simplify the content creation process and reduce the burden on creators. Whether generating engaging social media posts, crafting interactive multimedia presentations, or creating engaging storytelling experiences, CoDi's capabilities have the potential to reshape the content generation landscape.
To address the limitations of traditional unimodal AI models, CoDi provides a solution to the tedious and slow process of combining modality-specific generative models.
This novel model employs a unique composable generation strategy that bridges alignment during diffusion and facilitates simultaneous generation of interwoven modalities, such as time-aligned video and audio.
CoDi's model training process is also quite distinctive. It involves projecting input modalities such as image, video, audio, and language into a common semantic space. This allows for flexible handling of multimodal inputs, and through the cross-attention module and the environment encoder, it is able to simultaneously generate arbitrary combinations of output modalities.
丨Single or multiple inputs --> multiple outputs
CoDi models can take single or multiple cues (including video, image, text, or audio) to generate multiple aligned outputs, such as video with accompanying sound.
For example:
1. Text+Image+Audio——>Audio+Video
"A teddy bear on a skateboard, 4k, high resolution" + a picture of Times Square in New York + a rainy audio --> After CoDi generation, a piece of "A teddy bear skateboards in Times Square in the rain, Accompanied by the simultaneous sound of rain and street noise."
2 text+audio+image -->text+image
丨Multiple inputs --> single output
1. Text+Audio——Image
丨Single input——single output
1 Text --> Image