Z-Image: Powerful and highly efficient image generation model with 6B parameters

https://news.ycombinator.com/rss Hits: 27
Summary

⚡️- Image An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer Welcome to the official repository for the Z-Image(造相)project! Z-Image is a powerful and highly efficient image generation model with 6B parameters. Currently there are three variants: 🚀 Z-Image-Turbo – A distilled version of Z-Image that matches or exceeds leading competitors with only 8 NFEs (Number of Function Evaluations). It offers ⚡️sub-second inference latency⚡️ on enterprise-grade H800 GPUs and fits comfortably within 16G VRAM consumer devices . It excels in photorealistic image generation, bilingual text rendering (English & Chinese), and robust instruction adherence. 🧱 Z-Image-Base – The non-distilled foundation model. By releasing this checkpoint, we aim to unlock the full potential for community-driven fine-tuning and custom development. ✍️ Z-Image-Edit – A variant fine-tuned on Z-Image specifically for image editing tasks. It supports creative image-to-image generation with impressive instruction-following capabilities, allowing for precise edits based on natural language prompts. 📥 Model Zoo Model Hugging Face ModelScope Z-Image-Turbo Z-Image-Base To be released To be released Z-Image-Edit To be released To be released 🖼️ Showcase 📸 Photorealistic Quality: Z-Image-Turbo delivers strong photorealistic image generation while maintaining excellent aesthetic quality. 📖 Accurate Bilingual Text Rendering: Z-Image-Turbo excels at accurately rendering complex Chinese and English text. 💡 Prompt Enhancing & Reasoning: Prompt Enhancer empowers the model with reasoning capabilities, enabling it to transcend surface-level descriptions and tap into underlying world knowledge. 🧠 Creative Image Editing: Z-Image-Edit shows a strong understanding of bilingual editing instructions, enabling imaginative and flexible image transformations. 🏗️ Model Architecture We adopt a Scalable Single-Stream DiT (S3-DiT) architecture. In this setup, text, visual semantic tokens, and image V...

First seen: 2025-12-06 17:20

Last seen: 2025-12-07 19:23