This is a Flax/JAX implementation of VQGAN, which learns a codebook of context-rich visual parts by leveraging both the use of convolutional methods and transformers. It was introduced in Taming Transformers for High-Resolution Image Synthesis (CVPR paper).
The model allows the encoding of images as a fixed-length sequence of tokens taken from the codebook.
This version of the model uses a reduction factor f=16
and a vocabulary of 16,384
tokens.
As an example of how the reduction factor works, images of size 256x256
are encoded to sequences of 256
tokens: 256/16 * 256/16
. Images of 512x512
would result in sequences of 1024
tokens.
This model was ported to JAX using a checkpoint trained on ImageNet.
The checkpoint can be loaded using Suraj Patil's implementation of VQModel
.
This model can be used as part of the implementation of DALL·E mini. Our report contains more details on how to leverage it in an image encoding / generation pipeline.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
1. 开源生态
2. 协作、人、软件
3. 评估模型