Conditional Image Generation with PixelCNN Decoders

Paper Review

Conditional Image Generation with PixelCNN Decoders

Introduction

This paper presents Conditional PixelCNN, an autoregressive model designed for conditional image generation. It builds on PixelCNN by introducing gated convolutions, boosting performance to match PixelRNN while requiring significantly less training time.

Unlike GANs, Conditional PixelCNN returns explicit probability densities, making it suitable for tasks like compression, probabilistic planning, and image restoration. By conditioning on class labels or latent embeddings, the model can generate diverse and realistic samples, from natural scenes like coral reefs and dogs, to different poses and expressions of a face given just a single image. It also proves effective as a decoder in autoencoder architectures, highlighting its versatility and practical potential.

Contributions

Model Architecture

Gated PixelCNN: Enhancing Convolutional Autoregressive Models

PixelRNN and PixelCNN are both autoregressive models designed to generate images by modeling pixel dependencies. While PixelCNN offers computational efficiency, PixelRNN has historically demonstrated superior generative performance. This performance gap stems from fundamental architectural differences, which the Gated PixelCNN aims to address.


Architectural Comparison and Motivation
PixelRNN
PixelCNN

Although deeper convolutional stacks can alleviate this limitation, PixelRNN maintains an advantage due to its built-in recurrence and gating mechanisms.


Gated PixelCNN: Core Idea

To close the performance gap while retaining the efficiency of convolutions, Gated PixelCNN introduces gated activation units, inspired by mechanisms found in LSTM cells.

Activation Function

The ReLU activations in the original PixelCNN are replaced with a gated function:

\[y = \tanh(W_{k,f} * x) \circ \sigma(W_{k,g} * x)\]

Where:

This architecture allows the network to control information flow more effectively, capturing complex interactions between pixels without relying on recurrence.


Gated PixelCNN draws from several prior architectures that introduced gating mechanisms into feedforward networks:

These models consistently show that gating enhances representational capacity and training stability.


Summary
ModelCore MechanismStrength
PixelRNNSpatial LSTMs with gatingFull context and expressive modelling
PixelCNNMasked convolutions with ReLUEfficient but limited context
Gated PixelCNNMasked convolutions with gatingImproved interaction modelling and efficiency

Blind spot in the receptive field

PixelCNNs traditionally suffer from a “blind spot” — certain regions of the image, especially pixels to the right of the current one, are inaccessible due to masked convolutions. This limits the model’s ability to fully capture contextual information during generation.

To address this, Gated PixelCNN introduces two parallel stacks:

The two are combined at each layer, with the horizontal stack receiving information from the vertical stack. This design removes the blind spot without violating the autoregressive constraint (i.e., only conditioning on past pixels).

Conditional PixelCNN

Aim to model the conditional distribution:

\[p(x|h) = \prod_{i=1}^n p(x_i \mid x_1, \ldots, x_{i-1}, h)\]

What’s done:

Why:

PixelCNN Auto-Encoders

What:

Why:

Outcome:

Experiments

Unconditional Modeling with Gated PixelCNN

Gated PixelCNN achieves competitive likelihood on CIFAR-10 and ImageNet, outperforming PixelCNN and matching PixelRNN with faster training. Its simplicity and parallelism allow it to scale better in large models.

Conditioning on ImageNet Classes

Conditioning on class labels ($p(x|h_i)$) had limited impact on log-likelihood but significantly improved visual quality. The model generated diverse and distinct samples across 1000 classes.

Conditioning on Portrait Embeddings

Using embeddings from a face recognition model, the conditional PixelCNN generated realistic portraits capturing facial features and variations. Linear interpolation between embeddings produced smooth identity morphs.

PixelCNN Auto Encoder

An end-to-end PixelCNN autoencoder trained on ImageNet patches learned high-level representations. Unlike MSE-trained autoencoders, it captured abstract features, producing diverse reconstructions from the same latent code.

Future Work

Future directions include generating novel images of specific objects or animals from a single example, integrating Conditional PixelCNNs with variational inference to improve VAE decoders, and exploring image generation conditioned on captions rather than class labels. These extensions could further enhance controllability and realism in generative models.