Conditional Image Generation with PixelCNN Decoders

Paper Review

Conditional Image Generation with PixelCNN Decoders

Introduction

This paper presents Conditional PixelCNN, an autoregressive model designed for conditional image generation. It builds on PixelCNN by introducing gated convolutions, boosting performance to match PixelRNN while requiring significantly less training time.

Unlike GANs, Conditional PixelCNN returns explicit probability densities, making it suitable for tasks like compression, probabilistic planning, and image restoration. By conditioning on class labels or latent embeddings, the model can generate diverse and realistic samples, from natural scenes like coral reefs and dogs, to different poses and expressions of a face given just a single image. It also proves effective as a decoder in autoencoder architectures, highlighting its versatility and practical potential.

Contributions

Gated PixelCNN: Improved architecture using gating and dual CNN stacks to remove blind spots and enhance efficiency.
Conditional Image Generation: Enabled class- and embedding-conditioned image generation with high diversity and realism.
Autoencoder Integration: Used PixelCNN as a powerful decoder, achieving high-quality reconstructions.

Model Architecture

Autoregressive Pixel Modelling
- PixelCNN models the joint distribution of all pixels in an image as a product of conditional probabilities:
\[p(\mathbf{x}) = \prod_{i=1}^{n} p(x_i \mid x_1, \dots, x_{i-1})\]
- Each pixel $x_i$ is generated based on all previous pixels $x_1$ to $x_{i-1}$.
- The generation order follows a raster scan: row-by-row, left to right.
Masked Convolutions
- To ensure that each pixel only sees past pixels (above and to the left), PixelCNN uses masked convolutional filters.
- These filters are designed so that future pixels are not included in the receptive field.
- This maintains the autoregressive property while leveraging convolutional efficiency.
- The model’s convolution kernels are masked to ensure the output for a pixel only uses valid context.
Color Channel Conditioning
- Each pixel has three color channels: Red (R), Green (G), and Blue (B). PixelCNN models them sequentially:
  - Red (R) is predicted first.
  - Green (G) is conditioned on R.
  - Blue (B) is conditioned on both R and G.
- This is done by splitting the feature maps at each layer and using separate masks for each channel.
Output: 256-Way Softmax
Each channel is modeled as a discrete distribution over 256 values (0–255), using a softmax classifier:
- Each output is a probability distribution over all possible values:
  \[p(x_{i}^{(c)} \mid \text{context}) = \text{Softmax}(f_\theta(\cdot))\]
- The model outputs shape:
  $N \times N \times 3 \times 256$ for an $N \times N$ image.
Training vs Sampling
- Training Time:
  Since all pixels are available, we can compute all conditional probabilities in parallel using the masked convolutions.
- Sampling Time:
  Sampling must be sequential—each pixel (and channel) is predicted one at a time, as each depends on previous outputs.

Gated PixelCNN: Enhancing Convolutional Autoregressive Models

PixelRNN and PixelCNN are both autoregressive models designed to generate images by modeling pixel dependencies. While PixelCNN offers computational efficiency, PixelRNN has historically demonstrated superior generative performance. This performance gap stems from fundamental architectural differences, which the Gated PixelCNN aims to address.

Architectural Comparison and Motivation

PixelRNN

Uses: Spatial LSTM layers with recurrent connections.
Advantage: Each layer has access to the full context of preceding pixels, enabling effective long-range dependency modeling.

PixelCNN

Uses: Stacks of masked convolutional layers.
Limitation: The receptive field grows linearly with depth, restricting context in shallower layers.

Although deeper convolutional stacks can alleviate this limitation, PixelRNN maintains an advantage due to its built-in recurrence and gating mechanisms.

Gated PixelCNN: Core Idea

To close the performance gap while retaining the efficiency of convolutions, Gated PixelCNN introduces gated activation units, inspired by mechanisms found in LSTM cells.

Activation Function

The ReLU activations in the original PixelCNN are replaced with a gated function:

\[y = \tanh(W_{k,f} * x) \circ \sigma(W_{k,g} * x)\]

Where:

$*$: Convolution operator
$\sigma$: Sigmoid non-linearity
$\circ$: Element-wise multiplication
$W_{k,f}, W_{k,g}$: Learned filters at layer $k$

This architecture allows the network to control information flow more effectively, capturing complex interactions between pixels without relying on recurrence.

Gated PixelCNN draws from several prior architectures that introduced gating mechanisms into feedforward networks:

Highway Networks
Grid LSTMs
Neural GPUs

These models consistently show that gating enhances representational capacity and training stability.

Summary

Model	Core Mechanism	Strength
PixelRNN	Spatial LSTMs with gating	Full context and expressive modelling
PixelCNN	Masked convolutions with ReLU	Efficient but limited context
Gated PixelCNN	Masked convolutions with gating	Improved interaction modelling and efficiency

PixelCNNs traditionally suffer from a “blind spot” — certain regions of the image, especially pixels to the right of the current one, are inaccessible due to masked convolutions. This limits the model’s ability to fully capture contextual information during generation.

To address this, Gated PixelCNN introduces two parallel stacks:

A vertical stack (unmasked), which captures information from all rows above the current pixel.
A horizontal stack (masked), which looks at the current row up to the pixel.

The two are combined at each layer, with the horizontal stack receiving information from the vertical stack. This design removes the blind spot without violating the autoregressive constraint (i.e., only conditioning on past pixels).

Conditional PixelCNN

Aim to model the conditional distribution:

\[p(x|h) = \prod_{i=1}^n p(x_i \mid x_1, \ldots, x_{i-1}, h)\]

What’s done:

Conditioning vector $h$ is added into the gated activation unit:
\[y = \tanh(W_{k,f} * x + V_{k,f}^T h) \odot \sigma(W_{k,g} * x + V_{k,g}^T h)\]
For spatially-aware conditioning, $h$ is mapped to a spatial representation $y = \tanh(W_{k,f} * x + V_{k,f} * s) \odot \sigma(W_{k,g} * x + V_{k,g} * s)$

Why:

Enables image generation based on class labels or latent embeddings.
Location-dependent variant supports spatial guidance (e.g., object placement).

PixelCNN Auto-Encoders

What:

A traditional convolutional autoencoder is modified by replacing the deconvolutional decoder with a conditional PixelCNN.

Why:

PixelCNN, being a strong generative model, improves image reconstruction.
Encourages the encoder to focus on high-level abstract features, since PixelCNN can handle low-level pixel details.

Outcome:

An end-to-end trainable model where the encoder learns richer representations and the decoder models diverse outputs from $p(x∣h)$.

Experiments

Unconditional Modeling with Gated PixelCNN

Gated PixelCNN achieves competitive likelihood on CIFAR-10 and ImageNet, outperforming PixelCNN and matching PixelRNN with faster training. Its simplicity and parallelism allow it to scale better in large models.

Conditioning on ImageNet Classes

Conditioning on class labels ($p(x|h_i)$) had limited impact on log-likelihood but significantly improved visual quality. The model generated diverse and distinct samples across 1000 classes.

Conditioning on Portrait Embeddings

Using embeddings from a face recognition model, the conditional PixelCNN generated realistic portraits capturing facial features and variations. Linear interpolation between embeddings produced smooth identity morphs.

PixelCNN Auto Encoder

An end-to-end PixelCNN autoencoder trained on ImageNet patches learned high-level representations. Unlike MSE-trained autoencoders, it captured abstract features, producing diverse reconstructions from the same latent code.

Future Work

Future directions include generating novel images of specific objects or animals from a single example, integrating Conditional PixelCNNs with variational inference to improve VAE decoders, and exploring image generation conditioned on captions rather than class labels. These extensions could further enhance controllability and realism in generative models.

Ratan Kokal

Conditional Image Generation with PixelCNN Decoders

Paper Review