Paper Review

Conditional Image Generation with PixelCNN Decoders

Introduction

This paper presents Conditional PixelCNN, an autoregressive model designed for conditional image generation. It builds on PixelCNN by introducing gated convolutions, boosting performance to match PixelRNN while requiring significantly less training time.

Unlike GANs, Conditional PixelCNN returns explicit probability densities, making it suitable for tasks like compression, probabilistic planning, and image restoration. By conditioning on class labels or latent embeddings, the model can generate diverse and realistic samples, from natural scenes like coral reefs and dogs, to different poses and expressions of a face given just a single image. It also proves effective as a decoder in autoencoder architectures, highlighting its versatility and practical potential.

Contributions

Gated PixelCNN: Improved architecture using gating and dual CNN stacks to remove blind spots and enhance efficiency.
Conditional Image Generation: Enabled class- and embedding-conditioned image generation with high diversity and realism.
Autoencoder Integration: Used PixelCNN as a powerful decoder, achieving high-quality reconstructions.

Model Architecture

Autoregressive Pixel Modelling
- PixelCNN models the joint distribution of all pixels in an image as a product of conditional probabilities:
\[p(\mathbf{x}) = \prod_{i=1}^{n} p(x_i \mid x_1, \dots, x_{i-1})\]
- Each pixel $x_i$ is generated based on all previous pixels $x_1$ to $x_{i-1}$.
- The generation order follows a raster scan: row-by-row, left to right.
Masked Convolutions
- To ensure that each pixel only sees past pixels (above and to the left), PixelCNN uses masked convolutional filters.
- These filters are designed so that future pixels are not included in the receptive field.
- This maintains the autoregressive property while leveraging convolutional efficiency.
- The model’s convolution kernels are masked to ensure the output for a pixel only uses valid context.
Color Channel Conditioning
- Each pixel has three color channels: Red (R), Green (G), and Blue (B). PixelCNN models them sequentially:
  - Red (R) is predicted first.
  - Green (G) is conditioned on R.
  - Blue (B) is conditioned on both R and G.
- This is done by splitting the feature maps at each layer and using separate masks for each channel.
Output: 256-Way Softmax
Each channel is modeled as a discrete distribution over 256 values (0–255), using a softmax classifier:
- Each output is a probability distribution over all possible values:
  \[p(x_{i}^{(c)} \mid \text{context}) = \text{Softmax}(f_\theta(\cdot))\]
- The model outputs shape:
  $N \times N \times 3 \times 256$ for an $N \times N$ image.
Training vs Sampling
- Training Time:
  Since all pixels are available, we can compute all conditional probabilities in parallel using the masked convolutions.
- Sampling Time:
  Sampling must be sequential—each pixel (and channel) is predicted one at a time, as each depends on previous outputs.

Gated PixelCNN: Enhancing Convolutional Autoregressive Models

PixelRNN and PixelCNN are both autoregressive models designed to generate images by modeling pixel dependencies. While PixelCNN offers computational efficiency, PixelRNN has historically demonstrated superior generative performance. This performance gap stems from fundamental architectural differences, which the Gated PixelCNN aims to address.

Architectural Comparison and Motivation

PixelRNN

Uses: Spatial LSTM layers with recurrent connections.
Advantage: Each layer has access to the full context of preceding pixels, enabling effective long-range dependency modeling.

PixelCNN

Uses: Stacks of masked convolutional layers.
Limitation: The receptive field grows linearly with depth, restricting context in shallower layers.

Although deeper convolutional stacks can alleviate this limitation, PixelRNN maintains an advantage due to its built-in recurrence and gating mechanisms.

Gated PixelCNN: Core Idea

To close the performance gap while retaining the efficiency of convolutions, Gated PixelCNN introduces gated activation units, inspired by mechanisms found in LSTM cells.

Activation Function

The ReLU activations in the original PixelCNN are replaced with a gated function:

\[y = \tanh(W_{k,f} * x) \circ \sigma(W_{k,g} * x)\]

Where:

$*$: Convolution operator
$\sigma$: Sigmoid non-linearity
$\circ$: Element-wise multiplication
$W_{k,f}, W_{k,g}$: Learned filters at layer $k$

This architecture allows the network to control information flow more effectively, capturing complex interactions between pixels without relying on recurrence.

Gated PixelCNN draws from several prior architectures that introduced gating mechanisms into feedforward networks:

Highway Networks
Grid LSTMs
Neural GPUs

These models consistently show that gating enhances representational capacity and training stability.

Summary

Model	Core Mechanism	Strength
PixelRNN	Spatial LSTMs with gating	Full context and expressive modelling
PixelCNN	Masked convolutions with ReLU	Efficient but limited context
Gated PixelCNN	Masked convolutions with gating	Improved interaction modelling and efficiency

PixelCNNs traditionally suffer from a “blind spot” — certain regions of the image, especially pixels to the right of the current one, are inaccessible due to masked convolutions. This limits the model’s ability to fully capture contextual information during generation.

To address this, Gated PixelCNN introduces two parallel stacks:

A vertical stack (unmasked), which captures information from all rows above the current pixel.
A horizontal stack (masked), which looks at the current row up to the pixel.

The two are combined at each layer, with the horizontal stack receiving information from the vertical stack. This design removes the blind spot without violating the autoregressive constraint (i.e., only conditioning on past pixels).

Conditional PixelCNN

Aim to model the conditional distribution:

\[p(x|h) = \prod_{i=1}^n p(x_i \mid x_1, \ldots, x_{i-1}, h)\]

What’s done:

Conditioning vector $h$ is added into the gated activation unit:
\[y = \tanh(W_{k,f} * x + V_{k,f}^T h) \odot \sigma(W_{k,g} * x + V_{k,g}^T h)\]
For spatially-aware conditioning, $h$ is mapped to a spatial representation $y = \tanh(W_{k,f} * x + V_{k,f} * s) \odot \sigma(W_{k,g} * x + V_{k,g} * s)$

Why:

Enables image generation based on class labels or latent embeddings.
Location-dependent variant supports spatial guidance (e.g., object placement).

PixelCNN Auto-Encoders

What:

A traditional convolutional autoencoder is modified by replacing the deconvolutional decoder with a conditional PixelCNN.

Why:

PixelCNN, being a strong generative model, improves image reconstruction.
Encourages the encoder to focus on high-level abstract features, since PixelCNN can handle low-level pixel details.

Outcome:

An end-to-end trainable model where the encoder learns richer representations and the decoder models diverse outputs from $p(x∣h)$.

Ratan Kokal

Paper Review

Paper Review

Conditional Image Generation with PixelCNN Decoders

Introduction

Contributions

Model Architecture

Gated PixelCNN: Enhancing Convolutional Autoregressive Models

Architectural Comparison and Motivation

PixelRNN

PixelCNN

Gated PixelCNN: Core Idea

Activation Function

Summary

Blind spot in the receptive field

Conditional PixelCNN

PixelCNN Auto-Encoders

Ratan Kokal

Paper Review

Conditional Image Generation with PixelCNN Decoders

Introduction

Contributions

Model Architecture

Gated PixelCNN: Enhancing Convolutional Autoregressive Models

Architectural Comparison and Motivation

PixelRNN

PixelCNN

Gated PixelCNN: Core Idea

Activation Function

Related Work and Inspirations

Summary

Blind spot in the receptive field

Conditional PixelCNN

PixelCNN Auto-Encoders