Paper Review
Paper Review
Conditional Image Generation with PixelCNN Decoders
Introduction
This paper presents Conditional PixelCNN, an autoregressive model designed for conditional image generation. It builds on PixelCNN by introducing gated convolutions, boosting performance to match PixelRNN while requiring significantly less training time.
Unlike GANs, Conditional PixelCNN returns explicit probability densities, making it suitable for tasks like compression, probabilistic planning, and image restoration. By conditioning on class labels or latent embeddings, the model can generate diverse and realistic samples, from natural scenes like coral reefs and dogs, to different poses and expressions of a face given just a single image. It also proves effective as a decoder in autoencoder architectures, highlighting its versatility and practical potential.
Contributions
Gated PixelCNN: Improved architecture using gating and dual CNN stacks to remove blind spots and enhance efficiency.
Conditional Image Generation: Enabled class- and embedding-conditioned image generation with high diversity and realism.
Autoencoder Integration: Used PixelCNN as a powerful decoder, achieving high-quality reconstructions.
Model Architecture
Autoregressive Pixel Modelling
- PixelCNN models the joint distribution of all pixels in an image as a product of conditional probabilities:
- Each pixel $x_i$ is generated based on all previous pixels $x_1$ to $x_{i-1}$.
- The generation order follows a raster scan: row-by-row, left to right.
Masked Convolutions
To ensure that each pixel only sees past pixels (above and to the left), PixelCNN uses masked convolutional filters.
- These filters are designed so that future pixels are not included in the receptive field.
This maintains the autoregressive property while leveraging convolutional efficiency.
- The model’s convolution kernels are masked to ensure the output for a pixel only uses valid context.
Color Channel Conditioning
Each pixel has three color channels: Red (R), Green (G), and Blue (B). PixelCNN models them sequentially:
- Red (R) is predicted first.
- Green (G) is conditioned on R.
- Blue (B) is conditioned on both R and G.
This is done by splitting the feature maps at each layer and using separate masks for each channel.
Output: 256-Way Softmax
Each channel is modeled as a discrete distribution over 256 values (0–255), using a softmax classifier:
Each output is a probability distribution over all possible values:
\[p(x_{i}^{(c)} \mid \text{context}) = \text{Softmax}(f_\theta(\cdot))\]The model outputs shape:
$N \times N \times 3 \times 256$ for an $N \times N$ image.
Training vs Sampling
Training Time:
Since all pixels are available, we can compute all conditional probabilities in parallel using the masked convolutions.Sampling Time:
Sampling must be sequential—each pixel (and channel) is predicted one at a time, as each depends on previous outputs.
Gated PixelCNN: Enhancing Convolutional Autoregressive Models
PixelRNN and PixelCNN are both autoregressive models designed to generate images by modeling pixel dependencies. While PixelCNN offers computational efficiency, PixelRNN has historically demonstrated superior generative performance. This performance gap stems from fundamental architectural differences, which the Gated PixelCNN aims to address.
Architectural Comparison and Motivation
PixelRNN
- Uses: Spatial LSTM layers with recurrent connections.
- Advantage: Each layer has access to the full context of preceding pixels, enabling effective long-range dependency modeling.
PixelCNN
- Uses: Stacks of masked convolutional layers.
- Limitation: The receptive field grows linearly with depth, restricting context in shallower layers.
Although deeper convolutional stacks can alleviate this limitation, PixelRNN maintains an advantage due to its built-in recurrence and gating mechanisms.
Gated PixelCNN: Core Idea
To close the performance gap while retaining the efficiency of convolutions, Gated PixelCNN introduces gated activation units, inspired by mechanisms found in LSTM cells.
Activation Function
The ReLU activations in the original PixelCNN are replaced with a gated function:
\[y = \tanh(W_{k,f} * x) \circ \sigma(W_{k,g} * x)\]Where:
- $*$: Convolution operator
- $\sigma$: Sigmoid non-linearity
- $\circ$: Element-wise multiplication
- $W_{k,f}, W_{k,g}$: Learned filters at layer $k$
This architecture allows the network to control information flow more effectively, capturing complex interactions between pixels without relying on recurrence.
Related Work and Inspirations
Gated PixelCNN draws from several prior architectures that introduced gating mechanisms into feedforward networks:
- Highway Networks
- Grid LSTMs
- Neural GPUs
These models consistently show that gating enhances representational capacity and training stability.
Summary
Model | Core Mechanism | Strength |
---|---|---|
PixelRNN | Spatial LSTMs with gating | Full context and expressive modelling |
PixelCNN | Masked convolutions with ReLU | Efficient but limited context |
Gated PixelCNN | Masked convolutions with gating | Improved interaction modelling and efficiency |
Blind spot in the receptive field
PixelCNNs traditionally suffer from a “blind spot” — certain regions of the image, especially pixels to the right of the current one, are inaccessible due to masked convolutions. This limits the model’s ability to fully capture contextual information during generation.
To address this, Gated PixelCNN introduces two parallel stacks:
A vertical stack (unmasked), which captures information from all rows above the current pixel.
A horizontal stack (masked), which looks at the current row up to the pixel.
The two are combined at each layer, with the horizontal stack receiving information from the vertical stack. This design removes the blind spot without violating the autoregressive constraint (i.e., only conditioning on past pixels).
Conditional PixelCNN
Aim to model the conditional distribution:
\[p(x|h) = \prod_{i=1}^n p(x_i \mid x_1, \ldots, x_{i-1}, h)\]What’s done:
Conditioning vector $h$ is added into the gated activation unit:
\[y = \tanh(W_{k,f} * x + V_{k,f}^T h) \odot \sigma(W_{k,g} * x + V_{k,g}^T h)\]For spatially-aware conditioning, $h$ is mapped to a spatial representation \(y = \tanh(W_{k,f} * x + V_{k,f} * s) \odot \sigma(W_{k,g} * x + V_{k,g} * s)\)
Why:
Enables image generation based on class labels or latent embeddings.
Location-dependent variant supports spatial guidance (e.g., object placement).
PixelCNN Auto-Encoders
What:
- A traditional convolutional autoencoder is modified by replacing the deconvolutional decoder with a conditional PixelCNN.
Why:
PixelCNN, being a strong generative model, improves image reconstruction.
Encourages the encoder to focus on high-level abstract features, since PixelCNN can handle low-level pixel details.
Outcome:
- An end-to-end trainable model where the encoder learns richer representations and the decoder models diverse outputs from $p(x∣h)$.