← View series: machine learning
~/blog
CTGAN and VAE
Every SMOTE variant shares a fundamental assumption: that the space between two minority samples is valid minority territory. Pick a point, pick a neighbor, interpolate between them — the result is a realistic minority sample.
This assumption holds reasonably well for purely numerical features with smooth distributions. It fails the moment your minority class has features that don't interpolate cleanly: categorical columns, integer-only values, hard physical constraints between features, or strong non-linear correlations. Interpolating between two fraud transactions where one is a card-present transaction and one is card-not-present doesn't produce a realistic fraud case — it produces a fractional transaction type that doesn't exist.
Generative models don't interpolate. They learn the underlying distribution of the minority class and draw new samples from it. CTGAN and VAEs are the two main approaches for tabular minority class generation.
The Problem SMOTE Doesn't Know About
Consider a minority class with these features:
| Age | Income | Loan-to-Income Ratio | Has Mortgage | Default |
|---|---|---|---|---|
| 34 | 45,000 | 0.72 | Yes | Yes |
| 28 | 30,000 | 1.10 | No | Yes |
SMOTE with λ=0.5:
Age = (34 + 28) / 2 = 31
Income = (45000 + 30000) / 2 = 37500
Loan-to-Income = (0.72 + 1.10) / 2 = 0.91
Has Mortgage = (1 + 0) / 2 = 0.5 ← not a valid value
The synthetic sample has a half-mortgage. The classifier sees it during training, learns that 0.5 is a valid state for a binary feature, and the learned representation degrades. The more categorical features your dataset has, the worse SMOTE's synthetic samples get.
There's also a subtler issue: feature correlations. Loan-to-Income should be deterministically related to Loan Amount and Income. SMOTE doesn't know this constraint. A generated sample could have Income=37,500, Loan Amount=60,000, but Loan-to-Income=0.91 (which doesn't compute from those values). These inconsistencies introduce noise that a well-trained model will eventually pick up as confounding signal.
CTGAN: A GAN Built for Tabular Data
CTGAN (Conditional Tabular GAN) is a generative adversarial network designed specifically for mixed-type tabular data. The original paper (Xu et al., 2019) identifies two core problems with applying standard GANs to tables:
Mode collapse on imbalanced continuous columns. Continuous features in tabular data often have multiple modes — transaction amounts spike at round numbers, ages cluster at decade boundaries. Standard GANs collapse to generating the most common mode. CTGAN uses mode-specific normalization: it fits a variational Gaussian mixture to each continuous column and normalizes samples using the mode they belong to.
Training imbalance within rows. If you condition on generating minority class samples, the generator never learns the conditional distribution well because it sees so few real examples. CTGAN uses conditional vectors during training to explicitly balance exposure to each category, including rare ones.
from sdv.single_table import CTGANSynthesizer
from sdv.metadata import SingleTableMetadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(minority_df)
synthesizer = CTGANSynthesizer(
metadata,
epochs=300,
batch_size=500,
verbose=True
)
synthesizer.fit(minority_df)
synthetic_minority = synthesizer.sample(num_rows=1000)The metadata object tells CTGAN which columns are categorical, which are numerical, and their types — allowing it to apply the right normalization and sampling strategy per column.
VAE: Learning the Latent Distribution
A Variational Autoencoder takes a different approach. Rather than a two-player game (generator vs discriminator), it trains a single encoder-decoder network with a regularized latent space.
The encoder compresses each input sample into a probability distribution in latent space (typically Gaussian). The decoder reconstructs samples from points drawn from that distribution. The training objective balances reconstruction quality against how close the latent distributions stay to a standard normal.
After training on minority class samples, the VAE latent space represents the distribution of minority patterns. Sampling from this latent space and decoding gives new, realistic minority samples.
import torch
import torch.nn as nn
class TabularVAE(nn.Module):
def __init__(self, input_dim, latent_dim=8):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 64), nn.ReLU(),
nn.Linear(64, 32), nn.ReLU()
)
self.mu = nn.Linear(32, latent_dim)
self.log_var = nn.Linear(32, latent_dim)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 32), nn.ReLU(),
nn.Linear(32, 64), nn.ReLU(),
nn.Linear(64, input_dim)
)
def encode(self, x):
h = self.encoder(x)
return self.mu(h), self.log_var(h)
def reparameterize(self, mu, log_var):
std = torch.exp(0.5 * log_var)
eps = torch.randn_like(std)
return mu + eps * std
def forward(self, x):
mu, log_var = self.encode(x)
z = self.reparameterize(mu, log_var)
return self.decoder(z), mu, log_var
def vae_loss(recon_x, x, mu, log_var):
recon = nn.functional.mse_loss(recon_x, x, reduction='sum')
kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
return recon + klTo generate synthetic minority samples after training:
model.eval()
with torch.no_grad():
z = torch.randn(n_samples, latent_dim)
synthetic = model.decoder(z).numpy()The reparameterization trick is what makes training work: by expressing the sampling as mu + eps * std where eps ~ N(0,1), gradients can flow through the stochastic sampling step.
CTGAN vs VAE: What to Choose
These aren't interchangeable — they have different strengths.
CTGAN handles mixed data types natively. The conditional vector mechanism means it respects categorical feature distributions. The mode-specific normalization means it captures multi-modal continuous distributions. The downside: GAN training is unstable and mode collapse is still possible despite CTGAN's mitigations. Training requires tuning and patience.
VAE trains more stably than GAN. The latent space is smooth and interpolable — you can arithmetic in latent space to explore the minority distribution. The downside: for categorical features, the reconstruction objective (usually MSE) isn't appropriate. Handling mixed types requires careful output head design (separate MSE for continuous, cross-entropy for categorical). The SDV library's TVAESynthesizer handles this.
from sdv.single_table import TVAESynthesizer
synthesizer = TVAESynthesizer(metadata, epochs=300)
synthesizer.fit(minority_df)
synthetic_minority = synthesizer.sample(num_rows=1000)Validating Synthetic Sample Quality
This is where generative approaches have a problem that SMOTE variants don't: you can't just check the class distribution after fitting. You need to validate that synthetic samples are realistic.
import pandas as pd
from scipy.stats import ks_2samp
for col in minority_df.select_dtypes(include='number').columns:
stat, p = ks_2samp(minority_df[col], synthetic_minority[col])
print(f"{col}: KS statistic={stat:.3f}, p={p:.3f}")The Kolmogorov-Smirnov test checks if the synthetic column distribution matches the real one. A high p-value means similar distributions. Low p-values on several columns signal that the generative model hasn't learned the minority distribution well enough — train longer or adjust architecture.
For categorical features:
for col in minority_df.select_dtypes(include='object').columns:
real_dist = minority_df[col].value_counts(normalize=True)
synth_dist = synthetic_minority[col].value_counts(normalize=True)
print(f"\n{col}")
print(pd.concat([real_dist, synth_dist], axis=1, keys=['Real', 'Synthetic']))Where Generative Models Break
They require enough minority samples to learn from. With fewer than ~100 minority examples, CTGAN and VAEs will overfit or produce degenerate outputs. SMOTE variants can work with as few as 5-10 samples (poorly, but they work). Generative models need enough data to actually learn a distribution.
Training instability (CTGAN). GANs fail to converge. They produce mode-collapsed outputs. The generator and discriminator get out of sync. This isn't a theoretical concern — it happens in practice and requires monitoring loss curves during training.
Feature constraint violations. Even generative models don't automatically respect hard constraints between features (loan-to-income must equal loan/income). Post-generation validation and correction is often necessary.
Compute and iteration speed. Training 300 epochs of a GAN or VAE is significantly more expensive than running SMOTE. In rapid experimentation loops, this overhead breaks the feedback cycle.
When to Reach for Generative Models
Generative models are the right choice when:
- Dataset has categorical features that SMOTE handles poorly — especially high-cardinality categoricals
- You've validated that SMOTE variants produce statistically unrealistic synthetic samples (check distributions)
- Minority class size is large enough to train a generative model (100+ samples minimum, 500+ comfortably)
- Your domain requires interpretable synthetic samples — healthcare data synthesis for testing, fraud scenario generation for red-teaming
- The inter-feature correlations in your minority class are strong enough that linear interpolation destroys them
The fundamental shift generative models offer is this: instead of asking "what's between these two minority points?", they ask "what does a minority point from this distribution look like?" That's a more honest question about the structure of your data, and on complex datasets it produces better training signal.
The tradeoff is real though. These methods demand more data, more compute, more validation, and more expertise to use correctly. SMOTE with a good variant will outperform a poorly trained CTGAN every time. Use generative models when the simpler methods have proven insufficient — not as the first tool you reach for.