Scarf: Self Supervised Learning for Tabular Data Nov 25, 2024

Machine learning often struggles with the scarcity of labeled data. While unlabeled datasets are abundant, obtaining high-quality labeled data remains expensive and time-consuming. SCARF emerges as a breakthrough methodology that transforms how we extract meaningful representations from raw, untagged information.

Innovation

SCARF introduces a novel approach to generating data views by strategically corrupting input features. Unlike domain-specific augmentation techniques, this method works universally across datasets:

Randomly select a subset of features
Replace selected features with samples from their empirical marginal distributions
Leverage contrastive learning to maximize representation quality

Technical Mechanism

The approach employs:

An encoder network (f)
A pre-training head network (g)
InfoNCE contrastive loss function

The core objective is generating representations that maintain semantic integrity while being robust to feature perturbations.

Tested across 72 real-world classification datasets, SCARF demonstrated significant improvements:

Enhanced classification accuracy
Improved performance under label noise
Effective semi-supervised learning capabilities
Domain-agnostic representation learning
Minimal sensitivity to feature scaling
Hyperparameter stability
Compatibility with existing machine learning techniques

SCARF represents a pivotal advancement in self-supervised learning, offering a flexible, powerful approach to extracting meaningful representations with minimal labeled data.