Scarf: Self Supervised Learning for Tabular Data

Machine learning often struggles with the scarcity of labeled data. While unlabeled datasets are abundant, obtaining high-quality labeled data remains expensive and time-consuming. SCARF emerges as a breakthrough methodology that transforms how we extract meaningful representations from raw, untagged information.

Innovation

SCARF introduces a novel approach to generating data views by strategically corrupting input features. Unlike domain-specific augmentation techniques, this method works universally across datasets:

  • Randomly select a subset of features
  • Replace selected features with samples from their empirical marginal distributions
  • Leverage contrastive learning to maximize representation quality

Technical Mechanism

The approach employs:

  • An encoder network (f)
  • A pre-training head network (g)
  • InfoNCE contrastive loss function

The core objective is generating representations that maintain semantic integrity while being robust to feature perturbations.

Tested across 72 real-world classification datasets, SCARF demonstrated significant improvements:

  • Enhanced classification accuracy
  • Improved performance under label noise
  • Effective semi-supervised learning capabilities
  • Domain-agnostic representation learning
  • Minimal sensitivity to feature scaling
  • Hyperparameter stability
  • Compatibility with existing machine learning techniques

SCARF represents a pivotal advancement in self-supervised learning, offering a flexible, powerful approach to extracting meaningful representations with minimal labeled data.