
Summary
Neuralk is releasing TabBench V2, the new version of its open evaluation suite for tabular classification 📊
LLMs have shown that diverse evaluations drive progress. In the tabular foundation model (TFM) space, the leading benchmarks today target different aspects of the problem. TabArena provides an AutoGluon-powered benchmark with tuning and ensembling, while LAMDA TALENT emphasises runtime efficiency alongside predictive performance. MultiTab broadens the lens with a rich set of auxiliary evaluation metrics, TabRed focuses primarily on numerically dominated datasets, and CARTE’s benchmark targets categorically dominated datasets.
As practitioners deploying tabular foundation models in production, we at Neuralk identified an unmet need: existing benchmarks did not adequately capture real enterprise use cases. To bring evaluation closer to the realities of business applications, we launched TabBench V1 last year with 50 carefully curated datasets. Today, we are significantly expanding that effort with TabBench V2, which brings together 189 datasets covering a wide spectrum of industries and practical machine learning tasks.
TabBench V2 includes 189 OpenML classification datasets spanning healthcare, finance and insurance, industry and science, retail and behavioral data, computer vision, gaming and synthetic data, social and public-sector applications, and more. Beyond accuracy, the gold standard of business metrics, TabBench V2 enables deeper analysis of model behavior, including calibration, robustness to class imbalance, across different industry verticals. The benchmark compares a broad range of approaches, from tree ensembles and tuned neural networks to the latest generation of pre-trained tabular foundation models.


Beyond competitive accuracy, evaluating our model, Seldon, highlights its baseline operational reliability and speed across the suite:
Benchmarks are only as valuable as the community that uses them, challenges them, and helps them grow. Did any of the TabBench results surprise you? Does the leaderboard match what you see on your own datasets? We're building TabBench as a community resource, and we'd love your input.
→ Contribute a dataset: if you know of a classification dataset that captures a missing industry, challenge, or data modality, help us make TabBench even more representative.
pip install tabbench
→ Benchmark your model and see how your approach stacks up against the current leaderboard.
Check out all the visualisations and segmentations of all the performance results across 189 model-benchmark pairs at 🔗 : https://huggingface.co/spaces/Neuralk-AI/tabbench