Overview
Foundation models — large models pretrained on broad data sources and adaptable to diverse downstream tasks — have transformed areas such as natural language processing and computer vision [1]. With the emergence of tabular foundation models, this trend has started to percolate into tabular data analysis, which has been traditionally dominated by classical statistical methods.
A leading example is TabPFN v2 [2-5], a transformer-based neural network pretrained entirely on synthetic data, claimed to outperform all previous methods for regression and classification on datasets with up to 10,000 samples, while also being capable of data generation, density estimation, learning reusable embeddings, and fine-tuning. If these claims hold broadly, TabPFN has the potential to supersede existing approaches across a wide range of statistical tasks, mirroring the LLM revolution in natural language processing.
What is in-context learning?
In-context learning (ICL), originally observed as a property of LLMs, can be reframed statistically as a form of meta-learning: instead of fitting a separate model for each dataset, one pretrains a model to learn a mapping M from the space of datasets to a space of possible values for a quantity of interest (e.g., regression functions). Given a new dataset, M can be applied immediately to yield an estimate — with no retraining required.
TabPFN combines ICL with a Bayesian perspective, implementing it as amortized Bayesian inference:
- A prior is placed on the space of joint distributions over covariate–label pairs.
- A given dataset is assumed to be drawn i.i.d. from a fixed joint distribution.
- Given a new test point x, the model approximates the posterior predictive distribution for its label by integrating over all plausible data-generating distributions consistent with the observed data.
While most approximation techniques attempt to approximate the posterior for a fixed dataset, TabPFN is pretrained to approximate this mapping jointly over all datasets and test points. Because pretraining is performed using synthetic datasets generated from a prior, TabPFN is called a prior-fitted network (PFN).
Evaluating TabPFN beyond supervised prediction
In recent work [6], we evaluated the capabilities of TabPFN beyond supervised prediction. Strikingly, we found that when used out of the box, it outperforms specialized methods in:
- Semi-supervised parameter estimation
- Prediction under covariate shift
- Heterogeneous treatment effect estimation
It even surpasses LASSO in sparse regression, despite not being designed for linear sparsity, and breaks robustness–efficiency trade-offs in classification. These findings suggest that TabPFN — and more broadly the ICL/PFN learning paradigm — has the potential to supersede existing approaches across a wide range of statistical tasks, with broad methodological and practical implications.
Key paper: Zhang, Q., Tan, Y. S., Tian, Q.*, and Li, P. (2025). TabPFN: One Model to Rule Them All? arXiv preprint (major revision at JASA). Paper link
PFN for clustering
Clustering is a fundamental unsupervised learning task, but it becomes substantially more challenging in realistic settings where both the number of clusters and the cluster assignments are unknown. In contrast to supervised learning, there is no ground-truth label structure available during training, and the underlying grouping of the data must be inferred entirely from the observed features. This creates a joint inference problem: one must simultaneously determine how many clusters are present and how each data point should be assigned. These two goals are tightly coupled, yet most classical methods treat them separately—for example, by fixing the number of clusters in advance or selecting it through external model selection criteria such as BIC, cross-validation, or heuristic elbow methods. In practice, however, there is no universally reliable criterion for selecting the number of clusters, and the absence of ground truth also makes it difficult to define a meaningful and consistent evaluation metric.
We use the prior-fitted network (PFN) framework to address this challenge from a different perspective. PFNs are trained on large collections of synthetic tasks generated from a flexible prior, allowing them to learn how to perform inference across a wide range of data-generating mechanisms. This shifts the focus from designing a single clustering algorithm to learning a general inference procedure that can adapt to new datasets. However, directly applying PFNs to clustering is nontrivial. Unlike standard supervised tasks, clustering provides no explicit labels during training, and cluster identities are only identifiable up to permutation. Moreover, the output space itself is variable, since the number of clusters changes across datasets. This makes it difficult to directly define a supervised learning objective or a fixed prediction space for a neural network.
In TabClustPFN [7], our key idea is to reformulate clustering as a joint structured prediction problem, where the model must infer both the latent number of clusters and the corresponding cluster memberships simultaneously. Instead of treating the number of clusters as a separate preprocessing or model selection step, we embed it directly into the prediction target. We achieve this by constructing carefully designed prior datasets in which both the number of clusters and the cluster structures vary, enabling the PFN to learn a mapping from raw data to latent partition structure in a fully amortized manner. In this framework, the model learns to implicitly compare and reason over different clustering configurations, rather than relying on external selection criteria.
This leads to a unified inference procedure: given a new dataset, the trained PFN outputs both an estimate of the number of clusters and the corresponding cluster assignments in a single forward pass, without iterative optimization or repeated model fitting. The resulting approach is fast, adaptive, and robust across a wide range of clustering regimes, and it reframes clustering from a problem of algorithm design into a problem of learning how to infer structure directly from data.
Key paper: Zhao, T., Wang, G., Tan, Y. S., and Zhang, Q. (2025). TabClustPFN: A Prior-Fitted Network for Tabular Data Clustering. arXiv preprint. Paper link
References
- Bommasani, R. et al. (2021). On the opportunities and risks of foundation models. arXiv:2108.07258.
- Hollmann, N., Müller, S., Eggensperger, K., and Hutter, F. (2022). TabPFN: a transformer that solves small tabular classification problems in a second. NeurIPS 2022 First Table Representation Workshop.
- Nagler, T. (2023). Statistical foundations of prior-data fitted networks. International Conference on Machine Learning (ICML 2023), pp. 25660–25676. PMLR.
- Vetter, J., Gloeckler, M., Gedon, D., and Macke, J. H. (2025). Effortless, Simulation-Efficient Bayesian Inference using Tabular Foundation Models. arXiv:2504.17660.
- Müller, S., Reuter, A., Hollmann, N., Rügamer, D., and Hutter, F. (2025). Position: The Future of Bayesian Prediction Is Prior-Fitted. arXiv:2505.23947.
- Zhang, Q., Tan, Y. S., Tian, Q., and Li, P. (2025). TabPFN: One Model to Rule Them All? arXiv:2505.20003 (major revision at JASA).
- Dwivedi, R., Ho, N., Khamaru, K., Wainwright, M. J., Jordan, M. I., and Yu, B. (2019). Challenges with EM in application to weakly identifiable mixture models. arXiv:1902.00194.
- Harbrecht, H., Jakeman, J. D., and Zaspel, P. (2021). Cholesky-based experimental design for Gaussian process and kernel-based emulation and calibration. Communications in Computational Physics, 29(SAND-2020-12052J). Sandia National Lab.