Public Data, Reproducibility, and Benchmark-Building as Foundations for Generalizable Anastomotic Leak Prediction in Surgical Machine Learning
Main Article Content
Abstract
Benchmark culture has shaped progress in several computational disciplines because it converts isolated model claims into cumulative evidence. Clinical machine learning has adopted this logic unevenly, and anastomotic leak research illustrates why. Anastomotic leak is clinically consequential, relatively infrequent, operationally heterogeneous, and often documented through imperfect combinations of diagnosis codes, procedures, laboratory trajectories, and clinician judgment. These properties make the problem suitable for machine learning while simultaneously making evaluation unusually fragile. A model can appear promising under one cohort definition, one feature extraction protocol, or one institutional coding practice, then degrade when any of those conditions change. This paper develops a technical framework for benchmark-building in anastomotic leak research centered on three propositions: public data are necessary for cumulative method comparison, reproducibility must be treated as an evaluated property rather than a rhetorical aspiration, and benchmark design should foreground transportability rather than leaderboard maximization. The discussion formalizes benchmark tasks across preoperative, early postoperative, and longitudinal surveillance horizons; analyzes label uncertainty and missingness as structural properties of the data-generating process; and proposes an evaluation architecture that combines discrimination, calibration, shift robustness, fairness, and implementation-sensitive reporting. A central argument is that benchmark quality depends less on the novelty of any one algorithm than on precise cohort construction, deterministic pipelines, patient-level temporal splitting, and sustained governance of evolving public corpora. The result is a blueprint for anastomotic leak benchmarking that can support transparent model development, rigorous cross-site comparison, and more credible claims about clinical readiness