Nine soil organic models were evaluated using twelve datasets from seven long-term experiments. Datasets represented three different land-uses (grassland, arable cropping and woodland) and a range of climatic conditions within the temperate region. Different treatments (inorganic fertilizer, organic manures and different rotations) at the same site allowed the effects of differing land management to be explored. Model simulations were evaluated against the measured data and the performance of the models was compared both qualitatively and quantitatively. Not all models were able to simulate all datasets; only four attempted all. No one model performed better than all others across all datasets. The performance of each model in simulating each dataset is discussed. A comparison of the overall performance of models across all datasets reveals that the model errors of one group of models (RothC, CANDY, DNDC, CENTURY, DAISY and NCSOIL) did not differ significantly from each other. Another group (SOMM, ITE and Verberne) did not differ significantly from each other but showed significantly larger model errors than did models in the first group. Possible reasons for differences in model performance are discussed in detail.