We used BLAST [basic logical alignment search tool] algorithms to compare unigene sets from Medicago truncatula, Lotus japonicus [ L. japonicus var. corniculatus] and soyabean ( Glycine max and Glycine soja) to nonlegume ( Hordeum vulgare, Chlamydomonas reinhardtii, Gossypium sp., Vitis vinifera, Mesembryanthemum crystallinum, Lactuca sativa, Pinus sp., Solanum tuberosum, Secale cereale, Sorghum bicolor, Helianthus annuus and Triticum aestivum) unigene sets, to GenBank's nonredundant and expressed sequence tag (EST) databases, and to the genomic sequences of rice ( Oryza sativa) and Arabidopsis thaliana. As a working definition, putatively legume-specific genes had no sequence homology, below a specified threshold, to publicly available sequences of nonlegumes. Using this approach, 2525 legume-specific EST contigs were identified, of which less than 3% had clear homology to previously characterized legume genes. As a first step toward predicting function, related sequences were clustered to build motifs that could be searched against protein databases. Three families of interest were more deeply characterized: F-box related proteins, Pro-rich proteins and cysteine cluster proteins (CCPs). Of particular interest were the >300 CCPs, primarily from nodules or seeds, with predicted similarity to defensins. Motif searching also identified several previously unknown CCP-like open reading frames in A. thaliana. Evolutionary analyses of the genomic sequences of several CCPs in M. truncatula suggest that this family has evolved by local duplications and divergent selection.