The problem of infertility around the world is becoming more severe day by day. About half of the problem incidence is male-related. Computer Aided Sperm Analysis systems, accompanied by the usage of artificial intelligence, have been recently developed to measure the sperm quality. In particular, convolutional neural networks (CNNs) are one of the most widely used method for classifying sperm cells. Another method that has become popular in recent years and has started to be used in image classification problems is the vision transformer (ViT) models. In the proposed study, a comparative performance analysis of CNN and ViT architectures on open source HuSHeM, SMIDS and SCIAN human sperm cell image datasets was performed. The results obtained from 8 models, including 5 different traditional CNN architectures and 3 different variants of ViT models, were compared with each other. After 5-fold cross validation and data augmentation, the models were trained and the results were obtained. The results obtained were validated with t-test and the performance analysis was performed. Each model was compared with 7 different models on 3 different datasets for a total of 21 comparisons. As a result of the comparisons, 12 wins, 9 draws and 0 defeats were obtained with the ViT-L16 model. When comparing win rates, it has about %38 more win rates than its closest model.