The classification performance is highly affected because hyperspectral images include many bands, have high dimensions, and have few labeled training samples. This challenge is reduced by using rich spatial information and an effective classifier. The classifiers in this study are BERT-based (Bidirectional Encoder Representations from Transformers) models, which have recently been applied in natural language processing. The BERT model and its performance-improved version, the ALBERT (A Lite BERT) model, are utilized as transformer-based models. Because of their structure, these models can also accept spatial information via 'segment embeddings'. Segmentation algorithms are commonly used in the literature to get spatial information. Superpixel methods have shown superior results in the segmentation literature due to the utility of working at the superpixel level rather than the conventional pixel level. HyperSLIC, a modified version of the SLIC superpixel method for hyperspectral images, is employed as input to BERT-based models in this study. In addition, HyperSLIC segmentation results are merged with the DBSCAN algorithm for similar superpixels to increase the size of spatially similar areas and called as HyperSLIC-DBSCAN. The effects of segment embedding information on classification accuracy in BERT-based models is studied experimentally. Experimental results show that BERT-based models outperform conventional and deep learning-based 1D/2D convolutional neural network classifiers when spatial information is used with the help of segment embedding information.