3rd International Symposium of Scientific Research and Innovative Studies, Balıkesir, Türkiye, 15 Mart 2023
Socioeconomic status is an essential concept to understand what a citizen's position in society
is and based on that society can be divided into categories. To determine socioeconomic status,
there are different factors: Education level, wealth, income level, occupation, and access to
good nutrition some of these essential factors. The primary purpose of this study is based on
the socioeconomic clustering of the districts in the province of Istanbul using machine learning
methods. In this regard, the aim is to investigate whether the districts have socioeconomic
similarities using existing data. For this purpose, population, average household size, number
of hospitals, water consumption, domestic waste, number of public bread buffets, literacy
number unknown, literate, illiterate, preschool, primary school, secondary school, housing sales
amount, number of rail stations, number of vehicles data on districts, which are publicly
available and shared by İstanbul Metropolitan Municipality website, is used for analysis.
In order to analyze the variables and examine the districts from a socioeconomic point of
view, the k-means method, which is an unsupervised learning technique, is used. In this learning
type, there is no 𝑦 variable, namely the response variable, in the data set. The methods in this
learning are often used to explain and inferences about data. In this context, one of the studies
carried out under the title of unsupervised learning is clustering. The clustering is used for the
aggregation of observation values with a similar characteristic structure. The k-means method
is one of these methods. It is based on the division of the existing data set into a k set using the
k parameter in the name of the method. According to the results, it is observed that the
population variable is dominant in the existing data sets, and the districts are clustered according
to this variable. When the population variable is removed, it is observed that similar clusters
are obtained. One reason for this is that the population variable in the data set is associated with
other variables. As a result, the socio-economic distinction of the districts in the existing studies
could not be obtained with the current data set by using less number of clusters. As the number
of clusters increased, on the other hand, it is observed that the districts in the clusters are similar
to each other in related to the socio-economic structure.