Comparative analysis of the performance of selected machine learning algorithms depending on the size of the training sample

Przemysław Kupidura; Agnieszka Kępa; Piotr Krawczyk

doi:10.2478/rgg-2024-0015

2024 vol. 118

Stats

CC BY-NC-ND 4.0

Get citation

ORIGINAL ARTICLE

Comparative analysis of the performance of selected machine learning algorithms depending on the size of the training sample

Przemysław Kupidura ^{1, A,C-F}

Agnieszka Kępa ^{1, A-D}

Piotr Krawczyk ^{2, A,C}

More details

Hide details

Faculty of Geodesy and Cartography, Warsaw University of Technology, pl. Politechniki 1, 00-661, Warsaw, Poland

Orbitile Ltd., Potułkały 6B/4, 02-791, Warsaw, Poland

A - Research concept and design; B - Collection and/or assembly of data; C - Data analysis and interpretation; D - Writing the article; E - Critical revision of the article; F - Final approval of article

Submission date: 2024-03-12

Final revision date: 2024-07-22

Acceptance date: 2024-07-29

Publication date: 2024-09-23

Corresponding author

Przemysław Kupidura

Faculty of Geodesy and Cartography, Warsaw University of Technology, pl. Politechniki 1, 00-661, Warsaw, Poland

Reports on Geodesy and Geoinformatics 2024;118:53-69

DOI: https://doi.org/10.2478/rgg-2024-0015

Article (PDF, 4.02 MB)

Supplementary files

rgg_2024_0015_Appendix.pdf

References (48)

KEYWORDS

TOPICS

Image processing and machine learning

ABSTRACT

The article presents an analysis of the effectiveness of selected machine learning methods: Random Forest (RF), Extreme Gradient Boosting (XGB), and Support Vector Machine (SVM) in the classification of land use and cover in satellite images. Several variants of each algorithm were tested, adopting different parameters typical for each of them. Each variant was classified multiple (20) times, using training samples of different sizes: from 100 pixels to 200,000 pixels. The tests were conducted independently on 3 Sentinel-2 satellite images, identifying 5 basic land cover classes: built-up areas, soil, forest, water, and low vegetation. Typical metrics were used for the accuracy assessment: Cohen's kappa coefficient, overall accuracy (for whole images), as well as F-1 score, precision, and recall (for individual classes). The results obtained for different images were consistent and clearly indicated an increase in classification accuracy with the increase in the size of the training sample. They also showed that among the tested algorithms, the XGB algorithm is the most sensitive to the size of the training sample, while the least sensitive is SVM, which achieved relatively good results even when using training samples of the smallest sizes. At the same time, it was pointed out that while in the case of RF and XGB algorithms the differences between the tested variants were slight, the effectiveness of SVM was very much dependent on the gamma parameter -- with too high values of this parameter, the model showed a tendency to overfit, which did not allow for satisfactory results.

REFERENCES (48)

Allwright, S. (2023). XGBoost vs Random Forest, which is better? Technical report.

eISSN:	2391-8152
ISSN:	2391-8365