Please use this identifier to cite or link to this item: http://www.alice.cnptia.embrapa.br/alice/handle/doc/1186547
Title: Comparison of machine learning methods for marker identification in GWAS.
Authors: COSTA, W. G. da
PEREIRA, H. D.
SILVA, G. N.
BORÉM, A.
CAIXETA, E. T.
OLIVEIRA, A. C. B. de
CRUZ, C. D
NASCIMENTO, M.
Affiliation: WEVERTON GOMES DA COSTA, UNIVERSIDADE FEDERAL DE VIÇOSA; HÉLCIO DUARTE PEREIRA, UNIVERSIDADE ESTADUAL DE CAMPINAS; GABI NUNES SILVA, UNIVERSIDADE FEDERAL DE RONDÔNIA; ALUIZIO BORÉM, UNIVERSIDADE FEDERAL DE VIÇOSA; EVELINE TEIXEIRA CAIXETA MOURA, CNPCA; ANTONIO CARLOS BAIAO DE OLIVEIRA, CNPCA; COSME DAMIÃO CRUZ, UNIVERSIDADE FEDERAL DE VIÇOSA; MOYSES NASCIMENTO, UNIVERSIDADE FEDERAL DE VIÇOSA.
Date Issued: 2026
Citation: International Journal of Plant Biology, v. 17, n. 1, p. , 2026.
Description: Genome-wide association studies (GWAS) are essential for identifying genomic regions associated with agronomic traits, but Linear Mixed Model (LMM)-based GWAS face challenges in capturing complex gene interactions. This study explores the potential of machine learning (ML) methodologies to enhance marker identification and association modeling in plant breeding. Unlike LMM-based GWAS, ML approaches do not require prior assumptions about marker–phenotype relationships, enabling the detection of epistatic effects and non-linear interactions. The research sought to assess and contrast approaches utilizing ML (Decision Tree—DT; Bagging—BA; Random Forest—RF; Boosting—BO; and Multivariate Adaptive Regression Splines—MARS) and LMM-based GWAS. A simulated F2 population comprising 1000 individuals was analyzed using 4010 SNP markers and ten traits modeled with epistatic interactions. The simulation included quantitative trait loci (QTL) counts varying between 8 and 240, with heritability levels set at 0.5 and 0.8. These characteristics simulate traits of candidate crops that represent a diverse range of agronomic species, including major cereal crops (e.g., maize and wheat) as well as leguminous crops (e.g., soybean), such as yield, with moderate heritability and a high number of QTLs, and plant height, with high heritability and an average number of QTLs, among others. To validate the simulation findings, the methodologies were further applied to a real Coffea arabica population (n = 195) to identify genomic regions associated with yield, a complex polygenic trait. Results demonstrated a fundamental trade-off between sensitivity and precision. Specifically, for the most complex trait evaluated (240 QTLs under epistatic control), Ensemble methods (Bagging and Random Forest) maintained a Detection Power (DP) exceeding 90%, significantly outperforming state-of-the-art GWAS methods (FarmCPU), which dropped to approximately 30%, and traditional Linear Mixed Models, which failed to detect signals (0%). However, this sensitivity resulted in lower precision for ensembles. In contrast, MARS (Degree 1) and BLINK achieved exceptional Specificity (>99%) and Precision (>90%), effectively minimizing false positives. The real data analysis corroborated these trends: while standard GWAS models failed to detect significant associations, the ML framework successfully prioritized consensus genomic regions harboring functional candidates, such as SWEET sugar transporters and NAC transcription factors. In conclusion, ML Ensembles are recommended for broad exploratory screening to recover missing heritability, while MARS and BLINK are the most effective methods for precise candidate gene validation.
Thesagro: Coffea Arábica
NAL Thesaurus: Genetic markers
Plant breeding
DOI: https://doi.org/10.3390/ijpb17010006
Type of Material: Artigo de periódico
Access: openAccess
Appears in Collections:Artigo em periódico indexado (SAPC)

Files in This Item:
File SizeFormat 
Comparison-of-machine.pdf4,09 MBAdobe PDFView/Open

FacebookTwitterDeliciousLinkedInGoogle BookmarksMySpace