Posts Tagged ‘Genomic selection’

Genomic-enabled prediction with classification algorithms

Posted by Carelia Juarez on , in Journal Articles

Published in Heredity, 2014

Ornella, L.; Perez, P.; Tapia, E.; Gonzlez-Camacho, J.M.;Burgueño, J.; Zhang, X.; Sukhwinder Singh; Vicente, F.S.; Bonnett, D.; Dreisigacker, S.; Singh, R.P.; Long, N.; Crossa, J.

Pearson’s correlation coefficient (ρ) is the most commonly reported metric of the success of prediction in genomic selection (GS). However, in real breeding ρ may not be very useful for assessing the quality of the regression in the tails of the distribution, where individuals are chosen for selection. This research used 14 maize and 16 wheat data sets with different trait–environment combinations. Six different models were evaluated by means of a cross-validation scheme (50 random partitions each, with 90% of the individuals in the training set and 10% in the testing set). The predictive accuracy of these algorithms for selecting individuals belonging to the best α=10, 15, 20, 25, 30, 35, 40% of the distribution was estimated using Cohen’s kappa coefficient (κ) and an ad hoc measure, which we call relative efficiency (RE), which indicates the expected genetic gain due to selection when individuals are selected based on GS exclusively. We put special emphasis on the analysis for α=15%, because it is a percentile commonly used in plant breeding programmes (for example, at CIMMYT). We also used ρ as a criterion for overall success. The algorithms used were: Bayesian LASSO (BL), Ridge Regression (RR), Reproducing Kernel Hilbert Spaces (RHKS), Random Forest Regression (RFR), and Support Vector Regression (SVR) with linear (lin) and Gaussian kernels (rbf). The performance of regression methods for selecting the best individuals was compared with that of three supervised classification algorithms: Random Forest Classification (RFC) and Support Vector Classification (SVC) with linear (lin) and Gaussian (rbf) kernels. Classification methods were evaluated using the same cross-validation scheme but with the response vector of the original training sets dichotomised using a given threshold. For α=15%, SVC-lin presented the highest κ coefficients in 13 of the 14 maize data sets, with best values ranging from 0.131 to 0.722 (statistically significant in 9 data sets) and the best RE in the same 13 data sets, with values ranging from 0.393 to 0.948 (statistically significant in 12 data sets). RR produced the best mean for both κ and RE in one data set (0.148 and 0.381, respectively). Regarding the wheat data sets, SVC-lin presented the best κ in 12 of the 16 data sets, with outcomes ranging from 0.280 to 0.580 (statistically significant in 4 data sets) and the best RE in 9 data sets ranging from 0.484 to 0.821 (statistically significant in 5 data sets). SVC-rbf (0.235), RR (0.265) and RHKS (0.422) gave the best κ in one data set each, while RHKS and BL tied for the last one (0.234). Finally, BL presented the best RE in two data sets (0.738 and 0.750), RFR (0.636) and SVC-rbf (0.617) in one and RHKS in the remaining three (0.502, 0.458 and 0.586). The difference between the performance of SVC-lin and that of the rest of the models was not so pronounced at higher percentiles of the distribution. The behaviour of regression and classification algorithms varied markedly when selection was done at different thresholds, that is, κ and RE for each algorithm depended strongly on the selection percentile. Based on the results, we propose classification method as a promising alternative for GS in plant breeding.

The use of unbalanced historical data for genomic selection in an international wheat breeding program

Posted by Carelia Juarez on , in Journal Articles

Published in Field Crops Research 154  : 12-22, 2013

Dawson, J.C.; Endelman, J.B.; Heslot, N.; Crossa, J.; Poland, J.; Dreisigacker, S.; Manes, Y.; Sorrells, M.E.; Jean-Luc Jannink

Genomic selection (GS) offers breeders the possibility of using historic data and unbalanced breeding trials to form training populations for predicting the performance of new lines. However, when using datasets that are unbalanced over time and space, there is increasing exposure to different genotype – environment combinations and interactions that may make predictions less accurate. Global cross-validated genomic prediction accuracies may be high when using large historic datasets but accuracies for individual years using a forward-prediction approach, or accuracies for individual locations, are often much lower. The objective of this study was to evaluate the overall accuracy of genomic predictions for untested genotypes using an unbalanced dataset to train a genomic selection model, and to explore ways of combining genomic selection and genotype-by-environment (G×E) interaction models to better target untested lines to different locations. Using the International Center for Maize and Wheat Improvement’s (CIMMYT) Semi-Arid Wheat Yield Trials (SAWYT) we assessed the accuracy of genomic predictions and the potential to subset these nurseries using the concept of mega-environments (ME) adapted to a genomic selection context. We found that there was no difference in accuracy between models accounting for G×E interactions and global models. Data-driven methods of clustering locations based on similarities in genomic predictions also failed to improve accuracies within clusters. Using a simulation based on the empirical SAWYT data, we found that if there were different true genotypic values between clusters, there was an advantage to modeling G×E in prediction models. In the SAWYT dataset it appears that there is not a consistent pattern of genotype-by-environment interaction among the ME, and this dataset is not balanced enough to partition into new clusters that have predictive power.

Genomic prediction in maize breeding populations with genotyping-by sequencing

Posted by Carelia Juarez on , in Journal Articles

Published in G3-Genes Genomes Genetics 3 (11) : 1903-1926, 2013

Crossa, J.; Beyene, Y.;Fentaye Kassa Semagn; Perez, P.; Hickey, J.M.; Chen Charles; Campos, G. de los; Burgueño, J.; Windhausen, V.S.; Buckler, E.;Jannink, J.L.; Lopez Cruz, M.A.; Babu, R.

Genotyping-by-sequencing (GBS) technologies have proven capacity for delivering large numbers of marker genotypes with potentially less ascertainment bias than standard single nucleotide polymorphism (SNP) arrays. Therefore, GBS has become an attractive alternative technology for genomic selection. However, the use of GBS data poses important challenges, and the accuracy of genomic prediction using GBS is currently undergoing investigation in several crops, including maize, wheat, and cassava. The main objective of this study was to evaluate various methods for incorporating GBS information and compare them with pedigree models for predicting genetic values of lines from two maize populations evaluated for different traits measured in different environments (experiments 1 and 2). Given that GBS data come with a large percentage of uncalled genotypes, we evaluated methods using nonimputed, imputed, and GBS-inferred haplotypes of different lengths (short or long). GBS and pedigree data were incorporated into statistical models using either the genomic best linear unbiased predictors (GBLUP) or the reproducing kernel Hilbert spaces (RKHS) regressions, and prediction accuracy was quantified using cross-validation methods. The following results were found: relative to pedigree or marker-only models, there were consistent gains in prediction accuracy by combining pedigree and GBS data; there was increased predictive ability when using imputed or nonimputed GBS data over inferred haplotype in experiment 1, or nonimputed GBS and information-based imputed short and long haplotypes, as compared to the other methods in experiment 2; the level of prediction accuracy achieved using GBS data in experiment 2 is comparable to those reported by previous authors who analyzed this data set using SNP arrays; and GBLUP and RKHS models with pedigree with nonimputed and imputed GBS data provided the best prediction correlations for the three traits in experiment 1, whereas for experiment 2 RKHS provided slightly better prediction than GBLUP for drought-stressed environments, and both models provided similar predictions in well-watered environments.

A reaction norm model for genomic selection using high-dimensional genomic and environmental data

Posted by Carelia Juarez on , in Journal Articles

Published in Theoretical and Applied Genetics, 2013

Jarquin, D.; Crossa, J.; Lacaze, X.; Cheyron, P.D.; Daucourt, J.; Lorgeou, J.; Piraux, F.; Guerreiro, L.; Perez, P.; Calus, M.; Burgueño, J.; Campos, G. de los.

In most agricultural crops the effects of genes on traits are modulated by environmental conditions, leading to genetic by environmental interaction (G × E). Modern genotyping technologies allow characterizing genomes in great detail and modern information systems can generate large volumes of environmental data. In principle, G × E can be accounted for using interactions between markers and environmental covariates (ECs). However, when genotypic and environmental information is high dimensional, modeling all possible interactions explicitly becomes infeasible. In this article we show how to model interactions between high-dimensional sets of markers and ECs using covariance functions. The model presented here consists of (random) reaction norm where the genetic and environmental gradients are described as linear functions of markers and of ECs, respectively. We assessed the proposed method using data from Arvalis, consisting of 139 wheat lines genotyped with 2,395 SNPs and evaluated for grain yield over 8 years and various locations within northern France. A total of 68 ECs, defined based on five phases of the phenology of the crop, were used in the analysis. Interaction terms accounted for a sizable proportion (16 %) of the within-environment yield variance, and the prediction accuracy of models including interaction terms was substantially higher (17–34 %) than that of models based on main effects only. Breeding for target environmental conditions has become a central priority of most breeding programs. Methods, like the one presented here, that can capitalize upon the wealth of genomic and environmental information available, will become increasingly important.

Genetic prediction of complex traits: integrating infinitesimal and marked genetic effects

Posted by Carelia Juarez on , in Journal Articles

Published in Genetica 141 (4-6) : 239-246, 2013

Clément Carré, Fabrice Gamboa, David Cros, John Michael Hickey, Gregor Gorjanc and Eduardo Manfredi

Genetic prediction for complex traits is usually based on models including individual (infinitesimal) or marker effects. Here, we concentrate on models including both the individual and the marker effects. In particular, we develop a “Mendelian segregation” model combining infinitesimal effects for base individuals and realized Mendelian sampling in descendants described by the available DNA data. The model is illustrated with an example and the analyses of a public simulated data file. Further, the potential contribution of such models is assessed by simulation. Accuracy, measured as the correlation between true (simulated) and predicted genetic values, was similar for all models compared under different genetic backgrounds. As expected, the segregation model is worthwhile when markers capture a low fraction of total genetic variance.

Genomic prediction in CIMMYT maize and wheat breeding programs

Posted by Carelia Juarez on , in Journal Articles

Published in Heredity, 2013

J. Crossa, P. Pérez, J. Hickey, J. Burgueño, L. Ornella, J. Cerón-Rojas, X. Zhang, S.  Dreisigacker, R. Babu, Y. Li, D. Bonnett and K. Mathews

Genomic selection (GS) has been implemented in animal and plant species, and is regarded as a useful tool for accelerating genetic gains. Varying levels of genomic prediction accuracy have been obtained in plants, depending on the prediction problem assessed and on several other factors, such as trait heritability, the relationship between the individuals to be predicted and those used to train the models for prediction, number of markers, sample size and genotype × environment interaction (GE). The main objective of this article is to describe the results of genomic prediction in International Maize and Wheat Improvement Center’s (CIMMYT’s) maize and wheat breeding programs, from the initial assessment of the predictive ability of different models using pedigree and marker information to the present, when methods for implementing GS in practical global maize and wheat breeding programs are being studied and investigated. Results show that pedigree (population structure) accounts for a sizeable proportion of the prediction accuracy when a global population is the prediction problem to be assessed. However, when the prediction uses unrelated populations to train the prediction equations, prediction accuracy becomes negligible. When genomic prediction includes modeling GE, an increase in prediction accuracy can be achieved by borrowing information from correlated environments. Several questions on how to incorporate GS into CIMMYT’s maize and wheat programs remain unanswered and subject to further investigation, for example, prediction within and between related bi-parental crosses. Further research on the quantification of breeding value components for GS in plant breeding populations is required.

 

High-throughput phenotyping and genomic selection: The frontiers of crop breeding converge

Posted by Carelia Juarez on , in Journal Articles

Published in Journal of Integrative Plant Biology 54 (5) : 312-320, 2012

Llorenç Cabrera-Bosquet, José Crossa, Jarislav von Zitzewitz, María Dolors Serret, José Luis Araus

Genomic selection (GS) and high-throughput phenotyping have recently been captivating the interest of the crop breeding community from both the public and private sectors world-wide. Both approaches promise to revolutionize the prediction of complex traits, including growth, yield and adaptation to stress. Whereas high-throughput phenotyping may help to improve understanding of crop physiology, most powerful techniques for high-throughput field phenotyping are empirical rather than analytical and comparable to genomic selection. Despite the fact that the two methodological approaches represent the extremes of what is understood as the breeding process (phenotype versus genome), they both consider the targeted traits (e.g. grain yield, growth, phenology, plant adaptation to stress) as a black box instead of dissecting them as a set of secondary traits (i.e. physiological) putatively related to the target trait. Both GS and high-throughput phenotyping have in common their empirical approach enabling breeders to use genome profile or phenotype without understanding the underlying biology. This short review discusses the main aspects of both approaches and focuses on the case of genomic selection of maize flowering traits and near-infrared spectroscopy (NIRS) and plant spectral reflectance as high-throughput field phenotyping methods for complex traits such as crop growth and yield.

Computer simulation in plant breeding

Posted by Carelia Juarez on , in Journal Articles

Published in Advances in Agronomy 116: 219-264, 2012

Xin Li, Chengsong Zhu, Jiankang Wang and Jianming Yu

As a bridge between theory and experimentation, computer simulation has become a powerful tool in scientific research, providing not only preliminary validation of theories but also guidelines for empirical experiments. Plant breeding focuses on developing superior genotypes with available genetic and nongenetic resources, and improved plant-breeding methods maximize genetic gain and cost-effectiveness. Computer simulation can lay out the breeding process in silico and identify optimal candidates for various scenarios; empirical validation can then follow. Insights gained from empirical studies, in turn, can be incorporated into computer simulations. In this review, we discuss different applications of computer simulation in plant breeding. First, we briefly summarize the history of plant breeding and computer simulation and how computer simulation can facilitate the breeding process. Next, we partition the utility of computer simulation into different research areas of plant breeding, including breeding method comparison, genetic mapping, gene network and genotype-by-environment interaction simulation, and crop modeling. Then we discuss computational issues involved in simulation. Finally, we offer some perspectives on the future of computer simulation in plant breeding.

Genome-enabled prediction of genetic values using radial basis function neural networks

Posted by Carelia Juarez on , in Journal Articles

Published in Theoretical and Applied Genetics 125 (4): 759-771, 2012

J. M. González-Camacho, G. de los Campos, P. Pérez, D. Gianola, J. E. Cairns, G. Mahuku, R. Babu and J. Crossa

The availability of high density panels of molecular markers has prompted the adoption of genomic selection (GS) methods in animal and plant breeding. In GS, parametric, semi-parametric and non-parametric regressions models are used for predicting quantitative traits. This article shows how to use neural networks with radial basis functions (RBFs) for prediction with dense molecular markers. We illustrate the use of the linear Bayesian LASSO regression model and of two non-linear regression models, reproducing kernel Hilbert spaces (RKHS) regression and radial basis function neural networks (RBFNN) on simulated data and real maize lines genotyped with 55,000 markers and evaluated for several trait–environment combinations. The empirical results of this study indicated that the three models showed similar overall prediction accuracy, with a slight and consistent superiority of RKHS and RBFNN over the additive Bayesian LASSO model. Results from the simulated data indicate that RKHS and RBFNN models captured epistatic effects; however, adding non-signal (redundant) predictors (interaction between markers) can adversely affect the predictive accuracy of the non-linear regression models.

Genomic Selection and Prediction in Plant Breeding

Posted by on , in Journal Articles

Published in  Journal of Crop Improvement  25(3): 239-261, 2011

Genomic Selection and Prediction in Plant Breeding

José Crossaa, Paulino Pérezab, Gustavo de los Camposac, George Mahukua, Susanne Dreisigackera and Cosmos Magorokoshoa

The availability of thousands of genome-wide molecular markers has made possible the use of genomic selection in plants and animals. However, the evaluation of models for genomic selection in plant breeding populations remains limited. In this study, we provide an overview of several models for genomic selection, whose predictive ability we investigate using two plant data sets. The first data set comprises historical phenotypic records of a series of wheat (Triticum aestivum L.) trials evaluated in 10 environments and recently generated genomic data. The second data set pertains to international maize (Zea mays L.) trials in which two disease traits (Exserohilum turcicum and Cercospora zeae-maydis) of maize lines evaluated in five environments were measured. Results showed that models including marker information yielded important gains in predictive ability relative to that of a pedigree-based model, this with a modest number of markers. Estimates of marker effects were different across environmental conditions, indicating that genotype × environment interaction was an important component of genetic variability. Overall, the study provided evidence from real populations indicating that genomic selection could be an effective tool for improving traits of economic importance in commercial crops.