--- title: "Testing RRphylo methods overfit" author: "Silvia Castiglione, Carmela Serio, Giorgia Girardi, Pasquale Raia" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{overfit} %\VignetteEngine{knitr::rmarkdown_notangle} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} if (!requireNamespace("rmarkdown", quietly = TRUE) || !rmarkdown::pandoc_available()) { warning(call. = FALSE, "Pandoc not found, the vignettes is not built") knitr::knit_exit() } knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval=FALSE ) options(rmarkdown.html_vignette.check_title = FALSE) ``` ## Index 1. [overfit- functions basics](#basics) 2. [overfitRR](#overfitRR) 3. [overfitSS](#overfitSS) 4. [overfitST](#overfitST) 5. [overfitSC](#overfitSC) 6. [overfitPGLS](#overfitPGLS) 7. [Guided examples](#examples) ## overfit- functions basics {#basics} Methods using a large number of parameters risk being overfit. This usually translates in poor fitting with data and trees other than the those originally used. With RRphylo methods this risk is usually very low. However, the user can assess how robust the results got by applying [`search.shift`](search.shift.html), [`search.trend`](search.trend.html), [`search.conv`](search.conv.html), or `PGLS_fossil` are by running the `overfit-` functions. The basic idea of `overfit-` is using alternative tree topologies to test for both phylogenetic and sampling uncertainty at the same time. Such alternative phylogenies can be provided by the user as a `list`/`multiPhylo` object, otherwise can be generated through [`resampleTree`](Alternative-trees.html). In this latter case, the original tree and data are subsampled by specifying a `s` parameter, that is the proportion of tips to be removed from the tree, and species positions are shuffled by using the function [`swapONE`](Alternative-trees.html#swapone). When generating phylogenies to test the robustness of `search.shift`, `search.trend` at individual clades, and `search.conv`, it is advisable to constrain `resampleTree` so that it preserves the identity of clades/groups under testing. This is achieved by providing as `node`/`categories` arguments of `resampleTree` the `node`/`nodes`/`state` arguments in the main functions. Similarly, using the phylogenetic tree returned by `RRphylo` is recommended. ## overfitRR {#overfitRR} `overfitRR` tests the robustness of the robustness of ancestral state and rate estimates. It takes as input an object generated by [`RRphylo`](RRphylo.html), a list of alternative phylogenetic trees, and all the data used to produce it (besides necessary phenotypic data, any other argument such as covariate, predictor, and so on, passed to `RRphylo`). The output is a `RRphyloList` object including: the list of `RRphylo` per simulation (`$RR.list`), the phenotypic estimate at the tree root per simulation (`$root.est`), the 95% confidence interval around the original phenotypic value at the tree root (`$rootCI`) and the regression parameters describing the relation between the original values at internal nodes and the corresponding figure after subsampling and swapping (`$ace.regressions`). A regression slope close to one indicates a better matching between original and subsampled values, suggesting the estimation is robust to phylogenetic uncertainty and subsampling. Since the ancestral state estimation within `RRphylo` is unbounded, sometimes altering the tree topology leads to implausible root estimate (n.b. setting the `rootV` does not bound the estimation, it does not solve this problem). In this case it is advisable to discard such questionable outputs from `$RR.list` before running further analyses. ### Guided examples ```{r eval=FALSE} ## overfitRR routine # load the RRphylo example dataset including Ornithodirans tree and data library(ape) data("DataOrnithodirans") DataOrnithodirans$treedino->treedino log(DataOrnithodirans$massdino)->massdino DataOrnithodirans$statedino->statedino # extract Pterosaurs tree and data extract.clade(treedino,746)->treeptero massdino[match(treeptero$tip.label,names(massdino))]->massptero massptero[match(treeptero$tip.label,names(massptero))]->massptero # peform RRphylo on body mass RRphylo(tree=treeptero,y=massptero,clus=cc)->RRptero # generate a list of subsampled and swapped phylogenies to test tree.list<-resampleTree(RRptero$tree,s = 0.25,swap.si = 0.1,swap.si2 = 0.1,nsim=10) # test the robustness of RRphylo ofRRptero<-overfitRR(RR = RRptero,y=massptero,phylo.list=tree.list,clus=cc) ## overfitRR routine on multiple RRphylo # load the RRphylo example dataset including Cetaceans tree and data data("DataCetaceans") DataCetaceans$treecet->treecet DataCetaceans$masscet->masscet DataCetaceans$brainmasscet->brainmasscet DataCetaceans$aceMyst->aceMyst # cross-reference the phylogenetic tree and body and brain mass data. Remove from # both the tree and vector of body sizes the species whose brain size is missing drop.tip(treecet,treecet$tip.label[-match(names(brainmasscet),treecet$tip.label)])->treecet.multi masscet[match(treecet.multi$tip.label,names(masscet))]->masscet.multi # peform RRphylo on the variable (body mass) to be used as additional predictor RRphylo(tree=treecet.multi,y=masscet.multi,clus=cc)->RRmass.multi RRmass.multi$aces[,1]->acemass.multi # create the predictor vector: retrieve the ancestral character estimates # of body size at internal nodes from the RR object ($aces) and collate them # to the vector of species' body sizes to create c(acemass.multi,masscet.multi)->x1.mass # peform RRphylo on brain mass by using body mass as additional predictor RRphylo(tree=treecet.multi,y=brainmasscet,x1=x1.mass,clus=cc)->RRmulti # generate a list of subsampled and swapped phylogenies to test treecet.list<-resampleTree(RRmulti$tree,s = 0.25,swap.si=0.1,swap.si2=0.1,nsim=10) # test the robustness of multiple RRphylo ofRRcet<-overfitRR(RR = RRmulti,y=brainmasscet,phylo.list=treecet.list,clus=cc,x1 =x1.mass) ``` ## overfitSS {#overfitSS} `overfitSS` tests the robustness of `search.shift`. It takes as input an object generated by `RRphylo` (`RR`), an object generated by `overfitRR` performed on `RR` (`oveRR`), and `node` and/or `state` as passed to `search.shift` (they can be provided at the same time). The output is a `RRphyloList` object including: the results of `search.shift` under clade condition (`$SSclade.list`) and sparse condition (`$SSsparse.list`) per simulation, and a `$shift.results` summary object. The latter element returns separate results for [`clade`](search.shift.html#clade) and [`sparse`](search.shift.html#sparse) conditions (`$shift.results`). The first (**clade**) includes the proportion of simulations producing significant and positive (**p.shift+**) or significant and negative (**p.shift-**) rate shifts for each single node (either compared to the rest of the tree - **\$single.clades\$singles** - or to the rest of the tree after removing other shifting clades - **\$single.clades\$no.others**), and for all the clades taken as a whole (**$all.clades.together** - see [*Testing rate shifts pertaining to entire clades*](search.shift.html#clade) for further details). Under the `sparse` condition (**sparse**), the same figures as before are reported for each state category compared to the rest of the tree and for all possible pair of categories (see [*Testing rate shifts pertaining to phylogenetically unrelated species*](search.shift.html#sparse) for further details). ### Guided examples ```{r } # load the RRphylo example dataset including Ornithodirans tree and data data("DataOrnithodirans") DataOrnithodirans$treedino->treedino log(DataOrnithodirans$massdino)->massdino DataOrnithodirans$statedino->statedino # peform RRphylo on Ornithodirans tree and data RRphylo(tree=treedino,y=massdino,clus=cc)->dinoRates # perform search.shift under both "clade" and "sparse" condition search.shift(RR=dinoRates, status.type= "clade")->SSauto search.shift(RR=dinoRates, status.type= "sparse", state=statedino)->SSstate ## overfitSS routine # generate a list of subsampled and swapped phylogenies, setting as categories/node # the state/node under testing tree.list<-resampleTree(dinoRates$tree,s = 0.25,categories=statedino, node=rownames(SSauto$single.clades),swap.si = 0.1,swap.si2 = 0.1,nsim=10) # test the robustness of search.shift ofRRdino<-overfitRR(RR = dinoRates,y=massdino,phylo.list=tree.list,clus=cc) ofSS<-overfitSS(RR = dinoRates,oveRR = ofRRdino,state=statedino, node=rownames(SSauto$single.clades)) ``` ## overfitST {#overfitST} `overfitST` tests the robustness of `search.trend`. It takes as input an object generated by `RRphylo` (`RR`), the phenotypic data (`y`), an object generated by `overfitRR` performed on `RR` (`oveRR`), and the arguments `x1`, `x1.residuals`, `node`, and `cov` as passed to `search.trend`. The output is a `RRphyloList` object including: the results of `search.trend` per simulation (`$ST.list`) and a `$trend.results` summary object. Within the latter object, results for the entire tree (**tree**) summarize the proportion of simulations producing positive (**slope+**) or negative (**slope-**) slopes significantly higher (**p.up**) or lower (**p.down**) than BM simulations for either phenotypes or rescaled rates versus time regressions. Such evaluations is based on **p.random** only (see [*Temporal trends on the entire tree*](search.trend.html#tree),for further details). When specific clades are under testing, the same set of results as for the whole tree is returned for each node (**node**). In this case, for phenotype versus age regression through nodes, the proportion of significant and positive/negative slopes (**slope+p.up**,**slope+p.down**,**slope-p.up**,**slope-p.down**) is accompanied by the same figures for the estimated marginal mean differences (**p.emm+** and **p.emm-**). As for the temporal trend in absolute rates through node, the proportion of significant and positive/negative estimated marginal means differences (**p.emm+** and **p.emm-**) and the same figure for slope difference (**p.slope+** and **p.slope-**) are reported (see [*Temporal trends at clade level*](search.trend.html#nodes)). Finally when more than one node is tested, the `$trend.results` object also includes results for the pairwise comparison between nodes. ### Guided examples ```{r } library(ape) ## Case 1 # load the RRphylo example dataset including Ornithodirans tree and data data("DataOrnithodirans") DataOrnithodirans$treedino->treedino log(DataOrnithodirans$massdino)->massdino DataOrnithodirans$statedino->statedino # extract Pterosaurs tree and data extract.clade(treedino,746)->treeptero massdino[match(treeptero$tip.label,names(massdino))]->massptero massptero[match(treeptero$tip.label,names(massptero))]->massptero # perform RRphylo and search.trend on body mass data RRphylo(tree=treeptero,y=massptero,clus=cc)->RRptero search.trend(RR=RRptero, y=massptero,node=143,clus=cc)->ST ## overfitST routine # generate a list of subsampled and swapped phylogenies setting as node # the clade under testing treeptero.list<-resampleTree(RRptero$tree,s = 0.25,node=143, swap.si = 0.1,swap.si2 = 0.1,nsim=10) # test the robustness of search.trend ofRRptero<-overfitRR(RR = RRptero,y=massptero,phylo.list=treeptero.list,clus=cc) ofSTptero<-overfitST(RR=RRptero,y=massptero,oveRR=ofRRptero,node=143,clus=cc) ## Case 2 # load the RRphylo example dataset including Cetaceans tree and data data("DataCetaceans") DataCetaceans$treecet->treecet DataCetaceans$masscet->masscet DataCetaceans$brainmasscet->brainmasscet # cross-reference the phylogenetic tree and body and brain mass data. Remove from # both the tree and vector of body sizes the species whose brain size is missing drop.tip(treecet,treecet$tip.label[-match(names(brainmasscet), treecet$tip.label)])->treecet.multi masscet[match(treecet.multi$tip.label,names(masscet))]->masscet.multi # peform RRphylo on the variable (body mass) to be used as additional predictor RRphylo(tree=treecet.multi,y=masscet.multi,clus=cc)->RRmass.multi RRmass.multi$aces[,1]->acemass.multi # create the predictor vector: retrieve the ancestral character estimates # of body size at internal nodes from the RR object ($aces) and collate them # to the vector of species' body sizes to create c(acemass.multi,masscet.multi)->x1.mass # peform RRphylo and search.trend on brain mass by using body mass as additional predictor RRphylo(tree=treecet.multi,y=brainmasscet,x1=x1.mass,clus=cc)->RRmulti search.trend(RR=RRmulti, y=brainmasscet,x1=x1.mass,clus=cc)->STcet ## overfitST routine # generate a list of subsampled and swapped phylogenies to test treecet.list<-resampleTree(RRmulti$tree,s = 0.25,swap.si=0.1,swap.si2=0.1,nsim=10) # test the robustness of search.trend with and without x1.residuals ofRRcet<-overfitRR(RR = RRmulti,y=brainmasscet,phylo.list=treecet.list,clus=cc,x1 =x1.mass) ofSTcet1<-overfitST(RR=RRmulti,y=brainmasscet,oveRR=ofRRcet,x1 =x1.mass,clus=cc) ofSTcet2<-overfitST(RR=RRmulti,y=brainmasscet,oveRR=ofRRcet,x1 =x1.mass,x1.residuals = TRUE,clus=cc) ``` ## overfitSC {#overfitSC} `overfitSC` is the last step of a short routine the user should apply to test the robustness of `search.conv` results (see the [examples](#examples) below). 1) Generating a number of altered (i.e. subsampled and swapped) phylogenies by using `resampleTree`. 2) In case the multivariate shape data used as input to `search.conv` were reduced (e.g. by SVD), such data must be matched with each of the subsampled-swapped generated tree and the resulting dataset must be reduced as well. 3) The list of phylogenetic trees and multivariate shape data can be fed to `overfitSC` for testing. Similarly to `overfitRR`, `overfitSC` returns `RRphyloList` objects including the outputs of `RRphylo` and, depending on the analysis performed, the outputs of `search.conv` under `clade` or `state` conditions applied on the modified trees. Results for robustness of `search.conv` (`$conv.results`) include separate objects for convergence between [`clades`](search.conv.html#nodes) or between/within [`states`](search.conv.html#state). Under the first case (**clade**), the proportion of simulations producing significant instance of convergence (**p.ang.bydist**) or convergence and parallelism (**p.ang.conv**) between selected clades are returned (see [*Morphological convergence between clades*](search.conv.html#nodes) for further details). As for convergence between/within discrete categories (**state**), `overfitRR` returns the proportion of simulations producing significant instance of convergence either accounting (**p.ang.state.time**) or not accounting (**p.ang.state**) for the time intervening between the tips in the focal state [*Morphological convergence within/between categories*](search.conv.html#state) for explanations). ### Guided examples ```{r } library(Morpho) ## Testing search.conv # load the RRphylo example dataset including Felids tree and data data("DataFelids") DataFelids$treefel->treefel DataFelids$statefel->statefel DataFelids$landfel->feldata # perform data reduction via Procrustes superimposition (in this case) procSym(feldata)->pcafel pcafel$PCscores->PCscoresfel # perform RRphylo on Felids tree and data RRphylo(treefel,PCscoresfel)->RRfel # search for morphologicl convergence between clades (automatic mode) # and within the category search.conv(RR=RRfel, y=PCscoresfel, min.dim=5, min.dist="time38")->sc1 search.conv(tree=treefel, y=PCscoresfel, state=statefel, declust=TRUE)->sc2 # select converging clades returned in sc1 felnods<-c(85,155) ## overfitSC routine # generate a list of subsampled and swapped phylogenies to test for search.conv # robustness. Use as reference tree the phylogeny returned by RRphylo. # Set the nodes and the categories under testing as arguments of # resampleTree so that it maintains no less than 5 species in each clade/state. tree.list<-resampleTree(RRfel$tree,s=0.15,nodes=felnods,categories=statefel, nsim=15,swap.si=0.1,swap.si2=0.1) # match the original data with each subsampled-swapped phylogeny in tree.list # and repeat data reduction y.list<-lapply(tree.list,function(k){ treedataMatch(k,feldata)[[1]]->ynew procSym(ynew)$PCscores }) # test for robustness of search.conv results by overfitSC oSC<-overfitSC(RR=RRfel,phylo.list=tree.list,y.list=y.list, nodes = felnods,state=statefel) ``` ## overfitPGLS {#overfitPGLS} `overfitPGLS` tests the robustness of `PGLS_fossil`. Other than the regression formula (`modform`), it takes as input an object generated by `overfitRR` (if `PGLS_fossil` should be performed on trees rescaled according to `RRphylo` rates) and/or a list of alternative phylogenies (if no tree rescaling should be performed). These arguments are not exclusive though, they can be provided at the same time. The output includes separate objects for the analyses performed on the original tree (`$tree`) and on the tree rescaled according to `RRphylo` rates (`$RR`). ## Guided examples ```{r } library(phytools) library(ape) # generate fictional data to test the function rtree(100)->tree fastBM(tree)->resp fastBM(tree,nsim=3)->resp.multi fastBM(tree)->pred1 fastBM(tree)->pred2 data.frame(y1=resp,x2=pred1,x1=pred2)->dato # perform RRphylo and PGLS_fossil with univariate/multivariate phenotypic data PGLS_fossil(modform=y1~x1+x2,data=dato,tree=tree)->pgls_noRR RRphylo(tree,resp,clus=cc)->RR PGLS_fossil(modform=resp~pred1+pred2,RR=RR)->pgls_RR PGLS_fossil(modform=y1~x1+x2,data=list(y1=resp.multi,x2=pred1,x1=pred2),tree=tree)->pgls2_noRR RRphylo(tree,resp.multi,clus=cc)->RR2 PGLS_fossil(modform=resp.multi~pred1+pred2,tree=tree,RR=RR)->pgls2_RR ## overfitPGLS routine # generate a list of subsampled and swapped phylogenies to test tree.list<-resampleTree(RR$tree,s = 0.25,swap.si=0.1,swap.si2=0.1,nsim=10) # test the robustnes of PGLS_fossil with univariate/multivariate phenotypic data ofRR<-overfitRR(RR = RR,y=resp,phylo.list=tree.list,clus=cc) ofPGLS<-overfitPGLS(oveRR = ofRR,phylo.list=tree.list,modform = y1~x1+x2,data=dato) ofRR2<-overfitRR(RR = RR2,y=resp.multi,phylo.list=tree.list,clus=cc) ofPGLS2<-overfitPGLS(oveRR = ofRR2,phylo.list=tree.list,modform = y1~x1+x2, data=list(y1=resp.multi,x2=pred1,x1=pred2)) ```