The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel

Department of Computer Science, Iowa State University, Ames, IA 50011, USA

Department of Biology, University of Florida, Gainesville, FL 32611, USA

Abstract

Background

Supertree methods synthesize collections of small phylogenetic trees with incomplete taxon overlap into comprehensive trees, or supertrees, that include all taxa found in the input trees. Supertree methods based on the well established Robinson-Foulds (RF) distance have the potential to build supertrees that retain much information from the input trees. Specifically, the RF supertree problem seeks a binary supertree that minimizes the sum of the RF distances from the supertree to the input trees. Thus, an RF supertree is a supertree that is consistent with the largest number of clusters (or clades) from the input trees.

Results

We introduce efficient, local search based, hill-climbing heuristics for the intrinsically hard RF supertree problem on rooted trees. These heuristics use novel non-trivial algorithms for the SPR and TBR local search problems which improve on the time complexity of the best known (naïve) solutions by a factor of Θ(^{2}) respectively (where

Conclusions

Our heuristics for the RF supertree problem, based on our new local search algorithms, make it possible for the first time to estimate large supertrees by directly optimizing the RF distance from rooted input trees to the supertrees. This provides a new and fast method to build accurate supertrees. RF supertrees may also be useful for estimating majority-rule(-) supertrees, which are a generalization of majority-rule consensus trees.

Introduction

Supertree methods provide a formal approach for combining small phylogenetic trees with incomplete species overlap in order to build comprehensive species phylogenies, or supertrees, that contain all species found in the input trees. Supertree analyses have produced the first family-level phylogeny of flowering plants

Although supertrees can support large-scale evolutionary and ecological analyses, there are still numerous concerns about the performance of existing supertree methods (e.g.,

By far the most commonly used supertree method is matrix representation with parsimony (MRP), which works by solving the parsimony problem on a binary matrix representation of the input trees

Since evolutionary biologists rarely, if ever, know the true relationships for a group of species, it is difficult to assess the accuracy of supertree, or any phylogenetic, methods. One approach to evaluate the accuracy of supertrees is with simulations (e.g.,

Numerous metrics exist to measure the similarity of input trees to a supertree, and the Robinson-Foulds (RF) distance metric

The RF distance metric between two rooted trees is defined to be a normalized count of the symmetric difference between the sets of clusters of the two trees. In the supertree setting, the input trees will often have only a strict subset of the taxa present in the supertree. Thus, a high RF distance between an input tree and a supertree does not necessarily correspond to conflicting evolutionary histories; it can also indicate incomplete phylogenetic information. Consequently, in order to compute the RF distance between an input tree which has only a strict subset of the taxa in the supertree, we first restrict the supertree to only the leaf set of the input tree. This adapted version of the RF distance is not a metric, or even a distance measure (mathematically speaking). However, for convenience, we will refer to this adapted version of the RF distance metric using the same name.

Previous work

Supertree methods are a generalization of consensus methods, in which all the input trees have the same leaf set. The problem of finding an optimal median tree under the RF distance in such a consensus setting is well-studied. In particular, it is known that the majority-rule consensus of the input trees must be a median tree

Our definition of RF distance between two trees where one has only a strict subset of the taxa in the other, corresponds to the distance measure used to define "majority-rule(-) supertrees" by Cotton and Wilkinson

The RF distance between two trees on the same size

In the case of unrooted trees, the RF distance metric is sometimes also known as the splits metric (e.g.,

Local Search

We use a heuristic approach for the RF supertree problem. Local search is the basis of effective heuristics for many phylogenetic problems. These heuristics iteratively search through the space of possible supertrees guided, at each step, by solutions to some local search problem. More formally, in these heuristics, a

Two of the most extensively used tree edit operations for supertrees are rooted Subtree Prune and Regraft (^{3}) and ^{4}) time respectively, where

Our Contribution

We describe efficient hill-climbing heuristics for the RF supertree problem. These heuristics are based on novel non-trivial algorithms that can solve the corresponding local search problems for both ^{2}) time, yielding speed-ups of Θ(^{2}) over the best known solutions respectively. These new algorithms are inspired by fast local search algorithms for the gene duplication problem

Basic Notation and Preliminaries

A _{
T
}to be the partial order on _{
T
}
_{
T
}is denoted by ℒ(_{
T
}
_{
T
}(_{
T
}(_{
T
}. The _{
y
}, is the tree induced by {_{
v
}; i.e. _{
v
}). We denote the set of all clusters of a tree _{1}, ..., _{
k
}).

The RF Supertree Problem

Given a profile

**Definition 1 **(RF Distance). _{1}, ..., _{
k
})

_{
i
}, _{
i
}, _{
i
})Δℋ(_{
i
})])|.

**Problem 1 **(RF Supertree).

Instance:

Find: _{
opt
}
_{
opt
}) =

Recall that the RF Supertree problem is NP-hard

Local Search Problems

Here we first provide definitions for the re-rooting operation (denoted

**Definition 2 **(_{
x ∈ V(T)}{

For technical reasons, before we can define the

**Definition 3 **(Planted tree).

**Definition 4 **(_{
T
}(

**TBR Operation**_{S}(

**Notation**. We define the following:

_{
T
}(_{
y ∈ Y
}{_{
T
}(

_{
T
}(_{
x ∈ X
}
_{
T
}(

_{
T
}= ∪_{(u, v) ∈ E(T) }
_{
T
}(

**Definition 5 **(_{
T
}(_{
T
}(_{
T
}(_{
v
}

**Notation**. We define the following:

_{
T
}(_{
y ∈ Y
}{_{
T
}(

_{
T
}= ∪_{(u, v) ∈ E(T) }
_{
T
}(

Note that an

We now define the relevant local search problems based on the

**Problem 2 **(_{1}, ..., _{
k
}) _{
T
}

**Problem 3 **(_{1}, ..., _{
k
}), _{
T
}(

The problems

Throughout the remainder of this manuscript,

**Observation 1**.

We show how to solve the ^{2}) time. Since _{
T
}⊆ _{
T
}this also implies an ^{2}) solution for the ^{2}) and Θ(

In particular, we first show that any instance of the ^{2}) time algorithm for the

Note that the size of the set _{
T
}is Θ(^{3}). Thus, for each tree in the input profile the time complexity of computing and enumerating the RF distances of all trees in _{
T
}is Ω(^{3}). However, to solve the _{
T
}. In fact, after the initial ^{2}) preprocessing step, our algorithm can output the RF distance of any tree in _{
T
}in

Structural Properties

Throughout this section, we limit our attention to one tree

Our algorithm makes use of the LCA mapping from

**Definition 6 **(LCA Mapping). _{
T', T
}: _{
T', T
}(_{
T
}(ℒ(

**Notation**. We define a boolean function _{
T
}: _{
T
}(_{
T
}(_{
T
}(_{
T
}= {_{
T
}(_{
T
}is the set of all nodes

The following lemma associates the value _{
T
}.

**Lemma 1**. _{
T
}|.

_{
T
}(_{
T
}| = _{
T
}|. □

**Lemma 2**. _{
T
}(_{
S, T
}(

_{
S, T
}(_{
S, T
}(_{
T
}(_{
S, T
}(_{
S, T
}(_{
T
}(

The LCA mapping from _{
T
}(v) such that _{
T
}(

We now show that the

As before, we limit our attention to one tree _{
S, T
}(_{
v
})}.

**Lemma 3**. _{
T'
}(_{
T
}(_{
T
}(_{
T'
}(_{
T
}(_{
T
}(_{
v
}).

_{
T
}(_{
v
}), the subtrees _{
v
}and _{
S, T
}(_{
v
}and, consequently, _{
T'
}(_{
T
}(

Now consider the case when _{
v
}). Thus, for any node _{
v
}), we must have ℒ_{
T
}(_{
T'
}(_{
S, T
}(_{
S, T'
}(_{
T'
}(_{
T
}(

Lemma 3 implies that a tree in _{
T
}(_{
T
}(_{
T
}(^{2}) trees. It is interesting to note that this ability to decompose the

Thus, to solve the _{
v
}for which ℱ_{
T'
}is minimized, and (ii) a regraft location _{
v
}which minimizes |ℱ_{SPR}
_{(v, y)}|. Observe that the problem in part (ii) is simply the

**Problem 4 **(Rooting). _{1}, ..., _{
k
}), _{
v
}) _{
T
}(

Note that the problem in part (i) is the Rooting problem on the input instance ⟨(

**Theorem 1**.

**Theorem 2**. ^{2})

Solving the Rooting Problem

To solve the Rooting problem on instance ⟨(_{
T'
}(_{
v
}) and any _{
S, T
}(_{
T
}(_{
v
}). Depending on _{
T
}(_{
v
}), (ii) _{
v
}) and _{
T
}(_{
v
}) and _{
T
}(_{
v
})\_{
v
}) and _{
T
}(_{
v
})\_{
v
}) and _{
T
}(_{
T'
}(

**Lemma 4**. _{
v
}), _{
T'
}(_{
T
}(_{
v
}).

**Lemma 5**. _{
v
}) _{
T
}(_{
T'
}(_{
v
}).

_{
v
}) and _{
T
}(_{
u
}) = ℒ(_{
v
}). Thus, for any _{
v
}), ℳ_{
S, T'
}(_{
v
},

**Lemma 6**. _{
v
})\ℒ(_{
u
}), _{
v
}) _{
T
}(

_{
T'
}(

_{
T'
}(_{
b
})|.

_{
v
}) and _{
T
}(_{
u
}) ≠ ℒ(_{
v
}). We analyze each part of the lemma separately.

1. _{
v
}). Therefore, let _{
T'
}(_{
u
}) ≠ ∅. Therefore, we must have _{
T'
}ℳ_{
S, T'
}(_{
T'
}(

2.

(a) |_{
b
})|: In this case we must have _{
v
}). Therefore, let _{
v
}. Now consider the tree _{
u
}). Hence, _{
T'
}(

(b) |_{
b
})|: We claim that there does not exist any edge (_{
v
}) such that ℒ(_{
w
}) is either ℒ(_{
u
}) or _{
w
}) = ℒ(_{
u
}) then we must have _{
v
}). If ℒ(_{
w
}) = g _{
b
})|, which is, again, a contradiction. Thus, such an edge (_{
v
}) cannot exist. Hence, we must have _{
T'
}(_{
v
}) in this case.

The lemma follows. □

**Lemma 7**. _{
v
})\_{
v
}) _{
T
}(_{
T'
}(_{
v
}

_{
u
}) = ℒ(_{
a
}). We have two cases:

1. _{
S, T'
}(_{
a
}) = ℒ(_{
u
}) = ℒ(_{
T'
}(

2. _{
S, T'
}(_{
v
}, _{
v
}, _{
v
}), and ℒ(_{
u
}) ≠ ℒ(_{
v
}), Lemma 2 implies that _{
T'
}(

The lemma follows. □

**Lemma 8**. _{
v
})\_{
v
}) _{
T
}(_{
T'
}(_{
v
}).

_{
u
}) ≠ ℒ(_{
a
}). We have two possible cases:

1. _{
S, T'
}(_{
a
}) = ℒ(_{
u
}) ≠ ℒ(_{
T'
}(

2. _{
S, T'
}(_{
v
}, _{
v
}, _{
v
}), and ℒ(_{
u
}) ≠ ℒ(_{
v
}), Lemma 2 implies that _{
T'
}(

The lemma follows.

**The Algorithm**. For any _{
v
}) let

{_{
T
}(_{
T'
}(

{_{
T
}(_{
T'
}(_{
T
}(

By definition, to solve the Rooting problem we must find a node _{
v
}) for which |_{
v
}), the values

In a preprocessing step, our algorithm computes the mapping ℳ_{
S, T
}as well as the size of each cluster in _{
v
}). This takes _{
v
}) will be the values

Recall that, given _{
S, T
}(

The algorithm then traverses through

1. If _{
T'
}(_{
T
}(

2. If _{
a
})\{_{
v
}can be used to compute the correct values of _{
v
}).

3. If _{
v
}) such that _{
b
})| = |_{
b
}); otherwise, if such a _{
v
}suffices to compute the correct values of _{
v
}). In order to prove the _{
v
})]. Observe that, given any candidate _{
v
}If such an _{
v
}This edge _{
v
}. Every edge in this strict consensus corresponds to an edge in _{
v
}that induce the same bi-partitions in the two trees.

Thus, for all candidate _{
v
}), and for all candidate _{
v
}can be precomputed with-in

Hence, the Rooting problem for a profile consisting of a single tree can be solved in

**Theorem 3**.

Solving the

We will show how to solve the _{
T
}(_{
R
}(_{
S
}(_{
T'
}(_{
T'
}(_{
R
}(

For brevity, let _{
S, R
}(_{
v
}) ∪ {_{
R
}(

Depending on _{
R
}(_{
v
}), (ii) _{
R
}(_{
R
}(_{
T'
}(

**Lemma 9**. _{
v
}), _{
T'
}(_{
R
}(

_{
R
}(_{
R
}(

**Lemma 10**. _{
R
}(

_{
T'
}(_{
R
}

_{
T'
}(

_{
R
}(_{
R
}(

1. _{
R
}
_{
S, T'
}(_{
T'
}(

2. _{
R
}
_{
S, T'
}(_{
T'
}(

The lemma follows. □

**Lemma 11**. _{
R
}(_{
T'
}(

_{
R
}(_{
R
}(

1. _{
R
}
_{
S, T'
}(_{
T'
}(

2. _{
R
}
_{
S, T'
}(_{
T'
}(

The lemma follows. □

For the next lemma, let _{
S, R
}(_{
v
}.

**Lemma 12**. _{
S', R
}(_{
T'
}(_{
R
}
_{
b
})| + |ℒ(_{
v
})| = |ℒ(_{
u
})|.

_{
S', R
}(_{
S', R
}(_{
b
}), which implies that ℒ(_{
u
}) ⊆ ℒ(_{
v
}) ⊆ ℒ(_{
b
}). We now have the following three cases:

1. _{
R
}
_{
S, T'
}(_{
T'
}(_{
T'
}(_{
S, T'
}(_{
u
}) ⊆ ℒ(_{
v
}) ⊆ ℒ(_{
b
}), and _{
R
}
_{
S, T'
}(_{
T'
}(

2. _{
R
}
**and **|ℒ(_{
b
})| + |ℒ(_{
v
})| ≠ |ℒ(_{
u
})|: In this case we must have ℳ_{
S, T'
}(_{
b
})| + |ℒ(_{
v
})| ≠ |ℒ(_{
u
})|, we must have ℒ(_{
u
}) ⊂ ℒ(_{
v
}) ∪ ℒ(_{
b
}), which implies that |_{
S, T'
}(_{
T'
}(

3. _{
R
}
**and **|ℒ(_{
b
})| + |ℒ(_{
v
})| = |ℒ(_{
u
})|: In this case we must have ℳ_{
S, T'
}(_{
b
})| + |ℒ(_{
v
})| = |ℒ(_{
u
})|, we must have |_{
S, T'
}(_{
T'
}(

The lemma follows. □

**The Algorithm**. Note that _{
T
}(_{
R
}(_{
x ∈ Q
}
_{
R
}(_{
R
}(_{
T'
}(_{
R
}(_{
T'
}(_{
R
}(

In a preprocessing step, our algorithm first constructs the tree _{
S, R
}as well as the size of each cluster in

Recall that, given _{
S, R
}(

The algorithm then traverses through

1. If _{
T'
}(_{
R
}(

2. If _{
a
})\{

3. If _{
b
})| + |ℒ(_{
v
})| = |ℒ(_{
u
})|, then we increment the value of _{
b
})\{

Again, to do this efficiently, we increment a counter at node _{
S', R
}can be computed in _{
b
})| + |ℒ(_{
v
})| = |ℒ(_{
u
})| is also verifiable in

Hence, the

**Theorem 4**.

**Remark**. To improve the performance of local search heuristics in phylogeny construction, the starting tree for the first local search step is often constructed using a greedy 'stepwise addition' procedure. This greedy procedure builds a starting species tree step-by-step by adding one taxon at a time at its locally optimal position. In the context of RF supertrees, our algorithm for the

Experimental Evaluation

In order to evaluate the performance of the RF supertree method, we implemented an RF heuristic based on the

Experimental Results.

**Data Set**

**Supertree Method**

**RF-Distance**

**Parsimony Score**

Marsupial (272 taxa; 158 trees)

RF-Ratchet

1514

2528

RF-MRP

**1502**

2513

MRP-TBR

1514

**2509**

MRP-Ratchet

1514

**2509**

Triplet

1604

2569

Sea Birds (121 taxa; 7 trees)

RF-Ratchet

**61**

223

RF-MRP

**61**

223

MRP-TBR

63

**221**

MRP-Ratchet

63

**221**

Triplet

**61**

223

Placental Mammals (116 taxa; 726 trees)

RF-Ratchet

**5686**

8926

RF-MRP

5690

8890

MRP-TBR

5694

**8878**

MRP-Ratchet

5694

**8878**

Triplet

6032

9064

Legumes (571 taxa; 22 trees)

RF-Ratchet

1556

965

RF-MRP

**1534**

882

MRP-TBR

1554

856

MRP-Ratchet

1552

**854**

Triplet

N/A

N/A

Experimental results comparing the performance of the RF supertree method to MRP and triplet supertree methods. We used five different supertree analyses: RF supertrees using our

There are a number of ways to implement any local search algorithm. Preliminary analyses of the RF heuristic based on the

For our MRP analyses, we also tried two heuristic search methods, both implemented using PAUP*

Our analyses demonstrate the effectiveness of our local search heuristics for the RF supertree problem. In all four data sets, RF-ratchet searches found the supertrees with the lowest total RF distance to the input trees (Table

All of the data sets used in this analysis are from published studies that used MRP. Therefore, it is not surprising that MRP performed well (but see

Interestingly, while the MRP trees tend to have relatively low RF-distance scores, in some cases, such as the legume data set, trees with low RF-distance scores have high parsimony scores (Table

Our program for computing RF supertrees is freely available (for Windows, Linux, and Mac OS X) at

Discussion and Conclusion

There is a growing interest in using supertrees for large-scale evolutionary and ecological analyses. Yet there are many concerns about the performance of existing supertree methods, and the great majority of published supertree analyses have relied on only MRP

There are numerous alternate metrics to compare phylogenetic trees besides the RF distance, and any of these can be used for supertree methods (see, for example,

The results also suggest several future directions for research. Although heuristics guided by local search problems, especially ^{2}·min{

In some cases it might be desirable to remove the restriction that the supertree be binary. In the consensus setting, such a median tree can be obtained within polynomial time

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

MSB was responsible for algorithm design and program implementation, contributed to the experimental evaluation, and wrote major parts of the manuscript. JGB performed the experimental evaluation and the analysis of the results, and contributed to the writing of the manuscript. OE and DFB supervised the project and contributed to the writing of the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We thank Harris Lin for providing software for the triplet supertree analyses. This work was supported in part by NESCent and by NSF grants DEB-0334832 and DEB-0829674. MSB was supported in part by a postdoctoral fellowship from the Edmond J. Safra Bioinformatics program at Tel-Aviv university.