MNHN-Tree-Tools: Creating Trees from Sequences

Abstract

As the number of nucleic sequences available grows exponentially and faster than Moore's law, the endeavor to find more efficient and flexible algorithms is both crucial and infinite. MNHN-Tree-Tools allows you to cluster DNA or protein sequences and build a phylogenetic tree from a set of sequences. MNHN-Tree-Tool is based on an iterative adaptive version of the "discovering clusters in large spatial databases with noise" (DBSCAN) algorithm and is optimized to take advantage of Single Instruction Multiple Data (SIMD) instructions, Graphics Processing Unit (GPU) computing, multi-threading and multi-node high performance computing (HPC) architectures.

Installation

- UBUNTU 20.04 AND SIMILAR
Open a terminal and run the following:

sudo apt-get install git build-essential libpng-dev libsdl2-dev \
liblapack-dev libopenmpi-dev libpocl-dev ocl-icd-opencl-dev pocl-opencl-icd

git clone https://gitlab.in2p3.fr/mnhn-tools/mnhn-tree-tools.git

cd mnhn-tree-tools
mkdir bin
make all

cd bin
#make MNHN-TREE-TOOLS available from any folder this is temporary and you
#may modify your .bashrc and similar files to make this permanent
export PATH=$PATH:$PWD

- MACOS
We provide a tarball containing MacOS (Intel) binaries, that have been compiled and tested on MacOS Catalina. These binaries might work on later versions and are built against the Accelerate Framework from Apple, and an implementation of OpenCL that comes with MacOS. To have all tools that are mentioned in the manual at hand we ship the tarball with binaries of Gnuplot and Newick Tools.

Download treetools.tar.bz2
(sha256 checksum)
1f442f97799593282d20a4b882955803ce7bf6fd2ef1d90b5ba57b21003fb070

open a terminal and become root: i.e. by typing sudo su
move the dowoaded tarball to /opt
mv treetools.tar.bz2 /opt/ and extract the tarball

cd /opt
tar xjf treetools.tar.bz2

Add the extracted binaries to your path: export PATH=$PATH:/opt/treetools/bin You may need to rerun the last command in any new terminal you may open or adjust your shell initialzation file, i.e. .bashrc If the binaries do not work for you we refere you to the general UNIX installation outlined in the manual.

- WINDOWS
One way to install MNHN-Tree-Tools on Windows is to install the Windows Subsystem for Linux version 2 with Ubuntu and follow the Ubuntu installation. Other possibilities might exist.

- ON OTHER PLATFORMS / OPTIMZED VERSIONS
a general installation for POSIX / UNIX / UNIX-like systems is outlined in our manual. This also allows you to link this tools to high performance implementations of BLAS / LAPACK OpenCL, and MPI.

Frequently Asked Questions

- UNDER WHAT LICENSE / CONDITIONS MAY I USE MNHN-TREE-TOOLS?
You may use MNHN-Tree-Tools under the conditions of the Zlib License. MNHN-Tree-Tools development was sponsored by the French government and its tax payers and is hence, free and open source software world wide.

- AS A RESEARCHER USING MNHN-TREE-TOOLS, HOW SHOULD I INTERPRET THE DENSITY OF CLUSTERS?
Similar sequences will be grouped into clusters by the algorithm. For a given similarity, a high density of sequences thus corresponds to a high number of similar sequences. The final interpretation then depends on the type of sequences that are used together with MNHN-Tree-Tools. In the paper, we outline two different datasets: The first one is a dataset of alpha-satellites, DNA repeats found in the human centromeric regions. These repeats evolve through a multiplication and diversification mechanism. As such, a dense cluster would corresponds to a recent amplification of one repeat instance. In our second dataset, each sequence corresponds to the ribosomal RNA of a given specie. In this context, a high density cluster corresponds to a large number of species that have similar ribosomal RNA sequences, reflecting their relatedness in the course of evolution. For an in depth description of density peaks we refere the interested reader to our algorithmic suppliment.

- HOW TO CHOOSE BETWEEN THE K-MER+PCA BASED DISTANCE OR THE SMITH-WATERMAN BASED DISTANCE?
The Smith-Waterman based distance is directly related to the the number of point mutations or sequence gaps between two sequences, and hence can be interpreted in a straightforward manner. The k-mer + PCA based distance reflects a change in composition of the different k-mers between two sequences. This difference is down projected onto a low dimensional subspace spun by principal components and the euclidean norm is used to provide a direct estimation of the distance between sequences. This distance measure can be more difficult to grasp than the classical Smith-Waterman distance. However, the reduction of dimensionality has two important advantages: First, it makes the whole calculation way more efficient and rapid, and second it selects k-mers of high variance over k-mers with low variance and hence inherently performs a feature selection. This selection yields an improved capture of evolutionary related sequence clusters. This performance increase is illustrated using simulated data in our Algorithmic Supplement.

- IS-THERE ANY INTUITION OR JUSTIFICATION FOR THE CHOICE OF THE INPUT PARAMETERS EPSILON, DELTA EPSILON AND MINPOINTS
Initial ε defines the starting radius in an adaptive clustering run. The optimal starting value is the value where most clusters are captured. minpts is the second parameter used to define the density at which each clustering step is performed. Δε reflects the step at which this density changes to built hierarchically nested clusters and form our tree. The smallest density value (initial ε) that we wish to use during the tree building procedure in a density for none of the sequence for any cluster. This value is best guessed to a density where most or a least many clusters are estimated. The highest density value (final ε) on the other hand is the smallest value that yields only one cluster. It will depend on the chosen dataset and on the distance measure used.
The Δε value will define the sensitivity of the obtained tree (i.e. the number of ramification levels). This value often depends on the users needs and on the number of input sequences, but we advice to take a value that corresponds to (final-initial ε)/50 so that there will be at least 50 ramification levels. Note that in the final results ε levels where the number of clusters remains unchanged are not reported due to computational efficiency and storage gains. Clusters or tree layers for specific ε values can nevertheless be calculated with special tools (c.f. cluster_dbscan_X in the manual). The number of compute nodes available and their computational performance yield the trade-off between efficiency and precision.
A good way estimate the three input parameters is to create 2D or 3D plots of the different dimensions of the PCA using Gnuplot as outlined in section Investigating k-mers and Principal Components in the manual.

- WHEN THE DENSITY DECREASES IS IT ENSURED THAT THE DBSCAN CLUSTERS ARE ALWAYS MERGED INTO LARGER CLUSTERS ?
YES. This is due to the fact that clusters found at a given density value will always be included into clusters found at a lower density value. In other words clusters can only grow when the density decreases. When lowering the density, if a point of a given cluster A is found into another cluster B, the two clusters are merged by the algorithm to form a new larger cluster C that will encompass A and B clusters. In other words, decreasing the density either discovers new clusters or merges clusters found at a higher density. This behavior has been verified in all simulations as well as real world applications described in the paper and supplementary materials.

- WHY ARE WE USING A DENSITY BASED APPROACH INSTEAD OF OTHER CLUSTERING ALGORITHMS?
From a purely practical standpoint, density-based clustering offers the possibility to treat a large number of sequences without holding full matrices of pairwise distances in the computer memory. Contrary to many existing methods in the fields our method does not rely on a multiple alignment strategy. For further an in depth description on how we detect tree branches, evolutionary families and statistical clusters and how we relate our method to the evolution of sequences we refer the interested reader to our Algorithmic Supplement. Our method is a statistical, opposed to a hypothetico-deductive method, where related/similar sequences form detectable, statistical clusters of higher densities in the sequence space. [Felsenstein 1985, Felsenstein 2001]

- WHY ARE WE USING THE DBSCAN ALGORITHM AND NOT OTHER MORE RECENT DENSITY-BASED ALGORITHMS?
During the development of the algorithm we evaluated the possibility to use two other density based clustering algorithms: OPTICS [Ankerst and Breunig 1999] and SUBCLU [Kailing et al. 2004]. We decided that the original DBSCAN [Ester et al. 1996] algorithm was an optimal choice at this stage of the development of our tool since DBSCAN uses a simple and easy to interpret parameter to cluster data points: the density of points, as defined by ε and minpts parameters. OPTICS by contrast searches the space for clusters of varying densities, which would not allow us to perform a straightforward tree-building. We implemented a custom C version of DBSCAN which allowed us to work around memory constraints for large datasets and speed up the algorithm by using single instruction multiple data (SIMD) processor instructions and parallelization using graphics processors (GPUs) and message passing (MPI). The use of SUBCLU could prove advantageous to DBSCAN, but would increase the implementation complexity and related run-time.

- HOW DO THE RESULTS OF MNHN-TREE-TOOLS COMPARE TO OTHER TREES/TOOLS AVAILABLE?
In our Algorithmic Supplement we outline a comparison to clustering results obtained by SWARM2 [Mahé et al. 2015] and to ground truth data for human α-satellites [Uralsky et al. 2019] and a Tree of Life [Munoz et al. 2011]. We have developed a whole set of tools to compare different clusterings to the results that one may obtain using MNHN-Tree-Tools. For further details the reader is referred to our manual and its tree_purness section.

- WHY DO WE PREFERE A TREE BASED OVER A NETWORK BASED SOLUTION?
Due to horizontal transfers of DNA or to recombination events, the sub-sequences found within a sequence may have different evolutionary histories. This is the reason why grouping sequences into trees may sometimes be misleading, since a tree structure is based on the existence of a phylogeny. Representations of networks have thus been developed to deal with this potential issue. In our case, the sequence we deal with, being the centromeric DNA repeat monomer units or ribosomal RNA sequences are not the result of recombination events and can be classified into trees. Nevertheless, the user should keep this potential shortcoming in mind when using our tool on long genomic sequences that may not have evolved following a mutations/deletions and selection events only.

MNHN-Tree-Tools: The Density Based Phylogeny from Sequences

Abstract

Installation

Frequently Asked Questions