Installation
- UBUNTU 20.04 AND SIMILAR
Open a terminal and run the following:
sudo apt-get install git build-essential libpng-dev libsdl2-dev \
liblapack-dev libopenmpi-dev libpocl-dev ocl-icd-opencl-dev pocl-opencl-icd
git clone https://gitlab.in2p3.fr/mnhn-tools/mnhn-tree-tools.git
cd mnhn-tree-tools
mkdir bin
make all
cd bin
#make MNHN-TREE-TOOLS available from any folder this is temporary and you
#may modify your .bashrc and similar files to make this permanent
export PATH=$PATH:$PWD
- MACOS
We provide a tarball containing MacOS (Intel) binaries, that have been
compiled and tested on MacOS Catalina. These binaries might work on later
versions and are built against
the Accelerate
Framework from Apple,
and
an implementation of OpenCL that comes with MacOS. To have all tools that
are mentioned in the manual at hand we ship the tarball with binaries
of
Gnuplot and
Newick Tools.
Download
treetools.tar.bz2
(sha256 checksum)
1f442f97799593282d20a4b882955803ce7bf6fd2ef1d90b5ba57b21003fb070
open a terminal and become root: i.e. by typing
sudo su
move the dowoaded tarball to
/opt
mv treetools.tar.bz2 /opt/
and extract the tarball
cd /opt
tar xjf treetools.tar.bz2
Add the extracted binaries to your path:
export PATH=$PATH:/opt/treetools/bin
You may need to rerun the last command in any new terminal you may
open or adjust your shell initialzation file, i.e. .bashrc
If the binaries do not work for you we refere you to the general UNIX
installation outlined in the
manual.
- WINDOWS
One way to install MNHN-Tree-Tools on Windows is to install
the
Windows Subsystem for Linux version 2
with Ubuntu and follow the Ubuntu installation. Other
possibilities might exist.
- ON OTHER PLATFORMS / OPTIMZED VERSIONS
a general installation for POSIX / UNIX / UNIX-like systems
is outlined in
our
manual. This
also allows you to link this tools to high performance
implementations of
BLAS /
LAPACK
OpenCL,
and
MPI.
Frequently Asked Questions
- UNDER WHAT LICENSE / CONDITIONS MAY I USE MNHN-TREE-TOOLS?
You may use MNHN-Tree-Tools under the conditions of the
Zlib
License. MNHN-Tree-Tools development was sponsored by the
French government and its tax payers and is hence, free and open
source software world wide.
- AS A RESEARCHER USING MNHN-TREE-TOOLS, HOW SHOULD I INTERPRET THE DENSITY OF CLUSTERS?
Similar sequences will be grouped into clusters by the
algorithm. For a given similarity, a high density of
sequences thus corresponds to a high number of similar sequences.
The final interpretation then depends on the type of sequences that are
used together with MNHN-Tree-Tools.
In the paper, we outline two different datasets: The first one is a
dataset of alpha-satellites, DNA repeats found in the human
centromeric regions. These repeats evolve through a multiplication
and diversification mechanism. As such, a dense cluster would
corresponds to a recent amplification of one repeat
instance.
In our second dataset, each sequence corresponds to the ribosomal
RNA of a given specie. In this context, a high density cluster
corresponds to a large number of species that have similar
ribosomal RNA sequences, reflecting their relatedness in the
course of evolution.
For an in depth description of density peaks we refere the interested
reader to our
algorithmic suppliment.
- HOW TO CHOOSE BETWEEN THE K-MER+PCA BASED DISTANCE OR THE
SMITH-WATERMAN BASED DISTANCE?
The Smith-Waterman based distance is directly related to the
the number of point mutations or sequence gaps between two
sequences, and hence can be
interpreted in a straightforward manner.
The k-mer + PCA based distance reflects a change in composition of
the different k-mers between two sequences. This difference
is down projected onto a low dimensional subspace spun by
principal components and the
euclidean norm is used to provide a direct estimation of the
distance between sequences. This distance measure can be more
difficult to grasp than the classical Smith-Waterman
distance. However, the reduction of dimensionality has two
important advantages: First, it makes the whole calculation way
more efficient and rapid, and second it selects k-mers of high
variance over k-mers with low variance and hence inherently
performs a feature selection. This selection yields an improved capture of
evolutionary related sequence clusters. This performance increase
is illustrated using simulated data in
our
Algorithmic Supplement.
- IS-THERE ANY INTUITION OR JUSTIFICATION FOR THE CHOICE OF THE
INPUT PARAMETERS EPSILON, DELTA EPSILON AND MINPOINTS
Initial ε defines the starting radius in an adaptive clustering run.
The
optimal starting value is the value where most clusters are captured.
minpts is the second parameter used to define the density at
which each clustering step is performed.
Δε reflects the step at which this density changes to built
hierarchically nested clusters and form our tree.
The smallest density value (initial ε) that we wish to use during the
tree building procedure in a density for none of the sequence for
any cluster. This value is best guessed to a density where most
or a least many clusters are estimated.
The highest density value (final ε) on the other hand is the smallest
value that yields only one cluster. It will depend on the chosen
dataset and on the distance measure used.
The Δε value will define the sensitivity of the obtained tree
(i.e. the number of ramification levels). This value often depends on
the users needs and on the number of input sequences, but we advice
to take a value that corresponds to (final-initial ε)/50 so that there
will be at least 50 ramification levels. Note that in the final results ε
levels where the number of clusters remains unchanged are not
reported due to computational efficiency and storage gains.
Clusters or tree layers
for specific ε values can nevertheless be calculated with special
tools (c.f.
cluster_dbscan_X in the
manual).
The number of compute nodes available and their computational
performance yield the trade-off between
efficiency and precision.
A good way estimate the three input parameters is to create 2D or
3D plots of the different dimensions of the PCA
using
Gnuplot as
outlined in section
Investigating k-mers and Principal
Components in the
manual.
- WHEN THE DENSITY DECREASES IS IT ENSURED THAT THE DBSCAN
CLUSTERS ARE ALWAYS MERGED INTO LARGER CLUSTERS ?
YES. This is due to the fact that clusters found at a given
density value will always be included into clusters found at a
lower density value. In other words clusters can only grow when
the density decreases. When lowering the density, if a point of a
given cluster A is found into another cluster B, the two clusters
are merged by the algorithm to form a new larger cluster C that
will encompass A and B clusters.
In other words, decreasing the density either discovers new clusters
or merges clusters found at a higher density. This behavior has
been verified in all simulations as well as real world
applications described in the paper and supplementary materials.
- WHY ARE WE USING A DENSITY BASED APPROACH INSTEAD OF OTHER
CLUSTERING ALGORITHMS?
From a purely practical standpoint, density-based clustering offers
the possibility to treat a large number of sequences without holding
full matrices of pairwise distances in the computer memory.
Contrary to many existing methods in the fields our method
does not rely on a multiple alignment strategy. For further an in
depth description on how we detect tree branches,
evolutionary
families and
statistical clusters and how we relate
our method to the evolution of sequences we refer the interested
reader to our
Algorithmic Supplement.
Our method is a
statistical, opposed to
a
hypothetico-deductive method, where related/similar sequences form
detectable, statistical clusters of higher densities in the sequence space.
[
Felsenstein
1985,
Felsenstein 2001]
- WHY ARE WE USING THE DBSCAN ALGORITHM AND NOT OTHER MORE RECENT
DENSITY-BASED ALGORITHMS?
During the development of the algorithm we evaluated the
possibility to use two other density based clustering algorithms:
OPTICS [
Ankerst
and Breunig 1999] and SUBCLU
[
Kailing et
al. 2004]. We decided that the original DBSCAN
[
Ester et
al. 1996] algorithm was an optimal choice at this stage of
the development of our tool since DBSCAN uses a simple and easy to
interpret parameter to cluster data points: the density of
points, as defined by ε and minpts parameters. OPTICS by contrast
searches the space for clusters of varying densities, which would
not allow us to perform a straightforward tree-building. We
implemented a custom C version of DBSCAN
which allowed us to work around memory constraints for large
datasets and speed up the algorithm by using single instruction
multiple data (SIMD) processor instructions and parallelization
using graphics processors (GPUs) and message passing (MPI). The
use of SUBCLU could prove advantageous to DBSCAN, but would
increase the implementation complexity and related run-time.
- HOW DO THE RESULTS OF MNHN-TREE-TOOLS COMPARE TO OTHER TREES/TOOLS
AVAILABLE?
In our
Algorithmic Supplement we outline
a comparison to clustering results obtained by
SWARM2 [
Mahé et
al. 2015] and to ground truth data for human α-satellites
[
Uralsky et
al. 2019]
and a Tree of Life
[
Munoz et
al. 2011]. We have developed a whole set of tools to
compare different clusterings to the results that one may obtain
using MNHN-Tree-Tools. For further details the reader is referred
to
our
manual and its
tree_purness section.
- WHY DO WE PREFERE A TREE BASED OVER A NETWORK BASED SOLUTION?
Due to horizontal transfers of DNA or to recombination events, the
sub-sequences found within a sequence may have different
evolutionary histories. This is the reason why grouping sequences
into
trees
may sometimes be misleading, since a tree structure is based on
the existence of a phylogeny. Representations of networks have thus
been developed to deal with this potential issue. In our case, the
sequence we deal with, being the centromeric DNA repeat monomer
units or ribosomal RNA sequences are not the result of
recombination events and can be classified into
trees. Nevertheless, the user should keep this potential
shortcoming in mind when using our tool on long genomic sequences
that may not have evolved following a mutations/deletions and
selection events only.
(c) 2019-2021 Thomas Haschka, Loïc Ponger, Christophe Escudé and Julien Mozziconacci