Motivation: Sampling the conformational space of biological macromolecules generates large sets of data with considerable complexity.
Data-mining techniques, such as clustering, can extract meaningful information.
Among them, the Self-Organizing Maps (SOMs) algorithm has shown great promise, in particular since its computation time rises only linearly with the size of the data set.
Whereas SOMs are generally used with few neurons, we investigate here their behavior with large numbers of neurons.
Results: We present here a python library implementing the full SOM analysis workflow.
Large SOMs can readily be applied on heavy data sets.
Coupled with visualization tools they have very interesting properties.
Descriptors for each conformation of a trajectory are calculated and mapped onto a 3D landscape, the U-matrix, reporting the distance between neighboring neurons.
To delineate clusters, we developed the flooding algorithm, which hierarchically identifies local basins of the U-matrix from the global minimum to the maximum.
Availability: The python implementation of the SOM library is freely available on github.
Some technical remarks on how the figure has been generated…
Compute the descriptors from the trajectory in dcd format
We start from the DCD trajectory file (traj.dcd) of the folding trajectory of the G protein.
Only the atoms were kept.
This is sufficient for the conformational clustering.
First we create a configuration file (makeVectorsFromdcd_PCA.conf) for the python script applications/makeVectorsFromdcd_PCA.py with the following content:
To generate the conformational descriptors for each frame of the trajectory as described in the publication.
Learn the SOM
Now we can open an interactive python session with ipython or ipython notebook, for example, and import the required python modules.
Then, we load the input matrix containing the conformational descriptors.
After that we can learn the SOM with som.learn.
The U-matrix
Now we can compute the U-matrix corresponding to the map:
We decompose the U-matrix according to the four eigenvectors to compute the U-matrix in Angstrom.
The Best Matching Units (BMUs)
The BMUs give the coordinate of each frame in the SOM map.
They allow the projection of data onto the map.
We use the k-d tree algorithm to compute the BMUs to avoid memory errors.
The density
The density gives the number of frame for each unit of the SOM map.
It is easily computed from the BMUs:
Data projection
In the publication we project the RMSD onto the map.
You can compute the RMSD with the program of your choice and then save the data in a text file (one column and one value per line for each frame).
This part is very flexible, you can project any property you want.
Flooding the map
The SOM map has periodic boundaries.
To deal with the periodicity, the flooding algorithm is used.
See the publication for more details.
Below is the code to flood the map and unwrap it:
Plot the U-matrix and projection with matplotlib
Sausage representation of an ensemble of structures from rmsf values with pymol
The sausage representations were obtained with pymol.
sausage.py
This script requires these pymol scripts:
align_all.py
rmsf_states.py
If you want to ask me a question or leave me a message add @bougui505 in your comment.