I played a lot with Self-Organizing Map (SOM). The 2D SOM distributes the
input vectors onto a 2D plane. Mathematically the SOM is a 3D matrix and the
length of the third dimension is given by the length of your input data. To
visualize the SOM it’s usual to compute the Unified distance matrix
(U-matrix). The U-matrix gives for each neuron of the SOM the mean Euclidean
distance between the considered neuron and its neighbors.
Another way to visualize the SOM is to count the number of input data in each
bin of the SOM. However the SOM algorithm try to sparse data homogeneously
within the map. The basic idea here is to smooth the histogram of the SOM. To
do this each input data is not attributed to a unique cell but to an ensemble
of cells. The number of cells in the ensemble is determined by the
smoothing parameter . This idea come from this document: Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps.
The ipython notebook import:
A 2D potential constructed from 4 randomly chosen points.
This potential is then sampled using a Monte-Carlo
And then we plot the distributions obtained from the MCMC:
A Self-organizing map is trained with the MCMC sampling:
The resulting U-matrix.
(Here I used a algorithm to unwrap the U-matrix as the SOM is chosen with
periodic boundaries. The visualization is simpler with this representation.)
The gray scale dots are the original density of the data in the SOM space.
And the U-matrix in the input space in comparison with the 2D histogram of the
input data:
Now we compute the smooth data histogram (SDH).
The membership degree is to the closest bin, to the second, to the third and so forth.
The membership to all but the closest bins is 0.
is defined as:
For large dataset you can run out of memory when you compute the full distance matrix dmat with scipy.spatial.distance.cdist(inputmat, smap.reshape(x*y,n)).
Instead of computing dmat you can use KDTree or cKDTree.
This class provides an index into a set of k-dimensional points which can be used to rapidly look up the nearest neighbors of any point.
In the function below I’ve also added the data parameter to project a weighted sum of the data with the rule exposed before.
I’ve slightly changed the function to apply a distance cutoff
distances_threshold instead of a cutoff on the number of bins ():
And we choose a smoothing parameter s of 16
We compare below the original density and the smoothed density:
If you want to ask me a question or leave me a message add @bougui505 in your comment.