| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990 |
- # Natural Language Toolkit: Clusterers
- #
- # Copyright (C) 2001-2020 NLTK Project
- # Author: Trevor Cohn <tacohn@cs.mu.oz.au>
- # URL: <http://nltk.org/>
- # For license information, see LICENSE.TXT
- """
- This module contains a number of basic clustering algorithms. Clustering
- describes the task of discovering groups of similar items with a large
- collection. It is also describe as unsupervised machine learning, as the data
- from which it learns is unannotated with class information, as is the case for
- supervised learning. Annotated data is difficult and expensive to obtain in
- the quantities required for the majority of supervised learning algorithms.
- This problem, the knowledge acquisition bottleneck, is common to most natural
- language processing tasks, thus fueling the need for quality unsupervised
- approaches.
- This module contains a k-means clusterer, E-M clusterer and a group average
- agglomerative clusterer (GAAC). All these clusterers involve finding good
- cluster groupings for a set of vectors in multi-dimensional space.
- The K-means clusterer starts with k arbitrary chosen means then allocates each
- vector to the cluster with the closest mean. It then recalculates the means of
- each cluster as the centroid of the vectors in the cluster. This process
- repeats until the cluster memberships stabilise. This is a hill-climbing
- algorithm which may converge to a local maximum. Hence the clustering is
- often repeated with random initial means and the most commonly occurring
- output means are chosen.
- The GAAC clusterer starts with each of the *N* vectors as singleton clusters.
- It then iteratively merges pairs of clusters which have the closest centroids.
- This continues until there is only one cluster. The order of merges gives rise
- to a dendrogram - a tree with the earlier merges lower than later merges. The
- membership of a given number of clusters *c*, *1 <= c <= N*, can be found by
- cutting the dendrogram at depth *c*.
- The Gaussian EM clusterer models the vectors as being produced by a mixture
- of k Gaussian sources. The parameters of these sources (prior probability,
- mean and covariance matrix) are then found to maximise the likelihood of the
- given data. This is done with the expectation maximisation algorithm. It
- starts with k arbitrarily chosen means, priors and covariance matrices. It
- then calculates the membership probabilities for each vector in each of the
- clusters - this is the 'E' step. The cluster parameters are then updated in
- the 'M' step using the maximum likelihood estimate from the cluster membership
- probabilities. This process continues until the likelihood of the data does
- not significantly increase.
- They all extend the ClusterI interface which defines common operations
- available with each clusterer. These operations include.
- - cluster: clusters a sequence of vectors
- - classify: assign a vector to a cluster
- - classification_probdist: give the probability distribution over cluster memberships
- The current existing classifiers also extend cluster.VectorSpace, an
- abstract class which allows for singular value decomposition (SVD) and vector
- normalisation. SVD is used to reduce the dimensionality of the vector space in
- such a manner as to preserve as much of the variation as possible, by
- reparameterising the axes in order of variability and discarding all bar the
- first d dimensions. Normalisation ensures that vectors fall in the unit
- hypersphere.
- Usage example (see also demo())::
- from nltk import cluster
- from nltk.cluster import euclidean_distance
- from numpy import array
- vectors = [array(f) for f in [[3, 3], [1, 2], [4, 2], [4, 0]]]
- # initialise the clusterer (will also assign the vectors to clusters)
- clusterer = cluster.KMeansClusterer(2, euclidean_distance)
- clusterer.cluster(vectors, True)
- # classify a new vector
- print(clusterer.classify(array([3, 3])))
- Note that the vectors must use numpy array-like
- objects. nltk_contrib.unimelb.tacohn.SparseArrays may be used for
- efficiency when required.
- """
- from nltk.cluster.util import (
- VectorSpaceClusterer,
- Dendrogram,
- euclidean_distance,
- cosine_distance,
- )
- from nltk.cluster.kmeans import KMeansClusterer
- from nltk.cluster.gaac import GAAClusterer
- from nltk.cluster.em import EMClusterer
|