__init__.py 4.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
  1. # Natural Language Toolkit: Clusterers
  2. #
  3. # Copyright (C) 2001-2020 NLTK Project
  4. # Author: Trevor Cohn <tacohn@cs.mu.oz.au>
  5. # URL: <http://nltk.org/>
  6. # For license information, see LICENSE.TXT
  7. """
  8. This module contains a number of basic clustering algorithms. Clustering
  9. describes the task of discovering groups of similar items with a large
  10. collection. It is also describe as unsupervised machine learning, as the data
  11. from which it learns is unannotated with class information, as is the case for
  12. supervised learning. Annotated data is difficult and expensive to obtain in
  13. the quantities required for the majority of supervised learning algorithms.
  14. This problem, the knowledge acquisition bottleneck, is common to most natural
  15. language processing tasks, thus fueling the need for quality unsupervised
  16. approaches.
  17. This module contains a k-means clusterer, E-M clusterer and a group average
  18. agglomerative clusterer (GAAC). All these clusterers involve finding good
  19. cluster groupings for a set of vectors in multi-dimensional space.
  20. The K-means clusterer starts with k arbitrary chosen means then allocates each
  21. vector to the cluster with the closest mean. It then recalculates the means of
  22. each cluster as the centroid of the vectors in the cluster. This process
  23. repeats until the cluster memberships stabilise. This is a hill-climbing
  24. algorithm which may converge to a local maximum. Hence the clustering is
  25. often repeated with random initial means and the most commonly occurring
  26. output means are chosen.
  27. The GAAC clusterer starts with each of the *N* vectors as singleton clusters.
  28. It then iteratively merges pairs of clusters which have the closest centroids.
  29. This continues until there is only one cluster. The order of merges gives rise
  30. to a dendrogram - a tree with the earlier merges lower than later merges. The
  31. membership of a given number of clusters *c*, *1 <= c <= N*, can be found by
  32. cutting the dendrogram at depth *c*.
  33. The Gaussian EM clusterer models the vectors as being produced by a mixture
  34. of k Gaussian sources. The parameters of these sources (prior probability,
  35. mean and covariance matrix) are then found to maximise the likelihood of the
  36. given data. This is done with the expectation maximisation algorithm. It
  37. starts with k arbitrarily chosen means, priors and covariance matrices. It
  38. then calculates the membership probabilities for each vector in each of the
  39. clusters - this is the 'E' step. The cluster parameters are then updated in
  40. the 'M' step using the maximum likelihood estimate from the cluster membership
  41. probabilities. This process continues until the likelihood of the data does
  42. not significantly increase.
  43. They all extend the ClusterI interface which defines common operations
  44. available with each clusterer. These operations include.
  45. - cluster: clusters a sequence of vectors
  46. - classify: assign a vector to a cluster
  47. - classification_probdist: give the probability distribution over cluster memberships
  48. The current existing classifiers also extend cluster.VectorSpace, an
  49. abstract class which allows for singular value decomposition (SVD) and vector
  50. normalisation. SVD is used to reduce the dimensionality of the vector space in
  51. such a manner as to preserve as much of the variation as possible, by
  52. reparameterising the axes in order of variability and discarding all bar the
  53. first d dimensions. Normalisation ensures that vectors fall in the unit
  54. hypersphere.
  55. Usage example (see also demo())::
  56. from nltk import cluster
  57. from nltk.cluster import euclidean_distance
  58. from numpy import array
  59. vectors = [array(f) for f in [[3, 3], [1, 2], [4, 2], [4, 0]]]
  60. # initialise the clusterer (will also assign the vectors to clusters)
  61. clusterer = cluster.KMeansClusterer(2, euclidean_distance)
  62. clusterer.cluster(vectors, True)
  63. # classify a new vector
  64. print(clusterer.classify(array([3, 3])))
  65. Note that the vectors must use numpy array-like
  66. objects. nltk_contrib.unimelb.tacohn.SparseArrays may be used for
  67. efficiency when required.
  68. """
  69. from nltk.cluster.util import (
  70. VectorSpaceClusterer,
  71. Dendrogram,
  72. euclidean_distance,
  73. cosine_distance,
  74. )
  75. from nltk.cluster.kmeans import KMeansClusterer
  76. from nltk.cluster.gaac import GAAClusterer
  77. from nltk.cluster.em import EMClusterer