classify.doctest 7.3 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201
  1. .. Copyright (C) 2001-2020 NLTK Project
  2. .. For license information, see LICENSE.TXT
  3. =============
  4. Classifiers
  5. =============
  6. Classifiers label tokens with category labels (or *class labels*).
  7. Typically, labels are represented with strings (such as ``"health"``
  8. or ``"sports"``. In NLTK, classifiers are defined using classes that
  9. implement the `ClassifyI` interface:
  10. >>> import nltk
  11. >>> nltk.usage(nltk.classify.ClassifierI)
  12. ClassifierI supports the following operations:
  13. - self.classify(featureset)
  14. - self.classify_many(featuresets)
  15. - self.labels()
  16. - self.prob_classify(featureset)
  17. - self.prob_classify_many(featuresets)
  18. NLTK defines several classifier classes:
  19. - `ConditionalExponentialClassifier`
  20. - `DecisionTreeClassifier`
  21. - `MaxentClassifier`
  22. - `NaiveBayesClassifier`
  23. - `WekaClassifier`
  24. Classifiers are typically created by training them on a training
  25. corpus.
  26. Regression Tests
  27. ~~~~~~~~~~~~~~~~
  28. We define a very simple training corpus with 3 binary features: ['a',
  29. 'b', 'c'], and are two labels: ['x', 'y']. We use a simple feature set so
  30. that the correct answers can be calculated analytically (although we
  31. haven't done this yet for all tests).
  32. >>> train = [
  33. ... (dict(a=1,b=1,c=1), 'y'),
  34. ... (dict(a=1,b=1,c=1), 'x'),
  35. ... (dict(a=1,b=1,c=0), 'y'),
  36. ... (dict(a=0,b=1,c=1), 'x'),
  37. ... (dict(a=0,b=1,c=1), 'y'),
  38. ... (dict(a=0,b=0,c=1), 'y'),
  39. ... (dict(a=0,b=1,c=0), 'x'),
  40. ... (dict(a=0,b=0,c=0), 'x'),
  41. ... (dict(a=0,b=1,c=1), 'y'),
  42. ... (dict(a=None,b=1,c=0), 'x'),
  43. ... ]
  44. >>> test = [
  45. ... (dict(a=1,b=0,c=1)), # unseen
  46. ... (dict(a=1,b=0,c=0)), # unseen
  47. ... (dict(a=0,b=1,c=1)), # seen 3 times, labels=y,y,x
  48. ... (dict(a=0,b=1,c=0)), # seen 1 time, label=x
  49. ... ]
  50. Test the Naive Bayes classifier:
  51. >>> classifier = nltk.classify.NaiveBayesClassifier.train(train)
  52. >>> sorted(classifier.labels())
  53. ['x', 'y']
  54. >>> classifier.classify_many(test)
  55. ['y', 'x', 'y', 'x']
  56. >>> for pdist in classifier.prob_classify_many(test):
  57. ... print('%.4f %.4f' % (pdist.prob('x'), pdist.prob('y')))
  58. 0.2500 0.7500
  59. 0.5833 0.4167
  60. 0.3571 0.6429
  61. 0.7000 0.3000
  62. >>> classifier.show_most_informative_features()
  63. Most Informative Features
  64. c = 0 x : y = 2.3 : 1.0
  65. c = 1 y : x = 1.8 : 1.0
  66. a = 1 y : x = 1.7 : 1.0
  67. a = 0 x : y = 1.0 : 1.0
  68. b = 0 x : y = 1.0 : 1.0
  69. b = 1 x : y = 1.0 : 1.0
  70. Test the Decision Tree classifier (without None):
  71. >>> classifier = nltk.classify.DecisionTreeClassifier.train(
  72. ... train[:-1], entropy_cutoff=0,
  73. ... support_cutoff=0)
  74. >>> sorted(classifier.labels())
  75. ['x', 'y']
  76. >>> print(classifier)
  77. c=0? .................................................. x
  78. a=0? ................................................ x
  79. a=1? ................................................ y
  80. c=1? .................................................. y
  81. <BLANKLINE>
  82. >>> classifier.classify_many(test)
  83. ['y', 'y', 'y', 'x']
  84. >>> for pdist in classifier.prob_classify_many(test):
  85. ... print('%.4f %.4f' % (pdist.prob('x'), pdist.prob('y')))
  86. Traceback (most recent call last):
  87. . . .
  88. NotImplementedError
  89. Test the Decision Tree classifier (with None):
  90. >>> classifier = nltk.classify.DecisionTreeClassifier.train(
  91. ... train, entropy_cutoff=0,
  92. ... support_cutoff=0)
  93. >>> sorted(classifier.labels())
  94. ['x', 'y']
  95. >>> print(classifier)
  96. c=0? .................................................. x
  97. a=0? ................................................ x
  98. a=1? ................................................ y
  99. a=None? ............................................. x
  100. c=1? .................................................. y
  101. <BLANKLINE>
  102. Test SklearnClassifier, which requires the scikit-learn package.
  103. >>> from nltk.classify import SklearnClassifier
  104. >>> from sklearn.naive_bayes import BernoulliNB
  105. >>> from sklearn.svm import SVC
  106. >>> train_data = [({"a": 4, "b": 1, "c": 0}, "ham"),
  107. ... ({"a": 5, "b": 2, "c": 1}, "ham"),
  108. ... ({"a": 0, "b": 3, "c": 4}, "spam"),
  109. ... ({"a": 5, "b": 1, "c": 1}, "ham"),
  110. ... ({"a": 1, "b": 4, "c": 3}, "spam")]
  111. >>> classif = SklearnClassifier(BernoulliNB()).train(train_data)
  112. >>> test_data = [{"a": 3, "b": 2, "c": 1},
  113. ... {"a": 0, "b": 3, "c": 7}]
  114. >>> classif.classify_many(test_data)
  115. ['ham', 'spam']
  116. >>> classif = SklearnClassifier(SVC(), sparse=False).train(train_data)
  117. >>> classif.classify_many(test_data)
  118. ['ham', 'spam']
  119. Test the Maximum Entropy classifier training algorithms; they should all
  120. generate the same results.
  121. >>> def print_maxent_test_header():
  122. ... print(' '*11+''.join([' test[%s] ' % i
  123. ... for i in range(len(test))]))
  124. ... print(' '*11+' p(x) p(y)'*len(test))
  125. ... print('-'*(11+15*len(test)))
  126. >>> def test_maxent(algorithm):
  127. ... print('%11s' % algorithm, end=' ')
  128. ... try:
  129. ... classifier = nltk.classify.MaxentClassifier.train(
  130. ... train, algorithm, trace=0, max_iter=1000)
  131. ... except Exception as e:
  132. ... print('Error: %r' % e)
  133. ... return
  134. ...
  135. ... for featureset in test:
  136. ... pdist = classifier.prob_classify(featureset)
  137. ... print('%8.2f%6.2f' % (pdist.prob('x'), pdist.prob('y')), end=' ')
  138. ... print()
  139. >>> print_maxent_test_header(); test_maxent('GIS'); test_maxent('IIS')
  140. test[0] test[1] test[2] test[3]
  141. p(x) p(y) p(x) p(y) p(x) p(y) p(x) p(y)
  142. -----------------------------------------------------------------------
  143. GIS 0.16 0.84 0.46 0.54 0.41 0.59 0.76 0.24
  144. IIS 0.16 0.84 0.46 0.54 0.41 0.59 0.76 0.24
  145. >>> test_maxent('MEGAM'); test_maxent('TADM') # doctest: +SKIP
  146. MEGAM 0.16 0.84 0.46 0.54 0.41 0.59 0.76 0.24
  147. TADM 0.16 0.84 0.46 0.54 0.41 0.59 0.76 0.24
  148. Regression tests for TypedMaxentFeatureEncoding
  149. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  150. >>> from nltk.classify import maxent
  151. >>> train = [
  152. ... ({'a': 1, 'b': 1, 'c': 1}, 'y'),
  153. ... ({'a': 5, 'b': 5, 'c': 5}, 'x'),
  154. ... ({'a': 0.9, 'b': 0.9, 'c': 0.9}, 'y'),
  155. ... ({'a': 5.5, 'b': 5.4, 'c': 5.3}, 'x'),
  156. ... ({'a': 0.8, 'b': 1.2, 'c': 1}, 'y'),
  157. ... ({'a': 5.1, 'b': 4.9, 'c': 5.2}, 'x')
  158. ... ]
  159. >>> test = [
  160. ... {'a': 1, 'b': 0.8, 'c': 1.2},
  161. ... {'a': 5.2, 'b': 5.1, 'c': 5}
  162. ... ]
  163. >>> encoding = maxent.TypedMaxentFeatureEncoding.train(
  164. ... train, count_cutoff=3, alwayson_features=True)
  165. >>> classifier = maxent.MaxentClassifier.train(
  166. ... train, bernoulli=False, encoding=encoding, trace=0)
  167. >>> classifier.classify_many(test)
  168. ['y', 'x']