__init__.py 4.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
  1. # Natural Language Toolkit: Classifiers
  2. #
  3. # Copyright (C) 2001-2020 NLTK Project
  4. # Author: Edward Loper <edloper@gmail.com>
  5. # URL: <http://nltk.org/>
  6. # For license information, see LICENSE.TXT
  7. """
  8. Classes and interfaces for labeling tokens with category labels (or
  9. "class labels"). Typically, labels are represented with strings
  10. (such as ``'health'`` or ``'sports'``). Classifiers can be used to
  11. perform a wide range of classification tasks. For example,
  12. classifiers can be used...
  13. - to classify documents by topic
  14. - to classify ambiguous words by which word sense is intended
  15. - to classify acoustic signals by which phoneme they represent
  16. - to classify sentences by their author
  17. Features
  18. ========
  19. In order to decide which category label is appropriate for a given
  20. token, classifiers examine one or more 'features' of the token. These
  21. "features" are typically chosen by hand, and indicate which aspects
  22. of the token are relevant to the classification decision. For
  23. example, a document classifier might use a separate feature for each
  24. word, recording how often that word occurred in the document.
  25. Featuresets
  26. ===========
  27. The features describing a token are encoded using a "featureset",
  28. which is a dictionary that maps from "feature names" to "feature
  29. values". Feature names are unique strings that indicate what aspect
  30. of the token is encoded by the feature. Examples include
  31. ``'prevword'``, for a feature whose value is the previous word; and
  32. ``'contains-word(library)'`` for a feature that is true when a document
  33. contains the word ``'library'``. Feature values are typically
  34. booleans, numbers, or strings, depending on which feature they
  35. describe.
  36. Featuresets are typically constructed using a "feature detector"
  37. (also known as a "feature extractor"). A feature detector is a
  38. function that takes a token (and sometimes information about its
  39. context) as its input, and returns a featureset describing that token.
  40. For example, the following feature detector converts a document
  41. (stored as a list of words) to a featureset describing the set of
  42. words included in the document:
  43. >>> # Define a feature detector function.
  44. >>> def document_features(document):
  45. ... return dict([('contains-word(%s)' % w, True) for w in document])
  46. Feature detectors are typically applied to each token before it is fed
  47. to the classifier:
  48. >>> # Classify each Gutenberg document.
  49. >>> from nltk.corpus import gutenberg
  50. >>> for fileid in gutenberg.fileids(): # doctest: +SKIP
  51. ... doc = gutenberg.words(fileid) # doctest: +SKIP
  52. ... print(fileid, classifier.classify(document_features(doc))) # doctest: +SKIP
  53. The parameters that a feature detector expects will vary, depending on
  54. the task and the needs of the feature detector. For example, a
  55. feature detector for word sense disambiguation (WSD) might take as its
  56. input a sentence, and the index of a word that should be classified,
  57. and return a featureset for that word. The following feature detector
  58. for WSD includes features describing the left and right contexts of
  59. the target word:
  60. >>> def wsd_features(sentence, index):
  61. ... featureset = {}
  62. ... for i in range(max(0, index-3), index):
  63. ... featureset['left-context(%s)' % sentence[i]] = True
  64. ... for i in range(index, max(index+3, len(sentence))):
  65. ... featureset['right-context(%s)' % sentence[i]] = True
  66. ... return featureset
  67. Training Classifiers
  68. ====================
  69. Most classifiers are built by training them on a list of hand-labeled
  70. examples, known as the "training set". Training sets are represented
  71. as lists of ``(featuredict, label)`` tuples.
  72. """
  73. from nltk.classify.api import ClassifierI, MultiClassifierI
  74. from nltk.classify.megam import config_megam, call_megam
  75. from nltk.classify.weka import WekaClassifier, config_weka
  76. from nltk.classify.naivebayes import NaiveBayesClassifier
  77. from nltk.classify.positivenaivebayes import PositiveNaiveBayesClassifier
  78. from nltk.classify.decisiontree import DecisionTreeClassifier
  79. from nltk.classify.rte_classify import rte_classifier, rte_features, RTEFeatureExtractor
  80. from nltk.classify.util import accuracy, apply_features, log_likelihood
  81. from nltk.classify.scikitlearn import SklearnClassifier
  82. from nltk.classify.maxent import (
  83. MaxentClassifier,
  84. BinaryMaxentFeatureEncoding,
  85. TypedMaxentFeatureEncoding,
  86. ConditionalExponentialClassifier,
  87. )
  88. from nltk.classify.senna import Senna
  89. from nltk.classify.textcat import TextCat