__init__.py 7.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239
  1. # Natural Language Toolkit: Language Models
  2. #
  3. # Copyright (C) 2001-2020 NLTK Project
  4. # Authors: Ilia Kurenkov <ilia.kurenkov@gmail.com>
  5. # URL: <http://nltk.org/
  6. # For license information, see LICENSE.TXT
  7. """
  8. NLTK Language Modeling Module.
  9. ------------------------------
  10. Currently this module covers only ngram language models, but it should be easy
  11. to extend to neural models.
  12. Preparing Data
  13. ==============
  14. Before we train our ngram models it is necessary to make sure the data we put in
  15. them is in the right format.
  16. Let's say we have a text that is a list of sentences, where each sentence is
  17. a list of strings. For simplicity we just consider a text consisting of
  18. characters instead of words.
  19. >>> text = [['a', 'b', 'c'], ['a', 'c', 'd', 'c', 'e', 'f']]
  20. If we want to train a bigram model, we need to turn this text into bigrams.
  21. Here's what the first sentence of our text would look like if we use a function
  22. from NLTK for this.
  23. >>> from nltk.util import bigrams
  24. >>> list(bigrams(text[0]))
  25. [('a', 'b'), ('b', 'c')]
  26. Notice how "b" occurs both as the first and second member of different bigrams
  27. but "a" and "c" don't? Wouldn't it be nice to somehow indicate how often sentences
  28. start with "a" and end with "c"?
  29. A standard way to deal with this is to add special "padding" symbols to the
  30. sentence before splitting it into ngrams.
  31. Fortunately, NLTK also has a function for that, let's see what it does to the
  32. first sentence.
  33. >>> from nltk.util import pad_sequence
  34. >>> list(pad_sequence(text[0],
  35. ... pad_left=True,
  36. ... left_pad_symbol="<s>",
  37. ... pad_right=True,
  38. ... right_pad_symbol="</s>",
  39. ... n=2))
  40. ['<s>', 'a', 'b', 'c', '</s>']
  41. Note the `n` argument, that tells the function we need padding for bigrams.
  42. Now, passing all these parameters every time is tedious and in most cases they
  43. can be safely assumed as defaults anyway.
  44. Thus our module provides a convenience function that has all these arguments
  45. already set while the other arguments remain the same as for `pad_sequence`.
  46. >>> from nltk.lm.preprocessing import pad_both_ends
  47. >>> list(pad_both_ends(text[0], n=2))
  48. ['<s>', 'a', 'b', 'c', '</s>']
  49. Combining the two parts discussed so far we get the following preparation steps
  50. for one sentence.
  51. >>> list(bigrams(pad_both_ends(text[0], n=2)))
  52. [('<s>', 'a'), ('a', 'b'), ('b', 'c'), ('c', '</s>')]
  53. To make our model more robust we could also train it on unigrams (single words)
  54. as well as bigrams, its main source of information.
  55. NLTK once again helpfully provides a function called `everygrams`.
  56. While not the most efficient, it is conceptually simple.
  57. >>> from nltk.util import everygrams
  58. >>> padded_bigrams = list(pad_both_ends(text[0], n=2))
  59. >>> list(everygrams(padded_bigrams, max_len=2))
  60. [('<s>',),
  61. ('a',),
  62. ('b',),
  63. ('c',),
  64. ('</s>',),
  65. ('<s>', 'a'),
  66. ('a', 'b'),
  67. ('b', 'c'),
  68. ('c', '</s>')]
  69. We are almost ready to start counting ngrams, just one more step left.
  70. During training and evaluation our model will rely on a vocabulary that
  71. defines which words are "known" to the model.
  72. To create this vocabulary we need to pad our sentences (just like for counting
  73. ngrams) and then combine the sentences into one flat stream of words.
  74. >>> from nltk.lm.preprocessing import flatten
  75. >>> list(flatten(pad_both_ends(sent, n=2) for sent in text))
  76. ['<s>', 'a', 'b', 'c', '</s>', '<s>', 'a', 'c', 'd', 'c', 'e', 'f', '</s>']
  77. In most cases we want to use the same text as the source for both vocabulary
  78. and ngram counts.
  79. Now that we understand what this means for our preprocessing, we can simply import
  80. a function that does everything for us.
  81. >>> from nltk.lm.preprocessing import padded_everygram_pipeline
  82. >>> train, vocab = padded_everygram_pipeline(2, text)
  83. So as to avoid re-creating the text in memory, both `train` and `vocab` are lazy
  84. iterators. They are evaluated on demand at training time.
  85. Training
  86. ========
  87. Having prepared our data we are ready to start training a model.
  88. As a simple example, let us train a Maximum Likelihood Estimator (MLE).
  89. We only need to specify the highest ngram order to instantiate it.
  90. >>> from nltk.lm import MLE
  91. >>> lm = MLE(2)
  92. This automatically creates an empty vocabulary...
  93. >>> len(lm.vocab)
  94. 0
  95. ... which gets filled as we fit the model.
  96. >>> lm.fit(train, vocab)
  97. >>> print(lm.vocab)
  98. <Vocabulary with cutoff=1 unk_label='<UNK>' and 9 items>
  99. >>> len(lm.vocab)
  100. 9
  101. The vocabulary helps us handle words that have not occurred during training.
  102. >>> lm.vocab.lookup(text[0])
  103. ('a', 'b', 'c')
  104. >>> lm.vocab.lookup(["aliens", "from", "Mars"])
  105. ('<UNK>', '<UNK>', '<UNK>')
  106. Moreover, in some cases we want to ignore words that we did see during training
  107. but that didn't occur frequently enough, to provide us useful information.
  108. You can tell the vocabulary to ignore such words.
  109. To find out how that works, check out the docs for the `Vocabulary` class.
  110. Using a Trained Model
  111. =====================
  112. When it comes to ngram models the training boils down to counting up the ngrams
  113. from the training corpus.
  114. >>> print(lm.counts)
  115. <NgramCounter with 2 ngram orders and 24 ngrams>
  116. This provides a convenient interface to access counts for unigrams...
  117. >>> lm.counts['a']
  118. 2
  119. ...and bigrams (in this case "a b")
  120. >>> lm.counts[['a']]['b']
  121. 1
  122. And so on. However, the real purpose of training a language model is to have it
  123. score how probable words are in certain contexts.
  124. This being MLE, the model returns the item's relative frequency as its score.
  125. >>> lm.score("a")
  126. 0.15384615384615385
  127. Items that are not seen during training are mapped to the vocabulary's
  128. "unknown label" token. This is "<UNK>" by default.
  129. >>> lm.score("<UNK>") == lm.score("aliens")
  130. True
  131. Here's how you get the score for a word given some preceding context.
  132. For example we want to know what is the chance that "b" is preceded by "a".
  133. >>> lm.score("b", ["a"])
  134. 0.5
  135. To avoid underflow when working with many small score values it makes sense to
  136. take their logarithm.
  137. For convenience this can be done with the `logscore` method.
  138. >>> lm.logscore("a")
  139. -2.700439718141092
  140. Building on this method, we can also evaluate our model's cross-entropy and
  141. perplexity with respect to sequences of ngrams.
  142. >>> test = [('a', 'b'), ('c', 'd')]
  143. >>> lm.entropy(test)
  144. 1.292481250360578
  145. >>> lm.perplexity(test)
  146. 2.449489742783178
  147. It is advisable to preprocess your test text exactly the same way as you did
  148. the training text.
  149. One cool feature of ngram models is that they can be used to generate text.
  150. >>> lm.generate(1, random_seed=3)
  151. '<s>'
  152. >>> lm.generate(5, random_seed=3)
  153. ['<s>', 'a', 'b', 'c', 'd']
  154. Provide `random_seed` if you want to consistently reproduce the same text all
  155. other things being equal. Here we are using it to test the examples.
  156. You can also condition your generation on some preceding text with the `context`
  157. argument.
  158. >>> lm.generate(5, text_seed=['c'], random_seed=3)
  159. ['</s>', 'c', 'd', 'c', 'd']
  160. Note that an ngram model is restricted in how much preceding context it can
  161. take into account. For example, a trigram model can only condition its output
  162. on 2 preceding words. If you pass in a 4-word context, the first two words
  163. will be ignored.
  164. """
  165. from nltk.lm.models import (
  166. MLE,
  167. Lidstone,
  168. Laplace,
  169. WittenBellInterpolated,
  170. KneserNeyInterpolated,
  171. )
  172. from nltk.lm.counter import NgramCounter
  173. from nltk.lm.vocabulary import Vocabulary
  174. __all__ = [
  175. "Vocabulary",
  176. "NgramCounter",
  177. "MLE",
  178. "Lidstone",
  179. "Laplace",
  180. "WittenBellInterpolated",
  181. "KneserNeyInterpolated",
  182. ]