| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239 |
- # Natural Language Toolkit: Language Models
- #
- # Copyright (C) 2001-2020 NLTK Project
- # Authors: Ilia Kurenkov <ilia.kurenkov@gmail.com>
- # URL: <http://nltk.org/
- # For license information, see LICENSE.TXT
- """
- NLTK Language Modeling Module.
- ------------------------------
- Currently this module covers only ngram language models, but it should be easy
- to extend to neural models.
- Preparing Data
- ==============
- Before we train our ngram models it is necessary to make sure the data we put in
- them is in the right format.
- Let's say we have a text that is a list of sentences, where each sentence is
- a list of strings. For simplicity we just consider a text consisting of
- characters instead of words.
- >>> text = [['a', 'b', 'c'], ['a', 'c', 'd', 'c', 'e', 'f']]
- If we want to train a bigram model, we need to turn this text into bigrams.
- Here's what the first sentence of our text would look like if we use a function
- from NLTK for this.
- >>> from nltk.util import bigrams
- >>> list(bigrams(text[0]))
- [('a', 'b'), ('b', 'c')]
- Notice how "b" occurs both as the first and second member of different bigrams
- but "a" and "c" don't? Wouldn't it be nice to somehow indicate how often sentences
- start with "a" and end with "c"?
- A standard way to deal with this is to add special "padding" symbols to the
- sentence before splitting it into ngrams.
- Fortunately, NLTK also has a function for that, let's see what it does to the
- first sentence.
- >>> from nltk.util import pad_sequence
- >>> list(pad_sequence(text[0],
- ... pad_left=True,
- ... left_pad_symbol="<s>",
- ... pad_right=True,
- ... right_pad_symbol="</s>",
- ... n=2))
- ['<s>', 'a', 'b', 'c', '</s>']
- Note the `n` argument, that tells the function we need padding for bigrams.
- Now, passing all these parameters every time is tedious and in most cases they
- can be safely assumed as defaults anyway.
- Thus our module provides a convenience function that has all these arguments
- already set while the other arguments remain the same as for `pad_sequence`.
- >>> from nltk.lm.preprocessing import pad_both_ends
- >>> list(pad_both_ends(text[0], n=2))
- ['<s>', 'a', 'b', 'c', '</s>']
- Combining the two parts discussed so far we get the following preparation steps
- for one sentence.
- >>> list(bigrams(pad_both_ends(text[0], n=2)))
- [('<s>', 'a'), ('a', 'b'), ('b', 'c'), ('c', '</s>')]
- To make our model more robust we could also train it on unigrams (single words)
- as well as bigrams, its main source of information.
- NLTK once again helpfully provides a function called `everygrams`.
- While not the most efficient, it is conceptually simple.
- >>> from nltk.util import everygrams
- >>> padded_bigrams = list(pad_both_ends(text[0], n=2))
- >>> list(everygrams(padded_bigrams, max_len=2))
- [('<s>',),
- ('a',),
- ('b',),
- ('c',),
- ('</s>',),
- ('<s>', 'a'),
- ('a', 'b'),
- ('b', 'c'),
- ('c', '</s>')]
- We are almost ready to start counting ngrams, just one more step left.
- During training and evaluation our model will rely on a vocabulary that
- defines which words are "known" to the model.
- To create this vocabulary we need to pad our sentences (just like for counting
- ngrams) and then combine the sentences into one flat stream of words.
- >>> from nltk.lm.preprocessing import flatten
- >>> list(flatten(pad_both_ends(sent, n=2) for sent in text))
- ['<s>', 'a', 'b', 'c', '</s>', '<s>', 'a', 'c', 'd', 'c', 'e', 'f', '</s>']
- In most cases we want to use the same text as the source for both vocabulary
- and ngram counts.
- Now that we understand what this means for our preprocessing, we can simply import
- a function that does everything for us.
- >>> from nltk.lm.preprocessing import padded_everygram_pipeline
- >>> train, vocab = padded_everygram_pipeline(2, text)
- So as to avoid re-creating the text in memory, both `train` and `vocab` are lazy
- iterators. They are evaluated on demand at training time.
- Training
- ========
- Having prepared our data we are ready to start training a model.
- As a simple example, let us train a Maximum Likelihood Estimator (MLE).
- We only need to specify the highest ngram order to instantiate it.
- >>> from nltk.lm import MLE
- >>> lm = MLE(2)
- This automatically creates an empty vocabulary...
- >>> len(lm.vocab)
- 0
- ... which gets filled as we fit the model.
- >>> lm.fit(train, vocab)
- >>> print(lm.vocab)
- <Vocabulary with cutoff=1 unk_label='<UNK>' and 9 items>
- >>> len(lm.vocab)
- 9
- The vocabulary helps us handle words that have not occurred during training.
- >>> lm.vocab.lookup(text[0])
- ('a', 'b', 'c')
- >>> lm.vocab.lookup(["aliens", "from", "Mars"])
- ('<UNK>', '<UNK>', '<UNK>')
- Moreover, in some cases we want to ignore words that we did see during training
- but that didn't occur frequently enough, to provide us useful information.
- You can tell the vocabulary to ignore such words.
- To find out how that works, check out the docs for the `Vocabulary` class.
- Using a Trained Model
- =====================
- When it comes to ngram models the training boils down to counting up the ngrams
- from the training corpus.
- >>> print(lm.counts)
- <NgramCounter with 2 ngram orders and 24 ngrams>
- This provides a convenient interface to access counts for unigrams...
- >>> lm.counts['a']
- 2
- ...and bigrams (in this case "a b")
- >>> lm.counts[['a']]['b']
- 1
- And so on. However, the real purpose of training a language model is to have it
- score how probable words are in certain contexts.
- This being MLE, the model returns the item's relative frequency as its score.
- >>> lm.score("a")
- 0.15384615384615385
- Items that are not seen during training are mapped to the vocabulary's
- "unknown label" token. This is "<UNK>" by default.
- >>> lm.score("<UNK>") == lm.score("aliens")
- True
- Here's how you get the score for a word given some preceding context.
- For example we want to know what is the chance that "b" is preceded by "a".
- >>> lm.score("b", ["a"])
- 0.5
- To avoid underflow when working with many small score values it makes sense to
- take their logarithm.
- For convenience this can be done with the `logscore` method.
- >>> lm.logscore("a")
- -2.700439718141092
- Building on this method, we can also evaluate our model's cross-entropy and
- perplexity with respect to sequences of ngrams.
- >>> test = [('a', 'b'), ('c', 'd')]
- >>> lm.entropy(test)
- 1.292481250360578
- >>> lm.perplexity(test)
- 2.449489742783178
- It is advisable to preprocess your test text exactly the same way as you did
- the training text.
- One cool feature of ngram models is that they can be used to generate text.
- >>> lm.generate(1, random_seed=3)
- '<s>'
- >>> lm.generate(5, random_seed=3)
- ['<s>', 'a', 'b', 'c', 'd']
- Provide `random_seed` if you want to consistently reproduce the same text all
- other things being equal. Here we are using it to test the examples.
- You can also condition your generation on some preceding text with the `context`
- argument.
- >>> lm.generate(5, text_seed=['c'], random_seed=3)
- ['</s>', 'c', 'd', 'c', 'd']
- Note that an ngram model is restricted in how much preceding context it can
- take into account. For example, a trigram model can only condition its output
- on 2 preceding words. If you pass in a 4-word context, the first two words
- will be ignored.
- """
- from nltk.lm.models import (
- MLE,
- Lidstone,
- Laplace,
- WittenBellInterpolated,
- KneserNeyInterpolated,
- )
- from nltk.lm.counter import NgramCounter
- from nltk.lm.vocabulary import Vocabulary
- __all__ = [
- "Vocabulary",
- "NgramCounter",
- "MLE",
- "Lidstone",
- "Laplace",
- "WittenBellInterpolated",
- "KneserNeyInterpolated",
- ]
|