| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199 |
- # Natural Language Toolkit: Chunkers
- #
- # Copyright (C) 2001-2020 NLTK Project
- # Author: Steven Bird <stevenbird1@gmail.com>
- # Edward Loper <edloper@gmail.com>
- # URL: <http://nltk.org/>
- # For license information, see LICENSE.TXT
- #
- """
- Classes and interfaces for identifying non-overlapping linguistic
- groups (such as base noun phrases) in unrestricted text. This task is
- called "chunk parsing" or "chunking", and the identified groups are
- called "chunks". The chunked text is represented using a shallow
- tree called a "chunk structure." A chunk structure is a tree
- containing tokens and chunks, where each chunk is a subtree containing
- only tokens. For example, the chunk structure for base noun phrase
- chunks in the sentence "I saw the big dog on the hill" is::
- (SENTENCE:
- (NP: <I>)
- <saw>
- (NP: <the> <big> <dog>)
- <on>
- (NP: <the> <hill>))
- To convert a chunk structure back to a list of tokens, simply use the
- chunk structure's ``leaves()`` method.
- This module defines ``ChunkParserI``, a standard interface for
- chunking texts; and ``RegexpChunkParser``, a regular-expression based
- implementation of that interface. It also defines ``ChunkScore``, a
- utility class for scoring chunk parsers.
- RegexpChunkParser
- =================
- ``RegexpChunkParser`` is an implementation of the chunk parser interface
- that uses regular-expressions over tags to chunk a text. Its
- ``parse()`` method first constructs a ``ChunkString``, which encodes a
- particular chunking of the input text. Initially, nothing is
- chunked. ``parse.RegexpChunkParser`` then applies a sequence of
- ``RegexpChunkRule`` rules to the ``ChunkString``, each of which modifies
- the chunking that it encodes. Finally, the ``ChunkString`` is
- transformed back into a chunk structure, which is returned.
- ``RegexpChunkParser`` can only be used to chunk a single kind of phrase.
- For example, you can use an ``RegexpChunkParser`` to chunk the noun
- phrases in a text, or the verb phrases in a text; but you can not
- use it to simultaneously chunk both noun phrases and verb phrases in
- the same text. (This is a limitation of ``RegexpChunkParser``, not of
- chunk parsers in general.)
- RegexpChunkRules
- ----------------
- A ``RegexpChunkRule`` is a transformational rule that updates the
- chunking of a text by modifying its ``ChunkString``. Each
- ``RegexpChunkRule`` defines the ``apply()`` method, which modifies
- the chunking encoded by a ``ChunkString``. The
- ``RegexpChunkRule`` class itself can be used to implement any
- transformational rule based on regular expressions. There are
- also a number of subclasses, which can be used to implement
- simpler types of rules:
- - ``ChunkRule`` chunks anything that matches a given regular
- expression.
- - ``ChinkRule`` chinks anything that matches a given regular
- expression.
- - ``UnChunkRule`` will un-chunk any chunk that matches a given
- regular expression.
- - ``MergeRule`` can be used to merge two contiguous chunks.
- - ``SplitRule`` can be used to split a single chunk into two
- smaller chunks.
- - ``ExpandLeftRule`` will expand a chunk to incorporate new
- unchunked material on the left.
- - ``ExpandRightRule`` will expand a chunk to incorporate new
- unchunked material on the right.
- Tag Patterns
- ~~~~~~~~~~~~
- A ``RegexpChunkRule`` uses a modified version of regular
- expression patterns, called "tag patterns". Tag patterns are
- used to match sequences of tags. Examples of tag patterns are::
- r'(<DT>|<JJ>|<NN>)+'
- r'<NN>+'
- r'<NN.*>'
- The differences between regular expression patterns and tag
- patterns are:
- - In tag patterns, ``'<'`` and ``'>'`` act as parentheses; so
- ``'<NN>+'`` matches one or more repetitions of ``'<NN>'``, not
- ``'<NN'`` followed by one or more repetitions of ``'>'``.
- - Whitespace in tag patterns is ignored. So
- ``'<DT> | <NN>'`` is equivalant to ``'<DT>|<NN>'``
- - In tag patterns, ``'.'`` is equivalant to ``'[^{}<>]'``; so
- ``'<NN.*>'`` matches any single tag starting with ``'NN'``.
- The function ``tag_pattern2re_pattern`` can be used to transform
- a tag pattern to an equivalent regular expression pattern.
- Efficiency
- ----------
- Preliminary tests indicate that ``RegexpChunkParser`` can chunk at a
- rate of about 300 tokens/second, with a moderately complex rule set.
- There may be problems if ``RegexpChunkParser`` is used with more than
- 5,000 tokens at a time. In particular, evaluation of some regular
- expressions may cause the Python regular expression engine to
- exceed its maximum recursion depth. We have attempted to minimize
- these problems, but it is impossible to avoid them completely. We
- therefore recommend that you apply the chunk parser to a single
- sentence at a time.
- Emacs Tip
- ---------
- If you evaluate the following elisp expression in emacs, it will
- colorize a ``ChunkString`` when you use an interactive python shell
- with emacs or xemacs ("C-c !")::
- (let ()
- (defconst comint-mode-font-lock-keywords
- '(("<[^>]+>" 0 'font-lock-reference-face)
- ("[{}]" 0 'font-lock-function-name-face)))
- (add-hook 'comint-mode-hook (lambda () (turn-on-font-lock))))
- You can evaluate this code by copying it to a temporary buffer,
- placing the cursor after the last close parenthesis, and typing
- "``C-x C-e``". You should evaluate it before running the interactive
- session. The change will last until you close emacs.
- Unresolved Issues
- -----------------
- If we use the ``re`` module for regular expressions, Python's
- regular expression engine generates "maximum recursion depth
- exceeded" errors when processing very large texts, even for
- regular expressions that should not require any recursion. We
- therefore use the ``pre`` module instead. But note that ``pre``
- does not include Unicode support, so this module will not work
- with unicode strings. Note also that ``pre`` regular expressions
- are not quite as advanced as ``re`` ones (e.g., no leftward
- zero-length assertions).
- :type CHUNK_TAG_PATTERN: regexp
- :var CHUNK_TAG_PATTERN: A regular expression to test whether a tag
- pattern is valid.
- """
- from nltk.data import load
- from nltk.chunk.api import ChunkParserI
- from nltk.chunk.util import (
- ChunkScore,
- accuracy,
- tagstr2tree,
- conllstr2tree,
- conlltags2tree,
- tree2conlltags,
- tree2conllstr,
- tree2conlltags,
- ieerstr2tree,
- )
- from nltk.chunk.regexp import RegexpChunkParser, RegexpParser
- # Standard treebank POS tagger
- _BINARY_NE_CHUNKER = "chunkers/maxent_ne_chunker/english_ace_binary.pickle"
- _MULTICLASS_NE_CHUNKER = "chunkers/maxent_ne_chunker/english_ace_multiclass.pickle"
- def ne_chunk(tagged_tokens, binary=False):
- """
- Use NLTK's currently recommended named entity chunker to
- chunk the given list of tagged tokens.
- """
- if binary:
- chunker_pickle = _BINARY_NE_CHUNKER
- else:
- chunker_pickle = _MULTICLASS_NE_CHUNKER
- chunker = load(chunker_pickle)
- return chunker.parse(tagged_tokens)
- def ne_chunk_sents(tagged_sentences, binary=False):
- """
- Use NLTK's currently recommended named entity chunker to chunk the
- given list of tagged sentences, each consisting of a list of tagged tokens.
- """
- if binary:
- chunker_pickle = _BINARY_NE_CHUNKER
- else:
- chunker_pickle = _MULTICLASS_NE_CHUNKER
- chunker = load(chunker_pickle)
- return chunker.parse_sents(tagged_sentences)
|