__init__.py 7.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199
  1. # Natural Language Toolkit: Chunkers
  2. #
  3. # Copyright (C) 2001-2020 NLTK Project
  4. # Author: Steven Bird <stevenbird1@gmail.com>
  5. # Edward Loper <edloper@gmail.com>
  6. # URL: <http://nltk.org/>
  7. # For license information, see LICENSE.TXT
  8. #
  9. """
  10. Classes and interfaces for identifying non-overlapping linguistic
  11. groups (such as base noun phrases) in unrestricted text. This task is
  12. called "chunk parsing" or "chunking", and the identified groups are
  13. called "chunks". The chunked text is represented using a shallow
  14. tree called a "chunk structure." A chunk structure is a tree
  15. containing tokens and chunks, where each chunk is a subtree containing
  16. only tokens. For example, the chunk structure for base noun phrase
  17. chunks in the sentence "I saw the big dog on the hill" is::
  18. (SENTENCE:
  19. (NP: <I>)
  20. <saw>
  21. (NP: <the> <big> <dog>)
  22. <on>
  23. (NP: <the> <hill>))
  24. To convert a chunk structure back to a list of tokens, simply use the
  25. chunk structure's ``leaves()`` method.
  26. This module defines ``ChunkParserI``, a standard interface for
  27. chunking texts; and ``RegexpChunkParser``, a regular-expression based
  28. implementation of that interface. It also defines ``ChunkScore``, a
  29. utility class for scoring chunk parsers.
  30. RegexpChunkParser
  31. =================
  32. ``RegexpChunkParser`` is an implementation of the chunk parser interface
  33. that uses regular-expressions over tags to chunk a text. Its
  34. ``parse()`` method first constructs a ``ChunkString``, which encodes a
  35. particular chunking of the input text. Initially, nothing is
  36. chunked. ``parse.RegexpChunkParser`` then applies a sequence of
  37. ``RegexpChunkRule`` rules to the ``ChunkString``, each of which modifies
  38. the chunking that it encodes. Finally, the ``ChunkString`` is
  39. transformed back into a chunk structure, which is returned.
  40. ``RegexpChunkParser`` can only be used to chunk a single kind of phrase.
  41. For example, you can use an ``RegexpChunkParser`` to chunk the noun
  42. phrases in a text, or the verb phrases in a text; but you can not
  43. use it to simultaneously chunk both noun phrases and verb phrases in
  44. the same text. (This is a limitation of ``RegexpChunkParser``, not of
  45. chunk parsers in general.)
  46. RegexpChunkRules
  47. ----------------
  48. A ``RegexpChunkRule`` is a transformational rule that updates the
  49. chunking of a text by modifying its ``ChunkString``. Each
  50. ``RegexpChunkRule`` defines the ``apply()`` method, which modifies
  51. the chunking encoded by a ``ChunkString``. The
  52. ``RegexpChunkRule`` class itself can be used to implement any
  53. transformational rule based on regular expressions. There are
  54. also a number of subclasses, which can be used to implement
  55. simpler types of rules:
  56. - ``ChunkRule`` chunks anything that matches a given regular
  57. expression.
  58. - ``ChinkRule`` chinks anything that matches a given regular
  59. expression.
  60. - ``UnChunkRule`` will un-chunk any chunk that matches a given
  61. regular expression.
  62. - ``MergeRule`` can be used to merge two contiguous chunks.
  63. - ``SplitRule`` can be used to split a single chunk into two
  64. smaller chunks.
  65. - ``ExpandLeftRule`` will expand a chunk to incorporate new
  66. unchunked material on the left.
  67. - ``ExpandRightRule`` will expand a chunk to incorporate new
  68. unchunked material on the right.
  69. Tag Patterns
  70. ~~~~~~~~~~~~
  71. A ``RegexpChunkRule`` uses a modified version of regular
  72. expression patterns, called "tag patterns". Tag patterns are
  73. used to match sequences of tags. Examples of tag patterns are::
  74. r'(<DT>|<JJ>|<NN>)+'
  75. r'<NN>+'
  76. r'<NN.*>'
  77. The differences between regular expression patterns and tag
  78. patterns are:
  79. - In tag patterns, ``'<'`` and ``'>'`` act as parentheses; so
  80. ``'<NN>+'`` matches one or more repetitions of ``'<NN>'``, not
  81. ``'<NN'`` followed by one or more repetitions of ``'>'``.
  82. - Whitespace in tag patterns is ignored. So
  83. ``'<DT> | <NN>'`` is equivalant to ``'<DT>|<NN>'``
  84. - In tag patterns, ``'.'`` is equivalant to ``'[^{}<>]'``; so
  85. ``'<NN.*>'`` matches any single tag starting with ``'NN'``.
  86. The function ``tag_pattern2re_pattern`` can be used to transform
  87. a tag pattern to an equivalent regular expression pattern.
  88. Efficiency
  89. ----------
  90. Preliminary tests indicate that ``RegexpChunkParser`` can chunk at a
  91. rate of about 300 tokens/second, with a moderately complex rule set.
  92. There may be problems if ``RegexpChunkParser`` is used with more than
  93. 5,000 tokens at a time. In particular, evaluation of some regular
  94. expressions may cause the Python regular expression engine to
  95. exceed its maximum recursion depth. We have attempted to minimize
  96. these problems, but it is impossible to avoid them completely. We
  97. therefore recommend that you apply the chunk parser to a single
  98. sentence at a time.
  99. Emacs Tip
  100. ---------
  101. If you evaluate the following elisp expression in emacs, it will
  102. colorize a ``ChunkString`` when you use an interactive python shell
  103. with emacs or xemacs ("C-c !")::
  104. (let ()
  105. (defconst comint-mode-font-lock-keywords
  106. '(("<[^>]+>" 0 'font-lock-reference-face)
  107. ("[{}]" 0 'font-lock-function-name-face)))
  108. (add-hook 'comint-mode-hook (lambda () (turn-on-font-lock))))
  109. You can evaluate this code by copying it to a temporary buffer,
  110. placing the cursor after the last close parenthesis, and typing
  111. "``C-x C-e``". You should evaluate it before running the interactive
  112. session. The change will last until you close emacs.
  113. Unresolved Issues
  114. -----------------
  115. If we use the ``re`` module for regular expressions, Python's
  116. regular expression engine generates "maximum recursion depth
  117. exceeded" errors when processing very large texts, even for
  118. regular expressions that should not require any recursion. We
  119. therefore use the ``pre`` module instead. But note that ``pre``
  120. does not include Unicode support, so this module will not work
  121. with unicode strings. Note also that ``pre`` regular expressions
  122. are not quite as advanced as ``re`` ones (e.g., no leftward
  123. zero-length assertions).
  124. :type CHUNK_TAG_PATTERN: regexp
  125. :var CHUNK_TAG_PATTERN: A regular expression to test whether a tag
  126. pattern is valid.
  127. """
  128. from nltk.data import load
  129. from nltk.chunk.api import ChunkParserI
  130. from nltk.chunk.util import (
  131. ChunkScore,
  132. accuracy,
  133. tagstr2tree,
  134. conllstr2tree,
  135. conlltags2tree,
  136. tree2conlltags,
  137. tree2conllstr,
  138. tree2conlltags,
  139. ieerstr2tree,
  140. )
  141. from nltk.chunk.regexp import RegexpChunkParser, RegexpParser
  142. # Standard treebank POS tagger
  143. _BINARY_NE_CHUNKER = "chunkers/maxent_ne_chunker/english_ace_binary.pickle"
  144. _MULTICLASS_NE_CHUNKER = "chunkers/maxent_ne_chunker/english_ace_multiclass.pickle"
  145. def ne_chunk(tagged_tokens, binary=False):
  146. """
  147. Use NLTK's currently recommended named entity chunker to
  148. chunk the given list of tagged tokens.
  149. """
  150. if binary:
  151. chunker_pickle = _BINARY_NE_CHUNKER
  152. else:
  153. chunker_pickle = _MULTICLASS_NE_CHUNKER
  154. chunker = load(chunker_pickle)
  155. return chunker.parse(tagged_tokens)
  156. def ne_chunk_sents(tagged_sentences, binary=False):
  157. """
  158. Use NLTK's currently recommended named entity chunker to chunk the
  159. given list of tagged sentences, each consisting of a list of tagged tokens.
  160. """
  161. if binary:
  162. chunker_pickle = _BINARY_NE_CHUNKER
  163. else:
  164. chunker_pickle = _MULTICLASS_NE_CHUNKER
  165. chunker = load(chunker_pickle)
  166. return chunker.parse_sents(tagged_sentences)