data.doctest 14 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378
  1. .. Copyright (C) 2001-2020 NLTK Project
  2. .. For license information, see LICENSE.TXT
  3. =========================================
  4. Loading Resources From the Data Package
  5. =========================================
  6. >>> import nltk.data
  7. Overview
  8. ~~~~~~~~
  9. The `nltk.data` module contains functions that can be used to load
  10. NLTK resource files, such as corpora, grammars, and saved processing
  11. objects.
  12. Loading Data Files
  13. ~~~~~~~~~~~~~~~~~~
  14. Resources are loaded using the function `nltk.data.load()`, which
  15. takes as its first argument a URL specifying what file should be
  16. loaded. The ``nltk:`` protocol loads files from the NLTK data
  17. distribution:
  18. >>> tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')
  19. >>> tokenizer.tokenize('Hello. This is a test. It works!')
  20. ['Hello.', 'This is a test.', 'It works!']
  21. It is important to note that there should be no space following the
  22. colon (':') in the URL; 'nltk: tokenizers/punkt/english.pickle' will
  23. not work!
  24. The ``nltk:`` protocol is used by default if no protocol is specified:
  25. >>> nltk.data.load('tokenizers/punkt/english.pickle') # doctest: +ELLIPSIS
  26. <nltk.tokenize.punkt.PunktSentenceTokenizer object at ...>
  27. But it is also possible to load resources from ``http:``, ``ftp:``,
  28. and ``file:`` URLs, e.g. ``cfg = nltk.data.load('http://example.com/path/to/toy.cfg')``
  29. >>> # Load a grammar using an absolute path.
  30. >>> url = 'file:%s' % nltk.data.find('grammars/sample_grammars/toy.cfg')
  31. >>> url.replace('\\', '/') # doctest: +ELLIPSIS
  32. 'file:...toy.cfg'
  33. >>> print(nltk.data.load(url)) # doctest: +ELLIPSIS
  34. Grammar with 14 productions (start state = S)
  35. S -> NP VP
  36. PP -> P NP
  37. ...
  38. P -> 'on'
  39. P -> 'in'
  40. The second argument to the `nltk.data.load()` function specifies the
  41. file format, which determines how the file's contents are processed
  42. before they are returned by ``load()``. The formats that are
  43. currently supported by the data module are described by the dictionary
  44. `nltk.data.FORMATS`:
  45. >>> for format, descr in sorted(nltk.data.FORMATS.items()):
  46. ... print('{0:<7} {1:}'.format(format, descr)) # doctest: +NORMALIZE_WHITESPACE
  47. cfg A context free grammar.
  48. fcfg A feature CFG.
  49. fol A list of first order logic expressions, parsed with
  50. nltk.sem.logic.Expression.fromstring.
  51. json A serialized python object, stored using the json module.
  52. logic A list of first order logic expressions, parsed with
  53. nltk.sem.logic.LogicParser. Requires an additional logic_parser
  54. parameter
  55. pcfg A probabilistic CFG.
  56. pickle A serialized python object, stored using the pickle
  57. module.
  58. raw The raw (byte string) contents of a file.
  59. text The raw (unicode string) contents of a file.
  60. val A semantic valuation, parsed by
  61. nltk.sem.Valuation.fromstring.
  62. yaml A serialized python object, stored using the yaml module.
  63. `nltk.data.load()` will raise a ValueError if a bad format name is
  64. specified:
  65. >>> nltk.data.load('grammars/sample_grammars/toy.cfg', 'bar')
  66. Traceback (most recent call last):
  67. . . .
  68. ValueError: Unknown format type!
  69. By default, the ``"auto"`` format is used, which chooses a format
  70. based on the filename's extension. The mapping from file extensions
  71. to format names is specified by `nltk.data.AUTO_FORMATS`:
  72. >>> for ext, format in sorted(nltk.data.AUTO_FORMATS.items()):
  73. ... print('.%-7s -> %s' % (ext, format))
  74. .cfg -> cfg
  75. .fcfg -> fcfg
  76. .fol -> fol
  77. .json -> json
  78. .logic -> logic
  79. .pcfg -> pcfg
  80. .pickle -> pickle
  81. .text -> text
  82. .txt -> text
  83. .val -> val
  84. .yaml -> yaml
  85. If `nltk.data.load()` is unable to determine the format based on the
  86. filename's extension, it will raise a ValueError:
  87. >>> nltk.data.load('foo.bar')
  88. Traceback (most recent call last):
  89. . . .
  90. ValueError: Could not determine format for foo.bar based on its file
  91. extension; use the "format" argument to specify the format explicitly.
  92. Note that by explicitly specifying the ``format`` argument, you can
  93. override the load method's default processing behavior. For example,
  94. to get the raw contents of any file, simply use ``format="raw"``:
  95. >>> s = nltk.data.load('grammars/sample_grammars/toy.cfg', 'text')
  96. >>> print(s) # doctest: +ELLIPSIS
  97. S -> NP VP
  98. PP -> P NP
  99. NP -> Det N | NP PP
  100. VP -> V NP | VP PP
  101. ...
  102. Making Local Copies
  103. ~~~~~~~~~~~~~~~~~~~
  104. .. This will not be visible in the html output: create a tempdir to
  105. play in.
  106. >>> import tempfile, os
  107. >>> tempdir = tempfile.mkdtemp()
  108. >>> old_dir = os.path.abspath('.')
  109. >>> os.chdir(tempdir)
  110. The function `nltk.data.retrieve()` copies a given resource to a local
  111. file. This can be useful, for example, if you want to edit one of the
  112. sample grammars.
  113. >>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg')
  114. Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy.cfg'
  115. >>> # Simulate editing the grammar.
  116. >>> with open('toy.cfg') as inp:
  117. ... s = inp.read().replace('NP', 'DP')
  118. >>> with open('toy.cfg', 'w') as out:
  119. ... _bytes_written = out.write(s)
  120. >>> # Load the edited grammar, & display it.
  121. >>> cfg = nltk.data.load('file:///' + os.path.abspath('toy.cfg'))
  122. >>> print(cfg) # doctest: +ELLIPSIS
  123. Grammar with 14 productions (start state = S)
  124. S -> DP VP
  125. PP -> P DP
  126. ...
  127. P -> 'on'
  128. P -> 'in'
  129. The second argument to `nltk.data.retrieve()` specifies the filename
  130. for the new copy of the file. By default, the source file's filename
  131. is used.
  132. >>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg', 'mytoy.cfg')
  133. Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'mytoy.cfg'
  134. >>> os.path.isfile('./mytoy.cfg')
  135. True
  136. >>> nltk.data.retrieve('grammars/sample_grammars/np.fcfg')
  137. Retrieving 'nltk:grammars/sample_grammars/np.fcfg', saving to 'np.fcfg'
  138. >>> os.path.isfile('./np.fcfg')
  139. True
  140. If a file with the specified (or default) filename already exists in
  141. the current directory, then `nltk.data.retrieve()` will raise a
  142. ValueError exception. It will *not* overwrite the file:
  143. >>> os.path.isfile('./toy.cfg')
  144. True
  145. >>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg') # doctest: +ELLIPSIS
  146. Traceback (most recent call last):
  147. . . .
  148. ValueError: File '...toy.cfg' already exists!
  149. .. This will not be visible in the html output: clean up the tempdir.
  150. >>> os.chdir(old_dir)
  151. >>> for f in os.listdir(tempdir):
  152. ... os.remove(os.path.join(tempdir, f))
  153. >>> os.rmdir(tempdir)
  154. Finding Files in the NLTK Data Package
  155. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  156. The `nltk.data.find()` function searches the NLTK data package for a
  157. given file, and returns a pointer to that file. This pointer can
  158. either be a `FileSystemPathPointer` (whose `path` attribute gives the
  159. absolute path of the file); or a `ZipFilePathPointer`, specifying a
  160. zipfile and the name of an entry within that zipfile. Both pointer
  161. types define the `open()` method, which can be used to read the string
  162. contents of the file.
  163. >>> path = nltk.data.find('corpora/abc/rural.txt')
  164. >>> str(path) # doctest: +ELLIPSIS
  165. '...rural.txt'
  166. >>> print(path.open().read(60).decode())
  167. PM denies knowledge of AWB kickbacks
  168. The Prime Minister has
  169. Alternatively, the `nltk.data.load()` function can be used with the
  170. keyword argument ``format="raw"``:
  171. >>> s = nltk.data.load('corpora/abc/rural.txt', format='raw')[:60]
  172. >>> print(s.decode())
  173. PM denies knowledge of AWB kickbacks
  174. The Prime Minister has
  175. Alternatively, you can use the keyword argument ``format="text"``:
  176. >>> s = nltk.data.load('corpora/abc/rural.txt', format='text')[:60]
  177. >>> print(s)
  178. PM denies knowledge of AWB kickbacks
  179. The Prime Minister has
  180. Resource Caching
  181. ~~~~~~~~~~~~~~~~
  182. NLTK uses a weakref dictionary to maintain a cache of resources that
  183. have been loaded. If you load a resource that is already stored in
  184. the cache, then the cached copy will be returned. This behavior can
  185. be seen by the trace output generated when verbose=True:
  186. >>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg', verbose=True)
  187. <<Loading nltk:grammars/book_grammars/feat0.fcfg>>
  188. >>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg', verbose=True)
  189. <<Using cached copy of nltk:grammars/book_grammars/feat0.fcfg>>
  190. If you wish to load a resource from its source, bypassing the cache,
  191. use the ``cache=False`` argument to `nltk.data.load()`. This can be
  192. useful, for example, if the resource is loaded from a local file, and
  193. you are actively editing that file:
  194. >>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg',cache=False,verbose=True)
  195. <<Loading nltk:grammars/book_grammars/feat0.fcfg>>
  196. The cache *no longer* uses weak references. A resource will not be
  197. automatically expunged from the cache when no more objects are using
  198. it. In the following example, when we clear the variable ``feat0``,
  199. the reference count for the feature grammar object drops to zero.
  200. However, the object remains cached:
  201. >>> del feat0
  202. >>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg',
  203. ... verbose=True)
  204. <<Using cached copy of nltk:grammars/book_grammars/feat0.fcfg>>
  205. You can clear the entire contents of the cache, using
  206. `nltk.data.clear_cache()`:
  207. >>> nltk.data.clear_cache()
  208. Retrieving other Data Sources
  209. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  210. >>> formulas = nltk.data.load('grammars/book_grammars/background.fol')
  211. >>> for f in formulas: print(str(f))
  212. all x.(boxerdog(x) -> dog(x))
  213. all x.(boxer(x) -> person(x))
  214. all x.-(dog(x) & person(x))
  215. all x.(married(x) <-> exists y.marry(x,y))
  216. all x.(bark(x) -> dog(x))
  217. all x y.(marry(x,y) -> (person(x) & person(y)))
  218. -(Vincent = Mia)
  219. -(Vincent = Fido)
  220. -(Mia = Fido)
  221. Regression Tests
  222. ~~~~~~~~~~~~~~~~
  223. Create a temp dir for tests that write files:
  224. >>> import tempfile, os
  225. >>> tempdir = tempfile.mkdtemp()
  226. >>> old_dir = os.path.abspath('.')
  227. >>> os.chdir(tempdir)
  228. The `retrieve()` function accepts all url types:
  229. >>> urls = ['https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/toy.cfg',
  230. ... 'file:%s' % nltk.data.find('grammars/sample_grammars/toy.cfg'),
  231. ... 'nltk:grammars/sample_grammars/toy.cfg',
  232. ... 'grammars/sample_grammars/toy.cfg']
  233. >>> for i, url in enumerate(urls):
  234. ... nltk.data.retrieve(url, 'toy-%d.cfg' % i) # doctest: +ELLIPSIS
  235. Retrieving 'https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/toy.cfg', saving to 'toy-0.cfg'
  236. Retrieving 'file:...toy.cfg', saving to 'toy-1.cfg'
  237. Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy-2.cfg'
  238. Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy-3.cfg'
  239. Clean up the temp dir:
  240. >>> os.chdir(old_dir)
  241. >>> for f in os.listdir(tempdir):
  242. ... os.remove(os.path.join(tempdir, f))
  243. >>> os.rmdir(tempdir)
  244. Lazy Loader
  245. -----------
  246. A lazy loader is a wrapper object that defers loading a resource until
  247. it is accessed or used in any way. This is mainly intended for
  248. internal use by NLTK's corpus readers.
  249. >>> # Create a lazy loader for toy.cfg.
  250. >>> ll = nltk.data.LazyLoader('grammars/sample_grammars/toy.cfg')
  251. >>> # Show that it's not loaded yet:
  252. >>> object.__repr__(ll) # doctest: +ELLIPSIS
  253. '<nltk.data.LazyLoader object at ...>'
  254. >>> # printing it is enough to cause it to be loaded:
  255. >>> print(ll)
  256. <Grammar with 14 productions>
  257. >>> # Show that it's now been loaded:
  258. >>> object.__repr__(ll) # doctest: +ELLIPSIS
  259. '<nltk.grammar.CFG object at ...>'
  260. >>> # Test that accessing an attribute also loads it:
  261. >>> ll = nltk.data.LazyLoader('grammars/sample_grammars/toy.cfg')
  262. >>> ll.start()
  263. S
  264. >>> object.__repr__(ll) # doctest: +ELLIPSIS
  265. '<nltk.grammar.CFG object at ...>'
  266. Buffered Gzip Reading and Writing
  267. ---------------------------------
  268. Write performance to gzip-compressed is extremely poor when the files become large.
  269. File creation can become a bottleneck in those cases.
  270. Read performance from large gzipped pickle files was improved in data.py by
  271. buffering the reads. A similar fix can be applied to writes by buffering
  272. the writes to a StringIO object first.
  273. This is mainly intended for internal use. The test simply tests that reading
  274. and writing work as intended and does not test how much improvement buffering
  275. provides.
  276. >>> from io import StringIO
  277. >>> test = nltk.data.BufferedGzipFile('testbuf.gz', 'wb', size=2**10)
  278. >>> ans = []
  279. >>> for i in range(10000):
  280. ... ans.append(str(i).encode('ascii'))
  281. ... test.write(str(i).encode('ascii'))
  282. >>> test.close()
  283. >>> test = nltk.data.BufferedGzipFile('testbuf.gz', 'rb')
  284. >>> test.read() == b''.join(ans)
  285. True
  286. >>> test.close()
  287. >>> import os
  288. >>> os.unlink('testbuf.gz')
  289. JSON Encoding and Decoding
  290. --------------------------
  291. JSON serialization is used instead of pickle for some classes.
  292. >>> from nltk import jsontags
  293. >>> from nltk.jsontags import JSONTaggedEncoder, JSONTaggedDecoder, register_tag
  294. >>> @jsontags.register_tag
  295. ... class JSONSerializable:
  296. ... json_tag = 'JSONSerializable'
  297. ...
  298. ... def __init__(self, n):
  299. ... self.n = n
  300. ...
  301. ... def encode_json_obj(self):
  302. ... return self.n
  303. ...
  304. ... @classmethod
  305. ... def decode_json_obj(cls, obj):
  306. ... n = obj
  307. ... return cls(n)
  308. ...
  309. >>> JSONTaggedEncoder().encode(JSONSerializable(1))
  310. '{"!JSONSerializable": 1}'
  311. >>> JSONTaggedDecoder().decode('{"!JSONSerializable": 1}').n
  312. 1