ibm4.py 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488
  1. # -*- coding: utf-8 -*-
  2. # Natural Language Toolkit: IBM Model 4
  3. #
  4. # Copyright (C) 2001-2020 NLTK Project
  5. # Author: Tah Wei Hoon <hoon.tw@gmail.com>
  6. # URL: <http://nltk.org/>
  7. # For license information, see LICENSE.TXT
  8. """
  9. Translation model that reorders output words based on their type and
  10. distance from other related words in the output sentence.
  11. IBM Model 4 improves the distortion model of Model 3, motivated by the
  12. observation that certain words tend to be re-ordered in a predictable
  13. way relative to one another. For example, <adjective><noun> in English
  14. usually has its order flipped as <noun><adjective> in French.
  15. Model 4 requires words in the source and target vocabularies to be
  16. categorized into classes. This can be linguistically driven, like parts
  17. of speech (adjective, nouns, prepositions, etc). Word classes can also
  18. be obtained by statistical methods. The original IBM Model 4 uses an
  19. information theoretic approach to group words into 50 classes for each
  20. vocabulary.
  21. Terminology:
  22. Cept:
  23. A source word with non-zero fertility i.e. aligned to one or more
  24. target words.
  25. Tablet:
  26. The set of target word(s) aligned to a cept.
  27. Head of cept:
  28. The first word of the tablet of that cept.
  29. Center of cept:
  30. The average position of the words in that cept's tablet. If the
  31. value is not an integer, the ceiling is taken.
  32. For example, for a tablet with words in positions 2, 5, 6 in the
  33. target sentence, the center of the corresponding cept is
  34. ceil((2 + 5 + 6) / 3) = 5
  35. Displacement:
  36. For a head word, defined as (position of head word - position of
  37. previous cept's center). Can be positive or negative.
  38. For a non-head word, defined as (position of non-head word -
  39. position of previous word in the same tablet). Always positive,
  40. because successive words in a tablet are assumed to appear to the
  41. right of the previous word.
  42. In contrast to Model 3 which reorders words in a tablet independently of
  43. other words, Model 4 distinguishes between three cases.
  44. (1) Words generated by NULL are distributed uniformly.
  45. (2) For a head word t, its position is modeled by the probability
  46. d_head(displacement | word_class_s(s),word_class_t(t)),
  47. where s is the previous cept, and word_class_s and word_class_t maps
  48. s and t to a source and target language word class respectively.
  49. (3) For a non-head word t, its position is modeled by the probability
  50. d_non_head(displacement | word_class_t(t))
  51. The EM algorithm used in Model 4 is:
  52. E step - In the training data, collect counts, weighted by prior
  53. probabilities.
  54. (a) count how many times a source language word is translated
  55. into a target language word
  56. (b) for a particular word class, count how many times a head
  57. word is located at a particular displacement from the
  58. previous cept's center
  59. (c) for a particular word class, count how many times a
  60. non-head word is located at a particular displacement from
  61. the previous target word
  62. (d) count how many times a source word is aligned to phi number
  63. of target words
  64. (e) count how many times NULL is aligned to a target word
  65. M step - Estimate new probabilities based on the counts from the E step
  66. Like Model 3, there are too many possible alignments to consider. Thus,
  67. a hill climbing approach is used to sample good candidates.
  68. Notations:
  69. i: Position in the source sentence
  70. Valid values are 0 (for NULL), 1, 2, ..., length of source sentence
  71. j: Position in the target sentence
  72. Valid values are 1, 2, ..., length of target sentence
  73. l: Number of words in the source sentence, excluding NULL
  74. m: Number of words in the target sentence
  75. s: A word in the source language
  76. t: A word in the target language
  77. phi: Fertility, the number of target words produced by a source word
  78. p1: Probability that a target word produced by a source word is
  79. accompanied by another target word that is aligned to NULL
  80. p0: 1 - p1
  81. dj: Displacement, Δj
  82. References:
  83. Philipp Koehn. 2010. Statistical Machine Translation.
  84. Cambridge University Press, New York.
  85. Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and
  86. Robert L. Mercer. 1993. The Mathematics of Statistical Machine
  87. Translation: Parameter Estimation. Computational Linguistics, 19 (2),
  88. 263-311.
  89. """
  90. import warnings
  91. from collections import defaultdict
  92. from math import factorial
  93. from nltk.translate import AlignedSent
  94. from nltk.translate import Alignment
  95. from nltk.translate import IBMModel
  96. from nltk.translate import IBMModel3
  97. from nltk.translate.ibm_model import Counts
  98. from nltk.translate.ibm_model import longest_target_sentence_length
  99. class IBMModel4(IBMModel):
  100. """
  101. Translation model that reorders output words based on their type and
  102. their distance from other related words in the output sentence
  103. >>> bitext = []
  104. >>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small']))
  105. >>> bitext.append(AlignedSent(['das', 'haus', 'war', 'ja', 'groß'], ['the', 'house', 'was', 'big']))
  106. >>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small']))
  107. >>> bitext.append(AlignedSent(['ein', 'haus', 'ist', 'klein'], ['a', 'house', 'is', 'small']))
  108. >>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house']))
  109. >>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book']))
  110. >>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book']))
  111. >>> bitext.append(AlignedSent(['ich', 'fasse', 'das', 'buch', 'zusammen'], ['i', 'summarize', 'the', 'book']))
  112. >>> bitext.append(AlignedSent(['fasse', 'zusammen'], ['summarize']))
  113. >>> src_classes = {'the': 0, 'a': 0, 'small': 1, 'big': 1, 'house': 2, 'book': 2, 'is': 3, 'was': 3, 'i': 4, 'summarize': 5 }
  114. >>> trg_classes = {'das': 0, 'ein': 0, 'haus': 1, 'buch': 1, 'klein': 2, 'groß': 2, 'ist': 3, 'war': 3, 'ja': 4, 'ich': 5, 'fasse': 6, 'zusammen': 6 }
  115. >>> ibm4 = IBMModel4(bitext, 5, src_classes, trg_classes)
  116. >>> print(round(ibm4.translation_table['buch']['book'], 3))
  117. 1.0
  118. >>> print(round(ibm4.translation_table['das']['book'], 3))
  119. 0.0
  120. >>> print(round(ibm4.translation_table['ja'][None], 3))
  121. 1.0
  122. >>> print(round(ibm4.head_distortion_table[1][0][1], 3))
  123. 1.0
  124. >>> print(round(ibm4.head_distortion_table[2][0][1], 3))
  125. 0.0
  126. >>> print(round(ibm4.non_head_distortion_table[3][6], 3))
  127. 0.5
  128. >>> print(round(ibm4.fertility_table[2]['summarize'], 3))
  129. 1.0
  130. >>> print(round(ibm4.fertility_table[1]['book'], 3))
  131. 1.0
  132. >>> print(ibm4.p1)
  133. 0.033...
  134. >>> test_sentence = bitext[2]
  135. >>> test_sentence.words
  136. ['das', 'buch', 'ist', 'ja', 'klein']
  137. >>> test_sentence.mots
  138. ['the', 'book', 'is', 'small']
  139. >>> test_sentence.alignment
  140. Alignment([(0, 0), (1, 1), (2, 2), (3, None), (4, 3)])
  141. """
  142. def __init__(
  143. self,
  144. sentence_aligned_corpus,
  145. iterations,
  146. source_word_classes,
  147. target_word_classes,
  148. probability_tables=None,
  149. ):
  150. """
  151. Train on ``sentence_aligned_corpus`` and create a lexical
  152. translation model, distortion models, a fertility model, and a
  153. model for generating NULL-aligned words.
  154. Translation direction is from ``AlignedSent.mots`` to
  155. ``AlignedSent.words``.
  156. :param sentence_aligned_corpus: Sentence-aligned parallel corpus
  157. :type sentence_aligned_corpus: list(AlignedSent)
  158. :param iterations: Number of iterations to run training algorithm
  159. :type iterations: int
  160. :param source_word_classes: Lookup table that maps a source word
  161. to its word class, the latter represented by an integer id
  162. :type source_word_classes: dict[str]: int
  163. :param target_word_classes: Lookup table that maps a target word
  164. to its word class, the latter represented by an integer id
  165. :type target_word_classes: dict[str]: int
  166. :param probability_tables: Optional. Use this to pass in custom
  167. probability values. If not specified, probabilities will be
  168. set to a uniform distribution, or some other sensible value.
  169. If specified, all the following entries must be present:
  170. ``translation_table``, ``alignment_table``,
  171. ``fertility_table``, ``p1``, ``head_distortion_table``,
  172. ``non_head_distortion_table``. See ``IBMModel`` and
  173. ``IBMModel4`` for the type and purpose of these tables.
  174. :type probability_tables: dict[str]: object
  175. """
  176. super(IBMModel4, self).__init__(sentence_aligned_corpus)
  177. self.reset_probabilities()
  178. self.src_classes = source_word_classes
  179. self.trg_classes = target_word_classes
  180. if probability_tables is None:
  181. # Get probabilities from IBM model 3
  182. ibm3 = IBMModel3(sentence_aligned_corpus, iterations)
  183. self.translation_table = ibm3.translation_table
  184. self.alignment_table = ibm3.alignment_table
  185. self.fertility_table = ibm3.fertility_table
  186. self.p1 = ibm3.p1
  187. self.set_uniform_probabilities(sentence_aligned_corpus)
  188. else:
  189. # Set user-defined probabilities
  190. self.translation_table = probability_tables["translation_table"]
  191. self.alignment_table = probability_tables["alignment_table"]
  192. self.fertility_table = probability_tables["fertility_table"]
  193. self.p1 = probability_tables["p1"]
  194. self.head_distortion_table = probability_tables["head_distortion_table"]
  195. self.non_head_distortion_table = probability_tables[
  196. "non_head_distortion_table"
  197. ]
  198. for n in range(0, iterations):
  199. self.train(sentence_aligned_corpus)
  200. def reset_probabilities(self):
  201. super(IBMModel4, self).reset_probabilities()
  202. self.head_distortion_table = defaultdict(
  203. lambda: defaultdict(lambda: defaultdict(lambda: self.MIN_PROB))
  204. )
  205. """
  206. dict[int][int][int]: float. Probability(displacement of head
  207. word | word class of previous cept,target word class).
  208. Values accessed as ``distortion_table[dj][src_class][trg_class]``.
  209. """
  210. self.non_head_distortion_table = defaultdict(
  211. lambda: defaultdict(lambda: self.MIN_PROB)
  212. )
  213. """
  214. dict[int][int]: float. Probability(displacement of non-head
  215. word | target word class).
  216. Values accessed as ``distortion_table[dj][trg_class]``.
  217. """
  218. def set_uniform_probabilities(self, sentence_aligned_corpus):
  219. """
  220. Set distortion probabilities uniformly to
  221. 1 / cardinality of displacement values
  222. """
  223. max_m = longest_target_sentence_length(sentence_aligned_corpus)
  224. # The maximum displacement is m-1, when a word is in the last
  225. # position m of the target sentence and the previously placed
  226. # word is in the first position.
  227. # Conversely, the minimum displacement is -(m-1).
  228. # Thus, the displacement range is (m-1) - (-(m-1)). Note that
  229. # displacement cannot be zero and is not included in the range.
  230. if max_m <= 1:
  231. initial_prob = IBMModel.MIN_PROB
  232. else:
  233. initial_prob = 1 / (2 * (max_m - 1))
  234. if initial_prob < IBMModel.MIN_PROB:
  235. warnings.warn(
  236. "A target sentence is too long ("
  237. + str(max_m)
  238. + " words). Results may be less accurate."
  239. )
  240. for dj in range(1, max_m):
  241. self.head_distortion_table[dj] = defaultdict(
  242. lambda: defaultdict(lambda: initial_prob)
  243. )
  244. self.head_distortion_table[-dj] = defaultdict(
  245. lambda: defaultdict(lambda: initial_prob)
  246. )
  247. self.non_head_distortion_table[dj] = defaultdict(lambda: initial_prob)
  248. self.non_head_distortion_table[-dj] = defaultdict(lambda: initial_prob)
  249. def train(self, parallel_corpus):
  250. counts = Model4Counts()
  251. for aligned_sentence in parallel_corpus:
  252. m = len(aligned_sentence.words)
  253. # Sample the alignment space
  254. sampled_alignments, best_alignment = self.sample(aligned_sentence)
  255. # Record the most probable alignment
  256. aligned_sentence.alignment = Alignment(
  257. best_alignment.zero_indexed_alignment()
  258. )
  259. # E step (a): Compute normalization factors to weigh counts
  260. total_count = self.prob_of_alignments(sampled_alignments)
  261. # E step (b): Collect counts
  262. for alignment_info in sampled_alignments:
  263. count = self.prob_t_a_given_s(alignment_info)
  264. normalized_count = count / total_count
  265. for j in range(1, m + 1):
  266. counts.update_lexical_translation(
  267. normalized_count, alignment_info, j
  268. )
  269. counts.update_distortion(
  270. normalized_count,
  271. alignment_info,
  272. j,
  273. self.src_classes,
  274. self.trg_classes,
  275. )
  276. counts.update_null_generation(normalized_count, alignment_info)
  277. counts.update_fertility(normalized_count, alignment_info)
  278. # M step: Update probabilities with maximum likelihood estimates
  279. # If any probability is less than MIN_PROB, clamp it to MIN_PROB
  280. existing_alignment_table = self.alignment_table
  281. self.reset_probabilities()
  282. self.alignment_table = existing_alignment_table # don't retrain
  283. self.maximize_lexical_translation_probabilities(counts)
  284. self.maximize_distortion_probabilities(counts)
  285. self.maximize_fertility_probabilities(counts)
  286. self.maximize_null_generation_probabilities(counts)
  287. def maximize_distortion_probabilities(self, counts):
  288. head_d_table = self.head_distortion_table
  289. for dj, src_classes in counts.head_distortion.items():
  290. for s_cls, trg_classes in src_classes.items():
  291. for t_cls in trg_classes:
  292. estimate = (
  293. counts.head_distortion[dj][s_cls][t_cls]
  294. / counts.head_distortion_for_any_dj[s_cls][t_cls]
  295. )
  296. head_d_table[dj][s_cls][t_cls] = max(estimate, IBMModel.MIN_PROB)
  297. non_head_d_table = self.non_head_distortion_table
  298. for dj, trg_classes in counts.non_head_distortion.items():
  299. for t_cls in trg_classes:
  300. estimate = (
  301. counts.non_head_distortion[dj][t_cls]
  302. / counts.non_head_distortion_for_any_dj[t_cls]
  303. )
  304. non_head_d_table[dj][t_cls] = max(estimate, IBMModel.MIN_PROB)
  305. def prob_t_a_given_s(self, alignment_info):
  306. """
  307. Probability of target sentence and an alignment given the
  308. source sentence
  309. """
  310. return IBMModel4.model4_prob_t_a_given_s(alignment_info, self)
  311. @staticmethod # exposed for Model 5 to use
  312. def model4_prob_t_a_given_s(alignment_info, ibm_model):
  313. probability = 1.0
  314. MIN_PROB = IBMModel.MIN_PROB
  315. def null_generation_term():
  316. # Binomial distribution: B(m - null_fertility, p1)
  317. value = 1.0
  318. p1 = ibm_model.p1
  319. p0 = 1 - p1
  320. null_fertility = alignment_info.fertility_of_i(0)
  321. m = len(alignment_info.trg_sentence) - 1
  322. value *= pow(p1, null_fertility) * pow(p0, m - 2 * null_fertility)
  323. if value < MIN_PROB:
  324. return MIN_PROB
  325. # Combination: (m - null_fertility) choose null_fertility
  326. for i in range(1, null_fertility + 1):
  327. value *= (m - null_fertility - i + 1) / i
  328. return value
  329. def fertility_term():
  330. value = 1.0
  331. src_sentence = alignment_info.src_sentence
  332. for i in range(1, len(src_sentence)):
  333. fertility = alignment_info.fertility_of_i(i)
  334. value *= (
  335. factorial(fertility)
  336. * ibm_model.fertility_table[fertility][src_sentence[i]]
  337. )
  338. if value < MIN_PROB:
  339. return MIN_PROB
  340. return value
  341. def lexical_translation_term(j):
  342. t = alignment_info.trg_sentence[j]
  343. i = alignment_info.alignment[j]
  344. s = alignment_info.src_sentence[i]
  345. return ibm_model.translation_table[t][s]
  346. def distortion_term(j):
  347. t = alignment_info.trg_sentence[j]
  348. i = alignment_info.alignment[j]
  349. if i == 0:
  350. # case 1: t is aligned to NULL
  351. return 1.0
  352. if alignment_info.is_head_word(j):
  353. # case 2: t is the first word of a tablet
  354. previous_cept = alignment_info.previous_cept(j)
  355. src_class = None
  356. if previous_cept is not None:
  357. previous_s = alignment_info.src_sentence[previous_cept]
  358. src_class = ibm_model.src_classes[previous_s]
  359. trg_class = ibm_model.trg_classes[t]
  360. dj = j - alignment_info.center_of_cept(previous_cept)
  361. return ibm_model.head_distortion_table[dj][src_class][trg_class]
  362. # case 3: t is a subsequent word of a tablet
  363. previous_position = alignment_info.previous_in_tablet(j)
  364. trg_class = ibm_model.trg_classes[t]
  365. dj = j - previous_position
  366. return ibm_model.non_head_distortion_table[dj][trg_class]
  367. # end nested functions
  368. # Abort computation whenever probability falls below MIN_PROB at
  369. # any point, since MIN_PROB can be considered as zero
  370. probability *= null_generation_term()
  371. if probability < MIN_PROB:
  372. return MIN_PROB
  373. probability *= fertility_term()
  374. if probability < MIN_PROB:
  375. return MIN_PROB
  376. for j in range(1, len(alignment_info.trg_sentence)):
  377. probability *= lexical_translation_term(j)
  378. if probability < MIN_PROB:
  379. return MIN_PROB
  380. probability *= distortion_term(j)
  381. if probability < MIN_PROB:
  382. return MIN_PROB
  383. return probability
  384. class Model4Counts(Counts):
  385. """
  386. Data object to store counts of various parameters during training.
  387. Includes counts for distortion.
  388. """
  389. def __init__(self):
  390. super(Model4Counts, self).__init__()
  391. self.head_distortion = defaultdict(
  392. lambda: defaultdict(lambda: defaultdict(lambda: 0.0))
  393. )
  394. self.head_distortion_for_any_dj = defaultdict(lambda: defaultdict(lambda: 0.0))
  395. self.non_head_distortion = defaultdict(lambda: defaultdict(lambda: 0.0))
  396. self.non_head_distortion_for_any_dj = defaultdict(lambda: 0.0)
  397. def update_distortion(self, count, alignment_info, j, src_classes, trg_classes):
  398. i = alignment_info.alignment[j]
  399. t = alignment_info.trg_sentence[j]
  400. if i == 0:
  401. # case 1: t is aligned to NULL
  402. pass
  403. elif alignment_info.is_head_word(j):
  404. # case 2: t is the first word of a tablet
  405. previous_cept = alignment_info.previous_cept(j)
  406. if previous_cept is not None:
  407. previous_src_word = alignment_info.src_sentence[previous_cept]
  408. src_class = src_classes[previous_src_word]
  409. else:
  410. src_class = None
  411. trg_class = trg_classes[t]
  412. dj = j - alignment_info.center_of_cept(previous_cept)
  413. self.head_distortion[dj][src_class][trg_class] += count
  414. self.head_distortion_for_any_dj[src_class][trg_class] += count
  415. else:
  416. # case 3: t is a subsequent word of a tablet
  417. previous_j = alignment_info.previous_in_tablet(j)
  418. trg_class = trg_classes[t]
  419. dj = j - previous_j
  420. self.non_head_distortion[dj][trg_class] += count
  421. self.non_head_distortion_for_any_dj[trg_class] += count