16_3_NLP RNNs Encoder Decoder 多头 Attention_complexity_max path length_sequential

16_NLP stateful CharRNN_window_Tokenizer_stationary_celab_ResetState_character word level_regex_IMDb: <a href="https://blog.csdn.net/Linli522362242/article/details/115388298">https://blog.csdn.net/Linli522362242/article/details/115388298</a>
16_2NLP RNN_colab tensorboard_os.curdir_Pretrained Embed_TrainingSampler_Encoder–Decoder_Greedy Search_Exhaustive Search_Beam search_masking : <a href="https://blog.csdn.net/Linli522362242/article/details/115518150">https://blog.csdn.net/Linli522362242/article/details/115518150</a>
(batch size, number of time steps or sequence length in tokens, d dimentions)
<h1>Attention Mechanisms</h1>
<img alt="" height="376" src="https://beijingoptbbs.oss-cn-beijing.aliyuncs.com/cs/5606289-30b1d3317b2a9b64d03c7486a2a4400c.png" width="554">Figure 16-3. A simple machine translation model(just sending the encoder’s final hidden state to the decoder)
 Consider the path from the word “milk” to its translation “lait” in Figure 16-3: it is quite long! This means that a representation of this word (along with all the other words) needs to be carried over many steps before it is actually used. Can’t we make this path shorter?(序列中（或者句子中）每个单词的翻译就是一个time step，如果做翻译的时候使用beam search的方法,除了第一个单词外，之后的每个单词的翻译选择都要考虑到最终所有单词的翻译组合能够的最高分（当前这个单词的翻译选择需要在它之后的几个单词的翻译甚至到句子结束<eos>才能确定下来），这说明每个单词翻译之间存在着关联，这里将这种关联直接转化成权重。换句话说，在每个单词翻译的时候，使用权重直接找到能够使得最终所有单词的翻译组合得到最高分的单词翻译)
 This was the core idea in a groundbreaking开创性的 2014 paper(Dzmitry Bahdanau et al., “Neural Machine Translation by Jointly Learning to Align and Translate,” arXiv preprint arXiv:1409.0473 (2014).) by Dzmitry Bahdanau et al. They introduced a technique that allowed the decoder to focus on the appropriate words (as encoded by the encoder) at each time step. For example, at the time step where the decoder needs to output the word “lait,” it will focus its attention on the word “milk.” This means that the path from an input word to its translation is now much shorter, so the short-term memory limitations of RNNs have much less impact. Attention mechanisms revolutionized neural machine translation (and NLP in general), allowing a significant improvement in the state of the art, especially for long sentences (over 30 words)(The most common metric used in NMT is the BiLingual Evaluation Understudy (BLEU) score, which compares each translation produced by the model with several good translations produced by humans: it counts the number of n-grams (sequences of n words) that appear in any of the target translations and adjusts the score to take into account the frequency of the produced n-grams in the target translations.).
<img alt="" height="296" src="https://beijingoptbbs.oss-cn-beijing.aliyuncs.com/cs/5606289-db0fc7288958b6669d4c23176877aa27.png" width="465">Figure 16-6. Neural machine translation using an Encoder–Decoder network with an attention model
 Figure 16-6 shows this model’s architecture (slightly simplified, as we will see). On the left, you have the encoder and the decoder. Instead of just sending the encoder’s final hidden state to the decoder (which is still done, although it is not shown in the figure), we now send all of its outputs<img alt="" height="25" src="https://beijingoptbbs.oss-cn-beijing.aliyuncs.com/cs/5606289-8968232a83043777b252a518335a7c98.png" width="236"> to the decoder. At each time step, the decoder’s memory cell computes a weighted sum of all these encoder outputs ( the weight <img alt="" height="20" src="https://beijingoptbbs.oss-cn-beijing.aliyuncs.com/cs/5606289-1392f309d2eb9cbbec6cd55695d97ffa.png" width="36"> OR <img alt="" height="23" src="https://beijingoptbbs.oss-cn-beijing.aliyuncs.com/cs/5606289-71e5720835

16_3_NLP RNNs Encoder Decoder 多头 Attention_complexity_max path length_sequential

浏览过的版块