16_3_NLP RNNs Encoder Decoder 多头 Attention_complexity_max path length_sequential

论坛 期权论坛     
选择匿名的用户   2021-5-24 04:53   456   0
<p id="articleContentId">16_NLP stateful CharRNN_window_Tokenizer_stationary_celab_ResetState_character word level_regex_IMDb: <a href="https://blog.csdn.net/Linli522362242/article/details/115388298">https://blog.csdn.net/Linli522362242/article/details/115388298</a></p>
<p>16_2NLP RNN_colab tensorboard_os.curdir_Pretrained Embed_TrainingSampler_Encoder–Decoder_Greedy Search_Exhaustive Search_Beam search_masking : <a href="https://blog.csdn.net/Linli522362242/article/details/115518150">https://blog.csdn.net/Linli522362242/article/details/115518150</a></p>
<p><strong>(batch size, number of time steps or sequence length in tokens, d dimentions)</strong></p>
<h1>Attention Mechanisms</h1>
<p><img alt="" height="376" src="https://beijingoptbbs.oss-cn-beijing.aliyuncs.com/cs/5606289-30b1d3317b2a9b64d03c7486a2a4400c.png" width="554">Figure 16-3. A simple machine translation model(<strong>just sending the encoder’s final hidden state to the decoder</strong>)</p>
<p>     Consider the path from the word “milk” to its translation “lait” in Figure 16-3: it is quite long! This means that a representation of this word (along with all the other<br> words) needs to be carried over many steps before it is actually used. Can’t we make this path shorter?(序列中(或者句子中)每个单词的翻译就是一个time step,如果做翻译的时候使用beam search的方法,除了第一个单词外,之后的每个单词的翻译选择都要考虑到<strong>最终</strong>所有单词的<strong>翻译组合</strong>能够的最高分(当前这个单词的翻译选择需要在它之后的几个单词的翻译甚至到句子结束&lt;eos&gt;才能确定下来),这说明每个单词翻译之间存在着关联,这里将这种关联直接转化成权重。换句话说,在每个单词翻译的时候,使用权重直接找到能够使得<strong>最终</strong>所有单词的<strong>翻译组合</strong>得到最高分的单词翻译)</p>
<p>     This was the core idea in a groundbreaking开创性的 2014 paper(<em>Dzmitry Bahdanau et al., “Neural Machine Translation by Jointly Learning to Align and Translate,” arXiv preprint arXiv:1409.0473 (2014).</em>) by Dzmitry Bahdanau et al. They introduced a technique that allowed the decoder to focus on the appropriate words (as encoded by the encoder) at each time step. For example, at the time step where <strong><span style="color:#7c79e5;">the decoder needs to output the word “</span><span style="color:#e579b6;">lait</span><span style="color:#7c79e5;">,” it will focus its attention on the word “</span><span style="color:#86ca5e;">milk</span><span style="color:#7c79e5;">.” This means that the path from an</span><span style="color:#86ca5e;"> input</span><span style="color:#7c79e5;"> word to its </span><span style="color:#e579b6;">translation</span><span style="color:#7c79e5;"> is now much shorter</span>, </strong><strong>so the short-term memory limitations of RNNs have much less impact.</strong> Attention mechanisms revolutionized neural machine translation (and NLP in general), allowing a significant improvement in the state of the art, especially for long sentences (over 30 words)(<em>The most common metric used in NMT is the BiLingual Evaluation Understudy (BLEU) score, which compares each translation produced by the model with several good translations produced by humans: it counts the number of n-grams (sequences of n words) that appear in any of the target translations and adjusts the score to take into account the frequency of the produced n-grams in the target translations.</em>).</p>
<p><img alt="" height="296" src="https://beijingoptbbs.oss-cn-beijing.aliyuncs.com/cs/5606289-db0fc7288958b6669d4c23176877aa27.png" width="465">Figure 16-6. <span style="color:#7c79e5;"><strong>N</strong></span>eural <span style="color:#7c79e5;"><strong>m</strong></span>achine <span style="color:#7c79e5;"><strong>t</strong></span>ranslation using an Encoder–Decoder network with an attention model</p>
<p>     Figure 16-6 shows this model’s architecture (slightly simplified, as we will see). On the left, you have the<span style="color:#f33b45;"><strong> encoder</strong></span> and the decoder. Instead of just sending the encoder’s final hidden state to the decoder (which is still done, although it is not shown in the figure), <span style="color:#7c79e5;"><strong>we now send all of </strong></span><span style="color:#f33b45;"><strong>its</strong></span><span style="color:#7c79e5;"><strong> outputs</strong></span><img alt="" height="25" src="https://beijingoptbbs.oss-cn-beijing.aliyuncs.com/cs/5606289-8968232a83043777b252a518335a7c98.png" width="236"><span style="color:#7c79e5;"><strong> to the decoder</strong>. <strong>At each time step, </strong></span><strong>the decoder’s memory cell computes </strong><span style="color:#7c79e5;"><strong>a weighted sum of all these encoder outputs</strong></span><span style="color:#86ca5e;"><strong> </strong></span><strong><span style="color:#7c79e5;">(</span></strong><span style="color:#86ca5e;"> </span><strong>the weight</strong> <img alt="" height="20" src="https://beijingoptbbs.oss-cn-beijing.aliyuncs.com/cs/5606289-1392f309d2eb9cbbec6cd55695d97ffa.png" width="36"><strong> OR </strong><img alt="" height="23" src="https://beijingoptbbs.oss-cn-beijing.aliyuncs.com/cs/5606289-71e5720835
分享到 :
0 人收藏
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

积分:3875789
帖子:775174
精华:0
期权论坛 期权论坛
发布
内容

下载期权论坛手机APP