{"id":71,"date":"2023-02-21T22:53:52","date_gmt":"2023-02-21T22:53:52","guid":{"rendered":"https:\/\/thedataprocessblog.com\/?p=71"},"modified":"2023-02-21T23:44:46","modified_gmt":"2023-02-21T23:44:46","slug":"why-transformers-changed-everything-lstms","status":"publish","type":"post","link":"https:\/\/thedataprocessblog.com\/?p=71","title":{"rendered":"Why transformers changed everything: LSTMs"},"content":{"rendered":"\n<p>After seeing the problems of the simple RNNs seen before researchers proposed a new approach to memory-related models. The new approach was named the LSTM model.<\/p>\n\n\n\n<h3>Long Short Term Memory architecture (LSTM)<\/h3>\n\n\n\n<p>Although the LSTM models are a variant of the Recurrent Neural Networks, they have very important changes. The whole idea is to separate the cell state from the current input values. <\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/colah.github.io\/posts\/2015-08-Understanding-LSTMs\/img\/LSTM3-chain.png\" alt=\"\" width=\"522\" height=\"196\"\/><figcaption class=\"wp-element-caption\">Source: https:\/\/colah.github.io\/posts\/2015-08-Understanding-LSTMs\/<\/figcaption><\/figure><\/div>\n\n\n<p>The top horizontal line is called the state of the cell, this is considered the long-term &#8220;memory&#8221; of the cell. Another special component of the architecture are the gates. Whenever you see a multiplication sign it means that it is a gate. <\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/thedataprocessblog.com\/wp-content\/uploads\/2023\/02\/image-1.png\" alt=\"\" class=\"wp-image-72\" width=\"383\" height=\"71\" srcset=\"https:\/\/thedataprocessblog.com\/wp-content\/uploads\/2023\/02\/image-1.png 950w, https:\/\/thedataprocessblog.com\/wp-content\/uploads\/2023\/02\/image-1-300x56.png 300w, https:\/\/thedataprocessblog.com\/wp-content\/uploads\/2023\/02\/image-1-768x143.png 768w\" sizes=\"(max-width: 383px) 100vw, 383px\" \/><figcaption class=\"wp-element-caption\">Source: https:\/\/colah.github.io\/posts\/2015-08-Understanding-LSTMs\/<\/figcaption><\/figure><\/div>\n\n\n<h4>Gates<\/h4>\n\n\n\n<p>In this case, they are preceded with a sigmoid activation function and the explanation for that is that the sigmoid function output goes from 0 to 1, therefore, the multiplication operation works as a gate for letting some information pass to the next stage or be forgotten. This way, the cell state can be changed and only the important things at the moment are remembered by the model.<\/p>\n\n\n\n<h4>The output<\/h4>\n\n\n\n<p>The output of the model is the value <em>h<\/em>, which can be viewed as a filtered version of the state of the cell. It is filtered at the last gate on the right, which has as input the cell state passed through a <em>tanh<\/em> function.<\/p>\n\n\n\n<h4>Other variants<\/h4>\n\n\n\n<h5>Peephole connections<\/h5>\n\n\n\n<p>Introduced by <a href=\"ftp:\/\/ftp.idsia.ch\/pub\/juergen\/TimeCount-IJCNN2000.pdf\">Gers &amp; Schmidhuber (2000)<\/a>, the peephole approach provides the gates with the cell state itself as an input.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/thedataprocessblog.com\/wp-content\/uploads\/2023\/02\/image-2-1024x409.png\" alt=\"\" class=\"wp-image-73\" width=\"555\" height=\"221\" srcset=\"https:\/\/thedataprocessblog.com\/wp-content\/uploads\/2023\/02\/image-2-1024x409.png 1024w, https:\/\/thedataprocessblog.com\/wp-content\/uploads\/2023\/02\/image-2-300x120.png 300w, https:\/\/thedataprocessblog.com\/wp-content\/uploads\/2023\/02\/image-2-768x307.png 768w, https:\/\/thedataprocessblog.com\/wp-content\/uploads\/2023\/02\/image-2.png 1032w\" sizes=\"(max-width: 555px) 100vw, 555px\" \/><figcaption class=\"wp-element-caption\">Source: https:\/\/d3i71xaburhd42.cloudfront.net\/545a4e23bf00ddbc1d3325324b4c61f57cf45081\/2-Figure1-1.png<\/figcaption><\/figure><\/div>\n\n\n<h5>Gated Recurrent unit (GRU)<\/h5>\n\n\n\n<p>Introduced by <a href=\"http:\/\/arxiv.org\/pdf\/1406.1078v3.pdf\">Cho, et al. (2014)<\/a>, it merges the input and the forget gates into one gate, called the &#8220;update gate&#8221;. This achieves a simpler architecture, being easier to train and computationally cheaper. This variant has gained a lot of popularity over the years.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/thedataprocessblog.com\/wp-content\/uploads\/2023\/02\/image-3-1024x316.png\" alt=\"\" class=\"wp-image-74\" width=\"551\" height=\"169\" srcset=\"https:\/\/thedataprocessblog.com\/wp-content\/uploads\/2023\/02\/image-3-1024x316.png 1024w, https:\/\/thedataprocessblog.com\/wp-content\/uploads\/2023\/02\/image-3-300x93.png 300w, https:\/\/thedataprocessblog.com\/wp-content\/uploads\/2023\/02\/image-3-768x237.png 768w, https:\/\/thedataprocessblog.com\/wp-content\/uploads\/2023\/02\/image-3-1536x474.png 1536w, https:\/\/thedataprocessblog.com\/wp-content\/uploads\/2023\/02\/image-3.png 1826w\" sizes=\"(max-width: 551px) 100vw, 551px\" \/><figcaption class=\"wp-element-caption\">Source: https:\/\/colah.github.io\/posts\/2015-08-Understanding-LSTMs\/<\/figcaption><\/figure><\/div>\n\n\n<h3>Conclusion<\/h3>\n\n\n\n<p>The introduction of the LSTM architecture brought far more possibilities to solve problems. The model proposed, and its variants, have been used successfully with various memory-related solutions.<\/p>\n\n\n\n<p>The next big step that will be discussed in the next article is the Attention architecture introduced by Google. An approach that, again, changed everything in the machine learning world and created many more possibilities.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>After seeing the problems of the simple RNNs seen before researchers proposed a new approach to memory-related models. The new approach was named the LSTM model. Long Short Term Memory architecture (LSTM) Although the LSTM models are a variant of the Recurrent Neural Networks, they have very important changes. The whole idea is to separate &hellip; <a href=\"https:\/\/thedataprocessblog.com\/?p=71\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Why transformers changed everything: LSTMs<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/thedataprocessblog.com\/index.php?rest_route=\/wp\/v2\/posts\/71"}],"collection":[{"href":"https:\/\/thedataprocessblog.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/thedataprocessblog.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/thedataprocessblog.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/thedataprocessblog.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=71"}],"version-history":[{"count":1,"href":"https:\/\/thedataprocessblog.com\/index.php?rest_route=\/wp\/v2\/posts\/71\/revisions"}],"predecessor-version":[{"id":75,"href":"https:\/\/thedataprocessblog.com\/index.php?rest_route=\/wp\/v2\/posts\/71\/revisions\/75"}],"wp:attachment":[{"href":"https:\/\/thedataprocessblog.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=71"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/thedataprocessblog.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=71"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/thedataprocessblog.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=71"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}