<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[kozistr | Feed]]></title><description><![CDATA[all-rounder]]></description><link>http://kozistr.tech</link><generator>GatsbyJS</generator><lastBuildDate>Wed, 03 Sep 2025 12:18:06 GMT</lastBuildDate><item><title><![CDATA[2023년 회고 (feat. 병특 끝)]]></title><description><![CDATA[Prologue 글을 쓰기 시작한 시점에서 올해가 1주일이 채 남지 않았는데, 올해를 돌아보면 중요한 사건이 끝나고 새로운 시작을 한, 그사이에 느낀 것들이 많고 놓친 것들을 되돌아보는 한 해였다. 나름 만족스러운 해였다. 큰 이벤트들을 먼저 떠올려…]]></description><link>http://kozistr.tech/2023-12-26-review-2023/</link><guid isPermaLink="false">http://kozistr.tech/2023-12-26-review-2023/</guid><pubDate>Tue, 26 Dec 2023 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;prologue&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#prologue&quot; aria-label=&quot;prologue permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Prologue&lt;/h2&gt;
&lt;p&gt;글을 쓰기 시작한 시점에서 올해가 1주일이 채 남지 않았는데, 올해를 돌아보면 중요한 사건이 끝나고 새로운 시작을 한, 그사이에 느낀 것들이 많고 놓친 것들을 되돌아보는 한 해였다. 나름 만족스러운 해였다.&lt;/p&gt;
&lt;p&gt;큰 이벤트들을 먼저 떠올려보면, Kaggle과 병역특례 종료 (+ 퇴사)와 새로운 회사로 이직으로 크게 3가지 일이 있었다.&lt;/p&gt;
&lt;h2 id=&quot;kaggle&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#kaggle&quot; aria-label=&quot;kaggle permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Kaggle&lt;/h2&gt;
&lt;p&gt;최근에는 장비도 없고 일이 너무 바빠서 Kaggle을 못 하고 있지만, 올해 초에는 정말 열심히 했었고 꽤 좋은 성과를 거뒀다. 중단기 목표가 대회에서 solo gold medal을 따는 게 목표였는데, final submission 선택 차이로 눈앞에서 놓친 대회가 하나 있었고 나머지 대회에서도 gold medal zone에 가까운 rank를 달성했다. 그렇게 global ranking에서 200 등대를 찍으면서 highest competition rank &lt;strong&gt;top 0.1%&lt;/strong&gt; 를 달성했다.&lt;/p&gt;
&lt;p&gt;성적과 관련 없이 개인적으로 이전에는 나의 최대치가 2 ~ 30등이라면 올해는 짧은 시간 안에도 1 ~ 20등 결과물을 낼 수 있는 실력이 됐다는 점에서 성장했다는 걸 느꼈고 가장 의미 있다고 생각했다. 최종 목표는 10등 안의 결과물을 낼 수 있는 실력에 도달하는 것이고, 실질적인 성과로는 competition grandmaster를 달성하는 것으로 더 노력해야겠다.&lt;/p&gt;
&lt;h2 id=&quot;드디어-병역특례-끝&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EB%93%9C%EB%94%94%EC%96%B4-%EB%B3%91%EC%97%AD%ED%8A%B9%EB%A1%80-%EB%81%9D&quot; aria-label=&quot;드디어 병역특례 끝 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;드디어 병역특례 끝&lt;/h2&gt;
&lt;h3 id=&quot;다사다난한-병특&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EB%8B%A4%EC%82%AC%EB%8B%A4%EB%82%9C%ED%95%9C-%EB%B3%91%ED%8A%B9&quot; aria-label=&quot;다사다난한 병특 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;다사다난한 병특&lt;/h3&gt;
&lt;p&gt;실 복무는 &lt;code class=&quot;language-text&quot;&gt;2년 10개월&lt;/code&gt;인데, 준비를 시작한 시간까지 합치면 거의 4년 가까이 복무한 느낌이 들 정도로 지인들도 &quot;너 한 4년은 복무하지 않았냐?&quot; 라고 할 정도로 끝나지 않을 것 같은 병특이 드디어 끝났다.&lt;/p&gt;
&lt;p&gt;병특을 하기 마음먹은 시점부터 지금까지 다양한 일이 있었는데, 큰 사건들을 나열해보면 병특을 기대하고 간 회사가 갑자기 잘못된 내부고발로 병역특례 업체 취소되거나 인생이 크게 바뀔법한 업체(?)로 갈 뻔하기도 했고, 정말 감사하게도 옮긴 회사에서 &lt;em&gt;드래곤볼&lt;/em&gt;을 모아줘서 TO를 받아 시작했고, 병특 중간에도 이직하는 등 정말로 주변에서도 인정하는 다사다난한 병특을 한 거 같다.&lt;/p&gt;
&lt;p&gt;끝난 시점에서 돌아봐도 끝나서 다행이라는 생각밖에 들지 않는다 ㅋㅋ&lt;/p&gt;
&lt;p&gt;% 드래곤볼: (21년 기준) 4급 병특 3명을 모으면 현역 병특 TO 하나를 주는 제도&lt;/p&gt;
&lt;h3 id=&quot;느낀-점-feat-다녔던-회사들&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EB%8A%90%EB%82%80-%EC%A0%90-feat-%EB%8B%A4%EB%85%94%EB%8D%98-%ED%9A%8C%EC%82%AC%EB%93%A4&quot; aria-label=&quot;느낀 점 feat 다녔던 회사들 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;느낀 점 (feat. 다녔던 회사들)&lt;/h3&gt;
&lt;p&gt;올해 9월 말 병특 끝나고 바로 퇴사했다. 이유는 적자면 길지만 새로운 걸 하고 싶어서 퇴사했고 아래에 적겠지만, 10월 말에 새 직장에 합류하게 됐다. 그동안 제대로 쉬어본 적이 없어서 1달 동안 집에서 쉬면서 지난날들을 돌아보면서 정말 많은 생각이 들었는데, 새삼스럽지만, 당시에는 알지 못했지만 돌이켜보니 분기점마다 내외적으로나 성장에 영향을 준 고마운 분들이 꼭 있었던 거 같다.&lt;/p&gt;
&lt;p&gt;2019년 첫 회사인 VoyagerX 부터 적어 보면, 지금 생각해 보면 내 기준으론 거의 완벽한 회사였다. 모두가 똑똑하고 좋은 친구 같은 동료고, 솔직히 아직도 이때 멤버만 한 느낌을 주는 곳은 찾지 못했고 가장 그리운 곳이다. 당시에는 정말 내향적이고 사회성도 거의 없어서 걱정도 됐고 지금도 떠오르는 아쉬웠던 행동들이 있는데, 그런 부족함을 채워준 곳이었고 내면적으로 많이 성장했던 시기 같다.&lt;/p&gt;
&lt;p&gt;다음은 본격적으로 병특을 위해 이직한 Banksalad인데 (&lt;del&gt;결국, 못 했지만&lt;/del&gt;), VoyagerX 에선 내적인 성장을 주로 했다면, 여기에선 실리콘 밸리에서 근무하신 분들 주도로 팀 문화나 기술이 꾸려지면서 엔지니어링 전반적으로 정말 많이 배울 수 있었다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/ca94bbd51330587f890638e028de3d54/37523/dunning_kruger.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 66.89189189189189%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAANCAIAAAAmMtkJAAAACXBIWXMAAA7DAAAOwwHHb6hkAAABcUlEQVR42pWS23KDIBCG89S96lWfos/QF+ltp9Ek1uIRUcwoKnJQIxTJYZpOk2n/C2GEb/ffZVd93yulBqNxnKxGq+kPWgkhCCFhGA6D1P/UApslyzLKuDyo7XbrOI7v+/M838GU1QnOcYERxmuvJE0G06qqblEEk4rwq8yoLDkAyvPsBZNT/QKOo84g3PngM8yLwhS7wD2lQZIMUaTDSI/TTa8IazEkKHNdN01TKeUCd20LomhIUlXCPn6lAgyyXhyY0vRSeUc3m7fHePOAipfFk1LXDcOFBPHMYxg9584TY+/29NKzg0p3aokoLt36BqepCILTzY5rVGjajYdZzkpPk85zXdW3nyoIjrCap8XtOOi6bmKEAVT7vabUWlU/GnmGAZDHzOd6zJCt16672VHa3RuSnrHA9ynGZiYF58KKMQYhTJKkLEszu+KbTFzzNT9XpuMmBuP8w/cRQk3bEqumaQAA5lXMhlzL8zwTl1L6BVOx6kYV8GbOAAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;dunning-kruger&quot;
        title=&quot;&quot;
        src=&quot;/static/ca94bbd51330587f890638e028de3d54/fcda8/dunning_kruger.png&quot;
        srcset=&quot;/static/ca94bbd51330587f890638e028de3d54/12f09/dunning_kruger.png 148w,
/static/ca94bbd51330587f890638e028de3d54/e4a3f/dunning_kruger.png 295w,
/static/ca94bbd51330587f890638e028de3d54/fcda8/dunning_kruger.png 590w,
/static/ca94bbd51330587f890638e028de3d54/37523/dunning_kruger.png 720w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;이때 나의 실력과 마인드를 정확하게 사진 한 장으로 표현할 수 있는데, &lt;strong&gt;우매함의 봉우리&lt;/strong&gt; 정점처럼 능력보다 자신감이 더 앞서 있었는데, 그 봉우리를 하산할 수 있게 만들어준 곳인듯하다.&lt;/p&gt;
&lt;p&gt;특히 코드 리뷰 많이 해주신 겨울님에게 정말 감사하다. 덕분에 뱅샐에 있었을 때 배운 문화나 기술, 마인드가 지금의 나를 구성하는 데 큰 영향을 줬다고 생각한다.&lt;/p&gt;
&lt;p&gt;다음은 병특을 위해 Watcha로 이직했다. 왓챠에서는 내 직무 역량을 많이 끌어올릴 수 있는 좋은 곳이었고, 더 중요한 건 &lt;em&gt;드래곤볼&lt;/em&gt;을 모아주셔서 병특 TO를 얻어 주셨다는 거다. 부족했지만, CTO님과 팀원분들이 믿어주신 덕분에 내가 하고 싶은 일을 많이 펼쳐볼 수 있었던 거 같다. 정말 낭만 있는 회사고, 자유롭게 일할 수 있는 환경만큼은 좋았다.&lt;/p&gt;
&lt;p&gt;마지막으로 토스로 이직하고 병특을 마쳤는데, 이직한 이유는 솔직하게 대기업은 어떤 회사일까? 가 가장 궁금했었다. 짧은 결론으론, 난 스타트업이 가장 잘 맞는 거 같다.&lt;/p&gt;
&lt;p&gt;그래도 지금 다니는 회사에서 일하게 된 계기도 토스에서 만난 분일 정도로, 다양한 유형의 사람들과 좋은 인연을 만날 수 있었던 곳이었다.&lt;/p&gt;
&lt;p&gt;병특 동안 회사 4곳을 다니면서 배운 점들을 정리하면서 마무리하자면, 회사별로 내가 부족한 부분들을 채울 수 있었고 내외적으로 모두 성장할 수 있었던, 고생한 값 이상을 한 병특이라고 행복한 결말을 지어본다.&lt;/p&gt;
&lt;h2 id=&quot;다시-스타트업으로-그런데-초기-멤버로&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EB%8B%A4%EC%8B%9C-%EC%8A%A4%ED%83%80%ED%8A%B8%EC%97%85%EC%9C%BC%EB%A1%9C-%EA%B7%B8%EB%9F%B0%EB%8D%B0-%EC%B4%88%EA%B8%B0-%EB%A9%A4%EB%B2%84%EB%A1%9C&quot; aria-label=&quot;다시 스타트업으로 그런데 초기 멤버로 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;다시 스타트업으로, 그런데 초기 멤버로&lt;/h2&gt;
&lt;p&gt;최근에야 달력을 보고 알았는데, 벌써 이직한 지 2달 정도 지났다.&lt;/p&gt;
&lt;p&gt;현재 회사에선 챗봇 등등 LLM Application을 만들고 있는데, 아직 시장에 best practice가 없고 직접 시장과 기술을 개척해야 하는 게 재밌을 거 같아서 합류했고, 몰입하면서 재밌게 일하고 있다.&lt;/p&gt;
&lt;p&gt;아래는 제품 니즈 + 상용화를 위한 RAG pipeline 개발한 내용을 문서로 정리해서 공유하고 있는데, 최근 들어선 나름 상용화 가능할 정도로 빠르고 정확한 제품이 나오고 있어서 더 열심히 달려보려고 한다. 그래도 창작의 고통 10 안에 대략 마무리되길 기대해 본다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 181px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/a2a71e34a40e89ca096c175181bd0bfe/74c4e/pain_of_creation.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 79.05405405405406%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAQCAIAAACZeshMAAAACXBIWXMAAA7DAAAOwwHHb6hkAAACNklEQVR42p1Ty3abMBD1//9DN121XbTdJH0lp+fUeScGYmxjx2BiECCMQEiYhwQ2HZzUTdIueqrFcGfOXJi5V/Tats2yLI4JAFnL7XZbSwm4qsqMsxzOel0JyKpNXddSeMiJadrU9Waz6UFfEHhDXc/SaHDR/3jw7fD9W31smHfGxcWJPlR1/fZW1+azOfZcYzo+PvqsG3fa4HoVxx2ZELJYWEWx5pT8OD2zLTNJ05QmjuuWRU5IxDhL4oSlKecM+wi+DLiqqo4MA8C0ANqH2D7A7R49pPvKHnfkmBDbthtZIseJYrpcmIzxEPumObXvF467RJ4b4lXOWRgGV5enpr0MA3+dFx3Z85B2q7EkGquXr16/OfzwbjiawCtubq7ubWs6nczuDNu0A+RYULw+n9vLwdUZjpLeTtgqy3jTNDBRkiRC1lAEaUFpGFFKsKCVQkDDZrspihz6f48NSdcHBCnrBhyR9c6J3WrwbMBIISSQt09EeSRjjEcjXYhyNFQVTVNVZW6Z5yf9hDHkLGeT8ZdPBy5CxmSUl+Kplh05DPF0apRlCTYsHYfSBIfYQ27KWRytYhJZ1hzMY4y1z0/vqfQQfoFdfJz8pVXPyJRS5CEAKxz4GFzwY0p9D5WiytcZjYmmKknKSLQCMV6SwSpFUeASj4bKQFVg/9l8dn7aJ5TeWybs/P34yA/846OvjGcvd+ach2EIYhb5mvOsA0UhRAUqwB8hpYC7Wu3AX9T+79P7897+O/knc5yFq0hXY6cAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;pain_of_creation&quot;
        title=&quot;&quot;
        src=&quot;/static/a2a71e34a40e89ca096c175181bd0bfe/74c4e/pain_of_creation.png&quot;
        srcset=&quot;/static/a2a71e34a40e89ca096c175181bd0bfe/12f09/pain_of_creation.png 148w,
/static/a2a71e34a40e89ca096c175181bd0bfe/74c4e/pain_of_creation.png 181w&quot;
        sizes=&quot;(max-width: 181px) 100vw, 181px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;이외로는, 수상하리만큼 우리 회사에 외향적인 사람이 대다수인데, INTJ-A인 나로선 조금 신기한 경험을 하고 있다. 내 인생 전체에서 봐도 드문 상황인데, 아무튼 흥미롭게 관찰하고 있다.&lt;/p&gt;
&lt;h2 id=&quot;개발&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EA%B0%9C%EB%B0%9C&quot; aria-label=&quot;개발 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;개발&lt;/h2&gt;
&lt;h3 id=&quot;pytorch-optimizer&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#pytorch-optimizer&quot; aria-label=&quot;pytorch optimizer permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;pytorch-optimizer&lt;/h3&gt;
&lt;p&gt;올해 4월쯤에 새롭게 단장을 했다. 원래는 혼자 쓰려는 목적으로 개발했지만, 사용자들이 늘어나고 feature request가 종종 들어와서 전략을 조금 바꿨다. 전에는 기능도 적고 다른 곳에 이미 있는 것들은 구현하지 않았는데, 모든 구현체를 지원하고 사용성 개선하고 테스트도 빡세게 돌리는 방향으로 전략을 바꿔봤다.&lt;/p&gt;
&lt;p&gt;리브랜딩하고 점점 사용자 수가 많아지더니 최근에는 최대 &lt;strong&gt;2K downloads / day, 10K downloads / weeks, 55K downloads / month&lt;/strong&gt;, 155 stars를 달성했다. 4월, 10월에 한 번 크게 사용자 수가 늘어났을 때가 있었는데, 짜릿했던 순간이었다.&lt;/p&gt;
&lt;p&gt;작년에 목표한 문서화, 많은 구현체 지원은 모두 완성했고, 현재 버전은 &lt;code class=&quot;language-text&quot;&gt;v2.12.0&lt;/code&gt;인데, 내년엔 &lt;a href=&quot;https://github.com/kozistr/pytorch_optimizer/issues/164&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;v3&lt;/code&gt;&lt;/a&gt;릴리즈를 목표로 달려봐야겠다.&lt;/p&gt;
&lt;h3 id=&quot;open-source-contributions&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#open-source-contributions&quot; aria-label=&quot;open source contributions permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;open source contributions&lt;/h3&gt;
&lt;p&gt;최근에는 삶과 근접한 open source에 기여하고 있다. 예를 들어 embedding model을 서빙하는데 &lt;a href=&quot;https://github.com/huggingface/text-embeddings-inference&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;text-embeddings-inference&lt;/a&gt;를 사용 중인데, 필요한 기능들을 직접 PR 날리면서 쓰고 있다.&lt;/p&gt;
&lt;p&gt;개인적으로 rust 언어를 좋아하고 huggingface에서도 많이 쓰고 vector db도 그렇고 요즘 rust 기반 좋은 제품이 많이 나오고 있는데, 필요할 때마다 기여하면서 쓰지 않을까 생각한다.&lt;/p&gt;
&lt;p&gt;여담이지만, 요즘 rust 기반의 좋은 구현체들이 정말 많이 나오는 거 같다. 지금 회사에서도 TEI나 Qdrant 같은 Rust 기반 제품을 사용하고 있는데, 정말 좋다. 언젠가 낭만 챙길 시간이 있으면, 지금 python backend도 다 Rust로 바꾸는 게 목표기도 하다.&lt;/p&gt;
&lt;p&gt;결론은 rust가 미래다!&lt;/p&gt;
&lt;h3 id=&quot;leetcode&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#leetcode&quot; aria-label=&quot;leetcode permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Leetcode&lt;/h3&gt;
&lt;p&gt;올해 4월부터인가 매일 1문제씩 leetcode를 풀기 시작했다. 기억은 잘 안 나는데, 하다 보니 재밌어서 계속하게 된 거 같다. 그런데 요즘은 바빠서 고민하는 시간이 줄고 답을 빠르게 보고 있는데 시간을 좀 할당해 두고 여유롭게 풀어야 할 거 같다.&lt;/p&gt;
&lt;h2 id=&quot;클라이밍&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%ED%81%B4%EB%9D%BC%EC%9D%B4%EB%B0%8D&quot; aria-label=&quot;클라이밍 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;클라이밍&lt;/h2&gt;
&lt;p&gt;클라이밍 시작한 지 1년이 넘었고, 매주 1회씩 빠지지 않고 꼭 가고 있다. 작년 목표가 보라 클라이머가 목표였는데 (무지개 난이도 기준), 아직 아무 보라색이나 깰 정도로 잘하진 못하지만, 남색은 웬만하면 플래쉬하고 힘이 남으면 보라색에 도전하는 중이다.&lt;/p&gt;
&lt;p&gt;질리지 않고 할 수 있었던 가장 큰 이유는 악력인데, 악력기 제품마다 절대적인 기준이 다르지만, 악력도 작년보다 2 ~ 30 이상 오른 60 정도로 많이 늘었고 악력이 강해질수록 전에 안되는 동작들이 되기 시작할 때가 탑 찍는 것만큼 즐거웠고 동기부여가 됐다.&lt;/p&gt;
&lt;p&gt;이제는 악력보다 처음 해 보는 무브나 슬로퍼에서 막히는 느낌이라 올해는 디테일을 채워가면서 보라색을 플래쉬할 수 있는 날을 목표로 해야겠다.&lt;/p&gt;
&lt;h2 id=&quot;마무리&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EB%A7%88%EB%AC%B4%EB%A6%AC&quot; aria-label=&quot;마무리 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;마무리&lt;/h2&gt;
&lt;p&gt;매년 아쉽지 않게 다양한 일이 벌어지는데, 이쯤이면 그냥 받아들이고 즐겨야겠단 생각도 들고, 당시엔 스트레스 좀 받지만 사실 싫지 않은 거 같다.&lt;/p&gt;
&lt;p&gt;내년엔 또 어떤 일이 벌어질지 기대하며 올해 회고를 마무리해 본다.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[(Kaggle) BirdCLEF 2023 - 24th (top 2%) place solution]]></title><description><![CDATA[Original Post : https://www.kaggle.com/competitions/birdclef-2023/discussion/412996 Architecture Here's the pipeline. pre-train on 2020, 20…]]></description><link>http://kozistr.tech/2023-05-26-birdcelf-2023/</link><guid isPermaLink="false">http://kozistr.tech/2023-05-26-birdcelf-2023/</guid><pubDate>Fri, 26 May 2023 00:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;Original Post : &lt;a href=&quot;https://www.kaggle.com/competitions/birdclef-2023/discussion/412996&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;https://www.kaggle.com/competitions/birdclef-2023/discussion/412996&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;architecture&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#architecture&quot; aria-label=&quot;architecture permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Architecture&lt;/h2&gt;
&lt;p&gt;Here&apos;s the pipeline.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;pre-train on 2020, 2021, 2022, xeno-canto datasets.&lt;/li&gt;
&lt;li&gt;fine-tune on 2023 dataset (based on the pre-trained weight).
&lt;ul&gt;
&lt;li&gt;minor classes (&amp;#x3C;= 5 samples) are included in all folds&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I applied the same training recipes (e.g. augmentation, loss functions, ...) each step.&lt;/p&gt;
&lt;h3 id=&quot;cv&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#cv&quot; aria-label=&quot;cv permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;CV&lt;/h3&gt;
&lt;p&gt;(although based on my few experiments) my cv score and LB/PB are kinda correlated.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&quot;center&quot;&gt;Exp&lt;/th&gt;
&lt;th align=&quot;center&quot;&gt;CV&lt;/th&gt;
&lt;th align=&quot;center&quot;&gt;LB&lt;/th&gt;
&lt;th align=&quot;center&quot;&gt;PB&lt;/th&gt;
&lt;th align=&quot;center&quot;&gt;Note&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&quot;center&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;effnetb0&lt;/code&gt;&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;0.7720&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;0.82438&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;0.73641&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;multiple losses, 5 folds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&quot;center&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;effnetb0&lt;/code&gt;&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;0.7693&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;0.82402&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;0.73604&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;clipwise loss, 5 folds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&quot;center&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;eca_nfnet_l0&lt;/code&gt;&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;0.7753&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;0.80731&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;0.71845&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;clipwise loss, single fold&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&quot;model&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#model&quot; aria-label=&quot;model permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Model&lt;/h3&gt;
&lt;p&gt;I used SED architecture with the &lt;code class=&quot;language-text&quot;&gt;efficientnet_b0&lt;/code&gt; backbone. Also, I tested &lt;code class=&quot;language-text&quot;&gt;eca_nfnet_l0&lt;/code&gt; backbone, and it has a better cv score, but I can&apos;t use it due to the latency.&lt;/p&gt;
&lt;h3 id=&quot;training-recipe&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#training-recipe&quot; aria-label=&quot;training recipe permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Training recipe&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;[&lt;strong&gt;Important&lt;/strong&gt;] pre-training&lt;/li&gt;
&lt;li&gt;[&lt;strong&gt;Important&lt;/strong&gt;] augmentations
&lt;ul&gt;
&lt;li&gt;waveform-level
&lt;ul&gt;
&lt;li&gt;[Important] or mixup on a raw waveform&lt;/li&gt;
&lt;li&gt;gaussian &amp;#x26; uniform noise&lt;/li&gt;
&lt;li&gt;pitch shift&lt;/li&gt;
&lt;li&gt;[Important] background noise&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;spectrogram-level
&lt;ul&gt;
&lt;li&gt;spec augment&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;log-mel spectrogram
&lt;ul&gt;
&lt;li&gt;n_fft &amp;#x26; window size 1024, hop size 320, min/max freq 20/14000, num_mels 256, top_db 80. (actually, I wanted n_fft with 2048, but I set it to 1024 by my mistake)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;trained on 5 secs clips&lt;/li&gt;
&lt;li&gt;stratified k fold (5 folds, on primary_label)&lt;/li&gt;
&lt;li&gt;label smoothing 0.1&lt;/li&gt;
&lt;li&gt;multiple losses (from &lt;a href=&quot;https://www.kaggle.com/competitions/birdclef-2021/discussion/243351&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;birdcelf 2021 top 5&lt;/a&gt;)
&lt;ul&gt;
&lt;li&gt;bce loss on clip-wise output w/ weight 1.0&lt;/li&gt;
&lt;li&gt;bce loss on max of segment-wise outputs w/ weight 0.5&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;fp32&lt;/li&gt;
&lt;li&gt;AdamW + cosine annealing (w/o warmup)
&lt;ul&gt;
&lt;li&gt;50 epochs (usually converged between 40 ~ 50)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;inference&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#inference&quot; aria-label=&quot;inference permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Inference&lt;/h2&gt;
&lt;p&gt;I can ensemble up to 4 models with Pytorch (it took nearly 2 hrs). To mix more models, I utilized ONNX and did graph optimization, and it makes one more model to be ensembled! Finally, I can ensemble 5 models (single model 5 folds). Also, to utilize the full CPU, I do some multi-processing stuff.&lt;/p&gt;
&lt;h2 id=&quot;not-worked-perhaps-i-might-be-wrong&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#not-worked-perhaps-i-might-be-wrong&quot; aria-label=&quot;not worked perhaps i might be wrong permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Not worked (perhaps I might be wrong)&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;secondary label (both hard label, soft label (e.g. 0.3, 0.5))&lt;/li&gt;
&lt;li&gt;focal loss&lt;/li&gt;
&lt;li&gt;longer clips (e.g. 15s)&lt;/li&gt;
&lt;li&gt;post-processings (proposed in the BirdCLEF 2021, and 2022 competitions)
&lt;ul&gt;
&lt;li&gt;aggregate the probs of the previous and next segments.&lt;/li&gt;
&lt;li&gt;if there&apos;s a bird above the threshold, multiply constants on all segments of the bird.)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I hope this could help!&lt;/p&gt;
&lt;p&gt;Thanks : )&lt;/p&gt;</content:encoded></item><item><title><![CDATA[(Kaggle) Screening Mammography Breast Cancer Detection - 16th (top 1%) place solution]]></title><description><![CDATA[Original Post : https://www.kaggle.com/competitions/rsna-breast-cancer-detection/discussion/391133 Data Preprocessing My preprocessing code…]]></description><link>http://kozistr.tech/2023-02-28-screening-mammography-breast-cancer-detection/</link><guid isPermaLink="false">http://kozistr.tech/2023-02-28-screening-mammography-breast-cancer-detection/</guid><pubDate>Tue, 28 Feb 2023 00:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;Original Post : &lt;a href=&quot;https://www.kaggle.com/competitions/rsna-breast-cancer-detection/discussion/391133&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;https://www.kaggle.com/competitions/rsna-breast-cancer-detection/discussion/391133&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;data&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#data&quot; aria-label=&quot;data permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Data&lt;/h2&gt;
&lt;h3 id=&quot;preprocessing&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#preprocessing&quot; aria-label=&quot;preprocessing permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Preprocessing&lt;/h3&gt;
&lt;p&gt;My preprocessing code heavily depends on the public notebooks (eg. remove letters, crop breast via contour).&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;decode &lt;code class=&quot;language-text&quot;&gt;.jpeg&lt;/code&gt; with &lt;code class=&quot;language-text&quot;&gt;dicomsdl&lt;/code&gt; &amp;#x26; &lt;code class=&quot;language-text&quot;&gt;nvjpeg2000&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;crop edge (margin pixel 10)&lt;/li&gt;
&lt;li&gt;extract breast with &lt;code class=&quot;language-text&quot;&gt;opencv2&lt;/code&gt; (contour based)&lt;/li&gt;
&lt;li&gt;resize to 1536x960. (I roughly guess that resizing into a 1.5 ~ 2.0 aspect ratio is fine.)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In my experiment, windowing doesn&apos;t affect the score positively, so I decide not to use it.&lt;/p&gt;
&lt;h3 id=&quot;augmentation&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#augmentation&quot; aria-label=&quot;augmentation permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Augmentation&lt;/h3&gt;
&lt;p&gt;Heavy augmentation works well. Light augmentation tends to overfit.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;v/hflip&lt;/li&gt;
&lt;li&gt;scale / rotate&lt;/li&gt;
&lt;li&gt;brightness / contrast&lt;/li&gt;
&lt;li&gt;cutout (coarse dropout with large patch size)&lt;/li&gt;
&lt;li&gt;mixup&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;architecture&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#architecture&quot; aria-label=&quot;architecture permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Architecture&lt;/h2&gt;
&lt;p&gt;I couldn&apos;t spend much time running various experiments due to a lack of time &amp;#x26; computing resources. So, I only tested few backbones &amp;#x26; training recipes. (about 70% of my submissions are runtime errors &amp;#x26; mistakes lol)&lt;/p&gt;
&lt;p&gt;Here&apos;s a full pipeline.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;pre-train segmentation model with the &lt;code class=&quot;language-text&quot;&gt;cbis-ddsm&lt;/code&gt; &amp;#x26; &lt;code class=&quot;language-text&quot;&gt;vindr&lt;/code&gt; datasets.
&lt;ul&gt;
&lt;li&gt;segment: provided RoI image.&lt;/li&gt;
&lt;li&gt;label: &lt;code class=&quot;language-text&quot;&gt;malignant&lt;/code&gt; to cancer / &lt;code class=&quot;language-text&quot;&gt;BIRADS 5&lt;/code&gt; to cancer.&lt;/li&gt;
&lt;/ul&gt;
% Of course, the label doesn&apos;t perfectly correlate with the competition standards. But, I roughly think that maybe it could help train the model in some ways.&lt;/li&gt;
&lt;li&gt;train with competition data (initialize the weight with the pre-trained model)
&lt;ul&gt;
&lt;li&gt;segment: inferred with the pre-trained model.&lt;/li&gt;
&lt;li&gt;auxiliary: given meta-features (total 11 features).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;re-label the external data with the &lt;code class=&quot;language-text&quot;&gt;step 2&lt;/code&gt; model.&lt;/li&gt;
&lt;li&gt;re-train with competition data (initialize with &lt;code class=&quot;language-text&quot;&gt;step 3&lt;/code&gt; model)&lt;/li&gt;
&lt;li&gt;train a meta-classifier (oof + meta-features (e.g. laterality, age, ...))&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For a baseline, I run step 1 ~ 2, 5 and achieve CV 0.4885 LB 0.59 (PB 0.46). Also, I test only with the &lt;code class=&quot;language-text&quot;&gt;cbis-ddsm&lt;/code&gt; dataset for pre-training, and there were about 0.02 drops on CV &amp;#x26; LB, but the same score on PB (CV 0.4656 LB 0.57 PB 0.46).&lt;/p&gt;
&lt;p&gt;A week before the deadline, I finished up to step ~ 5 and got CV 0.5012 LB 0.55 (PB 0.51). Sadly, I didn&apos;t choose it as a final submission : (&lt;/p&gt;
&lt;p&gt;Last day of the competition, I ensembled &lt;code class=&quot;language-text&quot;&gt;effnet_v2_s&lt;/code&gt; backbone and got CV 0.5063 LB 0.56 (PB 0.49).&lt;/p&gt;
&lt;p&gt;Lastly, I choose the best LB &amp;#x26; CV for the final submission.&lt;/p&gt;
&lt;h3 id=&quot;meta-classifier&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#meta-classifier&quot; aria-label=&quot;meta classifier permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Meta-Classifier&lt;/h3&gt;
&lt;p&gt;I built a meta-classifier with meta-features like age, laterality, and the (per-breast) statistics of the predictions. Usually, It gives ~ 0.02 improvements on the CV &amp;#x26; LB (also PB).&lt;/p&gt;
&lt;p&gt;I&apos;m worried about overfitting into some meta-features (eg. machine id, (predicted) density, ...), so I decided to use only 3 auxiliary features (age, site_id, laterality) to train the model.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;feature: age, site_id, laterality, (mean, std, min, max) of the predictions.&lt;/li&gt;
&lt;li&gt;cv: stratified k fold (5 folds)&lt;/li&gt;
&lt;li&gt;model: CatBoost&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;works&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#works&quot; aria-label=&quot;works permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Works&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;higher resolution (1536x768 ~ 1024) is good.&lt;/li&gt;
&lt;li&gt;external data
&lt;ul&gt;
&lt;li&gt;it gives about +0.02 boosts.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;architecture
&lt;ul&gt;
&lt;li&gt;encoder: backbone: &lt;code class=&quot;language-text&quot;&gt;effnet-b3&lt;/code&gt; works best&lt;/li&gt;
&lt;li&gt;decoder: u-net++&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;augmentation&lt;/li&gt;
&lt;li&gt;mixup (alpha 1.0)&lt;/li&gt;
&lt;li&gt;loss
&lt;ul&gt;
&lt;li&gt;0.6 * cls_loss (cross_entropy) + 0.4 * seg_loss (dice) + 0.1 * aux_loss (cross-entropy)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;stratified group k fold (4 folds)&lt;/li&gt;
&lt;li&gt;meta-classifier&lt;/li&gt;
&lt;li&gt;TTA&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;thanks for reading! I hope this could help you :)&lt;/p&gt;</content:encoded></item><item><title><![CDATA[(Kaggle) Detecting Continuous Gravitational Waves - 22th (top 2%) place solution]]></title><description><![CDATA[Original Post : https://www.kaggle.com/competitions/g2net-detecting-continuous-gravitational-waves/discussion/375927 Data Pre-Processing In…]]></description><link>http://kozistr.tech/2023-01-03-detecting-continuous-gravitational-waves/</link><guid isPermaLink="false">http://kozistr.tech/2023-01-03-detecting-continuous-gravitational-waves/</guid><pubDate>Tue, 03 Jan 2023 00:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;Original Post : &lt;a href=&quot;https://www.kaggle.com/competitions/g2net-detecting-continuous-gravitational-waves/discussion/375927&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;https://www.kaggle.com/competitions/g2net-detecting-continuous-gravitational-waves/discussion/375927&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;data&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#data&quot; aria-label=&quot;data permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Data&lt;/h2&gt;
&lt;h3 id=&quot;pre-processing&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#pre-processing&quot; aria-label=&quot;pre processing permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Pre-Processing&lt;/h3&gt;
&lt;p&gt;In my experiment, &lt;a href=&quot;https://www.kaggle.com/code/laeyoung/g2net-large-kernel-inference&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;preprocessing&lt;/a&gt; (&lt;code class=&quot;language-text&quot;&gt;normalize&lt;/code&gt; function) works better than the power spectrogram. It improves the score by about +0.02 on CV/LB. After normalizing the signal, take a mean over the time axis. The final shape is (360, 360).&lt;/p&gt;
&lt;h3 id=&quot;simulation&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#simulation&quot; aria-label=&quot;simulation permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Simulation&lt;/h3&gt;
&lt;p&gt;Generating samples is the most crucial part of boosting the score. I can get 0.761 on the LB with a single model.&lt;/p&gt;
&lt;p&gt;In short, signal depth (&lt;code class=&quot;language-text&quot;&gt;sqrtSX / h0&lt;/code&gt;) takes a huge impact. I generated 100K samples (50K positives, 50K negatives) and uniformly sampled the signal depth between 10 and 100. &lt;code class=&quot;language-text&quot;&gt;cosi&lt;/code&gt; parameter is uniformly sampled (-1, 1).&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;signal depth&lt;/th&gt;
&lt;th&gt;LB score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10 ~ 50&lt;/td&gt;
&lt;td&gt;0.73x ~ 0.74x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10 ~ 80&lt;/td&gt;
&lt;td&gt;0.75x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10 ~ 100&lt;/td&gt;
&lt;td&gt;0.761&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&quot;augmentations&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#augmentations&quot; aria-label=&quot;augmentations permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Augmentations&lt;/h3&gt;
&lt;p&gt;Also, I&apos;ve worked on the augmentations for much time. Here&apos;s a list.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;v/hflip&lt;/li&gt;
&lt;li&gt;shuffle channel&lt;/li&gt;
&lt;li&gt;shift on freq-axis&lt;/li&gt;
&lt;li&gt;denoise a signal (subtract corresponding noise from the signal)&lt;/li&gt;
&lt;li&gt;add noises
&lt;ul&gt;
&lt;li&gt;Guassian N(0, 1e-2)&lt;/li&gt;
&lt;li&gt;mixed (add or concatenate) with another (stationary) noise(s)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;add vertical line artifact(s).&lt;/li&gt;
&lt;li&gt;SpecAugment&lt;/li&gt;
&lt;li&gt;mixup (alpha 5.0)
&lt;ul&gt;
&lt;li&gt;perform &lt;code class=&quot;language-text&quot;&gt;or&lt;/code&gt; mixup&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;model&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#model&quot; aria-label=&quot;model permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Model&lt;/h2&gt;
&lt;p&gt;First, I tried to search for the backbones (effnet, nfnet, resnest, convnext, vit-based) and found &lt;code class=&quot;language-text&quot;&gt;convnext&lt;/code&gt; works best on CV &amp;#x26; LB score. After selecting a baseline backbone, I experimented with customizing a stem layer (e.g. large kernel &amp;#x26; pool sizes, multiple convolutions stem with various kernel sizes) to detect the long-lasting signal effectively, but they didn&apos;t affect the performance positively.&lt;/p&gt;
&lt;h2 id=&quot;ensemble&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#ensemble&quot; aria-label=&quot;ensemble permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Ensemble&lt;/h2&gt;
&lt;p&gt;Most of the models used at the ensemble are &lt;code class=&quot;language-text&quot;&gt;convnext-xlarge&lt;/code&gt; but each model trained with different variances (e.g. augmentations, simulated samples, ...) and &lt;code class=&quot;language-text&quot;&gt;eca-nfnet-l2&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;efficientnetv2-xl&lt;/code&gt; for one model. Every model trained on various datasets and LB score seems reliable, so I adjusted the ensemble weights by LB score.&lt;/p&gt;
&lt;p&gt;I selected the two best LB submissions (LB 0.768 PB 0.771). And the best PB that I didn&apos;t select is 0.778 (LB 0.766) (mixing all my experiments).&lt;/p&gt;
&lt;h2 id=&quot;works&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#works&quot; aria-label=&quot;works permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Works&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code class=&quot;language-text&quot;&gt;convnext&lt;/code&gt; family backbone&lt;/li&gt;
&lt;li&gt;signal depth 10 ~ 100&lt;/li&gt;
&lt;li&gt;hard augmentation&lt;/li&gt;
&lt;li&gt;pair stratified k fold
&lt;ul&gt;
&lt;li&gt;8 folds&lt;/li&gt;
&lt;li&gt;stratified on the target&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;language-text&quot;&gt;pair&lt;/code&gt; means the pair (corresponding noise &amp;#x26; signal) must be in the same fold.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;pseudo label (smooth label)&lt;/li&gt;
&lt;li&gt;segmentation (but hard to converge on my experiment)&lt;/li&gt;
&lt;li&gt;TTA&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;not-works&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#not-works&quot; aria-label=&quot;not works permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Not Works&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;segmentation with classification head (0.6 * bce + 0.4 * dice)
&lt;ul&gt;
&lt;li&gt;Actually, seg with cls works slightly better than only cls, but hard to train without loss divergence. So, I just did only cls.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;language-text&quot;&gt;cosi == 0&lt;/code&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code class=&quot;language-text&quot;&gt;cosi&lt;/code&gt; is also a critical parameter to determine an SNR. I generated more samples where &lt;code class=&quot;language-text&quot;&gt;cosi&lt;/code&gt; is 0, but there&apos;s a score drop.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;augmentations (not worked)
&lt;ul&gt;
&lt;li&gt;swap with random negatives (proposed at the past competition)&lt;/li&gt;
&lt;li&gt;random resized crop&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Customize a stem layer with large kernel &amp;#x26; pool sizes.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I hope this could help you :)&lt;/p&gt;
&lt;p&gt;Happy new year!&lt;/p&gt;</content:encoded></item><item><title><![CDATA[2022년 회고]]></title><description><![CDATA[TL;DR 올해 회고를 시작하기 전 작년 회고를 읽었는데, 첫 줄부터 희망한 대로 흘러가진 않았다. 병특도 정착했고 2022년은 조용히 지나가나 했지만, 회사 관련해서도 큰 변화가 있었고 여러 일들이 있었다. 우리가 이라 하는 것처럼 "이젠 괜찮겠지…]]></description><link>http://kozistr.tech/2022-12-17-Review2022/</link><guid isPermaLink="false">http://kozistr.tech/2022-12-17-Review2022/</guid><pubDate>Fri, 16 Dec 2022 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;tldr&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#tldr&quot; aria-label=&quot;tldr permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;TL;DR&lt;/h2&gt;
&lt;p&gt;올해 회고를 시작하기 전 작년 회고를 읽었는데, 첫 줄부터 희망한 대로 흘러가진 않았다. 병특도 정착했고 2022년은 조용히 지나가나 했지만, 회사 관련해서도 큰 변화가 있었고 여러 일들이 있었다. 우리가 &lt;code class=&quot;language-text&quot;&gt;그 발언&lt;/code&gt;이라 하는 것처럼 &quot;이젠 괜찮겠지...?&quot; 라 하면 안 괜찮은 게 정말 맞나 싶다. 그래서 올해 TL;DR은 여기까지만 적어볼 예정이다.&lt;/p&gt;
&lt;h2 id=&quot;kaggle&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#kaggle&quot; aria-label=&quot;kaggle permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Kaggle&lt;/h2&gt;
&lt;p&gt;상반기엔 회사 일에 시간을 많이 쓰느라 Kaggle을 거의 못했다가 최근에 여유가 생겨서 대회 하나를 진행 중이다. 중성자별 continuous gravitational-wave signal을 탐지하는 &lt;a href=&quot;https://www.kaggle.com/competitions/g2net-detecting-continuous-gravitational-waves&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;challenge&lt;/a&gt; 인데, 주제가 재밌어서 진심을 담아 하고 있고, 내년 1월 3일에 끝나는데 남은 3 주 열심히 해서 gold medal를 노려봐야겠다. &lt;del&gt;역시 회사 일보다 100배는 재밌는 거 같다&lt;/del&gt;.&lt;/p&gt;
&lt;p&gt;이외로 Kaggle을 하는 이유는 재미도 있지만 수련이 더 큰 목적인데, 매년 성장하는 게 등수나 아이디어, 속도로 보이는 점은 만족하고 있다. 하지만, 아직 모든 대회에서 안정적으로 gold medal zone에 안착할 수 있을 정도의 실력은 아니라서 더 노력해야 겠다.&lt;/p&gt;
&lt;h2 id=&quot;programming&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#programming&quot; aria-label=&quot;programming permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Programming&lt;/h2&gt;
&lt;p&gt;commit 수만 보면 올해도 작년과 비슷하게 약 1.7k contributions를 했는데, 솔직히 고백하자면 매일 commit 1개씩 push 하는 action이 있어서 365개는 날로 먹어서 실질적으로 1.3k commits 했다. commit 수가 줄어든 이유를 생각해 봤는데, &lt;code class=&quot;language-text&quot;&gt;Rust&lt;/code&gt; 공부하는 거나 &lt;code class=&quot;language-text&quot;&gt;pytorch_optimizer&lt;/code&gt; 작업하는 거 이외엔 크게 관심 가는 것이 없는 게 이유였고, 다른 이유는 경향이 바뀌었는데, 개인 project를 진행하기보단 open source를 둘러보고 contributions 하는 방향에 시간을 더 많이 쓴 듯하다.&lt;/p&gt;
&lt;p&gt;많지는 않지만 5개 projects에 contributions을 했는데, 개인적으로 직접 package를 maintain 하는 거에서 이미 사용자가 많은 open source project에 기여하는 방향에 더 흥미가 생겨서 내년에도 계속할 생각이다.&lt;/p&gt;
&lt;h3 id=&quot;pytorch_optimizer&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#pytorch_optimizer&quot; aria-label=&quot;pytorch_optimizer permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;pytorch_optimizer&lt;/h3&gt;
&lt;p&gt;처음엔 매번 optimizer 구현해서 쓰기 귀찮아서 시작했고 벌써 시작한 지 1년이 됐는데, 12월 16일 기준으로 &lt;code class=&quot;language-text&quot;&gt;total 41.6K downloads&lt;/code&gt;와 &lt;code class=&quot;language-text&quot;&gt;2.1k/month downloads&lt;/code&gt; 를 달성했다. 버전은 &lt;code class=&quot;language-text&quot;&gt;v2.0.1&lt;/code&gt; 인데, major 버전도 &lt;code class=&quot;language-text&quot;&gt;2&lt;/code&gt;로 올렸다. repo에 code를 공유하는 거 이외에 쉽게 다른 사람도 사용할 수 있게끔 만들고 실제로 많이 사용하고 있다는 점에서 프로젝트를 계속 maintain 할 수 있는 동기부여가 되는 것도 있었다. 그리고, 감사하게도 PR 날려주신 분도 계셨는데 덕분에 사용성도 더 좋아졌고 조금 감동이었다.&lt;/p&gt;
&lt;p&gt;하나 아쉬운 점이 있다면, documentation을 제대로 작업하지 못했는데, 이 부분은 높은 우선순위로 가져가서 작업해야겠다.&lt;/p&gt;
&lt;p&gt;간단하게 앞으로의 계획을 적어보자면,&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;docstring format 정하기&lt;/li&gt;
&lt;li&gt;documentation 하기&lt;/li&gt;
&lt;li&gt;더 많은 lr scheduler 구현&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;요렇게 작업할 예정이다.&lt;/p&gt;
&lt;h3 id=&quot;gatsby-blog&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#gatsby-blog&quot; aria-label=&quot;gatsby blog permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Gatsby Blog&lt;/h3&gt;
&lt;p&gt;작년? 재작년 겨울에 ruby + jekelly 기반에서 react + gatsby 기반 blog로 바꿨는데, frontend와는 거리가 먼 나였는데 하다 보니 소소하기 기능 추가하고 가꾸는 게 재밌어서 유지보수를 계속 해 왔다. 대충 작업한 내용을 생각해 보면, post에 &lt;code class=&quot;language-text&quot;&gt;time to read&lt;/code&gt; 달아주거나 댓글은 Giscus 사용, Gatsby v5로 migrate 하기, CI/CD pipeline 최적화, Lighthouse 100점 만들기 등등 다양한 시도를 했고 그 과정에서 많이 공부할 수 있어서 재밌었다. React를 대충 공부한 적밖에 없어서 아직도 코드를 완벽하게 이해하고 한 게 아니라 잘 돌아가는 걸 목표로만 개발했는데, 내년엔 front 쪽 지식도 넓힐 겸 React 하고 Svelte 같은 것도 공부해 봐야겠다.&lt;/p&gt;
&lt;h3 id=&quot;rust&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#rust&quot; aria-label=&quot;rust permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Rust&lt;/h3&gt;
&lt;p&gt;작년에 잠시 공부한 거 이외엔 써 볼 데가 없었는데, 회사 일 하다가 ML 모델 serving을 Rust + gRPC 기반으로 하면 빠르겠지? 생각이 들어서 CatBoost model 서빙하는 gRPC 서버를 &lt;a href=&quot;https://github.com/kozistr/catboost-server-rs&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;만들어봤다&lt;/a&gt;. 역시나 python bindings 도 아니고 gRPC 라서 확실히 RESTful API server보다 속도 차이가 큰 거 같다. Rust 약팔이(?)를 하고 다녔지만, 토스에서는 아직 real-time으로 tree 계열 model을 serving 하는 니즈가 많이 없기도 하고 jvm 친화적인 곳이라 production까지 ship 하긴 어렵겠지만, 언젠가 기회가 된다면 해 보고 싶다.&lt;/p&gt;
&lt;h2 id=&quot;회사&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%ED%9A%8C%EC%82%AC&quot; aria-label=&quot;회사 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;회사&lt;/h2&gt;
&lt;p&gt;이곳에서도 직무는 Data Scientist지만 업무의 경계는 크게 없었다. 데이터 분석이나 모델 개발뿐만 아니라 데이터 생성 pipeline 관리부터 데이터 분석, 모델 개발, 서버 개발까지 다양한 업무를 했다. 개인적으로도 모델 개발뿐만 아니라 engineering 부분도 직접 할 수 있어서 업무의 완성도나 여러 가지 공부할 수 있다는 점에서는 재밌기도 했지만, 시간이 갈수록 내가 가장 잘하는 능력을 쓸 기회와 공부 시간이 줄어드는 점에서 오는 두려움도 있었다. 사실 이전 회사들에서도 이런 고민을 계속해 왔었고 매번 해결하지 못한 난제였는데, 토스로의 이직은 이 답을 마무리하기 위함이었고 여러 회사와 조직을 경험해본 결과 이번에 그 답을 어느 정도 찾을 수 있었다.&lt;/p&gt;
&lt;p&gt;그래서 위에서 말한 회사 관련한 변화가 바로 이 &lt;code class=&quot;language-text&quot;&gt;선택&lt;/code&gt;이다. 현재 다니는 회사 입사 전에 많은 고민과 가설을 가지고 입사했고 지난 1년 동안 많은 사건(?)을 보고 검증하면서 몇 년 동안 어려워했던 정답을 얼추 찾은 느낌이다. 더 구체적으로는 내가 원하는 것과 회사와 팀이 가지는 장/단점 trade-off를 내가 얼마나 어떻게 잘 버티고 타협할 수 있는가에 대한 답이다. 그럼, 정답을 알면 무엇이 달라지냐 하면, 위에 언급한 두려움에 대한 실질적인 해결책과 더 만족스러운 선택이 가능해질 거 같다. 구체적으로 적긴 힘들지만, 스스로 comfort zone에 빠지지 않게 더 노력해야 하고 risk가 더 큰 선택이지만 trade-off를 고려한 최선을 선택한 상황이고 새로운 팀으로 옮겨서 생활 중이다.&lt;/p&gt;
&lt;h2 id=&quot;취미&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EC%B7%A8%EB%AF%B8&quot; aria-label=&quot;취미 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;취미&lt;/h2&gt;
&lt;h3 id=&quot;운동&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EC%9A%B4%EB%8F%99&quot; aria-label=&quot;운동 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;운동&lt;/h3&gt;
&lt;p&gt;사실 생각도 못한 작년 목표였는데, 가을부터 지인 추천으로 클라이밍을 시작하게 됐다. 처음에는 별생각 없이 따라갔다가 볼더링 문제 푸는 재미도 있고 전완근도 잘 조질(?) 수 있어서 요즘엔 거의 매주 1번 출석하고 있다. 뭐든 한번 빠지면 끝을 보는 성격이라 요즘 유튜브마저 클라이밍 영상들을 찾아보는데, 주변에서 헬창에 이젠 클창(?)이라고 가지가지 한다고 한다.&lt;/p&gt;
&lt;p&gt;주로 훅클라이밍 왕십리, 성수를 다니고 난이도는 남색 난이도를 풀고 있는데, 그다음 단계인 보라색은 넘사벽인듯 해서, 내년 목표를 보라클라이머로 열심히 다녀봐야겠다.&lt;/p&gt;
&lt;h3 id=&quot;천체관측&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EC%B2%9C%EC%B2%B4%EA%B4%80%EC%B8%A1&quot; aria-label=&quot;천체관측 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;천체관측&lt;/h3&gt;
&lt;p&gt;마침 회사 같은 팀에 천체 관측하는 걸 취미로 하시는 분이 계셔서 가을에 홍천으로 따라서 관측하러 갔다 왔습니다. &lt;a href=&quot;https://ko.wikipedia.org/wiki/%EC%82%BC%EA%B0%81%ED%98%95%EC%9E%90%EB%A6%AC_%EC%9D%80%ED%95%98&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;M33&lt;/a&gt; 아니면 &lt;a href=&quot;https://en.wikipedia.org/wiki/Pleiades&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;M45&lt;/a&gt;를 찍으려 했다가 M33은 새벽쯤에 시야에 들어와서 당장 시야에 보이는 M45를 찍었는데, 당일치기 일정이라 노출을 100분 정도밖에 못 했는데도 대충 원하는 그림이 나와서 엄청나게 만족했다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/984854a5da9fa85cd0ff7d036c5d5c95/c3299/m45.jpg&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 91.8918918918919%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/jpeg;base64,/9j/2wBDABALDA4MChAODQ4SERATGCgaGBYWGDEjJR0oOjM9PDkzODdASFxOQERXRTc4UG1RV19iZ2hnPk1xeXBkeFxlZ2P/2wBDARESEhgVGC8aGi9jQjhCY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2P/wgARCAASABQDASIAAhEBAxEB/8QAGAABAAMBAAAAAAAAAAAAAAAAAAECAwX/xAAUAQEAAAAAAAAAAAAAAAAAAAAA/9oADAMBAAIQAxAAAAHiXpqZgrISD//EABgQAQEAAwAAAAAAAAAAAAAAAAECABAg/9oACAEBAAEFAsRN1bXP/8QAFBEBAAAAAAAAAAAAAAAAAAAAIP/aAAgBAwEBPwEf/8QAFBEBAAAAAAAAAAAAAAAAAAAAIP/aAAgBAgEBPwEf/8QAGhAAAAcAAAAAAAAAAAAAAAAAAAEQESAhQf/aAAgBAQAGPwIWpPkf/8QAGRAAAgMBAAAAAAAAAAAAAAAAAREAECBh/9oACAEBAAE/IYdQK+SFn//aAAwDAQACAAMAAAAQ9xgB/8QAFBEBAAAAAAAAAAAAAAAAAAAAIP/aAAgBAwEBPxAf/8QAFBEBAAAAAAAAAAAAAAAAAAAAIP/aAAgBAgEBPxAf/8QAGxABAAICAwAAAAAAAAAAAAAAAQAQESFBYXH/2gAIAQEAAT8QDeOYKQkH2hxDdSiXoi0V/9k=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/984854a5da9fa85cd0ff7d036c5d5c95/1c72d/m45.jpg&quot;
        srcset=&quot;/static/984854a5da9fa85cd0ff7d036c5d5c95/a80bd/m45.jpg 148w,
/static/984854a5da9fa85cd0ff7d036c5d5c95/1c91a/m45.jpg 295w,
/static/984854a5da9fa85cd0ff7d036c5d5c95/1c72d/m45.jpg 590w,
/static/984854a5da9fa85cd0ff7d036c5d5c95/a8a14/m45.jpg 885w,
/static/984854a5da9fa85cd0ff7d036c5d5c95/fbd2c/m45.jpg 1180w,
/static/984854a5da9fa85cd0ff7d036c5d5c95/c3299/m45.jpg 2816w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;꼭 관측이 아니더라도 의자에 누워서 멍하니 하늘만 보고 있어도 육안으로도 별이 엄청 많이 보여서 불멍하는 것 처럼 별멍(?)하는 느낌이 들더라고요. 아래는 대충 아이폰으로 하늘 찍은 건데, 실제로 보면 훨씬 더 별들이 많슴다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/61f1e4e61ef069ead18c355da7f5428f/12609/sky.jpg&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 133.1081081081081%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/jpeg;base64,/9j/2wBDABALDA4MChAODQ4SERATGCgaGBYWGDEjJR0oOjM9PDkzODdASFxOQERXRTc4UG1RV19iZ2hnPk1xeXBkeFxlZ2P/2wBDARESEhgVGC8aGi9jQjhCY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2P/wgARCAAbABQDASIAAhEBAxEB/8QAFwABAQEBAAAAAAAAAAAAAAAAAAIBBf/EABYBAQEBAAAAAAAAAAAAAAAAAAABAv/aAAwDAQACEAMQAAAB56m5LC7sCgf/xAAWEAEBAQAAAAAAAAAAAAAAAAARIAD/2gAIAQEAAQUCpzP/xAAUEQEAAAAAAAAAAAAAAAAAAAAg/9oACAEDAQE/AR//xAAUEQEAAAAAAAAAAAAAAAAAAAAg/9oACAECAQE/AR//xAAUEAEAAAAAAAAAAAAAAAAAAAAw/9oACAEBAAY/Ak//xAAZEAEAAgMAAAAAAAAAAAAAAAABABARIFH/2gAIAQEAAT8hvESKdhQg6f/aAAwDAQACAAMAAAAQ/wDljf/EABURAQEAAAAAAAAAAAAAAAAAABAR/9oACAEDAQE/ECH/xAAUEQEAAAAAAAAAAAAAAAAAAAAg/9oACAECAQE/EB//xAAYEAEBAQEBAAAAAAAAAAAAAAABABExIf/aAAgBAQABPxDtlsaNsIDiBkL32Q1JiOX/2Q==&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/61f1e4e61ef069ead18c355da7f5428f/1c72d/sky.jpg&quot;
        srcset=&quot;/static/61f1e4e61ef069ead18c355da7f5428f/a80bd/sky.jpg 148w,
/static/61f1e4e61ef069ead18c355da7f5428f/1c91a/sky.jpg 295w,
/static/61f1e4e61ef069ead18c355da7f5428f/1c72d/sky.jpg 590w,
/static/61f1e4e61ef069ead18c355da7f5428f/a8a14/sky.jpg 885w,
/static/61f1e4e61ef069ead18c355da7f5428f/fbd2c/sky.jpg 1180w,
/static/61f1e4e61ef069ead18c355da7f5428f/12609/sky.jpg 3000w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;h2 id=&quot;마무리&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EB%A7%88%EB%AC%B4%EB%A6%AC&quot; aria-label=&quot;마무리 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;마무리&lt;/h2&gt;
&lt;p&gt;그래도 작년에 목표했던 것들도 이루고 개인적으론 성과 있었던 한 해였던 거 같다. 내년 가을이면 병특도 끝나는데, 학교도 그렇고 회사도 그렇고 고민하게 될 텐데, 내년에는 또 어떤 재밌는(?) 일 들이 벌어질지 궁금하지만 당장은 그만 알아보자.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[MaxViT - Multi-Axis Vision Transformer]]></title><description><![CDATA[TL;DR paper : arXiv code : github Related Work GC ViT Introduction 최근 vision transformer연구 경향을 보면 global context를 잘 고려하는 ViT연구들이 많이 보이는데, 이…]]></description><link>http://kozistr.tech/2022-08-24-maxvit/</link><guid isPermaLink="false">http://kozistr.tech/2022-08-24-maxvit/</guid><pubDate>Wed, 24 Aug 2022 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;tldr&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#tldr&quot; aria-label=&quot;tldr permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;paper : &lt;a href=&quot;https://arxiv.org/pdf/2204.01697.pdf&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;arXiv&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;code : &lt;a href=&quot;https://github.com/google-research/maxvit&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;github&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;related-work&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#related-work&quot; aria-label=&quot;related work permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Related Work&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://arxiv.org/pdf/2206.09959.pdf&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;GC ViT&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;introduction&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#introduction&quot; aria-label=&quot;introduction permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Introduction&lt;/h2&gt;
&lt;p&gt;최근 vision transformer연구 경향을 보면 global context를 잘 고려하는 ViT연구들이 많이 보이는데, 이번 연구에서는 efficient 하고 scalable 한 multi-axis attention이란 걸 개발해 arbitrary image size에 대해서도 linear complexity만에 연산이 가능하고 global context도 잘 잡는 무언가를 만들었다고 합니다.&lt;/p&gt;
&lt;h2 id=&quot;architecture&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#architecture&quot; aria-label=&quot;architecture permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Architecture&lt;/h2&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/49869b48d19d414a8dc7612cb39abea3/84ee5/architecture.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 59.45945945945946%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAMCAIAAADtbgqsAAAACXBIWXMAAA7DAAAOwwHHb6hkAAAB9ElEQVR42i1S2XLbMAzU/39P+9I8eiYzmTjjJK7iQ4cpUjwkUgd1H7balRI8QBAIcIFdOv8269ou8IKmbmCu+5UXhZTycrlUVRXHjJCormtCiNZ6nmcuxLJ1OcPQt02rtDiFLlMUzcYYu1mWZegpNluDvLC4zBSG8MZWwzg6mTHe1f+8Hv68/P70DkZrxhgQeBxzzhFQSgGVpimNolTr2/F83b2Ra4CkM45j1/VSiys5xYpW1v6g5TmmqDcDXlmWCNq2bXJbMLUhD86yLNM0jeOcmQI4fT90XTdPM+qEENg8z/NhGFCDS6SQWqf90E/ztCwPBwQYoxNJKfEJjXQiUF3VteJccE4iKrjEHFmeM66kFBGhiUoxyP1+R/M9UTw6Px3ffr0eDzJ6TRJdlKX//KwiEgbx6S/JMs253O1vWZG7H97Nj4GHRmeaZqMV83bu+9PB/UzouzFZaW243yeMMqouXzcQIGXy8kEgoXcm5MaRwSLO47FgvcoWlS1LW8JjZyxZlXalq2nAVrdZaau1cuWv7vseZDnYAzz5vh9RqpQKgyDw/TiOoQqkAmGQHT5J1OrV6kGkUgkGdBChFKriJeGDZxQzJrcSCP59Ir5NokchB+VxQBlztn8KEGQRbwGaV4QgDPE2MCGmbbfJsc4qddP0m/0HDcCSskwGo+EAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/49869b48d19d414a8dc7612cb39abea3/fcda8/architecture.png&quot;
        srcset=&quot;/static/49869b48d19d414a8dc7612cb39abea3/12f09/architecture.png 148w,
/static/49869b48d19d414a8dc7612cb39abea3/e4a3f/architecture.png 295w,
/static/49869b48d19d414a8dc7612cb39abea3/fcda8/architecture.png 590w,
/static/49869b48d19d414a8dc7612cb39abea3/efc66/architecture.png 885w,
/static/49869b48d19d414a8dc7612cb39abea3/84ee5/architecture.png 1076w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;architecture design은 다른 연구들과 큰 차이가 없는 hierarchical 한 구조인데, 차이점은 block module를 보면 크게 3 가지 components로 이뤄졌습니다. &lt;code class=&quot;language-text&quot;&gt;MBConv&lt;/code&gt; -&gt; &lt;code class=&quot;language-text&quot;&gt;Block Attention&lt;/code&gt; -&gt; &lt;code class=&quot;language-text&quot;&gt;Grid Attention&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;간단하게 소개하면 &lt;code class=&quot;language-text&quot;&gt;Block Attention&lt;/code&gt;에선 local context, &lt;code class=&quot;language-text&quot;&gt;Grid Attention&lt;/code&gt;에선 global context를 위한 module입니다.&lt;/p&gt;
&lt;h3 id=&quot;attention&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#attention&quot; aria-label=&quot;attention permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Attention&lt;/h3&gt;
&lt;p&gt;self-attention 연산은 location-unaware (e.g. non-translation equivariant, inductive bias)한 특징이 있는데, 이런 걸 해결하기 위해 이전 연구들은 vanilla self-attention 대신 related self-attention를 사용해 이런 문제를 어느 정도 완화하고 있습니다. 이번 연구에서도 pre-normalized related self-attention module을 사용했다고 합니다.&lt;/p&gt;
&lt;h3 id=&quot;multi-axis-attention&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#multi-axis-attention&quot; aria-label=&quot;multi axis attention permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Multi-axis Attention&lt;/h3&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/9ae350be573c6c4006c0a7eb87b9b383/18539/multi_axis_self_attention.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 49.32432432432432%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAKCAIAAAA7N+mxAAAACXBIWXMAAA7DAAAOwwHHb6hkAAABmElEQVR42j1R23KbMBTk//8m05d02kybOLUNcZKm9oOdYMmAbohBIAMZbBM7K5HpPhxJe86em4Lz+Xy5XGpbt9ZmYcSiGVvearps2u79vRuGAadm6eZxvv33tHmOTKGbth1VQdM0xhhCiNGa/PhJb76T8JtYh1JpznldVWmWpeQ1mvz6G00X00mapEmS1LVF3uB4PPaHfkuILvLH1V34Mr2+Wy9WiRRcqbxtW0JoRt6eprfLh9lL+CehFEzf9048NqCkrOoqXN9PlrOr33G4EpUprbWn06mqKpbScHa/iOaLcC6lKsvy7PElHq21tZDw5ofD8T8JnIZBlybloqzth+e+ZhZCFEWxjWNKKWcOGAqtuUuyA4kJEWDKUud5oTXKIr8xJTYVcAehlMoYwyoyj9QtZQcePuTyDJD5YJ76NyTBbkdB+5IMO4eIbB1AYjNxHG82r8iBOyJVnr/FMZoFzzhzlXFwIXKFrWHHHBaF3fBSjEkRgmIYCmKkAIOnFNKLnZujE+5ZFIeu8cBXdV3XNHtn987uPUbXJ80sHDo4+szCAAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/9ae350be573c6c4006c0a7eb87b9b383/fcda8/multi_axis_self_attention.png&quot;
        srcset=&quot;/static/9ae350be573c6c4006c0a7eb87b9b383/12f09/multi_axis_self_attention.png 148w,
/static/9ae350be573c6c4006c0a7eb87b9b383/e4a3f/multi_axis_self_attention.png 295w,
/static/9ae350be573c6c4006c0a7eb87b9b383/fcda8/multi_axis_self_attention.png 590w,
/static/9ae350be573c6c4006c0a7eb87b9b383/efc66/multi_axis_self_attention.png 885w,
/static/9ae350be573c6c4006c0a7eb87b9b383/18539/multi_axis_self_attention.png 1074w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;entire space(local patch)에 full self-attention을 하면 complexity가 빡세다는 단점이 있는데, 해결하기 위해서 2 개 (&lt;strong&gt;local&lt;/strong&gt;, &lt;strong&gt;global&lt;/strong&gt;)의 sparse forms으로 나눠 연산했다고 합니다.&lt;/p&gt;
&lt;p&gt;input feature map &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;∈&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant=&quot;double-struck&quot;&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mo&gt;×&lt;/mo&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mo&gt;×&lt;/mo&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;X \in \mathbb{R}^{H \times W \times C}&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.7224em;vertical-align:-0.0391em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.07847em;&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;∈&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.8413em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathbb&quot;&gt;R&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.8413em;&quot;&gt;&lt;span style=&quot;top:-3.063em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.08125em;&quot;&gt;H&lt;/span&gt;&lt;span class=&quot;mbin mtight&quot;&gt;×&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;mbin mtight&quot;&gt;×&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.07153em;&quot;&gt;C&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;가 있을 때, 기존엔 flattened spatial dimension &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;HW&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.6833em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.08125em;&quot;&gt;H&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;에 attention을 했다면 이번 연구에선 &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mo&gt;×&lt;/mo&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;P \times P&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.7667em;vertical-align:-0.0833em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.13889em;&quot;&gt;P&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mbin&quot;&gt;×&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.6833em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.13889em;&quot;&gt;P&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; size를 가지는 partition별로  &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mfrac&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;/mfrac&gt;&lt;mo&gt;×&lt;/mo&gt;&lt;mfrac&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;/mfrac&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mo&gt;×&lt;/mo&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;(\frac{H}{P} \times \frac{W}{P}, P \times P, C)&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.2173em;vertical-align:-0.345em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mopen nulldelimiter&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mfrac&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.8723em;&quot;&gt;&lt;span style=&quot;top:-2.655em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.13889em;&quot;&gt;P&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3.23em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;frac-line&quot; style=&quot;border-bottom-width:0.04em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3.394em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.08125em;&quot;&gt;H&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.345em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose nulldelimiter&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mbin&quot;&gt;×&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.2173em;vertical-align:-0.345em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mopen nulldelimiter&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mfrac&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.8723em;&quot;&gt;&lt;span style=&quot;top:-2.655em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.13889em;&quot;&gt;P&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3.23em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;frac-line&quot; style=&quot;border-bottom-width:0.04em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3.394em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.345em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose nulldelimiter&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.13889em;&quot;&gt;P&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mbin&quot;&gt;×&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.13889em;&quot;&gt;P&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.07153em;&quot;&gt;C&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; attention을 합니다. -&gt; 이렇게 local context를 잘하기 위해 block attention을 활용했다 합니다.&lt;/p&gt;
&lt;p&gt;하지만, 이렇게 local attention만 사용하면 huge-scale datasets에서 잘 동작하지 않기 때문에 sparse global attention을 하는 간단하면서 효율적인 방법을 만들었다고 합니다. (&lt;code class=&quot;language-text&quot;&gt;grid attention&lt;/code&gt;)&lt;/p&gt;
&lt;p&gt;local attention처럼 (&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;HW&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.6833em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.08125em;&quot;&gt;H&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;에 대해) small window로 partitioning 하지 않고, grid에 대해 partitioning 합니다. input feature map &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;∈&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant=&quot;double-struck&quot;&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mo&gt;×&lt;/mo&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mo&gt;×&lt;/mo&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;X \in \mathbb{R}^{H \times W \times C}&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.7224em;vertical-align:-0.0391em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.07847em;&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;∈&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.8413em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathbb&quot;&gt;R&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.8413em;&quot;&gt;&lt;span style=&quot;top:-3.063em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.08125em;&quot;&gt;H&lt;/span&gt;&lt;span class=&quot;mbin mtight&quot;&gt;×&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;mbin mtight&quot;&gt;×&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.07153em;&quot;&gt;C&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;가 있을 때, &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;G&lt;/mi&gt;&lt;mo&gt;×&lt;/mo&gt;&lt;mi&gt;G&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;G \times G&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.7667em;vertical-align:-0.0833em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;G&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mbin&quot;&gt;×&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.6833em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;G&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; uniform grid로 &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mi&gt;G&lt;/mi&gt;&lt;mo&gt;×&lt;/mo&gt;&lt;mi&gt;G&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;mfrac&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mi&gt;G&lt;/mi&gt;&lt;/mfrac&gt;&lt;mo&gt;×&lt;/mo&gt;&lt;mfrac&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mi&gt;G&lt;/mi&gt;&lt;/mfrac&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;(G \times G, \frac{H}{G} \times \frac{W}{G}, C)&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;G&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mbin&quot;&gt;×&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.2173em;vertical-align:-0.345em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;G&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mopen nulldelimiter&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mfrac&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.8723em;&quot;&gt;&lt;span style=&quot;top:-2.655em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;G&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3.23em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;frac-line&quot; style=&quot;border-bottom-width:0.04em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3.394em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.08125em;&quot;&gt;H&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.345em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose nulldelimiter&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mbin&quot;&gt;×&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.2173em;vertical-align:-0.345em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mopen nulldelimiter&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mfrac&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.8723em;&quot;&gt;&lt;span style=&quot;top:-2.655em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;G&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3.23em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;frac-line&quot; style=&quot;border-bottom-width:0.04em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3.394em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.345em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose nulldelimiter&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.07153em;&quot;&gt;C&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;computational balance를 맞추기 위해 Swin Transformer처럼 &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;P&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.6833em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.13889em;&quot;&gt;P&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; = &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;G&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;G&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.6833em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;G&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; = 7을 채택했다고 합니다.&lt;/p&gt;
&lt;h2 id=&quot;performance&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#performance&quot; aria-label=&quot;performance permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Performance&lt;/h2&gt;
&lt;h3 id=&quot;imagenet-1k-benchmark&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#imagenet-1k-benchmark&quot; aria-label=&quot;imagenet 1k benchmark permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;ImageNet-1K benchmark&lt;/h3&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/ea0fea681df81601b961119a68c6defa/e515d/imagenet1k_benchmark.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 37.16216216216216%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAHCAIAAACHqfpvAAAACXBIWXMAAA7DAAAOwwHHb6hkAAABE0lEQVR42l2Q226EIBCGff+n22zSpGnazYqInFFRlIHRjrtX7XcBhP8QhibGuOd8HscSwuzCJIbNecM6xVgtZSHiDGmdlInWOS7SOAchdPtMKTVSStoka+/32/3j5ic7jl4MgkrP8+z73lsnuexYx54P9vjp2+ek5cC7K2yMWdf18/urVV0/8IKw455rRkQKG6O18yr4jrcUQchYCx5YX2rjrJ3jYkZ/1i1OsdR6HjTE8Q4rpWiIeVYheIJuEC+JIE8jhMg5Y8G6l+MvZOWcL8tcADAX4p/hejaNB3uGArXSc5BW8gEAhbXWwQXYqP9S38ALOjTk27YtA6wphRDeXfQLVyMA1ZNKEq3WWlKdc8uLWusv7m2PX3YIzJEAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/ea0fea681df81601b961119a68c6defa/fcda8/imagenet1k_benchmark.png&quot;
        srcset=&quot;/static/ea0fea681df81601b961119a68c6defa/12f09/imagenet1k_benchmark.png 148w,
/static/ea0fea681df81601b961119a68c6defa/e4a3f/imagenet1k_benchmark.png 295w,
/static/ea0fea681df81601b961119a68c6defa/fcda8/imagenet1k_benchmark.png 590w,
/static/ea0fea681df81601b961119a68c6defa/efc66/imagenet1k_benchmark.png 885w,
/static/ea0fea681df81601b961119a68c6defa/c83ae/imagenet1k_benchmark.png 1180w,
/static/ea0fea681df81601b961119a68c6defa/e515d/imagenet1k_benchmark.png 1430w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;비슷한 규모 대비 가장 좋은 성능입니다.&lt;/p&gt;
&lt;h3 id=&quot;pretrained-on-the-large-scale-datasets&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#pretrained-on-the-large-scale-datasets&quot; aria-label=&quot;pretrained on the large scale datasets permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Pretrained on the large-scale datasets&lt;/h3&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/48b5f16b66bc6ada200477be909080f4/9b379/large_scale_pretrained.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 56.75675675675676%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAALCAIAAADwazoUAAAACXBIWXMAAA7DAAAOwwHHb6hkAAABnklEQVR42jVR23KsIBD0/78u5yUP2XU1XiIgdxAR1E2jOVhlQff0dDNU3vtlWXJOi5Ji+NGMe8EtY5qQFGMIwTl3HMe+rsFaIaQL0Vi77/v7/a4YY9bZGLax+fn899HWn5pT79xNE0K0Mas2lkuj3cxmqTSo8zyLWAjhvNfKKD7FqBfv4Hb+X4zNWkmvTMwHys/36UPpiyyXmItZ8m1fjmNzPghlKJtzSqAvMTPGoA7HLUWLL8e7bxETQpWmW9ogg3jLe4FhctHTNFlrsclHDjnAOcYIxBqbc6601kbPOCLtse+YHOh1XfFPKSmlUAAKEZDxpkJYMMWccoUG27WAYzywklLeRVjwBwWxd54SOgyDFHKNK94IeIXWmMo8czjQS0wpxR+eiINGAAklzllSNvTqXuIs3pfYkJaz0ugHDSaEq9wDg16BkxI5gdtC/T0V6PLOePq+65qmbZoGtrB/PB6v+gWjcRwBdt/f4zD2fd+27auuAT6+vshEqr7vns8noK4rXPNqxmGo6xqBueBYiCSkwO2AICNjFGZSCGPML/YgbUkhSG2eAAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/48b5f16b66bc6ada200477be909080f4/fcda8/large_scale_pretrained.png&quot;
        srcset=&quot;/static/48b5f16b66bc6ada200477be909080f4/12f09/large_scale_pretrained.png 148w,
/static/48b5f16b66bc6ada200477be909080f4/e4a3f/large_scale_pretrained.png 295w,
/static/48b5f16b66bc6ada200477be909080f4/fcda8/large_scale_pretrained.png 590w,
/static/48b5f16b66bc6ada200477be909080f4/efc66/large_scale_pretrained.png 885w,
/static/48b5f16b66bc6ada200477be909080f4/9b379/large_scale_pretrained.png 951w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;ImageNet-21K는 MaxViT가 더 좋은 성능을 보이는데, 더 큰 규모인 JFT-300M에선 CoAtNet이 앞섭니다. (더 큰 image resolution 에선 comparable 합니다.)&lt;/p&gt;
&lt;h2 id=&quot;conclusion&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#conclusion&quot; aria-label=&quot;conclusion permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;(아닐 수도 있지만) 이전 연구들도 local attention처럼 &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mo&gt;×&lt;/mo&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;P \times P&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.7667em;vertical-align:-0.0833em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.13889em;&quot;&gt;P&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mbin&quot;&gt;×&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.6833em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.13889em;&quot;&gt;P&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; partitions에 대해 attention하긴 했지만, dilated global attention을 통해 global context를 해결했다는 점이 재밌었다.&lt;/p&gt;
&lt;p&gt;결론 : 굳굳&lt;/p&gt;</content:encoded></item><item><title><![CDATA[GC ViT - Global Context Vision Transformers]]></title><description><![CDATA[TL;DR 최근 computer vision architecture를 보면 image 만 사용하는 게 아닌 extra training data로 text information를 활용하면서 성능을 끌어올리거나 여러 models를 ensemble 하는 …]]></description><link>http://kozistr.tech/2022-08-19-gcvit/</link><guid isPermaLink="false">http://kozistr.tech/2022-08-19-gcvit/</guid><pubDate>Fri, 19 Aug 2022 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;tldr&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#tldr&quot; aria-label=&quot;tldr permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;TL;DR&lt;/h2&gt;
&lt;p&gt;최근 computer vision architecture를 보면 image 만 사용하는 게 아닌 extra training data로 text information를 활용하면서 성능을 끌어올리거나 여러 models를 ensemble 하는 &lt;code class=&quot;language-text&quot;&gt;Model soups&lt;/code&gt; 같은 approaches가 나오고 있는데, 또 다른 hybrid 모델 + 제목부터 Global Context를 고려한 ViT 라길래 기존 SwinTransformer 나 Focal 과는 어떻게 다를지 궁금해서 읽게 됐습니다.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;paper : &lt;a href=&quot;https://arxiv.org/pdf/2206.09959.pdf&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;arXiv&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;code : &lt;a href=&quot;https://github.com/NVlabs/GCVit&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;github&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;related-work&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#related-work&quot; aria-label=&quot;related work permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Related Work&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://arxiv.org/pdf/2103.14030.pdf&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;Swin Transformer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://proceedings.neurips.cc/paper/2021/file/fc1a36821b02abbd2503fd949bfc9131-Paper.pdf&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;Focal Transformer&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;architecture&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#architecture&quot; aria-label=&quot;architecture permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Architecture&lt;/h2&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/1309ba7ef90df79051e499783b09a858/af756/architecture.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 25%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAFCAIAAADKYVtkAAAACXBIWXMAAA7DAAAOwwHHb6hkAAABEElEQVR42iXKPU7DMACG4RyGG3AFJk7AjlgYGREH4AyIGSEWTkGrSi1QCq0bO7bjJCTEwanbxPFPmhCVd3o+6fN8AOBquRNc8C1ZMufavu9KLvEXs8YNrrYKfdBGma7vtNJwQeud2ndd3/fe5fX5xdnx8+MD/qYYjkgwtfmK0Lm/fgnpq+OAkE+ExjiYmNxPwkWAZwEaZWnMuPJub65OT47unu6z4md4UPq2F0GSAIgmLHw/2MdkSsjMCZynaxrOERwXPEuF8Yo4i+JIyN/WtarRWhtjjbW2Vkpr+19V1XWtmqYZrFQzzMHOOc+HMI7jiEUAAMZYSCmEKDiEEKI0LMtSSimE4EWxkXJzqOC8qqo/h1cI0CkWc/4AAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/1309ba7ef90df79051e499783b09a858/fcda8/architecture.png&quot;
        srcset=&quot;/static/1309ba7ef90df79051e499783b09a858/12f09/architecture.png 148w,
/static/1309ba7ef90df79051e499783b09a858/e4a3f/architecture.png 295w,
/static/1309ba7ef90df79051e499783b09a858/fcda8/architecture.png 590w,
/static/1309ba7ef90df79051e499783b09a858/efc66/architecture.png 885w,
/static/1309ba7ef90df79051e499783b09a858/c83ae/architecture.png 1180w,
/static/1309ba7ef90df79051e499783b09a858/af756/architecture.png 1639w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;전반적은 design 은 FocalNet, SwinTransformer 느낌처럼 hierarchical 한 구조인데, 차이점만 보면 다음과 같습니다.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Global Token Generator module&lt;/li&gt;
&lt;li&gt;Local / Global MSA&lt;/li&gt;
&lt;li&gt;Downsample module&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&quot;global-query-generator&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#global-query-generator&quot; aria-label=&quot;global query generator permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Global Query Generator&lt;/h3&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/31a3728d0a2d4f8e33cba4b3dcc962ee/2d2d6/global_query_generator.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 44.5945945945946%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAJCAIAAAC9o5sfAAAACXBIWXMAAA7DAAAOwwHHb6hkAAABT0lEQVR42k1R2XLCMAzM//9WHwppywOdKbSBXM5Jgq/Et2PqEK4deS3LXkkjB5crnHNa2WlydnJK64FJiIiSarl6xeUFwV08RSBCA8SDqXsqKJBKdf155BxhfD5Dj6qum6YdBvbIcBNrq9fZ2679Tiq03Sd9/o4p3WRh1PxWVVs3bVnWcZxmGaibbrqrH5VdhYoK5cWJ7o8FLFaMs59m+9ftCCFd11NKo+hYlNVr50+xMdYbYQaP2hompPaLScm4oAOjw9ieOt+/tfYpvg5hTuasspJoBuUIFSNGYM2x4XiS1EfgqRScKW20MZeluHPBsnmaJBzzEKUhStYMfMp6A+OV90X5RbKQ5h8+9dLls/I4jsx3JsTMUgkh/ZyV0t6XV7sFZ/ZvGOd8YY/geDjEcQIASJK4KADI8zSdj1mWYYQoIZQShCC5AiG8OBj736D/Hy7/w2Gk8dsAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/31a3728d0a2d4f8e33cba4b3dcc962ee/fcda8/global_query_generator.png&quot;
        srcset=&quot;/static/31a3728d0a2d4f8e33cba4b3dcc962ee/12f09/global_query_generator.png 148w,
/static/31a3728d0a2d4f8e33cba4b3dcc962ee/e4a3f/global_query_generator.png 295w,
/static/31a3728d0a2d4f8e33cba4b3dcc962ee/fcda8/global_query_generator.png 590w,
/static/31a3728d0a2d4f8e33cba4b3dcc962ee/efc66/global_query_generator.png 885w,
/static/31a3728d0a2d4f8e33cba4b3dcc962ee/c83ae/global_query_generator.png 1180w,
/static/31a3728d0a2d4f8e33cba4b3dcc962ee/2d2d6/global_query_generator.png 1205w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;이번 논문에서 Global Context를 더 잘 이해하기 위해 제안한 Global Token 이란 개념인데, global context를 잘 이해하기 위해 local patch 가 아닌 entire input feature에 대해서 잘 compress 해서 global feature를 생성합니다.&lt;/p&gt;
&lt;p&gt;각 stage 초반에 compute 하고 아래 소개할 Global Attention을 할 때 query 부분에 넣어주는 방식입니다. module design 은 간단한데, fused mbconv 후 max-pool 해 줍니다.&lt;/p&gt;
&lt;p&gt;&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msup&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;F&lt;/mi&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;v&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;msup&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mo&gt;−&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;x^{i} = FusedMBConv(x^{i - 1})&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.8247em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.8247em;&quot;&gt;&lt;span style=&quot;top:-3.063em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.0747em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.13889em;&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;u&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;se&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.07153em;&quot;&gt;MBC&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;o&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.03588em;&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.8247em;&quot;&gt;&lt;span style=&quot;top:-3.063em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;mbin mtight&quot;&gt;−&lt;/span&gt;&lt;span class=&quot;mord mtight&quot;&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msup&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;msup&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msup&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;x^{i} = MaxPooling(x^{i})&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.8247em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.8247em;&quot;&gt;&lt;span style=&quot;top:-3.063em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.0747em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.10903em;&quot;&gt;M&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.13889em;&quot;&gt;P&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;oo&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.01968em;&quot;&gt;l&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;in&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.03588em;&quot;&gt;g&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.8247em;&quot;&gt;&lt;span style=&quot;top:-3.063em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h3 id=&quot;global-self-attention&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#global-self-attention&quot; aria-label=&quot;global self attention permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Global Self Attention&lt;/h3&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/2f0d3e7e6a222907ba38ce60e9072780/b5dee/global_self_attention.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 30.405405405405407%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAGCAIAAABM9SnKAAAACXBIWXMAAA7DAAAOwwHHb6hkAAABHElEQVR42iVPTU/DMAzt//8XPXABIY6IC+w64MRYWdd1TUqnjdLme8kSJxlpsewnW89P9stCDD54iNGFAME78CFGmPt4jXHqvY9Baq20TpSf03lIwsw4oMoZ0lhxPI2/i/eX1X4llRopnVZDGIWJ16sxth/Ht+J1WSwJo4RSY23GlSnxoE+fjmB07G6e8tvnO9y2COGkBA8bTBTp7FAKJR8W9/ljvqm3qMHpQKatO1Et+1LTb6HNx369aSsuBWHs/3IvrCSd/Pm6OKgPqMAlEUxIeXEuAwBr7WTbw/mstb4kn84l735CADdHGhPLmPB+eidJEpUdDl2xLtq2bZpmN0VVp6oqjNC2qnZ1PQwDpYxzzhgjJJmlgvOESso/qjROkp8xLK8AAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/2f0d3e7e6a222907ba38ce60e9072780/fcda8/global_self_attention.png&quot;
        srcset=&quot;/static/2f0d3e7e6a222907ba38ce60e9072780/12f09/global_self_attention.png 148w,
/static/2f0d3e7e6a222907ba38ce60e9072780/e4a3f/global_self_attention.png 295w,
/static/2f0d3e7e6a222907ba38ce60e9072780/fcda8/global_self_attention.png 590w,
/static/2f0d3e7e6a222907ba38ce60e9072780/efc66/global_self_attention.png 885w,
/static/2f0d3e7e6a222907ba38ce60e9072780/c83ae/global_self_attention.png 1180w,
/static/2f0d3e7e6a222907ba38ce60e9072780/b5dee/global_self_attention.png 1237w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;architecture를 보면 각 stage 별로 local attention 후 global attention을 수행하는데, global attention을 수행할 때 stage 초반에 &lt;code class=&quot;language-text&quot;&gt;Global Query Generator&lt;/code&gt; 가 생성한 query token를 query로 넣어주는 부분에서 차이가 있습니다.&lt;/p&gt;
&lt;h3 id=&quot;downsample-module&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#downsample-module&quot; aria-label=&quot;downsample module permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Downsample module&lt;/h3&gt;
&lt;p&gt;은 이전 연구 (EfficientNetV2)에서 사용하던 모듈과 큰 특별한 점이 없는데, pooling layer를 max-pool 이 아니라 conv strided pool 한 점에서 차이가 있습니다. Fused-MBConv design 은 아래와 같습니다.&lt;/p&gt;
&lt;p&gt;&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mover accent=&quot;true&quot;&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;D&lt;/mi&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mo&gt;−&lt;/mo&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;v&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;3&lt;/mn&gt;&lt;mo&gt;×&lt;/mo&gt;&lt;mn&gt;3&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;\hat{x} = DW-Conv_{3 \times 3}(x)&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.6944em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord accent&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.6944em;&quot;&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;accent-body&quot; style=&quot;left:-0.2222em;&quot;&gt;&lt;span class=&quot;mord&quot;&gt;^&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.7667em;vertical-align:-0.0833em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mbin&quot;&gt;−&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.07153em;&quot;&gt;C&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;o&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.03588em;&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3011em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;mbin mtight&quot;&gt;×&lt;/span&gt;&lt;span class=&quot;mord mtight&quot;&gt;3&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.2083em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mover accent=&quot;true&quot;&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;G&lt;/mi&gt;&lt;mi&gt;E&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mi&gt;U&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mover accent=&quot;true&quot;&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;\hat{x} = GELU(\hat{x})&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.6944em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord accent&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.6944em;&quot;&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;accent-body&quot; style=&quot;left:-0.2222em;&quot;&gt;&lt;span class=&quot;mord&quot;&gt;^&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.05764em;&quot;&gt;GE&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.10903em;&quot;&gt;LU&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord accent&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.6944em;&quot;&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;accent-body&quot; style=&quot;left:-0.2222em;&quot;&gt;&lt;span class=&quot;mord&quot;&gt;^&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mover accent=&quot;true&quot;&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;S&lt;/mi&gt;&lt;mi&gt;E&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mover accent=&quot;true&quot;&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;\hat{x} = SE(\hat{x})&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.6944em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord accent&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.6944em;&quot;&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;accent-body&quot; style=&quot;left:-0.2222em;&quot;&gt;&lt;span class=&quot;mord&quot;&gt;^&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.05764em;&quot;&gt;SE&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord accent&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.6944em;&quot;&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;accent-body&quot; style=&quot;left:-0.2222em;&quot;&gt;&lt;span class=&quot;mord&quot;&gt;^&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;v&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;×&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mover accent=&quot;true&quot;&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;x = Conv_{1 \times 1}(\hat{x}) + x&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.4306em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.07153em;&quot;&gt;C&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;o&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.03588em;&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3011em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;mbin mtight&quot;&gt;×&lt;/span&gt;&lt;span class=&quot;mord mtight&quot;&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.2083em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord accent&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.6944em;&quot;&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;accent-body&quot; style=&quot;left:-0.2222em;&quot;&gt;&lt;span class=&quot;mord&quot;&gt;^&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mbin&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.4306em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;v&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;3&lt;/mn&gt;&lt;mo&gt;×&lt;/mo&gt;&lt;mn&gt;3&lt;/mn&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;x = Conv_{3 \times 3, stride 2}(x)&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.4306em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.0361em;vertical-align:-0.2861em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.07153em;&quot;&gt;C&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;o&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.03588em;&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3361em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;mbin mtight&quot;&gt;×&lt;/span&gt;&lt;span class=&quot;mord mtight&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;mpunct mtight&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;t&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.02778em;&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;mord mtight&quot;&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.2861em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mi&gt;N&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;x = LN(x)&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.4306em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;L&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.10903em;&quot;&gt;N&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;% DW-Conv : Depth-Wise Convolution
% SE : Squeeze and Excitation
% LN : Layer Normalization&lt;/p&gt;
&lt;h2 id=&quot;performance&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#performance&quot; aria-label=&quot;performance permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Performance&lt;/h2&gt;
&lt;h3 id=&quot;imagenet-1k-benchmark&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#imagenet-1k-benchmark&quot; aria-label=&quot;imagenet 1k benchmark permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;ImageNet-1K benchmark&lt;/h3&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/c699a67ad9e5738a071352fb144e98c7/19a15/imagenet_benchmark.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 50.67567567567568%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAKCAIAAAA7N+mxAAAACXBIWXMAAA7DAAAOwwHHb6hkAAABn0lEQVR42jWR2Y7bMAxF/f9f1rcCSTHOGKl3K7Yca7d2WaUmKB8ESOeSIi8r55wxJqUUYnT6dFpZyfe+U5z7EKy1HxpTMqdSnEV3esGnR21PXTHGQGG1NuR4NW3/+9tigVqEpjmEQI7Dea/P0zMi0fb4dXNYHiNu/jRO22pZFowxmee+/quo3jFO6bpy5koofU7TxIXYx2Gs25M5chCplE8XlRKKVuj1OoUwlG541Za55FwKLnoXgPp5mb3WBC1UCKHexisQeBD4QqtpGvG2sR0zSVLOMCe4kK8SMca26wj0dWBleM65+OJ8zhe4UJLneeK8AJB6B35xzghcP8l932tj4Op9gMmtloweMNV1JXCkArOcs1JISnZO9xTDJ63UAqu0BgFjlHNySnqlCBSsD6HQ6udD7x2UMFBn33d4LZsLIV/woYdWYRlAKaVAORexNO9AUx3HgRZECJiC2rZ9Pp/DMHZdNwwD9AxuwzrgBNQ037fbDdD7/V7XlVJWIYTu93v9qL++aoxhUxv4D7FuJdb/MU/ThnF5X9cfiBlj/wBr8DRwVd1SrAAAAABJRU5ErkJggg==&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/c699a67ad9e5738a071352fb144e98c7/fcda8/imagenet_benchmark.png&quot;
        srcset=&quot;/static/c699a67ad9e5738a071352fb144e98c7/12f09/imagenet_benchmark.png 148w,
/static/c699a67ad9e5738a071352fb144e98c7/e4a3f/imagenet_benchmark.png 295w,
/static/c699a67ad9e5738a071352fb144e98c7/fcda8/imagenet_benchmark.png 590w,
/static/c699a67ad9e5738a071352fb144e98c7/efc66/imagenet_benchmark.png 885w,
/static/c699a67ad9e5738a071352fb144e98c7/c83ae/imagenet_benchmark.png 1180w,
/static/c699a67ad9e5738a071352fb144e98c7/19a15/imagenet_benchmark.png 1229w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;FLOPs, parameters 대비 GC VIT 가 가장 좋은 성능을 보여주고 있습니다.&lt;/p&gt;
&lt;h3 id=&quot;mscoco-benchmark&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#mscoco-benchmark&quot; aria-label=&quot;mscoco benchmark permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;MSCOCO benchmark&lt;/h3&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/2d5743da26ed366ffbc3072880d5216b/58213/mscoco_benchmark.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 59.45945945945946%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAMCAIAAADtbgqsAAAACXBIWXMAAA7DAAAOwwHHb6hkAAAByklEQVR42j1S2XKjMBDU//9YYu/D5sFJFRih++AQYAEGU856W7CbeZhSSTM93dMiTdMwxoZ+mMbRWss5r6pKCFEUhZTSOc9Y6ZzDE+7dHsZYYwwaSYzRenubb49tXZa16/vn8xlCAKLWGqWU0tse8zTd7/dpmnAYxxEHYox+P5245iGGMY6AXJYFPSF03mMsw0wMARfvnBAS839YEEDWdR2aEPqAc9/3yCBPaVlci5JSpXRZlrgfhqHrUj4ClAk68zynpujvw7ZtIPx6vUJoUYzntg1t24Bk13XIwI0x5eNMrLHnX2duWTd2oK2NAQQQ2xCQFUIrYGmtQP6HMBDTwiDy98eHqMQ0zwBD7bqu2C46UYedob1p6tTmPfbf1A3E+xQVAcTlclGVHB/TY1mNtaANVIiMt5jw2wb8gQWdyLiHtEM/UUqezmfu+CEMxsADTLA2eYttY8+YpKSEL9g2TAYZPGE08d69vb9xy5dtmecZpd9/vt2uDViJbV1hGlbTtu1hG9qQwYJIpUpalqwUVla+Sqr2xRidQkiB34RqrEbsYf5F+j8kzzOQybLs6/PrmuVMMF5xwcX1WtBksgJzBp93b6Eo/vcpxvgXW1qOj3tsUNAAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/2d5743da26ed366ffbc3072880d5216b/fcda8/mscoco_benchmark.png&quot;
        srcset=&quot;/static/2d5743da26ed366ffbc3072880d5216b/12f09/mscoco_benchmark.png 148w,
/static/2d5743da26ed366ffbc3072880d5216b/e4a3f/mscoco_benchmark.png 295w,
/static/2d5743da26ed366ffbc3072880d5216b/fcda8/mscoco_benchmark.png 590w,
/static/2d5743da26ed366ffbc3072880d5216b/efc66/mscoco_benchmark.png 885w,
/static/2d5743da26ed366ffbc3072880d5216b/58213/mscoco_benchmark.png 902w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Object Detection task에서도 비슷한 scale에 있는 모델들 대비해 좋은 성능을 보여주고 있습니다. base 모델에선 ConNeXt 가 더 좋네요.&lt;/p&gt;
&lt;h3 id=&quot;various-components-benchmark&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#various-components-benchmark&quot; aria-label=&quot;various components benchmark permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Various Components benchmark&lt;/h3&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/8f345a3bdca829a972cf828f8a6aee03/1ddef/various_components_benchmark.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 67.56756756756756%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAOCAIAAACgpqunAAAACXBIWXMAAA7DAAAOwwHHb6hkAAACRUlEQVR42jVTa3PaMBD0//87LZ+SfmgaZghJ/JCfkixjybaMscGmTIxxB9IVaY8BxHlv73Z1WLvdrq7r7XZrPutt0zRhGAa+7xHiui7yRaFs2yaE6LpWSnqu+46w7XmerRvi07y+4nq9aq2DIIjCqKp013UiRQgw5lK2bceS2POIEBnKLBD0fY+GaNK27TiOp9OJc+46zmWeOaNRnCCZJEme50BWZbF+fd3vD+hkDcNgoK5LKQU9uk3TVJYFId48/wFlGIXn8xmiskx8fIzHYXBd7/fp9FXcSymTOE5TXqKoKOf5UmsNxuvttmsa9JwuF7AoKcfzGcUmM02mGO+h75VSPE0PfQ8HLoA2DTI4d22bCoFiiIdbH+N4PJpJocgU66p6fXtbr9fL5RJihBCHw+HlZfXw8LDf7+H7YrGQSjm2/ePxsaw0o/Tb9wWGBbWFad/v3hPiG5OjCMbArdVqBSLc0PPzL8aY4zhgZJz7JvMMgzCg9fPpKcNg01TdWUWWwXYc2q7TuoIJ21rzVMAOkMK2tt1BAmcMAMtx7CCMwL3JNpQm0ANQHMewEBFFoSqKTKRwxGA2uczzlHOsTVlVFr6wOkKYZ/AJ3bAVGUBK8nvUZrEU5BRFwWgcRvEmz33iYxYLdyvukXIGzZgHIEhCZ0oZWsMU3w+kKkAaBD72DFcFLtyqhQGxfUAzmuD2YTvxYQrxPA85LGmSUEDBjoaAYe0xIw74U1gJZbDly498s6kq8wvrnf0LoVRR1xrFRm3KpVSf/+MvPwLgnwLrNt8AAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/8f345a3bdca829a972cf828f8a6aee03/fcda8/various_components_benchmark.png&quot;
        srcset=&quot;/static/8f345a3bdca829a972cf828f8a6aee03/12f09/various_components_benchmark.png 148w,
/static/8f345a3bdca829a972cf828f8a6aee03/e4a3f/various_components_benchmark.png 295w,
/static/8f345a3bdca829a972cf828f8a6aee03/fcda8/various_components_benchmark.png 590w,
/static/8f345a3bdca829a972cf828f8a6aee03/1ddef/various_components_benchmark.png 635w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;연구에서 제안한 Global Token 이 얼마나 영향이 있는지도 확인했는데, global self-attention을 제거했을 때가 performance drop 이 가장 컸다고 한다.&lt;/p&gt;
&lt;h2 id=&quot;conclusion&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#conclusion&quot; aria-label=&quot;conclusion permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;갠적으로 Global Token을 stage 별로 생성하고 global attention을 한다는 점에서 직관적이고 정말 간단한 방법이면서 FLOPs 대비 성능도 훨씬 좋아서 재밌게 본 연구였다. 또, 이런 design 이 SwinTransformer 나 Focal Transformer 보다 더 깔끔한 거 같다 생각한다.&lt;/p&gt;
&lt;p&gt;결론 : 굳굳&lt;/p&gt;</content:encoded></item><item><title><![CDATA[TitaNet - Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context]]></title><description><![CDATA[TL;DR paper : arXiv code : github Related Work contextnet paper ecapa-tdnn paper angular softmax paper Architecture   architecture와 비슷한데, d…]]></description><link>http://kozistr.tech/2022-08-18-titanet/</link><guid isPermaLink="false">http://kozistr.tech/2022-08-18-titanet/</guid><pubDate>Thu, 18 Aug 2022 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;tldr&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#tldr&quot; aria-label=&quot;tldr permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;paper : &lt;a href=&quot;https://arxiv.org/pdf/2110.04410v1.pdf&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;arXiv&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;code : &lt;a href=&quot;https://github.com/NVIDIA/NeMo&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;github&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;related-work&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#related-work&quot; aria-label=&quot;related work permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Related Work&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2005.03191&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;contextnet paper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2005.07143&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;ecapa-tdnn paper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/1806.03464&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;angular softmax paper&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;architecture&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#architecture&quot; aria-label=&quot;architecture permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Architecture&lt;/h2&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 582px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/011c0e30e8c5868e8e9afcd77fca3909/7c1cd/architecture.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 127.02702702702702%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAZCAIAAAC+dZmEAAAACXBIWXMAAA7DAAAOwwHHb6hkAAAEp0lEQVR42oVU+2/TVhTuHzYhlYn9hqYxDQ2hlYptCNB+QEKTJqbuPYFg7SisUNrSdtDybEFAH9CmadI83DRPSBrHduwmcVI/8rCdOH473nFaxiZN2pVjSyfnu/c793zf6XEcp9PpOB3H0CymUm/UWo7ldGw39r+rB34NiV3M/jQf/25u64dHyPfzyQvrxIiqKt19Lbtj7b8d+z/AnFCaTZ28G/tidK1v6Pkn48in86+/qdb4UqnEVKqQoLQVy3KRlrtsy7bhA9xcMC8WZ1L9dxL9o77j44HPphPHn2cHFKVtmQ7O+17mLi6lfl9GL2bZFaiui3WXYZguuFIlJgL9Y57TY2tfjnlOja1/vpD+udlqaqoTpe+NRT6+6T1+CzkSLIy7hTj7l7EPrtVq+XzesjRdU1RFdjqmoeuy0rZNJ1y8fQ3pve796Dpy0Ede64L3KzdMs0dVVYbZJXeowDY950s9DWTSBV4Uag1RtG2HFSh0d2M1Nvu64CnXsY7tQL17nXBPVlWN4ziW5RI79cUQvhQhcUYSGnWp2YK7URVdaevheKBSZhp1sd1ui6KoKMo72qIopTMZEsNX/DOJlK9conM4LreBvxMqTAz5e/9YPzK0ftBHubR1XQOy78A8z0ejWyRBrgaeIlE/iedzuRzcNlT3uvLs7ubZKe+5O8iZaOmB2za1JbebtmPqhtED9OtV3htOjLzCbyyiwy+2J9ewLIopqguO048nN/tuLJ2aCPVtFu90T9ah0v2TgTNToZcCyQtPyPO3A1/PJH59hr3JYiqAO06Enh7yHrr05MNBz6FV4sqjzJmLL9/77dUHw+EDtJhyaQtCI4hEloPJBX9sBUlvRNNZNAeFGbqdwvwI/ihdXEe51V0pw8sYwYVwNkRLcVkVXHCr1aryfEusCzVeqHFyU2RYVpZlaEwmH05Sy1QtQtbDbDPnLVyeDp6d9p2by37FNom34Gq13mjw1RrHV6EQluOlpgR/hYrjw8GDgy8PXw28781f3a4trKLDq+mRcPlWXS53XdVoJBPx3DZ6/8WgP/y8tFPEcEyUJKg5Rj8AzU9unIb3VnH2bau0f8kznkjGM3lfNO6LJOJpHNQKRLrgh3/GuuB4f6QLVjW5rbTAIAa0qtWSOabs2Xwz8IQYmI19ez91eZFAsbwgCqDtLXrmpv/oqKdvNHg0XJgCsPZPkYAGazy7EoxdfoH/OBv65XHy+kIaw12woTux8r2JyLGRpZOTW8eC1BRcoWFAn1WQt2sM2EOSJJAUgWO5HEaS+WKhQNM0WFLXnHj54SRy4ubyqcnNE0C7UZe3sxkURSEBLN0DWgdj0OUygvNz/sz8RnabFttyE7Stq06wMHrFc+DS08ODa71e4lrHcmA8AcQ9GWiD3BiGyefJOMGtbOW8yR2sLDQlUdN0IGZYumlrHM/UBWie6yqguecqV9vwgRDgTcP4e7LtOdZ2ZxXMUUhQNA22AjNAlmlalunOMrtnb6xB6yCoACeYJaoKdGCYGG6iCee4kO7Su8G9+P70VNotBNlkWG53t5JDUYraKRVLJEmWuosgiEKxSFFUpVKGIAUPRYGLHcf5C3RTC47oUmZSAAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/011c0e30e8c5868e8e9afcd77fca3909/7c1cd/architecture.png&quot;
        srcset=&quot;/static/011c0e30e8c5868e8e9afcd77fca3909/12f09/architecture.png 148w,
/static/011c0e30e8c5868e8e9afcd77fca3909/e4a3f/architecture.png 295w,
/static/011c0e30e8c5868e8e9afcd77fca3909/7c1cd/architecture.png 582w&quot;
        sizes=&quot;(max-width: 582px) 100vw, 582px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;code class=&quot;language-text&quot;&gt;ContextNet&lt;/code&gt; architecture와 비슷한데, decoder 부분만 보면 attentive pooling에 2번의 projections 후 AAM (Additive Angular Margin) 한다.&lt;/p&gt;
&lt;h3 id=&quot;encoder&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#encoder&quot; aria-label=&quot;encoder permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Encoder&lt;/h3&gt;
&lt;p&gt;차이점은 거의 없지만 특징만 써 보면 다음과 같습니다.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;1d time-channel depth-wise separable convolution 사용
&lt;ul&gt;
&lt;li&gt;1d depth-wise conv + point-wise conv&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;residual connection 전 SE (Squeeze &amp;#x26; Excitation) 함&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&quot;decoder--embeddings&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#decoder--embeddings&quot; aria-label=&quot;decoder  embeddings permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Decoder &amp;#x26; Embeddings&lt;/h3&gt;
&lt;p&gt;decoder 도 이전 연구들에 비해 특별한 점이 없다.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;attentive statistics pooling 함&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&quot;recipes&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#recipes&quot; aria-label=&quot;recipes permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Recipes&lt;/h3&gt;
&lt;p&gt;recipe 에도 큰 특별한 점은 없다.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;전처리로 SAD는 하지 않았다.&lt;/li&gt;
&lt;li&gt;3 secs 이상의 audio는 1.5, 2, 3 secs의 chunk로 나눴다.&lt;/li&gt;
&lt;li&gt;frame window : 25 ms, hop window : 10 ms, mel features : 80, num FFT : 512
&lt;ul&gt;
&lt;li&gt;frequency-axis로 normalize 함&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;augmentation 함
&lt;ul&gt;
&lt;li&gt;RIR impulse corpora&lt;/li&gt;
&lt;li&gt;speed perturbation&lt;/li&gt;
&lt;li&gt;spec augment&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;performance&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#performance&quot; aria-label=&quot;performance permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Performance&lt;/h2&gt;
&lt;h3 id=&quot;eer-on-voxceleb1&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#eer-on-voxceleb1&quot; aria-label=&quot;eer on voxceleb1 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;EER on VoxCeleb1&lt;/h3&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/c1f0f3c0516171ac70ebc8d93c4ea405/940c5/voxceleb1_benchmark.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 58.10810810810811%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAMCAIAAADtbgqsAAAACXBIWXMAAA7DAAAOwwHHb6hkAAAB+0lEQVR42iVS2XKjMBDU/3+SSbIPLntdZXYd43CIwxwSAmIOcRlMoCqNPQ+qkWZ6plszRAjh+34cxzg554yz2+2Wpsl+t9tut19felXJ+/3Ooujvfn84HPAYC9H3fV3XJOZcVVXbcQzDuGja5+kUx2IcxzAMd7td3bRw3t8/Tqf/x+MR1d8UZbPZwC/ynKAnwlEUmabpuu71CgZ+URayqkQs2q4DkUQIQ9fzvJBScsYul4tIkrVzUzeWZTmO43lXx3a867UoimWZu7bN83ye52makIecYRznZUEhRVFuefEYR4I+mnbWDdNzPdPQIZKxCLRFzC1KoRbypKwA7roe1yzLVPWYphkCpG0b0DifNetplNKmaQAoy8L1PLRt2xZsoQgOwGmaUmohB1cy/0ymuWL8IMAZBOH0sxqofmfZ9DT4juu83quypJZVSTkOA0EMXDVNw095ngvx9/sAy7I0CIKXj3Eqbwr6Q072nf1T1aqqwI7gV2RdgwPuCA/Dmv6S+jpXqf3TutXGZ7Tr+8fjQf58fBiGWUuJnpTamCSmohuGba8+TkwBggPfxyJgFtgLm1J8E0ZI8IQJW5Zp2w61acQYqgAPAJDAQC0WAdIwCAg5nz+pbQNSFCVBBmKMsTRJ+HNhsCcc9TnHF0IIXrCtWNyXhWGAlUrSdFmWX/cPgY6q1Z0sAAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/c1f0f3c0516171ac70ebc8d93c4ea405/fcda8/voxceleb1_benchmark.png&quot;
        srcset=&quot;/static/c1f0f3c0516171ac70ebc8d93c4ea405/12f09/voxceleb1_benchmark.png 148w,
/static/c1f0f3c0516171ac70ebc8d93c4ea405/e4a3f/voxceleb1_benchmark.png 295w,
/static/c1f0f3c0516171ac70ebc8d93c4ea405/fcda8/voxceleb1_benchmark.png 590w,
/static/c1f0f3c0516171ac70ebc8d93c4ea405/940c5/voxceleb1_benchmark.png 772w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;이전 연구 (&lt;code class=&quot;language-text&quot;&gt;ECAPA&lt;/code&gt;) 랑 거의 comparable 한 성능을 보인다.&lt;/p&gt;
&lt;h3 id=&quot;det-cruve&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#det-cruve&quot; aria-label=&quot;det cruve permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;DET Cruve&lt;/h3&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/d4d37e8dd848c40204c9bec602cd6896/cc8d6/det_curve_benchmark.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 74.32432432432432%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAPCAIAAABr+ngCAAAACXBIWXMAAA7DAAAOwwHHb6hkAAAB10lEQVR42o1TWa6jMBDM/a+WEyTSJEDYQXi38cqU8ejxnmY+phVFbburqjdux2la640QKYVSQmsJ5/xlE5xLpSRnap2GYZjnGTf7vgN1K2Dv3Uak0EmaZN1hXQwh4T6lFEOEE0IwytRNDVvX9QLHiNDAKNXz7Bi3NnKVhIpcRLPnR5A4H6T2hBCKMK1/gJVSlDEj2L5O+9jv6+jtDm0hA2HO+ZgS+D0SQfx+2h/wtm24BV8CDZ5TskKYZfaMwHcuSnXsNldQarTWXuCu65CMECL4gOJzojGbpURPo/dZkPIgFRyQJ2PMBV6WBaFQPoVTKQRZ5D4pqYcunUYZWvOXMjRBhqmcehmP+goeTI5R4HEwJnBRav4GRjQO/wCn3ISAI2eya9AwxnNCP5RLUEk7ntl+5V+I4DhKzNQrg404nAPcXqOCLGPMn5Zncv4XLvioFZS6b9XGuIRybtm1YWV6mBnH6lkLB1ygxxTg4wn32zQaxijHKi1gzGB9mndeaU0pyYOJ6I0ex7Ft26qqsI9t2yEGAdY6pAUXvEjt1jT1/X7HV/F8PhH3fv16vV5A9l2PIy6naarrBg7W4f1+4wn+4/Hoh+GGakH/+XygM0Dt07yrCiUf/2G/AePjYiUGHNp6AAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/d4d37e8dd848c40204c9bec602cd6896/fcda8/det_curve_benchmark.png&quot;
        srcset=&quot;/static/d4d37e8dd848c40204c9bec602cd6896/12f09/det_curve_benchmark.png 148w,
/static/d4d37e8dd848c40204c9bec602cd6896/e4a3f/det_curve_benchmark.png 295w,
/static/d4d37e8dd848c40204c9bec602cd6896/fcda8/det_curve_benchmark.png 590w,
/static/d4d37e8dd848c40204c9bec602cd6896/cc8d6/det_curve_benchmark.png 791w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;이전 연구와 거의 비슷한 성능을 보여준다.&lt;/p&gt;
&lt;h2 id=&quot;conclusion&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#conclusion&quot; aria-label=&quot;conclusion permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;이전 연구랑 큰 성능 차이도 나지 않고 architecture design 도 대부분 이전 연구들과 차이점이 없어서 연구 자체가 재미있진 않았지만, 요즘 speaker verification 모델 성능이 이 정도 나오는구나 하고 넘어갔다.&lt;/p&gt;
&lt;p&gt;첫 회사 첫 프로젝트로 speaker diarization 모델을 만들었는데, 아쉬움이 많이 남은 프로젝트라서 아직도 애착이 가는 분야이자 프로젝트다. 그래서 아직도 speech domain 다루는 회사 (synthesis 도 좋지만 diarization 쪽)를 다녀보고 싶습니다 ㅋㅋㅋㅋ&lt;/p&gt;
&lt;p&gt;결론 : 굳&lt;/p&gt;</content:encoded></item><item><title><![CDATA[(Kaggle) Default Prediction - 135th (top 3%) place solution]]></title><description><![CDATA[Original Post : https://www.kaggle.com/competitions/amex-default-prediction/discussion/347996 TL;DR I couldn't spend lots of time on the co…]]></description><link>http://kozistr.tech/2022-08-17-default-prediction/</link><guid isPermaLink="false">http://kozistr.tech/2022-08-17-default-prediction/</guid><pubDate>Wed, 17 Aug 2022 00:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;Original Post : &lt;a href=&quot;https://www.kaggle.com/competitions/amex-default-prediction/discussion/347996&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;https://www.kaggle.com/competitions/amex-default-prediction/discussion/347996&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;tldr&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#tldr&quot; aria-label=&quot;tldr permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;TL;DR&lt;/h2&gt;
&lt;p&gt;I couldn&apos;t spend lots of time on the competition (only made 30 submissions :(). In the meantime, the competition metric is kinda noisy and we also expected a shake-up/down (not a planet-scale, but for some cases). So, my strategy is focused on protecting a shake-down as possible i can (instead of bulding new features).&lt;/p&gt;
&lt;h2 id=&quot;overview&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#overview&quot; aria-label=&quot;overview permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Overview&lt;/h2&gt;
&lt;p&gt;My strategy is &lt;code class=&quot;language-text&quot;&gt;building various datasets, folds, seeds, models&lt;/code&gt;. I&apos;ll explain them one by one.&lt;/p&gt;
&lt;h3 id=&quot;data-pre-processing&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#data-pre-processing&quot; aria-label=&quot;data pre processing permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Data (Pre-Processing)&lt;/h3&gt;
&lt;p&gt;My base dataset is based on the raddar&apos;s dataset (huge thanks to @raddar). Also, most of the pre-processing logic can be found in the &lt;code class=&quot;language-text&quot;&gt;Code&lt;/code&gt; section.&lt;/p&gt;
&lt;p&gt;The differences are...&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;using more lagging features (to 3 months)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;not just using a single dataset, but multiple datasets (I just added features incrementally) for the variousity.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A dataset&lt;/li&gt;
&lt;li&gt;B dataset = A dataset + (features)&lt;/li&gt;
&lt;li&gt;C dataset = B dataset + (another features)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I didn&apos;t check the exact effectiveness of using the datasets on multiple models, however, it seems that positive effects when ensembling in my experiments.&lt;/p&gt;
&lt;h3 id=&quot;model&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#model&quot; aria-label=&quot;model permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Model&lt;/h3&gt;
&lt;p&gt;I built 6 models (3 gbtm, 3 nn) to secure the variousity and roboustness. Also, a few models (LightGBM, CatBoost) are trained on multiple seeds (1, 42, 1337) with the same training recipe. Lastly, some models are trained with 10, 20 folds.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Xgboost&lt;/li&gt;
&lt;li&gt;CatBoost&lt;/li&gt;
&lt;li&gt;LightGBM (w/ dart, w/o dart)&lt;/li&gt;
&lt;li&gt;5-layers NN&lt;/li&gt;
&lt;li&gt;stacked bi-GRU&lt;/li&gt;
&lt;li&gt;Transformer&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here&apos;s the best CV by the model (sorry for the LB, PB scores, I rarely submitted a single model)&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&quot;center&quot;&gt;Model&lt;/th&gt;
&lt;th align=&quot;center&quot;&gt;CV&lt;/th&gt;
&lt;th align=&quot;center&quot;&gt;Note&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&quot;center&quot;&gt;bi-GRU&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;0.787006&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&quot;center&quot;&gt;Transformer&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;0.785647&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&quot;center&quot;&gt;NN&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;0.789874&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&quot;center&quot;&gt;Xgboost&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;0.795940&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;only using the given(?) cat features as &lt;code class=&quot;language-text&quot;&gt;cat_features&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&quot;center&quot;&gt;CatBoost&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;0.797058&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;using all &lt;code class=&quot;language-text&quot;&gt;np.int8&lt;/code&gt; features as &lt;code class=&quot;language-text&quot;&gt;cat_features&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&quot;center&quot;&gt;LighGBM&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;0.798410&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;w/ dart&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The CV score of the single neural network model isn&apos;t good. Nevertheless, when ensembling, It works good with the tree-based models.&lt;/p&gt;
&lt;h3 id=&quot;blend-ensemble&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#blend-ensemble&quot; aria-label=&quot;blend ensemble permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Blend (Ensemble)&lt;/h3&gt;
&lt;p&gt;Inspired by the discussion &lt;a href=&quot;https://www.kaggle.com/competitions/amex-default-prediction/discussion/329103&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;log-odds&lt;/a&gt;, I found weighted ensemble with log-odds probability is better than a normal weighted ensemble (I tuned the weights with &lt;code class=&quot;language-text&quot;&gt;Optuna&lt;/code&gt; library based on the OOF). But, one difference is not &lt;code class=&quot;language-text&quot;&gt;ln&lt;/code&gt;, but &lt;code class=&quot;language-text&quot;&gt;log10&lt;/code&gt;. In my experiments, It&apos;s better to optimize the weights with &lt;code class=&quot;language-text&quot;&gt;log10&lt;/code&gt;. However, It brings little boost (4th digit difference).&lt;/p&gt;
&lt;p&gt;I ensembled about 50 models, and there&apos;s no post-processing logic.&lt;/p&gt;
&lt;h2 id=&quot;summary&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#summary&quot; aria-label=&quot;summary permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Summary&lt;/h2&gt;
&lt;p&gt;The final score is&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&quot;center&quot;&gt;Model&lt;/th&gt;
&lt;th align=&quot;center&quot;&gt;CV&lt;/th&gt;
&lt;th align=&quot;center&quot;&gt;Public LB&lt;/th&gt;
&lt;th align=&quot;center&quot;&gt;Private LB&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&quot;center&quot;&gt;Ensemble&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;0.8009&lt;/code&gt;&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;0.7992&lt;/code&gt;&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;0.8075&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Last day of the competition, I selected about 1600th Public LB solution (my best CV solution). Luckily, &lt;code class=&quot;language-text&quot;&gt;Trust CV score&lt;/code&gt; wins again :) (Actually, my best CV is also my best LB, and when the cv score increases, lb score increases, so there&apos;s little difference between best CV &amp;#x26; LB for my cases)&lt;/p&gt;
&lt;p&gt;After the competition, I checked the correlation among the scores (CV vs Private LB, CV vs Public LB). then, I found the CV score is more correlated with Private LB than Public LB in my case.&lt;/p&gt;
&lt;h3 id=&quot;works&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#works&quot; aria-label=&quot;works permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Works&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;blending various models (gbtm + nn), even if there&apos;re huge CV gaps
&lt;ul&gt;
&lt;li&gt;e.g. nn 0.790, lgbm 0.798&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;(maybe) various datasets, models, seeds bring a robust prediction I guess&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;didnt-work&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#didnt-work&quot; aria-label=&quot;didnt work permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Didn&apos;t work&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;pseudo labeling (w/ hard label)
&lt;ul&gt;
&lt;li&gt;maybe &lt;code class=&quot;language-text&quot;&gt;soft-label&lt;/code&gt; or &lt;code class=&quot;language-text&quot;&gt;hard label&lt;/code&gt; with a more strict threshold could be worked i guess.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;deeper NN models
&lt;ul&gt;
&lt;li&gt;5-layers nn is enough&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;num of folds doesn&apos;t matter (5 folds are enough)
&lt;ul&gt;
&lt;li&gt;there&apos;s no significant difference between 5 folds vs 20 folds&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;rank weighted ensemble&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I hope this you could help :) Thank you!&lt;/p&gt;</content:encoded></item><item><title><![CDATA[NaturalSpeech - End-to-End Text to Speech Synthesis with Human-Level Quality]]></title><description><![CDATA[TL;DR 오랜만에 speech-synthesis 쪽 논문을 보다가 (LJSpeech dataset에서) MOS, CMOS metrics에서 human-level에 도달한 research 가 있는데, 거기에 최근 유행이었던 diffusion appr…]]></description><link>http://kozistr.tech/2022-08-15-natural-speech/</link><guid isPermaLink="false">http://kozistr.tech/2022-08-15-natural-speech/</guid><pubDate>Mon, 15 Aug 2022 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;tldr&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#tldr&quot; aria-label=&quot;tldr permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;TL;DR&lt;/h2&gt;
&lt;p&gt;오랜만에 speech-synthesis 쪽 논문을 보다가 (LJSpeech dataset에서) MOS, CMOS metrics에서 human-level에 도달한 research 가 있는데, 거기에 최근 유행이었던 diffusion approach 가 아닌 점에서도 꽤 흥미로웠습니다.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;paper : &lt;a href=&quot;https://arxiv.org/pdf/2205.04421v2.pdf&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;arXiv&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;code : &lt;a href=&quot;https://github.com/microsoft/NeuralSpeech&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;github&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;코드는 아직인가 보다&lt;/p&gt;
&lt;h2 id=&quot;related-work&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#related-work&quot; aria-label=&quot;related work permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Related Work&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/1711.00937v2&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;VQ-VAE paper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2103.14574&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;Parallel Tacotron 2 paper&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;architecture&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#architecture&quot; aria-label=&quot;architecture permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Architecture&lt;/h2&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/f8bd83072b80faad6549f96976411644/321ea/architecture.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 64.86486486486486%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAANCAIAAAAmMtkJAAAACXBIWXMAAA7DAAAOwwHHb6hkAAACNklEQVR42o1SW1PaUBDO/39pH+qDMFJUsFZatV4GrJdWOlQjBUJIiMGcXMgViLmHQFJJF9IynelL9yHznT3fl/1292BJkjiOY5rmeDI2DEPXdfguFov0n4jj+SPDkGQXIeS6DmSwMAzhIMuypmkAEIfYASsP5Xk0f3l5+bmKxQo8cTxB9Kq1zzje7FGM+WxhwIjjGO4AgIsVGajJIv1dPHPh+8H55V1+5/Bt+XizeFDYO22RLLZmQIB+GobzeP48sQYM12kRFEn3CIpj+aEoy4quagbebCFeVDTDsh1s/W9VVTmEaLpP0VTjGv9y2Nh6vbufO9rZ2L/9hNcvGggNJFHkeX40GmUFsVk09V07DFzPsVmGDjw3Cn2RQ228TbRJotVt3v8g2z3w4oehH4SDJw66DcLpNIow1O8e5F6dlnKnpc1KYaO2v3VWzn89LumyuG4n6zlXfJ8rVop7J9vvTt7k9mpXdcycGHSv9djvsgx5c1VlaIIF3O86tg2abJCgB6CohqqPOl1KkOShoi+nPRD6598+7h4Wtiv5D9Vy6ahQrOTrxKUoC+tx/Ikl9jwXlpKdMVmR6nc3D527JnF/Xb9okw94q3H7/WZijv8WZ7vINp8ByGDp6hoeGY94mqINYxTHSfp/gWUjSZJYVRRRFGENnuc5ruN6XhRFfhD4vg/AtiwwbNmWbdtAmM1mS/HaGxQXBHEyMV3XlSTYqDQej1VVUVUNkrqmSUOJ5wV4wYIgeL6fpukvAZ66bGHc3ZwAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/f8bd83072b80faad6549f96976411644/fcda8/architecture.png&quot;
        srcset=&quot;/static/f8bd83072b80faad6549f96976411644/12f09/architecture.png 148w,
/static/f8bd83072b80faad6549f96976411644/e4a3f/architecture.png 295w,
/static/f8bd83072b80faad6549f96976411644/fcda8/architecture.png 590w,
/static/f8bd83072b80faad6549f96976411644/321ea/architecture.png 786w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;이번 연구는 총 4가지 부분에서 contributes 했다.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;pre-train large-scale langugae model on phoneme sequence&lt;/li&gt;
&lt;li&gt;differentiable durator&lt;/li&gt;
&lt;li&gt;bi-directional prior/posterior module&lt;/li&gt;
&lt;li&gt;memory-based VAE (memory bank)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&quot;phoneme-encoder&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#phoneme-encoder&quot; aria-label=&quot;phoneme encoder permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Phoneme Encoder&lt;/h3&gt;
&lt;p&gt;phoneme encoder는 말대로 phoneme sequence &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;y&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.625em;vertical-align:-0.1944em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.03588em;&quot;&gt;y&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; 를 encode 하는 module인데, 이전 연구들은 일반 dataset으로 학습하거나 phoneme에 대해서만 학습한 LM을 사용해서 phoneme domain에 어울리지 않거나 capacity issue로 positive boost를 주지 못했다고 합니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/dd14fd65e64060eba639e1da1ef7c25a/6af66/phoneme_pretraining.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 53.37837837837838%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAALCAIAAADwazoUAAAACXBIWXMAAA7DAAAOwwHHb6hkAAAB2UlEQVR42oVSWW/aQBD2/3/KS/rQNCEXDthcNncIhTSFAIFwxMZ4vWbX6wMIbkQAG16g3YKUplWrfhqNZj7NJc3HfP83ttst9eOxw4lsoZRnY+eIoDeegvmj+j02mw0lTYscHh9wCfbw5EBWJMpQfl/A/Hfzy+yl1i1H0mwHNgwT/30znQcxUHEfYAXggQKlPpS0XXDfqkRFrvJwp+iSZgwoORjKs9fZr+b1al3tlHuk3kH3T6Rx+5i7iH8KCidXyUCyzLNCIFtJyFazi2o9o94EdzqCzO6G7f4YTIZDrOnUjJ0hoEJFG6oKkK+LOUWTkaHTFBEd6APPWzLr1arb6VK02+1YMswLXCh2mcoLLH+WKaTC8cvS15tEli/eXov5iJCOR5Ih6k+vjqbulHHdqWla48nENE1eDMUy4XPuOJIK8algLMl/PPuQSEfj2TAnBMViVMjELyIBXmRPw0e2bTGyLKkAqECVJOmhVzNcUKxkG1KVfNPaTw0xFy1XPkvw8UujoNn9Zqfe6lczJeEJtp6nk99e5fsr99m1LXugqAYmY4rRyJ26tu0QQjA2fN9fLpaL+WL+Ov/5570c3iTh+R5t0XXdsiyMkYGR7Ywcx0FoCCH0PO+9Cn4AXpM7HeQy+gMAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/dd14fd65e64060eba639e1da1ef7c25a/fcda8/phoneme_pretraining.png&quot;
        srcset=&quot;/static/dd14fd65e64060eba639e1da1ef7c25a/12f09/phoneme_pretraining.png 148w,
/static/dd14fd65e64060eba639e1da1ef7c25a/e4a3f/phoneme_pretraining.png 295w,
/static/dd14fd65e64060eba639e1da1ef7c25a/fcda8/phoneme_pretraining.png 590w,
/static/dd14fd65e64060eba639e1da1ef7c25a/6af66/phoneme_pretraining.png 640w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;그래서 이번 연구에선 phoneme에 대해서만 학습하는 게 아닌, mixed-phoneme (phoneme + sub-phoneme) pre-training을 했다고 합니다.&lt;/p&gt;
&lt;p&gt;또한, MLM 학습할 때 phoneme tokens과 sub-phoneme tokens 둘 다에 대해서 MLM 학습합니다.&lt;/p&gt;
&lt;h3 id=&quot;differentiable-durator&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#differentiable-durator&quot; aria-label=&quot;differentiable durator permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Differentiable Durator&lt;/h3&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 327px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/6cd4a82fd97dc8e72eb18e678a3c8182/00e65/differentiable_durator.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 74.32432432432432%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAPCAIAAABr+ngCAAAACXBIWXMAAA7DAAAOwwHHb6hkAAACrUlEQVR42m1S60/acBTlb923JduXxZnMbBqH2eaQzBFBKYo8BIQWkFdpseWh4AN5o6DgI05h41WKa6EtD/dTnGjizc3pvTf3nNs2R3T7UnCdTi6Xu7y8PCkUQZTK5RfXRI/VYDB4xN/lstPhyGYyxeJZMOA/KRSecoY7z8gcx/O8AIDjOLbd6XTAk1erV5s03e12QTucgJ0ReVi1aDqMu7SKeQO0AGuUsBYCiWihD+/egHZdrQCI6FSmZQX0UxKP7g3vP5DzmUQMh0upreMIipuUJKLBTJDXoETXV7xGVQyDt50Gt17xK7lVyW4bl+W9/mB0+eQwdRYlQg4jCWtV0hndgsS4JNXKZmVfJn+IJ/RyiVI6A8193kR0BKzxIgZe6I7IuXS8uOvbRS0HuDVJOhLExmHI49Auvn/9Sijn+eujbjkPkrlIMxep40SEYdtPLmdT1mUZrJK515QeA+TSL7nXII1sFpoTRzEEKEbc5j0UBqI7LlPABXO8MCIXjtIGuSQdcJmX5qTijytSsWR6Qj3/FbesKr5PS6bGpVNjs5/GrauL2YDTrFEK3SevfZSKkbA6isFJ0r7rMUe9ln0M3nGvy79N7aOWPdQc8yH7Xkt8057wwSEP8uyba9WKUbUA/semVU/Y9IRVT9oNOKybHHt7HguWsjtX6fBVOnKdiZxG/blUbOiVkUmAL6rVWqPRqNZqlWoVZLVWvzg/bzZpimrW63WKouqNBk23+v3+g0l6vR5wEzBOpfKnSVGlUvnm5qZ3HyzLCoJAN6k7wmDQarVA2xWEodmAhIhhmGAgEE8kE7EDEkc9OOHcsIVCIQ/mwzAsHA6jbofdZssXTlHUQxCkn8C9XtQfCP5lOyIgwDLM/XHhzrhAuN0GJm7/RzDkeA7MQQvWwAQAKHq9/j/s/e0uq4RMDQAAAABJRU5ErkJggg==&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/6cd4a82fd97dc8e72eb18e678a3c8182/00e65/differentiable_durator.png&quot;
        srcset=&quot;/static/6cd4a82fd97dc8e72eb18e678a3c8182/12f09/differentiable_durator.png 148w,
/static/6cd4a82fd97dc8e72eb18e678a3c8182/e4a3f/differentiable_durator.png 295w,
/static/6cd4a82fd97dc8e72eb18e678a3c8182/00e65/differentiable_durator.png 327w&quot;
        sizes=&quot;(max-width: 327px) 100vw, 327px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;architecture에 나온 것처럼, 위 phoneme encoder에서 나온 &lt;em&gt;phoneme-level phoneme representation&lt;/em&gt; 이 durator (&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;\theta_{dur}&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.8444em;vertical-align:-0.15em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3361em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;u&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.02778em;&quot;&gt;r&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;)의 input으로 들어오고 output으로 priro distribution &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;msup&gt;&lt;mi&gt;z&lt;/mi&gt;&lt;msup&gt;&lt;mrow&gt;&lt;/mrow&gt;&lt;mo mathvariant=&quot;normal&quot; lspace=&quot;0em&quot; rspace=&quot;0em&quot;&gt;′&lt;/mo&gt;&lt;/msup&gt;&lt;/msup&gt;&lt;mi mathvariant=&quot;normal&quot;&gt;∣&lt;/mi&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;p(z^{&apos;}|y)&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.1925em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.04398em;&quot;&gt;z&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.9425em;&quot;&gt;&lt;span style=&quot;top:-2.9425em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.5795em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.8278em;&quot;&gt;&lt;span style=&quot;top:-2.931em;margin-right:0.0714em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.5em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size3 size1 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;′&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;∣&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.03588em;&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; 을 줍니다.&lt;/p&gt;
&lt;p&gt;다음과 같이 쓸 수 있습니다.&lt;/p&gt;
&lt;p&gt;&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;msup&gt;&lt;mi&gt;z&lt;/mi&gt;&lt;msup&gt;&lt;mrow&gt;&lt;/mrow&gt;&lt;mo mathvariant=&quot;normal&quot; lspace=&quot;0em&quot; rspace=&quot;0em&quot;&gt;′&lt;/mo&gt;&lt;/msup&gt;&lt;/msup&gt;&lt;mi mathvariant=&quot;normal&quot;&gt;∣&lt;/mi&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;p(z^{&apos;}|y;\theta_{pri})&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.2286em;vertical-align:-0.2861em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.04398em;&quot;&gt;z&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.9425em;&quot;&gt;&lt;span style=&quot;top:-2.9425em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.5795em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.8278em;&quot;&gt;&lt;span style=&quot;top:-2.931em;margin-right:0.0714em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.5em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size3 size1 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;′&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;∣&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.03588em;&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3117em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.02778em;&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.2861em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; where &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mo stretchy=&quot;false&quot;&gt;[&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;]&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;\theta_{pri} = [\theta_{pho},\theta_{dur}]&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.9805em;vertical-align:-0.2861em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3117em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.02778em;&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.2861em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.0361em;vertical-align:-0.2861em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3361em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;h&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;o&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.2861em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3361em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;u&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.02778em;&quot;&gt;r&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;]&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;구체적으로 durator는 총 3가지 역할을 합니다.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;each phoneme에 대해 duration 예측&lt;/li&gt;
&lt;li&gt;up-sampling module에서 &lt;code class=&quot;language-text&quot;&gt;phoneme-level&lt;/code&gt; 을 &lt;code class=&quot;language-text&quot;&gt;frame-level&lt;/code&gt; 로 upsample 해 줌&lt;/li&gt;
&lt;li&gt;priro distribution의 mean/variance를 calculate 하는 module (prior &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;p&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.625em;vertical-align:-0.1944em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;p&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; 는 standard isotonic multivariant Gaussian. VAE scheme에 따라서)
&lt;ul&gt;
&lt;li&gt;train / inference time에서 predicted duration mismatch를 최소화하려고&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&quot;bi-directional-priorposterior-module&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#bi-directional-priorposterior-module&quot; aria-label=&quot;bi directional priorposterior module permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Bi-Directional Prior/Posterior Module&lt;/h3&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/8b7334dade24a4266237805bad0b501f/bca35/bidirectional_prior_posterior.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 25%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAFCAIAAADKYVtkAAAACXBIWXMAAA7DAAAOwwHHb6hkAAAA70lEQVR42lWO0W6CQBBF9/9/o2mb9FWtiSFgNAGlNIhFBnGFXfpQbUgT1F0WAksn+NKeTDL34U7mkH6gbdtLWWYZA4jCEHx3EwXxbgM7HzCEfsRS1jRNXdfYlFKWl1JKQTApVXPOPe8N4jg57BfmcvQwHT8Zsxdr/DibPBvbeWKb6/wzZ4wFQeDYNqXHr9OJ5JynaZplWVEUPwNxFH+swRhZ1nSJY74utivwnHdn5biui02lFMpqrcl99X9AMSFEhT6oOUylKhS8CYHaWO66Tg+QXv8Dj2/XK+oBAD1QDJRSNEPBJEkwHCk9n7/vb34B7E4MM9K1jRsAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/8b7334dade24a4266237805bad0b501f/fcda8/bidirectional_prior_posterior.png&quot;
        srcset=&quot;/static/8b7334dade24a4266237805bad0b501f/12f09/bidirectional_prior_posterior.png 148w,
/static/8b7334dade24a4266237805bad0b501f/e4a3f/bidirectional_prior_posterior.png 295w,
/static/8b7334dade24a4266237805bad0b501f/fcda8/bidirectional_prior_posterior.png 590w,
/static/8b7334dade24a4266237805bad0b501f/bca35/bidirectional_prior_posterior.png 683w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;bidirectional prior/posterior module 은 phoneme &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;y&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.625em;vertical-align:-0.1944em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.03588em;&quot;&gt;y&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;으로 부터 오는 &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;msup&gt;&lt;mi&gt;z&lt;/mi&gt;&lt;msup&gt;&lt;mrow&gt;&lt;/mrow&gt;&lt;mo mathvariant=&quot;normal&quot; lspace=&quot;0em&quot; rspace=&quot;0em&quot;&gt;′&lt;/mo&gt;&lt;/msup&gt;&lt;/msup&gt;&lt;mi mathvariant=&quot;normal&quot;&gt;∣&lt;/mi&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;p(z^{&apos;}|y;\theta_{pri})&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.2286em;vertical-align:-0.2861em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.04398em;&quot;&gt;z&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.9425em;&quot;&gt;&lt;span style=&quot;top:-2.9425em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.5795em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.8278em;&quot;&gt;&lt;span style=&quot;top:-2.931em;margin-right:0.0714em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.5em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size3 size1 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;′&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;∣&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.03588em;&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3117em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.02778em;&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.2861em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; 와 speech &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;x&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.4306em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;로부터 오는 &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mi&gt;z&lt;/mi&gt;&lt;mi mathvariant=&quot;normal&quot;&gt;∣&lt;/mi&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;;&lt;/mo&gt;&lt;mi&gt;ϕ&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;p(z|x;\phi)&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.04398em;&quot;&gt;z&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;∣&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;ϕ&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; 의 information gap을 줄이기 위해 만들었다고 합니다.&lt;/p&gt;
&lt;p&gt;위 그림처럼 KL divergence를 각 방향(?)에서 서로의 KL divergence loss를 optimize 하도록 학습합니다.&lt;/p&gt;
&lt;p&gt;module 은 flow model을 채택했고 이윤 inverse 가능해야 하고 optimize 쉬워야 하기 때문이라고 합니다.&lt;/p&gt;
&lt;p&gt;reduce posterior &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mi&gt;z&lt;/mi&gt;&lt;mi mathvariant=&quot;normal&quot;&gt;∣&lt;/mi&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;;&lt;/mo&gt;&lt;mi&gt;ϕ&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;p(z|x;\phi)&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.04398em;&quot;&gt;z&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;∣&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;ϕ&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; 와 backward mapping &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;mrow&gt;&lt;mo&gt;−&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;f^{-1}&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.0085em;vertical-align:-0.1944em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.10764em;&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.8141em;&quot;&gt;&lt;span style=&quot;top:-3.063em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;−&lt;/span&gt;&lt;span class=&quot;mord mtight&quot;&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;, enhanced prior &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;msup&gt;&lt;mi&gt;z&lt;/mi&gt;&lt;msup&gt;&lt;mrow&gt;&lt;/mrow&gt;&lt;mo mathvariant=&quot;normal&quot; lspace=&quot;0em&quot; rspace=&quot;0em&quot;&gt;′&lt;/mo&gt;&lt;/msup&gt;&lt;/msup&gt;&lt;mi mathvariant=&quot;normal&quot;&gt;∣&lt;/mi&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;p(z^{&apos;}|y;\theta_{pri})&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.2286em;vertical-align:-0.2861em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.04398em;&quot;&gt;z&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.9425em;&quot;&gt;&lt;span style=&quot;top:-2.9425em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.5795em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.8278em;&quot;&gt;&lt;span style=&quot;top:-2.931em;margin-right:0.0714em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.5em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size3 size1 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;′&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;∣&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.03588em;&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3117em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.02778em;&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.2861em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; 와 forward mapping &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;f&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.8889em;vertical-align:-0.1944em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.10764em;&quot;&gt;f&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; 간 KL을 최소화하는데, 구체적인 전개 수식은 논문에&lt;/p&gt;
&lt;h3 id=&quot;memory-based-vae&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#memory-based-vae&quot; aria-label=&quot;memory based vae permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;memory-based VAE&lt;/h3&gt;
&lt;p&gt;posterior &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mi&gt;z&lt;/mi&gt;&lt;mi mathvariant=&quot;normal&quot;&gt;∣&lt;/mi&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;;&lt;/mo&gt;&lt;mi&gt;ϕ&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;p(z|x;\phi)&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.04398em;&quot;&gt;z&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;∣&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;ϕ&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; 는 원래 VAE에서 speech waveform reconstruction 할 때 쓰여서 prior 보다 complex 한데, 요걸 간단하게 하려고 memory-based VAE를 제안합니다.&lt;/p&gt;
&lt;p&gt;&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;z&lt;/mi&gt;&lt;mo&gt;∼&lt;/mo&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mi&gt;z&lt;/mi&gt;&lt;mi mathvariant=&quot;normal&quot;&gt;∣&lt;/mi&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;;&lt;/mo&gt;&lt;mi&gt;ϕ&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;z \sim p(z|x;\phi)&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.4306em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.04398em;&quot;&gt;z&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;∼&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.04398em;&quot;&gt;z&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;∣&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;ϕ&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; 를 speech reconstructio 에 그대로 사용하지 말고, &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;z&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;z&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.4306em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.04398em;&quot;&gt;z&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; 를 attention query로 사용하고, attention output를 waveform reconstruction에 사용하자는 아이디어입니다. 즉, posterior &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;z&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;z&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.4306em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.04398em;&quot;&gt;z&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; 는 아래 그림처럼 memory bank에 attention weights를 구할 때만 사용됩니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 416px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/2201ce3a3347d769a3c83d3c1e3ddca1/b0122/memory_bank.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 83.1081081081081%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAARCAIAAABSJhvpAAAACXBIWXMAAA7DAAAOwwHHb6hkAAACqklEQVR42m2SaXOaUBSG84f7D9rv/dB+iJNO0yw1S20w0UmiUeMKuLMoILhcRASD0Wkd1wTcwF512lGbd5h7hnvPyzk85+4t/sqyTIqk4jgWiUTC4TCGoiRFJ3AMx7FA4DGVSi/+0946GIbx8vIyGo0Gg0Gn09F1I5PJ5POMruvdXm86nY7HY3g6mUy2zJa1DHgqj6aF787o8XXiIUL4o9nDnzGb3ReKs2GcMudTvix/RZJoml8szJ3KFsmK0rOBPBYcQUAKz5TQ8CS1q2CpKPdITu78auf4GhLTMLI6n403Kq9CkihAg8NLXj5waKaEZwVXtHLpoTOMnKLK+uuIZsGRm4mkBMucbra9tPd6PYhk33Zw5by5cbls+7YT+8UVcn1qPz/6dsiwrM8fYLjS7053hdbaAgZlmiYEM5vBZwaPlXpd05qmZUFIEJthjP95dmlDmMNBH0Oj4WAgHA4tBxZ6RJzO+7tbDMMqAPCFwsR4nUzGO7SX30sTLMGAU4fv89Ht5fWD49p77sYO7J6TC5f7PtjrdoVyDfHTMGeL9trsDcTOEN/7j6efTvzIbcTtRb/8iNrOAp5g+s6Hwqb8Ifzdh+MAxljmbLft4XBIURQeT0SisWgsxnGFHMPSNFy4RCIBRAB3sgQ1HI3eBgbV1DR4veSalCWIOI4nk0maJhVFqVbFuqJs2rbM612WycHiLJuHY2NZplgq5XN0PJ4gKUqW6+v7vw1s5TRX5rGuY2gslc7wgtBqt4bDAUFSgiBUKmA+n1ubMpfL3uItbaZttrY7ZzjhSqUC/4ogSI7jRFGEr61Wq6GqAIBisSQIPM8LklRTVoJEyuVyNktAOkuzoqgwW1VVSZJkWZZqNU3T2u22CECj8QRXAERFVZtNCPSp2+22Wu2qKPb7/T+N5X/pl79oPAAAAABJRU5ErkJggg==&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/2201ce3a3347d769a3c83d3c1e3ddca1/b0122/memory_bank.png&quot;
        srcset=&quot;/static/2201ce3a3347d769a3c83d3c1e3ddca1/12f09/memory_bank.png 148w,
/static/2201ce3a3347d769a3c83d3c1e3ddca1/e4a3f/memory_bank.png 295w,
/static/2201ce3a3347d769a3c83d3c1e3ddca1/b0122/memory_bank.png 416w&quot;
        sizes=&quot;(max-width: 416px) 100vw, 416px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;reconstruction loss를 써 보면 다음과 같습니다.&lt;/p&gt;
&lt;p&gt;&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mi&gt;ϕ&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mo&gt;−&lt;/mo&gt;&lt;msub&gt;&lt;mi mathvariant=&quot;double-struck&quot;&gt;E&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;z&lt;/mi&gt;&lt;mo&gt;∼&lt;/mo&gt;&lt;mi&gt;q&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mi&gt;z&lt;/mi&gt;&lt;mi mathvariant=&quot;normal&quot;&gt;∣&lt;/mi&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;;&lt;/mo&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;[&lt;/mo&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi mathvariant=&quot;normal&quot;&gt;∣&lt;/mi&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mi&gt;z&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;mo separator=&quot;true&quot;&gt;;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;mo stretchy=&quot;false&quot;&gt;]&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;L_{rec}(\phi, \theta_{dec}) = -\mathbb{E}_{z \sim q(z|x;\theta)} [log p(x|Attention(z, M, M);\theta_{dec})]&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;L&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.1514em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:0em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;rec&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;ϕ&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3361em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;ec&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.1052em;vertical-align:-0.3552em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;−&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathbb&quot;&gt;E&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3448em;&quot;&gt;&lt;span style=&quot;top:-2.5198em;margin-left:0em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.04398em;&quot;&gt;z&lt;/span&gt;&lt;span class=&quot;mrel mtight&quot;&gt;∼&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.03588em;&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;mopen mtight&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.04398em;&quot;&gt;z&lt;/span&gt;&lt;span class=&quot;mord mtight&quot;&gt;∣&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;mpunct mtight&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;mclose mtight&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3552em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.01968em;&quot;&gt;l&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;o&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.03588em;&quot;&gt;g&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;∣&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;A&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;tt&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;t&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;o&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.04398em;&quot;&gt;z&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.10903em;&quot;&gt;M&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.10903em;&quot;&gt;M&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3361em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;ec&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)]&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mi&gt;Q&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;mi&gt;K&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;mi&gt;V&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mo stretchy=&quot;false&quot;&gt;[&lt;/mo&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;mi&gt;Q&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mi&gt;Q&lt;/mi&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mi&gt;K&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mi&gt;K&lt;/mi&gt;&lt;/msub&gt;&lt;msup&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;mi&gt;T&lt;/mi&gt;&lt;/msup&gt;&lt;mi&gt;V&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mi&gt;V&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;msqrt&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;/msqrt&gt;&lt;/mfrac&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mi&gt;O&lt;/mi&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;]&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;Attention(Q, K, V) = [softmax(\frac{QW_{Q}(KW_{K})^{T}VW_{V}}{\sqrt{h}})W_{O}]&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;A&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;tt&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;t&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;o&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;Q&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.07153em;&quot;&gt;K&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.22222em;&quot;&gt;V&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.6889em;vertical-align:-0.538em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;so&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.10764em;&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;t&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;ma&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mopen nulldelimiter&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mfrac&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:1.1509em;&quot;&gt;&lt;span style=&quot;top:-2.5335em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord sqrt mtight&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.9378em;&quot;&gt;&lt;span class=&quot;svg-align&quot; style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mtight&quot; style=&quot;padding-left:0.833em;&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;h&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-2.8978em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;hide-tail mtight&quot; style=&quot;min-width:0.853em;height:1.08em;&quot;&gt;&lt;svg xmlns=&quot;http://www.w3.org/2000/svg&quot; width=&quot;400em&quot; height=&quot;1.08em&quot; viewBox=&quot;0 0 400000 1080&quot; preserveAspectRatio=&quot;xMinYMin slice&quot;&gt;&lt;path d=&quot;M95,702
c-2.7,0,-7.17,-2.7,-13.5,-8c-5.8,-5.3,-9.5,-10,-9.5,-14
c0,-2,0.3,-3.3,1,-4c1.3,-2.7,23.83,-20.7,67.5,-54
c44.2,-33.3,65.8,-50.3,66.5,-51c1.3,-1.3,3,-2,5,-2c4.7,0,8.7,3.3,12,10
s173,378,173,378c0.7,0,35.3,-71,104,-213c68.7,-142,137.5,-285,206.5,-429
c69,-144,104.5,-217.7,106.5,-221
l0 -0
c5.3,-9.3,12,-14,20,-14
H400000v40H845.2724
s-225.272,467,-225.272,467s-235,486,-235,486c-2.7,4.7,-9,7,-19,7
c-6,0,-10,-1,-12,-3s-194,-422,-194,-422s-65,47,-65,47z
M834 80h400000v40h-400000z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.1022em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3.23em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;frac-line&quot; style=&quot;border-bottom-width:0.04em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3.5075em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;Q&lt;/span&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3448em;&quot;&gt;&lt;span style=&quot;top:-2.3567em;margin-left:-0.1389em;margin-right:0.0714em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.5em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size3 size1 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;Q&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.2822em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mopen mtight&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.07153em;&quot;&gt;K&lt;/span&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3448em;&quot;&gt;&lt;span style=&quot;top:-2.3567em;margin-left:-0.1389em;margin-right:0.0714em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.5em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size3 size1 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.07153em;&quot;&gt;K&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.1433em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose mtight&quot;&gt;&lt;span class=&quot;mclose mtight&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.9191em;&quot;&gt;&lt;span style=&quot;top:-2.931em;margin-right:0.0714em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.5em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size3 size1 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.13889em;&quot;&gt;T&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.22222em;&quot;&gt;V&lt;/span&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3448em;&quot;&gt;&lt;span style=&quot;top:-2.3567em;margin-left:-0.1389em;margin-right:0.0714em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.5em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size3 size1 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.22222em;&quot;&gt;V&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.1433em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.538em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose nulldelimiter&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3283em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.02778em;&quot;&gt;O&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;]&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;\theta_{dec}&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.8444em;vertical-align:-0.15em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3361em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;ec&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; = waveform decoder&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;M&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.6833em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.10903em;&quot;&gt;M&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; (&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;mo&gt;∈&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant=&quot;double-struck&quot;&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mo&gt;×&lt;/mo&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;M \in \mathbb{R}^{L \times h}&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.7224em;vertical-align:-0.0391em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.10903em;&quot;&gt;M&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;∈&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.8491em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathbb&quot;&gt;R&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.8491em;&quot;&gt;&lt;span style=&quot;top:-3.063em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;L&lt;/span&gt;&lt;span class=&quot;mbin mtight&quot;&gt;×&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;h&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;) = memory bank&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mi&gt;Q&lt;/mi&gt;&lt;/msub&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mi&gt;K&lt;/mi&gt;&lt;/msub&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mi&gt;V&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;W_{Q}, W_{K}, W_{V}&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.9694em;vertical-align:-0.2861em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3283em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;Q&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.2861em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3283em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.07153em;&quot;&gt;K&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3283em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.22222em;&quot;&gt;V&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; (&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mo lspace=&quot;0em&quot; rspace=&quot;0em&quot;&gt;∗&lt;/mo&gt;&lt;/msub&gt;&lt;mo&gt;∈&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant=&quot;double-struck&quot;&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;mo&gt;×&lt;/mo&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;W_{*} \in \mathbb{R}^{h \times h}&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.8333em;vertical-align:-0.15em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.1757em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;∗&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;∈&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.8491em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathbb&quot;&gt;R&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.8491em;&quot;&gt;&lt;span style=&quot;top:-3.063em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;h&lt;/span&gt;&lt;span class=&quot;mbin mtight&quot;&gt;×&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;h&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;) = attention parameters&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;L&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.6833em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;L&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; = size of the &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;M&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.6833em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.10903em;&quot;&gt;M&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;, &lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;h&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.6944em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;h&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; = hidden dimensions&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;training-recipe&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#training-recipe&quot; aria-label=&quot;training recipe permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Training Recipe&lt;/h3&gt;
&lt;p&gt;전체 loss는 다음과 같습니다.&lt;/p&gt;
&lt;p&gt;&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mo&gt;−&lt;/mo&gt;&lt;msub&gt;&lt;mi mathvariant=&quot;double-struck&quot;&gt;E&lt;/mi&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mi&gt;z&lt;/mi&gt;&lt;msup&gt;&lt;mrow&gt;&lt;/mrow&gt;&lt;mo mathvariant=&quot;normal&quot; lspace=&quot;0em&quot; rspace=&quot;0em&quot;&gt;′&lt;/mo&gt;&lt;/msup&gt;&lt;/msup&gt;&lt;mi mathvariant=&quot;normal&quot;&gt;∣&lt;/mi&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;[&lt;/mo&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi mathvariant=&quot;normal&quot;&gt;∣&lt;/mi&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mi&gt;z&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;mo separator=&quot;true&quot;&gt;;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;mo stretchy=&quot;false&quot;&gt;]&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;L_{e2e} (\theta_{pri}, \theta_{bpp}, \theta_{dec}) = -\mathbb{E}_{z^{&apos;}|y;\theta_{pri})} [log p(x|Attention(z, M, M);\theta_{dec})]&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.0361em;vertical-align:-0.2861em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;L&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3011em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:0em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;mord mtight&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;e&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3117em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.02778em;&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.2861em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3361em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;pp&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.2861em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3361em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;ec&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.2275em;vertical-align:-0.4775em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;−&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathbb&quot;&gt;E&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3448em;&quot;&gt;&lt;span style=&quot;top:-2.4198em;margin-left:0em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.04398em;&quot;&gt;z&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.8928em;&quot;&gt;&lt;span style=&quot;top:-2.8928em;margin-right:0.0714em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.6068em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size3 size1 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.8496em;&quot;&gt;&lt;span style=&quot;top:-2.8496em;margin-right:0.1em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.5556em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;′&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mord mtight&quot;&gt;∣&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.03588em;&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;mpunct mtight&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3281em;&quot;&gt;&lt;span style=&quot;top:-2.357em;margin-left:-0.0278em;margin-right:0.0714em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.5em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size3 size1 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.02778em;&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.2819em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose mtight&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.4775em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.01968em;&quot;&gt;l&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;o&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.03588em;&quot;&gt;g&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;∣&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;A&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;tt&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;t&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;o&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.04398em;&quot;&gt;z&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.10903em;&quot;&gt;M&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.10903em;&quot;&gt;M&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3361em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;ec&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)]&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&quot;math math-inline&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;w&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mi&gt;ϕ&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;mi&gt;w&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mi&gt;ϕ&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mi&gt;ϕ&lt;/mi&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;θ&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;L = L_{bwd}(\phi, \theta_{pri}, \theta_{bpp}) + L_{fwd}(\phi, \theta_{pri}, \theta_{bpp}) + L_{rec}(\phi, \theta_{dec}) + L_{e2e} (\theta_{pri}, \theta_{bpp}, \theta_{dec})&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.6833em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;L&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.0361em;vertical-align:-0.2861em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;L&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3361em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:0em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.02691em;&quot;&gt;w&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;d&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;ϕ&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3117em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.02778em;&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.2861em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3361em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;pp&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.2861em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mbin&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.0361em;vertical-align:-0.2861em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;L&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3361em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:0em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.10764em;&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.02691em;&quot;&gt;w&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;d&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.2861em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;ϕ&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3117em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.02778em;&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.2861em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3361em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;pp&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.2861em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mbin&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;L&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.1514em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:0em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;rec&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;ϕ&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3361em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;ec&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mbin&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.0361em;vertical-align:-0.2861em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot;&gt;L&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3011em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:0em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;mord mtight&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;e&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3117em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot; style=&quot;margin-right:0.02778em;&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.2861em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3361em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;pp&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.2861em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.1667em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathnormal&quot; style=&quot;margin-right:0.02778em;&quot;&gt;θ&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.3361em;&quot;&gt;&lt;span style=&quot;top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;mord mathnormal mtight&quot;&gt;ec&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;gradient flow 가 수식으론 복잡한데, 아래 그림으로 보면 이해가 더 쉽습니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 367px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/b25fce9956cd7d1865902b49073fef80/46684/gradient_flows.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 131.75675675675674%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAaCAIAAAA44esqAAAACXBIWXMAAA7DAAAOwwHHb6hkAAADp0lEQVR42o2U63OiSBDA/d/vw23dt726L7dfrq5St5tUcrdqYnyAiCI+QAgigvhARZSsioD4ijzmBo1uKle5TVM1Nc3Mb7qnp7sj4IcSBOGgjbyJDnZbY7UfOd7O9fwARMA7xTSB3AKiuN/tJy4Qbf/JC94L73xPphmxxg6HwwIv8z0VehN5h9eh24Ig3tzmyw8KWuTxqnwdRebz+ZuwfxLP86DKCxJRa+cpSXl80i2AkXVdn0T+3+B3y6KcxOi/rpP/3BMlrp8tcrquh3AAglek67ptuS0JUkuURUFSh2qVZjFK7j3uytxAVMxEpjSZ/Mdy4Lpw1NTR3ed0h1KZbIMvtNPXWENolpIpNH5PVujLNJGtsPBmES/wF3sn8P3QuqoGVQp0lVGOIJNU4a4yay+fJoDDRSVXAAxvKzrcJVreeOvDSWTvu5zV45aDpvdtKrJWV+v0HLom4jES+Ru/+5wikzSF8mSRs+1gPt4uZKvSXU+egu9PxfeacRbFkZvbVDRO5PLNMhYtGp21VFb67AS6kKLz0SZO8TyDccmaJKihCyFs2nY5HVsrnCnRlszIRJJn6Xa7WyYrDM3SlZrUlApEucT0GHFcbz1SjVEqS2426xCeG4ZYznYrmT6DA3v42CCEh9rhkfxzILE82Z/uOWmcIeqjBcCKrGUujvCiWULkYiJ19YdEpsdcoVln4H/PdY95As/BcHJkgCRa/vMqqdsAIxjzCJuWVUHiYNaxu4yjcH0KbRwsH7BjUUHLRZrX+M5MUObQ8xRCrFZO5JhAOSL308efv/zy4dPHX3//7ZPjLM+5dRyXpn2Bxr8iaOYmd4GWqmL3OWCOs0rhifZMaA2Y1riO0xleqJ8th5MDn7E5Th3qpWnaBt1V8AwbxoJgUBjlhsqavlbvVx545lXAYAa7sEb88ION4HjcCa4hcfTq8vYCLd+VhFxdYF/D4FQALyohhBcLM5mLTZw2N6q1p3xRQGi2+tLttyRyPCxfIy7zX5nYFY6kYrHE3Ji9rMo34eMO0VJv5w0QjetUhycmIekHP2wyz7kN47D3XB9mBQwGjMbpkY7KKVXAeR6cA7Zer/d79+znbreDq9vtxj3UNlSDg0B1vV69vA6EfZLIpzMZDC/gOUyUWh25pY40kiQRFK1UqcR9IptJZZCs0BTj8ThFUUQh3+sPnu+83WwGg35TlMbj8WJhQHW73dq2BVucpmmwV2mjIVxaOg7smNPZzDAMy7J83/8XS8CJdrcglScAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/b25fce9956cd7d1865902b49073fef80/46684/gradient_flows.png&quot;
        srcset=&quot;/static/b25fce9956cd7d1865902b49073fef80/12f09/gradient_flows.png 148w,
/static/b25fce9956cd7d1865902b49073fef80/e4a3f/gradient_flows.png 295w,
/static/b25fce9956cd7d1865902b49073fef80/46684/gradient_flows.png 367w&quot;
        sizes=&quot;(max-width: 367px) 100vw, 367px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;train &amp;#x26; inference flows&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/c4eae91f82db2ea9fb132d1e8555381e/146da/train_inference.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 35.13513513513513%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAHCAIAAACHqfpvAAAACXBIWXMAAA7DAAAOwwHHb6hkAAABI0lEQVR42m1Q2W6DMBDk//+KPkURBWLOoJBD5jLmMA6XW0I7UOWt87Dyzs5qZ6wdj0dd1w+Hw/kckdPpRIht267rovq+H4YBIcQ0Pw3DIMThnAsh2je0rut8z/vQdcu05PM57pjneXxjVmqa5mnHC1hff1jXVev7HtLL5eJ5XhRFOBWE58D3b/c7yDi+pmkKgVLqS6nnjmEYsNUKoaHBGIZdx2GMFUXBCpZlKa/qqqrqum6aBmTJeVPXGEEMEi2ldFvGDGFNy0KPJN/L0g+DEF1Zlk3TYhkmf/7DlhmWrnGM5LZlOwQmyOPxSCjdbF8xieES3yGlhHIcJ7hG/mVZNHwJKBzHWVaWeZ5vD8ZQsyyHiPMqzzJKE5ok8IEskBVFLqX8BRpBfdtNgXdYAAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/c4eae91f82db2ea9fb132d1e8555381e/fcda8/train_inference.png&quot;
        srcset=&quot;/static/c4eae91f82db2ea9fb132d1e8555381e/12f09/train_inference.png 148w,
/static/c4eae91f82db2ea9fb132d1e8555381e/e4a3f/train_inference.png 295w,
/static/c4eae91f82db2ea9fb132d1e8555381e/fcda8/train_inference.png 590w,
/static/c4eae91f82db2ea9fb132d1e8555381e/efc66/train_inference.png 885w,
/static/c4eae91f82db2ea9fb132d1e8555381e/146da/train_inference.png 943w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;h2 id=&quot;performance&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#performance&quot; aria-label=&quot;performance permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Performance&lt;/h2&gt;
&lt;h3 id=&quot;moscmos-on-ljspeech&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#moscmos-on-ljspeech&quot; aria-label=&quot;moscmos on ljspeech permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;MOS/CMOS on LJSpeech&lt;/h3&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/a674b5ae798a735d09ab89071c19717d/3c492/mos_cmos_on_ljspeech.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 35.810810810810814%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAHCAIAAACHqfpvAAAACXBIWXMAAA7DAAAOwwHHb6hkAAAA9ElEQVR42m1Q2XKDMBDz//8daTqAr0BduwM+IMwkgAO0iknSPlQPHnntlVZLiqKoqoozVpYlT0CFUpplWV3XPoSmabz3bWu7rgMPocM1hDAMA1ludyzLMo6jlKIsKZqFEMYYyIFYa8E/7lDg339AXmxdV+ip9MtaN5zPcMZQaIAJilobkG3bfptjjLCd5zk+kCa5xb2Cp2madj4/sTNUCGNMKYWQIIxSKSXjHOfb4bBvQWsNcwRukdtZnwCO8GSOD5Pr5YKvLG1OcGTWeZ6/H4+vzEp9ouf/zJgQktinMV/Oub7vjTbydMIinHcQhTT8sZrtiR9kNYRpSPdvrgAAAABJRU5ErkJggg==&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/a674b5ae798a735d09ab89071c19717d/fcda8/mos_cmos_on_ljspeech.png&quot;
        srcset=&quot;/static/a674b5ae798a735d09ab89071c19717d/12f09/mos_cmos_on_ljspeech.png 148w,
/static/a674b5ae798a735d09ab89071c19717d/e4a3f/mos_cmos_on_ljspeech.png 295w,
/static/a674b5ae798a735d09ab89071c19717d/fcda8/mos_cmos_on_ljspeech.png 590w,
/static/a674b5ae798a735d09ab89071c19717d/efc66/mos_cmos_on_ljspeech.png 885w,
/static/a674b5ae798a735d09ab89071c19717d/c83ae/mos_cmos_on_ljspeech.png 1180w,
/static/a674b5ae798a735d09ab89071c19717d/3c492/mos_cmos_on_ljspeech.png 1300w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;MOS, CMOS metrics에서 통계학 적으로 유의미한 분포 차이가 없음을 보여준다.&lt;/p&gt;
&lt;h3 id=&quot;benchmark-on-ljspeech&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#benchmark-on-ljspeech&quot; aria-label=&quot;benchmark on ljspeech permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Benchmark on LJSpeech&lt;/h3&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/f7e170384762c8317ced79a60d21ce3e/690c6/benchmark.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 26.351351351351354%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAFCAIAAADKYVtkAAAACXBIWXMAAA7DAAAOwwHHb6hkAAAAyklEQVR42kWQ23KEIBBE/f+/M4BWbQwoFwVELpYvPsim3SSbfpgaqDndA00IYd937/04jo/Px2x+5ZxblsVaOwwDIWSeZymnD0KVUimlbdtANc8/KSkpo+tL3q9aa5CwxkEIEWMqpcAIphiutaI29dYTLab7vu9YxzknlLZtyxhDOBitpVI6p+Sc9esK7LouYHdy/UlWqu86rJdzxlYxRjTneaIao7UxJWdrlxC2/+T32tM0kVsU/HEc+0vlJg3nX1wI3OMT8OA38g1fzBQ/NL6UHwAAAABJRU5ErkJggg==&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/f7e170384762c8317ced79a60d21ce3e/fcda8/benchmark.png&quot;
        srcset=&quot;/static/f7e170384762c8317ced79a60d21ce3e/12f09/benchmark.png 148w,
/static/f7e170384762c8317ced79a60d21ce3e/e4a3f/benchmark.png 295w,
/static/f7e170384762c8317ced79a60d21ce3e/fcda8/benchmark.png 590w,
/static/f7e170384762c8317ced79a60d21ce3e/efc66/benchmark.png 885w,
/static/f7e170384762c8317ced79a60d21ce3e/c83ae/benchmark.png 1180w,
/static/f7e170384762c8317ced79a60d21ce3e/690c6/benchmark.png 1201w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;절대적 차이로도 꽤 큰 차이가 난다.&lt;/p&gt;
&lt;h3 id=&quot;modules&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#modules&quot; aria-label=&quot;modules permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Modules&lt;/h3&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/14ed0910e9b4b5e9b9914f102906eddd/9a86a/module_performances.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 38.513513513513516%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAICAIAAAB2/0i6AAAACXBIWXMAAA7DAAAOwwHHb6hkAAABOUlEQVR42l1Q13KDMBDk/z/ML7aosgqQUDxmwIhiA55QDFkgZZJ9QOJOe3u7mhA8S7MwDMnpJKTwfd80Ddd1gyBgjEkhKKW27bwHAedcSte2LNuhnuc1TatN4/R6vVwpj8cjSo/Ho67rsiyVUnme41IUqiiK7HbTCTkcDlEUTdM0DsM8z9qyAcqe57/5fp6vnK7rhg0fG/q+fz6faZomSYLW8g0NA3Cgaq0wCdGpQ/HbNE2uFDjYC1LLvICWJFdM/CXvxzSOtmXrhikFP7MVjmMTQrANurtAWVacs6Ztfypf5KHvGeMGyFKahknpWQih64SeGbaFf7xBFlKKtu3+k5HALgWy57owgFmqUGVVZWmK2Fby/Y6oq7r+Q4YltOM4DoIwjiMYTq7Xy+XSboBVfCGLIPFmH7R7/gRn5bbi1RXYPgAAAABJRU5ErkJggg==&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/14ed0910e9b4b5e9b9914f102906eddd/fcda8/module_performances.png&quot;
        srcset=&quot;/static/14ed0910e9b4b5e9b9914f102906eddd/12f09/module_performances.png 148w,
/static/14ed0910e9b4b5e9b9914f102906eddd/e4a3f/module_performances.png 295w,
/static/14ed0910e9b4b5e9b9914f102906eddd/fcda8/module_performances.png 590w,
/static/14ed0910e9b4b5e9b9914f102906eddd/9a86a/module_performances.png 792w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;이번 연구에서 제안한 ideas 중 하나 씩 빼고 학습했을 때, 모두 metric에 큰 영향을 주고 있다.&lt;/p&gt;
&lt;h3 id=&quot;inference-speed&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#inference-speed&quot; aria-label=&quot;inference speed permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Inference speed&lt;/h3&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/da5d100b2cd45b9d5579e26e37e3bfcf/844cc/inference_speed.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 31.756756756756754%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAGCAIAAABM9SnKAAAACXBIWXMAAA7DAAAOwwHHb6hkAAAA0UlEQVR42oWQ2ZKDIBBF/f+fs4xGxGhGDEgMMC4hca0wt/TFmZc5BV3Ncpu+eJQmhMQxicMw/ALXa8GKPMuyPMcq3yJjBcPcyC4XVpY4r6T0mqbpuq5tWyTW2sdDIX8PbzCO4zAMrw1jTF3XnwPOOc8dWJYliiJCkqqqyq08ohBimiYob5zjzi7b8Y6V5nmmlEbnc0xIeDoFQeD7Pjz1fa+1Rut/xceX13WFpTRN4UdrBYHg4n6X1j6/jUEX/4g55/gVKSW6gAvsIGDgR5RS7jc/qXVQIbzAq48AAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;img&quot;
        title=&quot;&quot;
        src=&quot;/static/da5d100b2cd45b9d5579e26e37e3bfcf/fcda8/inference_speed.png&quot;
        srcset=&quot;/static/da5d100b2cd45b9d5579e26e37e3bfcf/12f09/inference_speed.png 148w,
/static/da5d100b2cd45b9d5579e26e37e3bfcf/e4a3f/inference_speed.png 295w,
/static/da5d100b2cd45b9d5579e26e37e3bfcf/fcda8/inference_speed.png 590w,
/static/da5d100b2cd45b9d5579e26e37e3bfcf/efc66/inference_speed.png 885w,
/static/da5d100b2cd45b9d5579e26e37e3bfcf/c83ae/inference_speed.png 1180w,
/static/da5d100b2cd45b9d5579e26e37e3bfcf/844cc/inference_speed.png 1306w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
        decoding=&quot;async&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;RTF 도 FastSpeech 2 + HiFiGAN, VITS와 comparable 하고 빠른 수준이다.&lt;/p&gt;
&lt;h3 id=&quot;latency&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#latency&quot; aria-label=&quot;latency permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Latency&lt;/h3&gt;
&lt;h2 id=&quot;conclusion&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#conclusion&quot; aria-label=&quot;conclusion permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;LJSpeech dataset에서 human-level metrics을 달성했다는 점에서 promising 했고, novelties 나 architecture 도 갠적으론 마음에 드는 구조였다. 다른 dataset에서 benchmark 결과도 궁금한데 포함해 주면 좋겠다.&lt;/p&gt;
&lt;p&gt;결론 : 굳굳&lt;/p&gt;</content:encoded></item></channel></rss>