{"componentChunkName":"component---src-templates-post-template-js","path":"/clamav-scan-fmap-en","result":{"data":{"markdownRemark":{"id":"cdfe7bf1-bae8-5a6d-9d62-7d4d45df7361","html":"<blockquote>\n<p>This page has been machine-translated from the <a href=\"/clamav-scan-fmap\">original page</a>.</p>\n</blockquote>\n<p>In the <a href=\"/clamav-clamscan\">previous article</a>, I briefly followed the processing flow up to the point where the <code class=\"language-text\">cli_scan_fmap</code> function is called when scanning an Eicar file with <code class=\"language-text\">clamscan</code>.</p>\n<div class=\"gatsby-highlight\" data-language=\"\\tc\"><pre class=\"language-\\tc\"><code class=\"language-\\tc\">perf_start(ctx, PERFT_RAW);\nret = cli_scan_fmap(ctx,\n                    type == CL_TYPE_TEXT_ASCII ? CL_TYPE_ANY : type,\n                    false,\n                    &amp;ftoffset,\n                    acmode,\n                    NULL);\nperf_stop(ctx, PERFT_RAW);</code></pre></div>\n<p>This time, starting from the implementation of <code class=\"language-text\">cli_scan_fmap</code>, the core function for file scanning in ClamAV, I will summarize the outline of the Aho–Corasick algorithm used for signature matching in AntiVirus software.</p>\n<!-- omit in toc -->\n<h2 id=\"table-of-contents\" style=\"position:relative;\"><a href=\"#table-of-contents\" aria-label=\"table of contents permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Table of Contents</h2>\n<ul>\n<li><a href=\"#background-pattern-matching-in-clamav\">Background: Pattern Matching in ClamAV</a></li>\n<li><a href=\"#what-is-the-aho-corasick-algorithm\">What Is the Aho-Corasick Algorithm?</a></li>\n<li>\n<p><a href=\"#implementing-the-aho-corasick-algorithm\">Implementing the Aho-Corasick Algorithm</a></p>\n<ul>\n<li><a href=\"#building-a-trie\">Building a Trie</a></li>\n<li><a href=\"#implementing-a-trie\">Implementing a Trie</a></li>\n<li><a href=\"#about-the-aho-corasick-automaton\">About the Aho-Corasick Automaton</a></li>\n<li><a href=\"#implementing-aho-corasick\">Implementing Aho-Corasick</a></li>\n</ul>\n</li>\n<li><a href=\"#aho-corasick-in-clamav\">Aho-Corasick in ClamAV</a></li>\n<li><a href=\"#summary\">Summary</a></li>\n</ul>\n<h2 id=\"background-pattern-matching-in-clamav\" style=\"position:relative;\"><a href=\"#background-pattern-matching-in-clamav\" aria-label=\"background pattern matching in clamav permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Background: Pattern Matching in ClamAV</h2>\n<p>The <code class=\"language-text\">cli_scan_fmap</code> function, the core function for file scanning in ClamAV, roughly speaking scans a memory map abstracted as the <code class=\"language-text\">cl_fmap</code> structure introduced in the previous article.</p>\n<p>Inside it, processing such as hash-based checks and pattern matching against the loaded signature database is performed.</p>\n<p>After performing initialization for scanning, <code class=\"language-text\">cli_scan_fmap</code> appears to read up to SCANBUFF (0x20000 bytes) from the file map as a single chunk and then scan it with the <code class=\"language-text\">matcher_run</code> function.</p>\n<p><span\n      class=\"gatsby-resp-image-wrapper\"\n      style=\"position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 960px; \"\n    >\n      <a\n    class=\"gatsby-resp-image-link\"\n    href=\"/static/9227e830595fb5eaa841331b3333b68c/807a0/image-20251214111115528.png\"\n    style=\"display: block\"\n    target=\"_blank\"\n    rel=\"noopener\"\n  >\n    <span\n    class=\"gatsby-resp-image-background-image\"\n    style=\"padding-bottom: 56.25%; position: relative; bottom: 0; left: 0; background-image: url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAALCAYAAAB/Ca1DAAAACXBIWXMAAAsTAAALEwEAmpwYAAAB+ElEQVQoz2VSybKbMBDkkrJBgIQQYveCWWzw8hwvtyT//1WdkSCuVHLomhkto+4eOZt6g7qqESuFvChQliXyPEccx5BSIiIIIcA5h+d5cF33gz+1ib7vgzEGJy1TZGWG1WoNXZRohhHD8YhD09iHbFMR2ahkjCyl8zqF1hqKSCSJRhRFtplp6uR1hqLO4a7XxIQjMcxiids04XG/o+87nM8TpmnEOAw49j36tkV7OGC73dDDexRFbpn6PjHUOkeaFlit6QWhkNUHVNUGVVEvoEvtQExyCJmBBQJrz4drEdjosYCacTA/hDOdbxinK9q9wqllOHbGyxoiSuiAmC+QR+aSyyTVoc09FtI+hx/wpdkM5/1+4f164TbGeE7fcB18nPoSz68KzSYkVtQ41MQsguA+FNmhVExDk1YiYzSQv+BcLl8wLNu9xLFZ47DjNOmC/DGoEHANL8ypaQEuS8QkXSUp/QpN7MSH2YehjBM0TYfr7Y6up+m2HcURu303y7b++Fa+FyiSzT++Gc/+axhFAuN4wq+fP/B8PnC7XvH4frfTm2UwK23OzSS9uf5gkbtEhwtJjE54PN44niYM9A8v1xvJrhbzg88QZiy19+/eHJ1oq1Bsa9T1znqTJBlEXCKUFUKhkdAnNutKzb4lOrO5OTeva7tucmPfbyJkE0mFLdABAAAAAElFTkSuQmCC'); background-size: cover; display: block;\"\n  ></span>\n  <picture>\n          <source\n              srcset=\"/static/9227e830595fb5eaa841331b3333b68c/8ac56/image-20251214111115528.webp 240w,\n/static/9227e830595fb5eaa841331b3333b68c/d3be9/image-20251214111115528.webp 480w,\n/static/9227e830595fb5eaa841331b3333b68c/e46b2/image-20251214111115528.webp 960w,\n/static/9227e830595fb5eaa841331b3333b68c/f992d/image-20251214111115528.webp 1440w,\n/static/9227e830595fb5eaa841331b3333b68c/60dd2/image-20251214111115528.webp 1652w\"\n              sizes=\"(max-width: 960px) 100vw, 960px\"\n              type=\"image/webp\"\n            />\n          <source\n            srcset=\"/static/9227e830595fb5eaa841331b3333b68c/8ff5a/image-20251214111115528.png 240w,\n/static/9227e830595fb5eaa841331b3333b68c/e85cb/image-20251214111115528.png 480w,\n/static/9227e830595fb5eaa841331b3333b68c/d9199/image-20251214111115528.png 960w,\n/static/9227e830595fb5eaa841331b3333b68c/07a9c/image-20251214111115528.png 1440w,\n/static/9227e830595fb5eaa841331b3333b68c/807a0/image-20251214111115528.png 1652w\"\n            sizes=\"(max-width: 960px) 100vw, 960px\"\n            type=\"image/png\"\n          />\n          <img\n            class=\"gatsby-resp-image-image\"\n            src=\"/static/9227e830595fb5eaa841331b3333b68c/d9199/image-20251214111115528.png\"\n            alt=\"image-20251214111115528\"\n            title=\"image-20251214111115528\"\n            loading=\"lazy\"\n            style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\"\n          />\n        </picture>\n  </a>\n    </span></p>\n<p>This <code class=\"language-text\">matcher_run</code> function is called with the following parameters.</p>\n<div class=\"gatsby-highlight\" data-language=\"c\"><pre class=\"language-c\"><code class=\"language-c\"><span class=\"token keyword\">static</span> <span class=\"token keyword\">inline</span> <span class=\"token class-name\">cl_error_t</span> <span class=\"token function\">matcher_run</span><span class=\"token punctuation\">(</span>\n    <span class=\"token keyword\">const</span> <span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_matcher</span> <span class=\"token operator\">*</span>root<span class=\"token punctuation\">,</span>\n    <span class=\"token keyword\">const</span> <span class=\"token keyword\">unsigned</span> <span class=\"token keyword\">char</span> <span class=\"token operator\">*</span>buffer<span class=\"token punctuation\">,</span> <span class=\"token class-name\">uint32_t</span> length<span class=\"token punctuation\">,</span>\n    <span class=\"token keyword\">const</span> <span class=\"token keyword\">char</span> <span class=\"token operator\">*</span><span class=\"token operator\">*</span>virname<span class=\"token punctuation\">,</span> <span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_ac_data</span> <span class=\"token operator\">*</span>mdata<span class=\"token punctuation\">,</span>\n    <span class=\"token class-name\">uint32_t</span> offset<span class=\"token punctuation\">,</span>\n    <span class=\"token keyword\">const</span> <span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_target_info</span> <span class=\"token operator\">*</span>tinfo<span class=\"token punctuation\">,</span>\n    <span class=\"token class-name\">cli_file_t</span> ftype<span class=\"token punctuation\">,</span>\n    <span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_matched_type</span> <span class=\"token operator\">*</span><span class=\"token operator\">*</span>ftoffset<span class=\"token punctuation\">,</span>\n    <span class=\"token keyword\">unsigned</span> <span class=\"token keyword\">int</span> acmode<span class=\"token punctuation\">,</span>\n    <span class=\"token keyword\">unsigned</span> <span class=\"token keyword\">int</span> pcremode<span class=\"token punctuation\">,</span>\n    <span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_ac_result</span> <span class=\"token operator\">*</span><span class=\"token operator\">*</span>acres<span class=\"token punctuation\">,</span>\n    <span class=\"token class-name\">fmap_t</span> <span class=\"token operator\">*</span>map<span class=\"token punctuation\">,</span>\n    <span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_bm_off</span> <span class=\"token operator\">*</span>offdata<span class=\"token punctuation\">,</span>\n    <span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_pcre_off</span> <span class=\"token operator\">*</span>poffdata<span class=\"token punctuation\">,</span>\n    cli_ctx <span class=\"token operator\">*</span>ctx\n<span class=\"token punctuation\">)</span></code></pre></div>\n<p>The <code class=\"language-text\">matcher_run</code> function scans the received buffer. It first performs a Boyer-Moore-based scan with <code class=\"language-text\">cli_bm_scanbuff</code> (its actual behavior is closer to the Wu method), and then performs pattern matching with the Aho-Corasick algorithm in <code class=\"language-text\">cli_ac_scanbuff</code>.</p>\n<h2 id=\"what-is-the-aho-corasick-algorithm\" style=\"position:relative;\"><a href=\"#what-is-the-aho-corasick-algorithm\" aria-label=\"what is the aho corasick algorithm permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>What Is the Aho-Corasick Algorithm?</h2>\n<p>The Aho-Corasick algorithm is a string-search algorithm originally devised to efficiently search for specific keywords in documents. It is one way to perform what is sometimes called a “Common Prefix Search”: extracting every keyword that partially matches from any position in a given text.</p>\n<p>Reference: <a href=\"https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Aho–Corasick algorithm - Wikipedia</a></p>\n<p>Reference: <a href=\"https://naoya-2.hatenadiary.org/entry/20090405/aho_corasick\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Aho Corasick 法 - naoyaのはてなダイアリー</a></p>\n<p>Reference: <a href=\"https://tech.legalforce.co.jp/entry/2022/02/24/140316#AhoCorasick%E6%B3%95\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">高速な文字列探索：Daachorseの技術解説 - LegalOn Technologies Engineering Blog</a></p>\n<p>In the Aho-Corasick algorithm, an automaton (a model of computer states and transitions) is built in advance from the keyword patterns you want to search for, and pattern matching is then performed against a given buffer.</p>\n<p>For AntiVirus pattern matching such as in ClamAV, the automaton can be created beforehand from the preloaded signature database, so the computational cost required for pattern matching is linear with respect to the input buffer (and the matched entries).</p>\n<p>In this way, the Aho-Corasick algorithm can efficiently check whether many signatures defined in a signature database are contained in a specific data stream, making it well suited to implementing pattern-matching processing in AntiVirus software such as ClamAV.</p>\n<p><span\n      class=\"gatsby-resp-image-wrapper\"\n      style=\"position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 960px; \"\n    >\n      <a\n    class=\"gatsby-resp-image-link\"\n    href=\"/static/eb9edd8d99656ad8833adecba87795db/87a80/image-20251214132434653.png\"\n    style=\"display: block\"\n    target=\"_blank\"\n    rel=\"noopener\"\n  >\n    <span\n    class=\"gatsby-resp-image-background-image\"\n    style=\"padding-bottom: 54.166666666666664%; position: relative; bottom: 0; left: 0; background-image: url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAALCAYAAAB/Ca1DAAAACXBIWXMAAAsTAAALEwEAmpwYAAACyElEQVQoz2WSS28TVxiG53+w4RewYNUF6gJ1xZZtVxWihEVLpfQiEGpRgSQ0LIgqQiTapikktPgSx2M7nvgWe2zHc3fGGdvjOImdOHFiE+AXPD1xUTcsXn0z58x55znv90khw2VC0ZlOm0yG4sxEFWZXM2idQ2YLLtcWszxMiv1slW8iGj/EDL6T9f81HtX5arnCotGk5O8i6ft9brxU+FkuknIbRDSHVL2NKdavLuQ4NxHlyrM4ny/muTCzymfP01yaS3H5eZZP59J88lTh/KMYDxSTQq0hDLt90sLoj5UYL+JJ/owmmJcTJC2XsdAGV+ez/F72mEw7XA+U+FrQXH+1zpevcoy9VrkZKvPF30XmKx7FM0Kre0it/wb3aEipuTOSttOl2GyLuk/WsmkN37El9r3+kNrRAKPdoeLv4HR6NAen+IO3eMenIx+pejhks39KQrdZq3qsCrJoxRLUTRK2x1zOJO/vES7qxI1NlGodxfFIbTbIbPnImkVcnD2jy3stpLzfIWbWCJdNZNGgVM0nVCiTsav8KJe5+Osaz1RhpJto3SPs3gBdVFUYFBo7mAcnxM3NUf7GwQBpY+9wRBDVbHLiA7W1R7LRYbXeYUF0bmwpw2vbH70nvD2CTouf5BK3gzm+/SfN1JohgFyygs7sDZHK7S7u8VtimonTO6G03eFxxuL7cJ47y3nuBlLcT9ncW7NG9WHaZjpXZUbdYlY0a8FojQzPrlwV0UlmtyfItgmoGkpjl7+sFk/WbZ5mNvhtXedF2SHs1ImLTJMijpQgydS3WRdnVL9NZbfLSsUQcTli1I6QIlaNicAKt5cLjEfK3AoVGV+KM70c43EkwS/hGE+iSWZkhamgzFToP02K54kPmgxGubcU5KWAkiq7B9SH78Wf2yypJgHNRd8/wRu8G0VRE3I/qHbysbzBeywxy6qgVsXI/QttRMsNUwfHcQAAAABJRU5ErkJggg=='); background-size: cover; display: block;\"\n  ></span>\n  <picture>\n          <source\n              srcset=\"/static/eb9edd8d99656ad8833adecba87795db/8ac56/image-20251214132434653.webp 240w,\n/static/eb9edd8d99656ad8833adecba87795db/d3be9/image-20251214132434653.webp 480w,\n/static/eb9edd8d99656ad8833adecba87795db/e46b2/image-20251214132434653.webp 960w,\n/static/eb9edd8d99656ad8833adecba87795db/0ea8a/image-20251214132434653.webp 973w\"\n              sizes=\"(max-width: 960px) 100vw, 960px\"\n              type=\"image/webp\"\n            />\n          <source\n            srcset=\"/static/eb9edd8d99656ad8833adecba87795db/8ff5a/image-20251214132434653.png 240w,\n/static/eb9edd8d99656ad8833adecba87795db/e85cb/image-20251214132434653.png 480w,\n/static/eb9edd8d99656ad8833adecba87795db/d9199/image-20251214132434653.png 960w,\n/static/eb9edd8d99656ad8833adecba87795db/87a80/image-20251214132434653.png 973w\"\n            sizes=\"(max-width: 960px) 100vw, 960px\"\n            type=\"image/png\"\n          />\n          <img\n            class=\"gatsby-resp-image-image\"\n            src=\"/static/eb9edd8d99656ad8833adecba87795db/d9199/image-20251214132434653.png\"\n            alt=\"image-20251214132434653\"\n            title=\"image-20251214132434653\"\n            loading=\"lazy\"\n            style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\"\n          />\n        </picture>\n  </a>\n    </span></p>\n<p>Reference: <a href=\"https://www.mdpi.com/1999-4893/18/12/742#\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">The Aho-Corasick Paradigm in Modern Antivirus Engines: A Cornerstone of Signature-Based Malware Detection</a></p>\n<h2 id=\"implementing-the-aho-corasick-algorithm\" style=\"position:relative;\"><a href=\"#implementing-the-aho-corasick-algorithm\" aria-label=\"implementing the aho corasick algorithm permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Implementing the Aho-Corasick Algorithm</h2>\n<h3 id=\"building-a-trie\" style=\"position:relative;\"><a href=\"#building-a-trie\" aria-label=\"building a trie permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Building a Trie</h3>\n<p>To implement the Aho-Corasick algorithm, we first create an automaton based on a Trie built from a fixed set of keywords.</p>\n<p>A Trie is a tree structure that represents a set of strings.</p>\n<p>As shown below, a Trie uses the empty string as the Root and manages multiple strings together by grouping those that share the same prefix.</p>\n<p><span\n      class=\"gatsby-resp-image-wrapper\"\n      style=\"position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 336px; \"\n    >\n      <a\n    class=\"gatsby-resp-image-link\"\n    href=\"/static/610cb0459285f2e700d8b6cb61689d4b/d99f2/image-20251214151725777.png\"\n    style=\"display: block\"\n    target=\"_blank\"\n    rel=\"noopener\"\n  >\n    <span\n    class=\"gatsby-resp-image-background-image\"\n    style=\"padding-bottom: 120%; position: relative; bottom: 0; left: 0; background-image: url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAYCAYAAAD6S912AAAACXBIWXMAAAsTAAALEwEAmpwYAAADZklEQVQ4y31Vi3LaSBDk/38rqXKwnZSdMwe2kJEQ6IGeCIQeCM11jywCNrmtGrSP2d6Znt5lVDeNHA6llGUlhxtWVjW+pez2hVR1c9Pn0kYEattWuq6T0wnWfdhH/4g13/dlvV7Lfr8XttPp9MWP1gNWlTp8bjyALcsyBeKhQRDI/7XyMyBBBiNAiVSTJBHbtmW5XIrrutKAorquv/j/FfCyWZYlu91OXl5exHEc9eOXB93KpryVMgE2m43MZjPljo0pM1K2Cv6TyUSjjeP4IojuNuBqtVLeuGGIhAcQ5NKnKAr4eBrAEOUVYNf1oARiRT3PUyduZJqmaZ6jTNNUwnADYEdBrgHxM4RN8unMFMMwlMVioSmyIARM0+Qc5ffvlhiGD1AbRWp0TmVTQbisWp7nuul4POoivxwzpTiOEHEo376Rjr7CQVDiQpyQ0Rr9jdJTAWfEm0J9sQgEOBwOuoHyeH19lel0JlEUCmn2PNyWqr2qMOkwjLnStM13PSBJN00DqdlnQEbNlHkIN13L5I9UqATbflfAfLdnUWqc3srbm4dIturEgjBVRmgYhnLKLPa4z5e6YzGjKAYVMy0IuRwdtEodODxqWnSiRLbbLZwjFTcbwTm+bPRjhGHYgHNUufqQDa/ZIGBu5M2gXIYx1wnMeWqQdFAF87mpBw9Rnx+Htu01SOXf34/l4WGs/el0KuPxD+WRgD9/PoCaN6Xj7u4O6xNVwBcdDhFS3JblYdNaeSXQ/f0jCPcBcgC4AS3mKinbtuTx8V/IqH/SutM5wvqKlzxvAFCc76dtJ0izgIw6AGykvVCNae4kiavzmJoe8bUOIJsMXKTZFt8cluHkXDnkOEm30FgOy+CT6eudpuinseR8yatGnNVaojiBDlHq9/eFeH6gepvMDN3w9PSMIkwk3e6Uy19Pv6GzQoz5HP1ncJfoDYkSUtACLO6FzTDtpSMLa9kXYvaqY2ftiotD1nhRZphbub7KZAxOvYAX4V0McyGu52OvLUEY6//PiOhLZ4W0IylwSxzIIkoyBQ/CSOIk1UhSpO5C8Ja9FD/gS+PKCocyqg38SJcC8lI///5H9uCybo74UzqhAEf0W2lwGOcazA1rLarJefaPH/Mc0wpWuYEE+PdI9N6qi/5nq1Rrf/yufUnffz5PKwvsGYhzAAAAAElFTkSuQmCC'); background-size: cover; display: block;\"\n  ></span>\n  <picture>\n          <source\n              srcset=\"/static/610cb0459285f2e700d8b6cb61689d4b/8ac56/image-20251214151725777.webp 240w,\n/static/610cb0459285f2e700d8b6cb61689d4b/a6cd2/image-20251214151725777.webp 336w\"\n              sizes=\"(max-width: 336px) 100vw, 336px\"\n              type=\"image/webp\"\n            />\n          <source\n            srcset=\"/static/610cb0459285f2e700d8b6cb61689d4b/8ff5a/image-20251214151725777.png 240w,\n/static/610cb0459285f2e700d8b6cb61689d4b/d99f2/image-20251214151725777.png 336w\"\n            sizes=\"(max-width: 336px) 100vw, 336px\"\n            type=\"image/png\"\n          />\n          <img\n            class=\"gatsby-resp-image-image\"\n            src=\"/static/610cb0459285f2e700d8b6cb61689d4b/d99f2/image-20251214151725777.png\"\n            alt=\"image-20251214151725777\"\n            title=\"image-20251214151725777\"\n            loading=\"lazy\"\n            style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\"\n          />\n        </picture>\n  </a>\n    </span></p>\n<p>In the example above, the set of keywords <code class=\"language-text\">\"A\", \"to\", \"tea\", \"ted\", \"ten\", \"i\", \"in\", \"inn\"</code> is represented as a Trie, and the first node contains the unique prefixes <code class=\"language-text\">\"t\", \"A\", \"i\"</code>.</p>\n<p>After that, the Trie is completed by assigning nodes one character at a time for items in the keyword set that share common prefixes.</p>\n<p>Reference: <a href=\"https://qiita.com/minaminao/items/caf6d8147c7e70b6ae63\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">すごいTrie #アルゴリズム - Qiita</a></p>\n<p>Reference: <a href=\"https://ja.wikipedia.org/wiki/%E3%83%88%E3%83%A9%E3%82%A4_(%E3%83%87%E3%83%BC%E3%82%BF%E6%A7%8B%E9%80%A0)\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">トライ (データ構造) - Wikipedia</a></p>\n<h3 id=\"implementing-a-trie\" style=\"position:relative;\"><a href=\"#implementing-a-trie\" aria-label=\"implementing a trie permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Implementing a Trie</h3>\n<p>Next, we will implement a Trie in Python.</p>\n<p>First, we define the <code class=\"language-text\">TrieNode</code> class as a node in the Trie.</p>\n<p>Each node will hold the following two pieces of information.</p>\n<ol>\n<li>A list indicating the next nodes (<code class=\"language-text\">TrieNode</code>) (defined as a dictionary in this example)</li>\n<li>A value indicating whether this node is terminal</li>\n</ol>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre class=\"language-python\"><code class=\"language-python\"><span class=\"token decorator annotation punctuation\">@dataclass</span>\n<span class=\"token keyword\">class</span> <span class=\"token class-name\">TrieNode</span><span class=\"token punctuation\">:</span>\n    children<span class=\"token punctuation\">:</span> Dict<span class=\"token punctuation\">[</span><span class=\"token builtin\">str</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"TrieNode\"</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> field<span class=\"token punctuation\">(</span>default_factory<span class=\"token operator\">=</span><span class=\"token builtin\">dict</span><span class=\"token punctuation\">)</span>\n    is_end<span class=\"token punctuation\">:</span> <span class=\"token builtin\">bool</span> <span class=\"token operator\">=</span> <span class=\"token boolean\">False</span></code></pre></div>\n<p>Here, <code class=\"language-text\">field(default_factory=dict)</code> instructs the class defined with <code class=\"language-text\">@dataclass</code> to use a newly created object produced by <code class=\"language-text\">dict</code> each time an instance is generated. (See the references below for details.)</p>\n<p>Reference: <a href=\"https://note.com/shunk031/n/nc1106f2ef926\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">dataclass で万物に型を付けよう｜しゅんけー</a></p>\n<p>Reference: <a href=\"https://qiita.com/kinakomochi_/items/3a86552dad6c6d768702\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">【Python】dataclassで初心者が必ずハマる default_factory の罠を徹底解説 #Python - Qiita</a></p>\n<p>Now that we have defined Trie nodes, next we implement the <code class=\"language-text\">Trie</code> class, which adds nodes and searches them.</p>\n<p>As described above, the Root of the Trie is empty, so when instantiating the class we assign an empty <code class=\"language-text\">TrieNode</code> to <code class=\"language-text\">self.root</code>.</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre class=\"language-python\"><code class=\"language-python\"><span class=\"token keyword\">class</span> <span class=\"token class-name\">Trie</span><span class=\"token punctuation\">:</span>\n    <span class=\"token keyword\">def</span> <span class=\"token function\">__init__</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token boolean\">None</span><span class=\"token punctuation\">:</span>\n        self<span class=\"token punctuation\">.</span>root <span class=\"token operator\">=</span> TrieNode<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span></code></pre></div>\n<p>In the <code class=\"language-text\">insert</code> method for registering keywords in the Trie, the string <code class=\"language-text\">keyword</code> is taken one character at a time and the Trie is traversed from the Root in order.</p>\n<p>At this time, if the target character does not yet exist in the child nodes, a newly created <code class=\"language-text\">TrieNode</code> instance is registered in the dictionary as a child node.</p>\n<p>Next, we implement the <code class=\"language-text\">is_contains</code> method, which searches whether a specific keyword is registered in the Trie, as follows.</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre class=\"language-python\"><code class=\"language-python\"><span class=\"token keyword\">def</span> <span class=\"token function\">is_contains</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">,</span> keyword<span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token builtin\">bool</span><span class=\"token punctuation\">:</span>\n    node <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>_find<span class=\"token punctuation\">(</span>keyword<span class=\"token punctuation\">)</span>\n    <span class=\"token keyword\">return</span> node<span class=\"token punctuation\">.</span>is_end <span class=\"token keyword\">if</span> node <span class=\"token keyword\">else</span> <span class=\"token boolean\">False</span>\n\n<span class=\"token keyword\">def</span> <span class=\"token function\">_find</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">,</span> keyword<span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> Optional<span class=\"token punctuation\">[</span>TrieNode<span class=\"token punctuation\">]</span><span class=\"token punctuation\">:</span> <span class=\"token comment\"># TrieNode | None</span>\n    node <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>root\n    <span class=\"token keyword\">for</span> ch <span class=\"token keyword\">in</span> keyword<span class=\"token punctuation\">:</span>\n        node <span class=\"token operator\">=</span> node<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span>ch<span class=\"token punctuation\">)</span>\n        <span class=\"token keyword\">if</span> node <span class=\"token keyword\">is</span> <span class=\"token boolean\">None</span><span class=\"token punctuation\">:</span>\n            <span class=\"token keyword\">return</span> <span class=\"token boolean\">None</span>\n    <span class=\"token keyword\">return</span> node</code></pre></div>\n<p>The final code is shown below.</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre class=\"language-python\"><code class=\"language-python\"><span class=\"token keyword\">from</span> dataclasses <span class=\"token keyword\">import</span> dataclass<span class=\"token punctuation\">,</span> field\n<span class=\"token keyword\">from</span> typing <span class=\"token keyword\">import</span> Dict<span class=\"token punctuation\">,</span> Optional\n\n\n<span class=\"token decorator annotation punctuation\">@dataclass</span>\n<span class=\"token keyword\">class</span> <span class=\"token class-name\">TrieNode</span><span class=\"token punctuation\">:</span>\n    children<span class=\"token punctuation\">:</span> Dict<span class=\"token punctuation\">[</span><span class=\"token builtin\">str</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"TrieNode\"</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> field<span class=\"token punctuation\">(</span>default_factory<span class=\"token operator\">=</span><span class=\"token builtin\">dict</span><span class=\"token punctuation\">)</span>\n    is_end<span class=\"token punctuation\">:</span> <span class=\"token builtin\">bool</span> <span class=\"token operator\">=</span> <span class=\"token boolean\">False</span>\n\n\n<span class=\"token keyword\">class</span> <span class=\"token class-name\">Trie</span><span class=\"token punctuation\">:</span>\n    <span class=\"token keyword\">def</span> <span class=\"token function\">__init__</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token boolean\">None</span><span class=\"token punctuation\">:</span>\n        self<span class=\"token punctuation\">.</span>root <span class=\"token operator\">=</span> TrieNode<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n\n    <span class=\"token keyword\">def</span> <span class=\"token function\">insert</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">,</span> keyword<span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token boolean\">None</span><span class=\"token punctuation\">:</span>\n        node <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>root\n        <span class=\"token keyword\">for</span> ch <span class=\"token keyword\">in</span> keyword<span class=\"token punctuation\">:</span>\n            <span class=\"token keyword\">if</span> ch <span class=\"token keyword\">not</span> <span class=\"token keyword\">in</span> node<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">:</span>\n                node<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">[</span>ch<span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> TrieNode<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n            node <span class=\"token operator\">=</span> node<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">[</span>ch<span class=\"token punctuation\">]</span>\n        node<span class=\"token punctuation\">.</span>is_end <span class=\"token operator\">=</span> <span class=\"token boolean\">True</span>\n\n    <span class=\"token keyword\">def</span> <span class=\"token function\">is_contains</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">,</span> keyword<span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token builtin\">bool</span><span class=\"token punctuation\">:</span>\n        node <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>_find<span class=\"token punctuation\">(</span>keyword<span class=\"token punctuation\">)</span>\n        <span class=\"token keyword\">return</span> node<span class=\"token punctuation\">.</span>is_end <span class=\"token keyword\">if</span> node <span class=\"token keyword\">else</span> <span class=\"token boolean\">False</span>\n\n    <span class=\"token keyword\">def</span> <span class=\"token function\">_find</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">,</span> keyword<span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> Optional<span class=\"token punctuation\">[</span>TrieNode<span class=\"token punctuation\">]</span><span class=\"token punctuation\">:</span> <span class=\"token comment\"># TrieNode | None</span>\n        node <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>root\n        <span class=\"token keyword\">for</span> ch <span class=\"token keyword\">in</span> keyword<span class=\"token punctuation\">:</span>\n            node <span class=\"token operator\">=</span> node<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span>ch<span class=\"token punctuation\">)</span>\n            <span class=\"token keyword\">if</span> node <span class=\"token keyword\">is</span> <span class=\"token boolean\">None</span><span class=\"token punctuation\">:</span>\n                <span class=\"token keyword\">return</span> <span class=\"token boolean\">None</span>\n        <span class=\"token keyword\">return</span> node\n\n\n<span class=\"token keyword\">if</span> __name__ <span class=\"token operator\">==</span> <span class=\"token string\">\"__main__\"</span><span class=\"token punctuation\">:</span>\n    trie <span class=\"token operator\">=</span> Trie<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n    <span class=\"token keyword\">for</span> w <span class=\"token keyword\">in</span> <span class=\"token punctuation\">[</span><span class=\"token string\">\"he\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"she\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"his\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"hers\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"I\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"my\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"ME\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"mine\"</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">:</span>\n        trie<span class=\"token punctuation\">.</span>insert<span class=\"token punctuation\">(</span>w<span class=\"token punctuation\">)</span>\n\n    <span class=\"token keyword\">assert</span><span class=\"token punctuation\">(</span>trie<span class=\"token punctuation\">.</span>is_contains<span class=\"token punctuation\">(</span><span class=\"token string\">\"her\"</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">==</span> <span class=\"token boolean\">False</span><span class=\"token punctuation\">)</span>\n    <span class=\"token keyword\">assert</span><span class=\"token punctuation\">(</span>trie<span class=\"token punctuation\">.</span>is_contains<span class=\"token punctuation\">(</span><span class=\"token string\">\"hero\"</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">==</span> <span class=\"token boolean\">False</span><span class=\"token punctuation\">)</span>\n    <span class=\"token keyword\">assert</span><span class=\"token punctuation\">(</span>trie<span class=\"token punctuation\">.</span>is_contains<span class=\"token punctuation\">(</span><span class=\"token string\">\"she\"</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">==</span> <span class=\"token boolean\">True</span><span class=\"token punctuation\">)</span>\n    <span class=\"token keyword\">assert</span><span class=\"token punctuation\">(</span>trie<span class=\"token punctuation\">.</span>is_contains<span class=\"token punctuation\">(</span><span class=\"token string\">\"I\"</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">==</span> <span class=\"token boolean\">True</span><span class=\"token punctuation\">)</span>\n    <span class=\"token keyword\">assert</span><span class=\"token punctuation\">(</span>trie<span class=\"token punctuation\">.</span>is_contains<span class=\"token punctuation\">(</span><span class=\"token string\">\"i\"</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">==</span> <span class=\"token boolean\">False</span><span class=\"token punctuation\">)</span>\n    <span class=\"token keyword\">assert</span><span class=\"token punctuation\">(</span>trie<span class=\"token punctuation\">.</span>is_contains<span class=\"token punctuation\">(</span><span class=\"token string\">\"ME\"</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">==</span> <span class=\"token boolean\">True</span><span class=\"token punctuation\">)</span>\n    <span class=\"token keyword\">assert</span><span class=\"token punctuation\">(</span>trie<span class=\"token punctuation\">.</span>is_contains<span class=\"token punctuation\">(</span><span class=\"token string\">\"Me\"</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">==</span> <span class=\"token boolean\">False</span><span class=\"token punctuation\">)</span></code></pre></div>\n<h3 id=\"about-the-aho-corasick-automaton\" style=\"position:relative;\"><a href=\"#about-the-aho-corasick-automaton\" aria-label=\"about the aho corasick automaton permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>About the Aho-Corasick Automaton</h3>\n<p>With the Trie implementation above, we can now determine whether “a specific keyword exists in a predefined set,” but a simple Trie cannot be used to detect the “partial matches within a byte stream” that AntiVirus pattern matching requires.</p>\n<p>So, based on the Trie structure, we create a finite automaton for searching with the Aho-Corasick method.</p>\n<p>Specifically, we add Failure Links and Output to the Trie implemented as a tree structure.</p>\n<p>Failure Links are links that let the search transition to the next state that may match, instead of restarting the search, when matching a keyword character fails partway through.</p>\n<p>As a result, for example, when a pattern that has matched <code class=\"language-text\">...ABCD</code> mismatches on the next character, instead of starting over it can use the fact that the suffix <code class=\"language-text\">...BCD</code> may still match, transition to the <code class=\"language-text\">B -> C -> D</code> nodes in the Trie structure, and continue searching efficiently.</p>\n<p>Reference: <a href=\"https://medium.com/@nagasaiviraj_tammana/aho-corasick-algorithm-2372861bc650\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">AHO CORASICK ALGORITHM. The Aho-Corasick algorithm is a… | by Viraj Tammana | Medium</a></p>\n<p>Reference: <a href=\"https://www.upgrad.com/blog/aho-corasick-algorithm/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Aho-Corasick Algorithm: Key Concepts, Code &#x26; Real Examples</a></p>\n<p>This mechanism prevents the search position from ever rewinding within the input byte stream, so pattern matching can be performed in linear time with respect to the input.</p>\n<p>The Output values, meanwhile, contain the keywords that match when that state is reached.</p>\n<p><span\n      class=\"gatsby-resp-image-wrapper\"\n      style=\"position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 398px; \"\n    >\n      <a\n    class=\"gatsby-resp-image-link\"\n    href=\"/static/3f7035306caee189c0bfa3f0bb84d846/692d4/image-20251214133135610.png\"\n    style=\"display: block\"\n    target=\"_blank\"\n    rel=\"noopener\"\n  >\n    <span\n    class=\"gatsby-resp-image-background-image\"\n    style=\"padding-bottom: 161.66666666666666%; position: relative; bottom: 0; left: 0; background-image: url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAgCAYAAAASYli2AAAACXBIWXMAAAsTAAALEwEAmpwYAAAFYElEQVRIx4VWaXPaSBDl//+VzZdUNhU7lY1jHNv4wGAOSdyHQAgJhAAhIQl4+3pADpVjV1VTumZ6Xne/fj25dLdDHCfYHw5wXQdXV19RqWlot9soFAp4fHzi/QEvL2XomoYPf3/Et+s8notF3q/R7vaw2+2RJAlHitwmjBBtY4TRFjE/yLU/ACkn8Qbe0O50UavXEWwiZNfhdJd5slZsrIMNciENbolQRmZY7rKRGB6aJvL5b7i5yUPT6kjSvfon87K52fpgEx4RZsbOxzZOsVqv0Wz18OlTHl+vvuPpWaORkP9/N/8PBuVZdhZ0o/GYMezg7s5jvEp4ehqj1x+qeIeRzI3fjP/RoBhL0h3jETBuU7q3g+NuoektxngH3Zhj6niM7U4ZPQeyPjcoQxYIsmCzQaVqwVvEDH6CsRVhNg9oJOUGCY273MTjv4Pa/JcYSnblxVv4XDxBuTKCvyQN0oRG9mg0bZTKdeWquBlGKZHO0OmamHseYx1wbkqvzhCORmM4jo1isY6pveHeqUqMLPjy5R8m5gKDoakQxUmM5WpLbhqw7YmKtQAKaCuXpX9sWfAXC+46wt3rAMvlDgdFmxHev/+Ai4sL1DX9zUVB5cwsBIy147rKO7GjEAo5TdPCY+kWHx9vMXRc9Il2OFmiVktQLjWRv7mC5y9VeJJ4Dy9y8a2os3raaLGqxIYithgU//3lGnW9Dt3VMZkKqWM0+x4apouB5aPRGKlFKREuQg/NoUMvgG6vB2tin5jxE20W3gqLaIbqoAtrdMBqlSLdb2EvptD6A5iTDaZ0rTOe0RiRplu6O1Oof+GhjEZTuJagYLTQG26YYaHSnovXMEdD2BsL+UoTjc4a6zUZsJM675wol/7gobzITo2Gjm7HhTkjInOAKDyQdyGrY0CaaGgNpljS0Dxw0bFc1PQ+Pn/+zBh2frisskx3B4MhqtUqms0mwnADY9qE5ZCb3pwbNdFjrEYjW/EwIepNtMY1E/Xur3e4vLwkjdb8tv2BULgk1BF1iVgNXuigTpSOE8CyRoqn7mymuBnFojI7XOcNIrxEqVT+fS3P5h6yDSIiGfo9dEYLolyj8HDPf9tTvGNunmDh71Gt1chdiixr9o02mcGMtPIcE4nQY+xbMPouxuM59vtjlYwtMsKPVdKsyVQZkmSu/1sP6UK0UTRy/TnKTVLJDZggKvOKIrI/ZlZq/48Cm/04R+lFHkzLZJZ1NAcLaL0petYM9ixk+UUol19ZFKvfE/t8ZLo4mdqqgjqdPsyhrdz0gxW6kwkqWgPD4YCq3lJui623LKuy2h1O970yJto4ny9gGAZ0XWO8LNW0RDSkXAeDPinVULQSdzeZ2ohyjMaWiocEeeo4mNg2bIqEqE2311c8lGeZN+I8l4UghNd0XSEUG9IVc4IuZDU0Wm32jAfcPz7DkF1JcKkAXTdYXj3Y06lCUnqtsDXoqLEIKtWaepfebU+do3w5RPP9+y0K/CiKczj15Yg/Rf86HcoTDRpE0iWiWl1Ti6F69kGp+F6F4JQUnxokaAwKgxjMsizfn4sv1MIXyn+VLbSI8mtVfRuaYxX3rD9nhwVFG3nIrvMTxIq1WaVLer1Cyhg0WobG00OtVlUh2lFqBJUMSab0HoVwwszJeeWZZ5fsaCGltQpitM0A2iCgmAbQ+wGGFF6jz41aC3LSV7Esso4fCvf0sK3aas5jE7rO3+C+8KgyZlIEEnVeYXktQ8z843AXIRYrtlPevWV0HBRbGRZFxXHnR7URg8/FErvdC25v75gIA9tTfKKtiEE24tN7fHw/qyzp5ypMpE5OjmI/X/HpaBb/70h+Gf8CwmhogjhSogAAAAAASUVORK5CYII='); background-size: cover; display: block;\"\n  ></span>\n  <picture>\n          <source\n              srcset=\"/static/3f7035306caee189c0bfa3f0bb84d846/8ac56/image-20251214133135610.webp 240w,\n/static/3f7035306caee189c0bfa3f0bb84d846/579c2/image-20251214133135610.webp 398w\"\n              sizes=\"(max-width: 398px) 100vw, 398px\"\n              type=\"image/webp\"\n            />\n          <source\n            srcset=\"/static/3f7035306caee189c0bfa3f0bb84d846/8ff5a/image-20251214133135610.png 240w,\n/static/3f7035306caee189c0bfa3f0bb84d846/692d4/image-20251214133135610.png 398w\"\n            sizes=\"(max-width: 398px) 100vw, 398px\"\n            type=\"image/png\"\n          />\n          <img\n            class=\"gatsby-resp-image-image\"\n            src=\"/static/3f7035306caee189c0bfa3f0bb84d846/692d4/image-20251214133135610.png\"\n            alt=\"image-20251214133135610\"\n            title=\"image-20251214133135610\"\n            loading=\"lazy\"\n            style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\"\n          />\n        </picture>\n  </a>\n    </span></p>\n<p>Reference: <a href=\"https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Aho–Corasick algorithm - Wikipedia</a></p>\n<h3 id=\"implementing-aho-corasick\" style=\"position:relative;\"><a href=\"#implementing-aho-corasick\" aria-label=\"implementing aho corasick permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Implementing Aho-Corasick</h3>\n<p>Based on the Trie implementation above, I will build an automaton and try implementing the Aho-Corasick string-search algorithm.</p>\n<p>First, as before, I define the Trie node with Failure Links added as the <code class=\"language-text\">ACNode</code> class.</p>\n<p>The string array <code class=\"language-text\">outputs</code> is used to store the keywords matched up to that node.</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre class=\"language-python\"><code class=\"language-python\"><span class=\"token decorator annotation punctuation\">@dataclass</span>\n<span class=\"token keyword\">class</span> <span class=\"token class-name\">ACNode</span><span class=\"token punctuation\">:</span>\n    children<span class=\"token punctuation\">:</span> Dict<span class=\"token punctuation\">[</span><span class=\"token builtin\">str</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"ACNode\"</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> field<span class=\"token punctuation\">(</span>default_factory<span class=\"token operator\">=</span><span class=\"token builtin\">dict</span><span class=\"token punctuation\">)</span>\n    outputs<span class=\"token punctuation\">:</span> List<span class=\"token punctuation\">[</span><span class=\"token builtin\">str</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> field<span class=\"token punctuation\">(</span>default_factory<span class=\"token operator\">=</span><span class=\"token builtin\">list</span><span class=\"token punctuation\">)</span> <span class=\"token comment\"># mached keywords list</span>\n    failure_link<span class=\"token punctuation\">:</span> Optional<span class=\"token punctuation\">[</span><span class=\"token string\">\"ACNode\"</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> <span class=\"token boolean\">None</span> <span class=\"token comment\"># ACNode | None</span></code></pre></div>\n<p>Next, I define the <code class=\"language-text\">AhoCorasick</code> class, which creates the automaton and performs searches with the Aho-Corasick method.</p>\n<p>At initialization time, it sets an empty node at the Root just as before. (The Root’s <code class=\"language-text\">failure_link</code> is set to the Root itself.)</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre class=\"language-python\"><code class=\"language-python\"><span class=\"token keyword\">class</span> <span class=\"token class-name\">AhoCorasick</span><span class=\"token punctuation\">:</span>\n\n    <span class=\"token keyword\">def</span> <span class=\"token function\">__init__</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token boolean\">None</span><span class=\"token punctuation\">:</span>\n        self<span class=\"token punctuation\">.</span>root <span class=\"token operator\">=</span> ACNode<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n        self<span class=\"token punctuation\">.</span>root<span class=\"token punctuation\">.</span>failure_link <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>root</code></pre></div>\n<p>Next, we define the <code class=\"language-text\">insert</code> method that creates a Trie from predefined keywords as shown below.</p>\n<p>The implementation is exactly the same as the <code class=\"language-text\">insert</code> method earlier.</p>\n<div class=\"gatsby-highlight\" data-language=\"\\tpython\"><pre class=\"language-\\tpython\"><code class=\"language-\\tpython\">def insert(self, keyword: str) -&gt; None:\n        node = self.root\n        for ch in keyword:\n            if ch not in node.children:\n                node.children[ch] = ACNode()\n            node = node.children[ch]\n        node.outputs.append(keyword)</code></pre></div>\n<p>Once the Trie has been created, we next implement the <code class=\"language-text\">build</code> method, which constructs the Aho-Corasick finite automaton from it.</p>\n<p>To create the Aho-Corasick finite automaton, we traverse the Trie with BFS and add the required information to each node.</p>\n<p>First, define <code class=\"language-text\">q</code>, a deque that stores instances of the <code class=\"language-text\">ACNode</code> class imported from <code class=\"language-text\">collections</code>, and while adding the nodes directly under the Root to <code class=\"language-text\">q</code>, link all of their <code class=\"language-text\">failure_link</code> values to the Root.</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre class=\"language-python\"><code class=\"language-python\">q<span class=\"token punctuation\">:</span> deque<span class=\"token punctuation\">[</span>ACNode<span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> deque<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n\n<span class=\"token keyword\">for</span> child <span class=\"token keyword\">in</span> self<span class=\"token punctuation\">.</span>root<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">.</span>values<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n    child<span class=\"token punctuation\">.</span>failure_link <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>root\n    q<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>child<span class=\"token punctuation\">)</span></code></pre></div>\n<p>Next, we take nodes out of the queue with <code class=\"language-text\">popleft</code> and examine the dictionary keys pointing to child nodes together with the <code class=\"language-text\">ACNode</code> instances.</p>\n<p>Here, the code follows the parent node’s <code class=\"language-text\">failure_link</code> in order until it finds a child node matching the prefix, and then assigns the ultimately found node (or the Root) to that child node’s <code class=\"language-text\">failure_link</code>.</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre class=\"language-python\"><code class=\"language-python\"><span class=\"token keyword\">while</span> q<span class=\"token punctuation\">:</span>\n    cursor <span class=\"token operator\">=</span> q<span class=\"token punctuation\">.</span>popleft<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n\n    <span class=\"token keyword\">for</span> ch<span class=\"token punctuation\">,</span> next_item <span class=\"token keyword\">in</span> cursor<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">.</span>items<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n        q<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>next_item<span class=\"token punctuation\">)</span>\n\n        f <span class=\"token operator\">=</span> cursor<span class=\"token punctuation\">.</span>failure_link\n        <span class=\"token keyword\">while</span> f <span class=\"token keyword\">is</span> <span class=\"token keyword\">not</span> <span class=\"token boolean\">None</span> <span class=\"token keyword\">and</span> f <span class=\"token keyword\">is</span> <span class=\"token keyword\">not</span> self<span class=\"token punctuation\">.</span>root <span class=\"token keyword\">and</span> ch <span class=\"token keyword\">not</span> <span class=\"token keyword\">in</span> f<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">:</span>\n            f <span class=\"token operator\">=</span> f<span class=\"token punctuation\">.</span>failure_link\n\n        <span class=\"token keyword\">if</span> f <span class=\"token keyword\">is</span> <span class=\"token keyword\">not</span> <span class=\"token boolean\">None</span> <span class=\"token keyword\">and</span> ch <span class=\"token keyword\">in</span> f<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">:</span>\n            next_item<span class=\"token punctuation\">.</span>failure_link <span class=\"token operator\">=</span> f<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">[</span>ch<span class=\"token punctuation\">]</span>\n        <span class=\"token keyword\">else</span><span class=\"token punctuation\">:</span>\n            next_item<span class=\"token punctuation\">.</span>failure_link <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>root\n\n        <span class=\"token keyword\">if</span> next_item<span class=\"token punctuation\">.</span>failure_link <span class=\"token keyword\">and</span> next_item<span class=\"token punctuation\">.</span>failure_link<span class=\"token punctuation\">.</span>outputs<span class=\"token punctuation\">:</span>\n            next_item<span class=\"token punctuation\">.</span>outputs<span class=\"token punctuation\">.</span>extend<span class=\"token punctuation\">(</span>next_item<span class=\"token punctuation\">.</span>failure_link<span class=\"token punctuation\">.</span>outputs<span class=\"token punctuation\">)</span></code></pre></div>\n<p>Finally, the <code class=\"language-text\">extend</code> method is used to concatenate the <code class=\"language-text\">failure_link</code> outputs into <code class=\"language-text\">outputs</code>, which is a <code class=\"language-text\">List[str]</code>.</p>\n<p>The reason the node’s <code class=\"language-text\">outputs</code> is extended with the <code class=\"language-text\">failure_link</code> outputs is that, when that node is treated as a terminal node, there can also be matches with the outputs at the failure destination.</p>\n<p>Reference: <a href=\"https://qiita.com/michi1750/items/c499f9ae8c6a1982caa4\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">【Python】appendとextendでよく間違えるところ #Python3 - Qiita</a></p>\n<p>Now that the finite automaton is complete, we finally implement the <code class=\"language-text\">search</code> method, which searches the received text for predefined keywords in O(N).</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre class=\"language-python\"><code class=\"language-python\"><span class=\"token keyword\">def</span> <span class=\"token function\">search</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">,</span> text<span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token boolean\">None</span><span class=\"token punctuation\">:</span>\n    res<span class=\"token punctuation\">:</span> List<span class=\"token punctuation\">[</span>Tuple<span class=\"token punctuation\">[</span><span class=\"token builtin\">int</span><span class=\"token punctuation\">,</span> <span class=\"token builtin\">int</span><span class=\"token punctuation\">,</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span>\n    node <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>root\n\n    <span class=\"token keyword\">for</span> i<span class=\"token punctuation\">,</span> ch <span class=\"token keyword\">in</span> <span class=\"token builtin\">enumerate</span><span class=\"token punctuation\">(</span>text<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n        <span class=\"token keyword\">while</span> node <span class=\"token keyword\">is</span> <span class=\"token keyword\">not</span> self<span class=\"token punctuation\">.</span>root <span class=\"token keyword\">and</span> ch <span class=\"token keyword\">not</span> <span class=\"token keyword\">in</span> node<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">:</span>\n            node <span class=\"token operator\">=</span> node<span class=\"token punctuation\">.</span>failure_link <span class=\"token keyword\">if</span> node<span class=\"token punctuation\">.</span>failure_link <span class=\"token keyword\">is</span> <span class=\"token keyword\">not</span> <span class=\"token boolean\">None</span> <span class=\"token keyword\">else</span> self<span class=\"token punctuation\">.</span>root\n\n        node <span class=\"token operator\">=</span> node<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span>ch<span class=\"token punctuation\">,</span> self<span class=\"token punctuation\">.</span>root<span class=\"token punctuation\">)</span>\n\n        <span class=\"token keyword\">if</span> <span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>node<span class=\"token punctuation\">.</span>outputs<span class=\"token punctuation\">)</span> <span class=\"token operator\">></span> <span class=\"token number\">0</span><span class=\"token punctuation\">:</span>\n            <span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span>i<span class=\"token punctuation\">,</span> node<span class=\"token punctuation\">.</span>outputs<span class=\"token punctuation\">)</span></code></pre></div>\n<p>When you run the above, it displays which keywords matched at the position of the matched terminal character.</p>\n<p>(There are also cases where two keywords match at the same position, such as <code class=\"language-text\">\"she\"</code> and <code class=\"language-text\">\"he\"</code>.)</p>\n<p>The final implementation looks like this.</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre class=\"language-python\"><code class=\"language-python\"><span class=\"token keyword\">from</span> __future__ <span class=\"token keyword\">import</span> annotations\n\n<span class=\"token keyword\">from</span> dataclasses <span class=\"token keyword\">import</span> dataclass<span class=\"token punctuation\">,</span> field\n<span class=\"token keyword\">from</span> collections <span class=\"token keyword\">import</span> deque\n<span class=\"token keyword\">from</span> typing <span class=\"token keyword\">import</span> Dict<span class=\"token punctuation\">,</span> List<span class=\"token punctuation\">,</span> Optional<span class=\"token punctuation\">,</span> Tuple\n\n\n<span class=\"token decorator annotation punctuation\">@dataclass</span>\n<span class=\"token keyword\">class</span> <span class=\"token class-name\">ACNode</span><span class=\"token punctuation\">:</span>\n    children<span class=\"token punctuation\">:</span> Dict<span class=\"token punctuation\">[</span><span class=\"token builtin\">str</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"ACNode\"</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> field<span class=\"token punctuation\">(</span>default_factory<span class=\"token operator\">=</span><span class=\"token builtin\">dict</span><span class=\"token punctuation\">)</span>\n    outputs<span class=\"token punctuation\">:</span> List<span class=\"token punctuation\">[</span><span class=\"token builtin\">str</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> field<span class=\"token punctuation\">(</span>default_factory<span class=\"token operator\">=</span><span class=\"token builtin\">list</span><span class=\"token punctuation\">)</span> <span class=\"token comment\"># mached keywords list</span>\n    failure_link<span class=\"token punctuation\">:</span> Optional<span class=\"token punctuation\">[</span><span class=\"token string\">\"ACNode\"</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> <span class=\"token boolean\">None</span> <span class=\"token comment\"># ACNode | None</span>\n\n\n<span class=\"token keyword\">class</span> <span class=\"token class-name\">AhoCorasick</span><span class=\"token punctuation\">:</span>\n\n    <span class=\"token keyword\">def</span> <span class=\"token function\">__init__</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token boolean\">None</span><span class=\"token punctuation\">:</span>\n        self<span class=\"token punctuation\">.</span>root <span class=\"token operator\">=</span> ACNode<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n        self<span class=\"token punctuation\">.</span>root<span class=\"token punctuation\">.</span>failure_link <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>root\n\n    <span class=\"token keyword\">def</span> <span class=\"token function\">insert</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">,</span> keyword<span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token boolean\">None</span><span class=\"token punctuation\">:</span>\n        node <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>root\n        <span class=\"token keyword\">for</span> ch <span class=\"token keyword\">in</span> keyword<span class=\"token punctuation\">:</span>\n            <span class=\"token keyword\">if</span> ch <span class=\"token keyword\">not</span> <span class=\"token keyword\">in</span> node<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">:</span>\n                node<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">[</span>ch<span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> ACNode<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n            node <span class=\"token operator\">=</span> node<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">[</span>ch<span class=\"token punctuation\">]</span>\n        node<span class=\"token punctuation\">.</span>outputs<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>keyword<span class=\"token punctuation\">)</span>\n\n\n    <span class=\"token keyword\">def</span> <span class=\"token function\">build</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token boolean\">None</span><span class=\"token punctuation\">:</span>\n        q<span class=\"token punctuation\">:</span> deque<span class=\"token punctuation\">[</span>ACNode<span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> deque<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n\n        <span class=\"token keyword\">for</span> child <span class=\"token keyword\">in</span> self<span class=\"token punctuation\">.</span>root<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">.</span>values<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n            child<span class=\"token punctuation\">.</span>failure_link <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>root\n            q<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>child<span class=\"token punctuation\">)</span>\n\n        <span class=\"token keyword\">while</span> q<span class=\"token punctuation\">:</span>\n            cursor <span class=\"token operator\">=</span> q<span class=\"token punctuation\">.</span>popleft<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n\n            <span class=\"token keyword\">for</span> ch<span class=\"token punctuation\">,</span> next_item <span class=\"token keyword\">in</span> cursor<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">.</span>items<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n                q<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>next_item<span class=\"token punctuation\">)</span>\n\n                f <span class=\"token operator\">=</span> cursor<span class=\"token punctuation\">.</span>failure_link\n                <span class=\"token keyword\">while</span> f <span class=\"token keyword\">is</span> <span class=\"token keyword\">not</span> <span class=\"token boolean\">None</span> <span class=\"token keyword\">and</span> f <span class=\"token keyword\">is</span> <span class=\"token keyword\">not</span> self<span class=\"token punctuation\">.</span>root <span class=\"token keyword\">and</span> ch <span class=\"token keyword\">not</span> <span class=\"token keyword\">in</span> f<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">:</span>\n                    f <span class=\"token operator\">=</span> f<span class=\"token punctuation\">.</span>failure_link\n\n                <span class=\"token keyword\">if</span> f <span class=\"token keyword\">is</span> <span class=\"token keyword\">not</span> <span class=\"token boolean\">None</span> <span class=\"token keyword\">and</span> ch <span class=\"token keyword\">in</span> f<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">:</span>\n                    next_item<span class=\"token punctuation\">.</span>failure_link <span class=\"token operator\">=</span> f<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">[</span>ch<span class=\"token punctuation\">]</span>\n                <span class=\"token keyword\">else</span><span class=\"token punctuation\">:</span>\n                    next_item<span class=\"token punctuation\">.</span>failure_link <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>root\n\n                <span class=\"token keyword\">if</span> next_item<span class=\"token punctuation\">.</span>failure_link <span class=\"token keyword\">and</span> next_item<span class=\"token punctuation\">.</span>failure_link<span class=\"token punctuation\">.</span>outputs<span class=\"token punctuation\">:</span>\n                    next_item<span class=\"token punctuation\">.</span>outputs<span class=\"token punctuation\">.</span>extend<span class=\"token punctuation\">(</span>next_item<span class=\"token punctuation\">.</span>failure_link<span class=\"token punctuation\">.</span>outputs<span class=\"token punctuation\">)</span>\n\n\n    <span class=\"token keyword\">def</span> <span class=\"token function\">search</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">,</span> text<span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token boolean\">None</span><span class=\"token punctuation\">:</span>\n        res<span class=\"token punctuation\">:</span> List<span class=\"token punctuation\">[</span>Tuple<span class=\"token punctuation\">[</span><span class=\"token builtin\">int</span><span class=\"token punctuation\">,</span> <span class=\"token builtin\">int</span><span class=\"token punctuation\">,</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span>\n        node <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>root\n\n        <span class=\"token keyword\">for</span> i<span class=\"token punctuation\">,</span> ch <span class=\"token keyword\">in</span> <span class=\"token builtin\">enumerate</span><span class=\"token punctuation\">(</span>text<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n            <span class=\"token keyword\">while</span> node <span class=\"token keyword\">is</span> <span class=\"token keyword\">not</span> self<span class=\"token punctuation\">.</span>root <span class=\"token keyword\">and</span> ch <span class=\"token keyword\">not</span> <span class=\"token keyword\">in</span> node<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">:</span>\n                node <span class=\"token operator\">=</span> node<span class=\"token punctuation\">.</span>failure_link <span class=\"token keyword\">if</span> node<span class=\"token punctuation\">.</span>failure_link <span class=\"token keyword\">is</span> <span class=\"token keyword\">not</span> <span class=\"token boolean\">None</span> <span class=\"token keyword\">else</span> self<span class=\"token punctuation\">.</span>root\n\n            node <span class=\"token operator\">=</span> node<span class=\"token punctuation\">.</span>children<span class=\"token punctuation\">.</span>get<span class=\"token punctuation\">(</span>ch<span class=\"token punctuation\">,</span> self<span class=\"token punctuation\">.</span>root<span class=\"token punctuation\">)</span>\n\n            <span class=\"token keyword\">if</span> <span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>node<span class=\"token punctuation\">.</span>outputs<span class=\"token punctuation\">)</span> <span class=\"token operator\">></span> <span class=\"token number\">0</span><span class=\"token punctuation\">:</span>\n                <span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span>i<span class=\"token punctuation\">,</span> node<span class=\"token punctuation\">.</span>outputs<span class=\"token punctuation\">)</span>\n\n\n<span class=\"token keyword\">if</span> __name__ <span class=\"token operator\">==</span> <span class=\"token string\">\"__main__\"</span><span class=\"token punctuation\">:</span>\n    ac <span class=\"token operator\">=</span> AhoCorasick<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n    <span class=\"token keyword\">for</span> w <span class=\"token keyword\">in</span> <span class=\"token punctuation\">[</span><span class=\"token string\">\"he\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"she\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"his\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"hers\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"I\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"my\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"ME\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"mine\"</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">:</span>\n        ac<span class=\"token punctuation\">.</span>insert<span class=\"token punctuation\">(</span>w<span class=\"token punctuation\">)</span>\n    ac<span class=\"token punctuation\">.</span>build<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n\n    text <span class=\"token operator\">=</span> <span class=\"token string\">\"asdfahishersIadfsamaMEandOhers-mefsadfasmines\"</span>\n    ac<span class=\"token punctuation\">.</span>search<span class=\"token punctuation\">(</span>text<span class=\"token punctuation\">)</span></code></pre></div>\n<p>When you run this, you can get the following result.</p>\n<p><span\n      class=\"gatsby-resp-image-wrapper\"\n      style=\"position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 632px; \"\n    >\n      <a\n    class=\"gatsby-resp-image-link\"\n    href=\"/static/c6e147d69e26390337ec209a04b9e422/084e2/image-20251216221634682.png\"\n    style=\"display: block\"\n    target=\"_blank\"\n    rel=\"noopener\"\n  >\n    <span\n    class=\"gatsby-resp-image-background-image\"\n    style=\"padding-bottom: 44.583333333333336%; position: relative; bottom: 0; left: 0; background-image: url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAJCAYAAAAywQxIAAAACXBIWXMAAAsTAAALEwEAmpwYAAABE0lEQVQoz52S2W6DMBBF+Q/A2GwJQRhjvLCEqE+Vqvap6v//yu1gtZES9QH14coz8ujozhLV2wvEq4X4WMA/N/CvG9ibR/ruUGwOgzKYpgnX9QprHcqyxLKsGMeRpOGcp38PQ7mUEpHIcyg9QO8FdkTTtWCcgwsBlmVgLAOnPKM4Czm7x38pKssKeiDgoGGNRS5yGG2gehlAaZIgTdPDivKiCLZ72WOlthTB67oKTnbtRb/xEQWgdw6y66CUQlVVd8jzewhYBIc0fGp5nmc0zfkB8C+Hxoyh5X2Gl/YShvsMOgoOS1nImewkOVygjcH5VD+0G8fxYWhU0F1Za9DRDL2foLXG6Qe4S9D5tG2LhLZ9xOE3rtYAkbDmYVgAAAAASUVORK5CYII='); background-size: cover; display: block;\"\n  ></span>\n  <picture>\n          <source\n              srcset=\"/static/c6e147d69e26390337ec209a04b9e422/8ac56/image-20251216221634682.webp 240w,\n/static/c6e147d69e26390337ec209a04b9e422/d3be9/image-20251216221634682.webp 480w,\n/static/c6e147d69e26390337ec209a04b9e422/59680/image-20251216221634682.webp 632w\"\n              sizes=\"(max-width: 632px) 100vw, 632px\"\n              type=\"image/webp\"\n            />\n          <source\n            srcset=\"/static/c6e147d69e26390337ec209a04b9e422/8ff5a/image-20251216221634682.png 240w,\n/static/c6e147d69e26390337ec209a04b9e422/e85cb/image-20251216221634682.png 480w,\n/static/c6e147d69e26390337ec209a04b9e422/084e2/image-20251216221634682.png 632w\"\n            sizes=\"(max-width: 632px) 100vw, 632px\"\n            type=\"image/png\"\n          />\n          <img\n            class=\"gatsby-resp-image-image\"\n            src=\"/static/c6e147d69e26390337ec209a04b9e422/084e2/image-20251216221634682.png\"\n            alt=\"image-20251216221634682\"\n            title=\"image-20251216221634682\"\n            loading=\"lazy\"\n            style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\"\n          />\n        </picture>\n  </a>\n    </span></p>\n<h2 id=\"aho-corasick-in-clamav\" style=\"position:relative;\"><a href=\"#aho-corasick-in-clamav\" aria-label=\"aho corasick in clamav permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Aho-Corasick in ClamAV</h2>\n<p>Now that we have an overview of the Aho–Corasick algorithm, let’s briefly look at how scanning using the Aho–Corasick algorithm is performed in ClamAV scans.</p>\n<p>First, scanning using AC in ClamAV is performed by the <code class=\"language-text\">cli_ac_scanbuff</code> function.</p>\n<div class=\"gatsby-highlight\" data-language=\"c\"><pre class=\"language-c\"><code class=\"language-c\"><span class=\"token class-name\">cl_error_t</span> <span class=\"token function\">cli_ac_scanbuff</span><span class=\"token punctuation\">(</span>\n    <span class=\"token keyword\">const</span> <span class=\"token keyword\">unsigned</span> <span class=\"token keyword\">char</span> <span class=\"token operator\">*</span>buffer<span class=\"token punctuation\">,</span>\n    <span class=\"token class-name\">uint32_t</span> length<span class=\"token punctuation\">,</span>\n    <span class=\"token keyword\">const</span> <span class=\"token keyword\">char</span> <span class=\"token operator\">*</span><span class=\"token operator\">*</span>virname<span class=\"token punctuation\">,</span>\n    <span class=\"token keyword\">void</span> <span class=\"token operator\">*</span><span class=\"token operator\">*</span>customdata<span class=\"token punctuation\">,</span>\n    <span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_ac_result</span> <span class=\"token operator\">*</span><span class=\"token operator\">*</span>res<span class=\"token punctuation\">,</span>\n    <span class=\"token keyword\">const</span> <span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_matcher</span> <span class=\"token operator\">*</span>root<span class=\"token punctuation\">,</span>\n    <span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_ac_data</span> <span class=\"token operator\">*</span>mdata<span class=\"token punctuation\">,</span>\n    <span class=\"token class-name\">uint32_t</span> offset<span class=\"token punctuation\">,</span>\n    <span class=\"token class-name\">cli_file_t</span> ftype<span class=\"token punctuation\">,</span>\n    <span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_matched_type</span> <span class=\"token operator\">*</span><span class=\"token operator\">*</span>ftoffset<span class=\"token punctuation\">,</span>\n    <span class=\"token keyword\">unsigned</span> <span class=\"token keyword\">int</span> mode<span class=\"token punctuation\">,</span>\n    cli_ctx <span class=\"token operator\">*</span>ctx<span class=\"token punctuation\">)</span></code></pre></div>\n<p>The <code class=\"language-text\">buffer</code> provided when this function is called is assigned the data to be scanned, such as the Eicar text.</p>\n<p><span\n      class=\"gatsby-resp-image-wrapper\"\n      style=\"position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 960px; \"\n    >\n      <a\n    class=\"gatsby-resp-image-link\"\n    href=\"/static/a240acf52e99455fb9a257df19b89c2a/b41cb/image-20251217204907446.png\"\n    style=\"display: block\"\n    target=\"_blank\"\n    rel=\"noopener\"\n  >\n    <span\n    class=\"gatsby-resp-image-background-image\"\n    style=\"padding-bottom: 3.3333333333333335%; position: relative; bottom: 0; left: 0; background-image: url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAABCAYAAADeko4lAAAACXBIWXMAAAsTAAALEwEAmpwYAAAAQ0lEQVQI1z2LOw6AMAzFmFGHMhVF+Ue5/xEf0IHB8mD5ICLca2FMxjn9JTGugAiju+HuMLNtZoaqoqp+Z+ZuEd8jeAAJKBqpbn13jAAAAABJRU5ErkJggg=='); background-size: cover; display: block;\"\n  ></span>\n  <picture>\n          <source\n              srcset=\"/static/a240acf52e99455fb9a257df19b89c2a/8ac56/image-20251217204907446.webp 240w,\n/static/a240acf52e99455fb9a257df19b89c2a/d3be9/image-20251217204907446.webp 480w,\n/static/a240acf52e99455fb9a257df19b89c2a/e46b2/image-20251217204907446.webp 960w,\n/static/a240acf52e99455fb9a257df19b89c2a/f992d/image-20251217204907446.webp 1440w,\n/static/a240acf52e99455fb9a257df19b89c2a/e0991/image-20251217204907446.webp 1661w\"\n              sizes=\"(max-width: 960px) 100vw, 960px\"\n              type=\"image/webp\"\n            />\n          <source\n            srcset=\"/static/a240acf52e99455fb9a257df19b89c2a/8ff5a/image-20251217204907446.png 240w,\n/static/a240acf52e99455fb9a257df19b89c2a/e85cb/image-20251217204907446.png 480w,\n/static/a240acf52e99455fb9a257df19b89c2a/d9199/image-20251217204907446.png 960w,\n/static/a240acf52e99455fb9a257df19b89c2a/07a9c/image-20251217204907446.png 1440w,\n/static/a240acf52e99455fb9a257df19b89c2a/b41cb/image-20251217204907446.png 1661w\"\n            sizes=\"(max-width: 960px) 100vw, 960px\"\n            type=\"image/png\"\n          />\n          <img\n            class=\"gatsby-resp-image-image\"\n            src=\"/static/a240acf52e99455fb9a257df19b89c2a/d9199/image-20251217204907446.png\"\n            alt=\"image-20251217204907446\"\n            title=\"image-20251217204907446\"\n            loading=\"lazy\"\n            style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\"\n          />\n        </picture>\n  </a>\n    </span></p>\n<p>Also, nodes in the AC method are defined as the <code class=\"language-text\">cli_ac_node</code> structure.</p>\n<div class=\"gatsby-highlight\" data-language=\"c\"><pre class=\"language-c\"><code class=\"language-c\"><span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_ac_patt</span> <span class=\"token punctuation\">{</span>\n    <span class=\"token class-name\">uint16_t</span> <span class=\"token operator\">*</span>pattern<span class=\"token punctuation\">,</span> <span class=\"token operator\">*</span>prefix<span class=\"token punctuation\">,</span> length<span class=\"token punctuation\">[</span><span class=\"token number\">3</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span> prefix_length<span class=\"token punctuation\">[</span><span class=\"token number\">3</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">;</span>\n    <span class=\"token class-name\">uint32_t</span> mindist<span class=\"token punctuation\">,</span> maxdist<span class=\"token punctuation\">;</span>\n    <span class=\"token class-name\">uint32_t</span> sigid<span class=\"token punctuation\">;</span>\n    <span class=\"token class-name\">uint32_t</span> lsigid<span class=\"token punctuation\">[</span><span class=\"token number\">3</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">;</span>\n    <span class=\"token class-name\">uint16_t</span> ch<span class=\"token punctuation\">[</span><span class=\"token number\">2</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">;</span>\n    <span class=\"token keyword\">char</span> <span class=\"token operator\">*</span>virname<span class=\"token punctuation\">;</span>\n    <span class=\"token keyword\">void</span> <span class=\"token operator\">*</span>customdata<span class=\"token punctuation\">;</span>\n    <span class=\"token class-name\">uint16_t</span> ch_mindist<span class=\"token punctuation\">[</span><span class=\"token number\">2</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">;</span>\n    <span class=\"token class-name\">uint16_t</span> ch_maxdist<span class=\"token punctuation\">[</span><span class=\"token number\">2</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">;</span>\n    <span class=\"token class-name\">uint16_t</span> parts<span class=\"token punctuation\">,</span> partno<span class=\"token punctuation\">,</span> special<span class=\"token punctuation\">,</span> special_pattern<span class=\"token punctuation\">;</span>\n    <span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_ac_special</span> <span class=\"token operator\">*</span><span class=\"token operator\">*</span>special_table<span class=\"token punctuation\">;</span>\n    <span class=\"token class-name\">uint16_t</span> rtype<span class=\"token punctuation\">,</span> type<span class=\"token punctuation\">;</span>\n    <span class=\"token class-name\">uint32_t</span> offdata<span class=\"token punctuation\">[</span><span class=\"token number\">4</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span> offset_min<span class=\"token punctuation\">,</span> offset_max<span class=\"token punctuation\">;</span>\n    <span class=\"token class-name\">uint32_t</span> boundary<span class=\"token punctuation\">;</span>\n    <span class=\"token class-name\">uint8_t</span> depth<span class=\"token punctuation\">;</span>\n    <span class=\"token class-name\">uint8_t</span> sigopts<span class=\"token punctuation\">;</span>\n<span class=\"token punctuation\">}</span><span class=\"token punctuation\">;</span>\n\n<span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_ac_list</span> <span class=\"token punctuation\">{</span>\n    <span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_ac_patt</span> <span class=\"token operator\">*</span>me<span class=\"token punctuation\">;</span>\n    <span class=\"token keyword\">union</span> <span class=\"token punctuation\">{</span>\n        <span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_ac_node</span> <span class=\"token operator\">*</span>node<span class=\"token punctuation\">;</span>\n        <span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_ac_list</span> <span class=\"token operator\">*</span>next<span class=\"token punctuation\">;</span>\n    <span class=\"token punctuation\">}</span><span class=\"token punctuation\">;</span>\n    <span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_ac_list</span> <span class=\"token operator\">*</span>next_same<span class=\"token punctuation\">;</span>\n<span class=\"token punctuation\">}</span><span class=\"token punctuation\">;</span>\n\n<span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_ac_node</span> <span class=\"token punctuation\">{</span>\n    <span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_ac_list</span> <span class=\"token operator\">*</span>list<span class=\"token punctuation\">;</span>\n    <span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_ac_node</span> <span class=\"token operator\">*</span><span class=\"token operator\">*</span>trans<span class=\"token punctuation\">,</span> <span class=\"token operator\">*</span>fail<span class=\"token punctuation\">;</span>\n<span class=\"token punctuation\">}</span><span class=\"token punctuation\">;</span></code></pre></div>\n<p>Within this implementation, it starts from the Root node obtained by <code class=\"language-text\">root->ac_root</code>, then follows the transition destinations corresponding to <code class=\"language-text\">buffer[i]</code> one by one from <code class=\"language-text\">trans</code> (the next node), and when the node is terminal it begins the pattern-matching process.</p>\n<div class=\"gatsby-highlight\" data-language=\"c\"><pre class=\"language-c\"><code class=\"language-c\">current <span class=\"token operator\">=</span> root<span class=\"token operator\">-></span>ac_root<span class=\"token punctuation\">;</span>\n\n<span class=\"token keyword\">for</span> <span class=\"token punctuation\">(</span>i <span class=\"token operator\">=</span> <span class=\"token number\">0</span><span class=\"token punctuation\">;</span> i <span class=\"token operator\">&lt;</span> length<span class=\"token punctuation\">;</span> i<span class=\"token operator\">++</span><span class=\"token punctuation\">)</span> <span class=\"token punctuation\">{</span>\n    current <span class=\"token operator\">=</span> current<span class=\"token operator\">-></span>trans<span class=\"token punctuation\">[</span>buffer<span class=\"token punctuation\">[</span>i<span class=\"token punctuation\">]</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">;</span>\n\n    <span class=\"token keyword\">if</span> <span class=\"token punctuation\">(</span><span class=\"token function\">UNLIKELY</span><span class=\"token punctuation\">(</span><span class=\"token function\">IS_FINAL</span><span class=\"token punctuation\">(</span>current<span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span> <span class=\"token punctuation\">{</span>\n        <span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_ac_list</span> <span class=\"token operator\">*</span>faillist <span class=\"token operator\">=</span> current<span class=\"token operator\">-></span>fail<span class=\"token operator\">-></span>list<span class=\"token punctuation\">;</span>\n        pattN                        <span class=\"token operator\">=</span> current<span class=\"token operator\">-></span>list<span class=\"token punctuation\">;</span>\n        <span class=\"token comment\">/* omitted */</span>\n    <span class=\"token punctuation\">}</span>\n<span class=\"token punctuation\">}</span></code></pre></div>\n<p>Various checks are performed here, but the point I particularly want to focus on is the <code class=\"language-text\">ac_forward_match_branch</code> function, which is called further from the <code class=\"language-text\">ac_findmatch</code> function.</p>\n<div class=\"gatsby-highlight\" data-language=\"c\"><pre class=\"language-c\"><code class=\"language-c\"><span class=\"token comment\">/* state should reset on call, recursion depth = number of alternate specials */</span>\n<span class=\"token comment\">/* each loop iteration starts on the NEXT sequence to validate */</span>\n<span class=\"token keyword\">static</span> <span class=\"token keyword\">int</span> <span class=\"token function\">ac_forward_match_branch</span><span class=\"token punctuation\">(</span><span class=\"token keyword\">const</span> <span class=\"token keyword\">unsigned</span> <span class=\"token keyword\">char</span> <span class=\"token operator\">*</span>buffer<span class=\"token punctuation\">,</span> <span class=\"token class-name\">uint32_t</span> bp<span class=\"token punctuation\">,</span> <span class=\"token class-name\">uint32_t</span> offset<span class=\"token punctuation\">,</span> <span class=\"token class-name\">uint32_t</span> fileoffset<span class=\"token punctuation\">,</span> <span class=\"token class-name\">uint32_t</span> length<span class=\"token punctuation\">,</span>\n                                   <span class=\"token keyword\">const</span> <span class=\"token keyword\">struct</span> <span class=\"token class-name\">cli_ac_patt</span> <span class=\"token operator\">*</span>pattern<span class=\"token punctuation\">,</span> <span class=\"token class-name\">uint32_t</span> pp<span class=\"token punctuation\">,</span> <span class=\"token class-name\">uint16_t</span> specialcnt<span class=\"token punctuation\">,</span> <span class=\"token class-name\">uint32_t</span> <span class=\"token operator\">*</span>start<span class=\"token punctuation\">,</span> <span class=\"token class-name\">uint32_t</span> <span class=\"token operator\">*</span>end<span class=\"token punctuation\">)</span>\n<span class=\"token punctuation\">{</span>\n    <span class=\"token keyword\">int</span> match<span class=\"token punctuation\">;</span>\n    <span class=\"token class-name\">uint16_t</span> wc<span class=\"token punctuation\">,</span> i<span class=\"token punctuation\">;</span>\n\n    match <span class=\"token operator\">=</span> <span class=\"token number\">1</span><span class=\"token punctuation\">;</span>\n\n    <span class=\"token comment\">/* forward (pattern) validation; determines end */</span>\n    <span class=\"token keyword\">for</span> <span class=\"token punctuation\">(</span>i <span class=\"token operator\">=</span> pp<span class=\"token punctuation\">;</span> i <span class=\"token operator\">&lt;</span> pattern<span class=\"token operator\">-></span>length<span class=\"token punctuation\">[</span><span class=\"token number\">0</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">&amp;&amp;</span> bp <span class=\"token operator\">&lt;</span> length<span class=\"token punctuation\">;</span> i<span class=\"token operator\">++</span><span class=\"token punctuation\">)</span> <span class=\"token punctuation\">{</span>\n        <span class=\"token function\">AC_MATCH_CHAR</span><span class=\"token punctuation\">(</span>pattern<span class=\"token operator\">-></span>pattern<span class=\"token punctuation\">[</span>i<span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span> buffer<span class=\"token punctuation\">[</span>bp<span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span> <span class=\"token number\">0</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">;</span>\n        <span class=\"token keyword\">if</span> <span class=\"token punctuation\">(</span><span class=\"token operator\">!</span>match<span class=\"token punctuation\">)</span>\n            <span class=\"token keyword\">return</span> <span class=\"token number\">0</span><span class=\"token punctuation\">;</span>\n\n        bp<span class=\"token operator\">++</span><span class=\"token punctuation\">;</span>\n    <span class=\"token punctuation\">}</span>\n    <span class=\"token operator\">*</span>end <span class=\"token operator\">=</span> bp<span class=\"token punctuation\">;</span></code></pre></div>\n<p>In the code above, this is where the scanned <code class=\"language-text\">buffer</code> and the corresponding signature pattern are finally compared.</p>\n<p>If you actually debug an Eicar scan, you can confirm that at this point the signature pattern in <code class=\"language-text\">pattern->pattern</code> and the scan data in <code class=\"language-text\">buffer</code> are being compared one byte at a time.</p>\n<p><span\n      class=\"gatsby-resp-image-wrapper\"\n      style=\"position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 960px; \"\n    >\n      <a\n    class=\"gatsby-resp-image-link\"\n    href=\"/static/a0641cbbd514ae1d8ded0a70d07a95a6/c0786/image-20251217210820193.png\"\n    style=\"display: block\"\n    target=\"_blank\"\n    rel=\"noopener\"\n  >\n    <span\n    class=\"gatsby-resp-image-background-image\"\n    style=\"padding-bottom: 42.083333333333336%; position: relative; bottom: 0; left: 0; background-image: url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAICAYAAAD5nd/tAAAACXBIWXMAAAsTAAALEwEAmpwYAAABkUlEQVQoz0VSSXLcMBDTIXZK0kjcJW7al/F47EMqdjk+pfL/RyEtKssBhSbZBAGSWTtEaDtCmghrDWKUcF5C+QGi8ZCuJw5QcYUwDipMELaDPGAsRBupd4TQDZg0yIx9wqP8wFf5HY2bcB0fsE8lusGhCR3UIUibVZzBhURtPLhuUXMJplowoYibk7VF1nRPqPwbak+CYcW8B0xLgOlm6OkF7fycnLAmgh8bDxGqmdRg5JhTisMZP3pOwTuK5hOF/YQjF9eVIywbzHBLUY7I0sbkSlCdXJJwXVcoixxlWRKKf5yF4Ru+qF94UD/R+hX3Ncc8KvS9gg0etdAp0iHIyVVVMwI/mauTmaArEKiqCll7Cxi3F/TTHV0/0cPQ6aKFlAo+BFgf0Dp/wjp6MJr7UxvrUy1NQ6Icl8sFWelewdwb1u2K675hWXYwOi3EnuoFgUSNMeS4x/PtBsYY/YSY1sZxoNgFCf2PneXqHY/qB3x3NuzbBiEEvPc0HsmxhdY6jfd9T7Gcc2mt6zrkeZ6c/cVv0p/LfsjWHUkAAAAASUVORK5CYII='); background-size: cover; display: block;\"\n  ></span>\n  <picture>\n          <source\n              srcset=\"/static/a0641cbbd514ae1d8ded0a70d07a95a6/8ac56/image-20251217210820193.webp 240w,\n/static/a0641cbbd514ae1d8ded0a70d07a95a6/d3be9/image-20251217210820193.webp 480w,\n/static/a0641cbbd514ae1d8ded0a70d07a95a6/e46b2/image-20251217210820193.webp 960w,\n/static/a0641cbbd514ae1d8ded0a70d07a95a6/f992d/image-20251217210820193.webp 1440w,\n/static/a0641cbbd514ae1d8ded0a70d07a95a6/06e3f/image-20251217210820193.webp 1873w\"\n              sizes=\"(max-width: 960px) 100vw, 960px\"\n              type=\"image/webp\"\n            />\n          <source\n            srcset=\"/static/a0641cbbd514ae1d8ded0a70d07a95a6/8ff5a/image-20251217210820193.png 240w,\n/static/a0641cbbd514ae1d8ded0a70d07a95a6/e85cb/image-20251217210820193.png 480w,\n/static/a0641cbbd514ae1d8ded0a70d07a95a6/d9199/image-20251217210820193.png 960w,\n/static/a0641cbbd514ae1d8ded0a70d07a95a6/07a9c/image-20251217210820193.png 1440w,\n/static/a0641cbbd514ae1d8ded0a70d07a95a6/c0786/image-20251217210820193.png 1873w\"\n            sizes=\"(max-width: 960px) 100vw, 960px\"\n            type=\"image/png\"\n          />\n          <img\n            class=\"gatsby-resp-image-image\"\n            src=\"/static/a0641cbbd514ae1d8ded0a70d07a95a6/d9199/image-20251217210820193.png\"\n            alt=\"image-20251217210820193\"\n            title=\"image-20251217210820193\"\n            loading=\"lazy\"\n            style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\"\n          />\n        </picture>\n  </a>\n    </span></p>\n<h2 id=\"summary\" style=\"position:relative;\"><a href=\"#summary\" aria-label=\"summary permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Summary</h2>\n<p>While tracing ClamAV’s scan logic, I ended up reaching the implementation of an information retrieval algorithm.</p>\n<p>It was worthwhile because it deepened my understanding of how AV can scan large amounts of data at high speed.</p>","fields":{"slug":"/clamav-scan-fmap-en","tagSlugs":["/tag/clam-av-en/","/tag/malware-en/","/tag/linux-en/","/tag/english/"]},"frontmatter":{"date":"2025-12-17","description":"Using ClamAV as a reference, this article summarizes the Aho–Corasick algorithm, an information retrieval algorithm that underpins AntiVirus software.","tags":["ClamAV (en)","Malware (en)","Linux (en)","English"],"title":"Information Retrieval Algorithms Behind AntiVirus, Part 1 - The Aho–Corasick Algorithm","socialImage":{"publicURL":"/static/62f636075fbd5a68a816ecc949e6994b/clamav-scan-fmap.png"}}}},"pageContext":{"slug":"/clamav-scan-fmap-en"}},"staticQueryHashes":["251939775","401334301","825871152"]}