All Articles

Notes on How clamscan Works Before It Performs a File Scan (unfinished)

This page has been machine-translated from the original page.

I have been reading through the ClamAV codebase without any particular goal, and I am publishing these notes on how clamscan works before it performs a file scan.

This time, I will summarize the clamscan tool from ClamAV, the open-source antivirus software.

Previously, in the article below, I summarized what I found when checking how clamdscan detects the Eicar file. This time, I wanted to look into how libclamav is initialized and how signatures are loaded, so I decided to use clamscan, which is a one-time scanner.

Reference: Notes on tracing the scan behavior until ClamAV detects the Eicar test file - かえるのひみつきち

Table of Contents

Overview of clamscan

The clamscan tool is a command-line tool for scanning files by using libclamav.

Unlike clamdscan, clamscan can perform scans even when the clamd daemon is not running on the system.

Because of that, clamscan generally loads signatures at runtime, so it takes a little time from command execution until the scan actually starts.

image-20251018151701287

Reference: Scanning - ClamAV Documentation

In this article, I want to understand how ClamAV malware scanning works by analyzing the implementation of clamscan.

clamscan Options

The clamscan tool in ClamAV 1.5.0, which I use here, provides many options as shown below.

$ clamscan --help

                       Clam AntiVirus: Scanner 1.5.0
           By The ClamAV Team: https://www.clamav.net/about.html#credits
           (C) 2025 Cisco Systems, Inc.

    clamscan [options] [file/directory/-]

    --help                -h             Show this help.
    --version             -V             Print version number.
    --verbose             -v             Be verbose.
    --archive-verbose     -a             Show filenames inside scanned archives.
    --debug                              Enable libclamav's debug messages.
    --quiet                              Only output error messages.
    --stdout                             Write to stdout instead of stderr. Does not affect 'debug' messages.
    --no-summary                         Disable summary at end of scanning.
    --infected            -i             Only print infected files.
    --suppress-ok-results -o             Skip printing OK files.
    --bell                               Sound bell on virus detection.

    --tempdir=DIRECTORY                  Create temporary files in DIRECTORY.
    --leave-temps[=yes/no(*)]            Do not remove temporary files.
    --force-to-disk[=yes/no(*)]          Create temporary files for nested file scans that would otherwise be in-memory only.
    --gen-json[=yes/no(*)]               Generate JSON metadata for the scanned file(s). For testing & development use ONLY.
                                         JSON will be printed if --debug is enabled.
                                         A JSON file will dropped to the temp directory if --leave-temps is enabled.
    --json-store-html-uris[=yes(*)/no]   Store html URIs in metadata.
                                         URIs will be written to the metadata.json file in an array called 'URIs'.
    --json-store-pdf-uris[=yes(*)/no]    Store pdf URIs in metadata.
                                         URIs will be written to the metadata.json file in an array called 'URIs'.
    --json-store-extra-hashes[=yes(*)/no] Store md5 and sha1 in addition to sha2-256 in metadata.
    --database=FILE/DIR   -d FILE/DIR    Load virus database from FILE or load all supported db files from DIR.
    --official-db-only[=yes/no(*)]       Only load official signatures.
    --fail-if-cvd-older-than=days        Return with a nonzero error code if virus database outdated.
    --log=FILE            -l FILE        Save scan report to FILE.
    --recursive[=yes/no(*)]  -r          Scan subdirectories recursively.
    --allmatch[=yes/no(*)]   -z          Continue scanning within file after finding a match.
    --cross-fs[=yes(*)/no]               Scan files and directories on other filesystems.
    --follow-dir-symlinks[=0/1(*)/2]     Follow directory symlinks (0 = never, 1 = direct, 2 = always).
    --follow-file-symlinks[=0/1(*)/2]    Follow file symlinks (0 = never, 1 = direct, 2 = always).
    --file-list=FILE      -f FILE        Scan files from FILE.
    --remove[=yes/no(*)]                 Remove infected files. Be careful!
    --move=DIRECTORY                     Move infected files into DIRECTORY.
    --copy=DIRECTORY                     Copy infected files into DIRECTORY.
    --exclude=REGEX                      Don't scan file names matching REGEX.
    --exclude-dir=REGEX                  Don't scan directories matching REGEX.
    --include=REGEX                      Only scan file names matching REGEX.
    --include-dir=REGEX                  Only scan directories matching REGEX.

    --bytecode[=yes(*)/no]               Load bytecode from the database.
    --bytecode-unsigned[=yes/no(*)]      Load unsigned bytecode.
                                         **Caution**: You should NEVER run bytecode signatures from untrusted sources.
                                         Doing so may result in arbitrary code execution.
    --bytecode-timeout=N                 Set bytecode timeout (in milliseconds).
    --statistics[=none(*)/bytecode/pcre] Collect and print execution statistics.
    --detect-pua[=yes/no(*)]             Detect Possibly Unwanted Applications.
    --exclude-pua=CAT                    Skip PUA sigs of category CAT.
    --include-pua=CAT                    Load PUA sigs of category CAT.
    --detect-structured[=yes/no(*)]      Detect structured data (SSN, Credit Card).
    --structured-ssn-format=X            SSN format (0=normal,1=stripped,2=both).
    --structured-ssn-count=N             Min SSN count to generate a detect.
    --structured-cc-count=N              Min CC count to generate a detect.
    --structured-cc-mode=X               CC mode (0=credit debit and private label, 1=credit cards only.
    --scan-mail[=yes(*)/no]              Scan mail files.
    --phishing-sigs[=yes(*)/no]          Enable email signature-based phishing detection.
    --phishing-scan-urls[=yes(*)/no]     Enable URL signature-based phishing detection.
    --heuristic-alerts[=yes(*)/no]       Heuristic alerts.
    --heuristic-scan-precedence[=yes/no(*)] Stop scanning as soon as a heuristic match is found.
    --normalize[=yes(*)/no]              Normalize html, script, and text files. Use normalize=no for yara compatibility.
    --scan-pe[=yes(*)/no]                Scan PE files.
    --scan-elf[=yes(*)/no]               Scan ELF files.
    --scan-ole2[=yes(*)/no]              Scan OLE2 containers.
    --scan-pdf[=yes(*)/no]               Scan PDF files.
    --scan-swf[=yes(*)/no]               Scan SWF files.
    --scan-html[=yes(*)/no]              Scan HTML files.
    --scan-xmldocs[=yes(*)/no]           Scan xml-based document files.
    --scan-hwp3[=yes(*)/no]              Scan HWP3 files.
    --scan-onenote[=yes(*)/no]           Scan OneNote files.
    --scan-archive[=yes(*)/no]           Scan archive files (supported by libclamav).
    --scan-image[=yes(*)/no]             Scan image (graphics) files.
    --scan-image-fuzzy-hash[=yes(*)/no]  Detect files by calculating image (graphics) fuzzy hashes.
    --alert-broken[=yes/no(*)]           Alert on broken executable files (PE & ELF).
    --alert-broken-media[=yes/no(*)]     Alert on broken graphics files (JPEG, TIFF, PNG, GIF).
    --alert-encrypted[=yes/no(*)]        Alert on encrypted archives and documents.
    --alert-encrypted-archive[=yes/no(*)] Alert on encrypted archives.
    --alert-encrypted-doc[=yes/no(*)]    Alert on encrypted documents.
    --alert-macros[=yes/no(*)]           Alert on OLE2 files containing VBA macros.
    --alert-exceeds-max[=yes/no(*)]      Alert on files that exceed max file size, max scan size, or max recursion limit.
    --alert-phishing-ssl[=yes/no(*)]     Alert on emails containing SSL mismatches in URLs.
    --alert-phishing-cloak[=yes/no(*)]   Alert on emails containing cloaked URLs.
    --alert-partition-intersection[=yes/no(*)] Alert on raw DMG image files containing partition intersections.
    --nocerts                            Disable authenticode certificate chain verification in PE files.
    --dumpcerts                          Dump authenticode certificate chain in PE files.

    --max-scantime=#n                    Scan time longer than this will be skipped and assumed clean (milliseconds).
    --max-filesize=#n                    Files larger than this will be skipped and assumed clean.
    --max-scansize=#n                    The maximum amount of data to scan for each container file (**).
    --max-files=#n                       The maximum number of files to scan for each container file (**).
    --max-recursion=#n                   Maximum archive recursion level for container file (**).
    --max-dir-recursion=#n               Maximum directory recursion level.
    --max-embeddedpe=#n                  Maximum size file to check for embedded PE.
    --max-htmlnormalize=#n               Maximum size of HTML file to normalize.
    --max-htmlnotags=#n                  Maximum size of normalized HTML file to scan.
    --max-scriptnormalize=#n             Maximum size of script file to normalize.
    --max-ziptypercg=#n                  Maximum size zip to type reanalyze.
    --max-partitions=#n                  Maximum number of partitions in disk image to be scanned.
    --max-iconspe=#n                     Maximum number of icons in PE file to be scanned.
    --max-rechwp3=#n                     Maximum recursive calls to HWP3 parsing function.
    --pcre-match-limit=#n                Maximum calls to the PCRE match function.
    --pcre-recmatch-limit=#n             Maximum recursive calls to the PCRE match function.
    --pcre-max-filesize=#n               Maximum size file to perform PCRE subsig matching.
    --disable-cache                      Disable caching and cache checks for hash sums of scanned files.
    --hash-hint                          The file hash so that libclamav does not need to calculate it.
                                         The type of hash must match the '--hash-alg'.
    --log-hash                           Print the file hash after each file scanned.
                                         The type of hash printed will match the '--hash-alg'.
    --hash-alg                           The hashing algorithm used for either '--hash-hint' or '--log-hash'.
                                         Supported algorithms are 'md5', 'sha1', 'sha2-256'.
                                         If not specified, the default is 'sha2-256'.
    --file-type-hint                     The file type hint so that libclamav can optimize scanning.
                                         E.g. 'pe', 'elf', 'zip', etc.
                                         You may also use ClamAV type names such as 'CL_TYPE_PE'.
                                         ClamAV will ignore the hint if it is not familiar with the specified type.
                                         See also: https://docs.clamav.net/appendix/FileTypes.html#file-types
    --log-file-type                      Print the file type after each file scanned.
    --cvdcertsdir=DIRECTORY              Specify a directory containing the root
                                         CA cert needed to verify detached CVD digital signatures.
                                         If not provided, then clamscan will look in the default directory.
    --fips-limits                        Enforce FIPS-like limits on using hash algorithms for
                                         cryptographic purposes. Will disable MD5 & SHA1.
                                         FP sigs and will require '.sign' files to verify CVD
                                         authenticity.

Environment Variables:

    LD_LIBRARY_PATH                      May be used on startup to find the libclamunrar_iface
                                         shared library module to enable RAR archive support.
    CVD_CERTS_DIR                        Specify a directory containing the root CA cert needed
                                         to verify detached CVD digital signatures.
                                         If not provided, then clamscan will look in the default directory.

Pass in - as the filename for stdin.

(*) Default scan settings
(**) Certain files (e.g. documents, archives, etc.) may in turn contain other
   files inside. The above options ensure safe processing of this kind of data.

Detecting the Eicar File with clamscan

First, I will save the debug log showing how clamscan detects the Eicar test malware to a file with the command below.

Because clamscan’s debug information is output to stderr rather than stdout, I redirect stderr to stdout with 2>&1 and then pipe it onward.

(The --stdout option, which changes the normal output destination to stdout, does not affect debug messages.)

clamscan --debug --disable-cache --verbose --stdout eicar 2>&1 | tee logfile.txt

When debugging with gdb, you can use gdb --args as shown below.

gdb --args clamscan --debug --disable-cache --verbose --stdout eicar 2>&1

The debug log produced this way contained roughly 1,500 lines of information.

Also, because I did not use an option to disable the summary this time, the final output included a summary like the one below together with a message indicating that clamscan detected the Eicar file.

image-20251019125033073

From here on, I will follow the sequence that leads up to clamscan detecting the Eicar file.

Calling the scanmanager Function

The very first debug messages that appeared were the following lines.

LibClamAV debug: searching for unrar, user-searchpath: /usr/local/lib
LibClamAV debug: unrar support loaded from /usr/local/lib/libclamunrar_iface.so.12.1.0
LibClamAV debug: Initialized 1.5.0 engine
LibClamAV debug: Initializing phishcheck module
LibClamAV debug: Phishcheck: Compiling regex: ^ *(http|https|ftp:(//)?)?[0-9]{1,3}(\.[0-9]{1,3}){3}[/?:]? *$
LibClamAV debug: Phishcheck module initialized
LibClamAV debug: Bytecode initialized in interpreter mode
LibClamAV debug: clean_cache_init: Caching disabled.
LibClamAV debug: clean_cache_init: Cache initialized successfully.
LibClamAV debug: Adding certificate to verifier store: X509 { serial_number: "0493F2B851C5D5BED1", signature_algorithm: sha512WithRSAEncryption, issuer: [organizationalUnitName = "Arbor", organizationName = "Cisco", commonName = "Cisco Software Identity Root CA RSA 4096 SHA512 2099"], subject: [organizationalUnitName = "Arbor", organizationName = "Cisco", commonName = "Cisco Software Identity Root CA RSA 4096 SHA512 2099"], not_before: Jan 24 18:45:25 2024 GMT, not_after: Jan 24 18:45:25 2099 GMT, public_key: PKey { algorithm: "RSA" } }
LibClamAV debug: Verifier created successfully

Judging from the message content, the output on the first line seems to come from the following location inside *load_module(const char *name, const char *featurename).

/*
* Search in "<prefix>/lib" checking with each of the different possible suffixes.
*/
cli_dbgmsg("searching for %s, user-searchpath: %s\n", featurename, SEARCH_LIBDIR);

Reference: others.c - libclamav - Cisco-Talos/clamav - Sourcegraph

It also appears that this method is called from the rarload function in the same code.

image-20251018175708668

Furthermore, when I captured the call stack up to this point with the debugger, I confirmed that these libclamav functions are called from the scanmanager function in clamscan.

image-20251018220749733

I will not describe this entire sequence in detail here, but at a high level it seems to work as follows.

  • In clamscan::main, command-line arguments are stored in the opts structure, and several options are enabled by reading the command-line arguments with the optget function.
struct optstruct {
    char *name;
    char *cmd;
    char *strarg;
    long long numarg;
    int enabled;
    int active;
    int flags;
    int idx;
    struct optstruct *nextarg;
    struct optstruct *next;

    char **filename; /* cmdline */
};
  • The memory region of the info variable of type s_info, defined as a global variable, is initialized with memset.
struct s_info {
    unsigned int sigs;         /* number of signatures */
    unsigned int dirs;         /* number of scanned directories */
    unsigned int files;        /* number of scanned files */
    unsigned int ifiles;       /* number of infected files */
    unsigned int errors;       /* number of errors */
    unsigned long int blocks;  /* number of *scanned* 16kb blocks */
    unsigned long int rblocks; /* number of *read* 16kb blocks */
};
  • The scanmanager function is called together with the command-line arguments.
  • The scanmanager function initializes libclamav, determines the scan target, and performs the scan.

In the flow above, the scanmanager function effectively serves as the main driver that requests file scanning in clamscan.

For that reason, once scanmanager returns its result, the clamscan execution also ends.

Reference: clamscan.c - clamscan - Cisco-Talos/clamav - Sourcegraph

Initializing libclamav

Retrieving Scan Options

When the scanmanager function is called, it first initializes the options variable of type cl_scan_options.

/* Initalize scan options struct */
memset(&options, 0, sizeof(struct cl_scan_options));

This structure stores the scan options used during scanning.

The cl_scan_options structure defines members such as general and heuristic, and it appears that options are changed by manipulating these bitmasks.

/*** scan options ***/
struct cl_scan_options {
    uint32_t general;
    uint32_t parse;
    uint32_t heuristic;
    uint32_t mail;
    uint32_t dev;
};

The first set of options, general, seems to be manipulated mainly by the following code.

/* general */
#define CL_SCAN_GENERAL_ALLMATCHES                  0x1  /* scan in all-match mode */
#define CL_SCAN_GENERAL_COLLECT_METADATA            0x2  /* collect metadata (--gen-json) */
#define CL_SCAN_GENERAL_HEURISTICS                  0x4  /* option to enable heuristic alerts */
#define CL_SCAN_GENERAL_HEURISTIC_PRECEDENCE        0x8  /* allow heuristic match to take precedence. */
#define CL_SCAN_GENERAL_UNPRIVILEGED                0x10 /* scanner will not have read access to files. */

/* set scan options */
if (optget(opts, "allmatch")->enabled) {
    options.general |= CL_SCAN_GENERAL_ALLMATCHES;
}

if (optget(opts, "heuristic-scan-precedence")->enabled)
    options.general |= CL_SCAN_GENERAL_HEURISTIC_PRECEDENCE;

/* TODO: Remove deprecated option in a future feature release */
if ((optget(opts, "algorithmic-detection")->enabled) && /* && used due to default-yes for both options */
    (optget(opts, "heuristic-alerts")->enabled)) {
    options.general |= CL_SCAN_GENERAL_HEURISTICS;
}

/* JSON check to prevent engine loading if specified without libjson-c  */
if (optget(opts, "gen-json")->enabled)
    options.general |= CL_SCAN_GENERAL_COLLECT_METADATA;

Among the options above, heuristic-scan-precedence specifies that scanning should stop immediately when a heuristic match is found.

Also, the gen-json option is intended for testing and development, and instructs clamscan to generate JSON-formatted information including metadata for the scanned file.

When this option is used in debug mode, JSON-formatted information like the following seems to be displayed.

image-20251019132207507

The second group, the parse option, appears to be a set of flags specifying which file types the engine should parse.

/* parsing capabilities options */
#define CL_SCAN_PARSE_ARCHIVE                       0x1
#define CL_SCAN_PARSE_ELF                           0x2
#define CL_SCAN_PARSE_PDF                           0x4
#define CL_SCAN_PARSE_SWF                           0x8
#define CL_SCAN_PARSE_HWP3                          0x10
#define CL_SCAN_PARSE_XMLDOCS                       0x20
#define CL_SCAN_PARSE_MAIL                          0x40
#define CL_SCAN_PARSE_OLE2                          0x80
#define CL_SCAN_PARSE_HTML                          0x100
#define CL_SCAN_PARSE_PE                            0x200

The heuristic option then enables several heuristic alerts.

/* heuristic alerting options */
#define CL_SCAN_HEURISTIC_BROKEN                    0x2    /* alert on broken PE and broken ELF files */
#define CL_SCAN_HEURISTIC_EXCEEDS_MAX               0x4    /* alert when files exceed scan limits (filesize, max scansize, or max recursion depth) */
#define CL_SCAN_HEURISTIC_PHISHING_SSL_MISMATCH     0x8    /* alert on SSL mismatches */
#define CL_SCAN_HEURISTIC_PHISHING_CLOAK            0x10   /* alert on cloaked URLs in emails */
#define CL_SCAN_HEURISTIC_MACROS                    0x20   /* alert on OLE2 files containing macros */
#define CL_SCAN_HEURISTIC_ENCRYPTED_ARCHIVE         0x40   /* alert if archive is encrypted (rar, zip, etc) */
#define CL_SCAN_HEURISTIC_ENCRYPTED_DOC             0x80   /* alert if a document is encrypted (pdf, docx, etc) */
#define CL_SCAN_HEURISTIC_PARTITION_INTXN           0x100  /* alert if partition table size doesn't make sense */
#define CL_SCAN_HEURISTIC_STRUCTURED                0x200  /* data loss prevention options, i.e. alert when detecting personal information */
#define CL_SCAN_HEURISTIC_STRUCTURED_SSN_NORMAL     0x400  /* alert when detecting social security numbers */
#define CL_SCAN_HEURISTIC_STRUCTURED_SSN_STRIPPED   0x800  /* alert when detecting stripped social security numbers */
#define CL_SCAN_HEURISTIC_STRUCTURED_CC             0x1000 /* alert when detecting credit card numbers */
#define CL_SCAN_HEURISTIC_BROKEN_MEDIA              0x2000 /* alert if a file does not match the identified file format, works with JPEG, TIFF, GIF, PNG */

And the mail and dev options enable the following capabilities, respectively.

/* mail scanning options */
#define CL_SCAN_MAIL_PARTIAL_MESSAGE                0x1

/* dev options */
#define CL_SCAN_DEV_COLLECT_SHA                     0x1 /* Enables hash output in sha-collect builds - for internal use only */
#define CL_SCAN_DEV_COLLECT_PERFORMANCE_INFO        0x2 /* collect performance timings */

Configuring dboptions

Inside the scanmanager function, a bitmask is also configured in dboptions, which is initialized with unsigned int dboptions = 0.

The flags that can be set in dboptions are defined as follows.

/* db options */
// clang-format off
#define CL_DB_PHISHING          0x2
#define CL_DB_PHISHING_URLS     0x8
#define CL_DB_PUA               0x10
#define CL_DB_CVDNOTMP          0x20    /* obsolete */
#define CL_DB_OFFICIAL          0x40    /* internal */
#define CL_DB_PUA_MODE          0x80
#define CL_DB_PUA_INCLUDE       0x100
#define CL_DB_PUA_EXCLUDE       0x200
#define CL_DB_COMPILED          0x400   /* internal */
#define CL_DB_DIRECTORY         0x800   /* internal */
#define CL_DB_OFFICIAL_ONLY     0x1000
#define CL_DB_BYTECODE          0x2000
#define CL_DB_SIGNED            0x4000  /* internal */
#define CL_DB_BYTECODE_UNSIGNED 0x8000  /* Caution: You should never run bytecode signatures from untrusted sources. Doing so may result in arbitrary code execution. */
#define CL_DB_UNSIGNED          0x10000 /* internal */
#define CL_DB_BYTECODE_STATS    0x20000
#define CL_DB_ENHANCED          0x40000
#define CL_DB_PCRE_STATS        0x80000
#define CL_DB_YARA_EXCLUDE      0x100000
#define CL_DB_YARA_ONLY         0x200000

/* recommended db settings */
#define CL_DB_STDOPT (CL_DB_PHISHING | CL_DB_PHISHING_URLS | CL_DB_BYTECODE)

These dboptions values are used as arguments when scanmanager calls the cl_load function to load the virus database.

if ((opt = optget(opts, "database"))->active) {
    while (opt) {
        if ((ret = cl_load(opt->strarg, engine, &info.sigs, dboptions))) {
            logg("!%s\n", cl_strerror(ret));

            ret = 2;
            goto done;
        }

        opt = opt->nextarg;
    }
} else {
    char *dbdir = freshdbdir();

    if ((ret = cl_load(dbdir, engine, &info.sigs, dboptions))) {
        logg("!%s\n", cl_strerror(ret));

        free(dbdir);
        ret = 2;
        goto done;
    }

    free(dbdir);
}

For example, when CL_DB_PHISHING is enabled, phishing signatures are loaded, and when CL_DB_PUA is enabled, PUA signatures are loaded.

Reference: libclamav - ClamAV Documentation

clinit and clengine_new

Inside the scanmanager function, the following functions are called in order to initialize libclamav.

int cl_init(unsigned int options);
struct cl_engine *cl_engine_new(void);

The resources of the allocated engine are freed with int cl_engine_free(struct cl_engine *engine);.

The engine initialized by cl_engine_new is defined as the cl_engine structure shown below.

struct cl_engine {
    uint32_t refcount; /* reference counter */
    uint32_t sdb;
    uint32_t dboptions;
    uint32_t dbversion[2];
    uint32_t ac_only;
    uint32_t ac_mindepth;
    uint32_t ac_maxdepth;
    char *tmpdir;
    uint32_t keeptmp;
    uint64_t engine_options;

    /* Limits */
    uint32_t maxscantime; /* Time limit (in milliseconds) */
    uint64_t maxscansize; /* during the scanning of archives this size
           * will never be exceeded
           */
    uint64_t maxfilesize; /* compressed files will only be decompressed
           * and scanned up to this size
           */
    uint32_t maxreclevel; /* maximum recursion level for archives */
    uint32_t maxfiles;    /* maximum number of files to be scanned
           * within a single archive
           */
    /* This is for structured data detection.  You can set the minimum
     * number of occurrences of an CC# or SSN before the system will
     * generate a notification.
     */
    uint32_t min_cc_count;
    uint32_t min_ssn_count;

    /* Roots table */
    struct cli_matcher **root;

    /* hash matcher for standard MD5 sigs */
    struct cli_matcher *hm_hdb;
    /* hash matcher for MD5 sigs for PE sections */
    struct cli_matcher *hm_mdb;
    /* hash matcher for MD5 sigs for PE import tables */
    struct cli_matcher *hm_imp;
    /* hash matcher for allow list db */
    struct cli_matcher *hm_fp;

    /* Container metadata */
    struct cli_cdb *cdb;

    /* Phishing .pdb and .wdb databases*/
    struct regex_matcher *allow_list_matcher;
    struct regex_matcher *domain_list_matcher;
    struct phishcheck *phishcheck;

    /* Dynamic configuration */
    struct cli_dconf *dconf;

    /* Filetype definitions */
    struct cli_ftype *ftypes;
    struct cli_ftype *ptypes;

    /* Container password storage */
    struct cli_pwdb **pwdbs;

    /* Pre-loading test matcher
     * Test for presence before using; cleared on engine compile.
     */
    struct cli_matcher *test_root;

    /* Ignored signatures */
    struct cli_matcher *ignored;

    /* PUA categories (to be included or excluded) */
    char *pua_cats;

    /* Icon reference storage */
    struct icon_matcher *iconcheck;

    /* Negative cache storage */
    struct CACHE *cache;

    /* Database information from .info files */
    struct cli_dbinfo *dbinfo;

    /* Signature counting, for progress callbacks */
    size_t num_total_signatures;

    /* Used for memory pools */
    mpool_t *mempool;

    /* crtmgr stuff */
    crtmgr cmgr;

    /* Callback(s) */
    clcb_pre_cache cb_pre_cache;
    clcb_pre_scan cb_pre_scan;
    clcb_post_scan cb_post_scan;
    clcb_virus_found cb_virus_found;
    clcb_sigload cb_sigload;
    void *cb_sigload_ctx;
    clcb_hash cb_hash;
    clcb_meta cb_meta;
    clcb_file_props cb_file_props;
    clcb_progress cb_sigload_progress;
    void *cb_sigload_progress_ctx;
    clcb_progress cb_engine_compile_progress;
    void *cb_engine_compile_progress_ctx;
    clcb_progress cb_engine_free_progress;
    void *cb_engine_free_progress_ctx;

    /* Used for bytecode */
    struct cli_all_bc bcs;
    unsigned *hooks[_BC_LAST_HOOK - _BC_START_HOOKS];
    unsigned hooks_cnt[_BC_LAST_HOOK - _BC_START_HOOKS];
    unsigned hook_lsig_ids;
    enum bytecode_security bytecode_security;
    uint32_t bytecode_timeout;
    enum bytecode_mode bytecode_mode;

    /* Engine max settings */
    uint64_t maxembeddedpe;      /* max size to scan MSEXE for PE */
    uint64_t maxhtmlnormalize;   /* max size to normalize HTML */
    uint64_t maxhtmlnotags;      /* max size for scanning normalized HTML */
    uint64_t maxscriptnormalize; /* max size to normalize scripts */
    uint64_t maxziptypercg;      /* max size to re-do zip filetype */

    /* Statistics/intelligence gathering */
    void *stats_data;
    clcb_stats_add_sample cb_stats_add_sample;
    clcb_stats_remove_sample cb_stats_remove_sample;
    clcb_stats_decrement_count cb_stats_decrement_count;
    clcb_stats_submit cb_stats_submit;
    clcb_stats_flush cb_stats_flush;
    clcb_stats_get_num cb_stats_get_num;
    clcb_stats_get_size cb_stats_get_size;
    clcb_stats_get_hostid cb_stats_get_hostid;

    /* Raw disk image max settings */
    uint32_t maxpartitions; /* max number of partitions to scan in a disk image */

    /* Engine max settings */
    uint32_t maxiconspe; /* max number of icons to scan for PE */
    uint32_t maxrechwp3; /* max recursive calls for HWP3 parsing */

    /* PCRE matching limitations */
    uint64_t pcre_match_limit;
    uint64_t pcre_recmatch_limit;
    uint64_t pcre_max_filesize;

#ifdef HAVE_YARA
    /* YARA */
    struct _yara_global *yara_global;
#endif
};

Reference: libclamav - ClamAV Documentation

Assigning the Virus-Detection Callback

After engine initialization is complete, cl_engine_set_clcb_virus_found can be used to assign the callback function that the engine executes when it detects a virus.

By default, the following clamscan_virus_found_cb function is assigned as the callback, and it contains code that displays the file and detection name.

static void clamscan_virus_found_cb(int fd, const char *virname, void *context)
{
    struct clamscan_cb_data *data = (struct clamscan_cb_data *)context;
    const char *filename;

    UNUSEDPARAM(fd);

    if (data == NULL)
        return;
    if (data->filename != NULL)
        filename = data->filename;
    else
        filename = "(filename not set)";
    logg("~%s: %s FOUND\n", filename, virname);
    return;
}

cl_engine_set_clcb_virus_found(engine, clamscan_virus_found_cb);

Displaying the Progress Bar

In the following code, time-consuming tasks such as loading signatures and compiling the engine are executed with a progress bar only when the output target is a terminal and several options are not set.

if (isatty(fileno(stdout)) &&
    !optget(opts, "debug")->enabled &&
    !optget(opts, "quiet")->enabled &&
    !optget(opts, "infected")->enabled &&
    !optget(opts, "no-summary")->enabled) {
    /* set progress callbacks */
    cl_engine_set_clcb_sigload_progress(engine, sigload_callback, &sigload_progress_ctx);
    cl_engine_set_clcb_engine_compile_progress(engine, engine_compile_callback, &engine_compile_progress_ctx);
#ifdef ENABLE_ENGINE_FREE_PROGRESSBAR
    cl_engine_set_clcb_engine_free_progress(engine, engine_free_callback, &engine_free_progress_ctx);
#endif
}

The progress bar is displayed like this.

image-20251115183346232

Setting Engine Options

In the code called after that, various options are set on the engine by using functions such as cl_engine_set_str and cl_engine_set_num.

Reference: libclamav - ClamAV Documentation

The options used there are defined as follows.

/**
 * @brief Allocate a new scanning engine and initialize default settings.
 *
 * The engine should be freed with `cl_engine_free()`.
 *
 * @return struct cl_engine* Pointer to the scanning engine.
 */
extern struct cl_engine *cl_engine_new(void);

enum cl_engine_field {
    CL_ENGINE_MAX_SCANSIZE,        /* uint64_t */
    CL_ENGINE_MAX_FILESIZE,        /* uint64_t */
    CL_ENGINE_MAX_RECURSION,       /* uint32_t */
    CL_ENGINE_MAX_FILES,           /* uint32_t */
    CL_ENGINE_MIN_CC_COUNT,        /* uint32_t */
    CL_ENGINE_MIN_SSN_COUNT,       /* uint32_t */
    CL_ENGINE_PUA_CATEGORIES,      /* (char *) */
    CL_ENGINE_DB_OPTIONS,          /* uint32_t */
    CL_ENGINE_DB_VERSION,          /* uint32_t */
    CL_ENGINE_DB_TIME,             /* time_t */
    CL_ENGINE_AC_ONLY,             /* uint32_t */
    CL_ENGINE_AC_MINDEPTH,         /* uint32_t */
    CL_ENGINE_AC_MAXDEPTH,         /* uint32_t */
    CL_ENGINE_TMPDIR,              /* (char *) */
    CL_ENGINE_KEEPTMP,             /* uint32_t */
    CL_ENGINE_BYTECODE_SECURITY,   /* uint32_t */
    CL_ENGINE_BYTECODE_TIMEOUT,    /* uint32_t */
    CL_ENGINE_BYTECODE_MODE,       /* uint32_t */
    CL_ENGINE_MAX_EMBEDDEDPE,      /* uint64_t */
    CL_ENGINE_MAX_HTMLNORMALIZE,   /* uint64_t */
    CL_ENGINE_MAX_HTMLNOTAGS,      /* uint64_t */
    CL_ENGINE_MAX_SCRIPTNORMALIZE, /* uint64_t */
    CL_ENGINE_MAX_ZIPTYPERCG,      /* uint64_t */
    CL_ENGINE_FORCETODISK,         /* uint32_t */
    CL_ENGINE_DISABLE_CACHE,       /* uint32_t */
    CL_ENGINE_DISABLE_PE_STATS,    /* uint32_t */
    CL_ENGINE_STATS_TIMEOUT,       /* uint32_t */
    CL_ENGINE_MAX_PARTITIONS,      /* uint32_t */
    CL_ENGINE_MAX_ICONSPE,         /* uint32_t */
    CL_ENGINE_MAX_RECHWP3,         /* uint32_t */
    CL_ENGINE_MAX_SCANTIME,        /* uint32_t */
    CL_ENGINE_PCRE_MATCH_LIMIT,    /* uint64_t */
    CL_ENGINE_PCRE_RECMATCH_LIMIT, /* uint64_t */
    CL_ENGINE_PCRE_MAX_FILESIZE,   /* uint64_t */
    CL_ENGINE_DISABLE_PE_CERTS,    /* uint32_t */
    CL_ENGINE_PE_DUMPCERTS,        /* uint32_t */
};

Performing the File Scan

After initializing libclamav and the engine, ClamAV then performs the scan.

In this case, because I specify the eicar file directly, it appears that the scan is carried out by the scan_files function.

if (optget(opts, "file-list")->enabled || opts->filename) {
    /* scan the files listed in the --file-list, or it that's not specified, then
     * scan the list of file arguments (including data from stdin, if `-` specified) */
    ret = scan_files(engine, opts, &options, dirlnk, filelnk);

#ifdef _WIN32
} else if (optget(opts, "memory")->enabled) {
    /* scan only memory */
    ret = scan_memory(engine, opts, &options);

#endif
} else {
    /* No list of files provided to scan, and no request to scan memory,
     * so just scan the current directory. */
    char cwd[1024];

    /* Get the current working directory.
     * we need full path for some reasons (eg. archive handling) */
    if (!getcwd(cwd, sizeof(cwd))) {
        logg(LOGG_ERROR, "Can't get absolute pathname of current working directory\n");
        ret = 2;
    } else {
        CLAMSTAT(cwd, &sb);
        scandirs(cwd, engine, opts, &options, 1, sb.st_dev);
    }
}

scan_files

The scan_files function receives the following arguments and scans the files.

/**
* @brief Scan the files from the --file-list option, or scan the files listed as individual arguments.
*
* If the user uses both --file-list <LISTFILE> AND one or more files, then clam will only
* scan the files listed in the LISTFILE and emit a warning about not scanning the other file parameters.
*
* @param opts
* @param options
* @return int
*/
static int scan_files(struct cl_engine *engine, const struct optstruct *opts, struct cl_scan_options *options, unsigned int dirlnk, unsigned int filelnk)

For a single-file scan like this one, scanfile is called to scan the file.

static void scanfile(const char *filename, struct cl_engine *engine, const struct optstruct *opts, struct cl_scan_options *options)

Resolving the File Path

To scan the file in the scanfile function, it first stores the actual scan target path in real_filename by using cli_realpath.

ret = cli_realpath((const char *)filename, &real_filename);
if (CL_SUCCESS != ret) {
    logg(LOGG_DEBUG, "Failed to determine real filename of %s.\n", filename);
    logg(LOGG_DEBUG, "Quarantine of the file may fail if file path contains symlinks.\n");
} else {
    filename = real_filename;
}

In this case, on the command line at runtime, I pass filename as "eicar".

From there, cli_realpath is called, the library function realpath is used to obtain the absolute path of the scan target file, and then the data in the filename variable is replaced with that resolved full path.

Scanning the File

Once the scan target path has been obtained, the code performs several checks such as access permissions, gets the file descriptor for the target file, and then calls cl_scandesc_ex.

struct metachain {
    char **chains;
    size_t lastadd;
    size_t lastvir;
    size_t level;
    size_t nchains;
};

struct clamscan_cb_data {
    struct metachain *chain;
    const char *filename;
};


logg(LOGG_DEBUG, "Scanning %s\n", filename);

if ((fd = safe_open(filename, O_RDONLY | O_BINARY)) == -1) {
    logg(LOGG_WARNING, "Can't open file %s: %s\n", filename, strerror(errno));
    info.errors++;
    goto done;
}

data.chain    = &chain;
data.filename = filename;
ret = cl_scandesc_ex(
    fd,
    filename,
    &verdict,
    &alert_name,
    &info.bytes_scanned,
    engine, options,
    &data,
    hash_hint,
    hash_out,
    hash_alg,
    file_type_hint,
    file_type_out);

The cl_scandesc_ex function scans the received file descriptor and returns the result in the verdict_out variable.

/**
 * @brief Scan a file, given a file descriptor.
 *
 * This callback variant allows the caller to provide a context structure that
 * caller provided callback functions can interpret.
 *
 * This extended version of cl_scanmap_callback allows the caller to provide
 * additional hints to the scanning engine, such as a file hash and file type.
 *
 * This variant also upgrades the `scanned` output parameter to a 64-bit integer.
 *
 * @param desc               File descriptor of an open file. The caller must provide this or the map.
 * @param filename           (Optional) Filepath of the open file descriptor or file map.
 * @param[out] verdict_out   A pointer to a cl_verdict_t that will be set to the scan verdict.
 *                           You should check the verdict even if the function returns an error.
 * @param[out] last_alert_out Will be set to a statically allocated (i.e. needs not be freed) signature name if the scan
 *                           matches against a signature.
 * @param[out] scanned_out   The (exact) number of bytes scanned.
 * @param engine             The scanning engine.
 * @param scanoptions        Scanning options.
 * @param[in,out] context    (Optional) An application-defined context struct, opaque to libclamav.
 *                           May be used within your callback functions.
 * @param hash_hint          (Optional) A NULL terminated string of the file hash so that
 *                           libclamav does not need to calculate it.
 * @param[out] hash_out      (Optional) A NULL terminated string of the file hash.
 *                           The caller is responsible for freeing the string.
 * @param hash_alg           The hashing algorithm used for either `hash_hint` or `hash_out`.
 *                           Supported algorithms are "md5", "sha1", "sha2-256".
 *                           If not specified, the default is "sha2-256".
 * @param file_type_hint     (Optional) A NULL terminated string of the file type hint.
 *                           E.g. "pe", "elf", "zip", etc.
 *                           You may also use ClamAV type names such as "CL_TYPE_PE".
 *                           ClamAV will ignore the hint if it is not familiar with the specified type.
 *                           See also: https://docs.clamav.net/appendix/FileTypes.html#file-types
 * @param[out] file_type_out (Optional) A NULL terminated string of the file type
 *                           of the top layer as determined by ClamAV.
 *                           Will take the form of the standard ClamAV file type format. E.g. "CL_TYPE_PE".
 *                           See also: https://docs.clamav.net/appendix/FileTypes.html#file-types
 * @return cl_error_t        CL_SUCCESS if no error occured.
 *                           Otherwise a CL_E* error code.
 *                           Does NOT return CL_VIRUS for a signature match. Check the `verdict_out` parameter instead.
 */
extern cl_error_t cl_scandesc_ex(
    int desc,
    const char *filename,
    cl_verdict_t *verdict_out,
    const char **last_alert_out,
    uint64_t *scanned_out,
    const struct cl_engine *engine,
    struct cl_scan_options *scanoptions,
    void *context,
    const char *hash_hint,
    char **hash_out,
    const char *hash_alg,
    const char *file_type_hint,
    char **file_type_out);

The verdict_out variable is defined as the following cl_verdict_t enum.

/**
 * @brief Scan verdicts for cl_scanmap_ex(), cl_scanfile_ex(), and cl_scandesc_ex().
 */
typedef enum cl_verdict_t {
    CL_VERDICT_NOTHING_FOUND = 0,    /**< No alerting signatures matched. */
    CL_VERDICT_TRUSTED,              /**< The scan target has been deemed trusted (e.g. by FP signature or Authenticode). */
    CL_VERDICT_STRONG_INDICATOR,     /**< One or more strong indicator signatures matched. */
    CL_VERDICT_POTENTIALLY_UNWANTED, /**< One or more potentially unwanted signatures matched. */
} cl_verdict_t;

Inside cl_scandesc_ex, the received file descriptor is first passed to fmap_new and mapped into the map variable of type cl_fmap_t (cl_fmap), and then scan_common is called.

if (NULL == (map = fmap_new(desc, 0, sb.st_size, filename_base, filename))) {
    cli_errmsg("CRITICAL: fmap_new() failed\n");
    status = CL_EMEM;
    goto done;
}

status = scan_common(
    map,
    filename,
    verdict_out,
    last_alert_out,
    scanned_out,
    engine,
    scanoptions,
    context,
    hash_hint,
    hash_out,
    hash_alg,
    file_type_hint,
    file_type_out);

The cl_fmap structure is defined as follows, and it seems to be used to abstract file maps so they can be accessed from various APIs.

struct cl_fmap {
    /* handle interface */
    void *handle;
    clcb_pread pread_cb;

    /* memory interface */
    const void *data;

    /* internal */
    uint64_t mtime;
    uint64_t pages;
    uint64_t pgsz;
    uint64_t paged;
    bool aging;           /** Indicates if we should age off memory mapped pages */
    bool dont_cache_flag; /** Indicates if we should not cache scan results for this fmap. Used if limits exceeded */
    bool handle_is_fd;    /** Non-zero if `map->handle` is an fd. This is needed so that `fmap_fd()` knows if it can
                              return a file descriptor. If it's some other kind of handle, then `fmap_fd()` has to return -1. */
    size_t offset;        /** File offset representing start of original fmap, if the fmap created reading from a file starting at offset other than 0.
                              `offset` & `len` are critical information for anyone using the file descriptor/handle */
    size_t nested_offset; /** Offset from start of original fmap (data) for nested scan. 0 for orig fmap. */
    size_t real_len;      /** Length from start of original fmap (data) to end of current (possibly nested) map.
                              `real_len == nested_offset + len`.
                              `real_len` is needed for nested maps because we only reference the original mapping data.
                              We convert caller's fmap offsets & lengths to real data offsets using `nested_offset` & `real_len`. */

    /* external */
    size_t len; /** Length of data from nested_offset, accessible via current fmap */

    /* real_len = nested_offset + len
     * file_offset = offset + nested_offset + need_offset
     * maximum offset, length accessible via fmap API: len
     * offset in cached buffer: nested_offset + need_offset
     *
     * This allows scanning a portion of an already mapped file without dumping
     * to disk and remapping (for uncompressed archives for example) */

    /* vtable for implementation */
    void (*unmap)(fmap_t *);
    const void *(*need)(fmap_t *, size_t at, size_t len, int lock);
    const void *(*need_offstr)(fmap_t *, size_t at, size_t len_hint);
    const void *(*gets)(fmap_t *, char *dst, size_t *at, size_t max_len);
    void (*unneed_off)(fmap_t *, size_t at, size_t len);
    void *windows_file_handle;
    void *windows_map_handle;

    /* flags to indicate if we should calculate a hash next time we calculate any hashes */
    bool will_need_hash[CLI_HASH_AVAIL_TYPES];

    /* flags to indicate if we have calculated a hash */
    bool have_hash[CLI_HASH_AVAIL_TYPES];

    /* hash values */
    uint8_t hash[CLI_HASH_AVAIL_TYPES][CLI_HASHLEN_MAX];

    uint64_t *bitmap;
    char *name; /* name of the file, e.g. as recorded in a zip file entry record */
    char *path; /* path to the file/tempfile, if fmap was created from a file descriptor */
};

scan_common

The file abstracted as a cl_fmap is then passed to scan_common.

This function starts a scan of the cl_fmap.

/**
 * @brief   The main function to initiate a scan of an fmap.
 *
 * @param map                 File map.
 * @param filepath            (optional, recommended) filepath of the open file descriptor or file map.
 * @param[out] verdict_out    A pointer to a cl_verdict_t that will be set to the scan verdict.
 *                            You should check the verdict even if the function returns an error.
 * @param[out] last_alert_out Will be set to a statically allocated (i.e. needs not be freed) signature name if the scan matches against a signature.
 * @param[out] scanned_out    (Optional) The number of bytes scanned.
 * @param engine              The scanning engine.
 * @param scanoptions         Scanning options.
 * @param[in,out] context     (Optional) An application-defined context struct, opaque to libclamav.
 *                            May be used within your callback functions.
 * @param hash_hint           (Optional) A NULL terminated string of the file hash so that
 *                            libclamav does not need to calculate it.
 * @param[out] hash_out       (Optional) A NULL terminated string of the file hash.
 *                            The caller is responsible for freeing this string.
 * @param hash_alg            The hashing algorithm used for either `hash_hint` or `hash_out`.
 *                            Supported algorithms are "md5", "sha1", "sha2-256".
 *                            Required only if you provide a `hash_hint` or want to receive a `hash_out`.
 * @param file_type_hint      (Optional) A NULL terminated string of the file type hint.
 *                            E.g. "pe", "elf", "zip", etc.
 *                            You may also use ClamAV type names such as "CL_TYPE_PE".
 *                            ClamAV will ignore the hint if it is not familiar with the specified type.
 * @param file_type_out       (Optional) A NULL terminated string of the file type
 *                            of the top layer as determined by ClamAV.
 *                            Will take the form of the standard ClamAV file type format. E.g. "CL_TYPE_PE".
 *                            The caller is responsible for freeing this string.
 * @return cl_error_t         CL_SUCCESS if no error occured.
 *                            Otherwise a CL_E* error code.
 *                            Does NOT return CL_VIRUS for a signature match. Check the `verdict_out` parameter instead.
 */
static cl_error_t scan_common(
    cl_fmap_t *map,
    const char *filepath,
    cl_verdict_t *verdict_out,
    const char **last_alert_out,
    uint64_t *scanned_out,
    const struct cl_engine *engine,
    struct cl_scan_options *scanoptions,
    void *context,
    const char *hash_hint,
    char **hash_out,
    const char *hash_alg,
    const char *file_type_hint,
    char **file_type_out)

Inside this function, ClamAV context information defined as the cli_ctx structure is initialized, and information such as the engine and scan options received as arguments is registered into that context.

/* internal clamav context */
typedef struct cli_ctx_tag {
    char *target_filepath;   /* (optional) The filepath of the original scan target. */
    char *this_layer_tmpdir; /* Pointer to current temporary directory, MAY vary with recursion depth. For convenience. */
    uint64_t *scanned;
    const struct cli_matcher *root;
    const struct cl_engine *engine;
    uint64_t scansize;
    struct cl_scan_options *options;
    uint32_t scannedfiles;
    unsigned int corrupted_input;      /* Setting this flag will prevent the PE parser from reporting "broken executable" for unpacked/reconstructed files that may not be 100% to spec. */
    cli_scan_layer_t *recursion_stack; /* Array of recursion levels used as a stack. */
    uint32_t recursion_stack_size;     /* stack size must == engine->max_recursion_level */
    uint32_t recursion_level;          /* Index into recursion_stack; current fmap recursion level from start of scan. */
    evidence_t this_layer_evidence;    /* Pointer to current evidence in recursion_stack, varies with recursion depth. For convenience. */
    fmap_t *fmap;                      /* Pointer to current fmap in recursion_stack, varies with recursion depth. For convenience. */
    size_t object_count;               /* Counter for number of unique entities/contained files (including normalized files) processed. */
    struct cli_dconf *dconf;
    bitset_t *hook_lsig_matches;
    void *cb_ctx;
    cli_events_t *perf;
    struct json_object *metadata_json;            /* Top level metadata JSON object for the whole scan. */
    struct json_object *this_layer_metadata_json; /* Pointer to current metadata JSON object in recursion_stack, varies with recursion depth. For convenience. */
    struct timeval time_limit;
    bool limit_exceeded; /* To guard against alerting on limits exceeded more than once, or storing that in the JSON metadata more than once. */
    bool abort_scan;     /* So we can guarantee a scan is aborted, even if CL_ETIMEOUT/etc. status is lost in the scan recursion stack. */
} cli_ctx;

After several initialization steps and checks, it calls cli_magic_scan with the initialized context information.

/*
 * DO THE SCAN!
 */
status = cli_magic_scan(&ctx, file_type);

climagicscan

cli_magic_scan is responsible for behavior that is very close to the actual scan based on real signature matching.

After this function is called and some checks are performed, the pre_hash callback is invoked.

/*
 * Run the pre_hash callback.
 */
ret = cli_dispatch_scan_callback(ctx, CL_SCAN_CALLBACK_PRE_HASH);
if (CL_SUCCESS != ret) {
    status = ret;
    goto done;
}

The pre_hash callback is called by passing CL_SCAN_CALLBACK_PRE_HASH as the second argument to cli_dispatch_scan_callback.

In cli_dispatch_scan_callback, a predefined callback function is selected and executed according to the specified flag as follows.

/*
 * Determine which callback to use.
 */
switch (location) {
    case CL_SCAN_CALLBACK_PRE_HASH:
        callback = ctx->engine->cb_scan_pre_hash;
        break;
    case CL_SCAN_CALLBACK_PRE_SCAN:
        callback = ctx->engine->cb_scan_pre_scan;
        break;
    case CL_SCAN_CALLBACK_POST_SCAN:
        callback = ctx->engine->cb_scan_post_scan;
        break;
    case CL_SCAN_CALLBACK_ALERT:
        callback = ctx->engine->cb_scan_alert;
        break;
    case CL_SCAN_CALLBACK_FILE_TYPE:
        callback = ctx->engine->cb_scan_file_type;
        break;
    default:
        status = CL_EARG;
        cli_errmsg("dispatch_scan_callback: Invalid callback location\n");
        goto done;
}

// 中略

if (NULL == callback) {
    /*
     * Callback is not set.
     */
    if (location == CL_SCAN_CALLBACK_ALERT) {
        // Accept the alert.
        status = CL_VIRUS;
    } else {
        // Keep scanning.
        status = CL_SUCCESS;
    }
    goto done;
}

current_layer = (cl_scan_layer_t *)&ctx->recursion_stack[ctx->recursion_level];

/*
 * Call the callback function.
 */
// TODO: Add performance measurements around the new callback specific to each callback location.
// perf_start(ctx, PERFT_PRECB);
status = callback(
    current_layer, // current scan layer
    ctx->cb_ctx    // application context
);
// perf_stop(ctx, PERFT_PRECB);

However, the callback function invoked here apparently needs to be defined on the client side. When cb_scan_pre_hash was invoked from the clamscan command used this time, NULL == callback became true, so in practice no callback was called.

After that, several similar calls are made, the file type is checked, and execution finally reaches a call to scanraw.

/*
 * Perform pattern matching for malware detections AND embedded file type recognition.
 * Embedded file type recognition may re-assign the current file as a new type, or
 * it may detect embedded files. E.g. ZIP entries in a PE file (i.e. self-extracting ZIP).
 */
if ((type != CL_TYPE_IGNORED) &&
    /* CL_TYPE_HTML: raw HTML files are not scanned, unless safety measure activated via DCONF */
    (type != CL_TYPE_HTML || !(SCAN_PARSE_HTML) || !(DCONF_DOC & DOC_CONF_HTML_SKIPRAW)) &&
    (!ctx->engine->sdb)) {

    cli_dbgmsg("cli_magic_scan: Performing raw scan to pattern match and/or detect embedded files\n");

    ret = scanraw(ctx, type, typercg, &dettype);

    // Evaluate the result from the scan to see if it end the scan of this layer early,
    // and to decid if we should propagate an error or not.
    if (result_should_goto_done(ctx, ret, &status)) {
        goto done;
    }
}

image-20251123212552823

scanraw

This function appears to be implemented as a function that performs a raw scan against a file map.

/**
 * @brief Perform raw scan of current fmap.
 *
 * @param ctx           Current scan context.
 * @param type          File type
 * @param typercg       Enable type recognition (file typing scan results).
 *                      If 0, will be a regular ac-mode scan.
 * @param[out] dettype  If typercg enabled and scan detects HTML or MAIL types,
 *                      will output HTML or MAIL types after performing HTML/MAIL scans
 * @return cl_error_t
 */
static cl_error_t scanraw(cli_ctx *ctx, cli_file_t type, uint8_t typercg, cli_file_t *dettype)

The actual implementation is about 1,000 lines long, but the scan itself is carried out by calling cli_scan_fmap further down.

perf_start(ctx, PERFT_RAW);
ret = cli_scan_fmap(ctx,
                    type == CL_TYPE_TEXT_ASCII ? CL_TYPE_ANY : type,
                    false,
                    &ftoffset,
                    acmode,
                    NULL);
perf_stop(ctx, PERFT_RAW);

Summary

I wrote this pretty roughly, but I am exhausted for now, so I will stop the article here.

I plan to write about the next function, cli_scan_fmap, in this article.