Implementing RC4 Encryption in C and Reversing It with Ghidra and WinDbg

This page has been machine-translated from the original page.

This time I wanted to implement the RC4 cipher in C and reverse-engineer it with WinDbg.

At Harekaze mini CTF 2021, held on 2021/12/24, there was a Rev challenge that required identifying RC4 usage from static analysis alone. Unfortunately, I was unable to identify it from the decompiler output at the time.

Reference: harekaze-mini-ctf-2021 Pack

As a review exercise, I decided to implement RC4 myself and reverse-engineer it.

What Is RC4 Encryption?
Source Code
Reversing the RC4 Program
Summary
References

What Is RC4 Encryption?

RC4 is a type of stream cipher that was used in protocols such as WEP and SSL.

It is now widely known to have vulnerabilities and its use is not recommended.

Reference: RC4 - Wikipedia

Stream Ciphers

A stream cipher is a type of symmetric cipher (one that uses the same key for both encryption and decryption) that encrypts data sequentially, bit by bit or byte by byte.

Block ciphers are another form of symmetric cipher and are frequently compared with stream ciphers.

Reference: Practical Guide to Applied Cryptography

RC4 Cipher

Because RC4 is a stream cipher, it encrypts plaintext by XORing it with a generated pseudorandom sequence.

The encryption flow with RC4 is roughly as follows:

Using the given key, KSA generates an initial state S used to derive the encryption key stream.
Using the initial state S, PRGA (the Pseudo-Random Generation Algorithm) generates the key stream for encrypting the plaintext.
The generated key stream is used to XOR-encrypt the plaintext.

KSA

In RC4, a predefined array S is initialized using the KSA (Key Scheduling Algorithm).

The implementation follows the Wikipedia sample.

Reference: RC4 - Wikipedia

#define N 256

void swap(unsigned char *a, unsigned char *b)
{
    int tmp = *a;
    *a = *b;
    *b = tmp;
    return;
}

int KSA(char *key, unsigned char *S)
{

    // Initialize array S (S[0]=0, S[1]=1...S[255]=255)
    for (int i = 0; i < N; i++) S[i] = i;
    
    // Generate the initial stream with KSA
    int j = 0;
    int len = strlen(key);
    for (int i = 0; i < N; i++)
    {
        j = (j + S[i] + key[i % len]) % N;
        swap(&S[i], &S[j]);
    }

    return 0;
}

As shown above, the initial stream is generated by KSA based on the input key.

Generating the Key Stream and Encrypting with PRGA

The steps from PRGA to encryption are implemented as follows:

Increment i: i = (i + 1) % 256
Add S[i] to j: j = (j + S[i]) % 256
Swap S[i] and S[j]
Compute key stream K as S[(S[i] + S[j]) % 256] and XOR it with the plaintext

The implementation is very straightforward.

int PRGA(unsigned char *S, char *text, unsigned char *encrypted_text)
{

    int i = 0;
    int j = 0;
    for (size_t n = 0, len = strlen(text); n < len; n++)
    {
        i = (i + 1) % N;
        j = (j + S[i]) % N;
        swap(&S[i], &S[j]);
        int K = S[(S[i] + S[j]) % N];
        encrypted_text[n] = K ^ text[n];
    }

    return 0;
}

This completes the RC4 encryption implementation.

Decryption

RC4 encrypts data using XOR with a pseudorandom sequence.

Because the pseudorandom generation algorithm is deterministic, the same key always produces the same pseudorandom sequence.

Therefore, you can decrypt ciphertext by XORing it again with the pseudorandom sequence generated from the same key that was used for encryption.

Source Code

The full source code is shown below.

It is also available in the following repository.

Reference: Try2WinDbg/rc4_encrypt.c

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define N 256

void swap(unsigned char *a, unsigned char *b)
{
    int tmp = *a;
    *a = *b;
    *b = tmp;
    return;
}

int KSA(char *key, unsigned char *S)
{

    // Initialize array S (S[0]=0, S[1]=1...S[255]=255)
    for (int i = 0; i < N; i++) S[i] = i;
    
    // Generate the initial stream with KSA
    int j = 0;
    int len = strlen(key);
    for (int i = 0; i < N; i++)
    {
        j = (j + S[i] + key[i % len]) % N;
        swap(&S[i], &S[j]);
    }

    return 0;
}

int PRGA(unsigned char *S, char *text, unsigned char *encrypted_text)
{

    int i = 0;
    int j = 0;
    for (size_t n = 0, len = strlen(text); n < len; n++)
    {
        i = (i + 1) % N;
        j = (j + S[i]) % N;
        swap(&S[i], &S[j]);
        int K = S[(S[i] + S[j]) % N];
        encrypted_text[n] = K ^ text[n];
    }

    return 0;
}

int RC4(char *key, char *text, unsigned char *encrypted_text)
{

    unsigned char S[N];

    KSA(key, S);
    PRGA(S, text, encrypted_text);
    return 0;
}

int main(int argc, char *argv[])
{
    printf("RC4 module\n");

    // Note: a short RC4 key is easily guessable; exercise caution in production use
    char *key = "testkey";
    char *text = "this is test.";

    unsigned char *encrypted_text = malloc(sizeof(int) * strlen(text));
    RC4(key, text, encrypted_text);

    printf("This is encrypted text\n");
    printf("==> ");
    for (size_t i = 0, len = strlen(text); i < len; i++)
    {
        printf("%02hhX", encrypted_text[i]);
    }
    printf("\n");

    unsigned char *decrypted_text = malloc(sizeof(int) * strlen(encrypted_text));
    RC4(key, encrypted_text, decrypted_text);

    printf("This is decrypted text\n");
    printf("==> ");
    for (size_t i = 0, len = strlen(text); i < len; i++)
    {
        printf("%c", decrypted_text[i]);
    }

    return 0;
}

Finally, let’s reverse-engineer this program.

Reversing the RC4 Program

First, I loaded the compiled binary into Ghidra for analysis.

I will skip over the process of locating the main function.

Decompiling the Calling Function

Let’s start by looking at the RC4 calling function.

In terms of the source code, this corresponds to the following section:

int RC4(char *key, char *text, unsigned char *encrypted_text)
{
    unsigned char S[N];
    KSA(key, S);
    PRGA(S, text, encrypted_text);
    return 0;
}

Looking at the decompiled output, something seems off:

void __cdecl RC4(char *param_1,char *param_2,int param_3)
{
  uint uVar1;
  undefined extraout_DL;
  undefined in_stack_fffffef8;
  
  uVar1 = DAT_00471090 ^ (uint)&stack0xfffffffc;
  thunk_FUN_004071b0(param_1,(int)&stack0xfffffef8);
  thunk_FUN_00407260((int)&stack0xfffffef8,param_2,param_3);
  thunk_FUN_00407648(uVar1 ^ (uint)&stack0xfffffffc,extraout_DL,in_stack_fffffef8);
  return;
}

I renamed the functions and variables to make analysis easier:

void __cdecl RC4(char *key,char *text,int encrypted_text)
{
  uint uVar1;
  undefined extraout_DL;
  undefined S;
  
  uVar1 = DAT_00471090 ^ (uint)&stack0xfffffffc;
  KSA(key,(int)&stack0xfffffef8);
  PRGA((int)&stack0xfffffef8,text,encrypted_text);
  thunk_FUN_00407648(uVar1 ^ (uint)&stack0xfffffffc,extraout_DL,S);
  return;
}

The following two lines do not appear in the source code and it’s unclear what they’re doing:

uVar1 = DAT_00471090 ^ (uint)&stack0xfffffffc;
/* ... (omitted) ... */
thunk_FUN_00407648(uVar1 ^ (uint)&stack0xfffffffc,extraout_DL,S);

So I analyzed this section using WinDbg’s TTD trace.

Analyzing the TTD Trace

I captured a TTD trace as usual.

The trace file is available here:

Reference: Try2WinDbg/rc4_encrypt.zip

I loaded the trace file and set a breakpoint at the address corresponding to the uVar1 = DAT_00471090 ^ (uint)&stack0xfffffffc; instruction.

> bu 0086734e

After loading it in WinDbg, I found that the data shown as DAT_00471090 in Ghidra is actually the Security Cookie.

__security_cookie is defined in the data section and is used to detect buffer overflows.

Reference: ida - _securitycookie for function pointers in Windows 10

It turns out those two mysterious lines implement the stack-cookie mechanism for detecting buffer overflows:

uVar1 = DAT_00471090 ^ (uint)&stack0xfffffffc;
/* ... (omitted) ... */
thunk_FUN_00407648(uVar1 ^ (uint)&stack0xfffffffc,extraout_DL,S);

That clears things up a bit. Let’s move on to analyzing KSA and PRGA.

Decompiling the KSA Function

Here is the result of decompiling the KSA function in Ghidra:

undefined4 __cdecl KSA(char *param_1,int param_2)
{
  size_t sVar1;
  uint local_10;
  int local_c;
  int local_8;
  
  for (local_c = 0; local_c < 0x100; local_c = local_c + 1) {
    *(undefined *)(param_2 + local_c) = (undefined)local_c;
  }
  local_10 = 0;
  sVar1 = _strlen(param_1);
  for (local_8 = 0; local_8 < 0x100; local_8 = local_8 + 1) {
    local_10 = *(byte *)(param_2 + local_8) + local_10 + (int)param_1[local_8 % (int)sVar1] &
               0x800000ff;
    if ((int)local_10 < 0) {
      local_10 = (local_10 - 1 | 0xffffff00) + 1;
    }
    thunk_FUN_00867180((undefined *)(param_2 + local_8),(undefined *)(param_2 + local_10));
  }
  return 0;
}

This closely resembles the decompiled output of the binary from the Harekaze mini CTF 2021 Rev challenge.

Reference: harekaze-mini-ctf-2021 Pack

It’s frustrating to think that if I had caught this right away, I could have solved the challenge.

Decompiling the PRGA Function

Here is the decompiled output:

undefined4 __cdecl PRGA(int param_1,char *param_2,int param_3)
{
  size_t sVar1;
  uint local_10;
  uint local_c;
  uint local_8;
  
  local_8 = 0;
  local_10 = 0;
  local_c = 0;
  sVar1 = _strlen(param_2);
  for (; local_c < sVar1; local_c = local_c + 1) {
    local_8 = local_8 + 1 & 0x800000ff;
    if ((int)local_8 < 0) {
      local_8 = (local_8 - 1 | 0xffffff00) + 1;
    }
    local_10 = *(byte *)(param_1 + local_8) + local_10 & 0x800000ff;
    if ((int)local_10 < 0) {
      local_10 = (local_10 - 1 | 0xffffff00) + 1;
    }
    thunk_FUN_00867180((undefined *)(param_1 + local_8),(undefined *)(param_1 + local_10));
    *(byte *)(param_3 + local_c) =
         param_2[local_c] ^
         *(byte *)(param_1 +
                  ((uint)*(byte *)(param_1 + local_8) + (uint)*(byte *)(param_1 + local_10) &
                  0x800000ff));
  }
  return 0;
}

One interesting point is how (i + 1) % 256 in the source code appears differently in the decompiled output:

Note: In C, + and & have the same operator precedence, so i + 1 is evaluated first.

i = i + 1 & 0x800000ff;
if ((int)i < 0) {
	i = (i - 1 | 0xffffff00) + 1;
}

I wrote a small script to verify, and confirmed that this expression produces exactly the same result as % 256:

def mod256(n):
    n = n & 0x800000ff
    if n < 0:
        n = (n - 1 | 0xffffff00)

    return n

def main():
    for i in range(250, 260):
        print(mod256(i), end=" ")
    print("")

    return

if __name__ == "__main__":
    main()

That’s an interesting find.

In future reversing work, when I see this kind of decompiled pattern I can recognize it as a modulo operation.

Addendum (2022/1/2)

I initially assumed that all mod 256 code would produce the same output as shown above, but it turns out the output varies considerably depending on the compiler. Someone kindly pointed this out in a Twitter comment.

Source: original comment

Summary

This time, as a review of Harekaze mini CTF 2021, I implemented RC4 encryption and reverse-engineered it.

I made some new discoveries along the way, so it was well worth trying.

References

Practical Guide to Applied Cryptography

Published Jan 1, 2022

Aspiring Reverse Engineer and CTF Player (Team: 0nePadding). Passionate about WinDbg and Anti-Virus internals. OSCP / CISSP. Working at Microsoft Japan, but all views expressed are my own.かしわば(@kash1064) on Twitter