The Story of How Python's Dictionary 'in' and Array 'in' Are Completely Different

This page has been machine-translated from the original page.

This article describes the surprising discovery that Python’s in for checking if a key exists in a dictionary and the in for array search are completely different.

When doing competitive programming, you tend to be concerned about the computational complexity of each method.

For example, when using Python, as you probably know, the computational complexity of the Value in List syntax for checking if a value exists in a list is O(N), making it very time-consuming.

In my local environment, the execution time of the following code was about 600ms.

Arr = [i for i in range(pow(10, 7))]
print(pow(10, 5) in Arr)

This is because the implementation checks one by one whether a certain value exists in an array Arr with 10^7 elements defined by list comprehension.

On the other hand, Python’s dictionary type also has the Key in Dict notation for checking whether a key exists in a dictionary.

Embarrassingly, I had been under the impression that this computational complexity was also O(N) and time-consuming, so I thought it would be better to use the get method like Dict.get(Key) for key existence checking.

However, I learned for the first time that Python’s in for checking if a key exists in a dictionary and the in for array search have completely different implementations, and moreover, Key in Dict runs faster than Dict.get(Key), so I decided to write this article.

About the Key in Dict Implementation
Bonus: About the Value in List Implementation
Summary

About the Key in Dict Implementation

First, let’s look at the PyDict_Contains() function called by the Key in Dict notation.

/* Return 1 if `key` is in dict `op`, 0 if not, and -1 on error. */
int
PyDict_Contains(PyObject *op, PyObject *key)
{
    Py_hash_t hash;
    Py_ssize_t ix;
    PyDictObject *mp = (PyDictObject *)op;
    PyObject *value;

    if (!PyUnicode_CheckExact(key) ||
        (hash = ((PyASCIIObject *) key)->hash) == -1) {
        hash = PyObject_Hash(key);
        if (hash == -1)
            return -1;
    }
    ix = _Py_dict_lookup(mp, key, hash, &value);
    if (ix == DKIX_ERROR)
        return -1;
    return (ix != DKIX_EMPTY && value != NULL);
}

Reference: cpython/dictobject.c python/cpython · GitHub

In Python, all objects are accessed through pointers of type PyObject. The PyObject *op given as an argument here points to the dictionary object being searched.

Looking at the PyDictContains function processing, it first calculates a hash, then uses the _Pydictlookup function with that hash as an argument to determine if the key exists in the dictionary. The processing of _Pydict_lookup() is omitted here.

In other words, unlike the list type’s Value in List syntax, key existence can be checked in O(logN) logarithmic time rather than O(N).

Now let’s check what processing the Dict.get(Key) syntax I had been using performs.

static PyObject *
dict_get_impl(PyDictObject *self, PyObject *key, PyObject *default_value)
/*[clinic end generated code: output=bba707729dee05bf input=279ddb5790b6b107]*/
{
    PyObject *val = NULL;
    Py_hash_t hash;
    Py_ssize_t ix;

    if (!PyUnicode_CheckExact(key) ||
        (hash = ((PyASCIIObject *) key)->hash) == -1) {
        hash = PyObject_Hash(key);
        if (hash == -1)
            return NULL;
    }
    ix = _Py_dict_lookup(self, key, hash, &val);
    if (ix == DKIX_ERROR)
        return NULL;
    if (ix == DKIX_EMPTY || val == NULL) {
        val = default_value;
    }
    Py_INCREF(val);
    return val;
}

Reference: cpython/dictobject.c python/cpython · GitHub

Surprisingly, we can see there is almost no difference in implementation from the PyDict_Contains function called when using the Key in Dict syntax.

Taking a diff shows there is almost no difference in the main implementation parts.

Also, what surprised me was that the dictgetimpl function called by the Dict.get(Key) syntax calls the Py_INCREF() macro processing, making the execution time slightly longer.

Reference: Reference Counting — Python 3.9.4 Documentation

When I actually implemented a competitive programming problem using Python’s dictionary type with both the Key in Dict syntax and the Dict.get(Key) syntax, I confirmed that the Key in Dict syntax had an execution time a few milliseconds shorter.

Bonus: About the Value in List Implementation

I also took a look at the implementation of the Value in List syntax for checking if a value exists in a list.

As it appears, it’s an O(N) process that uses a for loop to compare values from the starting address of the list object.

static int
list_contains(PyListObject *a, PyObject *el)
{
    PyObject *item;
    Py_ssize_t i;
    int cmp;

    for (i = 0, cmp = 0 ; cmp == 0 && i < Py_SIZE(a); ++i) {
        item = PyList_GET_ITEM(a, i);
        Py_INCREF(item);
        cmp = PyObject_RichCompareBool(item, el, Py_EQ);
        Py_DECREF(item);
    }
    return cmp;
}

Reference: cpython/listobject.c python/cpython · GitHub

Summary

While solving competitive programming problems, I had the question “Wouldn’t implementing key search with in cause TLE even though I’m using an associative array??” which led me to dig into CPython’s source code.

I wrote this article as a memorandum since I investigated seriously, but I feel my understanding of language specifications deepened by tracing CPython’s source, so I’d like to continue doing similar things in the future.

By the way, if you just want to know the computational complexity, TimeComplexity - Python Wiki is useful.

Published Jul 23, 2021

Aspiring Reverse Engineer and CTF Player (Team: 0nePadding). Passionate about WinDbg and Anti-Virus internals. OSCP / CISSP. Working at Microsoft Japan, but all views expressed are my own.かしわば(@kash1064) on Twitter

The Story of How Python's Dictionary 'in' and Array 'in' Are Completely Different

Table of Contents

About the Key in Dict Implementation

Bonus: About the Value in List Implementation

Summary