MicroPython string interning

MicroPython uses string interning to save both RAM and ROM. This avoids having to store duplicate copies of the same string. Primarily, this applies to identifiers in your code, as something like a function or variable name is very likely to appear in multiple places in the code. In MicroPython an interned string is called a QSTR (uniQue STRing).

A QSTR value (with type qstr) is a index into a linked list of QSTR pools. QSTRs store their length and a hash of their contents for fast comparison during the de-duplication process. All bytecode operations that work with strings use a QSTR argument.

Compile-time QSTR generation

In the MicroPython C code, any strings that should be interned in the final firmware are written as MP_QSTR_Foo. At compile time this will evaluate to a qstr value that points to the index of "Foo" in the QSTR pool.

A multi-step process in the Makefile makes this work. In summary this process has three parts:

  1. Find all MP_QSTR_Foo tokens in the code.

  2. Generate a static QSTR pool containing all the string data (including lengths and hashes).

  3. Replace all MP_QSTR_Foo (via the preprocessor) with their corresponding index.

MP_QSTR_Foo tokens are searched for in two sources:

  1. All files referenced in $(SRC_QSTR). This is all C code (i.e. py, extmod, ports/stm32) but not including third-party code such as lib.

  2. Additional $(QSTR_GLOBAL_DEPENDENCIES) (which includes mpconfig*.h).

Note: frozen_mpy.c (generated by mpy-tool.py) has its own QSTR generation and pool.

Some additional strings that can’t be expressed using the MP_QSTR_Foo syntax (e.g. they contain non-alphanumeric characters) are explicitly provided in qstrdefs.h and qstrdefsport.h via the $(QSTR_DEFS) variable.

Processing happens in the following stages:

  1. qstr.i.last is the concatenation of putting every single input file through the C pre-processor. This means that any conditionally disabled code will be removed, and macros expanded. This means we don’t add strings to the pool that won’t be used in the final firmware. Because at this stage (thanks to the NO_QSTR macro added by QSTR_GEN_CFLAGS) there is no definition for MP_QSTR_Foo it passes through this stage unaffected. This file also includes comments from the preprocessor that include line number information. Note that this step only uses files that have changed, which means that qstr.i.last will only contain data from files that have changed since the last compile.

  2. qstr.split is an empty file created after running makeqstrdefs.py split on qstr.i.last. It’s just used as a dependency to indicate that the step ran. This script outputs one file per input C file, genhdr/qstr/...file.c.qstr, which contains only the matched QSTRs. Each QSTR is printed as Q(Foo). This step is necessary to combine the existing files with the new data generated from the incremental update in qstr.i.last.

  3. qstrdefs.collected.h is the output of concatenating genhdr/qstr/* using makeqstrdefs.py cat. This is now the full set of MP_QSTR_Foo’s found in the code, now formatted as Q(Foo), one-per-line, with duplicates. This file is only updated if the set of qstrs has changed. A hash of the QSTR data is written to another file (qstrdefs.collected.h.hash) which allows it to track changes across builds.

  4. Generate an enumeration, each entry of which maps a MP_QSTR_Foo to it’s corresponding index. It concatenates qstrdefs.collected.h with qstrdefs*.h, then it transforms each line from Q(Foo) to "Q(Foo)" so they pass through the preprocessor unchanged. Then the preprocessor is used to deal with any conditional compilation in qstrdefs*.h. Then the transformation is undone back to Q(Foo), and saved as qstrdefs.preprocessed.h.

  5. qstrdefs.generated.h is the output of makeqstrdata.py. For each Q(Foo) in qstrdefs.preprocessed.h (plus some extra hard-coded ones), it outputs QDEF(MP_QSTR_Foo, (const byte*)"hash" "Foo").

Then in the main compile, two things happen with qstrdefs.generated.h:

  1. In qstr.h, each QDEF becomes an entry in an enum, which makes MP_QSTR_Foo available to code and equal to the index of that string in the QSTR table.

  2. In qstr.c, the actual QSTR data table is generated as elements of the mp_qstr_const_pool->qstrs.

Run-time QSTR generation

Additional QSTR pools can be created at runtime so that strings can be added to them. For example, the code:

foo[x] = 3

Will need to create a QSTR for the value of x so it can be used by the “load attr” bytecode.

Also, when compiling Python code, identifiers and literals need to have QSTRs created. Note: only literals shorter than 10 characters become QSTRs. This is because a regular string on the heap always takes up a minimum of 16 bytes (one GC block), whereas QSTRs allow them to be packed more efficiently into the pool.

QSTR pools (and the underlying “chunks” that store the string data) are allocated on-demand on the heap with a minimum size.