Skip to content

py: Add support for nested f-strings within f-strings#18588

Merged
dpgeorge merged 6 commits intomicropython:masterfrom
dpgeorge:py-lexer-support-nested-fstrings
Feb 4, 2026
Merged

py: Add support for nested f-strings within f-strings#18588
dpgeorge merged 6 commits intomicropython:masterfrom
dpgeorge:py-lexer-support-nested-fstrings

Conversation

@dpgeorge
Copy link
Member

Summary

During the discussion about t-string support #17557, I thought that it might not be much effort to support nested f-strings with the current f-string parser. And it turned out relatively simple.

The way the MicroPython f-string parser works is:

  1. it extracts the f-string arguments (things in curly braces) into a temporary buffer (a vstr)
  2. once the f-string ends (reaches its closing quote) the lexer switches to tokenizing the temporary buffer
  3. once the buffer is empty it switches back to the stream.

The temporary buffer can easily hold f-strings itself (ie nested f-strings) and they can be re-parsed by the lexer using the same algorithm. The only thing stopping that from working is that the temporary buffer can't be reused for the nested f-string because it's currently being parsed.

This PR fixes that by adding a second temporary buffer, which is the "injection" buffer. That allows arbitrary number of nestings with a simple modification to the original algorithm:

  1. when an f-string is encountered the string is parsed and its arguments are extracted into fstring_args
  2. when the f-string finishes, fstring_args is inserted into the current position in inject_chrs (which is the start of that buffer if no injection is ongoing)
  3. fstring_args is now cleared and ready for any further f-strings (nested or not)
  4. the lexer switches to inject_chrs if it's not already reading from it
  5. if an f-string appeared inside the f-string then it is in inject_chrs and can be processed as before, extracting its arguments into fstring_args, which can then be inserted again into inject_chrs
  6. once inject_chrs is exhausted (meaning that all levels of f-strings have been fully processed) the lexer switched back to tokenizing the stream

Amazingly, this scheme supports arbitrary numbers of nestings of f-strings using the same quote style.

Testing

A new test is added which will run under CI.

Trade-offs and Alternatives

This adds some code size and a bit more memory usage for the lexer. In particular for a single (non-nested) f-string it now makes an extra copy of the fstring_args data, when copying it across to inject_chrs. That could possibly be optimized to reuse the same buffer (inject_chrs would steal the memory from fstring_args).

Otherwise, memory use only goes up with the complexity of nested f-strings.

@dpgeorge dpgeorge added the py-core Relates to py/ directory in source label Dec 18, 2025
@github-actions
Copy link

github-actions bot commented Dec 18, 2025

Code size report:

Reference:  tests/run-tests.py: Abort test run if enter_raw_repl fails many times. [6436f8b]
Comparison: py/lexer: Move f-string completion code to more logical location. [merge of c9f747c]
  mpy-cross:   -32 -0.008% 
   bare-arm:   -12 -0.021% 
minimal x86:   -94 -0.050% 
   unix x64:   -80 -0.009% standard
      stm32:   +44 +0.011% PYBV10
      esp32:   +20 +0.001% ESP32_GENERIC
     mimxrt:   +48 +0.013% TEENSY40
        rp2:   +56 +0.006% RPI_PICO_W
       samd:   +44 +0.016% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:   +66 +0.014% VIRT_RV32

@dpgeorge dpgeorge force-pushed the py-lexer-support-nested-fstrings branch from aba1901 to b51b102 Compare December 18, 2025 13:28
@codecov
Copy link

codecov bot commented Dec 18, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.41%. Comparing base (6436f8b) to head (c9f747c).
⚠️ Report is 6 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #18588   +/-   ##
=======================================
  Coverage   98.41%   98.41%           
=======================================
  Files         171      171           
  Lines       22324    22326    +2     
=======================================
+ Hits        21971    21973    +2     
  Misses        353      353           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dpgeorge dpgeorge force-pushed the py-lexer-support-nested-fstrings branch 4 times, most recently from b0e571f to 6043a9e Compare December 22, 2025 02:36
Copy link
Contributor

@dlech dlech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive that this can be done without increasing code size much. The commit messages are a bit short on the "why" reasoning behind the changes.

py/lexer.c Outdated
@@ -409,7 +417,7 @@ static void parse_string_literal(mp_lexer_t *lex, bool is_raw, bool is_fstring)
// note: "c" can never be MP_LEXER_EOF because next_char
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment still references MP_LEXER_EOF.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I turned this into a goto and eliminated the sentinel.

py/lexer.c Outdated
// always inserts a newline at the end of the input stream
case '\n':
c = MP_LEXER_EOF;
c = (unichar)(-1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this could be given a name like MP_LEXER_SENTINEL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now a goto.

py/lexer.h Outdated
mp_reader_t reader; // stream source

unichar chr0, chr1, chr2; // current cached characters from source
uint8_t chr0, chr1, chr2; // current cached characters from source
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment sounds wrong now. I guess these are bytes of a multi-byte character (or used to test it they are part of a multi-byte character or not).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment updated.

@dpgeorge
Copy link
Member Author

Impressive that this can be done without increasing code size much. The commit messages are a bit short on the "why" reasoning behind the changes.

Thanks for the review. Yes, the commit messages (and commits themselves) are not finalised because I went back and forth with a failing unix-qemu CI that I couldn't reproduce locally, and at the same time trying to reduce the code size.

@dpgeorge dpgeorge force-pushed the py-lexer-support-nested-fstrings branch 2 times, most recently from 12cc3e5 to 5413044 Compare January 4, 2026 14:52
@dpgeorge
Copy link
Member Author

dpgeorge commented Jan 5, 2026

I spent a lot of time tuning the commits here for code size. Locally I get quite different results to the CI's code size report. I'm using gcc 15.2.1 and arm-none-eabi-gcc 14.2.0. I get for this whole PR:

  mpy-cross:   -64 -0.017%
   bare-arm:   -12 -0.021%
minimal x86:   -74 -0.040%
   unix x64:  -176 -0.021% standard
      stm32:   +28 +0.007% PYBV10
    esp8266:  +500 +0.071% ESP8266_GENERIC
        rp2:   +40 +0.004% RPI_PICO_W

That's pretty good, except esp8266 which increases by a lot due to it's need to load/store all 8/16-bit values using a 32-bit load/store and masking (the -mforce-l32 compiler option). I could add a minor workaround for that to cut it down by -464 bytes (change uint8_t chr1, chr2 to uint32_t chr1, chr2 in mp_lexer_t). Is that worth doing?

@dpgeorge
Copy link
Member Author

dpgeorge commented Jan 7, 2026

This has been rebased on latest master to update the code_size CI to use ubuntu-latest, which changes the GCC version used for code size comparison, which changes the results. The code size diff is now slightly better, and matches more closely what I get locally.

This saves about 4 bytes on ARM Cortex-M, and about 50-60 bytes on x86-64.
It also allows the upcoming `vstr_ins_strn()` function to be inline as
well, and have less of a code-size impact when used.

Signed-off-by: Damien George <damien@micropython.org>
This is now an easy function to define as inline, so it does not impact
code size unless it's used.

Signed-off-by: Damien George <damien@micropython.org>
Having this check takes code size and execution time, and it's not
necessary: all callers of this function pass a non-zero value for
`byte_len` already.  And even if `byte_len` was zero, the code would still
perform correctly.

Signed-off-by: Damien George <damien@micropython.org>
The null byte cannot exist in source code (per CPython), so use it to
indicate the end of the input stream (instead of `(mp_uint_t)-1`).  This
allows the cache chars (chr0/1/2 and their saved versions) to be 8-bit
bytes, making it clear that they are not `unichar` values.  It also saves a
bit of memory in the `mp_lexer_t` data structure.  (And in a future commit
allows the saved cache chars to be eliminated entirely by storing them in
a vstr instead.)

In order to keep code size down, the frequently used `chr0` is still of
type `uint32_t`.  Having it 32-bit means that machine instructions to load
it are smaller (it adds about +80 bytes to Thumb code if `chr0` is changed
to `uint8_t`).

Also add tests for invalid bytes in the input stream to make sure there are
no regressions in this regard.

Signed-off-by: Damien George <damien@micropython.org>
It turns out that it's relatively simple to support nested f-strings, which
is what this commit implements.

The way the MicroPython f-string parser works at the moment is:
1. it extracts the f-string arguments (things in curly braces) into a
   temporary buffer (a vstr)
2. once the f-string ends (reaches its closing quote) the lexer switches to
   tokenizing the temporary buffer
3. once the buffer is empty it switches back to the stream.

The temporary buffer can easily hold f-strings itself (ie nested f-strings)
and they can be re-parsed by the lexer using the same algorithm.  The only
thing stopping that from working is that the temporary buffer can't be
reused for the nested f-string because it's currently being parsed.

This commit fixes that by adding a second temporary buffer, which is the
"injection" buffer.  That allows arbitrary number of nestings with a simple
modification to the original algorithm:
1. when an f-string is encountered the string is parsed and its arguments
   are extracted into `fstring_args`
2. when the f-string finishes, `fstring_args` is inserted into the current
   position in `inject_chrs` (which is the start of that buffer if no
   injection is ongoing)
3. `fstring_args` is now cleared and ready for any further f-strings
   (nested or not)
4. the lexer switches to `inject_chrs` if it's not already reading from it
5. if an f-string appeared inside the f-string then it is in `inject_chrs`
   and can be processed as before, extracting its arguments into
   `fstring_args`, which can then be inserted again into `inject_chrs`
6. once `inject_chrs` is exhausted (meaning that all levels of f-strings
   have been fully processed) the lexer switched back to tokenizing the
   stream.

Amazingly, this scheme supports arbitrary numbers of nestings of f-strings
using the same quote style.

This adds some code size and a bit more memory usage for the lexer.  In
particular for a single (non-nested) f-string it now makes an extra copy of
the `fstring_args` data, when copying it across to `inject_chrs`.
Otherwise, memory use only goes up with the complexity of nested f-strings.

Signed-off-by: Damien George <damien@micropython.org>
This way, the use of `lex->fstring_args` is fully self contained within the
string literal parsing section of `mp_lexer_to_next()`.

Signed-off-by: Damien George <damien@micropython.org>
@dpgeorge dpgeorge force-pushed the py-lexer-support-nested-fstrings branch from 1fbeb16 to c9f747c Compare February 4, 2026 12:22
@dpgeorge dpgeorge merged commit c9f747c into micropython:master Feb 4, 2026
76 checks passed
@dpgeorge dpgeorge deleted the py-lexer-support-nested-fstrings branch February 4, 2026 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

py-core Relates to py/ directory in source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments