py: Add support for nested f-strings within f-strings#18588
py: Add support for nested f-strings within f-strings#18588dpgeorge merged 6 commits intomicropython:masterfrom
Conversation
|
Code size report: |
aba1901 to
b51b102
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18588 +/- ##
=======================================
Coverage 98.41% 98.41%
=======================================
Files 171 171
Lines 22324 22326 +2
=======================================
+ Hits 21971 21973 +2
Misses 353 353 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
b0e571f to
6043a9e
Compare
dlech
left a comment
There was a problem hiding this comment.
Impressive that this can be done without increasing code size much. The commit messages are a bit short on the "why" reasoning behind the changes.
py/lexer.c
Outdated
| @@ -409,7 +417,7 @@ static void parse_string_literal(mp_lexer_t *lex, bool is_raw, bool is_fstring) | |||
| // note: "c" can never be MP_LEXER_EOF because next_char | |||
There was a problem hiding this comment.
Comment still references MP_LEXER_EOF.
There was a problem hiding this comment.
I turned this into a goto and eliminated the sentinel.
py/lexer.c
Outdated
| // always inserts a newline at the end of the input stream | ||
| case '\n': | ||
| c = MP_LEXER_EOF; | ||
| c = (unichar)(-1); |
There was a problem hiding this comment.
I guess this could be given a name like MP_LEXER_SENTINEL?
py/lexer.h
Outdated
| mp_reader_t reader; // stream source | ||
|
|
||
| unichar chr0, chr1, chr2; // current cached characters from source | ||
| uint8_t chr0, chr1, chr2; // current cached characters from source |
There was a problem hiding this comment.
Comment sounds wrong now. I guess these are bytes of a multi-byte character (or used to test it they are part of a multi-byte character or not).
Thanks for the review. Yes, the commit messages (and commits themselves) are not finalised because I went back and forth with a failing unix-qemu CI that I couldn't reproduce locally, and at the same time trying to reduce the code size. |
12cc3e5 to
5413044
Compare
|
I spent a lot of time tuning the commits here for code size. Locally I get quite different results to the CI's code size report. I'm using gcc 15.2.1 and arm-none-eabi-gcc 14.2.0. I get for this whole PR: That's pretty good, except esp8266 which increases by a lot due to it's need to load/store all 8/16-bit values using a 32-bit load/store and masking (the |
928f161 to
1fbeb16
Compare
|
This has been rebased on latest master to update the code_size CI to use |
This saves about 4 bytes on ARM Cortex-M, and about 50-60 bytes on x86-64. It also allows the upcoming `vstr_ins_strn()` function to be inline as well, and have less of a code-size impact when used. Signed-off-by: Damien George <damien@micropython.org>
This is now an easy function to define as inline, so it does not impact code size unless it's used. Signed-off-by: Damien George <damien@micropython.org>
Having this check takes code size and execution time, and it's not necessary: all callers of this function pass a non-zero value for `byte_len` already. And even if `byte_len` was zero, the code would still perform correctly. Signed-off-by: Damien George <damien@micropython.org>
The null byte cannot exist in source code (per CPython), so use it to indicate the end of the input stream (instead of `(mp_uint_t)-1`). This allows the cache chars (chr0/1/2 and their saved versions) to be 8-bit bytes, making it clear that they are not `unichar` values. It also saves a bit of memory in the `mp_lexer_t` data structure. (And in a future commit allows the saved cache chars to be eliminated entirely by storing them in a vstr instead.) In order to keep code size down, the frequently used `chr0` is still of type `uint32_t`. Having it 32-bit means that machine instructions to load it are smaller (it adds about +80 bytes to Thumb code if `chr0` is changed to `uint8_t`). Also add tests for invalid bytes in the input stream to make sure there are no regressions in this regard. Signed-off-by: Damien George <damien@micropython.org>
It turns out that it's relatively simple to support nested f-strings, which is what this commit implements. The way the MicroPython f-string parser works at the moment is: 1. it extracts the f-string arguments (things in curly braces) into a temporary buffer (a vstr) 2. once the f-string ends (reaches its closing quote) the lexer switches to tokenizing the temporary buffer 3. once the buffer is empty it switches back to the stream. The temporary buffer can easily hold f-strings itself (ie nested f-strings) and they can be re-parsed by the lexer using the same algorithm. The only thing stopping that from working is that the temporary buffer can't be reused for the nested f-string because it's currently being parsed. This commit fixes that by adding a second temporary buffer, which is the "injection" buffer. That allows arbitrary number of nestings with a simple modification to the original algorithm: 1. when an f-string is encountered the string is parsed and its arguments are extracted into `fstring_args` 2. when the f-string finishes, `fstring_args` is inserted into the current position in `inject_chrs` (which is the start of that buffer if no injection is ongoing) 3. `fstring_args` is now cleared and ready for any further f-strings (nested or not) 4. the lexer switches to `inject_chrs` if it's not already reading from it 5. if an f-string appeared inside the f-string then it is in `inject_chrs` and can be processed as before, extracting its arguments into `fstring_args`, which can then be inserted again into `inject_chrs` 6. once `inject_chrs` is exhausted (meaning that all levels of f-strings have been fully processed) the lexer switched back to tokenizing the stream. Amazingly, this scheme supports arbitrary numbers of nestings of f-strings using the same quote style. This adds some code size and a bit more memory usage for the lexer. In particular for a single (non-nested) f-string it now makes an extra copy of the `fstring_args` data, when copying it across to `inject_chrs`. Otherwise, memory use only goes up with the complexity of nested f-strings. Signed-off-by: Damien George <damien@micropython.org>
This way, the use of `lex->fstring_args` is fully self contained within the string literal parsing section of `mp_lexer_to_next()`. Signed-off-by: Damien George <damien@micropython.org>
1fbeb16 to
c9f747c
Compare
Summary
During the discussion about t-string support #17557, I thought that it might not be much effort to support nested f-strings with the current f-string parser. And it turned out relatively simple.
The way the MicroPython f-string parser works is:
The temporary buffer can easily hold f-strings itself (ie nested f-strings) and they can be re-parsed by the lexer using the same algorithm. The only thing stopping that from working is that the temporary buffer can't be reused for the nested f-string because it's currently being parsed.
This PR fixes that by adding a second temporary buffer, which is the "injection" buffer. That allows arbitrary number of nestings with a simple modification to the original algorithm:
fstring_argsfstring_argsis inserted into the current position ininject_chrs(which is the start of that buffer if no injection is ongoing)fstring_argsis now cleared and ready for any further f-strings (nested or not)inject_chrsif it's not already reading from itinject_chrsand can be processed as before, extracting its arguments intofstring_args, which can then be inserted again intoinject_chrsinject_chrsis exhausted (meaning that all levels of f-strings have been fully processed) the lexer switched back to tokenizing the streamAmazingly, this scheme supports arbitrary numbers of nestings of f-strings using the same quote style.
Testing
A new test is added which will run under CI.
Trade-offs and Alternatives
This adds some code size and a bit more memory usage for the lexer. In particular for a single (non-nested) f-string it now makes an extra copy of the
fstring_argsdata, when copying it across toinject_chrs. That could possibly be optimized to reuse the same buffer (inject_chrswould steal the memory fromfstring_args).Otherwise, memory use only goes up with the complexity of nested f-strings.