gh-101178: refactor base64.b85encode to be memory friendly#112248
gh-101178: refactor base64.b85encode to be memory friendly#112248romuald wants to merge 7 commits intopython:mainfrom
Conversation
eb34187 to
54793e7
Compare
|
I've changed the algorithm to use a dedicated generator, using |
54793e7 to
39b1d7e
Compare
|
Tested on a Linux VM the new code is actually faster with the 5Mb dataset 🤔 Memory gain is as follow: branch: |
|
@pitrou any chance you could spare a little time to review this? |
|
Could you post timeit numbers for both small and large inputs? (please be sure to compile in non-debug mode) |
Lib/base64.py
Outdated
| for c in unpack512(b[offset:offset+512]): | ||
| yield c |
There was a problem hiding this comment.
This can be simplified to:
| for c in unpack512(b[offset:offset+512]): | |
| yield c | |
| yield from unpack512(b[offset:offset+512]) |
There was a problem hiding this comment.
Changed (I'm not yet used to yield from ^^)
Lib/test/test_base64.py
Outdated
| # since the result is to large to fit inside a test, | ||
| # use a hash method to validate the test | ||
| self.assertEqual(len(result), 784) | ||
| self.assertEqual(hashlib.md5(result).hexdigest(), |
There was a problem hiding this comment.
I'm not sure md5 is always available. WDYT @gpshead ?
There was a problem hiding this comment.
Good catch, forgot about this
According to https://docs.python.org/3/library/hashlib.html#hashlib.algorithms_guaranteed md5 may not be present, I'll switch to a sha1
899c88d to
d0e7691
Compare
|
@pitrou here is a timeit script / results on a Macbook M1 (I don't have access to a Linux version right now) import timeit
import random
from base64 import b85encode
REPEAT = 5
COUNT = 10_000
SMALL_INPUT: bytes = b"hello world"
MEDIUM_INPUT: bytes # 200 bytes
BIG_INPUT: bytes # 5000 bytes
SMALL_COUNT = 500_000
MEDIUM_COUNT = 100_000
BIG_COUNT = 20_000
def init():
global MEDIUM_INPUT, BIG_INPUT
rnd = random.Random()
rnd.seed(42)
MEDIUM_INPUT = rnd.randbytes(200)
BIG_INPUT = rnd.randbytes(5000)
def main():
init()
for name in "SMALL", "MEDIUM", "BIG":
timer = timeit.Timer(f"b85encode({name}_INPUT)", globals=globals())
count = globals()[f"{name}_COUNT"]
values = timer.repeat(REPEAT, count)
values = ", ".join("%.3fs" % x for x in values)
print(f"Timeit {name} ({count} iterations: {values}")
if __name__ == '__main__':
main()Results: |
|
The performance decrease is a bit unfortunate. Instead of defining a separate def _85encode(b, chars, chars2, pad=False, foldnuls=False, foldspaces=False):
# Helper function for a85encode and b85encode
if not isinstance(b, bytes_types):
# TODO can this be `memoryview(b).cast('B')` instead?
b = memoryview(b).tobytes()
def encode_words(words):
# Encode a sequence of 32-bit words, excluding padding
chunks = [b'z' if foldnuls and not word else
b'y' if foldspaces and word == 0x20202020 else
(chars2[word // 614125] +
chars2[word // 85 % 7225] +
chars[word % 85])
for word in words]
return b''.join(chunks)
n1 = len(b) // 512 # number of 512 bytes unpack
n2 = (len(b) - n1 * 512) // 4 # number of 4 bytes unpack
padding = (-len(b)) % 4
unpack512 = struct.Struct("!128I").unpack
unpack4 = struct.Struct("!I").unpack
offset = 0
blocks = []
for _ in range(n1):
blocks.append(encode_words(unpack512(b[offset:offset+512])))
offset += 512
for _ in range(n2):
blocks.append(encode_words(unpack4(b[offset:offset+4])))
offset += 4
if padding:
# TODO deal with last bytes and padding...
return b''.join(blocks) |
|
Performance regression also on Linux amd64: I find it a bit strange because my initial test a few months back the execution was slower on macOS but slightly faster on Linux. I attribued that to the fact that the memory allocation was clostly enough to be a factor |
|
@pitrou sorry for the huge lag I gave up after spending a lot of time trying to find a way to have be both CPU and memory friendly for small and large dataset Until last week when I realized that I could simply rewrite the function in C (which I have done), to have both performance and memory improvements My question is, should I create a new PR or |
|
You surely can do both :) |
Initially done to reduce the huge memory consumption of the previous implementation for large inputs, and that no memory-friendly python way was found that did not include a performance regression This implementation also greatly improve performance in all cases Signed-off-by: Romuald Brunet <romuald@chivil.com>
Regression was found while testing the new C implementation, when foldspaces was used with b85encode (since a chunk could end in z without having been folded)
d0e7691 to
74fc245
Compare
|
@pitrou / @sobolevn I rewrote this PR from scratch to use the a C implementation instead of a python one Note that I do not consider myself a seasoned C developer so the implementation may be lacking. I've tried to maximize compatibility with previous implementation even if the _85encode method is private, we could drop I've also added a test to check for a regression I found while testing the new implementation with random data |
Modules/binascii.c
Outdated
| } | ||
|
|
||
| /*[clinic input] | ||
| binascii.b2a_base85 |
There was a problem hiding this comment.
It would feel weird not to have binascii.a2b_base85 so I would suggest keeping it private for now. Ideally, base64 should have its own C accelerator module but maybe it's an overkill.
There was a problem hiding this comment.
Renamed to private
That was what delayed me initially because I had no idea how to add a module dedicated to base64, I found out that it did use the binascii one only last week
|
By the way, I didn't look at the implementation in detail yet, but if you want to compare your implementation with a popular one, you can have a look at https://github.com/git/git/blob/master/base85.c#L40. |
Apply suggestions Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Thanks, I didn't know where too look when I started (the base algorithm originally came from a LLM :/) Inspiring from git's code, this could be used instead: There is a 20% performance gain for large inputs (starting a 5kb) since the potential padding does not need to be computed for each chunk. Shall I use this method instead? |
Yes, since base85 can be used for encoding large inputs it's worth it I think. |
Inspired from git source https://github.com/git/git/blob/03944513488db4a81fdb4c21c3b515e4cb260b05/base85.c#L79 This avoid checking the chunk size on every iteration and thus improves performance
Since j is not unsigned anymore we can reverse the table lookup loop
| size_t i = 0 ; | ||
| int padding = 0; | ||
|
|
||
| while (i < bin_len) { |
There was a problem hiding this comment.
I would also credit the git implementation for this one as it's heavily based on it.
There was a problem hiding this comment.
Credit added. I don't know if I should phrase it differently?
I've also added some other comments to try to explain the logic more
Current description
Rewrote the
base64._85encodemethod logic in C, by plugging in thebinasciimodule (already taking care of the bae64 methods)By using C and a single buffer, the memory use is reduced to a minimum, addressing the initial issue.
It also greatly improves performance as a bonus:
main
branch
Script used to test: https://gist.github.com/romuald/7aeba5f40693bb351da4abe62ad7321d
Previous description (python refactor)
not up to date with current PR
Refactor code to make use of generators instead of allocating 2 potentially huge lists for large datasets
Memory gain only measured using macOS and a 5Mb input.
Using
main:With refactor:
The execution time is more than doubled, which may not be acceptable. However the memory used is reduced by more than 90%edit: changed the algorithm to be more efficient, the performance decrease now seems to be negligible
I also have no idea how (and if) I should test this
Here is the script I've used to measure the execution time,
thememdebugcan probably be adapted to read/proc/{pid}on Linuxedit: updated to work on Linux too