Conversation
brianhelba
left a comment
There was a problem hiding this comment.
Relevant links:
- https://github.com/jschneier/django-storages/blob/master/storages/backends/s3.py#L549
- https://github.com/jschneier/django-storages/blob/ca89a94a7462a2423df460e7bfd5f847457042ca/storages/utils.py#L172
- https://github.com/jschneier/django-storages/blob/ca89a94a7462a2423df460e7bfd5f847457042ca/storages/backends/s3.py#L565
- https://docs.aws.amazon.com/boto3/latest/reference/services/s3/client/upload_fileobj.html
There was a problem hiding this comment.
From a high-level perspective, this is definitely more correct.
At a low level, I'm not 100% certain whether this is doing true streaming, or just making half as many in-memory copies. Note that Boto3 upload_fileobj is passed a ReadBytesWrapper. The ReadBytesWrapper.read method passes its parameters through to the underlying file descriptor, so if Boto3 calls .read(size=...) to read only a limited amount of bytes at a time, then size will be passed through to the underlying Python file descriptor and will use only limited amounts of memory.
However, Boto3 is such a mess internally that I can't verify it does this; we could run it with a debug breakpoint to see, which I haven't yet done.
Also note, the File constructor will also read the .name property from the parameter, whereas ContentFile has no way to know this (as it only receives a bytes), so the files will now know their filename. Whether or how the filename is used when determining the storage path is up to the FileField.upload_to.
Regardless, this is a clear correctness improvement.
Deploying geodatalytics with
|
| Latest commit: |
6d8ed37
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://3c38e8da.geodatalytics.pages.dev |
| Branch Preview URL: | https://file-streams.geodatalytics.pages.dev |
This PR replaces instances of
ContentFile(f.read())withFile(f)so that local file content can be streamed to file fields (rather than entirely loaded into memory and dumped). I tested this branch on the production EC2 worker to run the conversion task on the large "boston orthoimagery" dataset. On master, that task crashes with a memory error. On this branch, the task succeeds.