-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Fix for the issues with >2GB files and changesets #6068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| c = *p; | ||
| use(1); | ||
| size += (c & 0x7f) << shift; | ||
| size += (size_t)(c & 0x7f) << shift; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a one of the main fixes , shifting > 32 will result in undefined behavior and error -5 (buffer) on > 2GB streams
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be clearer to move the cast inside the parentheses, so that readers won't have to remember whether the cast or the << operator takes precedence.
|
|
||
| static void zlib_post_call(git_zstream *s, int status) | ||
| { | ||
| unsigned long bytes_consumed; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is on my opinion is the only bit "risky" change in entire PR , as we can't change zlib structure and there it defined uLong I had to modify the flow a bit , but I beleive it will behave correctly , at least from git point of view the number of total_out will be consistant and we will not git BUG report due to overflow in calculation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Git does wrap the zlib code so many of those ancillary variables are within Git's control.
There is a tight inner loop that uses the zlib API, and quickly passes bigger blocks onto the wider Git API.
[from my memory]
| c = *pack; | ||
| use(1); | ||
| size += (c & 0x7f) << shift; | ||
| size += (size_t)(c & 0x7f) << shift; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a one of the main fixes , shifting > 32 will result in undefined behavior and error -5 (buffer) on > 2GB streams
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This choice of whether to cast the bitwise and (&) result to size_t, or change any of the other variable types should be explained/rationalised.
Maybe also consider if the shift limit should be tested first (against the right sizeof() value) and throw a BUG there. This could be a precursor patch to at least detect your initial problem?
dscho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LordKiRon Thank you for taking on this gargantuan task.
Seeing as this recapitulates the changes of #2179, the very same issue as I have pointed out in that PR applies here: The enormous amount of changes in so many files is prone to hide obvious bugs. Remember, there are changes so simple that there are no obvious bugs, and there are changes that are so complex, or so extensive, that there are no obvious bugs. I much prefer the former to the latter.
This pull request is basically impossible to review. Even if I had the time, which I don't, I could not really make sure that there is no bug in there. I have a quite long attention span even when it comes to experienced software engineers, but even I would tire before I reached so much as the half-way point.
What's worse, these changes will actually need to go upstream, and since upstream Git is a very fast moving target, it is highly unlikely that this commit will apply without merge conflicts. In fact, it is even highly unlikely that such a large commit would be within the size limit of the mailing list. It would probably be rejected before arriving there. And even if it weren't, potential reviewers would reject the idea to review it. But even if they weren't, merge conflicts would be much easier to resolve if these changes were split into much more fine-grained commits.
Having said all that, in this particular instance, I think the most important part that we need to focus on is that there's something crucial missing, despite the fact that this PR contains so many changes in so many files: A regression test to confirm that there is a problem right now that is then fixed by those changes.
The challenge with such a regression test, of course, is that it will be quite expensive to test that a >4GB pack file file can be cloned successfully. It not only takes a lot of time to clone it, but it's also quite expensive to generate a pack file that large to begin with. The tools that Git offers in its test framework are insufficient for that. At least I, for one, would not want to wait for, I don't know, maybe thirty minutes? just to have the Bash script churn out a >4GB packfile that it will happily throw away immediately after the test succeeds.
So this is what I suggest to do going forward with this Pull Request:
- Implement a regression test that fails without the changes of this pull request, but succeeds with them.
- This test should probably live in
t/t5608-clone-2gb.sh(potentially renaming that file to something liket5608-clone-large.sh). - To generate the large packfile that is needed for this test, I'd suggest to cherry-pick dscho@b6d4561 and use it.
- Once we have that regression test that verifies that the changes in this Pull Request fix the underlying problem, we need to start splitting up the vast commit into very, very small changes.
- Ideally, we come up with a very small subset of the changes that already lets the regression test pass, and which are structured into a patch series where each intermediate revision compiles successfully and each intermediate commit contains a logical change that we can easily document in the commit message. Let https://github.com/git-for-windows/git/pull/3487/commits serve as an example for a very small subset of the changes in that huge #2179, structured into bite-sized patches ready for upstream Git's reviewers' pleasure.
- Once we fixed the immediate issue in a much smaller and easier-to-review PR, we can then leisurely tackle the rest of the changes in this PR, break them down into many more PRs with fine-grained splits into neat little patches, and then upstream all of that in a steady stream of patch series.
@LordKiRon are you up to the challenge? 😁
Have to admit that investing too much time into a Git was not my initial intention , for me its easier to drop Windows and move to Linux as my Dev environment :) After all what I started with was attempt to work with my code that for some reason refused to fetch latest changes and almost made me a heart attack, as I suspected the server repository got corrupted ;) Anyway, one of the reasons I opened this PR and not used #2179 is that I wanted to make changes as minimal as possible, which is still a lot of changes, most of them non-functional but still I agree potentially dangerous. Personally I find it outrageous ;) that such widely used (including in commercial market) application as Git can't work with large files, I do beleive action need to be taken. If merges are in concern , I even think the Git change point need to be at original Git , just "ask" them to use size_t instead of unsigned long, this will have no effect on Git for Linux as the sizes of long and size_t both 64 bit there, but will do wanders to Git for Windows :) As for temporary solution, I might try to create another PR that fixes one specific shift issue I see for sure, that can be fixed in few lines , the problem it probably will fail somewhere else later :) |
|
I am for now will close this PR and concentrate on most minimal fix possible (not to mention it fails white space check :)) |
This is to fix the issue with not able to work with >2GB files and changesets with Git for Windows.
The original Git works correctly , however due to differences between Linux and Windows where in Linux long is 64 bit and in Windows long is 32 bit this created a problem that you can't process files > 4GB nd in some cases (as the one I started with) even > 2GB.
The m,ain issue with 2GB is that we have in code:
size += (c & 0x7f) << shift;since size is unsigned long, e.g. in Windows 32 bit when "shift" is > 32 we get crap.
(btw: exactly same thing exists in builtin\unpack-objects.c "unpack_one()" function.
So the proper fix is, something like this:
However, while its a rootcouse it not entire fix, the problem is loosing data further in the path , for this to work properly , the line
obj->size = size;should also store real size , not truncated one , this means we need to change as follows:But this is not full fix, and probably only for 2GB issue, for 4GB+ we need to update entire code base to use 64 bit sizes.
This pull reequest is concentrates on minimal changes needed to make fetchhing of > 4GB files and changesets work. I tried to minimize the scope of changes needed for it juist to work and compile , so this does not do full 4GB adoption in other areas.
The main goal of this is was: minimal changes needed for fetch/clone to work with maximum atention not to change/break <4GB behavior.
The main approach used is changing "unsigned long" to size_t where applicable and possible in every place where sizes concern.