-
-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for parsing author lines with no angle brackets #927
base: master
Are you sure you want to change the base?
Conversation
This commit adds support for parsing author lines with no angle brackets. It uses a regexp to avoid hand-writing some backtracking logic. This also adds tests for other types of brokenness that I found in existing git repos, and that were already supported. I tried to use the original example as test data when possible.
Codecov Report
@@ Coverage Diff @@
## master #927 +/- ##
==========================================
- Coverage 84.11% 84.11% -0.01%
==========================================
Files 92 92
Lines 22679 22752 +73
Branches 3315 3317 +2
==========================================
+ Hits 19076 19137 +61
+ Misses 3129 3119 -10
- Partials 474 496 +22
Continue to review full report at Codecov.
|
Thanks for the PR. It looks like some of these will break roundtripping, i.e. parsing some bytes into a Commit object and then getting the exact same set of bytes back out when calling Commit.as_raw_bytes() - which is a property I'd like to maintain. Have you actually seen all of these in the wild? We could probably offer some sort of alternative for cases like this, if we don't want to add attributes for each one. |
There were already some breakages on commits with weird timezones. For example, missing leading zeros in timezones (I don't know how this happened, I assume home-made Git implementations) are fixed by Dulwich: >>> b = b'tree aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa \nparent bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb\nauthor John Doe <[email protected]> 1614159930 +100\ncommitter John Doe <[email protected]> 1614159930 +100\n\nfoo'
>>> c = dulwich.objects.Commit.from_string(b)
>>> c.author = c.author # reset c._needs_serialization
>>> c.as_raw_string()
b'tree aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa \nparent bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb\nauthor John Doe <[email protected]> 1614159930 +0100\ncommitter John Doe <[email protected]> 1614159930 +0100\n\nfoo' The solution we decided to use at @SoftwareHeritage was to keep the original timezone bytes somewhere. I could do it for Dulwich (I actually have a commit ready to do this, based on the one in this PR)
Yes. I have seen about 14k commits with missing leading zeros as above. 232 with the timezone
Dulwich could store the raw bytes of the original timezone. This requires a new attribute, but it allows the
|
FWIW Rather than changing the default parser, I'm planning to lazily load these event lines (user identity + timestamp + timezone). That way, you can get access to the raw data and parse that with a less strict parser if you like. The added minor benefit could be that we get to avoid parsing them when we don't need them. My PoC is in #1135; that approach works, but adds two extra object instances per commit which is not ideal. |
f1ae053
to
cd30df4
Compare
This commit adds support for parsing author lines with no angle brackets,
as I found an example of this on GitHub.
It uses a regexp to avoid hand-writing some backtracking logic.
This also adds tests for other types of brokenness that I found in existing
git repos, and that were already supported. I tried to use the original example
as test data when possible.