-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the regressions introduced in the fix for #89 #120
Conversation
There's another possible fix: https://github.com/ShoshinNikita/go-diff/commit/c8591cf97f43b198b269258a0c3f3b1fc07990df. It is also based on the previous approach but uses |
I'm also hitting this regression. Because the lineHash map is no longer shared, the diff assigns different chars/runes to lines that are the same. |
Hi! We (the gopls team) spoke to @sergi and will help review and test this PR. Specifically, I'll do my best to review the change, and we'll test the fix in gopls. Will probably need a couple days. |
Thanks for helping @findleyr and team!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for this!
This is my first time reading this source, so some of my questions / observations were general in nature. Will follow-up with another pass once you've responded.
// checkLineArray checks the size of the slice and ensures that the index of the next element | ||
// will be the valid rune. | ||
func checkLineArray(a *[]string) { | ||
// Runes in this range are invalid, utf8.ValidRune() returns false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very subtle. Please explain further in this comment why this is a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment describes why runes must be valid: #89 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. I think it's worth explaining here (you can reference #89).
return dmp.DiffLinesToRunes(string(text1), string(text2)) | ||
} | ||
|
||
// diffLinesToRunesMunge splits a text into an array of strings, and reduces the texts to a []rune where each Unicode character represents one line. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The words 'array', 'list', and 'slice' are used interchangeably throughout, but array is really incorrect in this context.
|
||
// checkLineArray checks the size of the slice and ensures that the index of the next element | ||
// will be the valid rune. | ||
func checkLineArray(a *[]string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have functions
// lineRune returns a valid rune value for the line at the given index
func lineRune(idx int) rune
// runeLine looks up the given rune hash in the lines slice.
func runeLine(idxRune rune, lines []string) string
Thereby avoiding the padding here. It could also be used to avoid the prepended "".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can break the user code because DiffLinesToRunes
returns the lineArray
. If a user has something like DiffCharsToLines
, it can panic with index out of range
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that DiffLinesToRunes could use lineRune to convert to rune values, and DiffCharsToLines can use runeLine to convert to line indexes. These functions would be inverses of eachother.
} | ||
|
||
// diffLinesToStringsMunge splits a text into an array of strings, and reduces the texts to a []string. | ||
func (dmp *DiffMatchPatch) diffLinesToStringsMunge(text string, lineArray *[]string) []uint32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please reorganize to minimize the diff with this refactored method.
} | ||
|
||
func (dmp *DiffMatchPatch) diffLinesToRunes(text1, text2 []rune) ([]rune, []rune, []string) { | ||
return dmp.DiffLinesToRunes(string(text1), string(text2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned about all this unnecessary allocation converting to string, when it really shouldn't be necessary.
Did you compare benchmark performance?
|
||
// diffLinesToRunesMunge splits a text into an array of strings, and reduces the texts to a []rune where each Unicode character represents one line. | ||
// We use strings instead of []runes as input mainly because you can't use []rune as a map key. | ||
func (dmp *DiffMatchPatch) diffLinesToRunesMunge(text string, lineArray *[]string, lineHash map[string]int) []rune { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm staring at this, and I don't see how the previous algorithm worked without passing in lineHash. Is that the root cause of the bug?
Can you explain somewhere the exact nature of the bug in the fix for #89?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the root cause of the bug is described in the PR description:
The current solution doesn't work well with (*DiffMatchPatch).diffMainRunes method because array indexes in the index string occupy multiple runes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...so array indexes could be corrupted/split in the diff output, right?
It's worth explaining in a comment.
@@ -392,28 +390,88 @@ func (dmp *DiffMatchPatch) diffBisectSplit(runes1, runes2 []rune, x, y int, | |||
// DiffLinesToChars splits two texts into a list of strings, and educes the texts to a string of hashes where each Unicode character represents one line. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/educes/reduces
@@ -392,28 +390,88 @@ func (dmp *DiffMatchPatch) diffBisectSplit(runes1, runes2 []rune, x, y int, | |||
// DiffLinesToChars splits two texts into a list of strings, and educes the texts to a string of hashes where each Unicode character represents one line. | |||
// It's slightly faster to call DiffLinesToRunes first, followed by DiffMainRunes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment still accurate?
lineEnd = len(text) - 1 | ||
} | ||
|
||
line := text[lineStart : lineEnd+1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd be interested to know how performance would be affected by using an actual hash to look up the rune value in the index, rather than a string. Then we could convert to []byte at API boundaries and avoid allocating again.
But that doesn't need to be done for this PR.
By this you mean modifying the diff algorithm to operate on []int, rather than string, right? |
I am not very familiar with the codebase. I just could determine the cause of the regressions and fix #89 with another approach based on code in v1.1.0 (the second commit in this PR partially reverts the previous fix). So, the real diff is v1.1.0...d20955a, lines 431-465 (or just the last commit). I think using |
I would suggest you guys look at my PR #128. It fixes a serious panic issue. It may also fix the issue you were disucssing (I'm not sure). |
This PR can probably be closed now since my PR which fixes the same thing was merged yesterday #136. |
This PR partially reverts commits db1b095 - 0a651d5 and fixes the issue #89 with a different approach. The current solution doesn't work well with
(*DiffMatchPatch).diffMainRunes
method because array indexes in the index string occupy multiple runes.This fix is based on the previous approach. But elements of the
lineArray
with indexes from0xD800
(55296) to0xDFFF
(57343) are skipped because runes in this range are invalid. It requires additional 32KB of memory ([2048]string{}
) but allows us to safely encode line indexes in a string.It doesn't completely fix #89 but increases the panic limit to ~1114111 lines (
0x10FFFF
is the maximum valid unicode code point). The complete fix will require a lot of changes. At the same time the current approach has a bug. So, I believe it's better to use this fix.Fixes #115