Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simhash #6

Closed
wants to merge 10 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 13 additions & 12 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
name: Test


on:
push:
branches: [main]
Expand All @@ -10,16 +11,16 @@ jobs:
vendor:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-go@v2
- uses: actions/checkout@v4
- uses: actions/setup-go@v4
with:
go-version: 1.14.x
go-version: 1.20.x
- name: get dependencies
run: go get -v -t -d ./...
- name: vendoring
run: go mod vendor

- uses: actions/upload-artifact@v2
- uses: actions/upload-artifact@v3
with:
name: repository
path: .
Expand All @@ -28,21 +29,21 @@ jobs:
runs-on: ubuntu-latest
needs: vendor
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
# https://github.com/golangci/golangci-lint-action#how-to-use
- name: golangci-lint
uses: golangci/golangci-lint-action@v2
uses: golangci/golangci-lint-action@v3
with:
version: latest

test:
runs-on: ubuntu-latest
needs: vendor
steps:
- uses: actions/setup-go@v2
- uses: actions/setup-go@v4
with:
go-version: 1.14.x
- uses: actions/download-artifact@v2
go-version: 1.20.x
- uses: actions/download-artifact@v3
with:
name: repository
path: .
Expand All @@ -53,10 +54,10 @@ jobs:
runs-on: ubuntu-latest
needs: vendor
steps:
- uses: actions/setup-go@v2
- uses: actions/setup-go@v4
with:
go-version: 1.14.x
- uses: actions/download-artifact@v2
go-version: 1.20.x
- uses: actions/download-artifact@v3
with:
name: repository
path: .
Expand Down
7 changes: 0 additions & 7 deletions .goreleaser.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,6 @@ builds:
- linux
- windows
- darwin
archives:
- replacements:
darwin: Darwin
linux: Linux
windows: Windows
386: i386
amd64: x86_64
checksum:
name_template: 'checksums.txt'
snapshot:
Expand Down
2 changes: 2 additions & 0 deletions .tool-versions
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
golang 1.20.3
goreleaser 1.18.2
2 changes: 1 addition & 1 deletion .whitesource
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
{
"settingsInheritedFrom": "SmartBear/whitesource-config@main"
"settingsInheritedFrom": "oselvar/whitesource-config@main"
}
Binary file added AsaduzzamanICSM2013LineDraft.pdf
Binary file not shown.
20 changes: 10 additions & 10 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [0.0.3] - 2022-02-03
### Changed
- Change module name from `github.com/aslakhellesoy/lhdiff` to `github.com/SmartBear/lhdiff`
- Change module name from `github.com/aslakhellesoy/lhdiff` to `github.com/oselvar/lhdiff`

## [0.0.2] - 2022-02-03
### Added
Expand All @@ -43,12 +43,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added
- First functional version

[Unreleased]: https://github.com/SmartBear/lhdiff/compare/v0.1.2...HEAD
[0.1.2]: https://github.com/SmartBear/lhdiff/compare/v0.1.1...v0.1.2
[0.1.1]: https://github.com/SmartBear/lhdiff/compare/v0.1.0...v0.1.1
[0.1.0]: https://github.com/SmartBear/lhdiff/compare/v0.0.5...v0.1.0
[0.0.5]: https://github.com/SmartBear/lhdiff/compare/v0.0.4...v0.0.5
[0.0.4]: https://github.com/SmartBear/lhdiff/compare/v0.0.3...v0.0.4
[0.0.3]: https://github.com/SmartBear/lhdiff/compare/v0.0.2...v0.0.3
[0.0.2]: https://github.com/SmartBear/lhdiff/compare/v0.0.1...v0.0.2
[0.0.1]: https://github.com/SmartBear/lhdiff/compare/6084d5de2ec3dbb25767433e79ab840d5941c2de...v0.0.1
[Unreleased]: https://github.com/oselvar/lhdiff/compare/v0.1.2...HEAD
[0.1.2]: https://github.com/oselvar/lhdiff/compare/v0.1.1...v0.1.2
[0.1.1]: https://github.com/oselvar/lhdiff/compare/v0.1.0...v0.1.1
[0.1.0]: https://github.com/oselvar/lhdiff/compare/v0.0.5...v0.1.0
[0.0.5]: https://github.com/oselvar/lhdiff/compare/v0.0.4...v0.0.5
[0.0.4]: https://github.com/oselvar/lhdiff/compare/v0.0.3...v0.0.4
[0.0.3]: https://github.com/oselvar/lhdiff/compare/v0.0.2...v0.0.3
[0.0.2]: https://github.com/oselvar/lhdiff/compare/v0.0.1...v0.0.2
[0.0.1]: https://github.com/oselvar/lhdiff/compare/6084d5de2ec3dbb25767433e79ab840d5941c2de...v0.0.1
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@

## Build

goreleaser build --single-target --snapshot --rm-dist
goreleaser build --single-target --snapshot --clean
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
[![Test](https://github.com/SmartBear/lhdiff/actions/workflows/test.yml/badge.svg)](https://github.com/SmartBear/lhdiff/actions/workflows/test.yml)
[![Test](https://github.com/oselvar/lhdiff/actions/workflows/test.yml/badge.svg)](https://github.com/oselvar/lhdiff/actions/workflows/test.yml)
# lhdiff

A Lightweight Hybrid Approach for Tracking Source Lines.
Expand All @@ -8,7 +8,7 @@ Unix diff, and works independently of the file contents (programming language).

## Install

go get github.com/SmartBear/lhdiff
go get github.com/oselvar/lhdiff

To install from source, see [CONTRIBUTING.md](./CONTRIBUTING.md)

Expand Down
8 changes: 4 additions & 4 deletions cmd/lhdiff/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,18 @@ package main
import (
"flag"
"fmt"
"github.com/SmartBear/lhdiff"
"io/ioutil"
"os"

"github.com/oselvar/lhdiff"
)

func main() {
compact := flag.Bool("compact", false, "Exclude identical lines from output")
flag.Parse()
leftFile := flag.Arg(0)
rightFile := flag.Arg(1)
left, _ := ioutil.ReadFile(leftFile)
right, _ := ioutil.ReadFile(rightFile)
left, _ := os.ReadFile(leftFile)
right, _ := os.ReadFile(rightFile)
mappings, err := lhdiff.Lhdiff(string(left), string(right), 4, !*compact)

if err != nil {
Expand Down
8 changes: 4 additions & 4 deletions go.mod
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
module github.com/SmartBear/lhdiff
module github.com/oselvar/lhdiff

go 1.17
go 1.20

require (
github.com/ianbruene/go-difflib v1.2.0
github.com/ka-weihe/fast-levenshtein v0.0.0-20201227151214-4c99ee36a1ba
github.com/mongodb-forks/go-difflib v1.3.1
github.com/rexsimiloluwah/distance_metrics v0.0.0-20211020112549-67979eee6077
github.com/sourcegraph/go-diff v0.6.1
github.com/sourcegraph/go-diff v0.7.0
)
8 changes: 4 additions & 4 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,14 @@ github.com/dgryski/trifles v0.0.0-20200830180326-aaf60a07f6a3 h1:JibukGTEjdN4VMX
github.com/dgryski/trifles v0.0.0-20200830180326-aaf60a07f6a3/go.mod h1:if7Fbed8SFyPtHLHbg49SI7NAdJiC5WIA09pe59rfAA=
github.com/google/go-cmp v0.5.2 h1:X2ev0eStA3AbceY54o37/0PQ/UWqKEiiO2dKL5OPaFM=
github.com/google/go-cmp v0.5.2/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE=
github.com/ianbruene/go-difflib v1.2.0 h1:iARmgaCq6nW5QptdoFm0PYAyNGix3xw/xRgEwphJSZw=
github.com/ianbruene/go-difflib v1.2.0/go.mod h1:uJbrQ06VPxjRiRIrync+E6VcWFGW2dWqw2gvQp6HQPY=
github.com/ka-weihe/fast-levenshtein v0.0.0-20201227151214-4c99ee36a1ba h1:keZ4vJpYOVm6yrjLzZ6QgozbEBaT0GjfH30ihbO67+4=
github.com/ka-weihe/fast-levenshtein v0.0.0-20201227151214-4c99ee36a1ba/go.mod h1:kaXTPU4xitQT0rfT7/i9O9Gm8acSh3DXr0p4y3vKqiE=
github.com/mongodb-forks/go-difflib v1.3.1 h1:e+DVrR/0m+1i3rAhqHcXlJAsyNl71w3RGtUgYAmCctU=
github.com/mongodb-forks/go-difflib v1.3.1/go.mod h1:HQwBVyCQe8Qoqa5oARs7axPr/BNmWpXPS0ozENUCaS0=
github.com/rexsimiloluwah/distance_metrics v0.0.0-20211020112549-67979eee6077 h1:UARAHYmaBmaZFFgO/3gdyMaw6ZJw7sGM2vF5NWUsDNM=
github.com/rexsimiloluwah/distance_metrics v0.0.0-20211020112549-67979eee6077/go.mod h1:c9cZ1im6joocUOHKTdfD5H8iLrG6yMFyzQQ0iVv/nog=
github.com/shurcooL/go v0.0.0-20180423040247-9e1955d9fb6e/go.mod h1:TDJrrUr11Vxrven61rcy3hJMUqaf/CLWYhHNPmT14Lk=
github.com/shurcooL/go-goon v0.0.0-20170922171312-37c2f522c041/go.mod h1:N5mDOmsrJOB+vfqUK+7DmDyjhSLIIBnXo9lvZJj3MWQ=
github.com/sourcegraph/go-diff v0.6.1 h1:hmA1LzxW0n1c3Q4YbrFgg4P99GSnebYa3x8gr0HZqLQ=
github.com/sourcegraph/go-diff v0.6.1/go.mod h1:iBszgVvyxdc8SFZ7gm69go2KDdt3ag071iBaWPF6cjs=
github.com/sourcegraph/go-diff v0.7.0 h1:9uLlrd5T46OXs5qpp8L/MTltk0zikUGi0sNNyCpA8G0=
github.com/sourcegraph/go-diff v0.7.0/go.mod h1:iBszgVvyxdc8SFZ7gm69go2KDdt3ag071iBaWPF6cjs=
golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
53 changes: 38 additions & 15 deletions lhdiff.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,15 @@ package lhdiff
import (
"bytes"
"fmt"
"github.com/ianbruene/go-difflib/difflib"
levenshtein "github.com/ka-weihe/fast-levenshtein"
"github.com/sourcegraph/go-diff/diff"
"math"
"regexp"
"sort"
"strconv"
"strings"

levenshtein "github.com/ka-weihe/fast-levenshtein"
"github.com/mongodb-forks/go-difflib/difflib"
"github.com/sourcegraph/go-diff/diff"
)

type LineInfo struct {
Expand Down Expand Up @@ -55,7 +56,23 @@ const ContextSimilarityFactor = 0.4
const ContentSimilarityFactor = 0.6
const SimilarityThreshold = 0.45

func Lhdiff(left string, right string, contextSize int, includeIdenticalLines bool) ([][]int, error) {
/**
* Returns a list of mappings between the lines of the left and right file.
* Each mapping is a pair of line numbers, where -1 indicates that the line is not present in the file.
* The mappings are sorted by the line number of the left file.
* If includeIdenticalLines is true, then lines that are identical in both files are included in the mappings.
* Otherwise, only lines that are not identical are included.
* The contextSize parameter determines how many lines of context are used to determine the similarity of lines.
* The context lines are not included in the mappings.
* The context lines are lines that are not blank and do not consist of only curly braces or parenthesis.
* The context lines are used to determine the similarity of lines.
* The similarity of lines is determined by a combination of the normalized Levenshtein distance of the content of the lines and the cosine similarity of the context of the lines.
* The similarity of lines is only considered if it is above a certain threshold.
* The mappings are determined by first finding the unchanged lines using the difflib library.
* Then, for each line in the right file, the most similar line in the left file is found.
* The most similar line is the line with the highest combined similarity.
*/
func Lhdiff(left string, right string, contextSize int, includeIdenticalLines bool) ([][]uint32, error) {
leftLines := ConvertToLinesWithoutNewLine(left)
rightLines := ConvertToLinesWithoutNewLine(right)

Expand Down Expand Up @@ -88,6 +105,11 @@ func Lhdiff(left string, right string, contextSize int, includeIdenticalLines bo
leftLineInfos := MakeLineInfos(leftLineNumbers, leftLines, contextSize)
rightLineInfos := MakeLineInfos(rightLineNumbers, rightLines, contextSize)

// TODO: We have combinatorial explosion here....
// See section D in the paper about simhash
// We need to compute that here.
// See HDiffSHMatching.match - line 287-293
// Maybe do this in parallel?
for _, rightLineInfo := range rightLineInfos {
var similarPairCandidates []LinePair
for _, leftLineInfo := range leftLineInfos {
Expand Down Expand Up @@ -124,25 +146,26 @@ func Lhdiff(left string, right string, contextSize int, includeIdenticalLines bo
rightLineNumbers = append(rightLineNumbers, rightLineNumber)
}
}
return lineMappings(allPairs, len(leftLines), rightLineNumbers, includeIdenticalLines), nil
return makeLineMappings(allPairs, len(leftLines), rightLineNumbers, includeIdenticalLines), nil
}

func lineMappings(linePairs map[int]LinePair, leftLineCount int, newRightLines []int, includeIdenticalLines bool) [][]int {
lines := make([][]int, 0)
func makeLineMappings(linePairs map[int]LinePair, leftLineCount int, newRightLines []int, includeIdenticalLines bool) [][]uint32 {
fmt.Println("linePairs:", len(linePairs))
lineMappings := make([][]uint32, 0)
for leftLineNumber := 0; leftLineNumber < leftLineCount; leftLineNumber++ {
pair, exists := linePairs[leftLineNumber]
if !exists {
lines = append(lines, []int{leftLineNumber, -1})
lineMappings = append(lineMappings, []uint32{uint32(leftLineNumber + 1), 0})
} else {
if includeIdenticalLines || !(pair.left.content == pair.right.content && leftLineNumber == pair.right.lineNumber) {
lines = append(lines, []int{leftLineNumber, pair.right.lineNumber})
lineMappings = append(lineMappings, []uint32{uint32(leftLineNumber + 1), uint32(pair.right.lineNumber + 1)})
}
}
}
for _, rightLine := range newRightLines {
lines = append(lines, []int{-1, rightLine})
lineMappings = append(lineMappings, []uint32{0, uint32(rightLine + 1)})
}
return lines
return lineMappings
}

func MakeLineInfos(lineNumbers []int, lines []string, contextSize int) []*LineInfo {
Expand Down Expand Up @@ -299,7 +322,7 @@ func RemoveMultipleSpaceAndTrim(s string) string {
return strings.TrimSpace(re.ReplaceAllString(s, " ")) + "\n"
}

func PrintMappings(mappings [][]int) error {
func PrintMappings(mappings [][]uint32) error {
for _, mapping := range mappings {
_, err := fmt.Printf("%s,%s\n", toString(mapping[0]), toString(mapping[1]))
if err != nil {
Expand All @@ -309,12 +332,12 @@ func PrintMappings(mappings [][]int) error {
return nil
}

func toString(i int) string {
func toString(i uint32) string {
var left string
if i == -1 {
if i == 0 {
left = "_"
} else {
left = strconv.Itoa(i + 1)
left = strconv.FormatUint(uint64(i), 10)
}
return left
}
Loading
Loading