Skip to content

Latest commit

 

History

History
76 lines (62 loc) · 3.68 KB

README.md

File metadata and controls

76 lines (62 loc) · 3.68 KB

B44.1 WT Content Management, Such- und Texttechnologien (SL) - 3. Zug- WiSe2020/21

Credits

Original owner of project/code

  • Prof. Dr. Hendrik Gärtner

Student Contributers:

  • Björn Uhlig
  • Lennart Döring

Intro

Combine two essentially connected fields:

  • (A) duplicate recognition using entity resolution
  • (B) data reduction using min hashing algorithm

Avoid Not Serializable Errors

  • function params cannot be passed directly:
import Utils.tokenize
def myFunc(s:String,param:Any) = { Utils.tokenize(s,param) } // does not work

def myFunc(s:String,param:Any) = { val mp = param; Utils.tokenize(s,mp) } // does work

=> not serializable

Task A: Create basic functions for text analysis (entity resolution)

See Chapter 3 (pp. 73-103) of Mining Massive Datasets for min hashing and local sensitivity hashing.

Entity Resolution ( Text Analysis )

Implement functions in: src/main/scala/textanalyse/EntityResolution.scala Implement TF-IDF:

TFt,d is the number of occurrences of t in document d.
DFt is the number of documents containing the term t.
N is the total number of documents in the corpus.

    Wt,d = TFt,d log (N/DFt)

Task B

Min Hashing

Implement functions in: src/main/scala/minhash/JaccardSimilarity.scala

Local Sensitivity Hashing

Scalable Entity Resolution ( Text Analysis )

Implement functions in: src/main/scala/textanalyse/ScalableEntityResolution.scala