Original owner of project/code
- Prof. Dr. Hendrik Gärtner
Student Contributers:
- Björn Uhlig
- Lennart Döring
Combine two essentially connected fields:
- (A) duplicate recognition using entity resolution
- (B) data reduction using min hashing algorithm
- function params cannot be passed directly:
import Utils.tokenize
def myFunc(s:String,param:Any) = { Utils.tokenize(s,param) } // does not work
def myFunc(s:String,param:Any) = { val mp = param; Utils.tokenize(s,mp) } // does work
=> not serializable
See Chapter 3 (pp. 73-103) of Mining Massive Datasets for min hashing and local sensitivity hashing.
Implement functions in: src/main/scala/textanalyse/EntityResolution.scala
Implement TF-IDF:
TFt,d is the number of occurrences of t in document d.
DFt is the number of documents containing the term t.
N is the total number of documents in the corpus.
Wt,d = TFt,d log (N/DFt)
- tokenize
- getTokens
- countTokens
- findBiggestRecord
- calculateTF_IDF
- computeSimilarity
- calculateDotProduct
- calculateNorm
- calculateCosinusSimilarity
- calculateDocumentSimilarity
- computeSimilarityWithBroadcast
- getTermFrequencies
- createCorpus
- calculateIDF
- simpleSimilarityCalculation
- findSimilarity
- simpleSimilarityCalculationWithBroadcast
- evaluateModel
Implement functions in: src/main/scala/minhash/JaccardSimilarity.scala
Implement functions in: src/main/scala/textanalyse/ScalableEntityResolution.scala