TFIDF calculated not correctly? #41

lubomirkrcmar · 2013-05-20T13:13:51Z

Hi SSpace team,

I believe, TFIDF is not calculated correctly. Am I right?

Why the value in the following is divided by docTermCount[column]?
I think, there should be just tf = value; since
tf stands for term frequency in a certain document and does not stand for probability of the term frequency in the document (division case). At least, wikipedia reffered in the code claims so.

class TfIdfTransform in edu.ucla.sspace.matrix:

public double transform(int row, int column, double value) {
double tf = value / docTermCount[column];
double idf = Math.log(totalDocCount / (termDocCount[row] + 1));
return tf * idf;
}

the same in the following method:
public double transform(int row, DoubleVector column) {
...
}

Cheers,
Luboš

davidjurgens · 2013-05-20T13:44:38Z

Hi Luboš,

You bring up a good point. Our implementation of TF-IDF is using the
term's probability in the document, rather than its frequency. Using the
probability discounts the impact of different sized documents where the
frequencies for a single term may differ significantly.

There are several ways to adjust the TF value (the Wikipedia page
mentions a few others, as well), but I don't think our docs mention
anywhere which one we're using. It would be pretty helpful to be able to
adjust the TF transform as well. I don't think this is too much work, so
if you want, I'm happy to try extending the code with a few options.

Thanks,
David

On Mon, May 20, 2013 at 3:13 PM, lubomirkrcmar [email protected]:

Hi SSpace team,

I believe, TFIDF is not calculated correctly. Am I right?

Why the value in the following is divided by docTermCount[column]?
I think, there should be just tf = value; since
tf stands for term frequency in a certain document and does not stand for
probability of the term frequency in the document (division case). At
least, wikipedia reffered in the code claims so.

class TfIdfTransform in edu.ucla.sspace.matrix:

public double transform(int row, int column, double value) {
double tf = value / docTermCount[column];
double idf = Math.log(totalDocCount / (termDocCount[row] + 1));
return tf * idf;
}

the same in the following method:
public double transform(int row, DoubleVector column) {
...
}

Cheers,
Luboš

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/41
.

lubomirkrcmar · 2013-05-23T08:15:12Z

Hi David,

I understand, thanks. I do some experiments in Informational Retrieval and yes, I would welcome more variants of the TfIdf Transform. Also, it would be nice to modify the comment in the currently implemented variant - I mean to include that tf is normalized.

Manning in Introduction to Information Retrieval (2008) besides from "term frequency" and "document frequency" weighting writes about "normalization weighting": 1/u (pivoted normalization) corresponds to what You use in Your tfIDf implementation in SSPace, I believe.

Currently (-term frequency w., document frequency w., and normalization weighting-), in Your Project, the NoTransform class corresponds to -tf, no, noNorm-, TfIdfTransform to -tf, Idf, pivotNorm- and LogEntropyTransform to -tf, Entropy, NoNorm-. There are also some other classes, which I do not know well yet, such as PointwiseMutualnformationTransform and LogLikelihoodTransform.

I believe TfIdfNoNormTransform and LogIDfNoNormTranform might be interesting candidates for implementation. Maybe, also sqrt weighting of term freqency is interesting (Lucene library uses sqrt.).

Thanks for SSpace!
Luboš

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFIDF calculated not correctly? #41

TFIDF calculated not correctly? #41

lubomirkrcmar commented May 20, 2013

davidjurgens commented May 20, 2013

lubomirkrcmar commented May 23, 2013

TFIDF calculated not correctly? #41

TFIDF calculated not correctly? #41

Comments

lubomirkrcmar commented May 20, 2013

davidjurgens commented May 20, 2013

lubomirkrcmar commented May 23, 2013