Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFIDF calculated not correctly? #41

Open
lubomirkrcmar opened this issue May 20, 2013 · 2 comments
Open

TFIDF calculated not correctly? #41

lubomirkrcmar opened this issue May 20, 2013 · 2 comments

Comments

@lubomirkrcmar
Copy link

Hi SSpace team,

I believe, TFIDF is not calculated correctly. Am I right?

Why the value in the following is divided by docTermCount[column]?
I think, there should be just tf = value; since
tf stands for term frequency in a certain document and does not stand for probability of the term frequency in the document (division case). At least, wikipedia reffered in the code claims so.

class TfIdfTransform in edu.ucla.sspace.matrix:

public double transform(int row, int column, double value) {
double tf = value / docTermCount[column];
double idf = Math.log(totalDocCount / (termDocCount[row] + 1));
return tf * idf;
}

the same in the following method:
public double transform(int row, DoubleVector column) {
...
}

Cheers,
Luboš

@davidjurgens
Copy link
Collaborator

Hi Luboš,

You bring up a good point. Our implementation of TF-IDF is using the
term's probability in the document, rather than its frequency. Using the
probability discounts the impact of different sized documents where the
frequencies for a single term may differ significantly.

There are several ways to adjust the TF value (the Wikipedia page
mentions a few others, as well), but I don't think our docs mention
anywhere which one we're using. It would be pretty helpful to be able to
adjust the TF transform as well. I don't think this is too much work, so
if you want, I'm happy to try extending the code with a few options.

Thanks,
David

On Mon, May 20, 2013 at 3:13 PM, lubomirkrcmar [email protected]:

Hi SSpace team,

I believe, TFIDF is not calculated correctly. Am I right?

Why the value in the following is divided by docTermCount[column]?
I think, there should be just tf = value; since
tf stands for term frequency in a certain document and does not stand for
probability of the term frequency in the document (division case). At
least, wikipedia reffered in the code claims so.

class TfIdfTransform in edu.ucla.sspace.matrix:

public double transform(int row, int column, double value) {
double tf = value / docTermCount[column];
double idf = Math.log(totalDocCount / (termDocCount[row] + 1));
return tf * idf;
}

the same in the following method:
public double transform(int row, DoubleVector column) {
...
}

Cheers,
Luboš


Reply to this email directly or view it on GitHubhttps://github.com//issues/41
.

@lubomirkrcmar
Copy link
Author

Hi David,

I understand, thanks. I do some experiments in Informational Retrieval and yes, I would welcome more variants of the TfIdf Transform. Also, it would be nice to modify the comment in the currently implemented variant - I mean to include that tf is normalized.

Manning in Introduction to Information Retrieval (2008) besides from "term frequency" and "document frequency" weighting writes about "normalization weighting": 1/u (pivoted normalization) corresponds to what You use in Your tfIDf implementation in SSPace, I believe.

Currently (-term frequency w., document frequency w., and normalization weighting-), in Your Project, the NoTransform class corresponds to -tf, no, noNorm-, TfIdfTransform to -tf, Idf, pivotNorm- and LogEntropyTransform to -tf, Entropy, NoNorm-. There are also some other classes, which I do not know well yet, such as PointwiseMutualnformationTransform and LogLikelihoodTransform.

I believe TfIdfNoNormTransform and LogIDfNoNormTranform might be interesting candidates for implementation. Maybe, also sqrt weighting of term freqency is interesting (Lucene library uses sqrt.).

Thanks for SSpace!
Luboš

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants