-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TFIDF calculated not correctly? #41
Comments
Hi Luboš, You bring up a good point. Our implementation of TF-IDF is using the There are several ways to adjust the TF value (the Wikipedia page Thanks, On Mon, May 20, 2013 at 3:13 PM, lubomirkrcmar [email protected]:
|
Hi David, I understand, thanks. I do some experiments in Informational Retrieval and yes, I would welcome more variants of the TfIdf Transform. Also, it would be nice to modify the comment in the currently implemented variant - I mean to include that tf is normalized. Manning in Introduction to Information Retrieval (2008) besides from "term frequency" and "document frequency" weighting writes about "normalization weighting": 1/u (pivoted normalization) corresponds to what You use in Your tfIDf implementation in SSPace, I believe. Currently (-term frequency w., document frequency w., and normalization weighting-), in Your Project, the NoTransform class corresponds to -tf, no, noNorm-, TfIdfTransform to -tf, Idf, pivotNorm- and LogEntropyTransform to -tf, Entropy, NoNorm-. There are also some other classes, which I do not know well yet, such as PointwiseMutualnformationTransform and LogLikelihoodTransform. I believe TfIdfNoNormTransform and LogIDfNoNormTranform might be interesting candidates for implementation. Maybe, also sqrt weighting of term freqency is interesting (Lucene library uses sqrt.). Thanks for SSpace! |
Hi SSpace team,
I believe, TFIDF is not calculated correctly. Am I right?
Why the value in the following is divided by docTermCount[column]?
I think, there should be just tf = value; since
tf stands for term frequency in a certain document and does not stand for probability of the term frequency in the document (division case). At least, wikipedia reffered in the code claims so.
class TfIdfTransform in edu.ucla.sspace.matrix:
public double transform(int row, int column, double value) {
double tf = value / docTermCount[column];
double idf = Math.log(totalDocCount / (termDocCount[row] + 1));
return tf * idf;
}
the same in the following method:
public double transform(int row, DoubleVector column) {
...
}
Cheers,
Luboš
The text was updated successfully, but these errors were encountered: