Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributional semantics using contexts rather than documents #50

Open
jiangfeng1124 opened this issue Apr 2, 2014 · 5 comments
Open

Comments

@jiangfeng1124
Copy link

Dear developers,

I found that the VsmMain computes the word-document matrix, which concerns the co-occurrences of words and documents. Could I generate distributional representation using the context within a certain size of window (say: 10), and use the PMI, rather than tf-idf as the element in the word-context matrix?

Thanks,
Jiang

@davidjurgens
Copy link
Collaborator

Hi Jiang,

You'll want to use the GwsMain class which uses the GenericWordSpace
class instead of the VsmMain class. I'm not sure if we have out-of-the-box
support for PMI though.

Thanks,
David

On Wed, Apr 2, 2014 at 5:42 AM, jiangfeng [email protected] wrote:

Dear developers,

I found that the VsmMain computes the word-document matrix, which
concerns the co-occurrences of words and documents. Could I generate
distributional representation using the context within a certain size of
window (say: 10), and use the PMI, rather than tf-idf as the element in the word-context
matrix?

Thanks,
Jiang

Reply to this email directly or view it on GitHubhttps://github.com//issues/50
.

@jiangfeng1124
Copy link
Author

GwsMain looks good, thank you.
However, I found a problem and I am not sure whether it is a bug:
This is what I get when running GwsMain:

Command:

java edu.ucla.sspace.mains.GwsMain -d data/wiki.sample data/output-sample/ -t 6 -o sparse_text -F include=data/wiki_vocab_sample.lst;exclude=data/english-stop-words-large.txt

What I get:

...
|0,994.0,1,2457.0,2,796.0,3,19110.0,4,1510.0,5,1990.0,6,1256.0,7,18830.0,...
...

It seems that representation of an empty word is generated. Could you help check this?

Thanks,
Jiang

@davidjurgens
Copy link
Collaborator

Hi Jiang,

Yes, this looks like a bug. The boolean logic for filtering this case
was missing parentheses in the code, so the vector you found is because
some internal token filtering escaped into the output. I've fixed this
issue in the latest commit and pushed it to the trunk. Thanks for
reporting it!

Thanks,
David

On Wed, Apr 2, 2014 at 10:03 PM, jiangfeng [email protected] wrote:

GwsMain looks good, thank you.
However, I found a problem and I am not sure whether it is a bug:
This is what I get when running GwsMain:

Command:

java edu.ucla.sspace.mains.GwsMain -d data/wiki.sample data/output-sample/ -t 6 -o sparse_text -F include=data/wiki_vocab_sample.lst;exclude=data/english-stop-words-large.txt

What I get:

...
|0,994.0,1,2457.0,2,796.0,3,19110.0,4,1510.0,5,1990.0,6,1256.0,7,18830.0,...
...

It seems that representation of an empty word is generated. Could you help
check this?

Thanks,
Jiang

Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-39408087
.

@jiangfeng1124
Copy link
Author

Hi David,

I would like to ask a little more. I realized that the GwsMain class outputs the raw counts in context vectors. I am wondering whether the results could be further processed by LSA or other dimensionality reduction algorithms, so that I can get a low-dimensional representation?

Thanks,
Jiang

@davidjurgens
Copy link
Collaborator

Hi Jiang,

So you want to report the output of GwsMain and then use that as input to
LSA? It might be easier to just run GwsMain and then LSAMain on the same
dataset, though I think Gws is a term-by-term algorithm, so it would
interpret the context differently than LSA.

If all you want to do is run SVD on the GWS data, that's currently not
supported, but I could probably put it in fairly quickly too. :)

Thanks,
David

On Sun, May 4, 2014 at 8:15 AM, jiangfeng [email protected] wrote:

Hi David,

I would like to ask a little more. I realized that the GwsMain class
outputs the raw counts in context vectors. I am wondering whether the
results could be further processed by LSA or other dimensionality
reduction algorithms, so that I can get a low-dimensional representation?

Thanks,
Jiang


Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-42132685
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants