Distributional semantics using contexts rather than documents #50

jiangfeng1124 · 2014-04-02T10:42:11Z

Dear developers,

I found that the VsmMain computes the word-document matrix, which concerns the co-occurrences of words and documents. Could I generate distributional representation using the context within a certain size of window (say: 10), and use the PMI, rather than tf-idf as the element in the word-context matrix?

Thanks,
Jiang

The text was updated successfully, but these errors were encountered:

davidjurgens · 2014-04-02T16:48:36Z

Hi Jiang,

You'll want to use the GwsMain class which uses the GenericWordSpace
class instead of the VsmMain class. I'm not sure if we have out-of-the-box
support for PMI though.

Thanks,
David

On Wed, Apr 2, 2014 at 5:42 AM, jiangfeng [email protected] wrote:

Dear developers,

I found that the VsmMain computes the word-document matrix, which
concerns the co-occurrences of words and documents. Could I generate
distributional representation using the context within a certain size of
window (say: 10), and use the PMI, rather than tf-idf as the element in the word-context
matrix?

Thanks,
Jiang

Reply to this email directly or view it on GitHubhttps://github.com//issues/50
.

jiangfeng1124 · 2014-04-03T03:03:17Z

GwsMain looks good, thank you.
However, I found a problem and I am not sure whether it is a bug:
This is what I get when running GwsMain:

Command:

java edu.ucla.sspace.mains.GwsMain -d data/wiki.sample data/output-sample/ -t 6 -o sparse_text -F include=data/wiki_vocab_sample.lst;exclude=data/english-stop-words-large.txt

What I get:

...
|0,994.0,1,2457.0,2,796.0,3,19110.0,4,1510.0,5,1990.0,6,1256.0,7,18830.0,...
...

It seems that representation of an empty word is generated. Could you help check this?

Thanks,
Jiang

davidjurgens · 2014-04-03T04:21:00Z

Hi Jiang,

Yes, this looks like a bug. The boolean logic for filtering this case
was missing parentheses in the code, so the vector you found is because
some internal token filtering escaped into the output. I've fixed this
issue in the latest commit and pushed it to the trunk. Thanks for
reporting it!

Thanks,
David

On Wed, Apr 2, 2014 at 10:03 PM, jiangfeng [email protected] wrote:

GwsMain looks good, thank you.
However, I found a problem and I am not sure whether it is a bug:
This is what I get when running GwsMain:

Command:

java edu.ucla.sspace.mains.GwsMain -d data/wiki.sample data/output-sample/ -t 6 -o sparse_text -F include=data/wiki_vocab_sample.lst;exclude=data/english-stop-words-large.txt

What I get:

...
|0,994.0,1,2457.0,2,796.0,3,19110.0,4,1510.0,5,1990.0,6,1256.0,7,18830.0,...
...

It seems that representation of an empty word is generated. Could you help
check this?

Thanks,
Jiang

Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-39408087
.

jiangfeng1124 · 2014-05-04T13:15:48Z

Hi David,

I would like to ask a little more. I realized that the GwsMain class outputs the raw counts in context vectors. I am wondering whether the results could be further processed by LSA or other dimensionality reduction algorithms, so that I can get a low-dimensional representation?

Thanks,
Jiang

davidjurgens · 2014-05-06T03:04:46Z

Hi Jiang,

So you want to report the output of GwsMain and then use that as input to
LSA? It might be easier to just run GwsMain and then LSAMain on the same
dataset, though I think Gws is a term-by-term algorithm, so it would
interpret the context differently than LSA.

If all you want to do is run SVD on the GWS data, that's currently not
supported, but I could probably put it in fairly quickly too. :)

Thanks,
David

On Sun, May 4, 2014 at 8:15 AM, jiangfeng [email protected] wrote:

Hi David,

I would like to ask a little more. I realized that the GwsMain class
outputs the raw counts in context vectors. I am wondering whether the
results could be further processed by LSA or other dimensionality
reduction algorithms, so that I can get a low-dimensional representation?

Thanks,
Jiang

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-42132685
.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributional semantics using contexts rather than documents #50

Distributional semantics using contexts rather than documents #50

jiangfeng1124 commented Apr 2, 2014

davidjurgens commented Apr 2, 2014

jiangfeng1124 commented Apr 3, 2014

davidjurgens commented Apr 3, 2014

jiangfeng1124 commented May 4, 2014

davidjurgens commented May 6, 2014

Distributional semantics using contexts rather than documents #50

Distributional semantics using contexts rather than documents #50

Comments

jiangfeng1124 commented Apr 2, 2014

davidjurgens commented Apr 2, 2014

jiangfeng1124 commented Apr 3, 2014

davidjurgens commented Apr 3, 2014

jiangfeng1124 commented May 4, 2014

davidjurgens commented May 6, 2014