Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use other data? How to create vocab.txt file? #2

Open
wrapperband opened this issue Apr 24, 2018 · 9 comments
Open

How to use other data? How to create vocab.txt file? #2

wrapperband opened this issue Apr 24, 2018 · 9 comments

Comments

@wrapperband
Copy link

How to use other data? How to create vocab.txt file?

The program crashed / stalled my PC after about 8 hours creating the training. How ever it was using CPU, so I tried to create a smaller data set.

I assumed : https://github.com/lancopku/DPGAN/blob/master/review_generation_dataset/generate_review.py is what formats the data.

I've being trying to read this program, I was / am hoping it formats the data some way, but there aren't any comments for a "non coder" to follow. I assumed I had to change the path? I'm on Linux.

generate_review.py
L52 : file_path = "F:\dataset\yelp_dataset\sorted_data"

@johndpope
Copy link

johndpope commented Jun 4, 2018

pretty sure this is just a word2vec model - see here for training

The script demo-word.sh downloads a small (100MB) text corpus from the web, and trains a small word vector model. After the training is finished, the user can interactively explore the similarity of the words.

More information about the scripts is provided at https://code.google.com/p/word2vec/

https://github.com/dav/word2vec

@akhileshkumargangwar
Copy link

Hi I am still unable to understand how vocab.txt created and why many words assigned same integer value?

@jklj077
Copy link

jklj077 commented Aug 20, 2018

@wrapperband

Yes, changing the path should work. The path should point to a directory that contians all the review files, which should be json files.

The script for generating vocab.txt is not released. But the format is quite simple. vocab.txt contains the word list for indexing. It is not an embedding file. Each line of vocab.txt contains (1) the lowered word and (2) its frequency in the training text, i.e., how many times it appears in the training text. The words are ranked by frequency so that the common words are in the front and the rare words are in the back.

Best regards

@akhileshkumargangwar
Copy link

Thank You.

@johndpope
Copy link

This unrelated code maybe able to be cherry picked - see the python code https://github.com/johndpope/vocab-mashup - it’s pretty impressive the smashing of text together. Can help augment training sets.

@akhileshkumargangwar
Copy link

Thanks I will check

@akhileshkumargangwar
Copy link

Hi,
This DP-GAN code is showing lots of error. In discriminator_test/negative/*.txt not generating review.It is giving empty review . I want to learn the flow of GAN by debugging but it is taking lots of time to fix the error. Is there any other updated code. I also tried SeqGAN but they have used synthetic data. So please help me. I am unable to fix some errors also.
Thanks

@jingjingxupku
Copy link
Collaborator

jingjingxupku commented Aug 23, 2018

I do not meet your problem on my local datasets. I guess this problem is mainly attributed to the small training data. I just released a small subset of dataset for illustrating data format on current codes. Since the default epoch of training generator is set to 1, the generator learns nothing on this small dataset. Therefore, I increased the training epochs and this problem was fixed successfully. I have updated my latest codes, so please download it again. Furthermore, I released the whole dataset in google drive, you can download it from readme.md.

@akhileshkumargangwar
Copy link

Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants