-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use other data? How to create vocab.txt file? #2
Comments
pretty sure this is just a word2vec model - see here for training
|
Hi I am still unable to understand how vocab.txt created and why many words assigned same integer value? |
Yes, changing the path should work. The path should point to a directory that contians all the review files, which should be json files. The script for generating vocab.txt is not released. But the format is quite simple. vocab.txt contains the word list for indexing. It is not an embedding file. Each line of vocab.txt contains (1) the lowered word and (2) its frequency in the training text, i.e., how many times it appears in the training text. The words are ranked by frequency so that the common words are in the front and the rare words are in the back. Best regards |
Thank You. |
This unrelated code maybe able to be cherry picked - see the python code https://github.com/johndpope/vocab-mashup - it’s pretty impressive the smashing of text together. Can help augment training sets. |
Thanks I will check |
Hi, |
I do not meet your problem on my local datasets. I guess this problem is mainly attributed to the small training data. I just released a small subset of dataset for illustrating data format on current codes. Since the default epoch of training generator is set to 1, the generator learns nothing on this small dataset. Therefore, I increased the training epochs and this problem was fixed successfully. I have updated my latest codes, so please download it again. Furthermore, I released the whole dataset in google drive, you can download it from readme.md. |
Thanks a lot. |
How to use other data? How to create vocab.txt file?
The program crashed / stalled my PC after about 8 hours creating the training. How ever it was using CPU, so I tried to create a smaller data set.
I assumed : https://github.com/lancopku/DPGAN/blob/master/review_generation_dataset/generate_review.py is what formats the data.
I've being trying to read this program, I was / am hoping it formats the data some way, but there aren't any comments for a "non coder" to follow. I assumed I had to change the path? I'm on Linux.
generate_review.py
L52 : file_path = "F:\dataset\yelp_dataset\sorted_data"
The text was updated successfully, but these errors were encountered: