This repo contains a simple machine-learning instruction datasets generation script using open-sourced Hugging Face models and a Hugging Face PRO subscription (required) or free tier of Groq Cloud account, which allows you to generate ~14k requests per day (30 per minute).
pip install -r requirements.txt
- Setup your huggingface account.
- Subscribe to huggingface pro and get the api key.
- Copy
config.ini-example
toconfig.ini
and edit theconfig.ini
file and add your api key. - To use a different model, edit the
config.ini
file and change the model name. You can get the endpoint from the hugging face model hub by clicking on "Deploy model" and copying the "Inference Endpoint (Serverless)" URL.
- Setup your groq.com cloud account.
- Copy
config.ini-example
toconfig.ini
and edit theconfig.ini
file and add your groq api key. - To use a different model, edit the
config.ini
file and change the model name. You can get the endpoint from the groq cloud account documentation (https://console.groq.com/docs/models).
- From the root of your project, run the
python src/main.py
script. - Your data must contain a column named
input
with the prompt input. - The generation process is parallel but may take some time depending on data size, jobs count and timeout.
- Each response is immediately included in the output file, so you don't have to wait for the process to complete.
- Please, give a star if you like it. Thanks !
- Enjoy!