finben-refactor

finben-refactor is a specialized adaptation of the lm-evaluation-harness framework, tailored specifically for evaluating language models on financial tasks. This tool focuses on assessing models available through commercial APIs and Hugging Face (HF) models, providing a streamlined approach for financial domain evaluations.

How to Evaluate

To evaluate a language model using finben-refactor, follow these steps:

Installation: Clone the repository and install the necessary dependencies:

git clone https://github.com/theFinAI/finben-refactor.git
cd finben-refactor
pip install -e .

Model Selection: finben-refactor supports models accessible via commercial APIs and Hugging Face. Specify the model type and parameters using the --model and --model_args flags. For example, to evaluate a Hugging Face model:

lm_eval --model hf --model_args pretrained=your-model-name --tasks your_task --device cuda:0 --batch_size 8 --hf_hub_log_args "hub_results_org=your_org,results_repo_name=your_repo,push_results_to_hub=True,public_repo=True"

Task Selection: Choose the financial task(s) you wish to evaluate. A list of supported tasks can be viewed with:
```
lm_eval --tasks list
```

Running Evaluation: Execute the evaluation by specifying the model and task(s):

lm_eval --model hf --model_args pretrained=your-model-name --tasks your_task --device cuda:0 --batch_size 8 --hf_hub_log_args "hub_results_org=your_org,results_repo_name=your_repo,push_results_to_hub=True,public_repo=True"

For detailed information on additional parameters and advanced configurations, refer to the lm-evaluation-harness documentation.

How to Add New Tasks

Adding new financial tasks to finben-refactor involves creating a YAML configuration file that defines the task's parameters and behavior. This configuration allows the framework to understand how to process the dataset and evaluate the model's performance.

Steps to Add a New Task

Define the Task Configuration: Create a YAML file (e.g., your_task.yaml) with the necessary fields. Below is an example configuration for a multiple-choice task:

dataset_name: default
dataset_path: your-dataset-path
doc_to_target: gold
doc_to_text: '{{query}}'
output_type: multiple_choice
doc_to_choice: choices
fewshot_split: train
should_decontaminate: true
doc_to_decontamination_query: "{{support}} {{question}}"
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
  - metric: acc_norm
    aggregation: mean
    higher_is_better: true
metadata:
  version: '1.0'
num_fewshot: 0
task: YourTaskName
test_split: test
training_split: train

Implement Both Logits and Generation Tasks: Ensure that both multiple-choice (logits-based) and generative evaluation tasks are defined. An example for generation-based tasks:

dataset_name: default
dataset_path: your-dataset-path
output_type: generate_until
doc_to_target: '{{answer}}'
doc_to_text: '{{query}}'
fewshot_split: train
should_decontaminate: true
doc_to_decontamination_query: "{{query}}"
generation_kwargs:
  until:
    - "."
    - ","
  do_sample: false
  temperature: 0.0
  max_gen_toks: 50
filter_list:
  - name: "score-first"
    filter:
      - function: "regex"
        regex_pattern: "(Φορολογία & Λογιστική|Επιχειρήσεις & Διοίκηση|Οικονομικά|Βιομηχανία|Τεχνολογία|Κυβέρνηση & Έλεγχοι)"
      - function: "take_first"
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
metadata:
  version: '1.0'
num_fewshot: 0
task: YourGenTaskName
test_split: test
training_split: train

Add the Task to the Framework: Place the task configuration file in lm_eval/tasks/finben/ and ensure the task name is registered within the evaluation harness.

Test Your Task: Run the evaluation pipeline to verify that your task is properly configured:

lm_eval --model hf --model_args pretrained=your-model-name --tasks YourTaskName --device cuda:0 --batch_size 8 --hf_hub_log_args "hub_results_org=your_org,results_repo_name=your_repo,push_results_to_hub=True,public_repo=True"

How to Report Results to the Leaderboard

To report evaluation results to the leaderboard, use the aggregate.py script included in this repository. This script processes evaluation results and updates the leaderboard.

Steps to Report Results:

Add a New Model Configuration: Update the MODEL_DICT in aggregate.py with your model details:

MODEL_DICT["your-model-name"] = {
    "Architecture": "Transformer",
    "Hub License": "your-license",
    "#Params (B)": 7,
    "Available on the hub": True,
}

Add a New Task Mapping: Update the METRIC_DICT in aggregate.py to define task-specific parameters:
```
METRIC_DICT["YourTask"] = {"task_name": "YourTaskName", "random_baseline": 0.2}
```
Run the Aggregation Script: Use aggregate.py to collect and process evaluation results:
```
python aggregate.py
```

By following these steps, you can ensure your evaluation results are properly processed and reflected in the leaderboard.

Name		Name	Last commit message	Last commit date
Latest commit History 3,604 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
lm_eval		lm_eval
scripts		scripts
templates/new_yaml_task		templates/new_yaml_task
tests		tests
.coveragerc		.coveragerc
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.bib		CITATION.bib
CODEOWNERS		CODEOWNERS
LICENSE.md		LICENSE.md
README.md		README.md
ignore.txt		ignore.txt
mypy.ini		mypy.ini
pile_statistics.json		pile_statistics.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

finben-refactor

How to Evaluate

How to Add New Tasks

Steps to Add a New Task

How to Report Results to the Leaderboard

Steps to Report Results:

About

Releases

Packages

Languages

License

The-FinAI/finlm_eval

Folders and files

Latest commit

History

Repository files navigation

finben-refactor

How to Evaluate

How to Add New Tasks

Steps to Add a New Task

How to Report Results to the Leaderboard

Steps to Report Results:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages