diff --git a/README.md b/README.md index f7892fca..9216ba75 100644 --- a/README.md +++ b/README.md @@ -69,27 +69,27 @@ results = evaluate(args) 1. setup a separate server with [GenAIComps](https://github.com/opea-project/GenAIComps/tree/main/comps/llms/lm-eval) -``` -# build cpu docker -docker build -f Dockerfile.cpu -t opea/lm-eval:latest . + ``` + # build cpu docker + docker build -f Dockerfile.cpu -t opea/lm-eval:latest . -# start the server -docker run -p 9006:9006 --ipc=host -e MODEL="hf" -e MODEL_ARGS="pretrained=Intel/neural-chat-7b-v3-3" -e DEVICE="cpu" opea/lm-eval:latest -``` + # start the server + docker run -p 9006:9006 --ipc=host -e MODEL="hf" -e MODEL_ARGS="pretrained=Intel/neural-chat-7b-v3-3" -e DEVICE="cpu" opea/lm-eval:latest + ``` 2. evaluate the model -- set `base_url`, `tokenizer` and `--model genai-hf` + - set `base_url`, `tokenizer` and `--model genai-hf` -``` -cd evals/evaluation/lm_evaluation_harness/examples + ``` + cd evals/evaluation/lm_evaluation_harness/examples -python main.py \ - --model genai-hf \ - --model_args "base_url=http://{your_ip}:9006,tokenizer=Intel/neural-chat-7b-v3-3" \ - --tasks "lambada_openai" \ - --batch_size 2 -``` + python main.py \ + --model genai-hf \ + --model_args "base_url=http://{your_ip}:9006,tokenizer=Intel/neural-chat-7b-v3-3" \ + --tasks "lambada_openai" \ + --batch_size 2 + ``` ### bigcode-evaluation-harness For evaluating the models on coding tasks or specifically coding LLMs, we follow the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) and provide the command line usage and function call usage. [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp), [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode are available. @@ -104,6 +104,7 @@ python main.py \ --batch_size 10 \ --allow_code_execution ``` + #### function call usage ```python from evals.evaluation.bigcode_evaluation_harness import BigcodeEvalParser, evaluate diff --git a/evals/benchmark/stresscli/README.md b/evals/benchmark/stresscli/README.md index a2f1bfa1..2cc7fc98 100644 --- a/evals/benchmark/stresscli/README.md +++ b/evals/benchmark/stresscli/README.md @@ -35,7 +35,7 @@ pip install -r requirements.txt ### Usage ``` -$ ./stresscli.py --help +./stresscli.py --help Usage: stresscli.py [OPTIONS] COMMAND [ARGS]... StressCLI - A command line tool for stress testing OPEA workloads. @@ -60,7 +60,7 @@ Commands: More detail options: ``` -$ ./stresscli.py load-test --help +./stresscli.py load-test --help Usage: stresscli.py load-test [OPTIONS] Do load test @@ -74,12 +74,12 @@ Options: You can generate the report for test cases by: ``` -$ ./stresscli.py report --folder /home/sdp/test_reports/20240710_004105 --format csv -o data.csv +./stresscli.py report --folder /home/sdp/test_reports/20240710_004105 --format csv -o data.csv ``` More detail options: ``` -$ ./stresscli.py report --help +./stresscli.py report --help Usage: stresscli.py report [OPTIONS] Print the test report @@ -101,7 +101,7 @@ You can dump the current testing profile by ``` More detail options: ``` -$ ./stresscli.py dump --help +./stresscli.py dump --help Usage: stresscli.py dump [OPTIONS] Dump the test spec @@ -115,12 +115,12 @@ Options: You can validate if the current K8s and workloads deployment comply with the test spec by: ``` -$ ./stresscli.py validate --file testspec.yaml +./stresscli.py validate --file testspec.yaml ``` More detail options: ``` -$ ./stresscli.py validate --help +./stresscli.py validate --help Usage: stresscli.py validate [OPTIONS] Validate against the test spec diff --git a/evals/metrics/bleu/README.md b/evals/metrics/bleu/README.md index d92598f6..cd6985f0 100644 --- a/evals/metrics/bleu/README.md +++ b/evals/metrics/bleu/README.md @@ -1,28 +1,5 @@ ---- -title: BLEU -emoji: 🤗 -colorFrom: blue -colorTo: red -sdk: gradio -sdk_version: 3.19.1 -app_file: app.py -pinned: false -tags: -- evaluate -- metric -description: >- - BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. - Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" - – this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics. - - Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. - Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality. - Neither intelligibility nor grammatical correctness are not taken into account. ---- - # Metric Card for BLEU - ## Metric Description BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics. @@ -48,17 +25,20 @@ This metric takes as input a list of predicted sentences and a list of lists of ``` ### Inputs + - **predictions** (`list` of `str`s): Translations to score. - **references** (`list` of `list`s of `str`s): references for each translation. -- ** tokenizer** : approach used for standardizing `predictions` and `references`. +- **tokenizer** : approach used for standardizing `predictions` and `references`. The default tokenizer is `tokenizer_13a`, a relatively minimal tokenization approach that is however equivalent to `mteval-v13a`, used by WMT. This can be replaced by another tokenizer from a source such as [SacreBLEU](https://github.com/mjpost/sacrebleu/tree/master/sacrebleu/tokenizers). The default tokenizer is based on whitespace and regexes. It can be replaced by any function that takes a string as input and returns a list of tokens as output. E.g. `word_tokenize()` from [NLTK](https://www.nltk.org/api/nltk.tokenize.html) or pretrained tokenizers from the [Tokenizers library](https://huggingface.co/docs/tokenizers/index). + - **max_order** (`int`): Maximum n-gram order to use when computing BLEU score. Defaults to `4`. - **smooth** (`boolean`): Whether or not to apply Lin et al. 2004 smoothing. Defaults to `False`. ### Output Values + - **bleu** (`float`): bleu score - **precisions** (`list` of `float`s): geometric mean of n-gram precisions, - **brevity_penalty** (`float`): brevity penalty,