Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: fix headings and indents #109

Merged
merged 1 commit into from
Sep 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 16 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,27 +69,27 @@ results = evaluate(args)

1. setup a separate server with [GenAIComps](https://github.com/opea-project/GenAIComps/tree/main/comps/llms/lm-eval)

```
# build cpu docker
docker build -f Dockerfile.cpu -t opea/lm-eval:latest .
```
# build cpu docker
docker build -f Dockerfile.cpu -t opea/lm-eval:latest .

# start the server
docker run -p 9006:9006 --ipc=host -e MODEL="hf" -e MODEL_ARGS="pretrained=Intel/neural-chat-7b-v3-3" -e DEVICE="cpu" opea/lm-eval:latest
```
# start the server
docker run -p 9006:9006 --ipc=host -e MODEL="hf" -e MODEL_ARGS="pretrained=Intel/neural-chat-7b-v3-3" -e DEVICE="cpu" opea/lm-eval:latest
```

2. evaluate the model

- set `base_url`, `tokenizer` and `--model genai-hf`
- set `base_url`, `tokenizer` and `--model genai-hf`

```
cd evals/evaluation/lm_evaluation_harness/examples
```
cd evals/evaluation/lm_evaluation_harness/examples

python main.py \
--model genai-hf \
--model_args "base_url=http://{your_ip}:9006,tokenizer=Intel/neural-chat-7b-v3-3" \
--tasks "lambada_openai" \
--batch_size 2
```
python main.py \
--model genai-hf \
--model_args "base_url=http://{your_ip}:9006,tokenizer=Intel/neural-chat-7b-v3-3" \
--tasks "lambada_openai" \
--batch_size 2
```

### bigcode-evaluation-harness
For evaluating the models on coding tasks or specifically coding LLMs, we follow the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) and provide the command line usage and function call usage. [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp), [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode are available.
Expand All @@ -104,6 +104,7 @@ python main.py \
--batch_size 10 \
--allow_code_execution
```

#### function call usage
```python
from evals.evaluation.bigcode_evaluation_harness import BigcodeEvalParser, evaluate
Expand Down
14 changes: 7 additions & 7 deletions evals/benchmark/stresscli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ pip install -r requirements.txt
### Usage

```
$ ./stresscli.py --help
./stresscli.py --help
Usage: stresscli.py [OPTIONS] COMMAND [ARGS]...

StressCLI - A command line tool for stress testing OPEA workloads.
Expand All @@ -60,7 +60,7 @@ Commands:

More detail options:
```
$ ./stresscli.py load-test --help
./stresscli.py load-test --help
Usage: stresscli.py load-test [OPTIONS]

Do load test
Expand All @@ -74,12 +74,12 @@ Options:

You can generate the report for test cases by:
```
$ ./stresscli.py report --folder /home/sdp/test_reports/20240710_004105 --format csv -o data.csv
./stresscli.py report --folder /home/sdp/test_reports/20240710_004105 --format csv -o data.csv
```

More detail options:
```
$ ./stresscli.py report --help
./stresscli.py report --help
Usage: stresscli.py report [OPTIONS]

Print the test report
Expand All @@ -101,7 +101,7 @@ You can dump the current testing profile by
```
More detail options:
```
$ ./stresscli.py dump --help
./stresscli.py dump --help
Usage: stresscli.py dump [OPTIONS]

Dump the test spec
Expand All @@ -115,12 +115,12 @@ Options:

You can validate if the current K8s and workloads deployment comply with the test spec by:
```
$ ./stresscli.py validate --file testspec.yaml
./stresscli.py validate --file testspec.yaml
```

More detail options:
```
$ ./stresscli.py validate --help
./stresscli.py validate --help
Usage: stresscli.py validate [OPTIONS]

Validate against the test spec
Expand Down
28 changes: 4 additions & 24 deletions evals/metrics/bleu/README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,5 @@
---
title: BLEU
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another.
Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is"
– this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.

Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations.
Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality.
Neither intelligibility nor grammatical correctness are not taken into account.
---

# Metric Card for BLEU


## Metric Description
BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.

Expand All @@ -48,17 +25,20 @@ This metric takes as input a list of predicted sentences and a list of lists of
```

### Inputs

- **predictions** (`list` of `str`s): Translations to score.
- **references** (`list` of `list`s of `str`s): references for each translation.
- ** tokenizer** : approach used for standardizing `predictions` and `references`.
- **tokenizer** : approach used for standardizing `predictions` and `references`.
The default tokenizer is `tokenizer_13a`, a relatively minimal tokenization approach that is however equivalent to `mteval-v13a`, used by WMT.
This can be replaced by another tokenizer from a source such as [SacreBLEU](https://github.com/mjpost/sacrebleu/tree/master/sacrebleu/tokenizers).

The default tokenizer is based on whitespace and regexes. It can be replaced by any function that takes a string as input and returns a list of tokens as output. E.g. `word_tokenize()` from [NLTK](https://www.nltk.org/api/nltk.tokenize.html) or pretrained tokenizers from the [Tokenizers library](https://huggingface.co/docs/tokenizers/index).

- **max_order** (`int`): Maximum n-gram order to use when computing BLEU score. Defaults to `4`.
- **smooth** (`boolean`): Whether or not to apply Lin et al. 2004 smoothing. Defaults to `False`.

### Output Values

- **bleu** (`float`): bleu score
- **precisions** (`list` of `float`s): geometric mean of n-gram precisions,
- **brevity_penalty** (`float`): brevity penalty,
Expand Down