Skip to content

Commit

Permalink
doc: fix headings and indents (#109)
Browse files Browse the repository at this point in the history
* fix heading levels
* remove $ on command examples
* fix markdown coding errors: indenting and spaces in emphasis

Signed-off-by: David B. Kinder <[email protected]>
  • Loading branch information
dbkinder authored Sep 6, 2024
1 parent 626d269 commit 65a0a5b
Show file tree
Hide file tree
Showing 3 changed files with 27 additions and 46 deletions.
31 changes: 16 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,27 +69,27 @@ results = evaluate(args)

1. setup a separate server with [GenAIComps](https://github.com/opea-project/GenAIComps/tree/main/comps/llms/lm-eval)

```
# build cpu docker
docker build -f Dockerfile.cpu -t opea/lm-eval:latest .
```
# build cpu docker
docker build -f Dockerfile.cpu -t opea/lm-eval:latest .
# start the server
docker run -p 9006:9006 --ipc=host -e MODEL="hf" -e MODEL_ARGS="pretrained=Intel/neural-chat-7b-v3-3" -e DEVICE="cpu" opea/lm-eval:latest
```
# start the server
docker run -p 9006:9006 --ipc=host -e MODEL="hf" -e MODEL_ARGS="pretrained=Intel/neural-chat-7b-v3-3" -e DEVICE="cpu" opea/lm-eval:latest
```

2. evaluate the model

- set `base_url`, `tokenizer` and `--model genai-hf`
- set `base_url`, `tokenizer` and `--model genai-hf`

```
cd evals/evaluation/lm_evaluation_harness/examples
```
cd evals/evaluation/lm_evaluation_harness/examples
python main.py \
--model genai-hf \
--model_args "base_url=http://{your_ip}:9006,tokenizer=Intel/neural-chat-7b-v3-3" \
--tasks "lambada_openai" \
--batch_size 2
```
python main.py \
--model genai-hf \
--model_args "base_url=http://{your_ip}:9006,tokenizer=Intel/neural-chat-7b-v3-3" \
--tasks "lambada_openai" \
--batch_size 2
```
### bigcode-evaluation-harness
For evaluating the models on coding tasks or specifically coding LLMs, we follow the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) and provide the command line usage and function call usage. [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp), [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode are available.
Expand All @@ -104,6 +104,7 @@ python main.py \
--batch_size 10 \
--allow_code_execution
```

#### function call usage
```python
from evals.evaluation.bigcode_evaluation_harness import BigcodeEvalParser, evaluate
Expand Down
14 changes: 7 additions & 7 deletions evals/benchmark/stresscli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ pip install -r requirements.txt
### Usage

```
$ ./stresscli.py --help
./stresscli.py --help
Usage: stresscli.py [OPTIONS] COMMAND [ARGS]...
StressCLI - A command line tool for stress testing OPEA workloads.
Expand All @@ -60,7 +60,7 @@ Commands:

More detail options:
```
$ ./stresscli.py load-test --help
./stresscli.py load-test --help
Usage: stresscli.py load-test [OPTIONS]
Do load test
Expand All @@ -74,12 +74,12 @@ Options:

You can generate the report for test cases by:
```
$ ./stresscli.py report --folder /home/sdp/test_reports/20240710_004105 --format csv -o data.csv
./stresscli.py report --folder /home/sdp/test_reports/20240710_004105 --format csv -o data.csv
```

More detail options:
```
$ ./stresscli.py report --help
./stresscli.py report --help
Usage: stresscli.py report [OPTIONS]
Print the test report
Expand All @@ -101,7 +101,7 @@ You can dump the current testing profile by
```
More detail options:
```
$ ./stresscli.py dump --help
./stresscli.py dump --help
Usage: stresscli.py dump [OPTIONS]
Dump the test spec
Expand All @@ -115,12 +115,12 @@ Options:

You can validate if the current K8s and workloads deployment comply with the test spec by:
```
$ ./stresscli.py validate --file testspec.yaml
./stresscli.py validate --file testspec.yaml
```

More detail options:
```
$ ./stresscli.py validate --help
./stresscli.py validate --help
Usage: stresscli.py validate [OPTIONS]
Validate against the test spec
Expand Down
28 changes: 4 additions & 24 deletions evals/metrics/bleu/README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,5 @@
---
title: BLEU
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another.
Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is"
– this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.
Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations.
Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality.
Neither intelligibility nor grammatical correctness are not taken into account.
---

# Metric Card for BLEU


## Metric Description
BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.

Expand All @@ -48,17 +25,20 @@ This metric takes as input a list of predicted sentences and a list of lists of
```

### Inputs

- **predictions** (`list` of `str`s): Translations to score.
- **references** (`list` of `list`s of `str`s): references for each translation.
- ** tokenizer** : approach used for standardizing `predictions` and `references`.
- **tokenizer** : approach used for standardizing `predictions` and `references`.
The default tokenizer is `tokenizer_13a`, a relatively minimal tokenization approach that is however equivalent to `mteval-v13a`, used by WMT.
This can be replaced by another tokenizer from a source such as [SacreBLEU](https://github.com/mjpost/sacrebleu/tree/master/sacrebleu/tokenizers).

The default tokenizer is based on whitespace and regexes. It can be replaced by any function that takes a string as input and returns a list of tokens as output. E.g. `word_tokenize()` from [NLTK](https://www.nltk.org/api/nltk.tokenize.html) or pretrained tokenizers from the [Tokenizers library](https://huggingface.co/docs/tokenizers/index).

- **max_order** (`int`): Maximum n-gram order to use when computing BLEU score. Defaults to `4`.
- **smooth** (`boolean`): Whether or not to apply Lin et al. 2004 smoothing. Defaults to `False`.

### Output Values

- **bleu** (`float`): bleu score
- **precisions** (`list` of `float`s): geometric mean of n-gram precisions,
- **brevity_penalty** (`float`): brevity penalty,
Expand Down

0 comments on commit 65a0a5b

Please sign in to comment.