Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented BLEU score, wrote unit tests and documentation for it. #1006

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kadamrahul18
Copy link

LEU Metric Implementation:

  1. Added a new BLEU class under sdks/python/src/opik/evaluation/metrics/heuristics/bleu.py.
  2. Implemented the BLEU algorithm to calculate scores based on n-gram precision between the generated text and a reference text.
  3. Included methods for handling both single sentences and corpus-level scoring.
  4. Implemented smoothing techniques (methods 0, 1, 2, 3 from the Chen & Cherry paper) to address zero n-gram matches.
  5. Added configuration options for n-gram order, smoothing method, and weights.

Unit Tests:

  1. Added comprehensive unit tests in sdks/python/tests/unit/evaluation/metrics/test_heuristics.py to validate the BLEU metric's behavior in various scenarios:
  • Exact match, partial match, and no match cases.
  • Empty candidate and reference strings.
  • Different smoothing methods.
  • Corpus-level scoring.
  • Edge cases and error handling.

Integration with Evaluation Framework:

  1. Added the BLEU class to the all list in sdks/python/src/opik/evaluation/metrics/heuristics/init.py to make it discoverable by the evaluate function.

Documentation:

  1. Added a new documentation page for the BLEU metric (bleu.md) in the evaluation/metrics section of the documentation, detailing its purpose and usage.

Testing:

  1. Thorough unit tests have been included to cover different aspects of the BLEU metric implementation, including edge cases and different smoothing methods.
  2. All tests in the Python SDK, including the new tests for the BLEU metric, pass successfully when running pytest tests/ from the sdks/python directory.
  3. pre-commit run --all-files has been executed successfully from the sdks/python directory, ensuring code style and formatting consistency.

Request for Review:

Please review the following aspects of this pull request:

  1. Correctness of the BLEU metric implementation, including n-gram precision, brevity penalty, and smoothing.
  2. Clarity and completeness of the unit tests.
  3. Thoroughness of the documentation.
  4. Adherence to Opik's coding standards and best practices.

Any feedback or suggestions for improvement are greatly appreciated.

@kadamrahul18 kadamrahul18 requested review from a team as code owners January 9, 2025 04:11
@alexkuzmik
Copy link
Collaborator

Hi @kadamrahul18!
I can see that the code is based on the nltk library implementation, which is one of the most popular libraries when it comes to BLEU score calculation.
I prefer to just use NLTK as well, it's likely not the last heuristic metric we will add, and I don't think that populating the code base with non-trivial mathematical calculations is the right thing to do when there are already pretty stable and specialized tools for that.
What I suggest doing is something like that:

try:
    import nltk  # we won't add nltk as a package dependency, but we can add it to a separate requirements file for unit tests
except ImportError:
    nltk = None

...
class BLEU:
    def __init__(...):
        if nltk is None:
            raise ImportError("`nltk` library is required for BLEU score calculation, please install it via `pip install nltk`")

Under the hood of the metric implementation you can use nltk.translate.bleu_score.sentence_bleu or nltk.translate.bleu_score.corpus_bleu.

That way we'll be able to have a stable implementation and avoid a big chunk of mathematical code (which is almost always hard to read and easy to break :) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants