Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TODO collections #50

Open
1 of 21 tasks
luarss opened this issue Aug 10, 2024 · 0 comments
Open
1 of 21 tasks

TODO collections #50

luarss opened this issue Aug 10, 2024 · 0 comments

Comments

@luarss
Copy link
Collaborator

luarss commented Aug 10, 2024

  • CI Status badge
  • Use structured output library? e.g. Instructor
  • https://www.anthropic.com/news/contextual-retrieval
  • have small prompts that do one thing, and only one thing well. e.g. instead of having a single catch-all-prompt, try to split it into separate prompts that are simple, focused and easy to understand -> so you can eval each prompt separately.
  • RAG evaluation: MRR NDCG.
  • RAG information density: if two documents are equally relevant, we should prefer one that is more concise and has fewer erroneous details.
  • Multistep workflow: Include reflection/CoT prompting (small tasks)
  • Increase output diversity beyond increasing temperature. E.g. when the user is asking for a solution to XX problem, keep a list of recent responses and tell the LLM, "do not suggest any responses from the following:"
  • Prompt caching: e.g. common functions. Use features like autocomplete/spelling correction/suggested queries to normalize user input and increase the cache hit rate.
  • Simple assertion based unit tests.
  • Intent Classification: https://rasa.com/docs/rasa/next/llms/llm-intent/
- What do each tool abbreviation in OR mean? 
- What are the supported public PDKs?
- What are the supported OSes?
- What are the social media links?

Evals

  • https://docs.confident-ai.com/docs/synthesizer-introduction#save-your-synthetic-dataset
  • pairwise evaluation. How is this different from normal? maybe there are a few responses (using different LLMs) that are rated same score on g-eval. Use pairwise evaluation to force a winner. E.g. prompt
  • Needle-in-a-haystack (NIAH) evals
  • If evals are reference-free, you can use them as a guardrail (not show the output if it is too low scoring). E.g. is summarization evals, where all you need is the input prompt (and no need for a summarisation "reference")

Guardrails

  • Use gemini guardrails to identify harmful/offensive output, PII.
  • factual inconsistency guardrail link

Production

  • Development-prod skew: measure skew between LLM input/output pairs. E.g. length of inputs/outputs, specific formatting requirements. For advanced drift detection consider clustering embeddings to detect semantic drifts (i.e. users are discussing topics not discussed before.)
  • Hold-out datasets for evals -> must be reflective of user-interactions
  • Always log outputs. Store this in a separate DB.

Data flywheel

References

@luarss luarss changed the title CI Status badge TODO collections Sep 10, 2024
@luarss luarss pinned this issue Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant