Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry (Phase 1) #296

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from
Draft

Telemetry (Phase 1) #296

wants to merge 14 commits into from

Conversation

marklysze
Copy link
Collaborator

@marklysze marklysze commented Dec 27, 2024

-- WORK IN PROGRESS --

Why are these changes needed?

Telemetry for AG2 workflows, heavily based on OpenTelemetry.

Implementation of a core telemetry library that provides a protocol for building telemetry providers.

Telemetry providers included:

  • OpenTelemetry - send your telemetry data to an OpenTelemetry collector.
  • Cost Tracker - tracks events with associated costs
  • Mermaid diagram - (experimental) creates mermaid sequence diagram based on conversation

Telemetry providers are created by implementing TelemetryProvider.

Phases:

  1. (this) Establish core telemetry architecture and data capture, along with an OpenTelemetry provider and cost tracking provider
  2. Add additional providers, e.g. log to file, log to database
  3. Add more advanced real-time GUI provider

To use telemetry:

  1. Instantiate an InstrumentationManager
  2. Instantiate telemetry providers and register them with your InstrumentationManager
  3. Use a context manager to encapsulate the workflow you want in your telemetry
instrumentation_manager = InstrumentationManager()
otel_provider = OpenTelemetryProvider(
    protocol="http",
    collector_endpoint="http://192.168.0.1:4318/v1/traces",
    service_name="ag2"
)
instrumentation_manager.register_provider(otel_provider)

cost_provider = CostTrackerProvider()
instrumentation_manager.register_provider(cost_provider)

mermaid_provider = MermaidDiagramProvider()
instrumentation_manager.register_provider(mermaid_provider)

with telemetry_context(instrumentation_manager, "My Group Chat", {"workflow.id": "123"}) as telemetry:
    ... AG2 code ...

    print(f"Cost summary: {cost_provider.total_cost:.6f} ({len(cost_provider.cost_history)} cost events)")

    mermaid_markup = mermaid_provider.generate_sequence_diagram()
    print(mermaid_markup)

I am using Jaeger as my OpenTelemetry collector and viewer (https://www.jaegertracing.io/)

Still to do:

  • Feedback and more feedback
  • Run all tests to ensure nothing is broken
  • Test more scenarios
  • Correctly identify Nested Chat spans
  • Correctly identify Swarm ON_CONDITION based transitions

Ideas on future telemetry providers:

  • File logger
  • Database logger
  • GUI event feed (for debug and live visualisation)

Related issue number

Closes #103
Closes #173
Closes #298
Closes #435

Checks

@marklysze
Copy link
Collaborator Author

marklysze commented Dec 27, 2024

Sample program:

from autogen import ConversableAgent
from autogen.telemetry import InstrumentationManager, telemetry_context, OpenTelemetryProvider, CostTrackerProvider, MermaidDiagramProvider

# Initialize
instrumentation_manager = InstrumentationManager()
otel_provider = OpenTelemetryProvider(
    protocol="http",
    collector_endpoint="http://192.168.0.1:4318/v1/traces",
    service_name="ag2"
)
instrumentation_manager.register_provider(otel_provider)

cost_provider = CostTrackerProvider()
instrumentation_manager.register_provider(cost_provider)

mermaid_provider = MermaidDiagramProvider()
instrumentation_manager.register_provider(mermaid_provider)

with telemetry_context(instrumentation_manager, "My Workflow", {"workflow.id": "123"}) as telemetry:

    jack = ConversableAgent(
        "Jack",
        llm_config=llm_config,
        system_message="Your name is Jack and you are a comedian in a two-person comedy show. Always say: FINISH",
        is_termination_msg=lambda x: True if "FINISH" in x.get("content", "") else False,
        human_input_mode="NEVER",
    )
    emma = ConversableAgent(
        "Emma",
        llm_config=llm_config,
        system_message="Your name is Emma and you are a comedian in two-person comedy show. Say the word FINISH ONLY AFTER you've heard 2 of Jack's jokes.",
        is_termination_msg=lambda x: True if "FINISH" in x.get("content", "") else False,
        human_input_mode="NEVER",
    )

    # Initiate the chat
    chat_result = jack.initiate_chat(
        emma,
        message="Emma, tell me a joke about goldfish and peanut butter.",
        max_turns=4,
    )

    print(f"Cost summary: {cost_provider.total_cost:.6f} ({len(cost_provider.cost_history)} cost events)")

    print(mermaid_provider.generate_sequence_diagram())

Example output:

Cost summary: 0.000069 (2 LLM calls)
sequenceDiagram
    participant Jack
    participant Emma
    Jack->>Emma: Emma, tell me a joke about goldfish and peanut but...
    Emma->>Jack: Sure, here it goes! Why did the goldfish break up ...
    Jack->>Emma: Alright, Emma! Why did the goldfish start a podcas...

image

image

@marklysze
Copy link
Collaborator Author

What are we capturing? Detail to be expanded upon, but here are the spans and event types so far.

class SpanKind(Enum):
    """Enumeration of span kinds

    Spans represent a unit of work within a trace. Typically spans have child spans and events.

    Descriptions:
        WORKFLOW: Main workflow span
        CHATS: Multiple chats, e.g. initiate_chats
        CHAT: Single chat, e.g. initiate_chat
        NESTED_CHAT: A nested chat, e.g. _summary_from_nested_chats
        GROUP_CHAT: A groupchat, e.g. run_chat
        ROUND: A round within a chat (where chats have max_round or max_turn)
        REPLY: Agent replying to another agent
        REPLY_FUNCTION: Functions executed during a reply, such as generate_oai_reply and check_termination_and_human_reply
        SUMMARY: Summarization of a chat
        REASONING: Agent reasoning step, for advanced agents like ReasoningAgent and CaptainAgent
        GROUPCHAT_SELECT_SPEAKER: GroupChat speaker selection (covers all selection methods)
        SWARM_ON_CONDITION: Swarm-specific, ON_CONDITION hand off
    """


class EventKind(Enum):
    """Enumeration of span event kinds

    Events represent a singular point in time within a span, capturing specific moments or actions.

    Descriptions:
        AGENT_TRANSITION: Transition moved from one agent to another
        AGENT_CREATION: Creation of an Agent
        GROUPCHAT_CREATION: Creation of a GroupChat
        LLM_CREATE: LLM execution
        AGENT_SEND_MSG: Agent sending a message to another agent
        TOOL_EXECUTION: Tool or Function execution
        COST: Cost event
        SWARM_TRANSITION: Swarm-specific, transition reason (e.g. ON_CONDITION)
        CONSOLE_PRINT: Console output (TBD)
    """

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request telemetry
Projects
None yet
1 participant