[draft] Room IO for agent v1.0 #1404

longcw · 2025-01-23T03:38:34Z

No description provided.

changeset-bot · 2025-01-23T03:38:37Z

⚠️ No Changeset found

Latest commit: bdb5c65

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

theomonnom · 2025-01-23T07:28:20Z

livekit-agents/livekit/agents/llm/chat_context.py


 class AudioContent(BaseModel):
    type: Literal["audio_content"] = Field(default="audio_content")
    frame: list[rtc.AudioFrame]
    transcript: Optional[str] = None

+    # temporary fix for pydantic before rtc.AudioFrame is supported
+    model_config = ConfigDict(arbitrary_types_allowed=True)


sounds good, will fix asap

theomonnom · 2025-01-23T07:29:51Z

livekit-agents/livekit/agents/pipeline/speech_handle.py

+        temp_tasks = []
+        tasks = []
+        for task in aw:
+            if not isinstance(task, asyncio.Task):
+                task = asyncio.create_task(task)
+                temp_tasks.append(task)
+            tasks.append(task)
+
        await asyncio.wait(
-            [*aw, self._interrupt_fut], return_when=asyncio.FIRST_COMPLETED
+            [*tasks, self._interrupt_fut], return_when=asyncio.FIRST_COMPLETED
        )
+        for task in temp_tasks:
+            if not task.done():
+                task.cancel()


that's pretty annoying, will check if I can handle that differently (I'm on python 3.9 so it worked)

theomonnom · 2025-01-23T07:30:42Z

livekit-agents/livekit/agents/pipeline/room_io.py

+from .io import AudioSink, AudioStream, VideoStream
+
+
+class RoomAudioSink(AudioSink):


I think we should create another class that does both directly at the same time (like ChatCLI).
We can either create another class or merge both. Not sure about the naming but I was thinking about RoomIO

did you mean to set the input and output of agent when calling RoomIO(agent). But I think this actually make it less flexible if user want to use some different input and output sources, for example for avatar, the input is RoomInput, the output could be DataStreamAudioSink. wdyt?

I was thinking that if you don't want a Room as the Input. The users could simply only change the agent.output.audio and leave agent.input.audio empty (But they create only one class which is RoomIO)

Maybe this is handled directly inside a run method (like the ChatCLI).

Something like

await room_io.run(input=False)

I still think RoomInput and RoomOutput could be two classes for the reasons

they both have the audio and video sinks, if in a single RoomIO then there will be four options room_io.run(input_audio=True, input_video=False, output_audio=True, output_video=False)

the input and output can be different as the avatar use case: agent uses RoomInput and datastream output, avatar worker uses data stream input and RoomOutput

We can make the Room input and output as the default for the agent, then if user want to change one of them it's just agent.output.audio = xxx

ah yep, that makes sense!

one question: regarding the output, do we want a RoomOutput class including both audio and video sinks, and optionally can enable the av sync? For now I think just a RoomAudioSink is okay as most of the use cases of agent is to publish audios to the room. @davidzhao @theomonnom

I think we can just do a single audio sink for now.. in spirit of unblocking current functionality

Agree with dz, tho I think it would be cleaner to have only 2 class: RoomInput and RoomOutput.
It may be harder for the users to understand if they have to initialize the RoomAudioSink

I can make a RoomOutput with RoomOutput().audio that is an AudioSink.

theomonnom · 2025-01-23T07:31:30Z

livekit-agents/livekit/agents/pipeline/room_io.py

+        )
+
+        self._participant_ready = asyncio.Event()
+        self._room.on("participant_connected", self._on_participant_connected)


Let's handle room reconnection. If you get a "reconnected" event, you have to republish the local track.
Also we will have to "refind" the track we want to subscribe to

theomonnom · 2025-01-23T07:31:52Z

livekit-agents/livekit/agents/pipeline/room_io.py

+    def __init__(
+        self,
+        room: rtc.Room,
+        participant_identity: Optional[str] = None,


Let's add TrackPublishOption

davidzhao · 2025-01-23T06:30:42Z

examples/roomio_worker.py

this lg. though in my opinion this should just be the default mode, but still nice to show how to use custom IO in examples

davidzhao · 2025-01-23T06:32:04Z

livekit-agents/livekit/agents/pipeline/room_io.py

+            rtc.TrackPublishOptions(source=rtc.TrackSource.SOURCE_MICROPHONE),
+        )
+
+        # is this necessary?


this is necessary to ensure the agent isn't speaking into the void..

though it's not necessary to block start.. but it needs to happen before the first frame is captured

davidzhao · 2025-01-23T06:51:08Z

livekit-agents/livekit/agents/pipeline/room_io.py

+
+        if self._pushed_duration is not None:
+            # Notify that playback finished
+            self.on_playback_finished(


should this wait for audio_source.wait_for_playout before notifying?

davidzhao · 2025-01-23T08:32:07Z

livekit-agents/livekit/agents/pipeline/room_io.py

+        return self._participant
+
+    @property
+    def audio(self) -> AudioStream | None:


neat.. lets us swap sources without disrupting downstream

livekit-agents/livekit/agents/pipeline/room_io.py

theomonnom · 2025-01-24T09:41:19Z

livekit-agents/livekit/agents/pipeline/room_io.py

+    """Whether to subscribe to audio"""
+    video_enabled: bool = False
+    """Whether to subscribe to video"""
+    audio_sample_rate: int = 16000


Ideally, we should automatically detect the sample_rate.
I think the right thing to do would be to add a sample_rate field to the AudioSink baseclass.
And the PipelineAgent should resample before using the sink

one question regarding this, for room input we are using rtc.AudioStream.from_participant, are the audio sample rate and num_channels actually for resampler, that means the output audio frame is resampled to the value in option?

For output AudioSink I agree that the sample rate can be a field in base class for auto resampling.

The rtc.AudioStream isn't resampling, when setting a sample_rate + num_channels parameters, it expects to receive frames matching those options when calling capture_frame

how can we know what is the frame rate and num channels read from a remote track? or it's always the same value like 16000 and 1?

I think my question is whether the sample rate and number of channels provided to rtc.AudioStream.from_participant should match the track's sample rate, or if the audio frames will be automatically resampled to these values before putting to the stream?

@classmethod def from_participant( cls, *, participant: Participant, track_source: TrackSource.ValueType, loop: Optional[asyncio.AbstractEventLoop] = None, capacity: int = 0, sample_rate: int = 48000, num_channels: int = 1, ) -> AudioStream:

Oh sorry, I got confused, so the rtc.AudioStream does indeed resample (the default is always 48000 & 1 channel).
But the rtc.AudioSource doesn't (and this is where we should introduce a new sample_rate field on the baseclass)

You can directly use the frames coming from the rtc.AudioStream and inject them to the PipelineAgent. We already automatically resample on the STT for this usecase (Tho it isn't the case for the realtime API that currently require 24000, I will create a ticket for it)

theomonnom · 2025-01-24T09:44:14Z

examples/roomio_worker.py

+    audio_output = RoomAudioSink(ctx.room, sample_rate=24000, num_channels=1)
+
+    agent.input.audio = room_input.audio
+    agent.output.audio = audio_output


Maybe something like this to be consistent:

room_output = RoomOutput(ctx.room, sample_rate=24000, num_channels=1) agent.input.audio = room_input.audio agent.output.audio = room_output.audio

Yes, this is what I am thinking

theomonnom · 2025-01-24T13:32:46Z

livekit-agents/livekit/agents/pipeline/room_io.py

+            return
+
+        # TODO: handle reconnected, do we need to cancel and re-publish? seems no
+        self._publication = await self._room.local_participant.publish_track(


After reconnection, we indeed need to republish the track

longcw added 4 commits January 22, 2025 18:56

add RoomInput

1ebdf25

add room audio sink

33cb115

wip testing the agent

89dced2

add room io example

c26e5e5

theomonnom reviewed Jan 23, 2025

View reviewed changes

davidzhao reviewed Jan 23, 2025

View reviewed changes

longcw added 3 commits January 23, 2025 17:57

fix wait for playout

9a39e53

add enable and disable the room input

5c30bee

add video buffer size

896d65e

theomonnom reviewed Jan 23, 2025

View reviewed changes

livekit-agents/livekit/agents/pipeline/room_io.py Outdated Show resolved Hide resolved

longcw added 3 commits January 23, 2025 23:27

remove remote tracker handler

7196adb

use stream.from_participant

5df24dc

update room input stream

44eb659

theomonnom reviewed Jan 24, 2025

View reviewed changes

longcw added 2 commits January 24, 2025 22:10

add room output

f923820

make RoomIO default for agent

bdb5c65

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[draft] Room IO for agent v1.0 #1404

[draft] Room IO for agent v1.0 #1404

longcw commented Jan 23, 2025

changeset-bot bot commented Jan 23, 2025 •

edited

Loading

theomonnom Jan 23, 2025

theomonnom Jan 23, 2025

theomonnom Jan 23, 2025

longcw Jan 23, 2025

theomonnom Jan 23, 2025

theomonnom Jan 23, 2025

longcw Jan 23, 2025

theomonnom Jan 23, 2025

longcw Jan 24, 2025

davidzhao Jan 24, 2025

theomonnom Jan 24, 2025

longcw Jan 24, 2025 •

edited

Loading

theomonnom Jan 23, 2025

theomonnom Jan 23, 2025

davidzhao Jan 23, 2025

davidzhao Jan 23, 2025

davidzhao Jan 23, 2025

davidzhao Jan 23, 2025

theomonnom Jan 24, 2025 •

edited

Loading

longcw Jan 24, 2025

theomonnom Jan 24, 2025

longcw Jan 24, 2025

longcw Jan 24, 2025

theomonnom Jan 24, 2025 •

edited

Loading

theomonnom Jan 24, 2025 •

edited

Loading

theomonnom Jan 24, 2025

longcw Jan 24, 2025 •

edited

Loading

theomonnom Jan 24, 2025

		from .io import AudioSink, AudioStream, VideoStream


		class RoomAudioSink(AudioSink):

[draft] Room IO for agent v1.0 #1404

Are you sure you want to change the base?

[draft] Room IO for agent v1.0 #1404

Conversation

longcw commented Jan 23, 2025

changeset-bot bot commented Jan 23, 2025 • edited Loading

⚠️ No Changeset found

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

longcw Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theomonnom Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theomonnom Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

theomonnom Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

longcw Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

changeset-bot bot commented Jan 23, 2025 •

edited

Loading

longcw Jan 24, 2025 •

edited

Loading

theomonnom Jan 24, 2025 •

edited

Loading

theomonnom Jan 24, 2025 •

edited

Loading

theomonnom Jan 24, 2025 •

edited

Loading

longcw Jan 24, 2025 •

edited

Loading