[TTS Avatar][Live][Python] Update sample code to add option for clien…

…t to communicate with server through WebSocket, and do STT on server side (#2597) * [TalkingAvatar] Add sample code for TTS talking avatar real-time API * sample codes for batch avatar synthesis * Address repository check failure * update * [Avatar] Update real time avatar sample code to support multi-lingual * [avatar] update real time avatar chat sample to receive GPT response streamingly * [Live Avatar] update chat sample to make some refinements * [TTS Avatar] Update real-time sample to support 1. non-continuous recognition mode 2. a button to stop speaking 3. user can type query without speech * [TTS Avatar] Update real time avatar sample to support auto-reconnect * Don't reset message history when re-connecting * [talking avatar] update real time sample to support using cached local video for idle status, to help save customer cost * Update chat.html and README.md * Update batch avatar sample to use mp4 as default format, to avoid defaultly showing slow speed with vp9 * A minor refinement * Some refinement * Some bug fixing * Refine the reponse receiving logic for AOAI streaming mode, to make it more robust * [Talking Avatar] update real-time sample code to log result id (turn id) for ease of debugging * [Talking Avatar] Update avatar live chat sample, to upgrade AOAI API version from 2023-03-15-preview to 2023-12-01-preview * [Talking Avatar][Live Chat] Update AOAI API to be long term support version 2023-06-01-preview * [Talking Avatar] Add real time avatar sample code for server/client hybrid web app, with server code written in python * Some refinements * Add README.md * Fix repo check failure: files that are neither marked as binary nor text, please extend .gitattributes * [Python][TTS Avatar] Add chat sample * [Python][TTS Avatar] Add chat sample - continue * Support multiple clients management * Update README.md * [Python][TTS Avatar] Support customized ICE server * [Talking Avatar][Python] Support stop speaking * Tolerat speech sdk to unsupport sending message with connection * [Python][TTS Avatar] Send local SDP as post body instead of header, to avoid header size over limit * [python][avatar] update requirements.txt to add the missing dependencies * [python][avatar] update real-time sample to make auto-connection more smoothy * [Python][Avatar] Fix some small bugs * [python][avatar] Support AAD authorization on private endpoint * [Java][Android][Avatar] Add Android sample code for real time avatar * Code refinement * More refinement * More refinement * Update README.md * [Java][Android][Avatar] Remove AddStream method, which is not available with Unified Plan SDP semantics, and use AddTrack per suggestion * [Python][Avatar][Live] Get speaking status from WebRTC event, and remove the checkSpeakingStatus API from backend code * [Java][Android][Live Avatar] Update the sample to demonstrate switching audio output device to loud speaker * [Python][Avatar][Live] Switch from REST API to SDK for calling AOAI * [Python][Avatar][Live] Trigger barging at first recognizing event which is earlier * [Python][Avatar][Live] Enable continuous conversation by default * [Python][Avatar][Live] Disable multi-lingual by default for better latency * [Python][Avatar][Live] Configure shorter segmentation silence timeout for quicker SR * [Live Avatar][Python, CSharp] Add logging for latency * [TTS Avatar][Live][Python, CSharp, JS] Fix a bug to correctly clean up audio player * [TTS Avatar][Live][JavaScript] Output display text with a slower rate, to follow the avatar speaking progress * Make the display text / speech alignment able for on/off * [TTS Avatar][Live][CSharp] Output display text with a slower rate, to follow the avatar speaking progress * Create an auto-deploy file * Unlink the containerApp yinhew-avatar-app from this repo * Delete unnecessary file * [talking avatar][python] Update real time sample to add option to connect with server through WebSocket, and do STT on server side --------- Co-authored-by: Yulin Li <[email protected]>
Azure-Samples · Oct 11, 2024 · cbe01b6 · cbe01b6
1 parent 0e8f2b8
commit cbe01b6
Show file tree

Hide file tree

Showing 8 changed files with 402 additions and 26 deletions.
diff --git a/samples/csharp/web/avatar/Controllers/AvatarController.cs b/samples/csharp/web/avatar/Controllers/AvatarController.cs
@@ -153,9 +153,17 @@ public async Task<IActionResult> ConnectAvatar()
                     speechConfig.EndpointId = customVoiceEndpointId;
                 }
 
-                var speechSynthesizer = new SpeechSynthesizer(speechConfig);
+                var speechSynthesizer = new SpeechSynthesizer(speechConfig, null);
                 clientContext.SpeechSynthesizer = speechSynthesizer;
 
+                if (ClientSettings.EnableAudioAudit)
+                {
+                    speechSynthesizer.Synthesizing += (o, e) =>
+                    {
+                        Console.WriteLine($"Audio chunk received: {e.Result.AudioData.Length} bytes.");
+                    };
+                }
+
                 if (string.IsNullOrEmpty(GlobalVariables.IceToken))
                 {
                     return BadRequest("IceToken is missing or invalid.");
@@ -168,7 +176,7 @@ public async Task<IActionResult> ConnectAvatar()
                 {
                     iceTokenObj = new Dictionary<string, object>
                     {
-                        { "Urls", string.IsNullOrEmpty(_clientSettings.IceServerUrlRemote) ? [_clientSettings.IceServerUrl] : new[] { _clientSettings.IceServerUrlRemote } },
+                        { "Urls", string.IsNullOrEmpty(_clientSettings.IceServerUrlRemote) ? new JArray(_clientSettings.IceServerUrl) : new JArray(_clientSettings.IceServerUrlRemote) },
                         { "Username", _clientSettings.IceServerUsername },
                         { "Password", _clientSettings.IceServerPassword }
                     };
@@ -189,7 +197,7 @@ public async Task<IActionResult> ConnectAvatar()
                 var videoCrop = Request.Headers["VideoCrop"].FirstOrDefault() ?? "false";
 
                 // Configure avatar settings
-                var urlsArray = iceTokenObj?.TryGetValue("Urls", out var value) == true ? value as string[] : null;
+                var urlsArray = iceTokenObj?.TryGetValue("Urls", out var value) == true ? value as JArray : null;
 
                 var firstUrl = urlsArray?.FirstOrDefault()?.ToString();
 
@@ -213,7 +221,8 @@ public async Task<IActionResult> ConnectAvatar()
                                             username = iceTokenObj!["Username"],
                                             credential = iceTokenObj["Password"]
                                         }
-                                    }
+                                    },
+                                    auditAudio = ClientSettings.EnableAudioAudit
                                 }
                             },
                             format = new
@@ -255,7 +264,7 @@ public async Task<IActionResult> ConnectAvatar()
                 connection.SetMessageProperty("speech.config", "context", JsonConvert.SerializeObject(avatarConfig));
 
                 var speechSynthesisResult = speechSynthesizer.SpeakTextAsync("").Result;
-                    Console.WriteLine($"Result ID: {speechSynthesisResult.ResultId}");
+                Console.WriteLine($"Result ID: {speechSynthesisResult.ResultId}");
                 if (speechSynthesisResult.Reason == ResultReason.Canceled)
                 {
                     var cancellationDetails = SpeechSynthesisCancellationDetails.FromResult(speechSynthesisResult);
@@ -456,7 +465,7 @@ public async Task HandleUserQuery(string userQuery, Guid clientId, HttpResponse
             // We return some quick reply here before the chat API returns to mitigate.
             if (ClientSettings.EnableQuickReply)
             {
-                await SpeakWithQueue(ClientSettings.QuickReplies[new Random().Next(ClientSettings.QuickReplies.Count)], 2000, clientId);
+                await SpeakWithQueue(ClientSettings.QuickReplies[new Random().Next(ClientSettings.QuickReplies.Count)], 2000, clientId, httpResponse);
             }
 
             // Process the responseContent as needed
@@ -507,9 +516,13 @@ public async Task HandleUserQuery(string userQuery, Guid clientId, HttpResponse
                         responseToken = ClientSettings.OydDocRegex.Replace(responseToken, string.Empty);
                     }
 
-                    await httpResponse.WriteAsync(responseToken).ConfigureAwait(false);
+                    if (!ClientSettings.EnableDisplayTextAlignmentWithSpeech)
+                    {
+                        await httpResponse.WriteAsync(responseToken).ConfigureAwait(false);
+                    }
 
                     assistantReply.Append(responseToken);
+                    spokenSentence.Append(responseToken); // build up the spoken sentence
                     if (responseToken == "\n" || responseToken == "\n\n")
                     {
                         if (isFirstSentence)
@@ -520,13 +533,12 @@ public async Task HandleUserQuery(string userQuery, Guid clientId, HttpResponse
                             isFirstSentence = false;
                         }
 
-                        await SpeakWithQueue(spokenSentence.ToString().Trim(), 0, clientId);
+                        await SpeakWithQueue(spokenSentence.ToString(), 0, clientId, httpResponse);
                         spokenSentence.Clear();
                     }
                     else
                     {
                         responseToken = responseToken.Replace("\n", string.Empty);
-                        spokenSentence.Append(responseToken); // build up the spoken sentence
                         if (responseToken.Length == 1 || responseToken.Length == 2)
                         {
                             foreach (var punctuation in ClientSettings.SentenceLevelPunctuations)
@@ -541,7 +553,7 @@ public async Task HandleUserQuery(string userQuery, Guid clientId, HttpResponse
                                         isFirstSentence = false;
                                     }
 
-                                    await SpeakWithQueue(spokenSentence.ToString().Trim(), 0, clientId);
+                                    await SpeakWithQueue(spokenSentence.ToString(), 0, clientId, httpResponse);
                                     spokenSentence.Clear();
                                     break;
                                 }
@@ -553,11 +565,21 @@ public async Task HandleUserQuery(string userQuery, Guid clientId, HttpResponse
 
             if (spokenSentence.Length > 0)
             {
-                await SpeakWithQueue(spokenSentence.ToString().Trim(), 0, clientId);
+                await SpeakWithQueue(spokenSentence.ToString(), 0, clientId, httpResponse);
             }
 
             var assistantMessage = new AssistantChatMessage(assistantReply.ToString());
             messages.Add(assistantMessage);
+
+            if (ClientSettings.EnableDisplayTextAlignmentWithSpeech)
+            {
+                while (clientContext.SpokenTextQueue.Count > 0)
+                {
+                    await Task.Delay(200);
+                }
+
+                await Task.Delay(200);
+            }
         }
 
         public void InitializeChatContext(string systemPrompt, Guid clientId)
@@ -572,7 +594,7 @@ public void InitializeChatContext(string systemPrompt, Guid clientId)
         }
 
         // Speak the given text. If there is already a speaking in progress, add the text to the queue. For chat scenario.
-        public Task SpeakWithQueue(string text, int endingSilenceMs, Guid clientId)
+        public Task SpeakWithQueue(string text, int endingSilenceMs, Guid clientId, HttpResponse httpResponse)
         {
             var clientContext = _clientService.GetClientContext(clientId);
 
@@ -595,6 +617,11 @@ public Task SpeakWithQueue(string text, int endingSilenceMs, Guid clientId)
                         while (spokenTextQueue.Count > 0)
                         {
                             var currentText = spokenTextQueue.Dequeue();
+                            if (ClientSettings.EnableDisplayTextAlignmentWithSpeech)
+                            {
+                                httpResponse.WriteAsync(currentText);
+                            }
+
                             await SpeakText(currentText, ttsVoice!, personalVoiceSpeakerProfileId!, endingSilenceMs, clientId);
                             clientContext.LastSpeakTime = DateTime.UtcNow;
                         }

diff --git a/samples/csharp/web/avatar/Models/ClientSettings.cs b/samples/csharp/web/avatar/Models/ClientSettings.cs
@@ -19,6 +19,10 @@ public class ClientSettings
 
         public static readonly bool EnableQuickReply = false;
 
+        public static readonly bool EnableDisplayTextAlignmentWithSpeech = false;
+
+        public static readonly bool EnableAudioAudit = false;
+
         public string? SpeechRegion { get; set; }
 
         public string? SpeechKey { get; set; }

diff --git a/samples/csharp/web/avatar/Views/Home/chat.cshtml b/samples/csharp/web/avatar/Views/Home/chat.cshtml
@@ -36,7 +36,7 @@
   <label style="font-size: medium;" for="sttLocale">STT Locale(s):</label>
   <input id="sttLocales" type="text" size="64" style="font-size: medium;" value="en-US"></input><br />
   <label style="font-size: medium;" for="ttsVoice">TTS Voice:</label>
-  <input id="ttsVoice" type="text" size="32" style="font-size: medium;" value="en-US-JennyNeural"></input><br />
+  <input id="ttsVoice" type="text" size="32" style="font-size: medium;" value="en-US-AvaNeural"></input><br />
   <label style="font-size: medium;" for="customVoiceEndpointId">Custom Voice Deployment ID (Endpoint ID):</label>
   <input id="customVoiceEndpointId" type="text" size="32" style="font-size: medium;" value=""></input><br />
   <label style="font-size: medium;" for="personalVoiceSpeakerProfileID">Personal Voice Speaker Profile ID:</label>

diff --git a/samples/js/browser/avatar/js/chat.js b/samples/js/browser/avatar/js/chat.js
@@ -9,6 +9,7 @@ var messages = []
 var messageInitiated = false
 var dataSources = []
 var sentenceLevelPunctuations = [ '.', '?', '!', ':', ';', '。', '？', '！', '：', '；' ]
+var enableDisplayTextAlignmentWithSpeech = true
 var enableQuickReply = false
 var quickReplies = [ 'Let me take a look.', 'Let me check.', 'One moment, please.' ]
 var byodDocRegex = new RegExp(/\[doc(\d+)\]/g)
@@ -322,6 +323,12 @@ function speakNext(text, endingSilenceMs = 0) {
         ssml = `<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xml:lang='en-US'><voice name='${ttsVoice}'><mstts:ttsembedding speakerProfileId='${personalVoiceSpeakerProfileID}'><mstts:leadingsilence-exact value='0'/>${htmlEncode(text)}<break time='${endingSilenceMs}ms' /></mstts:ttsembedding></voice></speak>`
     }
 
+    if (enableDisplayTextAlignmentWithSpeech) {
+        let chatHistoryTextArea = document.getElementById('chatHistory')
+        chatHistoryTextArea.innerHTML += text.replace(/\n/g, '<br/>')
+        chatHistoryTextArea.scrollTop = chatHistoryTextArea.scrollHeight
+    }
+
     lastSpeakTime = new Date()
     isSpeaking = true
     document.getElementById('stopSpeaking').disabled = false
@@ -506,17 +513,18 @@ function handleUserQuery(userQuery, userQueryHTML, imgUrlPath) {
                                 // console.log(`Current token: ${responseToken}`)
 
                                 if (responseToken === '\n' || responseToken === '\n\n') {
-                                    speak(spokenSentence.trim())
+                                    spokenSentence += responseToken
+                                    speak(spokenSentence)
                                     spokenSentence = ''
                                 } else {
-                                    responseToken = responseToken.replace(/\n/g, '')
                                     spokenSentence += responseToken // build up the spoken sentence
 
+                                    responseToken = responseToken.replace(/\n/g, '')
                                     if (responseToken.length === 1 || responseToken.length === 2) {
                                         for (let i = 0; i < sentenceLevelPunctuations.length; ++i) {
                                             let sentenceLevelPunctuation = sentenceLevelPunctuations[i]
                                             if (responseToken.startsWith(sentenceLevelPunctuation)) {
-                                                speak(spokenSentence.trim())
+                                                speak(spokenSentence)
                                                 spokenSentence = ''
                                                 break
                                             }
@@ -531,9 +539,11 @@ function handleUserQuery(userQuery, userQueryHTML, imgUrlPath) {
                     }
                 })
 
-                chatHistoryTextArea.innerHTML += `${displaySentence}`
-                chatHistoryTextArea.scrollTop = chatHistoryTextArea.scrollHeight
-                displaySentence = ''
+                if (!enableDisplayTextAlignmentWithSpeech) {
+                    chatHistoryTextArea.innerHTML += displaySentence.replace(/\n/g, '<br/>')
+                    chatHistoryTextArea.scrollTop = chatHistoryTextArea.scrollHeight
+                    displaySentence = ''
+                }
 
                 // Continue reading the next chunk
                 return read()
@@ -545,7 +555,7 @@ function handleUserQuery(userQuery, userQueryHTML, imgUrlPath) {
     })
     .then(() => {
         if (spokenSentence !== '') {
-            speak(spokenSentence.trim())
+            speak(spokenSentence)
             spokenSentence = ''
         }