Skip to content

Commit

Permalink
[TTS Avatar][Live][Python] Update sample code to add option for clien…
Browse files Browse the repository at this point in the history
…t to communicate with server through WebSocket, and do STT on server side (#2597)

* [TalkingAvatar] Add sample code for TTS talking avatar real-time API

* sample codes for batch avatar synthesis

* Address repository check failure

* update

* [Avatar] Update real time avatar sample code to support multi-lingual

* [avatar] update real time avatar chat sample to receive GPT response streamingly

* [Live Avatar] update chat sample to make some refinements

* [TTS Avatar] Update real-time sample to support 1. non-continuous recognition mode 2. a button to stop speaking 3. user can type query without speech

* [TTS Avatar] Update real time avatar sample to support auto-reconnect

* Don't reset message history when re-connecting

* [talking avatar] update real time sample to support using cached local video for idle status, to help save customer cost

* Update chat.html and README.md

* Update batch avatar sample to use mp4 as default format, to avoid defaultly showing slow speed with vp9

* A minor refinement

* Some refinement

* Some bug fixing

* Refine the reponse receiving logic for AOAI streaming mode, to make it more robust

* [Talking Avatar] update real-time sample code to log result id (turn id) for ease of debugging

* [Talking Avatar] Update avatar live chat sample, to upgrade AOAI API version from 2023-03-15-preview to 2023-12-01-preview

* [Talking Avatar][Live Chat] Update AOAI API to be long term support version 2023-06-01-preview

* [Talking Avatar] Add real time avatar sample code for server/client hybrid web app, with server code written in python

* Some refinements

* Add README.md

* Fix repo check failure: files that are neither marked as binary nor text, please extend .gitattributes

* [Python][TTS Avatar] Add chat sample

* [Python][TTS Avatar] Add chat sample - continue

* Support multiple clients management

* Update README.md

* [Python][TTS Avatar] Support customized ICE server

* [Talking Avatar][Python] Support stop speaking

* Tolerat speech sdk to unsupport sending message with connection

* [Python][TTS Avatar] Send local SDP as post body instead of header, to avoid header size over limit

* [python][avatar] update requirements.txt to add the missing dependencies

* [python][avatar] update real-time sample to make auto-connection more smoothy

* [Python][Avatar] Fix some small bugs

* [python][avatar] Support AAD authorization on private endpoint

* [Java][Android][Avatar] Add Android sample code for real time avatar

* Code refinement

* More refinement

* More refinement

* Update README.md

* [Java][Android][Avatar] Remove AddStream method, which is not available with Unified Plan SDP semantics, and use AddTrack per suggestion

* [Python][Avatar][Live] Get speaking status from WebRTC event, and remove the checkSpeakingStatus API from backend code

* [Java][Android][Live Avatar] Update the sample to demonstrate switching audio output device to loud speaker

* [Python][Avatar][Live] Switch from REST API to SDK for calling AOAI

* [Python][Avatar][Live] Trigger barging at first recognizing event which is earlier

* [Python][Avatar][Live] Enable continuous conversation by default

* [Python][Avatar][Live] Disable multi-lingual by default for better latency

* [Python][Avatar][Live] Configure shorter segmentation silence timeout for quicker SR

* [Live Avatar][Python, CSharp] Add logging for latency

* [TTS Avatar][Live][Python, CSharp, JS] Fix a bug to correctly clean up audio player

* [TTS Avatar][Live][JavaScript] Output display text with a slower rate, to follow the avatar speaking progress

* Make the display text / speech alignment able for on/off

* [TTS Avatar][Live][CSharp] Output display text with a slower rate, to follow the avatar speaking progress

* Create an auto-deploy file

* Unlink the containerApp yinhew-avatar-app from this repo

* Delete unnecessary file

* [talking avatar][python] Update real time sample to add option to connect with server through WebSocket, and do STT on server side

---------

Co-authored-by: Yulin Li <[email protected]>
  • Loading branch information
yinhew and Yulin Li authored Oct 11, 2024
1 parent 0e8f2b8 commit cbe01b6
Show file tree
Hide file tree
Showing 8 changed files with 402 additions and 26 deletions.
51 changes: 39 additions & 12 deletions samples/csharp/web/avatar/Controllers/AvatarController.cs
Original file line number Diff line number Diff line change
Expand Up @@ -153,9 +153,17 @@ public async Task<IActionResult> ConnectAvatar()
speechConfig.EndpointId = customVoiceEndpointId;
}

var speechSynthesizer = new SpeechSynthesizer(speechConfig);
var speechSynthesizer = new SpeechSynthesizer(speechConfig, null);
clientContext.SpeechSynthesizer = speechSynthesizer;

if (ClientSettings.EnableAudioAudit)
{
speechSynthesizer.Synthesizing += (o, e) =>
{
Console.WriteLine($"Audio chunk received: {e.Result.AudioData.Length} bytes.");
};
}

if (string.IsNullOrEmpty(GlobalVariables.IceToken))
{
return BadRequest("IceToken is missing or invalid.");
Expand All @@ -168,7 +176,7 @@ public async Task<IActionResult> ConnectAvatar()
{
iceTokenObj = new Dictionary<string, object>
{
{ "Urls", string.IsNullOrEmpty(_clientSettings.IceServerUrlRemote) ? [_clientSettings.IceServerUrl] : new[] { _clientSettings.IceServerUrlRemote } },
{ "Urls", string.IsNullOrEmpty(_clientSettings.IceServerUrlRemote) ? new JArray(_clientSettings.IceServerUrl) : new JArray(_clientSettings.IceServerUrlRemote) },
{ "Username", _clientSettings.IceServerUsername },
{ "Password", _clientSettings.IceServerPassword }
};
Expand All @@ -189,7 +197,7 @@ public async Task<IActionResult> ConnectAvatar()
var videoCrop = Request.Headers["VideoCrop"].FirstOrDefault() ?? "false";

// Configure avatar settings
var urlsArray = iceTokenObj?.TryGetValue("Urls", out var value) == true ? value as string[] : null;
var urlsArray = iceTokenObj?.TryGetValue("Urls", out var value) == true ? value as JArray : null;

var firstUrl = urlsArray?.FirstOrDefault()?.ToString();

Expand All @@ -213,7 +221,8 @@ public async Task<IActionResult> ConnectAvatar()
username = iceTokenObj!["Username"],
credential = iceTokenObj["Password"]
}
}
},
auditAudio = ClientSettings.EnableAudioAudit
}
},
format = new
Expand Down Expand Up @@ -255,7 +264,7 @@ public async Task<IActionResult> ConnectAvatar()
connection.SetMessageProperty("speech.config", "context", JsonConvert.SerializeObject(avatarConfig));

var speechSynthesisResult = speechSynthesizer.SpeakTextAsync("").Result;
Console.WriteLine($"Result ID: {speechSynthesisResult.ResultId}");
Console.WriteLine($"Result ID: {speechSynthesisResult.ResultId}");
if (speechSynthesisResult.Reason == ResultReason.Canceled)
{
var cancellationDetails = SpeechSynthesisCancellationDetails.FromResult(speechSynthesisResult);
Expand Down Expand Up @@ -456,7 +465,7 @@ public async Task HandleUserQuery(string userQuery, Guid clientId, HttpResponse
// We return some quick reply here before the chat API returns to mitigate.
if (ClientSettings.EnableQuickReply)
{
await SpeakWithQueue(ClientSettings.QuickReplies[new Random().Next(ClientSettings.QuickReplies.Count)], 2000, clientId);
await SpeakWithQueue(ClientSettings.QuickReplies[new Random().Next(ClientSettings.QuickReplies.Count)], 2000, clientId, httpResponse);
}

// Process the responseContent as needed
Expand Down Expand Up @@ -507,9 +516,13 @@ public async Task HandleUserQuery(string userQuery, Guid clientId, HttpResponse
responseToken = ClientSettings.OydDocRegex.Replace(responseToken, string.Empty);
}

await httpResponse.WriteAsync(responseToken).ConfigureAwait(false);
if (!ClientSettings.EnableDisplayTextAlignmentWithSpeech)
{
await httpResponse.WriteAsync(responseToken).ConfigureAwait(false);
}

assistantReply.Append(responseToken);
spokenSentence.Append(responseToken); // build up the spoken sentence
if (responseToken == "\n" || responseToken == "\n\n")
{
if (isFirstSentence)
Expand All @@ -520,13 +533,12 @@ public async Task HandleUserQuery(string userQuery, Guid clientId, HttpResponse
isFirstSentence = false;
}

await SpeakWithQueue(spokenSentence.ToString().Trim(), 0, clientId);
await SpeakWithQueue(spokenSentence.ToString(), 0, clientId, httpResponse);
spokenSentence.Clear();
}
else
{
responseToken = responseToken.Replace("\n", string.Empty);
spokenSentence.Append(responseToken); // build up the spoken sentence
if (responseToken.Length == 1 || responseToken.Length == 2)
{
foreach (var punctuation in ClientSettings.SentenceLevelPunctuations)
Expand All @@ -541,7 +553,7 @@ public async Task HandleUserQuery(string userQuery, Guid clientId, HttpResponse
isFirstSentence = false;
}

await SpeakWithQueue(spokenSentence.ToString().Trim(), 0, clientId);
await SpeakWithQueue(spokenSentence.ToString(), 0, clientId, httpResponse);
spokenSentence.Clear();
break;
}
Expand All @@ -553,11 +565,21 @@ public async Task HandleUserQuery(string userQuery, Guid clientId, HttpResponse

if (spokenSentence.Length > 0)
{
await SpeakWithQueue(spokenSentence.ToString().Trim(), 0, clientId);
await SpeakWithQueue(spokenSentence.ToString(), 0, clientId, httpResponse);
}

var assistantMessage = new AssistantChatMessage(assistantReply.ToString());
messages.Add(assistantMessage);

if (ClientSettings.EnableDisplayTextAlignmentWithSpeech)
{
while (clientContext.SpokenTextQueue.Count > 0)
{
await Task.Delay(200);
}

await Task.Delay(200);
}
}

public void InitializeChatContext(string systemPrompt, Guid clientId)
Expand All @@ -572,7 +594,7 @@ public void InitializeChatContext(string systemPrompt, Guid clientId)
}

// Speak the given text. If there is already a speaking in progress, add the text to the queue. For chat scenario.
public Task SpeakWithQueue(string text, int endingSilenceMs, Guid clientId)
public Task SpeakWithQueue(string text, int endingSilenceMs, Guid clientId, HttpResponse httpResponse)
{
var clientContext = _clientService.GetClientContext(clientId);

Expand All @@ -595,6 +617,11 @@ public Task SpeakWithQueue(string text, int endingSilenceMs, Guid clientId)
while (spokenTextQueue.Count > 0)
{
var currentText = spokenTextQueue.Dequeue();
if (ClientSettings.EnableDisplayTextAlignmentWithSpeech)
{
httpResponse.WriteAsync(currentText);
}

await SpeakText(currentText, ttsVoice!, personalVoiceSpeakerProfileId!, endingSilenceMs, clientId);
clientContext.LastSpeakTime = DateTime.UtcNow;
}
Expand Down
4 changes: 4 additions & 0 deletions samples/csharp/web/avatar/Models/ClientSettings.cs
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ public class ClientSettings

public static readonly bool EnableQuickReply = false;

public static readonly bool EnableDisplayTextAlignmentWithSpeech = false;

public static readonly bool EnableAudioAudit = false;

public string? SpeechRegion { get; set; }

public string? SpeechKey { get; set; }
Expand Down
2 changes: 1 addition & 1 deletion samples/csharp/web/avatar/Views/Home/chat.cshtml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
<label style="font-size: medium;" for="sttLocale">STT Locale(s):</label>
<input id="sttLocales" type="text" size="64" style="font-size: medium;" value="en-US"></input><br />
<label style="font-size: medium;" for="ttsVoice">TTS Voice:</label>
<input id="ttsVoice" type="text" size="32" style="font-size: medium;" value="en-US-JennyNeural"></input><br />
<input id="ttsVoice" type="text" size="32" style="font-size: medium;" value="en-US-AvaNeural"></input><br />
<label style="font-size: medium;" for="customVoiceEndpointId">Custom Voice Deployment ID (Endpoint ID):</label>
<input id="customVoiceEndpointId" type="text" size="32" style="font-size: medium;" value=""></input><br />
<label style="font-size: medium;" for="personalVoiceSpeakerProfileID">Personal Voice Speaker Profile ID:</label>
Expand Down
24 changes: 17 additions & 7 deletions samples/js/browser/avatar/js/chat.js
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ var messages = []
var messageInitiated = false
var dataSources = []
var sentenceLevelPunctuations = [ '.', '?', '!', ':', ';', '。', '?', '!', ':', ';' ]
var enableDisplayTextAlignmentWithSpeech = true
var enableQuickReply = false
var quickReplies = [ 'Let me take a look.', 'Let me check.', 'One moment, please.' ]
var byodDocRegex = new RegExp(/\[doc(\d+)\]/g)
Expand Down Expand Up @@ -322,6 +323,12 @@ function speakNext(text, endingSilenceMs = 0) {
ssml = `<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xml:lang='en-US'><voice name='${ttsVoice}'><mstts:ttsembedding speakerProfileId='${personalVoiceSpeakerProfileID}'><mstts:leadingsilence-exact value='0'/>${htmlEncode(text)}<break time='${endingSilenceMs}ms' /></mstts:ttsembedding></voice></speak>`
}

if (enableDisplayTextAlignmentWithSpeech) {
let chatHistoryTextArea = document.getElementById('chatHistory')
chatHistoryTextArea.innerHTML += text.replace(/\n/g, '<br/>')
chatHistoryTextArea.scrollTop = chatHistoryTextArea.scrollHeight
}

lastSpeakTime = new Date()
isSpeaking = true
document.getElementById('stopSpeaking').disabled = false
Expand Down Expand Up @@ -506,17 +513,18 @@ function handleUserQuery(userQuery, userQueryHTML, imgUrlPath) {
// console.log(`Current token: ${responseToken}`)

if (responseToken === '\n' || responseToken === '\n\n') {
speak(spokenSentence.trim())
spokenSentence += responseToken
speak(spokenSentence)
spokenSentence = ''
} else {
responseToken = responseToken.replace(/\n/g, '')
spokenSentence += responseToken // build up the spoken sentence

responseToken = responseToken.replace(/\n/g, '')
if (responseToken.length === 1 || responseToken.length === 2) {
for (let i = 0; i < sentenceLevelPunctuations.length; ++i) {
let sentenceLevelPunctuation = sentenceLevelPunctuations[i]
if (responseToken.startsWith(sentenceLevelPunctuation)) {
speak(spokenSentence.trim())
speak(spokenSentence)
spokenSentence = ''
break
}
Expand All @@ -531,9 +539,11 @@ function handleUserQuery(userQuery, userQueryHTML, imgUrlPath) {
}
})

chatHistoryTextArea.innerHTML += `${displaySentence}`
chatHistoryTextArea.scrollTop = chatHistoryTextArea.scrollHeight
displaySentence = ''
if (!enableDisplayTextAlignmentWithSpeech) {
chatHistoryTextArea.innerHTML += displaySentence.replace(/\n/g, '<br/>')
chatHistoryTextArea.scrollTop = chatHistoryTextArea.scrollHeight
displaySentence = ''
}

// Continue reading the next chunk
return read()
Expand All @@ -545,7 +555,7 @@ function handleUserQuery(userQuery, userQueryHTML, imgUrlPath) {
})
.then(() => {
if (spokenSentence !== '') {
speak(spokenSentence.trim())
speak(spokenSentence)
spokenSentence = ''
}

Expand Down
Loading

0 comments on commit cbe01b6

Please sign in to comment.