Skip to content

Azure AI Speech Service Implementation Guide

๐Ÿงฉ Problem

You need to implement an application that can recognize spoken commands and respond with synthesized speech output.

๐Ÿ’ก Solution with Azure

Use Azure AI Speech service: - Speech-to-Text API for recognizing speech - Text-to-Speech API for generating speech output

๐Ÿ› ๏ธ Components required

  • Azure AI Speech resource (created via Azure Portal)
  • Visual Studio Code
  • Language SDK:
  • Microsoft.CognitiveServices.Speech NuGet package (C#)
  • azure-cognitiveservices-speech pip package (Python)
  • (Optional) Audio input/output libraries:
  • System.Media.SoundPlayer (C#)
  • playsound (Python)
  • Microphone (or use audio file as alternative)

GitHub repository: https://github.com/MicrosoftLearning/mslearn-ai-language

๐Ÿ—๏ธ Architecture / Development

1. ๐Ÿ”ง Provision Azure AI Speech resource

Azure Portal โ†’ Azure AI Services โ†’ Create Speech Service

Required settings: Subscription, Resource group, Region, Name, Pricing tier

After deployment: Get Key and Region from Keys and Endpoint

2. ๐Ÿง‘โ€๐Ÿ’ป Set up development environment

  • Clone repo: mslearn-ai-language
  • Open in VS Code โ†’ trust project if prompted
  • Use Labfiles/07-speech โ†’ choose CSharp or Python โ†’ speaking-clock folder

3. ๐Ÿ“ฆ Install SDK

C#

dotnet add package Microsoft.CognitiveServices.Speech --version 1.30.0

Python

pip install azure-cognitiveservices-speech==1.30.0

4. โš™๏ธ Configure app

Edit configuration file: - C#: appsettings.json - Python: .env

Add key and region

5. ๐Ÿง  Initialize SDK

C#:

using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;
speechConfig = SpeechConfig.FromSubscription(aiSvcKey, aiSvcRegion);
speechConfig.SpeechSynthesisVoiceName = "en-US-AriaNeural";

Python:

import azure.cognitiveservices.speech as speech_sdk
speech_config = speech_sdk.SpeechConfig(ai_key, ai_region)
speech_config.speech_synthesis_voice_name = "en-US-AriaNeural"

6. ๐ŸŽค Recognize speech input

Microphone

C#:

using AudioConfig audioConfig = AudioConfig.FromDefaultMicrophoneInput();
using SpeechRecognizer speechRecognizer = new SpeechRecognizer(speechConfig, audioConfig);

Python:

audio_config = speech_sdk.AudioConfig(use_default_microphone=True)
speech_recognizer = speech_sdk.SpeechRecognizer(speech_config, audio_config)

Or from audio file

Install playback:

C#:

dotnet add package System.Windows.Extensions --version 4.6.0

Python:

pip install playsound==1.2.2

Add code:

C#:

string audioFile = "time.wav";
SoundPlayer wavPlayer = new SoundPlayer(audioFile);
wavPlayer.Play();
using AudioConfig audioConfig = AudioConfig.FromWavFileInput(audioFile);

Python:

from playsound import playsound
audioFile = os.getcwd() + '\\time.wav'
playsound(audioFile)
audio_config = speech_sdk.AudioConfig(filename=audioFile)

7. ๐Ÿ“œ Transcribe speech

C#:

SpeechRecognitionResult speech = await speechRecognizer.RecognizeOnceAsync();
if (speech.Reason == ResultReason.RecognizedSpeech) {
    command = speech.Text;
    Console.WriteLine(command);
}

Python:

speech = speech_recognizer.recognize_once_async().get()
if speech.reason == speech_sdk.ResultReason.RecognizedSpeech:
    command = speech.text
    print(command)

8. ๐Ÿ—ฃ๏ธ Synthesize speech

C#:

using SpeechSynthesizer speechSynthesizer = new SpeechSynthesizer(speechConfig);
SpeechSynthesisResult speak = await speechSynthesizer.SpeakTextAsync(responseText);

Python:

speech_synthesizer = speech_sdk.SpeechSynthesizer(speech_config)
speak = speech_synthesizer.speak_text_async(response_text).get()

9. ๐Ÿ”Š Use alternative voice

C#:

speechConfig.SpeechSynthesisVoiceName = "en-GB-LibbyNeural";

Python:

speech_config.speech_synthesis_voice_name = 'en-GB-LibbyNeural'

10. ๐Ÿ“ Use SSML

C#:

string responseSsml = $@"
<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='en-US'>
  <voice name='en-GB-LibbyNeural'>
    {responseText}
    <break strength='weak'/>
    Time to end this lab!
  </voice>
</speak>";
SpeechSynthesisResult speak = await speechSynthesizer.SpeakSsmlAsync(responseSsml);

Python:

responseSsml = " \
<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='en-US'> \
  <voice name='en-GB-LibbyNeural'> \
    {} \
    <break strength='weak'/> \
    Time to end this lab! \
  </voice> \
</speak>".format(response_text)
speak = speech_synthesizer.speak_ssml_async(responseSsml).get()

๐Ÿง  Best practice / Considerations

  • Ensure correct region and key in configuration
  • Use neural voices for more natural synthesis
  • Use SSML to control prosody, pauses, and emphasis
  • Provide fallback for missing microphone by using audio file
  • Handle recognition result errors (No match, Cancelled)
  • Use await/.get() as appropriate for async methods

โ“ Exam-style questions

Q1. How do you configure speech recognition using the default microphone in Python? - ๐Ÿ… speech_sdk.SpeechRecognizer(audio_input="mic") - ๐Ÿ…‘ speech_sdk.AudioConfig(use_default_microphone=True) โœ… - ๐Ÿ…’ AudioConfig.FromWavFileInput("mic.wav") - ๐Ÿ…“ SpeechSynthesizer(speech_config, audio=True)

Q2. What must you do to synthesize speech with a custom voice in Azure AI Speech? - ๐Ÿ… Change your subscription tier - ๐Ÿ…‘ Use SSML - ๐Ÿ…’ Set SpeechSynthesisVoiceName in the config โœ… - ๐Ÿ…“ Install a separate neural voice package

Q3. Where do you find the key and region for your Azure AI Speech resource? - ๐Ÿ… Azure AI Studio - ๐Ÿ…‘ Speech Studio โ†’ Deployment tab - ๐Ÿ…’ Keys and Endpoint page in the Azure Portal โœ… - ๐Ÿ…“ Billing page in Azure