Skip to content

Azure Multimodal Chat Application Guide

๐ŸŽฏ Problem

You need to build a chat application that accepts both text and audio inputs, and uses multimodal generative AI to understand and respond.

โœ… Solution with Azure

Use multimodal models in Azure AI Foundry, like: * Microsoft Phi-4-multimodal-instruct * OpenAI gpt-4o * OpenAI gpt-4o-mini

These models support text + audio input and can generate intelligent responses.

๐Ÿงฉ Componenti richiesti

  • โœ… Azure AI Foundry (portal access)
  • โœ… A deployed multimodal model
  • โœ… Chat Playground (for testing)
  • โœ… Python or .NET SDK (for app development)
  • โœ… Proper formatting of multi-part messages (JSON structure)

๐Ÿ› ๏ธ Architettura / Sviluppo

๐Ÿ”น Deploy a Multimodal Model

  1. Go to Azure AI Foundry portal
  2. Select a model like gpt-4o or phi-4-multimodal-instruct
  3. Deploy the model
  4. Test in Chat Playground with audio + text prompts:
  5. Upload audio file
  6. Combine with text to form a prompt

๐Ÿ”น Structure of Audio-Based Prompt (JSON)

{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Transcribe this audio:"
        },
        {
          "type": "audio_url",
          "audio_url": {
            "url": "https://..."
          }
        }
      ]
    }
  ]
}

๐ŸŸจ Alternatively, use base64-encoded binary audio data:

{
  "type": "audio_url",
  "audio_url": {
    "url": "data:audio/mp3;base64,<binary_audio_data>"
  }
}

๐Ÿ”น Develop Audio-Enabled Chat App

  • Use Python or .NET SDK for:
  • Azure AI Model Inference
  • OpenAI API
  • Your client application should:
  • Connect to the model endpoint
  • Submit multi-part prompts (text + audio)
  • Receive and process the model's response

๐Ÿ” Prompt Submission Options

  • โœ… Text + Audio URL (hosted audio file)
  • โœ… Text + Base64 binary audio (inline submission)

๐Ÿง  Best Practice / Considerazioni

  • ๐ŸŒ Ensure audio files are in supported format (e.g., MP3)
  • ๐Ÿ“‚ If using base64, avoid large files (limits may apply)
  • ๐Ÿ”’ Secure any URLs and ensure CORS/permissions are handled if using remote files
  • ๐Ÿงช Test prompts using Chat Playground before coding

โ“ Domande simulate d'esame

  1. Q: What is the correct JSON structure for submitting a multimodal audio prompt? A: A messages array with a content array including both text and audio_url objects.

  2. Q: Which models in Azure AI Foundry support audio-based prompts? A: Microsoft Phi-4-multimodal-instruct, OpenAI gpt-4o, and OpenAI gpt-4o-mini.

  3. Q: How can you submit local audio data directly in a prompt? A: By encoding it in base64 and using a data: URL format in the audio_url.

  4. Q: Which tools can be used to test audio prompts before writing application code? A: Azure AI Foundry Chat Playground.