Skip to content

Service: Azure AI Foundry โ€“ Multimodal Generative AI ๐Ÿค–๐Ÿ–ผ๏ธ

๐Ÿงฉ Problem

You need to create a chat-based application that can understand and respond to prompts containing both text and images (vision-enabled interaction).

๐Ÿš€ Solution with Azure

Use Azure AI Foundry to deploy and interact with multimodal generative AI models, such as: * ๐Ÿง  Microsoft Phi-4-multimodal-instruct * ๐ŸŒ OpenAI GPT-4o * โšก OpenAI GPT-4o-mini

These models can process text + image inputs and return appropriate, context-aware responses.

๐Ÿ› ๏ธ Componenti richiesti

  • Azure AI Foundry portal for model deployment and testing
  • Multimodal model (e.g., GPT-4o)
  • Chat playground to test image+text prompts
  • API endpoint for submitting multi-part messages
  • SDK for Python or .NET (Azure AI Model Inference or OpenAI API)

๐Ÿ—๏ธ Architettura / Sviluppo

  • Deploy a multimodal model from the Azure AI Foundry portal
  • Test prompts using the built-in chat playground by uploading an image and adding a text prompt
  • Develop a client application that:
  • Connects to the model endpoint
  • Sends prompts as multi-part messages
  • Receives and processes the response

Multi-part JSON prompt format:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Describe this picture:"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://....."
          }
        }
      ]
    }
  ]
}

If using local image data:

{
  "type": "image_url",
  "image_url": {
    "url": "data:image/jpeg;base64,<binary_image_data>"
  }
}

๐Ÿ“Œ Best practice / Considerazioni

  • โœ… Use multimodal prompt format (text + image) for richer, context-aware interaction
  • ๐ŸŒ Choose the right model (GPT-4o, Phi-4, etc.) based on performance and availability
  • ๐Ÿ“ค For local images, use base64 encoding in data URL format
  • โš™๏ธ Use SDKs to simplify interaction with REST APIs in production apps
  • ๐Ÿงช Always validate model output in the chat playground before integrating

๐Ÿ“ Domande simulate d'esame

Q1. Which Azure service allows you to create chat-based apps using text and image input?
A. Azure AI Foundry (with a multimodal model)

Q2. What is the correct JSON structure for sending a prompt with both text and image?
A. A multi-part message containing both type: "text" and type: "image_url" items

Q3. You need to develop a vision-enabled chatbot in .NET. Which resource do you use?
A. The Azure AI Model Inference or OpenAI .NET SDK

Q4. Which model types support vision-based chat in Azure AI Foundry?
A. Microsoft Phi-4-multimodal-instruct, OpenAI GPT-4o, and GPT-4o-mini

Q5. What's a required step to submit a local image file in a prompt?
A. Encode the image as base64 and include it in a data URL