Vision Language Models

Sunset Notice: All vision language models on this page are being discontinued. These models will be removed in a future update. Please plan your migration accordingly.

Analyze images and documents using powerful vision-language models (VLMs). These multimodal models can understand image content, extract text, answer questions about visuals, and perform complex reasoning tasks—all through the same chat completions API.

Overview

Vision language models combine image understanding with natural language processing to enable:

Image Understanding: Describe and analyze image content
Document Analysis: Extract text from documents, receipts, and forms (OCR)
Visual Q&A: Answer questions about images
Image-based Reasoning: Perform complex analysis and comparisons

Endpoint

VLMs use the same chat completions endpoint as text models:

POST https://api.hyperbolic.xyz/v1/chat/completions

Basic Example

Python
cURL

import base64
import requests
from PIL import Image
from io import BytesIO

def encode_image(image_path):
    """Encode an image file to base64 string."""
    with Image.open(image_path) as img:
        buffered = BytesIO()
        img.save(buffered, format="PNG")
        return base64.b64encode(buffered.getvalue()).decode("utf-8")

# Encode your image
base64_image = encode_image("path/to/your/image.jpg")

url = "https://api.hyperbolic.xyz/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer YOUR_API_KEY"
}
data = {
    "model": "Qwen/Qwen2.5-VL-72B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{base64_image}"}
                }
            ]
        }
    ],
    "max_tokens": 512,
    "temperature": 0.1
}

response = requests.post(url, headers=headers, json=data)
print(response.json()["choices"][0]["message"]["content"])

# First, encode your image to base64:
# base64 -i image.jpg -o image_base64.txt

curl -X POST "https://api.hyperbolic.xyz/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "Qwen/Qwen2.5-VL-72B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is in this image?"},
          {
            "type": "image_url",
            "image_url": {"url": "data:image/png;base64,YOUR_BASE64_STRING"}
          }
        ]
      }
    ],
    "max_tokens": 512,
    "temperature": 0.1
  }'

Image Input Format

Encoding Images

Images must be base64-encoded before sending to the API. Here’s a helper function:

import base64
from PIL import Image
from io import BytesIO

def encode_image(image_path):
    """Encode an image file to base64 string."""
    with Image.open(image_path) as img:
        # Resize if larger than max resolution
        max_size = (2048, 2048)
        img.thumbnail(max_size, Image.Resampling.LANCZOS)
        
        buffered = BytesIO()
        img.save(buffered, format="PNG")
        return base64.b64encode(buffered.getvalue()).decode("utf-8")

Message Format

When sending images, the content field becomes an array of content objects:

{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image in detail."},
        {
          "type": "image_url",
          "image_url": {"url": "data:image/png;base64,{base64_string}"}
        }
      ]
    }
  ]
}

Limitations

Supported formats: JPG, PNG
Maximum resolution: 2048x2048 pixels
Images per request: 1

Multi-turn Conversations

You can ask follow-up questions about an image by maintaining conversation history:

Python
cURL

import base64
import requests
from PIL import Image
from io import BytesIO

def encode_image(image_path):
    with Image.open(image_path) as img:
        buffered = BytesIO()
        img.save(buffered, format="PNG")
        return base64.b64encode(buffered.getvalue()).decode("utf-8")

base64_image = encode_image("receipt.jpg")

url = "https://api.hyperbolic.xyz/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer YOUR_API_KEY"
}

# First turn: send the image
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What items are on this receipt?"},
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/png;base64,{base64_image}"}
            }
        ]
    }
]

response = requests.post(url, headers=headers, json={
    "model": "Qwen/Qwen2.5-VL-72B-Instruct",
    "messages": messages,
    "max_tokens": 512
})

assistant_response = response.json()["choices"][0]["message"]["content"]
print("First response:", assistant_response)

# Second turn: follow-up question (no need to resend image)
messages.append({"role": "assistant", "content": assistant_response})
messages.append({"role": "user", "content": "What is the total amount?"})

response = requests.post(url, headers=headers, json={
    "model": "Qwen/Qwen2.5-VL-72B-Instruct",
    "messages": messages,
    "max_tokens": 256
})

print("Follow-up response:", response.json()["choices"][0]["message"]["content"])

# Multi-turn conversation with follow-up question
curl -X POST "https://api.hyperbolic.xyz/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "Qwen/Qwen2.5-VL-72B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What items are on this receipt?"},
          {
            "type": "image_url",
            "image_url": {"url": "data:image/png;base64,YOUR_BASE64_STRING"}
          }
        ]
      },
      {
        "role": "assistant",
        "content": "The receipt shows: 1. Coffee - $4.50, 2. Sandwich - $8.99..."
      },
      {
        "role": "user",
        "content": "What is the total amount?"
      }
    ],
    "max_tokens": 256
  }'

Available Models

Model	Model ID	Best For	Price
NVIDIA Nemotron Nano 12B v2 VL ⚠️ Sunset	`nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16`	Document intelligence	$0.20/M tokens
Pixtral 12B ⚠️ Sunset	`mistralai/Pixtral-12B-2409`	Budget-friendly, general use	$0.10/M tokens
Qwen2.5-VL-7B-Instruct ⚠️ Sunset	`Qwen/Qwen2.5-VL-7B-Instruct`	Balanced cost/performance	$0.20/M tokens
Qwen2.5-VL-72B-Instruct ⚠️ Sunset	`Qwen/Qwen2.5-VL-72B-Instruct`	Best quality, complex analysis	$0.60/M tokens

Model Recommendations

Choosing the right model:

Best quality: Qwen2.5-VL-72B-Instruct for complex analysis and detailed understanding
Best value: Pixtral 12B at $0.10/M tokens for general image tasks
Document analysis: NVIDIA Nemotron Nano for OCR, forms, and document intelligence
Balanced: Qwen2.5-VL-7B-Instruct for good performance at moderate cost

Use Cases

Document Analysis

Extract text and structured data from documents, receipts, and forms:

data = {
    "model": "nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16",
    "messages": [{
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract all text from this receipt and format it as a list with item names and prices."},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"}}
        ]
    }],
    "max_tokens": 1024
}

Image Captioning

Generate detailed descriptions of images:

data = {
    "model": "Qwen/Qwen2.5-VL-72B-Instruct",
    "messages": [{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image in detail, including colors, objects, and any text visible."},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"}}
        ]
    }],
    "max_tokens": 512
}

Visual Q&A

Ask specific questions about image content:

data = {
    "model": "mistralai/Pixtral-12B-2409",
    "messages": [{
        "role": "user",
        "content": [
            {"type": "text", "text": "How many people are in this photo? What are they doing?"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"}}
        ]
    }],
    "max_tokens": 256
}

Next Steps

Text APIs

Text generation with large language models

Image APIs

Generate images from text prompts

Audio APIs

Text-to-speech and audio generation

Overview

On-Demand GPU

Serverless Inference

Reserved Clusters

General Platform

Vision Language Models

Vision Language Models

Overview

Endpoint

Basic Example

Image Input Format

Encoding Images

Message Format

Limitations

Multi-turn Conversations

Available Models

Model Recommendations

Use Cases

Document Analysis

Image Captioning

Visual Q&A

Next Steps

Text APIs

Image APIs

Audio APIs

Overview

On-Demand GPU

Serverless Inference

Reserved Clusters

General Platform

Documentation Index

​Vision Language Models

​Overview

​Endpoint

​Basic Example

​Image Input Format

​Encoding Images

​Message Format

​Limitations

​Multi-turn Conversations

​Available Models

​Model Recommendations

​Use Cases

​Document Analysis

​Image Captioning

​Visual Q&A

​Next Steps

Text APIs

Image APIs

Audio APIs

Vision Language Models

Overview

Endpoint

Basic Example

Image Input Format

Encoding Images

Message Format

Limitations

Multi-turn Conversations

Available Models

Model Recommendations

Use Cases

Document Analysis

Image Captioning

Visual Q&A

Next Steps