Meta: Llama 3.2 11B Vision Instruct
Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answering, bridging the gap between language generation and visual reasoning. Pre-trained on a massive dataset of image-text pairs, it performs well in complex, high-accuracy image analysis. Its ability to integrate visual understanding with language processing makes it an ideal solution for industries requiring comprehensive visual-linguistic AI applications, such as content creation, AI-driven customer service, and research. Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md). Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).
Undisclosed
Parameters
131K tokens
Context Window
Proprietary
License
Sep 25, 2024
Released
๐ฐ Pricing
Input
$0.05
per 1M tokens
Output
$0.05
per 1M tokens
API Available
This model is accessible via API for integration into your applications.
โญ Related Models
Llama 4 Maverick
Meta
Meta's flagship open-weight MoE model with 400B total parameters and 17B active. Strong multilingual and coding performance.
Llama 4 Scout
Meta
Efficient MoE model with 109B total parameters. Fits on a single H100 GPU while delivering strong performance.
Llama 3.1 405B
Meta
The largest dense open-weight model. Competitive with GPT-4 class models across benchmarks.