Pixtral 12B(2409)

Pixtral 12B is a state-of-the-art multimodal AI model developed by Mistral AI. It combines strong visual understanding capabilities with excellent text processing, making it a versatile tool for various multimodal tasks. Key features include:

Natively multimodal architecture, trained on interleaved image and text data
400M parameter vision encoder and 12B parameter multimodal decoder based on Mistral Nemo Support for variable image sizes and multiple images within a 128k token context window
Top-tier performance on multimodal benchmarks like MMMU (52.5%), outperforming many larger models
Maintained excellence in text-only tasks, unlike some other multimodal models

Pixtral excels in tasks such as chart understanding, document question-answering, and multimodal reasoning. It's particularly strong in instruction following for both multimodal and text-only scenarios. The model can process images at their native resolution and aspect ratio, offering flexibility in token usage for image processing.