-
1
Qwen3
Alibaba
Unleashing groundbreaking AI with unparalleled global language support.
Qwen3, the latest large language model from the Qwen family, introduces a new level of flexibility and power for developers and researchers. With models ranging from the high-performance Qwen3-235B-A22B to the smaller Qwen3-4B, Qwen3 is engineered to excel across a variety of tasks, including coding, math, and natural language processing. The unique hybrid thinking modes allow users to switch between deep reasoning for complex tasks and fast, efficient responses for simpler ones. Additionally, Qwen3 supports 119 languages, making it ideal for global applications. The model has been trained on an unprecedented 36 trillion tokens and leverages cutting-edge reinforcement learning techniques to continually improve its capabilities. Available on multiple platforms, including Hugging Face and ModelScope, Qwen3 is an essential tool for those seeking advanced AI-powered solutions for their projects.
-
2
Mistral Medium 3
Mistral AI
Revolutionary AI: Unmatched performance, unbeatable affordability, seamless deployment.
Mistral Medium 3 is a breakthrough in AI technology, offering the perfect balance of cutting-edge performance and significantly reduced costs. This model introduces a new era of enterprise AI, with a focus on simplifying deployments while still providing exceptional performance. Its ability to deliver high-level results at just a fraction of the cost of its competitors makes it a game-changer in industries that rely on complex AI tasks. Mistral Medium 3 is particularly strong in professional use cases like coding, where it competes closely with larger models that are typically more expensive and slower. The model supports hybrid and on-premises deployments, offering enterprise users full control over customization and integration into their systems. Businesses can leverage Mistral Medium 3 for both large-scale deployments and fine-tuned, domain-specific training, allowing for enhanced efficiency in industries such as healthcare, financial services, and energy. The addition of continuous learning and the ability to integrate with enterprise knowledge bases makes it a flexible, future-proof solution. Customers in beta are already using Mistral Medium 3 to enrich customer service, personalize business processes, and analyze complex datasets, demonstrating its real-world value. Available through various cloud platforms like Amazon Sagemaker, IBM WatsonX, and Google Cloud Vertex, Mistral Medium 3 is now ready to be deployed for custom use cases across a range of industries.
-
3
NuExtract
NuExtract
Effortlessly extract structured data from any document format.
NuExtract is a sophisticated tool designed to extract structured information from a wide array of document formats, including text files, scanned images, PDFs, PowerPoint presentations, and spreadsheets, while effectively managing multiple languages and mixed-language content. It produces output in JSON format according to user-defined templates, featuring validation and null value handling to minimize errors. Users can begin extraction tasks by creating a template, either by specifying desired fields or by importing existing formats; they can further improve accuracy by providing example documents alongside expected results in the example set. The NuExtract Platform offers an intuitive interface for creating templates, testing extractions in a controlled environment, curating teaching examples, and fine-tuning parameters like model temperature and document rasterization DPI. Once validation is complete, projects can be executed through a RESTful API endpoint, allowing for real-time document processing. This seamless integration empowers users to effectively manage their data extraction processes, significantly boosting both efficiency and precision in their operations. Furthermore, the ability to adjust parameters and test in a sandbox environment grants users greater control over the extraction process, ensuring optimal results tailored to their specific needs.
-
4
GLM-4.5-Air
Z.ai
Your all-in-one AI solution for presentations, writing, coding!
Z.ai is a flexible and complementary AI assistant that merges the realms of presentations, writing, and coding into a fluid conversational experience. Utilizing cutting-edge language models, it empowers users to design intricate slide decks with AI-generated visuals, generate high-caliber text for diverse applications like emails, reports, and blogs, and even tackle complex coding challenges through writing or debugging code. Beyond just content creation, Z.ai shines in thorough research and information gathering, enabling users to extract data, summarize extensive documents, and overcome writer's block, while its coding assistant can elucidate code snippets, enhance functions, or create scripts from scratch. The intuitive chat interface requires no extensive training; users simply articulate their needs—whether for a strategic presentation, marketing materials, or a script for data analysis—and receive prompt, relevant responses. Additionally, Z.ai supports multiple languages, including Chinese, and boasts an impressive native function invocation along with a support for a substantial 128K token context, making it adept at facilitating everything from brainstorming ideas to automating repetitive writing and coding tasks. This makes it an essential resource for professionals in a wide array of disciplines. Ultimately, Z.ai's all-encompassing approach ensures that users can handle complicated projects with both comfort and effectiveness.
-
5
ByteDance Seed
ByteDance
Revolutionizing code generation with unmatched speed and accuracy.
Seed Diffusion Preview represents a cutting-edge language model tailored for code generation that utilizes discrete-state diffusion, enabling it to generate code in a non-linear fashion, which significantly accelerates inference times without sacrificing quality. This pioneering methodology follows a two-phase training procedure that consists of mask-based corruption coupled with edit-based enhancement, allowing a typical dense Transformer to strike an optimal balance between efficiency and accuracy while steering clear of shortcuts such as carry-over unmasking, thereby ensuring rigorous density estimation. Remarkably, the model achieves an impressive inference rate of 2,146 tokens per second on H20 GPUs, outperforming existing diffusion benchmarks while either matching or exceeding accuracy on recognized code evaluation metrics, including various editing tasks. This exceptional performance not only establishes a new standard for the trade-off between speed and quality in code generation but also highlights the practical effectiveness of discrete diffusion techniques in real-world coding environments. Furthermore, its achievements pave the way for improved productivity in coding tasks across diverse platforms, potentially transforming how developers approach code generation and refinement.
-
6
GPT-5 mini
OpenAI
Streamlined AI for fast, precise, and cost-effective tasks.
GPT-5 mini is a faster, more affordable variant of OpenAI’s advanced GPT-5 language model, specifically tailored for well-defined and precise tasks that benefit from high reasoning ability. It accepts both text and image inputs (image input only), and generates high-quality text outputs, supported by a large 400,000-token context window and a maximum of 128,000 tokens in output, enabling complex multi-step reasoning and detailed responses. The model excels in providing rapid response times, making it ideal for use cases where speed and efficiency are critical, such as chatbots, customer service, or real-time analytics. GPT-5 mini’s pricing structure significantly reduces costs, with input tokens priced at $0.25 per million and output tokens at $2 per million, offering a more economical option compared to the flagship GPT-5. While it supports advanced features like streaming, function calling, structured output generation, and fine-tuning, it does not currently support audio input or image generation capabilities. GPT-5 mini integrates seamlessly with multiple API endpoints including chat completions, responses, embeddings, and batch processing, providing versatility for a wide array of applications. Rate limits are tier-based, scaling from 500 requests per minute up to 30,000 per minute for higher tiers, accommodating small to large scale deployments. The model also supports snapshots to lock in performance and behavior, ensuring consistency across applications. GPT-5 mini is ideal for developers and businesses seeking a cost-effective solution with high reasoning power and fast throughput. It balances cutting-edge AI capabilities with efficiency, making it a practical choice for applications demanding speed, precision, and scalability.
-
7
GPT-5 nano
OpenAI
Lightning-fast, budget-friendly AI for text and images!
GPT-5 nano is OpenAI’s fastest and most cost-efficient version of the GPT-5 model, engineered to handle high-speed text and image input processing for tasks such as summarization, classification, and content generation. It features an extensive 400,000-token context window and can output up to 128,000 tokens, allowing for complex, multi-step language understanding despite its focus on speed. With ultra-low pricing—$0.05 per million input tokens and $0.40 per million output tokens—GPT-5 nano makes advanced AI accessible to budget-conscious users and developers working at scale. The model supports a variety of advanced API features, including streaming output, function calling for interactive applications, structured outputs for precise control, and fine-tuning for customization. While it lacks support for audio input and web search, GPT-5 nano supports image input, code interpretation, and file search, broadening its utility. Developers benefit from tiered rate limits that scale from 500 to 30,000 requests per minute and up to 180 million tokens per minute, supporting everything from small projects to enterprise workloads. The model also offers snapshots to lock performance and behavior, ensuring consistent results over time. GPT-5 nano strikes a practical balance between speed, cost, and capability, making it ideal for fast, efficient AI implementations where rapid turnaround and budget are critical. It fits well for applications requiring real-time summarization, classification, chatbots, or lightweight natural language processing tasks. Overall, GPT-5 nano expands the accessibility of OpenAI’s powerful AI technology to a broader user base.
-
8
DeepSeek V3.1
DeepSeek
Revolutionizing AI with unmatched power and flexibility.
DeepSeek V3.1 emerges as a groundbreaking open-weight large language model, featuring an astounding 685-billion parameters and an extensive 128,000-token context window that enables it to process lengthy documents similar to 400-page novels in a single run. This model encompasses integrated capabilities for conversation, reasoning, and code generation within a unified hybrid framework that effectively blends these varied functionalities. Additionally, V3.1 supports multiple tensor formats, allowing developers to optimize performance across different hardware configurations. Initial benchmark tests indicate impressive outcomes, with a notable score of 71.6% on the Aider coding benchmark, placing it on par with or even outperforming competitors like Claude Opus 4, all while maintaining a significantly lower cost. Launched under an open-source license on Hugging Face with minimal promotion, DeepSeek V3.1 aims to transform the availability of advanced AI solutions, potentially challenging the traditional landscape dominated by proprietary models. The model's innovative features and affordability are likely to attract a diverse array of developers eager to implement state-of-the-art AI technologies in their applications, thus fostering a new wave of creativity and efficiency in the tech industry.
-
9
Hermes 4
Nous Research
Experience dynamic, human-like interactions with innovative reasoning power.
Hermes 4 marks a significant leap forward in Nous Research's lineup of neutrally aligned, steerable foundational models, showcasing advanced hybrid reasoners capable of seamlessly shifting between creative, expressive outputs and succinct, efficient answers tailored to user needs. This model is designed to emphasize user and system commands above any corporate ethical considerations, resulting in a more conversational and engaging interaction style that avoids sounding overly authoritative or ingratiating, while also promoting opportunities for imaginative roleplay. By incorporating a specific tag in prompts, users can unlock a higher level of reasoning that is resource-intensive, enabling them to tackle complex problems without sacrificing efficiency for simpler inquiries. With a training dataset that is 50 times larger than that of Hermes 3, much of which has been synthetically generated through Atropos, Hermes 4 shows significant performance improvements. This evolution not only enhances accuracy but also expands the scope of applications for which the model can be utilized effectively. Furthermore, the increased capabilities of Hermes 4 pave the way for innovative uses across various domains, demonstrating a strong commitment to advancing user experiences.
-
10
K2 Think
Institute of Foundation Models
Revolutionary reasoning model: compact, powerful, and open-source.
K2 Think is an innovative open-source advanced reasoning model that has emerged from a collaborative effort between the Institute of Foundation Models at MBZUAI and G42. Despite having a relatively modest size of 32 billion parameters, K2 Think delivers performance that competes with top-tier models that possess much larger parameter counts. Its primary strength is in mathematical reasoning, where it has achieved excellent rankings on distinguished benchmarks, including AIME ’24/’25, HMMT ’25, and OMNI-Math-HARD. This model is part of a broader initiative aimed at developing open models in the UAE, which also encompasses Jais (for Arabic), NANDA (for Hindi), and SHERKALA (for Kazakh). It builds on the foundational work laid by the K2-65B, a fully reproducible open-source foundation model that was introduced in 2024. K2 Think is designed to be open, efficient, and versatile, featuring a web app interface that encourages user interaction and exploration. Its cutting-edge approach to parameter positioning signifies a notable leap forward in creating compact architectures for high-level AI reasoning. Furthermore, its development underscores a commitment to improving access to advanced AI technologies across multiple languages and sectors, ultimately fostering greater inclusivity in the field.
-
11
DeepSeek has introduced DeepSeek-V3.1-Terminus, an enhanced version of the V3.1 architecture that incorporates user feedback to improve output reliability, uniformity, and overall performance of the agent. This upgrade notably reduces the frequency of mixed Chinese and English text as well as unintended anomalies, resulting in a more polished and cohesive language generation experience. Furthermore, the update overhauls both the code agent and search agent subsystems, yielding better and more consistent performance across a range of benchmarks. DeepSeek-V3.1-Terminus is released as an open-source model, with its weights made available on Hugging Face, thereby facilitating easier access for the community to utilize its functionalities. The model's architecture stays consistent with that of DeepSeek-V3, ensuring compatibility with existing deployment strategies, while updated inference demonstrations are provided for users to investigate its capabilities. Impressively, the model functions at a massive scale of 685 billion parameters and accommodates various tensor formats, such as FP8, BF16, and F32, which enhances its adaptability in diverse environments. This versatility empowers developers to select the most appropriate format tailored to their specific requirements and resource limitations, thereby optimizing performance in their respective applications.
-
12
Qwen3-Max
Alibaba
Unleash limitless potential with advanced multi-modal reasoning capabilities.
Qwen3-Max is Alibaba's state-of-the-art large language model, boasting an impressive trillion parameters designed to enhance performance in tasks that demand agency, coding, reasoning, and the management of long contexts. As a progression of the Qwen3 series, this model utilizes improved architecture, training techniques, and inference methods; it features both thinker and non-thinker modes, introduces a distinctive “thinking budget” approach, and offers the flexibility to switch modes according to the complexity of the tasks. With its capability to process extremely long inputs and manage hundreds of thousands of tokens, it also enables the invocation of tools and showcases remarkable outcomes across various benchmarks, including evaluations related to coding, multi-step reasoning, and agent assessments like Tau2-Bench. Although the initial iteration primarily focuses on following instructions within a non-thinking framework, Alibaba plans to roll out reasoning features that will empower autonomous agent functionalities in the near future. Furthermore, with its robust multilingual support and comprehensive training on trillions of tokens, Qwen3-Max is available through API interfaces that integrate well with OpenAI-style functionalities, guaranteeing extensive applicability across a range of applications. This extensive and innovative framework positions Qwen3-Max as a significant competitor in the field of advanced artificial intelligence language models, making it a pivotal tool for developers and researchers alike.
-
13
GLM-4.6
Zhipu AI
Empower your projects with enhanced reasoning and coding capabilities.
GLM-4.6 builds on the groundwork established by its predecessor, offering improved reasoning, coding, and agent functionalities that lead to significant improvements in inferential precision, better tool application during reasoning exercises, and a smoother incorporation into agent architectures. In extensive benchmark assessments evaluating reasoning, coding, and agent performance, GLM-4.6 outperforms GLM-4.5 and holds its own against competitive models such as DeepSeek-V3.2-Exp and Claude Sonnet 4, though it still trails Claude Sonnet 4.5 regarding coding proficiency. Additionally, when evaluated through practical testing using a comprehensive “CC-Bench” suite, which encompasses tasks related to front-end development, tool creation, data analysis, and algorithmic challenges, GLM-4.6 shows superior performance compared to GLM-4.5, achieving a nearly equal standing with Claude Sonnet 4, winning around 48.6% of direct matchups while exhibiting an approximate 15% boost in token efficiency. This newest iteration is available via the Z.ai API, allowing developers to utilize it either as a backend for an LLM or as the fundamental component in an agent within the platform's API ecosystem. Moreover, the enhancements in GLM-4.6 promise to significantly elevate productivity across diverse application areas, making it a compelling choice for developers eager to adopt the latest advancements in AI technology. Consequently, the model's versatility and performance improvements position it as a key player in the ongoing evolution of AI-driven solutions.
-
14
DeepSeek-V3.2-Exp
DeepSeek
Experience lightning-fast efficiency with cutting-edge AI technology!
We are excited to present DeepSeek-V3.2-Exp, our latest experimental model that evolves from V3.1-Terminus, incorporating the cutting-edge DeepSeek Sparse Attention (DSA) technology designed to significantly improve both training and inference speeds for longer contexts. This innovative DSA framework enables accurate sparse attention while preserving the quality of outputs, resulting in enhanced performance for long-context tasks alongside reduced computational costs. Benchmark evaluations demonstrate that V3.2-Exp delivers performance on par with V3.1-Terminus, all while benefiting from these efficiency gains. The model is fully functional across various platforms, including app, web, and API. In addition, to promote wider accessibility, we have reduced DeepSeek API pricing by more than 50% starting now. During this transition phase, users will have access to V3.1-Terminus through a temporary API endpoint until October 15, 2025. DeepSeek invites feedback on DSA from users via our dedicated feedback portal, encouraging community engagement. To further support this initiative, DeepSeek-V3.2-Exp is now available as open-source, with model weights and key technologies—including essential GPU kernels in TileLang and CUDA—published on Hugging Face, and we are eager to observe how the community will leverage this significant technological advancement. As we unveil this new chapter, we anticipate fruitful interactions and innovative applications arising from the collective contributions of our user base.
-
15
Gemini Enterprise
Google
Unlock productivity with AI automation and seamless integration.
Gemini Enterprise app is a powerful enterprise-grade AI platform that enables organizations to deploy, manage, and scale AI agents across their entire workforce. It integrates seamlessly with popular productivity tools and data sources, allowing users to access and analyze business data through a single interface. The platform supports advanced automation by enabling agents to execute complex, multi-step workflows across multiple applications. It includes prebuilt agents like NotebookLM Enterprise, as well as tools for building custom and third-party agents using a no-code approach. Gemini Enterprise app provides robust security, governance, and compliance features, including data access controls, encryption, and regulatory support. It offers centralized visibility into all agents, workflows, and permissions, ensuring efficient management at scale. The platform is designed to enhance productivity across departments by automating repetitive tasks and accelerating content creation. It also helps break down data silos by connecting multiple data sources into one system. With scalable pricing options and enterprise-grade infrastructure, it supports both small teams and large organizations. Overall, Gemini Enterprise app delivers a unified, secure, and scalable solution for AI-driven business transformation.
-
16
Claude Haiku 4.5
Anthropic
Elevate efficiency with cutting-edge performance at reduced costs!
Anthropic has launched Claude Haiku 4.5, a new small language model that seeks to deliver near-frontier capabilities while significantly lowering costs. This model shares the coding and reasoning strengths of the mid-tier Sonnet 4 but operates at about one-third of the cost and boasts over twice the processing speed. Benchmarks provided by Anthropic indicate that Haiku 4.5 either matches or exceeds the performance of Sonnet 4 in vital areas such as code generation and complex “computer use” workflows. It is particularly fine-tuned for use cases that demand real-time, low-latency performance, making it a perfect fit for applications such as chatbots, customer service, and collaborative programming. Users can access Haiku 4.5 via the Claude API under the label “claude-haiku-4-5,” aiming for large-scale deployments where cost efficiency, quick responses, and sophisticated intelligence are critical. Now available on Claude Code and a variety of applications, this model enhances user productivity while still delivering high-caliber performance. Furthermore, its introduction signifies a major advancement in offering businesses affordable yet effective AI solutions, thereby reshaping the landscape of accessible technology. This evolution in AI capabilities reflects the ongoing commitment to providing innovative tools that meet the diverse needs of users in various sectors.
-
17
MiniMax M2
MiniMax
Revolutionize coding workflows with unbeatable performance and cost.
MiniMax M2 represents a revolutionary open-source foundational model specifically designed for agent-driven applications and coding endeavors, striking a remarkable balance between efficiency, speed, and cost-effectiveness. It excels within comprehensive development ecosystems, skillfully handling programming assignments, utilizing various tools, and executing complex multi-step operations, all while seamlessly integrating with Python and delivering impressive inference speeds estimated at around 100 tokens per second, coupled with competitive API pricing at roughly 8% of comparable proprietary models. Additionally, the model features a "Lightning Mode" for rapid and efficient agent actions and a "Pro Mode" tailored for in-depth full-stack development, report generation, and management of web-based tools; its completely open-source weights facilitate local deployment through vLLM or SGLang. What sets MiniMax M2 apart is its readiness for production environments, enabling agents to independently carry out tasks such as data analysis, software development, tool integration, and executing complex multi-step logic in real-world organizational settings. Furthermore, with its cutting-edge capabilities, this model is positioned to transform how developers tackle intricate programming challenges and enhances productivity across various domains.
-
18
Olmo 3
Ai2
Unlock limitless potential with groundbreaking open-model technology.
Olmo 3 constitutes an extensive series of open models that include versions with 7 billion and 32 billion parameters, delivering outstanding performance in areas such as base functionality, reasoning, instruction, and reinforcement learning, all while ensuring transparency throughout the development process, including access to raw training datasets, intermediate checkpoints, training scripts, extended context support (with a remarkable window of 65,536 tokens), and provenance tools. The backbone of these models is derived from the Dolma 3 dataset, which encompasses about 9 trillion tokens and employs a thoughtful mixture of web content, scientific research, programming code, and comprehensive documents; this meticulous strategy of pre-training, mid-training, and long-context usage results in base models that receive further refinement through supervised fine-tuning, preference optimization, and reinforcement learning with accountable rewards, leading to the emergence of the Think and Instruct versions. Importantly, the 32 billion Think model has earned recognition as the most formidable fully open reasoning model available thus far, showcasing a performance level that closely competes with that of proprietary models in disciplines such as mathematics, programming, and complex reasoning tasks, highlighting a considerable leap forward in the realm of open model innovation. This breakthrough not only emphasizes the capabilities of open-source models but also suggests a promising future where they can effectively rival conventional closed systems across a range of sophisticated applications, potentially reshaping the landscape of artificial intelligence.
-
19
DeepSeek-V3.2
DeepSeek
Revolutionize reasoning with advanced, efficient, next-gen AI.
DeepSeek-V3.2 represents one of the most advanced open-source LLMs available, delivering exceptional reasoning accuracy, long-context performance, and agent-oriented design. The model introduces DeepSeek Sparse Attention (DSA), a breakthrough attention mechanism that maintains high-quality output while significantly lowering compute requirements—particularly valuable for long-input workloads. DeepSeek-V3.2 was trained with a large-scale reinforcement learning framework capable of scaling post-training compute to the level required to rival frontier proprietary systems. Its Speciale variant surpasses GPT-5 on reasoning benchmarks and achieves performance comparable to Gemini-3.0-Pro, including gold-medal scores in the IMO and IOI 2025 competitions. The model also features a fully redesigned agentic training pipeline that synthesizes tool-use tasks and multi-step reasoning data at scale. A new chat template architecture introduces explicit thinking blocks, robust tool-interaction formatting, and a specialized developer role designed exclusively for search-powered agents. To support developers, the repository includes encoding utilities that translate OpenAI-style prompts into DeepSeek-formatted input strings and parse model output safely. DeepSeek-V3.2 supports inference using safetensors and fp8/bf16 precision, with recommendations for ideal sampling settings when deployed locally. The model is released under the MIT license, ensuring maximal openness for commercial and research applications. Together, these innovations make DeepSeek-V3.2 a powerful choice for building next-generation reasoning applications, agentic systems, research assistants, and AI infrastructures.
-
20
DeepSeek-V3.2-Speciale represents the pinnacle of DeepSeek’s open-source reasoning models, engineered to deliver elite performance on complex analytical tasks. It introduces DeepSeek Sparse Attention (DSA), a highly efficient long-context attention design that reduces the computational burden while maintaining deep comprehension and logical consistency. The model is trained with an expanded reinforcement learning framework capable of leveraging massive post-training compute, enabling performance not only comparable to GPT-5 but demonstrably surpassing it in internal tests. Its reasoning capabilities have been validated through gold-winning solutions across major global competitions, including IMO 2025 and IOI 2025, with official submissions released for transparency and peer assessment. DeepSeek-V3.2-Speciale is intentionally designed without tool-calling features, focusing every parameter on pure reasoning, multi-step logic, and structured problem solving. It introduces a reworked chat template featuring explicit thought-delimited sections and a structured message format optimized for agentic-style reasoning workflows. The repository includes Python-based utilities for encoding and parsing messages, illustrating how to format prompts correctly for the model. Supporting multiple tensor types (BF16, FP32, FP8_E4M3), it is built for both research experimentation and high-performance local deployment. Users are encouraged to use temperature = 1.0 and top_p = 0.95 for best results when running the model locally. With its open MIT license and transparent development process, DeepSeek-V3.2-Speciale stands as a breakthrough option for anyone requiring industry-leading reasoning capacity in an open LLM.
-
21
Ministral 3
Mistral AI
"Unleash advanced AI efficiency for every device."
Mistral 3 marks the latest development in the realm of open-weight AI models created by Mistral AI, featuring a wide array of options ranging from small, edge-optimized variants to a prominent large-scale multimodal model. Among this selection are three streamlined “Ministral 3” models, equipped with 3 billion, 8 billion, and 14 billion parameters, specifically designed for use on resource-constrained devices like laptops, drones, and various edge devices. In addition, the powerful “Mistral Large 3” serves as a sparse mixture-of-experts model, featuring an impressive total of 675 billion parameters, with 41 billion actively utilized. These models are adept at managing multimodal and multilingual tasks, excelling in areas such as text analysis and image understanding, and have demonstrated remarkable capabilities in responding to general inquiries, handling multilingual conversations, and processing multimodal inputs. Moreover, both the base and instruction-tuned variants are offered under the Apache 2.0 license, which promotes significant customization and integration into a range of enterprise and open-source projects. This approach not only enhances flexibility in usage but also sparks innovation and fosters collaboration among developers and organizations, ultimately driving advancements in AI technology.
-
22
GLM-4.6V
Zhipu AI
Empowering seamless vision-language interactions with advanced reasoning capabilities.
The GLM-4.6V is a sophisticated, open-source multimodal vision-language model that is part of the Z.ai (GLM-V) series, specifically designed for tasks that involve reasoning, perception, and actionable outcomes. It comes in two distinct configurations: a full-featured version boasting 106 billion parameters, ideal for cloud-based systems or high-performance computing setups, and a more efficient “Flash” version with 9 billion parameters, optimized for local use or scenarios that demand minimal latency. With an impressive native context window capable of handling up to 128,000 tokens during its training, GLM-4.6V excels in managing large documents and various multimodal data inputs. A key highlight of this model is its integrated Function Calling feature, which allows it to directly accept different types of visual media, including images, screenshots, and documents, without the need for manual text conversion. This capability not only streamlines the reasoning process regarding visual content but also empowers the model to make tool calls, effectively bridging visual perception with practical applications. The adaptability of GLM-4.6V paves the way for numerous applications, such as generating combined image-and-text content that enhances document understanding with text summarization or crafting responses that incorporate image annotations, significantly improving user engagement and output quality. Moreover, its architecture encourages exploration into innovative uses across diverse fields, making it a valuable asset in the realm of AI.
-
23
GLM-4.1V
Zhipu AI
"Unleashing powerful multimodal reasoning for diverse applications."
GLM-4.1V represents a cutting-edge vision-language model that provides a powerful and efficient multimodal ability for interpreting and reasoning through different types of media, such as images, text, and documents. The 9-billion-parameter variant, referred to as GLM-4.1V-9B-Thinking, is built on the GLM-4-9B foundation and has been refined using a distinctive training method called Reinforcement Learning with Curriculum Sampling (RLCS). With a context window that accommodates 64k tokens, this model can handle high-resolution inputs, supporting images with a resolution of up to 4K and any aspect ratio, enabling it to perform complex tasks like optical character recognition, image captioning, chart and document parsing, video analysis, scene understanding, and GUI-agent workflows, which include interpreting screenshots and identifying UI components. In benchmark evaluations at the 10 B-parameter scale, GLM-4.1V-9B-Thinking achieved remarkable results, securing the top performance in 23 of the 28 tasks assessed. These advancements mark a significant progression in the fusion of visual and textual information, establishing a new benchmark for multimodal models across a variety of applications, and indicating the potential for future innovations in this field. This model not only enhances existing workflows but also opens up new possibilities for applications in diverse domains.
-
24
GLM-4.5V-Flash
Zhipu AI
Efficient, versatile vision-language model for real-world tasks.
GLM-4.5V-Flash is an open-source vision-language model designed to seamlessly integrate powerful multimodal capabilities into a streamlined and deployable format. This versatile model supports a variety of input types including images, videos, documents, and graphical user interfaces, enabling it to perform numerous functions such as scene comprehension, chart and document analysis, screen reading, and image evaluation. Unlike larger models, GLM-4.5V-Flash boasts a smaller size yet retains crucial features typical of visual language models, including visual reasoning, video analysis, GUI task management, and intricate document parsing. Its application within "GUI agent" frameworks allows the model to analyze screenshots or desktop captures, recognize icons or UI elements, and facilitate both automated desktop and web activities. Although it may not reach the performance levels of the most extensive models, GLM-4.5V-Flash offers remarkable adaptability for real-world multimodal tasks where efficiency, lower resource demands, and broad modality support are vital. Ultimately, its innovative design empowers users to leverage sophisticated capabilities while ensuring optimal speed and easy access for various applications. This combination makes it an appealing choice for developers seeking to implement multimodal solutions without the overhead of larger systems.
-
25
GLM-4.5V
Zhipu AI
Revolutionizing multimodal intelligence with unparalleled performance and versatility.
The GLM-4.5V model emerges as a significant advancement over its predecessor, the GLM-4.5-Air, featuring a sophisticated Mixture-of-Experts (MoE) architecture that includes an impressive total of 106 billion parameters, with 12 billion allocated specifically for activation purposes. This model is distinguished by its superior performance among open-source vision-language models (VLMs) of similar scale, excelling in 42 public benchmarks across a wide range of applications, including images, videos, documents, and GUI interactions. It offers a comprehensive suite of multimodal capabilities, tackling image reasoning tasks like scene understanding, spatial recognition, and multi-image analysis, while also addressing video comprehension challenges such as segmentation and event recognition. In addition, it demonstrates remarkable proficiency in deciphering intricate charts and lengthy documents, which supports GUI-agent workflows through functionalities like screen reading and desktop automation, along with providing precise visual grounding by identifying objects and creating bounding boxes. The introduction of a unique "Thinking Mode" switch further enhances the user experience, enabling users to choose between quick responses or more deliberate reasoning tailored to specific situations. This innovative addition not only underscores the versatility of GLM-4.5V but also highlights its adaptability to meet diverse user requirements, making it a powerful tool in the realm of multimodal AI solutions. Furthermore, the model’s ability to seamlessly integrate into various applications signifies its potential for widespread adoption in both research and practical environments.