GPT-OSS 120B (Infercom)
A high-velocity reasoning engine that bridges the gap between frontier intelligence and open-weight accessibility, optimized for the next generation of autonomous agentic workflows.

About the Model
GPT-OSS 120B is built on a massive Mixture-of-Experts (MoE) architecture containing 117 billion total parameters. To ensure lightning-fast performance, it uses a sparse activation strategy where only 5.1 billion parameters are active for any given token. The "Infercom" variant is specifically tuned for inference engines like vLLM and NVIDIA NIM, utilizing MXFP4 quantization to maintain high intelligence while fitting on a single 80GB GPU (like the H100 or A100).
Model Key Capabilities
Adjustable Reasoning Effort:
Native support for the reasoning_effort parameter, allowing users to toggle between Low (fast/cheap), Medium (balanced), and High (deep analytical thinking).
Full Chain-of-Thought (CoT):
Unlike closed-source models, GPT-OSS provides full transparency into its internal reasoning steps, which is critical for debugging complex agentic workflows.
Structured Outputs:
Optimized for JSON mode and function calling, achieving near-perfect reliability for API-driven agents.
High-Speed Throughput:
Capable of exceeding 500 tokens/sec on optimized inference stacks, making it one of the fastest models in its weight class.
Applications & Use Cases
Agentic Workflows:
Ideally suited as the "brain" for autonomous agents that require real-time web browsing, Python code execution, and multi-step tool use.
STEM & Technical Research:
Exceptional performance in mathematics (AIME 2025: 97.9% with tools) and graduate-level science reasoning (GPQA Diamond: 80.9%).
Privacy-Sensitive Production:
A favorite for legal, financial, and healthcare sectors that require frontier-level reasoning on-premises to ensure data sovereignty.
Developer Tooling:
Perfect for repository-scale code analysis and high-volume synthetic data generation.
Recomended Models based on your needs
Model Specifications
General | |
|---|---|
Model Provider | OpenAI |
Main Use Cases |
|
Intelligence | |
Reasoning Effort | Adaptive (Low, Medium, High) |
GPQA Diamond | 80.9% |
Memory | |
Max Context | 131K Tokens |
Speed | |
Latency (TTFT) | 0.37s |
Throughput | 313 - 544 Tokens/sec |



