Running Enterprise AI On-Prem: Our Self-Hosted Inference Stack

AI Infrastructure — May 1, 2026 — By Andy Giles

At Blue Oak Interactive, we build digital products for clients who care about security, performance, and control. That philosophy extends to how we use AI internally.

We use API providers and frontier models for everyday tasks where convenience makes sense. But for sensitive data, proprietary workflows, and when dealing with sensitive code or internal intellectual property, everything runs on hardware we own, in a room we control, behind our own firewall. Zero data ever leaves our network.

Here is how we built it.

Data Sovereignty Without Compromise

The obvious choice for most teams is to route everything through an API provider. It is convenient. But convenience comes with trade-offs: your prompts, your codebases, your client data all become someone else’s training signal. For a shop that handles sensitive client work daily, that is not acceptable.

Our goal was straightforward: run state-of-the-art open-weight models locally, serve them through a unified OpenAI-compatible /v1 endpoint, and make the whole thing fast enough that we never miss the cloud APIs. We wanted to power our own agentic coding tools with models running on our own metal.

The Hardware: Two GPUs, One Mission

Our inference cluster runs two NVIDIA GPUs in parallel, each optimized for different workloads:

NVIDIA GeForce RTX 3090 (24 GB VRAM): Our speed-first node. Compute is extremely expensive right now, but luckily a consumer 3090 is a solid fit for local inference and can be found second-hand. It punches well above its weight, running dense models with flash attention and quantized KV caches to maximize throughput.
NVIDIA GB10 (128 GB unified memory): Our capacity node. With over five times the memory of the 3090, this ARM-based accelerator runs multiple models concurrently, including multimodal models that accept images, audio, and video as input alongside text.

Together they give us the flexibility to run different model families simultaneously while maintaining low-latency responses for interactive work.

nvtop GPU monitoring dashboard showing real-time VRAM utilization, GPU clock speeds, and process load across both inference nodes — Real-time GPU monitoring via nvtop: VRAM utilization and process load across both inference nodes. The RTX 3090 and GB10 each appear as separate devices, making it straightforward to watch which models are consuming resources during active inference.

The Model: Qwen 3.6 Dense at the Core

After benchmarking across coding, reasoning, and agentic tasks, we settled on Qwen 3.6 (27B Dense) as our primary inference model. It runs quantized to Q4_K_XL precision (16.4 GB), which preserves nearly all of the full-precision model’s capability while fitting comfortably in GPU memory with room for a 262K token context window.

Qwen 3.6 was chosen for its strengths exactly where we need them: code generation, tool use, function calling, and multi-step reasoning. These are the capabilities that matter most when AI is assisting developers rather than just chatting.

The model runs identically on both GPU nodes, giving us consistent output quality regardless of which backend serves a given request.

Load Balancing by Performance, Not Luck

Having two inference nodes is only useful if you can route intelligently between them. We built a custom load balancer that monitors each node’s real-time performance and routes requests to whichever GPU can respond fastest.

The strategy is simple but effective: when one node is handling a long context or complex reasoning task, new requests automatically flow to the other. When both are idle, requests distribute evenly. If one node becomes unavailable for maintenance or a model swap, all traffic fails over seamlessly without dropped connections.

When developing, there is no degraded performance during peak usage. The system adapts in real time.

A Model Dashboard: Swap Models Without Downtime

One of the most valuable parts of our stack is the model management dashboard we built on top of everything. It gives us a live view of which models are loaded and running, GPU memory utilization per node, available models ready to deploy, and one-click model swaps.

LLM model registry dashboard showing loaded models, VRAM usage per node, and available models ready to deploy with one-click swap controls — The model registry dashboard: a live view of what is running, what is loaded in memory, and what is available to deploy. One-click model swaps update routing automatically without dropping active connections.

When we want to test a new model, say a MoE variant for better throughput or a multimodal model for vision tasks, we swap it through the dashboard. The system handles the transition: the old model unloads, the new one loads, and routing updates automatically. Developers can continue working throughout the process; the load balancer routes around the node being updated.

This flexibility means we are constantly evaluating new models as they are released. We have already updated our underlying models several times since implementing this stack. The industry moves fast enough that what was state-of-the-art last quarter can be superseded in weeks. We have benchmarked Qwen 3.5, Qwen 3.6 MoE variants, GLM-4.7-Flash, NVIDIA Nemotron models, and Google’s Gemma 4, all running locally and testable in production conditions before committing.

Powering Our Development Tools

The unified endpoint our stack provides is a standard OpenAI-compatible /v1 interface. That means it drops into virtually any modern AI tool that supports custom endpoints. Currently it powers:

Pi: Our primary AI coding agent, handling multi-file edits, refactoring, and architecture decisions.
Hermes Agent: An always-on research agent accessible over secure channels like Signal messenger, available for queries without switching contexts.
Claude Code: Anthropic’s agentic CLI tool, pointed at our local endpoint instead of their cloud API via its ANTHROPIC_BASE_URL override, keeping all prompts on-prem.
Custom automation: Internal tooling for code review, documentation generation, and client project analysis.

Most modern AI tools support pointing at a custom /v1 OpenAI-compatible endpoint. Whether it is Claude Code with an environment variable override or Hermes with a config file change, the pattern is the same: swap the endpoint, keep your data local.

The Results

Zero data egress: No prompts, no code, no client information ever leaves our infrastructure.
Sub-second first-token latency: Our quantized models with flash attention respond faster than most cloud APIs for context windows under 32K tokens.
262K token context: Full codebases fit in context. We regularly pass entire Drupal sites or WordPress themes to our agents for analysis and modification.
Always available: No API rate limits, no outages, no usage caps. Our models run 24/7 on dedicated hardware.
Constantly evolving: New models are tested and deployed as they are released, keeping us at the cutting edge without vendor lock-in.

Beyond On-Prem: Secure Cloud Inference with AWS Bedrock

On-prem is our default because it gives us maximum control. But local hardware is not always practical: some organizations cannot justify the capital expenditure, or need to scale inference across distributed teams.

For those cases, we deploy frontier models through AWS Bedrock, which provides secure, isolated inference without your data being used for training. The same OpenAI-compatible endpoint pattern applies: your tools point at a Bedrock-backed proxy, and your prompts never leave AWS’s infrastructure. It is not as complete a control as on-prem metal, but it is a significant step up from sending sensitive data to public APIs.

We design the approach to fit the constraint: on-prem when you want full ownership, Bedrock when you need cloud scale with security guarantees, or a hybrid of both.

Why This Matters for Our Clients

Running AI this way is not just about our internal workflow. It is about proving what is possible. When we consult with clients on their own digital strategy, we speak from experience about:

Security-first architecture: How to keep sensitive data under your control while still leveraging modern AI capabilities.
Performance optimization: The engineering that goes into making open-weight models competitive with proprietary alternatives.
Operational excellence: Monitoring, failover, and maintenance patterns that keep critical infrastructure reliable.

We do not just recommend these practices. We run them in production every day.

You do not need a team of ML engineers or a six-figure cloud bill to use AI effectively. You do need someone who understands the trade-offs between speed, security, and cost, and can build the infrastructure that works for your specific situation. Whether you need an on-prem inference cluster for sensitive workflows, a secure Bedrock deployment for cloud-based teams, or just honest advice about whether AI actually solves your problem, get in touch.