
The GPU infrastructure conversation in AI is dominated by the hyperscalers. Training frontier models. Building 100,000 GPU clusters. Spending billions on compute. This is interesting but irrelevant to 99% of teams building AI applications.
Most teams are not training foundation models. They are running inference on existing models, fine-tuning small models on domain-specific data, processing vision workloads, running embedding pipelines, and deploying AI-powered features that need GPUs for a few hours a day, not 24/7.
The GPU infrastructure needs of these teams are fundamentally different from the needs of a foundation model lab, and the solutions should be different too. Renting dedicated A100 clusters for a workload that peaks at 4 hours per day is like renting a warehouse to store a bicycle.
This article covers practical GPU infrastructure for the rest of us: teams that need GPUs for real work but do not need to build a supercomputer.
Before choosing infrastructure, characterize your workload along four dimensions:
Compute intensity. How much GPU compute does each job require? Running inference with a 7B parameter model is very different from fine-tuning a 70B model. Inference on small models can run on consumer-grade GPUs. Fine-tuning large models needs high-memory GPUs (A100 80GB, H100).
Duration. How long does each job run? Batch inference jobs might run for minutes. Fine-tuning jobs might run for hours. Long-running model serving might run continuously. Duration determines whether on-demand or reserved instances make economic sense.
Concurrency. How many jobs run simultaneously? A single data scientist running experiments has different needs than a production pipeline processing 10,000 inference requests per minute. Concurrency determines whether you need a cluster or a single GPU.
Burstiness. Does your workload have peaks and valleys? If you process embeddings in nightly batches, you need GPUs for 2 hours a day. If you serve real-time inference, you need GPUs during business hours. If your workload is truly constant, reserved instances make sense. If it is bursty, on-demand or serverless compute saves money.
Map your workload to these dimensions before making infrastructure decisions. The mismatch between workload characteristics and infrastructure choices is the primary source of wasted GPU spend.
The major cloud providers offer GPU instances with various configurations. This is the default choice for most teams, and it is reasonable but often over-provisioned.
When to use: You need reliable, managed infrastructure with integration into your existing cloud ecosystem. You want standard tooling (Docker, Kubernetes) and do not want to manage hardware.
What to watch for:
Availability. GPU instances, especially high-end ones (A100, H100), have limited availability. In high-demand periods, you may not be able to launch the instance type you need. If your workload is time-sensitive, availability is a risk factor.
Cost. Cloud GPU instances are expensive by the hour. An A100 instance on AWS (p4d.24xlarge) costs roughly $30-40 per hour depending on region and pricing model. If you run this 24/7, you are spending $20,000-30,000 per month. For workloads that do not need 24/7 compute, this is waste.
Right-sizing. Teams often provision more GPU memory and compute than they need because GPU instance types are coarse-grained. If your model needs 12GB of GPU memory, you might end up renting a 40GB or 80GB GPU because that is the available instance type.
Cost optimization strategies:
Spot/preemptible instances for fault-tolerant workloads (batch inference, fine-tuning with checkpointing). Savings of 60-90% over on-demand pricing. The catch is that instances can be reclaimed with short notice, so your workload needs to handle interruptions gracefully. This is where the checkpoint/resume pattern pays for itself directly.
Reserved instances or savings plans for baseline workloads that run continuously. Savings of 30-60% over on-demand for 1-3 year commitments. Only commit to what you consistently use – use on-demand or spot for peaks.
Aggressive auto-scaling that spins down GPU instances when not in use. If your inference service has zero traffic at night, scale to zero GPUs. The cold start latency when scaling back up (1-5 minutes for GPU instances) is acceptable for many workloads.
Serverless GPU platforms abstract away instance management entirely. You deploy a function or container, and the platform provisions GPU compute on demand.
When to use: Bursty workloads where you need GPUs for minutes or hours, not days. Batch processing jobs. Development and experimentation where you want to pay per second, not per hour.
Advantages:
What to watch for:
Modal has become our go-to for batch processing and experimentation at CONFLICT. The developer experience is excellent – define a function with GPU requirements, deploy it, and it scales automatically. For production inference serving, we evaluate on a case-by-case basis whether serverless or managed instances make more economic sense.
# Modal example: GPU batch processing
import modal
app = modal.App("embedding-pipeline")
@app.function(gpu="A10G", timeout=3600)
def process_batch(documents: list[str]) -> list[list[float]]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(documents, batch_size=64)
return embeddings.tolist()
Ray is an open-source framework for distributed computing that handles GPU workload distribution, scheduling, and fault tolerance. It runs on your own infrastructure (cloud instances, on-premises) or through managed services (Anyscale).
When to use: You have multiple GPU workloads that need to share a pool of GPU resources efficiently. You want fine-grained control over scheduling and resource allocation. You need distributed training or inference across multiple GPUs.
Advantages:
What to watch for:
Managed ML platforms (SageMaker, Vertex AI, Azure ML) provide end-to-end infrastructure for model training, tuning, and deployment.
When to use: You want managed infrastructure with built-in model management, experiment tracking, and deployment pipelines. Your team is focused on model development rather than infrastructure.
Advantages:
What to watch for:
GPU cost management is not a one-time optimization. It is an ongoing discipline that requires instrumentation, monitoring, and regular review.
Instrument GPU utilization. Track actual GPU utilization (compute and memory) for every workload. Most teams are surprised to find that their GPU utilization averages 20-40%. A GPU that is 30% utilized is 70% wasted.
Track cost per unit of work. Measure cost per inference, cost per fine-tuning run, cost per embedding batch. These metrics let you compare infrastructure options on a like-for-like basis and identify workloads where optimization has the most impact.
Implement auto-scaling with scale-to-zero. For inference serving, scale GPU instances based on actual request volume. For batch workloads, provision GPUs only when jobs are queued. The ability to scale to zero when there is no work eliminates idle cost entirely.
Right-size your GPU instances. If your model fits in 8GB of GPU memory, do not rent a 40GB GPU. Use profiling to understand your actual GPU memory and compute requirements, and select the smallest instance type that meets them.
Use spot instances for everything that can tolerate interruption. Fine-tuning with checkpointing, batch inference, embedding generation, data preprocessing – all of these can run on spot instances with significant cost savings. The checkpoint/resume pattern is not just a reliability feature; it is a cost optimization enabler.
Review monthly. GPU costs can change significantly as your workload evolves, as cloud providers update pricing, and as new instance types become available. Monthly cost reviews prevent drift from your budget.
For a typical team running inference serving, periodic fine-tuning, and batch processing, we recommend:
Inference serving: Managed endpoint (SageMaker, Vertex AI) or serverless GPU (Modal, Replicate) with auto-scaling. Choose based on your latency requirements and cost sensitivity. For latency-sensitive applications, a managed endpoint with warm instances. For latency-tolerant applications, serverless with cold-start tolerance.
Fine-tuning: On-demand or spot cloud GPU instances provisioned for the duration of the training job. Use checkpointing so spot interruptions do not lose progress. Tear down instances immediately after training completes.
Batch processing: Serverless GPU (Modal) or spot instances with auto-scaling. Scale up for the batch window, process, scale down to zero. Pay only for the processing time.
Development and experimentation: Serverless GPU platforms for quick iteration. Developers should be able to run GPU experiments without provisioning infrastructure or waiting for instance availability.
+-+
| Request Router |
+-+
|
+++
| |
+-+ +-+
| Real-time Inference| | Batch Processing |
| (Managed Endpoint) | | (Serverless GPU) |
| Auto-scaling | | Scale-to-zero |
+--+ +--+
+--+ +--+
| Fine-tuning Jobs | | Dev/Experiment |
| (Spot Instances) | | (Serverless GPU) |
| With Checkpointing | | Per-second billing |
+--+ +--+
Over-provisioning from the start. Start with the smallest GPU that handles your workload. Scale up when you have data showing you need more, not when you anticipate you might.
Running GPUs 24/7 for bursty workloads. If your GPU utilization is under 50%, you are likely over-provisioned. Implement auto-scaling before optimizing anything else.
Ignoring spot instances. The savings are too significant to ignore. Any workload that can checkpoint and resume is a candidate for spot instances. This includes most fine-tuning, batch processing, and embedding generation workloads.
Choosing infrastructure before understanding the workload. Characterize your workload first. Duration, concurrency, burstiness, and compute requirements should drive your infrastructure choice, not the other way around.
Not tracking GPU cost per unit of work. Without this metric, you cannot compare options or identify optimization opportunities. Instrument cost tracking from day one.
You do not need OpenAI’s infrastructure to build useful AI applications. You need the right GPU infrastructure for your specific workload, managed with the same discipline you apply to any other operational cost. Start small, measure everything, and scale based on data, not assumptions.