/images/blog/conflict-bg.png

The GPU infrastructure conversation in AI is dominated by the hyperscalers. Training frontier models. Building 100,000 GPU clusters. Spending billions on compute. This is interesting but irrelevant to 99% of teams building AI applications.

Most teams are not training foundation models. They are running inference on existing models, fine-tuning small models on domain-specific data, processing vision workloads, running embedding pipelines, and deploying AI-powered features that need GPUs for a few hours a day, not 24/7.

The GPU infrastructure needs of these teams are fundamentally different from the needs of a foundation model lab, and the solutions should be different too. Renting dedicated A100 clusters for a workload that peaks at 4 hours per day is like renting a warehouse to store a bicycle.

This article covers practical GPU infrastructure for the rest of us: teams that need GPUs for real work but do not need to build a supercomputer.

Understanding Your GPU Workload

Before choosing infrastructure, characterize your workload along four dimensions:

Compute intensity. How much GPU compute does each job require? Running inference with a 7B parameter model is very different from fine-tuning a 70B model. Inference on small models can run on consumer-grade GPUs. Fine-tuning large models needs high-memory GPUs (A100 80GB, H100).

Duration. How long does each job run? Batch inference jobs might run for minutes. Fine-tuning jobs might run for hours. Long-running model serving might run continuously. Duration determines whether on-demand or reserved instances make economic sense.

Concurrency. How many jobs run simultaneously? A single data scientist running experiments has different needs than a production pipeline processing 10,000 inference requests per minute. Concurrency determines whether you need a cluster or a single GPU.

Burstiness. Does your workload have peaks and valleys? If you process embeddings in nightly batches, you need GPUs for 2 hours a day. If you serve real-time inference, you need GPUs during business hours. If your workload is truly constant, reserved instances make sense. If it is bursty, on-demand or serverless compute saves money.

Map your workload to these dimensions before making infrastructure decisions. The mismatch between workload characteristics and infrastructure choices is the primary source of wasted GPU spend.

The Infrastructure Options

Cloud GPU Instances (AWS, GCP, Azure)

The major cloud providers offer GPU instances with various configurations. This is the default choice for most teams, and it is reasonable but often over-provisioned.

When to use: You need reliable, managed infrastructure with integration into your existing cloud ecosystem. You want standard tooling (Docker, Kubernetes) and do not want to manage hardware.

What to watch for:

Availability. GPU instances, especially high-end ones (A100, H100), have limited availability. In high-demand periods, you may not be able to launch the instance type you need. If your workload is time-sensitive, availability is a risk factor.

Cost. Cloud GPU instances are expensive by the hour. An A100 instance on AWS (p4d.24xlarge) costs roughly $30-40 per hour depending on region and pricing model. If you run this 24/7, you are spending $20,000-30,000 per month. For workloads that do not need 24/7 compute, this is waste.

Right-sizing. Teams often provision more GPU memory and compute than they need because GPU instance types are coarse-grained. If your model needs 12GB of GPU memory, you might end up renting a 40GB or 80GB GPU because that is the available instance type.

Cost optimization strategies:

  • Spot/preemptible instances for fault-tolerant workloads (batch inference, fine-tuning with checkpointing). Savings of 60-90% over on-demand pricing. The catch is that instances can be reclaimed with short notice, so your workload needs to handle interruptions gracefully. This is where the checkpoint/resume pattern pays for itself directly.

  • Reserved instances or savings plans for baseline workloads that run continuously. Savings of 30-60% over on-demand for 1-3 year commitments. Only commit to what you consistently use – use on-demand or spot for peaks.

  • Aggressive auto-scaling that spins down GPU instances when not in use. If your inference service has zero traffic at night, scale to zero GPUs. The cold start latency when scaling back up (1-5 minutes for GPU instances) is acceptable for many workloads.

Serverless GPU (Modal, Replicate, Banana, RunPod)

Serverless GPU platforms abstract away instance management entirely. You deploy a function or container, and the platform provisions GPU compute on demand.

When to use: Bursty workloads where you need GPUs for minutes or hours, not days. Batch processing jobs. Development and experimentation where you want to pay per second, not per hour.

Advantages:

  • Pay-per-second billing means you only pay for actual compute time, not idle time.
  • No instance management, auto-scaling is handled by the platform.
  • Quick iteration: deploy code, run it, get results, pay only for the execution time.
  • Many platforms support container-based deployment, so you can bring your existing Docker images.

What to watch for:

  • Cold start latency. Loading a model into GPU memory takes time (10 seconds to several minutes depending on model size). For real-time inference, cold starts are a problem. Warm instance pools mitigate this but add cost.
  • Vendor lock-in. Each serverless GPU platform has its own deployment model and SDK. Switching platforms requires code changes.
  • Cost at scale. Serverless pricing per GPU-second is higher than reserved instance pricing. At high utilization rates, serverless costs more than dedicated instances. The breakeven depends on your utilization pattern.

Modal has become our go-to for batch processing and experimentation at CONFLICT. The developer experience is excellent – define a function with GPU requirements, deploy it, and it scales automatically. For production inference serving, we evaluate on a case-by-case basis whether serverless or managed instances make more economic sense.

# Modal example: GPU batch processing
import modal

app = modal.App("embedding-pipeline")

@app.function(gpu="A10G", timeout=3600)
def process_batch(documents: list[str]) -> list[list[float]]:
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = model.encode(documents, batch_size=64)
    return embeddings.tolist()

Ray and Distributed Compute

Ray is an open-source framework for distributed computing that handles GPU workload distribution, scheduling, and fault tolerance. It runs on your own infrastructure (cloud instances, on-premises) or through managed services (Anyscale).

When to use: You have multiple GPU workloads that need to share a pool of GPU resources efficiently. You want fine-grained control over scheduling and resource allocation. You need distributed training or inference across multiple GPUs.

Advantages:

  • Resource pooling: multiple workloads share GPU resources without each needing dedicated instances.
  • Built-in fault tolerance and checkpointing.
  • Support for heterogeneous GPU types in the same cluster.
  • Good integration with PyTorch, TensorFlow, and Hugging Face.

What to watch for:

  • Operational complexity. Running a Ray cluster requires managing nodes, networking, storage, and monitoring. This is infrastructure engineering work.
  • Learning curve. Ray’s programming model is powerful but different from standard Python. Your team needs to invest in learning it.

Amazon SageMaker and Managed ML Platforms

Managed ML platforms (SageMaker, Vertex AI, Azure ML) provide end-to-end infrastructure for model training, tuning, and deployment.

When to use: You want managed infrastructure with built-in model management, experiment tracking, and deployment pipelines. Your team is focused on model development rather than infrastructure.

Advantages:

  • Managed training jobs: specify your code and data, the platform handles instance provisioning, training, and teardown.
  • Built-in model hosting with auto-scaling endpoints.
  • Integration with the cloud provider’s ecosystem (storage, monitoring, IAM).

What to watch for:

  • Cost. Managed platforms charge a premium over raw compute instances. SageMaker inference endpoints, for example, cost more than equivalent EC2 instances.
  • Flexibility constraints. Managed platforms impose constraints on how you deploy and serve models. Custom serving logic or non-standard model formats may not fit well.
  • Lock-in. Deep integration with a specific cloud provider’s ML platform makes migration expensive.

Cost Management: The Practical Discipline

GPU cost management is not a one-time optimization. It is an ongoing discipline that requires instrumentation, monitoring, and regular review.

Instrument GPU utilization. Track actual GPU utilization (compute and memory) for every workload. Most teams are surprised to find that their GPU utilization averages 20-40%. A GPU that is 30% utilized is 70% wasted.

Track cost per unit of work. Measure cost per inference, cost per fine-tuning run, cost per embedding batch. These metrics let you compare infrastructure options on a like-for-like basis and identify workloads where optimization has the most impact.

Implement auto-scaling with scale-to-zero. For inference serving, scale GPU instances based on actual request volume. For batch workloads, provision GPUs only when jobs are queued. The ability to scale to zero when there is no work eliminates idle cost entirely.

Right-size your GPU instances. If your model fits in 8GB of GPU memory, do not rent a 40GB GPU. Use profiling to understand your actual GPU memory and compute requirements, and select the smallest instance type that meets them.

Use spot instances for everything that can tolerate interruption. Fine-tuning with checkpointing, batch inference, embedding generation, data preprocessing – all of these can run on spot instances with significant cost savings. The checkpoint/resume pattern is not just a reliability feature; it is a cost optimization enabler.

Review monthly. GPU costs can change significantly as your workload evolves, as cloud providers update pricing, and as new instance types become available. Monthly cost reviews prevent drift from your budget.

A Practical Architecture

For a typical team running inference serving, periodic fine-tuning, and batch processing, we recommend:

Inference serving: Managed endpoint (SageMaker, Vertex AI) or serverless GPU (Modal, Replicate) with auto-scaling. Choose based on your latency requirements and cost sensitivity. For latency-sensitive applications, a managed endpoint with warm instances. For latency-tolerant applications, serverless with cold-start tolerance.

Fine-tuning: On-demand or spot cloud GPU instances provisioned for the duration of the training job. Use checkpointing so spot interruptions do not lose progress. Tear down instances immediately after training completes.

Batch processing: Serverless GPU (Modal) or spot instances with auto-scaling. Scale up for the batch window, process, scale down to zero. Pay only for the processing time.

Development and experimentation: Serverless GPU platforms for quick iteration. Developers should be able to run GPU experiments without provisioning infrastructure or waiting for instance availability.

                    +-+
                    |   Request Router   |
                    +-+
                           |
              +++
              |                         |
    +-+   +-+
    | Real-time Inference|   | Batch Processing   |
    | (Managed Endpoint) |   | (Serverless GPU)   |
    | Auto-scaling       |   | Scale-to-zero      |
    +--+   +--+

    +--+   +--+
    | Fine-tuning Jobs   |   | Dev/Experiment     |
    | (Spot Instances)   |   | (Serverless GPU)   |
    | With Checkpointing |   | Per-second billing |
    +--+   +--+

Common Mistakes

Over-provisioning from the start. Start with the smallest GPU that handles your workload. Scale up when you have data showing you need more, not when you anticipate you might.

Running GPUs 24/7 for bursty workloads. If your GPU utilization is under 50%, you are likely over-provisioned. Implement auto-scaling before optimizing anything else.

Ignoring spot instances. The savings are too significant to ignore. Any workload that can checkpoint and resume is a candidate for spot instances. This includes most fine-tuning, batch processing, and embedding generation workloads.

Choosing infrastructure before understanding the workload. Characterize your workload first. Duration, concurrency, burstiness, and compute requirements should drive your infrastructure choice, not the other way around.

Not tracking GPU cost per unit of work. Without this metric, you cannot compare options or identify optimization opportunities. Instrument cost tracking from day one.

You do not need OpenAI’s infrastructure to build useful AI applications. You need the right GPU infrastructure for your specific workload, managed with the same discipline you apply to any other operational cost. Start small, measure everything, and scale based on data, not assumptions.