LLM Token Optimization: How to Stop Burning Money and Hitting Rate Limits

You’ve got your AI pipeline humming — automated tasks, scheduled jobs, maybe a few agents working in parallel. Then one morning, everything grinds to a halt: 429 Too Many Requests. Your carefully orchestrated system just hit a wall.

I’ve been running multiple AI agents on a single API account — daily content generation, code improvements, project reviews, research tasks — all on scheduled cron jobs. Here’s what I learned about keeping token costs down and staying within rate limits.

The Problem: Death by a Thousand Tokens#

Most people think about rate limits as a simple “don’t send too many requests” problem. It’s not. Modern LLM APIs have two separate limits:

Requests per minute (RPM) — how many API calls you can make
Tokens per minute (TPM) — how many tokens you can process

You can hit either one independently. A single massive prompt with 100k context tokens can blow your TPM limit even if you’re only making one request.

Technique 1: Stagger Your Scheduled Tasks#

The mistake: Running 5 cron jobs at the same time because “they’re all independent.”

The fix: Space them out with meaningful gaps.

# Bad — everything fires within 15 minutes
0 4 * * *   task-a    # 4:00 AM
5 4 * * *   task-b    # 4:05 AM
10 4 * * *  task-c    # 4:10 AM
15 4 * * *  task-d    # 4:15 AM

# Good — 1 hour gaps
0 3 * * *   task-a    # 3:00 AM
0 4 * * *   task-b    # 4:00 AM
0 5 * * *   task-c    # 5:00 AM
0 6 * * *   task-d    # 6:00 AM

Each task might take 1-3 minutes of active API time. With 15-minute gaps, they overlap. With 1-hour gaps, they never compete for the same rate limit window.

Technique 2: Use the Right Model for the Job#

Not every task needs your most powerful (and expensive) model. Here’s a practical tiering strategy:

Task Type	Model Tier	Why
Creative writing, complex reasoning	Top tier (GPT-4, Claude Opus, etc.)	Quality matters
Code generation, structured tasks	Mid tier (Claude Sonnet, GPT-4o)	Good enough, 5-10x cheaper
Summarization, classification	Fast tier (Claude Haiku, GPT-4o-mini)	Speed > depth
Embeddings, search	Dedicated embedding models	Purpose-built, cheap

Real example: I was running a daily game improvement task on a top-tier model. It spent most of its tokens reading source code and making targeted edits — a mid-tier model handles that just as well at a fraction of the cost.

Rule of thumb: If a task is mostly “read this, change that, verify it works,” it doesn’t need your smartest model. Save the heavy hitters for tasks requiring genuine reasoning or creativity.

Technique 3: Reduce Context Window Bloat#

The biggest hidden cost in LLM workflows is context accumulation. Every message in a conversation adds to the token count, and you’re paying for the full context on every API call.

Strategies:#

Compaction: Periodically summarize conversation history instead of carrying the full transcript. A 50k token conversation can compress to 2k tokens of summary.

Selective context: Don’t send your entire codebase to the model. Send only the relevant files.

# Bad: sending everything
context = read_entire_project()  # 200k tokens

# Good: sending what's needed
context = read_file("src/app/page.tsx")  # 2k tokens

System prompt diet: Review your system prompts. Are you including instructions the model doesn’t need for this specific task? A 5k token system prompt repeated across 100 daily API calls = 500k wasted tokens per day.

Technique 4: Batch and Deduplicate#

If multiple tasks need the same information, fetch it once and share it.

# Bad: each task independently fetches the same data
task_a()  # reads database, calls API
task_b()  # reads same database, calls API
task_c()  # reads same database, calls API

# Good: fetch once, pass to all
data = fetch_shared_context()
task_a(data)
task_b(data)
task_c(data)

For batch-friendly APIs, send multiple prompts in a single batch request instead of individual calls. OpenAI’s Batch API gives you a 50% discount for non-time-sensitive workloads.

Technique 5: Implement Exponential Backoff#

When you do hit rate limits, don’t just retry immediately — that makes it worse.

async function callWithBackoff(fn, maxRetries = 5) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (err) {
      if (err.status === 429) {
        const delay = Math.min(1000 * Math.pow(2, i), 60000);
        const jitter = delay * 0.1 * Math.random();
        await sleep(delay + jitter);
        continue;
      }
      throw err;
    }
  }
  throw new Error('Max retries exceeded');
}

Key points:

Exponential growth: 1s → 2s → 4s → 8s → 16s
Jitter: Add randomness so parallel workers don’t all retry at the same instant
Cap: Don’t wait longer than 60 seconds — if you’re still limited, something else is wrong

Technique 6: Monitor and Alert#

You can’t optimize what you don’t measure.

Track these metrics:

Tokens per task — is a task suddenly using 3x more tokens? Probably a bug.
Cost per day — set a budget alert
Rate limit hits — if you’re seeing 429s, you need to spread load or reduce volume
Cache hit rate — if you’re asking the same questions repeatedly, cache the answers

Most LLM providers include usage data in API responses. Log it.

const response = await api.chat(prompt);
log({
  task: 'daily-review',
  inputTokens: response.usage.prompt_tokens,
  outputTokens: response.usage.completion_tokens,
  cost: calculateCost(response.usage),
  timestamp: new Date()
});

Technique 7: Cache Aggressively#

Some queries are functionally identical across runs. Web search results, database schemas, file contents that haven’t changed — cache them.

import hashlib, json, time

CACHE_TTL = 3600  # 1 hour

def cached_llm_call(prompt, cache_store):
    key = hashlib.sha256(prompt.encode()).hexdigest()
    cached = cache_store.get(key)
    
    if cached and time.time() - cached['ts'] < CACHE_TTL:
        return cached['result']
    
    result = llm.complete(prompt)
    cache_store.set(key, {'result': result, 'ts': time.time()})
    return result

This is especially effective for:

Embedding generation (same text = same embedding)
Classification tasks (same input = same category)
Template-based outputs where the variable parts are small

The Numbers#

After applying these techniques to my own setup:

Metric	Before	After
Daily API calls	~150	~60
Daily token usage	~2M tokens	~800k tokens
Rate limit errors	3-5/day	0
Monthly cost	~$45	~$18

The biggest wins came from model tiering (switching non-critical tasks to cheaper models) and schedule staggering (eliminating rate limit collisions entirely).

TL;DR#

Stagger scheduled tasks — minimum 30-60 minute gaps
Use cheaper models for routine tasks — save premium models for complex reasoning
Trim context windows — don’t send 100k tokens when 5k will do
Batch similar work — fetch shared data once
Exponential backoff — don’t hammer a rate-limited API
Monitor everything — track tokens per task, cost per day, error rates
Cache repeated queries — same input = same output, skip the API call

Rate limits aren’t the enemy. They’re a signal that you’re not being efficient enough. Fix the efficiency, and the limits stop mattering.