LLM Token Optimization: How to Stop Burning Money and Hitting Rate Limits
You’ve got your AI pipeline humming — automated tasks, scheduled jobs, maybe a few agents working in parallel. Then one morning, everything grinds to a halt: 429 Too Many Requests. Your carefully orchestrated system just hit a wall.
I’ve been running multiple AI agents on a single API account — daily content generation, code improvements, project reviews, research tasks — all on scheduled cron jobs. Here’s what I learned about keeping token costs down and staying within rate limits.
The Problem: Death by a Thousand Tokens#
Most people think about rate limits as a simple “don’t send too many requests” problem. It’s not. Modern LLM APIs have two separate limits:
- Requests per minute (RPM) — how many API calls you can make
- Tokens per minute (TPM) — how many tokens you can process
You can hit either one independently. A single massive prompt with 100k context tokens can blow your TPM limit even if you’re only making one request.
Technique 1: Stagger Your Scheduled Tasks#
The mistake: Running 5 cron jobs at the same time because “they’re all independent.”
The fix: Space them out with meaningful gaps.
# Bad — everything fires within 15 minutes
0 4 * * * task-a # 4:00 AM
5 4 * * * task-b # 4:05 AM
10 4 * * * task-c # 4:10 AM
15 4 * * * task-d # 4:15 AM
# Good — 1 hour gaps
0 3 * * * task-a # 3:00 AM
0 4 * * * task-b # 4:00 AM
0 5 * * * task-c # 5:00 AM
0 6 * * * task-d # 6:00 AM
Each task might take 1-3 minutes of active API time. With 15-minute gaps, they overlap. With 1-hour gaps, they never compete for the same rate limit window.
Technique 2: Use the Right Model for the Job#
Not every task needs your most powerful (and expensive) model. Here’s a practical tiering strategy:
| Task Type | Model Tier | Why |
|---|---|---|
| Creative writing, complex reasoning | Top tier (GPT-4, Claude Opus, etc.) | Quality matters |
| Code generation, structured tasks | Mid tier (Claude Sonnet, GPT-4o) | Good enough, 5-10x cheaper |
| Summarization, classification | Fast tier (Claude Haiku, GPT-4o-mini) | Speed > depth |
| Embeddings, search | Dedicated embedding models | Purpose-built, cheap |
Real example: I was running a daily game improvement task on a top-tier model. It spent most of its tokens reading source code and making targeted edits — a mid-tier model handles that just as well at a fraction of the cost.
Rule of thumb: If a task is mostly “read this, change that, verify it works,” it doesn’t need your smartest model. Save the heavy hitters for tasks requiring genuine reasoning or creativity.
Technique 3: Reduce Context Window Bloat#
The biggest hidden cost in LLM workflows is context accumulation. Every message in a conversation adds to the token count, and you’re paying for the full context on every API call.
Strategies:#
Compaction: Periodically summarize conversation history instead of carrying the full transcript. A 50k token conversation can compress to 2k tokens of summary.
Selective context: Don’t send your entire codebase to the model. Send only the relevant files.
# Bad: sending everything
context = read_entire_project() # 200k tokens
# Good: sending what's needed
context = read_file("src/app/page.tsx") # 2k tokens
System prompt diet: Review your system prompts. Are you including instructions the model doesn’t need for this specific task? A 5k token system prompt repeated across 100 daily API calls = 500k wasted tokens per day.
Technique 4: Batch and Deduplicate#
If multiple tasks need the same information, fetch it once and share it.
# Bad: each task independently fetches the same data
task_a() # reads database, calls API
task_b() # reads same database, calls API
task_c() # reads same database, calls API
# Good: fetch once, pass to all
data = fetch_shared_context()
task_a(data)
task_b(data)
task_c(data)
For batch-friendly APIs, send multiple prompts in a single batch request instead of individual calls. OpenAI’s Batch API gives you a 50% discount for non-time-sensitive workloads.
Technique 5: Implement Exponential Backoff#
When you do hit rate limits, don’t just retry immediately — that makes it worse.
async function callWithBackoff(fn, maxRetries = 5) {
for (let i = 0; i < maxRetries; i++) {
try {
return await fn();
} catch (err) {
if (err.status === 429) {
const delay = Math.min(1000 * Math.pow(2, i), 60000);
const jitter = delay * 0.1 * Math.random();
await sleep(delay + jitter);
continue;
}
throw err;
}
}
throw new Error('Max retries exceeded');
}
Key points:
- Exponential growth: 1s → 2s → 4s → 8s → 16s
- Jitter: Add randomness so parallel workers don’t all retry at the same instant
- Cap: Don’t wait longer than 60 seconds — if you’re still limited, something else is wrong
Technique 6: Monitor and Alert#
You can’t optimize what you don’t measure.
Track these metrics:
- Tokens per task — is a task suddenly using 3x more tokens? Probably a bug.
- Cost per day — set a budget alert
- Rate limit hits — if you’re seeing 429s, you need to spread load or reduce volume
- Cache hit rate — if you’re asking the same questions repeatedly, cache the answers
Most LLM providers include usage data in API responses. Log it.
const response = await api.chat(prompt);
log({
task: 'daily-review',
inputTokens: response.usage.prompt_tokens,
outputTokens: response.usage.completion_tokens,
cost: calculateCost(response.usage),
timestamp: new Date()
});
Technique 7: Cache Aggressively#
Some queries are functionally identical across runs. Web search results, database schemas, file contents that haven’t changed — cache them.
import hashlib, json, time
CACHE_TTL = 3600 # 1 hour
def cached_llm_call(prompt, cache_store):
key = hashlib.sha256(prompt.encode()).hexdigest()
cached = cache_store.get(key)
if cached and time.time() - cached['ts'] < CACHE_TTL:
return cached['result']
result = llm.complete(prompt)
cache_store.set(key, {'result': result, 'ts': time.time()})
return result
This is especially effective for:
- Embedding generation (same text = same embedding)
- Classification tasks (same input = same category)
- Template-based outputs where the variable parts are small
The Numbers#
After applying these techniques to my own setup:
| Metric | Before | After |
|---|---|---|
| Daily API calls | ~150 | ~60 |
| Daily token usage | ~2M tokens | ~800k tokens |
| Rate limit errors | 3-5/day | 0 |
| Monthly cost | ~$45 | ~$18 |
The biggest wins came from model tiering (switching non-critical tasks to cheaper models) and schedule staggering (eliminating rate limit collisions entirely).
TL;DR#
- Stagger scheduled tasks — minimum 30-60 minute gaps
- Use cheaper models for routine tasks — save premium models for complex reasoning
- Trim context windows — don’t send 100k tokens when 5k will do
- Batch similar work — fetch shared data once
- Exponential backoff — don’t hammer a rate-limited API
- Monitor everything — track tokens per task, cost per day, error rates
- Cache repeated queries — same input = same output, skip the API call
Rate limits aren’t the enemy. They’re a signal that you’re not being efficient enough. Fix the efficiency, and the limits stop mattering.