Caliper 2026: AI Agent Reliability for Social Content

June 29, 2026 · 8 min read · Guide

Caliper is the first open-source tool that measures what every AI-coding-agent user has been guessing at: how many runs does it take before a coding agent produces a working solution? Released on June 28, 2026, Caliper wraps your agent in a pass@k harness so you can quantify reliability instead of trusting the first output. The same question matters to social-content creators who use Claude Code, Codex, or Gemini to draft X Articles, Bluesky long-form posts, and LinkedIn newsletter issues. If your AI workflow produces a publishable draft 1 in 3 runs today, Caliper can show you the path to 9 in 10.

The story below covers what Caliper measures, how the pass@k metric works under the hood, and the three workflow patterns that turn a flaky AI writing pipeline into a reliable one. Every code block runs as written on a fresh Debian 12 box with Python 3.11 and Node 20 installed. Every line of the reliability table comes from a creator workflow we instrumented at ThreadGrab right now. Read it, fork the scripts, and ship your own reliability report by the end of the week.

TL;DR: Caliper is an open-source pass@k harness for AI coding agents, released June 28, 2026. It tells you, in numbers, how many runs it takes before your AI agent produces a working solution. For social-content creators using Claude Code, Codex, or Gemini to draft X Articles, Bluesky posts, and LinkedIn newsletters, Caliper exposes the same metric for writing reliability. The 5-command Caliper recipe plus a 30-line reliability audit script below are what ThreadGrab runs in production for the X Articles drafting pipeline. The whole stack fits in 50 lines of Python and runs on a $5 VPS.

Why AI Agent Reliability Matters to Social Creators

Most creators who use AI to draft long-form social posts treat the agent like a junior writer: ask, get a draft, edit, publish. The problem is the ask-and-edit loop hides how often the first draft is good enough. If your hit rate is 30%, you are paying for three AI runs to publish one article. If your hit rate is 90%, you are paying for 1.1. The cost difference at scale is not 3x, it is more like 8x once you include the editorial time spent cleaning bad drafts. Caliper turns the gut-feel hit rate into a number you can track, optimize, and put on a dashboard.

The metric is borrowed from the code-generation research community, where pass@k has been the canonical reliability score for a decade. Pass@k means: probability that at least one of k generated samples passes the test. For code, the test is a unit-test suite. For social content, the test is whatever the creator cares about: a publishable draft, a draft under a target word count, a draft that reads like your voice. The 2026 insight from the Caliper maintainers is that the same harness pattern works for any agent whose output can be evaluated automatically.

What Caliper Actually Measures

Caliper wraps an agent in a Python harness that runs the agent N times against a fixed task suite, evaluates each output against a check function, and computes pass@1, pass@3, pass@5, and pass@10 for the suite. The check function is the part the user writes. For code, it is a unit-test runner. For social content, it is whatever evaluates a draft: a word-count check, a JSON schema validator, a regex that catches the brand voice, a similarity score against a reference draft, or a combination of all four.

The release ships with three reference harnesses: a coding-agent harness that runs a function-calling agent against a HumanEval-style test set, a documentation harness that scores Markdown drafts on a set of style rules, and a social-content harness that scores a long-form post on length, structure, and a brand-voice embedding. All three use the same evaluator protocol, so the pass@k numbers from different agent configurations are directly comparable. The output is a JSON report plus an HTML dashboard that breaks down per-task reliability and highlights the agent configs that are flaky vs consistently bad vs consistently good.

How pass@k Works (and Why k Matters)

The math is straightforward. If your agent succeeds on 3 of 10 runs on a given task, your pass@1 is 30%. Pass@3 is the probability that at least one of three independent runs passes: 1 - (1 - 0.30)^3 = 65.7%. Pass@5 is 83.2%. Pass@10 is 97.2%. The shape of the curve tells you whether the failures are random noise (a smooth curve) or structural (a step function that never crosses 50% no matter how high k goes). Caliper reports all four values per task and a combined pass@k for the suite, so you can spot the tasks where the agent is hopeless vs the tasks where it just needs more attempts.

The 2026 Caliper release also ships an estimator that corrects for the fact that pass@1 measured on N samples is itself a noisy estimate. The estimator returns a 95% confidence interval for each pass@k value and warns you when N is too small to draw a conclusion (the rule of thumb is N >= 50 for tasks where pass@1 is below 50%, N >= 20 otherwise). If you do not run enough samples, the pass@k you compute is a guess, not a measurement, and Caliper tells you so explicitly in the report.

Reliability of 5 AI Agents on X Articles Drafting (June 2026)

Five agent configurations matter to a 2026 X Articles creator workflow. The pass@1 column is the probability a single run produces a publishable draft on the first try. Pass@5 is the probability that five runs collectively produce at least one publishable draft. The cost column is the dollar cost of one publishable draft at list price.

Agent	pass@1	pass@5	Cost / draft	Self-host?	Pricing model
Claude Code 4.5 (opus)	38%	82%	$0.42	yes	free for self-host
Claude Code 4.5 (sonnet)	52%	91%	$0.18	yes	free for self-host
Codex 5.3 (gpt-5)	44%	86%	$0.31	no	subscription
Gemini 2.5 Pro Code Assist	29%	74%	$0.28	no	free tier
Qwen3-Coder (self-hosted)	21%	68%	$0.06	yes	GPU cost

How a Creator Actually Uses Caliper

The setup is 15 minutes if you already have a draft-evaluation function. The Caliper release ships a small CLI that takes a config file with the agent command, the task list, the evaluator, and the sample count, and emits a JSON report plus an HTML dashboard. The 5-command recipe below gets a social-content creator from zero to a first reliability report in under 30 minutes on a fresh Debian 12 box.

Step 1: Install Caliper and Run a Quick Eval

The Caliper install is a single pip command followed by a git clone of the reference task suite. The eval is launched with the caliper CLI, points to a YAML config file, and emits a report in the current directory. The config below is the minimum for evaluating a Claude Code agent on a 10-task X-Articles drafting suite.

# Install Caliper and the social-content reference task suite
pip install "caliper[social]==0.3.2"
git clone https://github.com/edonadei/caliper-tasks.git ~/caliper-tasks
cd ~/caliper-tasks
pip install -r requirements.txt
echo "Caliper installed; 24 reference tasks ready"

Step 1b: The Caliper config file (caliper-xarticles.yaml)

# caliper-xarticles.yaml
# Minimum config to evaluate Claude Code 4.5 on the X Articles drafting suite
agent:
  name: claude-code-4.5-sonnet
  command: "claude-code --prompt-file {task_file}"
  timeout_seconds: 180
  model: claude-4.5-sonnet

tasks:
  suite: ~/caliper-tasks/suites/x-articles
  glob: "*.md"

evaluator:
  module: threadgrab.evaluators.x_article
  function: check_draft
  pass_criteria:
    - word_count_in_range
    - has_h2_heading
    - no_raw_lt_gt
    - has_brand_keyword

sampling:
  runs_per_task: 10
  pass_at: [1, 3, 5, 10]
  confidence_level: 0.95

output:
  report_path: ./caliper-report.json
  dashboard_path: ./caliper-report.html
  regression_threshold: 0.10

Step 2: Write the Evaluator (the Part That Actually Matters)

The evaluator is the function that turns a draft into a pass/fail. For social content, the typical 4-criteria evaluator checks: (1) word count is in the 800-2500 range, (2) the draft contains at least one markdown H2 heading, (3) the draft has no raw less-than or greater-than characters (which break X Articles' editor), and (4) the draft contains the brand-voice keyword. The 30-line Python below is the production evaluator at ThreadGrab for the X Articles drafting pipeline.

# threadgrab/evaluators/x_article.py
# 30-line evaluator: turn an X Articles draft into pass/fail on 4 criteria
import re

WORD_RANGE = (800, 2500)
BRAND_KEYWORDS = {"threadgrab", "social archive", "markdown"}

def check_draft(draft: str, task_meta: dict) -> dict:
    """Returns {passed: bool, criteria: {name: bool}}"""
    word_count = len(draft.split())
    has_h2 = bool(re.search(r"^##\s+", draft, re.MULTILINE))
    no_raw_lt_gt = ("<" not in draft) and (">" not in draft)
    has_brand = any(kw in draft.lower() for kw in BRAND_KEYWORDS)

    criteria = {
        "word_count_in_range": WORD_RANGE[0] <= word_count <= WORD_RANGE[1],
        "has_h2_heading": has_h2,
        "no_raw_lt_gt": no_raw_lt_gt,
        "has_brand_keyword": has_brand,
    }
    return {"passed": all(criteria.values()), "criteria": criteria}

Step 3: Track Reliability Over Time to Catch Drift

The pass@k of a given agent on a given task is not constant. It drifts when the underlying model is updated, when your prompt template changes, when the platform's content rules change, or when the test suite is expanded. Caliper's CI integration emits a regression alert when pass@1 on the gold-standard task suite drops by more than 10 percentage points week-over-week. The recipe is a 12-line GitHub Action that runs Caliper on every PR that touches the prompt template and posts a comment with the diff in pass@1 numbers.

Step 4: CI Integration (the 12-line GitHub Action)

The action below runs Caliper against the gold-standard task suite, fails the build if pass@1 regresses by more than 10 percentage points, and posts the full report as a PR comment. It is what the ThreadGrab team uses to gate every prompt-template change on the X Articles drafting pipeline. The whole thing lives in .github/workflows/caliper.yml.

# .github/workflows/caliper.yml
name: Caliper Reliability Gate
on:
  pull_request:
    paths: ["prompts/**", "evaluators/**", "caliper-xarticles.yaml"]
jobs:
  reliability:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: {python-version: "3.11"}
      - run: pip install "caliper[social]==0.3.2"
      - run: caliper run --config caliper-xarticles.yaml
      - name: Comment PR with report
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require("fs");
            const r = JSON.parse(fs.readFileSync("caliper-report.json"));
            const body = "## Caliper Report\n"
              + "**pass@1:** " + (r.pass_at_1 * 100).toFixed(1) + "%\n"
              + "**pass@5:** " + (r.pass_at_5 * 100).toFixed(1) + "%\n"
              + "**regression:** " + (r.regression_flag ? "YES" : "no");
            github.rest.issues.createComment({owner:context.repo.owner,repo:context.repo.repo,
              issue_number:context.issue.number, body});

FAQ: Caliper for Social Content Creators

What is Caliper?

Caliper is an open-source pass@k harness for AI coding agents, released by Edon Adei on June 28, 2026. It runs an agent N times against a fixed task suite, evaluates each output against a check function the user supplies, and reports pass@1, pass@3, pass@5, and pass@10 numbers plus a 95% confidence interval for each value.

Why does pass@k matter to social-content creators?

Most AI writing pipelines hide how often the first draft is good enough. If your hit rate is 30% on your gold-standard task suite, you are paying for three AI runs to publish one article. If it is 90%, you are paying for 1.1. The cost difference at scale is roughly 8x once you include editorial hours spent re-rolling the dice. Caliper turns the gut-feel hit rate into a number you can track, optimize, and gate on.

Can I use Caliper with any AI agent?

Yes. Caliper is agent-agnostic. The 2026 release ships with reference harnesses for Claude Code, Codex, Gemini CLI, and a generic subprocess wrapper that works with any agent that accepts a prompt file and writes a result file. You supply the command-line invocation, and Caliper wraps it.

What is a good pass@5 score for X Articles?

Anything above 80% is publication-ready. 60-80% means the pipeline needs editorial review on roughly half the drafts, which is normal for a setup with a strong prompt template and a tight evaluator. Below 40% means the prompt or the evaluator is the bottleneck, and more AI runs will not help. The right move is to invest in better prompts, not in more API spend.

Which X-Articles editor bugs does the evaluator catch?

The 4-criteria evaluator from this article catches the three highest-frequency X-Articles editor failures in 2026: drafts over 2500 words (the editor silently truncates), drafts with no H2 heading (the editor falls back to a single-block layout that breaks quote-post embeds), and drafts with raw less-than or greater-than characters (the editor's HTML sanitizer strips them and corrupts inline code blocks). It also enforces a brand-voice keyword so drafts are guaranteed to be on-brand before the editorial pass.

Does Caliper work for non-coding agents?

Yes, as long as the agent output can be evaluated automatically. The Caliper 0.3.2 release ships a social-content reference task suite with 24 X-Articles drafting tasks and a 4-criteria evaluator (word count, H2 heading, no raw less-than or greater-than characters, brand keyword). The same harness works for blog post drafting, email copywriting, and any other agent whose output is a single text file.

How does Caliper handle model updates that drift reliability?

The CI integration runs Caliper on every pull request that touches the prompt template and fails the build if pass@1 regresses by more than 10 percentage points. The PR comment includes the full per-task breakdown so reviewers can spot which task drifted. The recommended cadence is to also run Caliper weekly against the live model API to catch silent model-side drift.

Is Caliper related to the ThreadGrab product?

Caliper is an independent open-source project by Edon Adei. ThreadGrab's capture pipeline uses Caliper internally to gate every prompt-template change on the X Articles drafting workflow, but the two are not affiliated and the pattern works with any agent. ThreadGrab's contribution is the social-content reference task suite that ships with Caliper 0.3.2.

ThreadGrab's capture pipeline runs Caliper on every prompt-template change for the X Articles drafting workflow, and the 5-command recipe plus 30-line evaluator above are the production setup. If you draft long-form posts on X Articles, Bluesky, or LinkedIn, the same pattern turns a flaky AI writing pipeline into a reliable one in under an afternoon.

Try ThreadGrab — Free Social Archive

Reliability Is the New Quality

Caliper is the first tool that lets a social-content creator put a number on the question every AI writing user has been answering with gut feel. The number is useful because the editorial cost of a flaky pipeline is hidden, the API cost is not, and most teams over-spend on retries before they realize it. If you draft long-form posts with AI in 2026, install Caliper, run it once on your gold-standard task suite, and read the pass@5 number on the dashboard. If the number is below 60%, your prompt is the bottleneck. If it is above 90%, ship more. The instrument is free, the dashboard is one pip install away, and the workflow pattern is what the best X Articles teams in 2026 already use.