Why a pilot exists at all

Annotation projects fail more often during scoping than during production. The schema looked tight on paper, then annotators started working and the ambiguities surfaced. The accuracy target sounded achievable, then the test data showed it was harder than expected. The output format that the buyer agreed to turned out to lose information their downstream system needed.

A pilot is the first 500-or-so samples done in production conditions, designed to surface these issues before you've committed to scale. It's not a sales motion. It's a sample of the same work, run by the same team, against the same QA process you'd see in production.

This article walks through what a pilot deliverable contains, so you know what to expect before you commit to one.

The five things in every pilot deliverable

1. The labeled data

500 annotated samples in your specified output format (GeoJSON, COCO, KITTI, Mapillary, custom). Every feature carries the attributes the schema defines, the geometry in your specified coordinate reference frame, and the standard metadata (created by, created at, QA reviewed by, QA timestamp).

You can load this directly into QGIS, ArcGIS, your training pipeline, or your downstream consumer system. It's not a preview or a sample — it's production-quality output on a representative slice.

2. The held-out test score

Before the pilot starts, we set aside 50-100 samples that the annotators don't see. You (or your domain expert) label that held-out set independently. After the pilot, we score our labels against yours.

The score is per-class F1, with the confusion matrix included. If we hit 98% on signs, 95% on lane markings, and 91% on a tricky class like "construction zone temporary signage," you see all three numbers. We don't average them.

This is the measurement that anchors the production conversation. If a class scores below your target, we identify why and propose either a schema refinement or a different annotation approach.

3. The inter-annotator agreement report

We have multiple annotators on every pilot. The IAA report measures how much they agreed with each other on the same samples — Cohen's kappa for categorical classifications, IoU for geometry, weighted F1 for mixed schemas.

High IAA (>0.85 kappa) means the schema is clear and reproducible. Low IAA (<0.7) means we have a schema ambiguity to fix before scale. The number itself matters less than the pattern — if IAA is high on most classes but low on one specific class, we know exactly where the schema gap is.

4. The edge-case log

Every ambiguous case the annotators encountered, with our interpretation and a written rule. This is usually 15-40 entries on a typical pilot. Examples:

  • "Sign mounted on a temporary work-zone post: we annotate as the regulatory sign type (R1-1) with attribute mounting=temporary. Alternative: separate temporary-sign class. Recommend the current approach for downstream consistency."
  • "Lane line that fades to indeterminate over a 10-meter stretch: we annotate the visible segments separately and add a partial=true attribute. Alternative: single line with attribute partial_length=10m. Recommend current."
  • "Utility pole with multiple attached transformers: we annotate one feature for the pole with an attribute count for attached equipment. Alternative: separate feature per piece of equipment. Recommend current for inventory use, alternative for inspection use."

Each entry is a decision you can confirm, override, or refine before production. The log gets formalized into the production schema document.

5. The production scope

After the pilot, we propose a production scope: total samples to label, per-feature pricing, weekly delivery cadence, QA structure, accuracy targets per class, edge-case rules from the pilot. Fixed-fee where possible, per-feature where the volume is variable.

The scope is informed by what we actually saw during the pilot, not what we assumed at the start. Throughput estimates are based on the throughput we measured on the pilot, not industry averages. Pricing is based on the schema complexity we measured, not a generic per-feature rate.

What the pilot doesn't include

Three things, intentionally.

Not the full edge-case coverage. 500 samples surface the common edge cases, not all of them. Production typically reveals 5-10 additional edge cases over the first 5,000 samples. We commit to keeping the edge-case log updated during production, not to having seen everything in the pilot.

Not a free service. Pilots are fixed-fee, typically $1,500-$4,000 depending on schema complexity and turnaround. The pilot fee covers our labor and is non-refundable. If you decide not to proceed to production after the pilot, you keep the labeled data and the reports.

Not a long sales process. From the moment you send sample data and a schema description, we're typically labeling within 24 hours and delivering within 2-4 business days. There's no MSA negotiation for the pilot — we sign a mutual NDA and start.

How to evaluate a pilot deliverable

Three questions to ask when you receive it.

Does the data load cleanly into your downstream system? If you have to convert formats, fix coordinate systems, or strip out unexpected fields, that's a problem to surface now, not after 100,000 samples.

Do the held-out F1 scores match your accuracy requirement? If you need 98% on signs and we hit 96%, we need to understand why and adjust before production. Maybe the schema needs refinement. Maybe the source imagery has issues. Maybe 96% is actually the achievable ceiling on this data and you should know that before committing.

Does the edge-case log surface things that would have been silent disagreements in production? This is the highest-value part of the deliverable. If our log forces conversations about schema rules you hadn't thought of, the pilot has done its job.


Want to scope a pilot? Send representative sample data (50-500 frames is plenty), a paragraph describing what you want labeled, and the output format your downstream system expects. We'll return a pilot scope and fixed price within 48 hours.