How we measure accuracy
Every accuracy number we publish — on this site, in the App Store listing, in our blog posts — comes from one private benchmark. Here is exactly what's in it, how we run it, and what the numbers do and don't mean.
The benchmark setup
- Reference instrument: a calibrated 0.1g laboratory scale (model rotated periodically; current is a dual-platform 0.01g jewelry scale verified against an OIML Class M1 calibration weight set).
- Phone: iPhone 15 Pro and iPhone 16 Pro, rear camera, both tested separately. Results below are average across both.
- Vision model: GPT-5.1 with high-detail image input via Supabase Edge Function. Each app mode (General, Gold, Kitchen, Blind Box) routes to a mode-specific prompt chain.
- Lighting: diffuse window light during daytime, plus a controlled run under bright LED ceiling light. Results below are from window light unless noted.
- Background: plain white paper or plain medium-grey ceramic. No patterns.
- Reference object: US quarter (24.26 mm diameter, 5.67 g) included in frame for every shot.
The item categories
The benchmark contains 200 items across four categories, weighted to match real app usage rather than even distribution:
- Jewelry & precious metals (60 items) — gold rings (10k, 14k, 18k), silver rings, gold chains (hollow, semi-hollow, solid in 2-6 mm widths), pendants, bullion coins (American Eagle, Krugerrand, Maple Leaf, sovereigns, Cumhuriyet), gemstone-set pieces.
- Kitchen ingredients (50 items) — bulk staples (rice, flour, sugar in measured volumes), spices, herbs, single-serving food (chicken breast, salmon fillet, eggs, fruits, vegetables), portioned dishes.
- Packaged & shipping items (50 items) — poly-mailer-shipped soft goods, small electronics, bubble- mailer items, vinyl records, paperback and hardcover books, small ceramics, vintage clothing samples.
- Collectibles & blind boxes (40 items) — Pop Mart sealed boxes (10 series), Sonny Angel boxes, Smiski, Pokemon TCG packs, sealed coin tubes, sealed sneaker boxes, assorted mystery packs.
How we run a test
- Calibrate the reference scale with the OIML weight set.
- Place item on the test surface with the reference quarter in frame.
- Capture three photos at slightly different angles (top-down, 45-degree from left, 45-degree from right).
- Run each photo through the appropriate app mode.
- Log the median of the three estimates against the reference scale reading.
- Record absolute error and percentage error per item.
- Repeat the entire batch quarterly to track drift.
Current results
Latest benchmark run: April 2026.
| Category | Median error | 80th percentile error | Best mode for category |
|---|---|---|---|
| Jewelry & precious metals | 4.8% | 9.2% | Gold |
| Kitchen ingredients | 7.1% | 14.5% | Kitchen |
| Packaged & shipping | 5.6% | 11.8% | General |
| Collectibles & blind boxes | 6.4% | 12.1% | Blind Box |
| Overall (200 items) | 6.0% | 11.7% | — |
What the numbers mean
The "6% median error" headline number says half of all estimates land within 6% of the true weight, in benchmark conditions. The 80th percentile of 11.7% says 80% of estimates land within 11.7% of true. These are usable accuracy numbers for shopping, shipping, jewelry pricing decisions, and recipe sanity checking. They are not lab-grade accuracy.
What degrades these numbers
Real-world conditions differ from benchmark conditions. The following factors push median error from 6% toward 15-25%:
- Patterned background — granite countertops, patterned tablecloths confuse segmentation. Add 5-15% error.
- Bad light — incandescent only, hard shadows, blown highlights. Add 5-10% error.
- No reference object — the single biggest accuracy lever. Without a known-size object in frame, error can double.
- Wrong mode — running a jewelry photo in Kitchen mode (or vice versa) produces 30%+ errors.
- Mixed-item plates — multiple items overlapping in one photo degrades segmentation. Photograph items separately.
The full failure-mode reference is in Photo Weighing: 7 Mistakes That Wreck Your Estimate.
What we don't measure
- Bodyweight (no app on a phone measures human bodyweight; we don't claim to).
- Lab-grade chemistry (sub-1% precision is outside the camera method's reach in 2026).
- Items the camera can't see (hidden internal structure, dense packing of multi-item bundles).
- Items larger than the photo frame can capture with a reference object also in shot.
Update cadence
We re-run the full 200-item benchmark every quarter. When the underlying vision model is updated by OpenAI or we change a prompt chain, we re-run the affected categories. Numbers on this page reflect the most recent run; the date at the top of "Current results" shows when.
Why publish this
Most phone-scale apps quote accuracy numbers without explaining how they were measured. Some of those numbers are real and some are marketing — you can't tell which from the App Store listing. By publishing the full benchmark setup, we make our accuracy claim verifiable. Anyone running the same setup against any phone-scale app should get reproducible numbers, and the differences between apps become measurable instead of debated.
For an honest comparison of how phone scale apps work in the category, see Phone Scale Apps in 2026: 9 Things the Marketing Won't Tell You.