The biggest moat of an AI model is the dataset. But bigger isn't better — it must be non-conflated by type, consistent in labeling, and traceable across versions. We've been accumulating PV defect data since 2018, and every pitfall up to 2024 traces back to one phrase — "two labelers look at the same EL image and disagree". This article describes how MVCreate's annotation and versioning practice evolved as the library grew from 50K images / 3 cell types to 2M images / 9 cell types.
Originally we organized the library by defect type — microcracks together, broken fingers together, all cell types blended. This worked through 2019 but broke in 2021, because:
| Defect | PERC | TOPCon | HJT | IBC |
|---|---|---|---|---|
| Microcrack | Dark line, 50–200 μm wide | 30–150 μm | 20–100 μm | Often shorter, branched |
| Broken finger | Visible busbar break | Busbar + finger together | Dark patches on transparent TCO | No busbars |
| Black spot | 200 μm – 2 mm | 100–500 μm | Invisible in EL (PL only) | Different morphology entirely |
Key insight: "the same defect" looks like different objects across cell types. Mixing PERC microcracks with IBC microcracks makes the model learn an "averaged microcrack" — accurate for neither.
We restructured in 2021: each defect type sub-divided by cell type. 29 defects × 9 cell types ≈ 180 valid sub-libraries. Training selects per-type subsets — per-type fine-tuning + shared representation.
The second killer is annotation drift — the same labeler's judgments slowly shift over time:
New-hire phase (first 3 months): conservative — "suspicious" labeled as "non-defect";
Mature phase (3–12 months): aggressive — highest detection;
Fatigue phase (12 months+): boundaries soften — "edge brightness" mislabeled as "black edge".
After 3 years, old vs new labels disagree at the boundaries, and models "fit old data well, lose accuracy on new data."
Our solution: three-layer verification + calibrator rotation:
Each image initially labeled by 1 labeler, recording labeler ID, timestamp, label, self-confidence.
5% of daily labels random-sampled and reviewed by senior verifiers, with pass/reject/relabel decisions. Verifiers rotate every 6 months to prevent verifier drift.
We maintain a 5,000-image gold-standard set (jointly labeled by 5 industry experts, contested samples removed). Every month all labelers are tested against the gold set; labelers below 90% pause for a week of retraining.
Annotation consistency (Cohen's kappa) rose from 0.71 (2019) to 0.91 (2024).
The third challenge is versioning. Our earliest scheme tagged datasets by month (dataset_2020_03 ...) — and broke quickly. One monthly update introduced 200 mislabels but had already trained 3 deployed models. Rolling back required retraining all three.
In 2022 we built git-like version control:
Each new batch is a "commit" recording committer, timestamp, affected types, source (production / customer / synthetic), verification status.
main accepts only fully verified data; experimental branches accept under-verified data. Models train against main at a specific commit hash.
A bad commit can be rolled back; downstream models retrain against the rolled-back version.
Built on DVC + S3 + a custom annotation-quality dashboard. Every deployed model carries the training-data commit hash in metadata — any issue traces to a specific data version.
Customers ask: "can you synthesize training data?" — yes, with discipline.
Synthetic data (GAN/Diffusion-generated EL images) helps:
- Long-tail classes can be quickly padded;
- Data privacy concerns minimal;
- Cheap.
But it carries a leakage risk — if both train and test contain images from the same generator, the model overfits the generator's quirks and real-line performance degrades.
Our rules:
1. Real-first: synthetic ≤ 30% of any single class's training set;
2. Pure-real test sets: all benchmark test sets are 100% real production data;
3. Multi-generator mixing: synthetic data must come from ≥ 3 different GAN/Diffusion models;
4. Physical consistency check: every synthetic image passes a "physical plausibility" filter (intensity distribution, contrast, defect-edge sharpness); failures are discarded.
Each customer line's data has unique value — process signature, defect distribution, scenario context — that lab-synthesized data can't replicate.
Our customer-feedback mechanism:
Raw images stay on customer line — never uploaded;
Local annotation — customer engineers label their own results;
Aggregated "label + feature vector" upload — only feature vectors leave site;
Federated update — central model updates from aggregated vectors; new model is redistributed.
This preserves data privacy while letting MVCreate's central model learn from every line. In Q4 2024, 14 customers participated, averaging ~120K new valid labels per month.
PV inspection AI is still young. Maturity varies. A few suggestions:
Start versioning early — wait until 1M images and it's too late;
Labeler rotation + gold-set calibration is mandatory — don't skimp;
Cell-type partitioning is mandatory — never merge for convenience;
Customer privacy + federated learning will be table-stakes by 2027 — plan ahead.
For dataset management methodology exchange or federated-learning onboarding, contact MVCreate at +86 159-5048-9233.
Originally published by Vision Potential (Nanjing MVCreate Intelligent Technology Co., Ltd.). Reproductions must credit the source.
contact
Be the first to know about our new product launches, latest blog posts and more.
Nanjing Vision Potential Intelligent Technology Co.,Ltd.Established based on the Nanjing Xiangning Artificial Intelligence Research Institute, we have brought together a number of outstanding industry... Any question or request?
Click below, we’ll be happy to assist. contact