Avoiding Dataset Leakage: Annotation & Versioning Across MVCreate's Defect Library

The biggest moat of an AI model is the dataset. But bigger isn't better — it must be non-conflated by type, consistent in labeling, and traceable across versions. We've been accumulating PV defect data since 2018, and every pitfall up to 2024 traces back to one phrase — "two labelers look at the same EL image and disagree". This article describes how MVCreate's annotation and versioning practice evolved as the library grew from 50K images / 3 cell types to 2M images / 9 cell types.

1. Why multi-cell-type libraries can't be merged

Originally we organized the library by defect type — microcracks together, broken fingers together, all cell types blended. This worked through 2019 but broke in 2021, because:

Defect	PERC	TOPCon	HJT	IBC
Microcrack	Dark line, 50–200 μm wide	30–150 μm	20–100 μm	Often shorter, branched
Broken finger	Visible busbar break	Busbar + finger together	Dark patches on transparent TCO	No busbars
Black spot	200 μm – 2 mm	100–500 μm	Invisible in EL (PL only)	Different morphology entirely

Key insight: "the same defect" looks like different objects across cell types. Mixing PERC microcracks with IBC microcracks makes the model learn an "averaged microcrack" — accurate for neither.

We restructured in 2021: each defect type sub-divided by cell type. 29 defects × 9 cell types ≈ 180 valid sub-libraries. Training selects per-type subsets — per-type fine-tuning + shared representation.

2. The "drift problem" of annotation consistency

The second killer is annotation drift — the same labeler's judgments slowly shift over time:

New-hire phase (first 3 months): conservative — "suspicious" labeled as "non-defect";
Mature phase (3–12 months): aggressive — highest detection;
Fatigue phase (12 months+): boundaries soften — "edge brightness" mislabeled as "black edge".

After 3 years, old vs new labels disagree at the boundaries, and models "fit old data well, lose accuracy on new data."

Our solution: three-layer verification + calibrator rotation:

Layer 1: Labelers

Each image initially labeled by 1 labeler, recording labeler ID, timestamp, label, self-confidence.

Layer 2: Verifiers

5% of daily labels random-sampled and reviewed by senior verifiers, with pass/reject/relabel decisions. Verifiers rotate every 6 months to prevent verifier drift.

Layer 3: Gold-standard set

We maintain a 5,000-image gold-standard set (jointly labeled by 5 industry experts, contested samples removed). Every month all labelers are tested against the gold set; labelers below 90% pause for a week of retraining.

Annotation consistency (Cohen's kappa) rose from 0.71 (2019) to 0.91 (2024).

3. Git-like version control

The third challenge is versioning. Our earliest scheme tagged datasets by month (dataset_2020_03 ...) — and broke quickly. One monthly update introduced 200 mislabels but had already trained 3 deployed models. Rolling back required retraining all three.

In 2022 we built git-like version control:

3.1 The "commit" concept

Each new batch is a "commit" recording committer, timestamp, affected types, source (production / customer / synthetic), verification status.

3.2 Branches

main accepts only fully verified data; experimental branches accept under-verified data. Models train against main at a specific commit hash.

3.3 Rollback

A bad commit can be rolled back; downstream models retrain against the rolled-back version.

3.4 Tooling

Built on DVC + S3 + a custom annotation-quality dashboard. Every deployed model carries the training-data commit hash in metadata — any issue traces to a specific data version.

4. Avoiding leakage with synthetic data

Customers ask: "can you synthesize training data?" — yes, with discipline.

Synthetic data (GAN/Diffusion-generated EL images) helps:
- Long-tail classes can be quickly padded;
- Data privacy concerns minimal;
- Cheap.

But it carries a leakage risk — if both train and test contain images from the same generator, the model overfits the generator's quirks and real-line performance degrades.

Our rules:
1. Real-first: synthetic ≤ 30% of any single class's training set;
2. Pure-real test sets: all benchmark test sets are 100% real production data;
3. Multi-generator mixing: synthetic data must come from ≥ 3 different GAN/Diffusion models;
4. Physical consistency check: every synthetic image passes a "physical plausibility" filter (intensity distribution, contrast, defect-edge sharpness); failures are discarded.

5. How customer feedback enters the model

Each customer line's data has unique value — process signature, defect distribution, scenario context — that lab-synthesized data can't replicate.

Our customer-feedback mechanism:

Raw images stay on customer line — never uploaded;
Local annotation — customer engineers label their own results;
Aggregated "label + feature vector" upload — only feature vectors leave site;
Federated update — central model updates from aggregated vectors; new model is redistributed.

This preserves data privacy while letting MVCreate's central model learn from every line. In Q4 2024, 14 customers participated, averaging ~120K new valid labels per month.

6. Recommendations for the industry

PV inspection AI is still young. Maturity varies. A few suggestions:

Start versioning early — wait until 1M images and it's too late;
Labeler rotation + gold-set calibration is mandatory — don't skimp;
Cell-type partitioning is mandatory — never merge for convenience;
Customer privacy + federated learning will be table-stakes by 2027 — plan ahead.

For dataset management methodology exchange or federated-learning onboarding, contact MVCreate at +86 159-5048-9233.

Originally published by Vision Potential (Nanjing MVCreate Intelligent Technology Co., Ltd.). Reproductions must credit the source.

Previous : Edge vs. Cloud Inference: Trade-offs
No time : Next

Recommend news

Tags

product PV-Station-Solutions PV-Panel-Testing-Solutions Silicon-Ingot-Testing-Solutions
Applications
news
LINKS

contact

Be the first to know about our new product launches, latest blog posts and more.

Nanjing Vision Potential Intelligent Technology Co.,Ltd.Established based on the Nanjing Xiangning Artificial Intelligence Research Institute, we have brought together a number of outstanding industry...

Any question or request?

Click below, we’ll be happy to assist. contact

关注

联系

+86 15950489233

联系

顶部