Data Stage #
Types of Datasets #
Small Data vs. Big Data: generally, the threshold is “can a small team examine every single example in a reasonable time?”.
Small Data
- Every record can be manually examined
- Humans can label data
- Clean (true and consistent) labels are critical
- If the labels are inconsistent, it may be more worthy to fix that instead of collecting new data and overcomplicate a model
Big Data
- Manual labeling is nearly impossible
- Focus of data processes
- but problems of rear events in Big Data are pretty similar to the problems of Small Data, where accurate labeling is critical
Unstructured Data
- Humans can label more data
- Data Augmentation more likely to be helpful
Structured Data
- May be more difficult to obtain more data
- Human labeling may not be possible
Examples #
| Unstructured | Structured | |
|---|---|---|
| Small Data | Manufacturing visual inspection from 100 training examples | Housing price prediction based on square footage, etc. from 50 examples |
| Big Data | Speech recognition from 50 million examples | Online shopping recommendations for 1 million users |
Label Consistency #
Techniques:
- Standardizing: make an agreement between MLE, Subject Matter Experts and labelers. As well-defined as posible.
- Merging Classes: sometimes it’s more productive to treat some classes as one (e.g. “small scratch” and “deep scratch” -> “scratch”)
- Unknown/Borderline Class: introduce a separate class for ambiguous examples
To verify label consistency, the same data can be re-labeled by a few labelers or even by the same labeler after some time.
Ground Truth Definition #
It is critical to understand how is the Ground Truth defined. How objective it is?
When the GT is externally defined, HLP (Human Level Performance) gives an estimate of irreducable error. But if the GT is defined by a human’s decision, it is the algorithm accuracy is generally “how close it is to the human’s opinion”.
Obtain New Data #
The earlier you start to iterate through model training cycle - the better.
Brainstorm list of data sources:
| Source | Amount | Cost | Time |
|---|---|---|---|
| Owned | 100 units | $0 | 0 |
| Crownsourced | 1000 units | $10000 | 14d |
| Pay for labels | 100 units | $6000 | 7d |
| Purchase data | 1000 units | $10000 | 1d |
Don’t increase the dataset size more than 10x at a time. Otherwise, it can be really hard to predict how it will affect the model
Data Provenance and Data Linage #
Data Provenance: where the data comes from.
Data Linage: how to reproduce the same data through sequence of steps.