Data Stage #

Types of Datasets #

Small Data vs. Big Data: generally, the threshold is “can a small team examine every single example in a reasonable time?”.

Small Data

Every record can be manually examined
Humans can label data
Clean (true and consistent) labels are critical
If the labels are inconsistent, it may be more worthy to fix that instead of collecting new data and overcomplicate a model

Big Data

Manual labeling is nearly impossible
Focus of data processes
but problems of rear events in Big Data are pretty similar to the problems of Small Data, where accurate labeling is critical

Unstructured Data

Humans can label more data
Data Augmentation more likely to be helpful

Structured Data

May be more difficult to obtain more data
Human labeling may not be possible

Examples #

	Unstructured	Structured
Small Data	Manufacturing visual inspection from 100 training examples	Housing price prediction based on square footage, etc. from 50 examples
Big Data	Speech recognition from 50 million examples	Online shopping recommendations for 1 million users

Label Consistency #

Techniques:

Standardizing: make an agreement between MLE, Subject Matter Experts and labelers. As well-defined as posible.
Merging Classes: sometimes it’s more productive to treat some classes as one (e.g. “small scratch” and “deep scratch” -> “scratch”)
Unknown/Borderline Class: introduce a separate class for ambiguous examples

To verify label consistency, the same data can be re-labeled by a few labelers or even by the same labeler after some time.

Ground Truth Definition #

It is critical to understand how is the Ground Truth defined. How objective it is?

When the GT is externally defined, HLP (Human Level Performance) gives an estimate of irreducable error. But if the GT is defined by a human’s decision, it is the algorithm accuracy is generally “how close it is to the human’s opinion”.

Obtain New Data #

The earlier you start to iterate through model training cycle - the better.

Brainstorm list of data sources:

Source	Amount	Cost	Time
Owned	100 units	$0	0
Crownsourced	1000 units	$10000	14d
Pay for labels	100 units	$6000	7d
Purchase data	1000 units	$10000	1d

Don’t increase the dataset size more than 10x at a time. Otherwise, it can be really hard to predict how it will affect the model

Data Provenance and Data Linage #

Data Provenance: where the data comes from.
Data Linage: how to reproduce the same data through sequence of steps.