Modeling #
AI system = Code + DataApproaches to modeling
- Model-centric: improve the algorithm
- Data-centric: improve the data quality
Training Cycle #
flowchart LR MHD(Model \n + Hyperparameters \n + Data) TR(Training) EA(Error Analysis) MHD --> TR TR --> EA EA --> MHD
Why Low Average Error isn’t Good Enough #
Typical ML model challenges:
- Perform well on training dataset (avg training error)
- Perform well on dev/test sets
- Perform well on business metrics/project goals.
In most cases, this is not the same as item #2.
Priorities #
The model performance is measured by average error metrics over all possible requests. But in real world requests have priorities. So-called “disproportionally important” requests. From business perspective, the model performance requirements on such requests are higher.
Example:
In web search engine there are requests like: “Apple pie recipe” or “Latest movies”. They are Informational and Transactional queries. Most likely, there’s no “best apple pie recipe”, so errors in ranking are forgivable.
In contrast, if user queries “YouTube” or “Reddit” they expect the correct result to be Top-1. Such queries are Navigational and have higher error cost.
Performance on Key Slices #
Make sure to treat all slices of the dataset fairly. The model must not discriminate by gender, location, ethnicity or other protected attributes.
Rare Classes #
Skewed data distribution:
- 99% negative
- 1% positive
If some class is rare the model can just ignore it without a major effect on average accuracy.
To detect such a skew dataset the Confusion Matrix should be used instead of Accuracy.
\(Precision = \cfrac{TP}{TP + FP} \qquad Recall = \cfrac{TP}{TP + FN}\)Combining Precision and Recall - F1 score
\(F_1 = \cfrac{2}{\cfrac{1}{Precision} + \cfrac{1}{Recall}}\)Establishing a Baseline Level of Performance #
Baseline helps indicate what might be possible. In some cases it also gives a sense of what is irreducable error (Bayes error). A baseline can be established per category.
Ways to establish a baseline:
- Human Level Performance (HLP)
- Search for results of similar projects
- Quick-and-dirty implementation
- Performance of older system
Error Analysis #
By manually expecting misclassified examples you may come up with additional attributes/tags that should be assigned to records in a dataset. For example, you found out that some audio clips have car noises in the background. Maybe it’s worth to add this column to the dataset and check:
- what fraction of errors has that tag?
- Of all data with that tag what fraction is misclassified?
- What fraction of all the data has that tag?
- How much room for improvement is there on data with that tag?
Baseline helps indicate how much room for improvement is there for certain category:
(Baseline accuracy - Current accuracy) * Category frequency.
For categories you want to prioritize:
- Collect more data
- Use augmentation to get more data
- Improve data quality
Data Manipulations for Improving Performance #
Data Augmentation #
The goal is to create realistic examples that:
- the algorithm does poorly on
- humans (or other baseline) do well on
As a rule, Data Augmentation does not hurt the performance if:
- the model is large (low bias)
- The mapping x -> y is clear (1 vs I)
Data Augmentation is a good fit for unstructured data.
Adding features #
For structured data it’s often more efficient to add new features to existent data instead of creating completely new data records.
Experiment Tracking #
Things to track:
- Algorithm version
- Dataset used
- Hyperparameters
- Results
Desirable Features:
- Experiment reproducability
- Experiment results with metrics analysis
- Perhaps: resource consumption, visualization, etc…