Data

Data #

Data Inspection #

  • Identiry data sources
  • Check how they are refreshed
  • Consistency (formatting, data types)
  • Outliers and errors

Responsilbe Data #

  • Bias in data
  • User privacy:
    • Aggregation: replace unique values with summary
    • Redaction: remove some data to create less complete picture
  • Compilance with GDPR and other regulations

Data Problems #

Data/Concept Drift #

Data changes:

  • Trend and seasonality
  • Distributiob of features changes
  • Relative importance of features changes

World changes:

  • Style changes
  • Scope and processes change
  • Competitors change
  • Business expands to other geos
Drift Pace Slow Faster Really Fast
Ground truth change period months, years weeks days, hours, minutes
Retraining driven by - model improvements
- better data
- changes in software/systems
- Declining model performance
- model improvements
- better data
- changes in software/systems
- Declining model performance
- model improvements
- better data
- changes in software/systems
Labeling - Curated datasets
- Crown-based
- Direct feedback
- Crowd-based
- Direct feedback
- Weak supervision

Sudden Problems #

  • Data collection problem: bad sensor, bad log data, moved or disabled sensor
  • System problems: bad software update, loss of network connectivity, system down, bad credentials

Detecting Problems #

  • Label new training data to handle changing ground truth
  • Validate data (schema, distribution)
  • Monitor models

Data Labeling #

Process Feedback (direct labeling) #

  • Continuous dataset creation
  • Lables evolve quicly
  • Captures strong label signals
  • Not always possible
  • Individual design
  • Tools: Logstash, fluentd, Google Cloud Logging, AWS ElasticSearch, Azure Monitor

Human Labeling #

  • slow
  • difficult for many dataset
  • expensive
  • small datasets

Semi-supervied labeling #

  • can boost the accuracy
  • cheap and fast
  • small human-labeled data is mixed with large unlabeled data
  • relies on some uniformity and clustering within feature space and performs label propogation

Active Learning #

  • algorithms for intelligent sampling data
  • select the most informative points and label them only
  • useful with contrained budgets when you can afford to label few data
  • useful with imbalanced dataset: helps find rare classes

Techniques: margin sampling, cluster-based sampling, query-by-committee, region-based sampling, etc.

Weak Supervision #

  • SME defines heuristics: labeling functions that generate noisy rough labels
  • generative model de-noises and weights labels
  • the labels are used to train model

Weak supervision framework: Snorkel.

Data Validation #

Typical workflow with TensorFlow Data Validation (TFDV)

  • Infer schema (columns, constraints, domains) from training dataset and calculate statistics for each feature
  • Check for anomalies (compare statistics) in evaluation dataset and adjust schema
  • Store the schema
  • Validate input serving data with the schema
  • Monitor serving data statistics and track data drift

Data Preprocessing #

Data Preprocessing transforms raw data into a clean and training-ready dataset.

Preprocessin Operations:

  • Data cleansing
  • Feature tuning (scaling, normalizing, …)
  • Representation transformation
  • Feature extraction
  • Feature construction

feature_engineering

It is important that data preprocessing during training must also be applied correctly during serving

Text preprocessing techniques:

  • stemming
  • lemmatization
  • TF-IDF
  • n-grams
  • embedding lookup

Image preprocessing techniques:

  • clipping
  • resizing
  • cropping
  • blug
  • canny filters
  • Sobel filter
  • photometric distortions

Scaling #

Rescale numeric range to fit another range, that is more performant for ML model.

Example: greyscale pixel [0, 255] is usually rescaled to [-1, 1].

Normalization #

Transform any numeric range to [0, 1].

\(X_{norm} = \cfrac{X - X_{min}}{X_{max} - X_{min}}\)

Standardization #

Z-score: number of standard deviations away from the mean.

\(X_{std} = \cfrac{X - \mu}{\sigma}\)

Bucketizing / Binning #

Bucketizing (Binning) is a grouping technique, that creates categories from a numeric range.

Feature crossing #

Feature crossing: combining multiple features into one. Example: week + hour = hour of the week.

Other techniques #

Dimensionality reduction in embeddings:

  • Principal component analysis (PCA)
  • t-Distribution stochastic neighbor embedding (t-SNE)
  • Uniform manifold approximation and projection (UMAP)

Preprocessing Data at Scale #

  • Real-world models can be terabytes of data.
  • Large-scale data processing frameworks should be used to handle such a volume.
  • Consisnent trasformation between training and serving is crutial.

When do you transform? #

Pros Cons
Pre-process whole training dataset Run-once
Compute on entire dataset
Transformations reproduced at serving
Slower iterations
Within a model Easy iterations
Transform guarantees
Expensive transforms
Long model latency
Transformations per batch: skew
Transform per batch Access to a single batch, not the full dataset
Scales better
Deal with normalization and other “wide” transformations:
- normalize by average withing a batch
- precompute average and reuse it in batch processing

Prefetch transformation #

To optimize narrow transformations, the “prefetch” transformation can be used. It fetches the next batch of data ahead of time, while the previous batch is still processing. This approaches minimizes the idle time.

Feature Selection #

Feature Space: n-dimensional space that has each feature of a dataset as an axis.
Training is learning the decision boundry in a feature space: a line in 2D, a surface in 3D, etc… Feature Space Coverage: train/eval dataset should cover the same areas of Feature Space that the serving dataset does.

  • same numerical ranges
  • same classes
  • similar characteristics for image data
  • similar vocabulary, syntax and semantics for NLP data

Feature Selection:

  • identify features that best represent the relationship
  • remove features that don’t influence the outcome
  • reduce the size of the feature space

Unsupervised In unsupervised feature selection method the target column is not considered.

It looks for correlation between features and removes features that highly correlate with some other ones.

Supervised

Target column is not considered.

It selects features that contibute to the result the most.

Supervised methods:

  • Filter methods
  • Wrapper methods
  • Embedded methods

Filter methods #

  • Correlation (between features and between features and the label)
  • Univariate feature selection

To detect correlation the correlation matrix can be used.

Correlation tests:

  • Pearson correlation: lenear relationship
  • Kendall Tau Rank Correlation Coefficient: monotonic relationships and small sample size
  • Spearman’s Rank Correlation Coefficient: monotonic relationsships

Other methods:

  • Mutual information
  • F-test
  • Chi-Squared test

Wrapper methods #

Forward selection

  • Start with 1 feature
  • evaluate model performance when adding each of the additional features, one at a time
  • Add the one that permorms best
  • Repeat until there’s no improvement

Backward elimination

  • Start with all features
  • evaluate model performance when removing each of the included features, one at a time
  • Remove the one that permorms worst
  • Repeat until there’s no improvement

Recursive elimination

  • Select a model for evaluating feature improtance (not all models are able to do that)
  • Select the desired number of features
  • Fit the model
  • Rank features by importance
  • Discard least important features
  • Repeat until the desired number of features remains

Embedded methods #

Depends on the model you are using.

  • L1 regularization
  • Feature importance