Five Most Significant Issues With Data Pipelines

0
441
data-pipeline
  1. Not testing your pipelines, get test cases for at least 90% coverage
  2. Optimize for the wrong metric. Are you optimizing cost or performance? If you optimize for performance your primary measure cannot be “cost.” Four standard metrics for any Data pipeline:
    1. Data quality metrics to reduce data loss, increase accuracy and usability
    2. Speed of pipeline
    3. Data Recovery time and Pipeline Health. The pipeline overhead to maintain pipelines.
    4. Cost to process each Pipeline batch,
  3. Not having the correct controls, eg, mistakenly dropping the entire lake. Managing failures in a run
  4. Not incrementally processing data. Re-running historical batches – reattribution pipeline fixes some data in the past for this measurement period run. Parallel processing pipelines, after which you can do a swap on the production data for the reattribution fixes after validations.
  5. Not planning for the future and its growth. Reading useless records can hurt performance over time.

The hardest thing about performance is knowing what you need to measure, that must be tied back to your mission statement

— Peter Drucker