Senior Staff Data Engineer at Catalyst Software
Christina is passionate about distributed computing, multi-cloud architecture, scalable data pipelines, as well as the latest and greatest in the open source community. You can find her at data streaming, data science, and the Linux Foundation events. An intensely curious lifelong learner and effective team leader, she builds data lakes with medallion architecture that support advanced analytics and machine learning. Lately, she has taken a keen interest in MLOps.
Watch live: March 7, 2023 @ 1:30 – 2:00 pm ET
Automated, Scalable and Quality Machine Learning with Airflow, Kubernetes, and Great Expectations
Previously, the NLP model training was a manual process. These steps included piecemeal jobs spread across multiple GCP projects with various timing/scheduling. Airflow enables us to automate the entire process on a schedule or on-demand with little to no human intervention. We can now break down a monolithic job into several dependent components. This prevents full job failure, allows us to reprocess independently, and train models faster in parallel. We further implemented GKEStartPodOperator to isolate dependencies and spin up customizable resources as needed, as well as incorporated Great Expectations for data quality checks.