AI Agents

Autonomous Data Ingestion Agent

Ford Motor Company — GCP · Vertex AI · Tekton · Airflow

95% reduction in onboarding time — weeks to minutes

95%Reduction in onboarding timeWeeks → minutes per source

35+Sources onboarded to dateAvg. previously 2–4 months each

$1M+Engineering capacity recoveredConservative estimate at blended eng. rate

ZeroManual schema files authoredFully automated for tables of any size

The Problem

Onboarding a new data source at Ford was a deeply manual, months-long process. A data engineer had to gather full schema details from source teams, manually author data dictionary JSON files for tables with hundreds — sometimes thousands — of columns, write ETL scripts from scratch, provision GCS buckets, create BigQuery tables via Terraform, set up Airflow DAGs, configure CI/CD pipelines in Tekton, and validate everything before a single byte of data moved.

A simple source took a minimum of one month. Complex sources with large schemas or ambiguous metadata regularly stretched to two to four months. With dozens of ingestion requests in the backlog at any time, engineering capacity was perpetually saturated on work that was largely repetitive and high-effort but low-differentiation.

The Solution

We built a fully autonomous, multi-agent system deployed on Google Cloud Run. An operator fills in a lightweight intake form — source name, file format (CSV, fixed-width, TXT), connection details, and target dataset — and the agent takes it from there.

The system uses a hierarchical multi-agent architecture: an orchestrator agent receives the request and delegates to specialized subagents for schema inference, data dictionary generation, ETL script authoring, Terraform config generation, Airflow DAG creation, and infrastructure provisioning. Once all artifacts are generated and validated, the system raises a pull request for human review. After approval, it triggers the Tekton CI/CD pipeline, monitors build status, tracks the first BigQuery load, and sends a completion notification — all without manual intervention.

System Flow

→01Intake FormSource details, format, connection

→02Schema InferenceAuto-derive columns, types, nullability

→03Artifact GenerationETL scripts, Terraform, DAGs, data dictionary

→04Human ReviewPR raised — engineer approves or requests changes

→05CI/CD TriggerTekton pipeline builds and deploys

06Validation & NotifyBigQuery load tracked, team notified

Tech Stack

GCP Cloud Run

Vertex AI

BigQuery

Airflow

Terraform

Docker

Python

Tekton CI/CD

GCS

Kafka / Pub/Sub

SQL

Key Engineering Challenges

Multi-agent orchestration at scale

The core architectural challenge was decomposing the ingestion workflow into discrete, independently reliable subagents. Each subagent needed to handle partial failures gracefully and feed structured outputs to the next stage without human intervention. We designed a DAG-style execution model where the orchestrator tracks state across agents and retries failed steps with context.

Tekton CI/CD integration

Integrating with Ford's internal Tekton pipelines required navigating complex org-level security policies, custom pipeline triggers, and non-standard build environments. The agent needed to not only trigger pipelines but monitor their multi-stage execution and surface meaningful status — not just pass/fail — back to the operator.

Schema inference for massive tables

Some source tables had thousands of columns with ambiguous or undocumented types. Manually authoring Terraform schema files for these was the single biggest time sink in the old process. We built a schema inference subagent that samples source data, resolves type conflicts, infers nullability constraints, and generates the full Terraform HCL and data dictionary JSON automatically.