The pattern
Most organisations treat SAP as a black box. They extract what they can, throw it into Databricks, and hope the data is good enough for analytics or AI. It almost never is.
Where the cracks appear
- Extracted tables that are heavily normalised and full of SAP-specific codes.
- Missing master data or slowly changing dimensions that break downstream models.
- No clear lineage from SAP ECC/S/4HANA to the lakehouse.
- Duplicate or conflicting records across modules (FI, SD, MM, etc.).
- Change data capture that was never designed for real-time or near-real-time use cases.
What actually works
- Domain-driven extraction patterns. Pull only the modules and tables that matter for the use case first (finance for cost AI, SD/MM for supply-chain prediction, etc.).
- Semantic layer in the curated zone. Build reusable, business-friendly views that translate SAP codes into meaningful concepts and handle slowly changing dimensions properly.
- Model-ready data products. Create purpose-built datasets with feature stores in mind: clean, timestamped, and governed, instead of dumping raw SAP tables.
- Hybrid integration strategy. Use SAP Datasphere or direct JDBC where appropriate, but route high-volume, high-velocity data through Delta Live Tables with proper schema evolution handling.
- Governance at the source. Early data contracts and quality checks inside the pipeline prevent garbage from reaching AI workloads.
The takeaway
SAP modernisation for AI is not about “lifting and shifting” everything. It is about surgically building high-signal, governed data products that serve both analytics and AI from day one. The organisations that get this right see their Databricks AI use cases move from pilot to production months faster.