Customer-Managed Data Products In SAP BDC: What Re...

Imagine you are a data engineer at a mid-size enterprise running SAP S/4HANA Private Cloud Edition. Your company has just subscribed to SAP Business Data Cloud. The pitch was compelling: bring your SAP data into Databricks with zero copy, no ETL pipelines, no data duplication. You have an existing Datasphere implementation where your team already built a clean, business-ready General Ledger Account master data VIEW. It joins the SKA1 chart-of-accounts table and the SKAT text table, filters to English language, and projects the analytical columns your finance team needs. It took weeks to get right. You want to publish that as a Customer-Managed Data Product and share it to your Enterprise Databricks workspace.

You follow the documented steps. You create a Datasphere HDLF (HANA Data Lake Files) space. You try to point the CMDP directly at your VIEW. And then you discover something nobody told you upfront: you cannot directly share a Datasphere HANA relational VIEW as a Delta-shareable data product. The VIEW must first be materialized into a physical HANA local table. That HANA local table must then be exported to an HDLF Delta Lake file via a Transformation Flow. Only from that HDLF Delta Lake file can the CMDP be published and shared to Databricks via Delta Sharing.

Two physical copies. You thought there would be zero.

Your first reaction is frustration. Your second is a question: did SAP misrepresent the zero-copy architecture? The answer is no, but the explanation requires a level of precision that the marketing material does not always provide. This post provides that precision.

Part 1: The Foundations: What Delta Sharing Actually Guarantees

Before diagnosing the architecture, you need to understand exactly what Delta Sharing is and what its zero-copy promise covers, because this is where the misalignment between expectation and reality originates.

Delta Sharing is an open protocol (originally developed by Databricks, now vendor-neutral) for governed, read-access sharing of data stored in Delta Lake format from a provider to a consumer, without the consumer needing to import or duplicate that data. Delta Lake format means Parquet columnar data files plus a transaction log: a folder of JSON files recording every change operation, stored in a cloud object store (Amazon S3, Azure Data Lake Storage Gen2, or Google Cloud Storage depending on your hyperscaler).

The zero-copy guarantee that Delta Sharing makes is specifically and exclusively at the consumer boundary. When Databricks reads a Delta-shared table, the sequence at the protocol level works as follows. Databricks authenticates to the SAP BDC Delta Sharing server using mutual TLS (mTLS, meaning both sides exchange certificates to prove identity) and OpenID Connect (OIDC) for token-based authorization. The server issues a time-limited, pre-signed URL pointing directly to the Parquet files in the HDLF object store. Databricks reads those files directly from HDLF over HTTPS. No data enters Databricks storage. The URL expires. The next query gets fresh URLs. No copy is ever created on the Databricks side.

This is real. It works exactly as described. Zero copy at the consumer boundary is a genuine architectural guarantee that SAP BDC delivers today.

What Delta Sharing has never guaranteed (not in SAP BDC, not in Databricks, not on any platform from any vendor) is the elimination of producer-side preparation steps. Delta Sharing can only share data that already exists in Delta Lake format in an object store. If your data lives in a relational table, a view, a CSV, or any other format, it must be converted to Parquet and Delta Log before Delta Sharing can reference it. That conversion is not duplication in any meaningful business sense. It is a format transformation required by the protocol itself. This same requirement exists on AWS, Azure, GCP, Databricks native, Snowflake, and every other platform implementing Delta Sharing. SAP is not the exception here. SAP is the standard.

The producer-side storage pattern (one copy, or two) depends entirely on a separate decision: where your merge and join logic lives. That is what the rest of this post addresses.

Part 2: The Two HDLF Object Stores in SAP BDC

Before explaining the two architecture options, you need a clear mental model of where data physically lives in SAP BDC. There are two distinct HDLF object store instances and confusing them is a common source of misunderstanding.

The Foundation Services HDLF (FOS) is managed entirely by SAP. It stores SAP-Managed Data Products (the out-of-the-box data products for S/4HANA Finance, Procurement, Sales, HR, and so on). When you activate an Intelligent Application in the BDC Cockpit, SAP automatically populates the relevant data products in the FOS layer. Customers have no direct access to this layer. SAP handles extraction, replication, transformation, and publication invisibly.

The Datasphere HDLF Object Store is managed by the customer within their Datasphere tenant. This is where Customer-Managed Data Products live. Customers configure, populate, transform, and govern this space themselves. It is backed by the same hyperscaler object storage technology as the FOS layer. The difference is ownership, access, and responsibility.

Customer-Managed Data Products can only be published from the Datasphere HDLF object store. They cannot be published directly from the HANA relational space. This is the design constraint that determines everything about the architecture options below.

Part 3: The Real-World Use Case: GL Account Master Data from S/4HANA

The worked example throughout this post is General Ledger Account master data from SAP S/4HANA Private Cloud Edition (PCE), one of the most common CMDP scenarios in enterprise analytics implementations.

In S/4HANA, GL Account master data is stored across two key ABAP tables:

SKA1: The chart of accounts segment, holding account number, account type, account group, P&L classification, and balance sheet flag. One record per account per chart of accounts.
SKAT: The text segment, holding account short text and long text per language key. Multiple records per account.

A business-ready GL Account master data product for Databricks needs to join SKA1 and SKAT on account number and chart of accounts key, filter to English language, and project the meaningful analytical attributes into a flat, single-row-per-account dataset.

This join is the “merge” that drives the architecture decision. Where you perform it determines whether you end up with one physical copy or two on the SAP side.

Part 4: Option A: Upstream Merge (One Physical Copy)

The core principle: Push the join logic upstream into S/4HANA as a CDS view. By the time data leaves S/4HANA and arrives in Datasphere, it is already a single, merged, business-ready dataset. Datasphere's Replication Flow lands this dataset directly into an HDLF File Space as a Delta Lake local table. One physical copy. No HANA relational space involved at any point.

This option applies to both S/4HANA Private Cloud Edition (PCE) and Public Cloud Edition (CE).

On Private Cloud Edition, your ABAP team builds an extraction-enabled, CDC-capable custom CDS view (for example, ZI_GL_ACCOUNT_MASTER) that performs the SKA1-SKAT join directly in S/4HANA. The view carries the annotation @analytics.dataextraction.enabled: true, making it available in the CDS_EXTRACTION container in Datasphere's Replication Flow. Connectivity from Datasphere to S/4HANA on-premises runs via SAP Cloud Connector, a lightweight outbound-only tunnel that requires no inbound firewall rule changes.

On Public Cloud Edition, the same pattern applies. Custom CDS views are created via the Custom CDS View tile in Fiori Launchpad with a YY1_ prefix. No ABAP IDE or developer access is required. Connectivity is a direct HTTPS API call using the ABAP SQL Service communication arrangement, with no Cloud Connector needed.

The data flow for Option A:

S/4HANA ABAP Layer: Custom CDS view ZI_GL_ACCOUNT_MASTER joins SKA1 and SKAT, filters to language EN, selects analytical columns. Extraction and CDC annotations are activated on the view.
Replication Flow (Datasphere): Reads ZI_GL_ACCOUNT_MASTER from the CDS_EXTRACTION container. The target is an HDLF File Space (Storage Type: SAP HANA Data Lake Files). Data lands in the HDLF inbound buffer.
Merge Task: Runs on a schedule to consolidate the inbound buffer into the HDLF local table (file), creating the Parquet files and Delta transaction log. This is the one and only physical copy on the SAP side.
Transformation Flow (optional): If additional silver layer logic is needed (inactive account filtering, currency enrichment, language fallback), a Spark-based Transformation Flow applies it and writes to a Silver HDLF local table. Still effectively one net dataset if Bronze is transient.
CMDP Publication: A Data Provider Profile with Formations visibility is created. A Customer-Managed Data Product is defined pointing to the HDLF local table as its Product Artifact. The ORD (Open Resource Discovery) metadata descriptor is published to the BDC catalog.
Delta Sharing to Enterprise Databricks: BDC Connect (generally available since October 2025) establishes the two-sided trust between BDC and the customer's existing Databricks workspace. mTLS and OIDC handle authentication. The BDC Delta Sharing server issues signed URLs. Databricks reads Parquet files directly from HDLF over HTTPS. Zero copy at the consumer boundary.

When to choose Option A: New CMDP implementations starting from scratch where the ABAP team can build or extend a CDS view in S/4HANA. High-volume, high-frequency datasets where eliminating the HANA in-memory footprint has meaningful cost impact. Any scenario where you want the leanest possible producer-side architecture.

Part 5: Option B: Datasphere View Merge (Two Physical Copies)

The core principle: Build the join logic as a VIEW inside Datasphere's HANA relational space. This VIEW must be materialized into a physical HANA local table before a Transformation Flow can read it and produce the HDLF Delta Lake file. Two physical copies result on the SAP side: one in HANA relational storage, one in HDLF object storage.

This is the required path when merge logic already lives in a Datasphere HANA VIEW, as it does in most mature Datasphere deployments.

The technical reason this two-copy pattern exists is straightforward: a Datasphere HANA VIEW is a virtual object. It is a saved query definition. It contains no data. A Transformation Flow requires a physical source to read from. The VIEW must therefore be materialized into a physical HANA local table first. Additionally, Replication Flows in Datasphere are designed to read from external source systems (S/4HANA, other ABAP sources, cloud databases). They cannot use a Datasphere HANA VIEW or local table as their source. Directly writing a Datasphere VIEW result to HDLF without materialization is not a supported path today.

This behavior is not unique to SAP. In Oracle, SQL Server, Snowflake, and every relational database system on earth, a view is a virtual object. Materializing before exporting is the standard pattern everywhere.

The data flow for Option B:

S/4HANA ABAP Layer: SKA1 and SKAT are individually extraction-enabled CDS views. Two Replication Flows extract each table separately into Datasphere.
Datasphere HANA Relational Space (Copy 1): SKA1 and SKAT land as separate HANA local tables. This is the first physical copy of the data on the SAP side.
Datasphere HANA VIEW (virtual, no additional copy): A graphical or SQL VIEW in Datasphere joins the two local tables on account number and chart of accounts key, filters to language EN, and projects the analytical columns. This VIEW holds no data.
Materialization to HANA Local Table: The VIEW output persisted into a new physical HANA local table via a Data Flow or view persistency, producing the merged, flat, single-row-per-account dataset in HANA. Still part of Copy 1 (the HANA relational layer).
Transformation Flow to HDLF (Copy 2): A Transformation Flow reads the materialized HANA local table and writes the result as a Delta Lake file (Parquet and transaction log) into the Datasphere HDLF File Space. This is the second physical copy on the SAP side.
CMDP Publication: Same as Option A. Data Provider Profile, CMDP definition, ORD descriptor, BDC catalog publication.
Delta Sharing to Enterprise Databricks: Identical to Option A from this point. BDC Connect, mTLS and OIDC, signed URLs, direct Parquet reads from HDLF. Zero copy at the consumer boundary.

When to choose Option B: When merge and join logic already exists as a Datasphere HANA VIEW and rebuilding it upstream is not justified by the business case. Low-to-medium volume reference data (GL Account master, Cost Center master, Profit Center master) where the HANA memory cost of the materialized table is negligible. When the Datasphere VIEW contains complex business logic (multi-table joins, CASE expressions, aggregations) that would require significant effort to replicate in a CDS view or Spark Transformation Flow.

Part 6: Option A vs. Option B (The Decision Table)

Part 7: How Delta Sharing Actually Transits Data to Enterprise Databricks

For completeness, here is the exact connectivity and transit story for Enterprise Databricks (what SAP calls the “brownfield” scenario, meaning the customer's own independently licensed Databricks workspace, not the SAP Databricks bundled inside BDC). This is the part customers ask about most in architecture reviews.

Connecting BDC to Enterprise Databricks via BDC Connect:

BDC Connect became generally available for all three hyperscale's in October 2025. The connection establishment is a one-time, two-sided handshake:

The Databricks workspace admin generates a connection identifier in Databricks (Data Ingestion > SAP Business Data Cloud tile) and shares it with the SAP BDC admin.
The SAP BDC admin uses that identifier to create a Third Party Connection in BDC and generates an invitation link.
The Databricks admin completes the connection using that invitation link. The confirmation message reads: “Connection to SAP Business Data Cloud was successful.”
BDC is now registered as an authorized Delta Sharing provider in the Databricks Unity Catalog. The connection is bidirectional.

The read path (every time Databricks queries a shared table):

Databricks authenticates to the SAP BDC Delta Sharing server using mTLS (mutual Transport Layer Security, where both sides present certificates) and OIDC (OpenID Connect for token-based identity verification).
The Delta Sharing server validates entitlements against the governance policy, covering which data product, which recipient, and which columns or partitions are authorized.
The server issues time-limited pre-signed URLs pointing directly to the specific Parquet files in the HDLF object store that constitute the requested table or delta increment.
Databricks reads those Parquet files directly from HDLF (S3, ADLS Gen2, or GCS) over HTTPS. Data travels encrypted in transit. No SAP middleware sits in this read path.
The signed URLs expire. No persistent copy of the data is created in Databricks storage.

This is zero copy in operational terms. The Parquet files never leave the HDLF object store. Whether those Parquet files were produced via Option A (one copy on the SAP side) or Option B (two copies on the SAP side) is entirely irrelevant to this read path. The consumer experience is identical either way.

Part 8: The Reverse Flow: What Happens When Databricks Sends Enriched Data Back to BDC

A question that comes up in almost every customer architecture discussion: when Databricks runs ML models or analytics on top of the CMDP and sends results back to BDC, does it update the original CMDP? Or is it a new data product?

The answer is unambiguous. It is always a brand new, separate Derived Data Product. The original CMDP is never modified, overwritten, or versioned by the reverse flow.

The data scientist publishes the enriched dataset (say, a GL Account clustering result produced by an ML model) back to BDC using the BDC Connect SDK in a Databricks notebook. Three SDK calls are made in sequence: bdc_connect_client.create_or_update_share() registers the share with its ORD metadata, bdc_connect_client.create_or_update_share_csn() defines the data schema using the CSN (Core Schema Notation) format, and bdc_connect_client.publish_data_product() makes the data product discoverable in the BDC catalog. The new Derived Data Product appears in the BDC Cockpit Catalog and Marketplace as a distinct, independently governed entry, discoverable by SAP Analytics Cloud reports, Datasphere models, and other data products.

The original GL Account master CMDP remains exactly as it was: unchanged, unaware, still serving as a source for other consumers. The Derived Data Product and the original CMDP coexist in the BDC catalog as independent assets with their own lineage, their own governance policies, and their own update schedules. This composability is the foundation of SAP's data product strategy and the direction that Data Product Studio (currently in preview, maturing through 2026) is designed to operationalize at low-code scale.

Closing: The Definitive Framing

The most precise way to describe the SAP BDC zero-copy architecture for Customer-Managed Data Products is this:

SAP BDC delivers genuine zero-copy data sharing to Enterprise Databricks at the consumer boundary via the Delta Sharing protocol. The producer-side pipeline (one physical copy in Option A, two in Option B) prepares the data in Delta Lake format as required by the Delta Sharing protocol. That preparation is not duplication. It is format conversion. It is architecturally equivalent to what every other vendor implementing Delta Sharing requires on every platform, without exception.

The choice between Option A and Option B is not a choice between correct and incorrect architecture. It is a design decision driven by where your merge logic lives and what trade-offs make sense for your specific data product, your S/4HANA deployment type, and your existing Datasphere investment. Both options are fully supported by SAP today. Both deliver genuine zero copy to Databricks. The difference is purely in the producer-side storage footprint and the location of your business logic.

If you are an architect, a CSM, or a customer who has been confused by this, I hope this post has made the picture clear. If your team is navigating the CMDP journey on SAP BDC right now, whether on Private Cloud or Public Cloud, whether starting fresh or working with an existing Datasphere implementation, feel free to reach out in the comments or directly.

Data and Analytics SAP Business Data Cloud RISE with SAP SAP S/4HANA SAP Datasphere SAP S/4HANA SAP Databricks ABAP Development SAP BTP ABAP environment

Source link