Dataflows and Shared Datasets: Architecture

Dataflows and Shared Datasets: Architecture

Introduction

As Power BI implementations mature, organizations face challenges with duplicated transformation logic, inconsistent metrics definitions, refresh bottlenecks, and inability to scale semantic models across the enterprise. Dataflows and shared datasets provide the architectural foundation to address these challenges through reusable, governed, and performant data preparation layers.

This comprehensive guide explores enterprise-grade dataflows and shared datasets architecture, from layered design principles through advanced refresh orchestration, incremental refresh configuration, computed entities, dependency management, cost optimization strategies, migration patterns, and comprehensive governance frameworks. You'll learn how to build scalable, maintainable BI solutions that eliminate duplication and ensure consistency across your organization.

Prerequisites

  • Power BI Premium or Premium Per User (PPU) for dataflows
  • Understanding of Power Query M language
  • Familiarity with data warehousing concepts (staging, dimensional modeling)
  • Power BI Pro or Premium for shared datasets
  • (Optional) PowerShell for automation and governance

Dataflows vs Shared Datasets vs Traditional Datasets

Architectural Comparison

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Traditional Approach (Monolithic Datasets)                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ SQL Server ──→ Dataset A (Sales) ──→ Report 1                  β”‚
β”‚         ↓                                                        β”‚
β”‚         └───→ Dataset B (Sales) ──→ Report 2                    β”‚
β”‚         ↓                                                        β”‚
β”‚         └───→ Dataset C (Sales) ──→ Report 3                    β”‚
β”‚                                                                  β”‚
β”‚ ❌ Problems:                                                    β”‚
β”‚    - Duplicated Power Query logic in each dataset              β”‚
β”‚    - Inconsistent transformations lead to different results    β”‚
β”‚    - 3x refresh load on source database                        β”‚
β”‚    - Difficult to maintain (change requires updating all)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Dataflows + Shared Datasets Architecture                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ SQL Server ──→ Dataflow (Sales) ──→ Shared Dataset (Sales)     β”‚
β”‚                                           ↓                      β”‚
β”‚                                           β”œβ”€β”€β†’ Report 1         β”‚
β”‚                                           β”œβ”€β”€β†’ Report 2         β”‚
β”‚                                           └──→ Report 3         β”‚
β”‚                                                                  β”‚
β”‚ βœ… Benefits:                                                    β”‚
β”‚    - Centralized Power Query transformations                   β”‚
β”‚    - Single source of truth for business logic                 β”‚
β”‚    - 1x refresh load on source database                        β”‚
β”‚    - Update once, all reports inherit changes                  β”‚
β”‚    - Dataflow output cached for multiple dataset refreshes     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Feature Comparison

Feature Traditional Dataset Dataflow Shared Dataset
Purpose Transform + Model + Visualize Transform only Model + Visualize (reusable)
Power Query βœ… Yes βœ… Yes βœ… Yes (from dataflow or direct)
Data Storage In-memory (VertiPaq) Azure Data Lake Storage Gen2 In-memory (VertiPaq)
Reusability ❌ Per report βœ… Multiple datasets can consume βœ… Multiple reports can use
Incremental Refresh βœ… Yes βœ… Yes βœ… Yes
Computed Entities N/A βœ… Yes (referenced entities) βœ… Calc tables/columns
License Required Pro Premium/PPU Pro (to create), Free (to consume)
Refresh Cost Per dataset Once (cached) Per dataset, but data from dataflow
Direct Query βœ… Yes ❌ No (Import only) βœ… Yes
Certification ❌ No ❌ No βœ… Yes (endorsed dataset)

Layered Dataflow Architecture

Three-Tier Design Pattern

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LAYER 1: Ingestion / Bronze Dataflows                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Purpose: Raw data extraction with minimal transformation        β”‚
β”‚                                                                  β”‚
β”‚ Source Systems ──→ DF_Ingestion_SalesOrders                    β”‚
β”‚                    - Extract raw tables                         β”‚
β”‚                    - Data type conversion only                  β”‚
β”‚                    - No business logic                          β”‚
β”‚                    - Store as-is from source                    β”‚
β”‚                                                                  β”‚
β”‚ Example Entities:                                               β”‚
β”‚   - Orders_Raw                                                  β”‚
β”‚   - Customers_Raw                                               β”‚
β”‚   - Products_Raw                                                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LAYER 2: Standardization / Silver Dataflows                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Purpose: Conform dimensions, cleanse, standardize              β”‚
β”‚                                                                  β”‚
β”‚ Ingestion DFs ──→ DF_Standardized_Customers                    β”‚
β”‚                   - Merge/deduplicate                           β”‚
β”‚                   - Standardize naming (FirstName, LastName)    β”‚
β”‚                   - Data quality checks                         β”‚
β”‚                   - Conform dimensions (date, geography)        β”‚
β”‚                                                                  β”‚
β”‚ Example Entities:                                               β”‚
β”‚   - Dim_Customer (standardized)                                 β”‚
β”‚   - Dim_Product (standardized)                                  β”‚
β”‚   - Dim_Date (conformed calendar)                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LAYER 3: Business / Gold Dataflows                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Purpose: Business metrics, calculated entities, aggregations    β”‚
β”‚                                                                  β”‚
β”‚ Standardized DFs ──→ DF_Business_SalesMetrics                  β”‚
β”‚                      - Business KPIs (Margin, COGS)             β”‚
β”‚                      - Aggregated summaries                     β”‚
β”‚                      - Enriched with business rules             β”‚
β”‚                                                                  β”‚
β”‚ Example Entities:                                               β”‚
β”‚   - Fact_Sales (with calculated columns)                        β”‚
β”‚   - Agg_SalesByCustomer (pre-aggregated)                        β”‚
β”‚   - Metrics_ProductPerformance                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LAYER 4: Shared Datasets                                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Purpose: Semantic model with relationships, measures, RLS       β”‚
β”‚                                                                  β”‚
β”‚ Business DFs ──→ Dataset_Sales_Certified                        β”‚
β”‚                  - Import from dataflows                        β”‚
β”‚                  - Define relationships                         β”‚
β”‚                  - Create DAX measures                          β”‚
β”‚                  - Apply RLS                                    β”‚
β”‚                  - Certify for enterprise use                   β”‚
β”‚                  ↓                                              β”‚
β”‚                  β”œβ”€β†’ Report A (Sales Overview)                  β”‚
β”‚                  β”œβ”€β†’ Report B (Customer Analysis)               β”‚
β”‚                  └─→ Report C (Product Performance)             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Implementation: Ingestion Layer

// DF_Ingestion_Orders
// Minimal transformation: extract and convert data types only

let
    Source = Sql.Database("sql-server", "SalesDB"),
    Orders_Raw = Source{[Schema="dbo",Item="Orders"]}[Data],
    
    // Type conversion only (no business logic)
    TypedColumns = Table.TransformColumnTypes(Orders_Raw, {
        {"OrderID", Int64.Type},
        {"OrderDate", type datetime},
        {"CustomerID", Int64.Type},
        {"Amount", type number},
        {"Status", type text}
    }),
    
    // Add metadata for lineage tracking
    AddedMetadata = Table.AddColumn(TypedColumns, "DataflowRefreshDate", 
        each DateTime.LocalNow(), type datetime)
in
    AddedMetadata

Implementation: Standardization Layer

// DF_Standardized_Customers
// References DF_Ingestion_Customers
// Cleanses, deduplicates, standardizes naming

let
    // Reference ingestion dataflow (linked entity)
    Source = PowerBI.Dataflows(null),
    Workspace = Source{[workspaceId="workspace-guid"]}[Data],
    Dataflow = Workspace{[dataflowId="dataflow-guid"]}[Data],
    Customers_Raw = Dataflow{[entity="Customers_Raw",version=""]}[Data],
    
    // Cleanse: trim whitespace, proper case
    CleanedNames = Table.TransformColumns(Customers_Raw, {
        {"FirstName", Text.Trim, type text},
        {"LastName", Text.Trim, type text},
        {"Email", Text.Lower, type text}  // Standardize email to lowercase
    }),
    
    // Deduplicate based on Email (keep most recent)
    Sorted = Table.Sort(CleanedNames, {{"LastModifiedDate", Order.Descending}}),
    Deduplicated = Table.Distinct(Sorted, {"Email"}),
    
    // Standardize country names (USA, US β†’ United States)
    StandardizedCountry = Table.ReplaceValue(
        Deduplicated,
        each [Country],
        each if [Country] = "USA" or [Country] = "US" then "United States" else [Country],
        Replacer.ReplaceValue,
        {"Country"}
    ),
    
    // Add surrogate key (if source doesn't have one)
    AddedIndex = Table.AddIndexColumn(StandardizedCountry, "CustomerKey", 1, 1),
    
    // Select and reorder columns
    FinalColumns = Table.SelectColumns(AddedIndex, {
        "CustomerKey",
        "CustomerID", 
        "FirstName", 
        "LastName", 
        "Email", 
        "Country", 
        "Region"
    })
in
    FinalColumns

Implementation: Business Layer

// DF_Business_SalesMetrics
// References DF_Standardized_Orders and DF_Standardized_Customers
// Enriches with business calculations

let
    // Reference standardized dataflows
    Orders = PowerBI.Dataflows(...){[entity="Dim_Orders"]}[Data],
    Customers = PowerBI.Dataflows(...){[entity="Dim_Customer"]}[Data],
    
    // Join orders with customer tier (for discount calculation)
    MergedTier = Table.NestedJoin(
        Orders, {"CustomerID"},
        Customers, {"CustomerID"},
        "CustomerInfo", JoinKind.Inner
    ),
    
    ExpandedTier = Table.ExpandTableColumn(MergedTier, "CustomerInfo", 
        {"CustomerTier", "Region"}, {"CustomerTier", "Region"}),
    
    // Business Logic: Calculate discounted amount
    AddDiscountRate = Table.AddColumn(ExpandedTier, "DiscountRate", 
        each if [CustomerTier] = "Gold" then 0.15
             else if [CustomerTier] = "Silver" then 0.10
             else if [CustomerTier] = "Bronze" then 0.05
             else 0.00,
        type number),
    
    AddDiscountedAmount = Table.AddColumn(AddDiscountRate, "DiscountedAmount",
        each [Amount] * (1 - [DiscountRate]), type number),
    
    // Calculate margin (requires COGS from product table)
    AddMargin = Table.AddColumn(AddDiscountedAmount, "Margin",
        each [DiscountedAmount] - [COGS], type number),
    
    AddMarginPercent = Table.AddColumn(AddMargin, "MarginPercent",
        each if [DiscountedAmount] <> 0 
             then [Margin] / [DiscountedAmount] 
             else null, type number)
in
    AddMarginPercent

Computed Entities vs Linked Entities

Linked Entities (References)

// Linked Entity: References another dataflow entity without copying data
// Storage: Points to source entity (no data duplication)
// Refresh: Doesn't require re-querying source system

let
    // Links to DF_Ingestion_Orders
    Source = PowerBI.Dataflows(null),
    Workspace = Source{[workspaceId="workspace-guid"]}[Data],
    Dataflow = Workspace{[dataflowId="ingestion-dataflow-guid"]}[Data],
    Orders = Dataflow{[entity="Orders_Raw",version=""]}[Data]
in
    Orders  // This is a LINKED entity (no data copied)

Linked Entity Characteristics:

  • βœ… No data duplication (references source entity)
  • βœ… Faster refresh (no data transfer)
  • βœ… Lower storage costs
  • βœ… Always up-to-date with source entity
  • ❌ Requires source dataflow to be available
  • ❌ Adds dependency between dataflows

Computed Entities

// Computed Entity: Stores transformed data independently
// Storage: Materialized in Azure Data Lake (separate from source)
// Refresh: Queries source and stores result

let
    // References source entity
    Source = PowerBI.Dataflows(...){[entity="Orders_Raw"]}[Data],
    
    // Perform transformations (these materialize the computed entity)
    Filtered = Table.SelectRows(Source, each [OrderDate] >= #date(2024, 1, 1)),
    Aggregated = Table.Group(Filtered, {"CustomerID"}, {
        {"TotalAmount", each List.Sum([Amount]), type number},
        {"OrderCount", each Table.RowCount(_), Int64.Type}
    })
in
    Aggregated  // This is STORED as a computed entity

Computed Entity Characteristics:

  • βœ… Materialized (stored independently)
  • βœ… Decouples from source dataflow
  • βœ… Supports Enhanced Compute Engine for complex transformations
  • βœ… Can be incrementally refreshed
  • ❌ Duplicates data (higher storage cost)
  • ❌ Requires refresh to sync with source

When to Use Each:

Scenario Use Linked Entity Use Computed Entity
Simple pass-through βœ… Yes ❌ No
Complex aggregations/joins ❌ No βœ… Yes
Minimize storage βœ… Yes ❌ No
Incremental refresh needed ❌ No βœ… Yes
Decouple from source ❌ No βœ… Yes
Enhanced Compute Engine ❌ No βœ… Yes

Incremental Refresh Configuration

Dataflow Incremental Refresh Setup

// Step 1: Create RangeStart and RangeEnd parameters (case-sensitive!)
// In Power BI Service: Create these in dataflow settings

RangeStart = #datetime(2020, 1, 1, 0, 0, 0) meta [IsParameterQuery=true, Type="DateTime"]
RangeEnd = #datetime(2025, 12, 31, 23, 59, 59) meta [IsParameterQuery=true, Type="DateTime"]

// Step 2: Filter table using parameters
let
    Source = Sql.Database("server", "DB"),
    Orders = Source{[Schema="dbo",Item="Orders"]}[Data],
    
    // Filter by date range (must use parameter names exactly as "RangeStart" and "RangeEnd")
    FilteredRows = Table.SelectRows(Orders, 
        each [OrderDate] >= RangeStart and [OrderDate] < RangeEnd)
in
    FilteredRows

Incremental Refresh Policy Configuration (Power BI Service):

Dataflow Settings β†’ Incremental Refresh:

β˜‘ Enable Incremental Refresh
  ☐ Refresh rows where OrderDate is in the last: 7 days
  ☐ Store rows where OrderDate is in the last: 2 years
  β˜‘ Detect data changes: Use OrderDate column
  β˜‘ Only refresh complete days (avoid partial day issues)

Partitions Created:
- Historical: 2020-01-01 to 2025-08-11 (static, not refreshed)
- Rolling: 2025-08-11 to 2025-08-18 (refreshed daily)

Dataset Incremental Refresh from Dataflow

// Dataset consuming dataflow with incremental refresh

// Step 1: Define RangeStart/RangeEnd parameters in dataset
RangeStart = #datetime(2024, 1, 1, 0, 0, 0) meta [IsParameterQuery=true, Type="DateTime"]
RangeEnd = #datetime(2025, 12, 31, 23, 59, 59) meta [IsParameterQuery=true, Type="DateTime"]

// Step 2: Connect to dataflow
let
    Source = PowerBI.Dataflows(null),
    Workspace = Source{[workspaceId="workspace-guid"]}[Data],
    Dataflow = Workspace{[dataflowId="dataflow-guid"]}[Data],
    Orders = Dataflow{[entity="Fact_Sales",version=""]}[Data],
    
    // Apply filter using parameters (enables incremental refresh)
    FilteredRows = Table.SelectRows(Orders,
        each [OrderDate] >= RangeStart and [OrderDate] < RangeEnd)
in
    FilteredRows

Result: Dataset only refreshes recent data, while dataflow handles full history.

Incremental Refresh Best Practices

β˜‘ Design Considerations:
  ☐ Use consistent date column (OrderDate, TransactionDate)
  ☐ Ensure date column is indexed in source database
  ☐ Rolling window: 7-30 days (daily refresh needs)
  ☐ Historical window: 2-5 years (long-term storage)
  ☐ Test with "Refresh full data" option before enabling

β˜‘ Performance Optimization:
  ☐ Partition large tables (> 10M rows)
  ☐ Use "Detect data changes" to skip unchanged partitions
  ☐ Enable "Only refresh complete days" to avoid partial data
  ☐ Schedule refreshes during off-peak hours

β˜‘ Monitoring:
  ☐ Track refresh duration (should decrease with incremental)
  ☐ Monitor partition creation (should align with policy)
  ☐ Alert on full refresh (indicates policy issue)
  ☐ Validate historical data integrity regularly

Shared Dataset Architecture

Creating Enterprise Shared Datasets

Shared Dataset Design Pattern:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Shared Dataset: "Sales_Certified"                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Data Sources:                                          β”‚
β”‚   β”œβ”€ DF_Business_Sales (Fact_Sales entity)            β”‚
β”‚   β”œβ”€ DF_Standardized_Customers (Dim_Customer entity)  β”‚
β”‚   β”œβ”€ DF_Standardized_Products (Dim_Product entity)    β”‚
β”‚   └─ DF_Standardized_Date (Dim_Date entity)           β”‚
β”‚                                                        β”‚
β”‚ Relationships:                                         β”‚
β”‚   - Fact_Sales[CustomerKey] β†’ Dim_Customer[CustomerKey]β”‚
β”‚   - Fact_Sales[ProductKey] β†’ Dim_Product[ProductKey]  β”‚
β”‚   - Fact_Sales[OrderDateKey] β†’ Dim_Date[DateKey]      β”‚
β”‚                                                        β”‚
β”‚ DAX Measures (50+ enterprise KPIs):                   β”‚
β”‚   - Total Sales                                        β”‚
β”‚   - Total Margin                                       β”‚
β”‚   - YTD Sales                                          β”‚
β”‚   - Sales vs Target                                    β”‚
β”‚   - Customer LTV                                       β”‚
β”‚   - ... (centralized business logic)                  β”‚
β”‚                                                        β”‚
β”‚ Row-Level Security:                                   β”‚
β”‚   - Dynamic RLS by Region                             β”‚
β”‚   - Manager hierarchy security                        β”‚
β”‚                                                        β”‚
β”‚ Status: βœ… Certified Dataset                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓ Build Permission (Viewer role)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Thin Reports (Visual Layer Only)                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Report A: Sales Overview Dashboard                    β”‚
β”‚   - Uses measures from certified dataset              β”‚
β”‚   - No duplicate DAX logic                            β”‚
β”‚   - Inherits RLS automatically                        β”‚
β”‚                                                        β”‚
β”‚ Report B: Customer Analysis                           β”‚
β”‚   - Same measures, different visuals                  β”‚
β”‚   - Consistent calculations across org                β”‚
β”‚                                                        β”‚
β”‚ Report C: Product Performance                         β”‚
β”‚   - Uses same semantic model                          β”‚
β”‚   - Governed by central dataset team                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Dataset Certification Process

# Power BI REST API: Certify dataset

$headers = @{
    "Authorization" = "Bearer $accessToken"
    "Content-Type" = "application/json"
}

$certificationBody = @{
    "endorsementDetails" = @{
        "endorsement" = "Certified"  # Promoted, Certified
        "certifiedBy" = "Data Governance Team"
    }
} | ConvertTo-Json

$datasetId = "dataset-guid"

Invoke-RestMethod -Uri "https://api.powerbi.com/v1.0/myorg/datasets/$datasetId/endorsement" `
    -Method Post `
    -Headers $headers `
    -Body $certificationBody

Write-Host "βœ… Dataset certified successfully"

Certification Criteria Checklist:

β˜‘ Data Quality:
  ☐ Data validated against source systems
  ☐ Refresh schedule tested and stable
  ☐ Data quality checks implemented (row counts, null checks)
  ☐ Historical data integrity validated

β˜‘ Semantic Model:
  ☐ Relationships correctly defined (star schema preferred)
  ☐ DAX measures tested and optimized
  ☐ Measure naming convention followed (no spaces, PascalCase)
  ☐ Calculation groups used for time intelligence (if applicable)

β˜‘ Security:
  ☐ Row-level security implemented and tested
  ☐ Object-level security applied (if sensitive columns exist)
  ☐ Build permission granted to report creators only
  ☐ Viewer permission for end users

β˜‘ Documentation:
  ☐ Dataset description updated
  ☐ Measure descriptions added
  ☐ Source dataflow dependencies documented
  ☐ Refresh schedule and SLA documented

β˜‘ Governance:
  ☐ Dataset owner assigned
  ☐ Support contact documented
  ☐ Change management process defined
  ☐ Version control implemented (export .pbix to Git)

Refresh Orchestration

Dependency Management

Refresh Dependency Chain:

1. Ingestion Dataflows (Layer 1)
   DF_Ingestion_Orders      ── Refresh: Daily 2:00 AM ──┐
   DF_Ingestion_Customers   ── Refresh: Daily 2:00 AM ───
   DF_Ingestion_Products    ── Refresh: Daily 2:00 AM ───
                                                          ↓
2. Standardization Dataflows (Layer 2)                   β”‚
   DF_Standardized_Orders   ── Refresh: Daily 3:00 AM ─── (Wait for Layer 1)
   DF_Standardized_Customers── Refresh: Daily 3:00 AM ───
                                                          ↓
3. Business Dataflows (Layer 3)                          β”‚
   DF_Business_SalesMetrics ── Refresh: Daily 4:00 AM ─── (Wait for Layer 2)
                                                          ↓
4. Shared Datasets                                       β”‚
   Dataset_Sales_Certified  ── Refresh: Daily 5:00 AM β”€β”€β”˜ (Wait for Layer 3)
                                     ↓
5. Reports (auto-refresh when dataset refreshes)

PowerShell Orchestration Script

# Orchestrate dataflow and dataset refreshes with dependency checking

function Invoke-DataflowRefresh {
    param(
        [string]$WorkspaceId,
        [string]$DataflowId,
        [string]$DataflowName
    )
    
    Write-Host "πŸ”„ Starting refresh: $DataflowName" -ForegroundColor Cyan
    
    $headers = @{
        "Authorization" = "Bearer $accessToken"
    }
    
    # Trigger refresh
    $refreshResponse = Invoke-RestMethod `
        -Uri "https://api.powerbi.com/v1.0/myorg/groups/$WorkspaceId/dataflows/$DataflowId/refreshes" `
        -Method Post `
        -Headers $headers
    
    # Poll refresh status
    do {
        Start-Sleep -Seconds 30
        $status = Invoke-RestMethod `
            -Uri "https://api.powerbi.com/v1.0/myorg/groups/$WorkspaceId/dataflows/$DataflowId/refreshes?`$top=1" `
            -Method Get `
            -Headers $headers
        
        $currentStatus = $status.value[0].status
        Write-Host "  Status: $currentStatus"
        
    } while ($currentStatus -eq "InProgress")
    
    if ($currentStatus -eq "Completed") {
        Write-Host "βœ… Refresh completed: $DataflowName" -ForegroundColor Green
        return $true
    } else {
        Write-Host "❌ Refresh failed: $DataflowName - $($status.value[0].error.message)" -ForegroundColor Red
        return $false
    }
}

# Orchestration workflow
$workspaceId = "workspace-guid"

# Layer 1: Ingestion (parallel)
$layer1Success = @(
    Invoke-DataflowRefresh -WorkspaceId $workspaceId -DataflowId "ingestion-orders-guid" -DataflowName "DF_Ingestion_Orders",
    Invoke-DataflowRefresh -WorkspaceId $workspaceId -DataflowId "ingestion-customers-guid" -DataflowName "DF_Ingestion_Customers"
) -notcontains $false

if (-not $layer1Success) {
    Write-Host "❌ Layer 1 refresh failed. Aborting." -ForegroundColor Red
    exit 1
}

# Layer 2: Standardization (depends on Layer 1)
$layer2Success = Invoke-DataflowRefresh -WorkspaceId $workspaceId `
    -DataflowId "standardized-orders-guid" `
    -DataflowName "DF_Standardized_Orders"

if (-not $layer2Success) {
    Write-Host "❌ Layer 2 refresh failed. Aborting." -ForegroundColor Red
    exit 1
}

# Layer 3: Business (depends on Layer 2)
$layer3Success = Invoke-DataflowRefresh -WorkspaceId $workspaceId `
    -DataflowId "business-sales-guid" `
    -DataflowName "DF_Business_SalesMetrics"

if (-not $layer3Success) {
    Write-Host "❌ Layer 3 refresh failed. Aborting." -ForegroundColor Red
    exit 1
}

# Layer 4: Dataset refresh (depends on Layer 3)
Write-Host "πŸ”„ Refreshing dataset: Sales_Certified" -ForegroundColor Cyan
Invoke-PowerBIRestMethod -Url "groups/$workspaceId/datasets/$datasetId/refreshes" -Method Post

Write-Host "βœ… Full refresh orchestration completed successfully" -ForegroundColor Green

Cost Optimization

Premium Capacity Cost Analysis

Dataflow Storage Costs (Premium):

Factors:
- Storage: Azure Data Lake Storage Gen2 (pay per GB)
- Compute: Premium capacity CU consumption
- Refresh frequency: More refreshes = higher CU usage

Cost Optimization Strategies:

1. Minimize Computed Entities
   ❌ 10 computed entities with duplicated aggregations
   βœ… 1 computed entity with comprehensive aggregations
   Savings: ~40% storage, ~30% refresh time

2. Incremental Refresh
   ❌ Full refresh of 100M row table daily
   βœ… Incremental refresh (7-day window)
   Savings: ~90% refresh time, ~60% CU usage

3. Linked Entities Where Possible
   ❌ Copy data across 5 dataflows (5x storage)
   βœ… Use linked entities (1x storage)
   Savings: ~80% storage cost

4. Archive Old Data
   ❌ Keep 10 years of detailed transactions
   βœ… Aggregate to monthly summaries after 2 years
   Savings: ~70% storage for historical data

5. Optimize Refresh Schedule
   ❌ Hourly refresh for rarely changing dimensions
   βœ… Daily refresh for dimensions, hourly for facts
   Savings: ~50% CU usage

PowerShell Cost Analysis Script

# Analyze dataflow storage and refresh costs

function Get-DataflowCostAnalysis {
    param([string]$WorkspaceId)
    
    $dataflows = Invoke-PowerBIRestMethod -Url "groups/$WorkspaceId/dataflows" -Method Get | ConvertFrom-Json
    
    $analysis = @()
    
    foreach ($df in $dataflows.value) {
        $refreshes = Invoke-PowerBIRestMethod -Url "groups/$WorkspaceId/dataflows/$($df.objectId)/refreshes?`$top=30" -Method Get | ConvertFrom-Json
        
        $avgRefreshDuration = ($refreshes.value | Measure-Object -Property durationInMs -Average).Average / 1000 / 60  # minutes
        $refreshesPerMonth = $refreshes.value.Count * (30 / 7)  # Extrapolate to monthly
        
        $analysis += [PSCustomObject]@{
            DataflowName = $df.name
            EntityCount = $df.entities.Count
            AvgRefreshMinutes = [math]::Round($avgRefreshDuration, 2)
            EstimatedMonthlyRefreshHours = [math]::Round(($refreshesPerMonth * $avgRefreshDuration) / 60, 2)
            RefreshFrequency = if ($refreshes.value.Count -gt 0) { "$($refreshes.value.Count) in last 7 days" } else { "Unknown" }
            EstimatedMonthlyCost = [math]::Round(($refreshesPerMonth * $avgRefreshDuration / 60) * 0.20, 2)  # $0.20 per CU-hour estimate
        }
    }
    
    $analysis | Format-Table -AutoSize
    
    $totalMonthlyCost = ($analysis | Measure-Object -Property EstimatedMonthlyCost -Sum).Sum
    Write-Host "`nπŸ’° Estimated Monthly Cost: `$$totalMonthlyCost" -ForegroundColor Yellow
}

Get-DataflowCostAnalysis -WorkspaceId "workspace-guid"

Migration from Monolithic Datasets

Migration Strategy

Phase 1: Assessment (Week 1-2)
☐ Inventory all existing datasets
☐ Identify duplicated Power Query logic
☐ Document current refresh schedules
☐ Analyze dataset dependencies (which reports use which datasets)
☐ Prioritize datasets for migration (start with most duplicated)

Phase 2: Design Dataflow Architecture (Week 3-4)
☐ Design layered dataflow structure (Ingestion/Standardization/Business)
☐ Define naming conventions
☐ Plan incremental refresh policies
☐ Design shared datasets (one per subject area)
☐ Document dependencies and refresh order

Phase 3: Build Dataflows (Week 5-8)
☐ Create Ingestion dataflows (extract raw data)
☐ Create Standardization dataflows (cleanse, conform)
☐ Create Business dataflows (metrics, aggregations)
☐ Test refresh times and validate data accuracy
☐ Implement incremental refresh where applicable

Phase 4: Build Shared Datasets (Week 9-10)
☐ Create shared datasets consuming business dataflows
☐ Define relationships and DAX measures
☐ Implement RLS
☐ Test with sample reports
☐ Certify datasets

Phase 5: Migrate Reports (Week 11-14)
☐ Identify reports using old datasets
☐ Rebind reports to new shared datasets
☐ Remove duplicate visuals/calculations
☐ Test report functionality
☐ Deploy to production workspace
☐ Deprecate old datasets (after validation period)

Phase 6: Governance & Monitoring (Ongoing)
☐ Monitor refresh success rates
☐ Track cost savings (refresh time, storage)
☐ Collect user feedback
☐ Iterate and optimize

Migration Script Example

// Old Dataset (Monolithic):
// All logic in single dataset

let
    Source = Sql.Database("server", "DB"),
    Orders = Source{[Schema="dbo",Item="Orders"]}[Data],
    
    // All transformations in one place
    Filtered = Table.SelectRows(Orders, each [OrderDate] >= #date(2024, 1, 1)),
    Typed = Table.TransformColumnTypes(Filtered, {{"Amount", type number}}),
    AddedMargin = Table.AddColumn(Typed, "Margin", each [Amount] * 0.30),
    
    // ... 50 more transformation steps ...
    
    Final = AddedMargin
in
    Final

// New Approach (Dataflow + Shared Dataset):

// DATAFLOW: DF_Business_Sales
let
    // Reference standardized dataflow (reusable)
    Source = PowerBI.Dataflows(...){[entity="Fact_Sales"]}[Data]
    
    // Only business-specific logic here
    // (No extraction, cleansing - already done in lower layers)
in
    Source

// SHARED DATASET:
// Consumes DF_Business_Sales
// Adds DAX measures only
// Multiple reports connect to this dataset

Troubleshooting Guide

Issue 1: Dataflow Refresh Failures

Symptoms:

  • Dataflow shows "Refresh failed" status
  • Timeout errors
  • Data inconsistency

Diagnosis:

# Check recent refresh history
$workspaceId = "workspace-guid"
$dataflowId = "dataflow-guid"

$refreshHistory = Invoke-PowerBIRestMethod `
    -Url "groups/$workspaceId/dataflows/$dataflowId/refreshes?`$top=10" `
    -Method Get | ConvertFrom-Json

$refreshHistory.value | ForEach-Object {
    [PSCustomObject]@{
        StartTime = $_.startTime
        EndTime = $_.endTime
        Status = $_.status
        Duration = if ($_.endTime) { 
            (New-TimeSpan -Start $_.startTime -End $_.endTime).TotalMinutes 
        } else { "In Progress" }
        ErrorMessage = $_.error.message
    }
} | Format-Table -AutoSize

Common Causes & Resolutions:

  1. Query timeout (> 2 hours)

    • Add incremental refresh
    • Optimize Power Query (enable query folding)
    • Partition large tables
  2. Memory limits exceeded

    • Reduce entity size (split into multiple entities)
    • Use computed entities instead of linked (materializes data)
    • Upgrade Premium capacity SKU
  3. Source system unavailable

    • Check gateway connectivity
    • Verify credentials haven't expired
    • Implement retry logic in orchestration

Issue 2: Duplicate Data in Reports

Symptoms:

  • Metrics don't match between reports
  • Same query returns different results
  • "Which number is correct?" from business users

Root Causes:

  • Multiple datasets with different transformation logic
  • Lack of certified shared dataset
  • Inconsistent DAX measure definitions

Resolution:

1. Audit Current State:
   ☐ List all datasets used across organization
   ☐ Identify duplicate Power Query logic
   ☐ Document different calculation methods

2. Consolidate to Shared Dataset:
   ☐ Create single certified dataset per subject area
   ☐ Migrate all reports to use shared dataset
   ☐ Deprecate old datasets after validation

3. Governance Process:
   ☐ Require approval for new datasets
   ☐ Promote reuse of existing shared datasets
   ☐ Regular audits of dataset proliferation

Issue 3: Slow Dataset Refresh

Symptoms:

  • Dataset refresh takes hours
  • Reports show stale data
  • Capacity overload

Diagnosis:

-- Check dataset refresh duration trend
SELECT 
    DatasetName,
    RefreshDate,
    DATEDIFF(MINUTE, StartTime, EndTime) AS DurationMinutes
FROM DatasetRefreshLog
WHERE DatasetName = 'Sales_Certified'
  AND RefreshDate >= DATEADD(DAY, -30, GETDATE())
ORDER BY RefreshDate DESC;

Optimization Steps:

  1. Enable incremental refresh on dataset
  2. Use dataflows for heavy transformations (pre-process data)
  3. Optimize DAX measures (avoid expensive iterators)
  4. Use aggregations for large fact tables
  5. Schedule refreshes during off-peak hours

Best Practices Checklist

β˜‘ Architecture:
  ☐ Use layered dataflow design (Ingestion/Standardization/Business)
  ☐ Minimize computed entities (prefer linked where possible)
  ☐ Create shared datasets for enterprise metrics
  ☐ Certify datasets used by multiple teams
  ☐ Document dependencies (dataflow β†’ dataset β†’ report lineage)

β˜‘ Performance:
  ☐ Implement incremental refresh for large tables (> 1M rows)
  ☐ Use Enhanced Compute Engine for computed entities
  ☐ Optimize Power Query (enable query folding)
  ☐ Schedule refreshes to respect dependencies
  ☐ Monitor refresh duration trends

β˜‘ Governance:
  ☐ Naming convention: DF_<Layer>_<Domain>_<Purpose>
  ☐ Assign dataflow owners
  ☐ Document refresh SLAs
  ☐ Version control dataflow definitions (export JSON)
  ☐ Regular access reviews (quarterly)
  ☐ Implement change management process

β˜‘ Cost Optimization:
  ☐ Use incremental refresh (reduce CU usage)
  ☐ Linked entities over computed (reduce storage)
  ☐ Archive old data (aggregate to summaries)
  ☐ Optimize refresh schedules (avoid over-refreshing)
  ☐ Monitor Premium capacity utilization

β˜‘ Security:
  ☐ Apply RLS at shared dataset level
  ☐ Use service principals for gateway connections
  ☐ Implement least-privilege access
  ☐ Audit data access logs

Key Takeaways

  • Layered dataflow architecture (Ingestion/Standardization/Business) eliminates duplication and ensures consistency
  • Linked entities avoid data duplication, while computed entities enable complex transformations and incremental refresh
  • Incremental refresh is critical for large tables to reduce refresh time and capacity costs
  • Shared datasets provide single source of truth for enterprise metrics, with RLS and certification
  • Refresh orchestration must respect dependencies between dataflow layers and datasets
  • Cost optimization focuses on minimizing computed entities, enabling incremental refresh, and optimizing schedules
  • Migration from monolithic datasets requires phased approach: assess, design, build, migrate, monitor
  • Governance includes naming conventions, ownership, documentation, and change management
  • Performance monitoring tracks refresh duration, failure rates, and capacity utilization

Next Steps

  1. Assess current dataset landscape (identify duplication)
  2. Design layered dataflow architecture (Ingestion/Standardization/Business)
  3. Create first ingestion dataflow (start simple, single source system)
  4. Build standardization layer (cleanse and conform dimensions)
  5. Implement business layer (add calculated metrics)
  6. Create first shared dataset (import from business dataflow)
  7. Certify dataset and grant Build permissions
  8. Migrate first report to use shared dataset
  9. Implement refresh orchestration script
  10. Monitor refresh success rates and optimize
  11. Document architecture and governance process
  12. Scale to additional subject areas (Sales, Finance, Operations)

Additional Resources


Reusable. Governed. Scalable.