Dataflows and Shared Datasets: Architecture

Introduction

As Power BI implementations mature, organizations face challenges with duplicated transformation logic, inconsistent metrics definitions, refresh bottlenecks, and inability to scale semantic models across the enterprise. Dataflows and shared datasets provide the architectural foundation to address these challenges through reusable, governed, and performant data preparation layers.

This comprehensive guide explores enterprise-grade dataflows and shared datasets architecture, from layered design principles through advanced refresh orchestration, incremental refresh configuration, computed entities, dependency management, cost optimization strategies, migration patterns, and comprehensive governance frameworks. You'll learn how to build scalable, maintainable BI solutions that eliminate duplication and ensure consistency across your organization.

Prerequisites

Power BI Premium or Premium Per User (PPU) for dataflows
Understanding of Power Query M language
Familiarity with data warehousing concepts (staging, dimensional modeling)
Power BI Pro or Premium for shared datasets
(Optional) PowerShell for automation and governance

Dataflows vs Shared Datasets vs Traditional Datasets

Architectural Comparison

┌─────────────────────────────────────────────────────────────────┐
│ Traditional Approach (Monolithic Datasets)                      │
├─────────────────────────────────────────────────────────────────┤
│ SQL Server ──→ Dataset A (Sales) ──→ Report 1                  │
│         ↓                                                        │
│         └───→ Dataset B (Sales) ──→ Report 2                    │
│         ↓                                                        │
│         └───→ Dataset C (Sales) ──→ Report 3                    │
│                                                                  │
│ ❌ Problems:                                                    │
│    - Duplicated Power Query logic in each dataset              │
│    - Inconsistent transformations lead to different results    │
│    - 3x refresh load on source database                        │
│    - Difficult to maintain (change requires updating all)      │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ Dataflows + Shared Datasets Architecture                        │
├─────────────────────────────────────────────────────────────────┤
│ SQL Server ──→ Dataflow (Sales) ──→ Shared Dataset (Sales)     │
│                                           ↓                      │
│                                           ├──→ Report 1         │
│                                           ├──→ Report 2         │
│                                           └──→ Report 3         │
│                                                                  │
│ ✅ Benefits:                                                    │
│    - Centralized Power Query transformations                   │
│    - Single source of truth for business logic                 │
│    - 1x refresh load on source database                        │
│    - Update once, all reports inherit changes                  │
│    - Dataflow output cached for multiple dataset refreshes     │
└─────────────────────────────────────────────────────────────────┘

Feature Comparison

Feature	Traditional Dataset	Dataflow	Shared Dataset
Purpose	Transform + Model + Visualize	Transform only	Model + Visualize (reusable)
Power Query	✅ Yes	✅ Yes	✅ Yes (from dataflow or direct)
Data Storage	In-memory (VertiPaq)	Azure Data Lake Storage Gen2	In-memory (VertiPaq)
Reusability	❌ Per report	✅ Multiple datasets can consume	✅ Multiple reports can use
Incremental Refresh	✅ Yes	✅ Yes	✅ Yes
Computed Entities	N/A	✅ Yes (referenced entities)	✅ Calc tables/columns
License Required	Pro	Premium/PPU	Pro (to create), Free (to consume)
Refresh Cost	Per dataset	Once (cached)	Per dataset, but data from dataflow
Direct Query	✅ Yes	❌ No (Import only)	✅ Yes
Certification	❌ No	❌ No	✅ Yes (endorsed dataset)

Layered Dataflow Architecture

Three-Tier Design Pattern

┌─────────────────────────────────────────────────────────────────┐
│ LAYER 1: Ingestion / Bronze Dataflows                          │
├─────────────────────────────────────────────────────────────────┤
│ Purpose: Raw data extraction with minimal transformation        │
│                                                                  │
│ Source Systems ──→ DF_Ingestion_SalesOrders                    │
│                    - Extract raw tables                         │
│                    - Data type conversion only                  │
│                    - No business logic                          │
│                    - Store as-is from source                    │
│                                                                  │
│ Example Entities:                                               │
│   - Orders_Raw                                                  │
│   - Customers_Raw                                               │
│   - Products_Raw                                                │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ LAYER 2: Standardization / Silver Dataflows                    │
├─────────────────────────────────────────────────────────────────┤
│ Purpose: Conform dimensions, cleanse, standardize              │
│                                                                  │
│ Ingestion DFs ──→ DF_Standardized_Customers                    │
│                   - Merge/deduplicate                           │
│                   - Standardize naming (FirstName, LastName)    │
│                   - Data quality checks                         │
│                   - Conform dimensions (date, geography)        │
│                                                                  │
│ Example Entities:                                               │
│   - Dim_Customer (standardized)                                 │
│   - Dim_Product (standardized)                                  │
│   - Dim_Date (conformed calendar)                               │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ LAYER 3: Business / Gold Dataflows                             │
├─────────────────────────────────────────────────────────────────┤
│ Purpose: Business metrics, calculated entities, aggregations    │
│                                                                  │
│ Standardized DFs ──→ DF_Business_SalesMetrics                  │
│                      - Business KPIs (Margin, COGS)             │
│                      - Aggregated summaries                     │
│                      - Enriched with business rules             │
│                                                                  │
│ Example Entities:                                               │
│   - Fact_Sales (with calculated columns)                        │
│   - Agg_SalesByCustomer (pre-aggregated)                        │
│   - Metrics_ProductPerformance                                  │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ LAYER 4: Shared Datasets                                       │
├─────────────────────────────────────────────────────────────────┤
│ Purpose: Semantic model with relationships, measures, RLS       │
│                                                                  │
│ Business DFs ──→ Dataset_Sales_Certified                        │
│                  - Import from dataflows                        │
│                  - Define relationships                         │
│                  - Create DAX measures                          │
│                  - Apply RLS                                    │
│                  - Certify for enterprise use                   │
│                  ↓                                              │
│                  ├─→ Report A (Sales Overview)                  │
│                  ├─→ Report B (Customer Analysis)               │
│                  └─→ Report C (Product Performance)             │
└─────────────────────────────────────────────────────────────────┘

Implementation: Ingestion Layer

// DF_Ingestion_Orders
// Minimal transformation: extract and convert data types only

let
    Source = Sql.Database("sql-server", "SalesDB"),
    Orders_Raw = Source{[Schema="dbo",Item="Orders"]}[Data],
    
    // Type conversion only (no business logic)
    TypedColumns = Table.TransformColumnTypes(Orders_Raw, {
        {"OrderID", Int64.Type},
        {"OrderDate", type datetime},
        {"CustomerID", Int64.Type},
        {"Amount", type number},
        {"Status", type text}
    }),
    
    // Add metadata for lineage tracking
    AddedMetadata = Table.AddColumn(TypedColumns, "DataflowRefreshDate", 
        each DateTime.LocalNow(), type datetime)
in
    AddedMetadata

Implementation: Standardization Layer

// DF_Standardized_Customers
// References DF_Ingestion_Customers
// Cleanses, deduplicates, standardizes naming

let
    // Reference ingestion dataflow (linked entity)
    Source = PowerBI.Dataflows(null),
    Workspace = Source{[workspaceId="workspace-guid"]}[Data],
    Dataflow = Workspace{[dataflowId="dataflow-guid"]}[Data],
    Customers_Raw = Dataflow{[entity="Customers_Raw",version=""]}[Data],
    
    // Cleanse: trim whitespace, proper case
    CleanedNames = Table.TransformColumns(Customers_Raw, {
        {"FirstName", Text.Trim, type text},
        {"LastName", Text.Trim, type text},
        {"Email", Text.Lower, type text}  // Standardize email to lowercase
    }),
    
    // Deduplicate based on Email (keep most recent)
    Sorted = Table.Sort(CleanedNames, {{"LastModifiedDate", Order.Descending}}),
    Deduplicated = Table.Distinct(Sorted, {"Email"}),
    
    // Standardize country names (USA, US → United States)
    StandardizedCountry = Table.ReplaceValue(
        Deduplicated,
        each [Country],
        each if [Country] = "USA" or [Country] = "US" then "United States" else [Country],
        Replacer.ReplaceValue,
        {"Country"}
    ),
    
    // Add surrogate key (if source doesn't have one)
    AddedIndex = Table.AddIndexColumn(StandardizedCountry, "CustomerKey", 1, 1),
    
    // Select and reorder columns
    FinalColumns = Table.SelectColumns(AddedIndex, {
        "CustomerKey",
        "CustomerID", 
        "FirstName", 
        "LastName", 
        "Email", 
        "Country", 
        "Region"
    })
in
    FinalColumns

Implementation: Business Layer

// DF_Business_SalesMetrics
// References DF_Standardized_Orders and DF_Standardized_Customers
// Enriches with business calculations

let
    // Reference standardized dataflows
    Orders = PowerBI.Dataflows(...){[entity="Dim_Orders"]}[Data],
    Customers = PowerBI.Dataflows(...){[entity="Dim_Customer"]}[Data],
    
    // Join orders with customer tier (for discount calculation)
    MergedTier = Table.NestedJoin(
        Orders, {"CustomerID"},
        Customers, {"CustomerID"},
        "CustomerInfo", JoinKind.Inner
    ),
    
    ExpandedTier = Table.ExpandTableColumn(MergedTier, "CustomerInfo", 
        {"CustomerTier", "Region"}, {"CustomerTier", "Region"}),
    
    // Business Logic: Calculate discounted amount
    AddDiscountRate = Table.AddColumn(ExpandedTier, "DiscountRate", 
        each if [CustomerTier] = "Gold" then 0.15
             else if [CustomerTier] = "Silver" then 0.10
             else if [CustomerTier] = "Bronze" then 0.05
             else 0.00,
        type number),
    
    AddDiscountedAmount = Table.AddColumn(AddDiscountRate, "DiscountedAmount",
        each [Amount] * (1 - [DiscountRate]), type number),
    
    // Calculate margin (requires COGS from product table)
    AddMargin = Table.AddColumn(AddDiscountedAmount, "Margin",
        each [DiscountedAmount] - [COGS], type number),
    
    AddMarginPercent = Table.AddColumn(AddMargin, "MarginPercent",
        each if [DiscountedAmount] <> 0 
             then [Margin] / [DiscountedAmount] 
             else null, type number)
in
    AddMarginPercent

Computed Entities vs Linked Entities

Linked Entities (References)

// Linked Entity: References another dataflow entity without copying data
// Storage: Points to source entity (no data duplication)
// Refresh: Doesn't require re-querying source system

let
    // Links to DF_Ingestion_Orders
    Source = PowerBI.Dataflows(null),
    Workspace = Source{[workspaceId="workspace-guid"]}[Data],
    Dataflow = Workspace{[dataflowId="ingestion-dataflow-guid"]}[Data],
    Orders = Dataflow{[entity="Orders_Raw",version=""]}[Data]
in
    Orders  // This is a LINKED entity (no data copied)

Linked Entity Characteristics:

✅ No data duplication (references source entity)
✅ Faster refresh (no data transfer)
✅ Lower storage costs
✅ Always up-to-date with source entity
❌ Requires source dataflow to be available
❌ Adds dependency between dataflows

Computed Entities

// Computed Entity: Stores transformed data independently
// Storage: Materialized in Azure Data Lake (separate from source)
// Refresh: Queries source and stores result

let
    // References source entity
    Source = PowerBI.Dataflows(...){[entity="Orders_Raw"]}[Data],
    
    // Perform transformations (these materialize the computed entity)
    Filtered = Table.SelectRows(Source, each [OrderDate] >= #date(2024, 1, 1)),
    Aggregated = Table.Group(Filtered, {"CustomerID"}, {
        {"TotalAmount", each List.Sum([Amount]), type number},
        {"OrderCount", each Table.RowCount(_), Int64.Type}
    })
in
    Aggregated  // This is STORED as a computed entity

Computed Entity Characteristics:

✅ Materialized (stored independently)
✅ Decouples from source dataflow
✅ Supports Enhanced Compute Engine for complex transformations
✅ Can be incrementally refreshed
❌ Duplicates data (higher storage cost)
❌ Requires refresh to sync with source

When to Use Each:

Scenario	Use Linked Entity	Use Computed Entity
Simple pass-through	✅ Yes	❌ No
Complex aggregations/joins	❌ No	✅ Yes
Minimize storage	✅ Yes	❌ No
Incremental refresh needed	❌ No	✅ Yes
Decouple from source	❌ No	✅ Yes
Enhanced Compute Engine	❌ No	✅ Yes

Incremental Refresh Configuration

Dataflow Incremental Refresh Setup

// Step 1: Create RangeStart and RangeEnd parameters (case-sensitive!)
// In Power BI Service: Create these in dataflow settings

RangeStart = #datetime(2020, 1, 1, 0, 0, 0) meta [IsParameterQuery=true, Type="DateTime"]
RangeEnd = #datetime(2025, 12, 31, 23, 59, 59) meta [IsParameterQuery=true, Type="DateTime"]

// Step 2: Filter table using parameters
let
    Source = Sql.Database("server", "DB"),
    Orders = Source{[Schema="dbo",Item="Orders"]}[Data],
    
    // Filter by date range (must use parameter names exactly as "RangeStart" and "RangeEnd")
    FilteredRows = Table.SelectRows(Orders, 
        each [OrderDate] >= RangeStart and [OrderDate] < RangeEnd)
in
    FilteredRows

Incremental Refresh Policy Configuration (Power BI Service):

Dataflow Settings → Incremental Refresh:

☑ Enable Incremental Refresh
  ☐ Refresh rows where OrderDate is in the last: 7 days
  ☐ Store rows where OrderDate is in the last: 2 years
  ☑ Detect data changes: Use OrderDate column
  ☑ Only refresh complete days (avoid partial day issues)

Partitions Created:
- Historical: 2020-01-01 to 2025-08-11 (static, not refreshed)
- Rolling: 2025-08-11 to 2025-08-18 (refreshed daily)

Dataset Incremental Refresh from Dataflow

// Dataset consuming dataflow with incremental refresh

// Step 1: Define RangeStart/RangeEnd parameters in dataset
RangeStart = #datetime(2024, 1, 1, 0, 0, 0) meta [IsParameterQuery=true, Type="DateTime"]
RangeEnd = #datetime(2025, 12, 31, 23, 59, 59) meta [IsParameterQuery=true, Type="DateTime"]

// Step 2: Connect to dataflow
let
    Source = PowerBI.Dataflows(null),
    Workspace = Source{[workspaceId="workspace-guid"]}[Data],
    Dataflow = Workspace{[dataflowId="dataflow-guid"]}[Data],
    Orders = Dataflow{[entity="Fact_Sales",version=""]}[Data],
    
    // Apply filter using parameters (enables incremental refresh)
    FilteredRows = Table.SelectRows(Orders,
        each [OrderDate] >= RangeStart and [OrderDate] < RangeEnd)
in
    FilteredRows

Result: Dataset only refreshes recent data, while dataflow handles full history.

Incremental Refresh Best Practices

☑ Design Considerations:
  ☐ Use consistent date column (OrderDate, TransactionDate)
  ☐ Ensure date column is indexed in source database
  ☐ Rolling window: 7-30 days (daily refresh needs)
  ☐ Historical window: 2-5 years (long-term storage)
  ☐ Test with "Refresh full data" option before enabling

☑ Performance Optimization:
  ☐ Partition large tables (> 10M rows)
  ☐ Use "Detect data changes" to skip unchanged partitions
  ☐ Enable "Only refresh complete days" to avoid partial data
  ☐ Schedule refreshes during off-peak hours

☑ Monitoring:
  ☐ Track refresh duration (should decrease with incremental)
  ☐ Monitor partition creation (should align with policy)
  ☐ Alert on full refresh (indicates policy issue)
  ☐ Validate historical data integrity regularly

Shared Dataset Architecture

Creating Enterprise Shared Datasets

Shared Dataset Design Pattern:

┌────────────────────────────────────────────────────────┐
│ Shared Dataset: "Sales_Certified"                     │
├────────────────────────────────────────────────────────┤
│ Data Sources:                                          │
│   ├─ DF_Business_Sales (Fact_Sales entity)            │
│   ├─ DF_Standardized_Customers (Dim_Customer entity)  │
│   ├─ DF_Standardized_Products (Dim_Product entity)    │
│   └─ DF_Standardized_Date (Dim_Date entity)           │
│                                                        │
│ Relationships:                                         │
│   - Fact_Sales[CustomerKey] → Dim_Customer[CustomerKey]│
│   - Fact_Sales[ProductKey] → Dim_Product[ProductKey]  │
│   - Fact_Sales[OrderDateKey] → Dim_Date[DateKey]      │
│                                                        │
│ DAX Measures (50+ enterprise KPIs):                   │
│   - Total Sales                                        │
│   - Total Margin                                       │
│   - YTD Sales                                          │
│   - Sales vs Target                                    │
│   - Customer LTV                                       │
│   - ... (centralized business logic)                  │
│                                                        │
│ Row-Level Security:                                   │
│   - Dynamic RLS by Region                             │
│   - Manager hierarchy security                        │
│                                                        │
│ Status: ✅ Certified Dataset                          │
└────────────────────────────────────────────────────────┘
         ↓ Build Permission (Viewer role)
┌────────────────────────────────────────────────────────┐
│ Thin Reports (Visual Layer Only)                      │
├────────────────────────────────────────────────────────┤
│ Report A: Sales Overview Dashboard                    │
│   - Uses measures from certified dataset              │
│   - No duplicate DAX logic                            │
│   - Inherits RLS automatically                        │
│                                                        │
│ Report B: Customer Analysis                           │
│   - Same measures, different visuals                  │
│   - Consistent calculations across org                │
│                                                        │
│ Report C: Product Performance                         │
│   - Uses same semantic model                          │
│   - Governed by central dataset team                  │
└────────────────────────────────────────────────────────┘

Dataset Certification Process

# Power BI REST API: Certify dataset

$headers = @{
    "Authorization" = "Bearer $accessToken"
    "Content-Type" = "application/json"
}

$certificationBody = @{
    "endorsementDetails" = @{
        "endorsement" = "Certified"  # Promoted, Certified
        "certifiedBy" = "Data Governance Team"
    }
} | ConvertTo-Json

$datasetId = "dataset-guid"

Invoke-RestMethod -Uri "https://api.powerbi.com/v1.0/myorg/datasets/$datasetId/endorsement" `
    -Method Post `
    -Headers $headers `
    -Body $certificationBody

Write-Host "✅ Dataset certified successfully"

Certification Criteria Checklist:

☑ Data Quality:
  ☐ Data validated against source systems
  ☐ Refresh schedule tested and stable
  ☐ Data quality checks implemented (row counts, null checks)
  ☐ Historical data integrity validated

☑ Semantic Model:
  ☐ Relationships correctly defined (star schema preferred)
  ☐ DAX measures tested and optimized
  ☐ Measure naming convention followed (no spaces, PascalCase)
  ☐ Calculation groups used for time intelligence (if applicable)

☑ Security:
  ☐ Row-level security implemented and tested
  ☐ Object-level security applied (if sensitive columns exist)
  ☐ Build permission granted to report creators only
  ☐ Viewer permission for end users

☑ Documentation:
  ☐ Dataset description updated
  ☐ Measure descriptions added
  ☐ Source dataflow dependencies documented
  ☐ Refresh schedule and SLA documented

☑ Governance:
  ☐ Dataset owner assigned
  ☐ Support contact documented
  ☐ Change management process defined
  ☐ Version control implemented (export .pbix to Git)

Refresh Orchestration

Dependency Management

Refresh Dependency Chain:

1. Ingestion Dataflows (Layer 1)
   DF_Ingestion_Orders      ── Refresh: Daily 2:00 AM ──┐
   DF_Ingestion_Customers   ── Refresh: Daily 2:00 AM ──┤
   DF_Ingestion_Products    ── Refresh: Daily 2:00 AM ──┤
                                                          ↓
2. Standardization Dataflows (Layer 2)                   │
   DF_Standardized_Orders   ── Refresh: Daily 3:00 AM ──┤ (Wait for Layer 1)
   DF_Standardized_Customers── Refresh: Daily 3:00 AM ──┤
                                                          ↓
3. Business Dataflows (Layer 3)                          │
   DF_Business_SalesMetrics ── Refresh: Daily 4:00 AM ──┤ (Wait for Layer 2)
                                                          ↓
4. Shared Datasets                                       │
   Dataset_Sales_Certified  ── Refresh: Daily 5:00 AM ──┘ (Wait for Layer 3)
                                     ↓
5. Reports (auto-refresh when dataset refreshes)

PowerShell Orchestration Script

# Orchestrate dataflow and dataset refreshes with dependency checking

function Invoke-DataflowRefresh {
    param(
        [string]$WorkspaceId,
        [string]$DataflowId,
        [string]$DataflowName
    )
    
    Write-Host "🔄 Starting refresh: $DataflowName" -ForegroundColor Cyan
    
    $headers = @{
        "Authorization" = "Bearer $accessToken"
    }
    
    # Trigger refresh
    $refreshResponse = Invoke-RestMethod `
        -Uri "https://api.powerbi.com/v1.0/myorg/groups/$WorkspaceId/dataflows/$DataflowId/refreshes" `
        -Method Post `
        -Headers $headers
    
    # Poll refresh status
    do {
        Start-Sleep -Seconds 30
        $status = Invoke-RestMethod `
            -Uri "https://api.powerbi.com/v1.0/myorg/groups/$WorkspaceId/dataflows/$DataflowId/refreshes?`$top=1" `
            -Method Get `
            -Headers $headers
        
        $currentStatus = $status.value[0].status
        Write-Host "  Status: $currentStatus"
        
    } while ($currentStatus -eq "InProgress")
    
    if ($currentStatus -eq "Completed") {
        Write-Host "✅ Refresh completed: $DataflowName" -ForegroundColor Green
        return $true
    } else {
        Write-Host "❌ Refresh failed: $DataflowName - $($status.value[0].error.message)" -ForegroundColor Red
        return $false
    }
}

# Orchestration workflow
$workspaceId = "workspace-guid"

# Layer 1: Ingestion (parallel)
$layer1Success = @(
    Invoke-DataflowRefresh -WorkspaceId $workspaceId -DataflowId "ingestion-orders-guid" -DataflowName "DF_Ingestion_Orders",
    Invoke-DataflowRefresh -WorkspaceId $workspaceId -DataflowId "ingestion-customers-guid" -DataflowName "DF_Ingestion_Customers"
) -notcontains $false

if (-not $layer1Success) {
    Write-Host "❌ Layer 1 refresh failed. Aborting." -ForegroundColor Red
    exit 1
}

# Layer 2: Standardization (depends on Layer 1)
$layer2Success = Invoke-DataflowRefresh -WorkspaceId $workspaceId `
    -DataflowId "standardized-orders-guid" `
    -DataflowName "DF_Standardized_Orders"

if (-not $layer2Success) {
    Write-Host "❌ Layer 2 refresh failed. Aborting." -ForegroundColor Red
    exit 1
}

# Layer 3: Business (depends on Layer 2)
$layer3Success = Invoke-DataflowRefresh -WorkspaceId $workspaceId `
    -DataflowId "business-sales-guid" `
    -DataflowName "DF_Business_SalesMetrics"

if (-not $layer3Success) {
    Write-Host "❌ Layer 3 refresh failed. Aborting." -ForegroundColor Red
    exit 1
}

# Layer 4: Dataset refresh (depends on Layer 3)
Write-Host "🔄 Refreshing dataset: Sales_Certified" -ForegroundColor Cyan
Invoke-PowerBIRestMethod -Url "groups/$workspaceId/datasets/$datasetId/refreshes" -Method Post

Write-Host "✅ Full refresh orchestration completed successfully" -ForegroundColor Green

Cost Optimization

Premium Capacity Cost Analysis

Dataflow Storage Costs (Premium):

Factors:
- Storage: Azure Data Lake Storage Gen2 (pay per GB)
- Compute: Premium capacity CU consumption
- Refresh frequency: More refreshes = higher CU usage

Cost Optimization Strategies:

1. Minimize Computed Entities
   ❌ 10 computed entities with duplicated aggregations
   ✅ 1 computed entity with comprehensive aggregations
   Savings: ~40% storage, ~30% refresh time

2. Incremental Refresh
   ❌ Full refresh of 100M row table daily
   ✅ Incremental refresh (7-day window)
   Savings: ~90% refresh time, ~60% CU usage

3. Linked Entities Where Possible
   ❌ Copy data across 5 dataflows (5x storage)
   ✅ Use linked entities (1x storage)
   Savings: ~80% storage cost

4. Archive Old Data
   ❌ Keep 10 years of detailed transactions
   ✅ Aggregate to monthly summaries after 2 years
   Savings: ~70% storage for historical data

5. Optimize Refresh Schedule
   ❌ Hourly refresh for rarely changing dimensions
   ✅ Daily refresh for dimensions, hourly for facts
   Savings: ~50% CU usage

PowerShell Cost Analysis Script

# Analyze dataflow storage and refresh costs

function Get-DataflowCostAnalysis {
    param([string]$WorkspaceId)
    
    $dataflows = Invoke-PowerBIRestMethod -Url "groups/$WorkspaceId/dataflows" -Method Get | ConvertFrom-Json
    
    $analysis = @()
    
    foreach ($df in $dataflows.value) {
        $refreshes = Invoke-PowerBIRestMethod -Url "groups/$WorkspaceId/dataflows/$($df.objectId)/refreshes?`$top=30" -Method Get | ConvertFrom-Json
        
        $avgRefreshDuration = ($refreshes.value | Measure-Object -Property durationInMs -Average).Average / 1000 / 60  # minutes
        $refreshesPerMonth = $refreshes.value.Count * (30 / 7)  # Extrapolate to monthly
        
        $analysis += [PSCustomObject]@{
            DataflowName = $df.name
            EntityCount = $df.entities.Count
            AvgRefreshMinutes = [math]::Round($avgRefreshDuration, 2)
            EstimatedMonthlyRefreshHours = [math]::Round(($refreshesPerMonth * $avgRefreshDuration) / 60, 2)
            RefreshFrequency = if ($refreshes.value.Count -gt 0) { "$($refreshes.value.Count) in last 7 days" } else { "Unknown" }
            EstimatedMonthlyCost = [math]::Round(($refreshesPerMonth * $avgRefreshDuration / 60) * 0.20, 2)  # $0.20 per CU-hour estimate
        }
    }
    
    $analysis | Format-Table -AutoSize
    
    $totalMonthlyCost = ($analysis | Measure-Object -Property EstimatedMonthlyCost -Sum).Sum
    Write-Host "`n💰 Estimated Monthly Cost: `$$totalMonthlyCost" -ForegroundColor Yellow
}

Get-DataflowCostAnalysis -WorkspaceId "workspace-guid"

Migration from Monolithic Datasets

Migration Strategy

Phase 1: Assessment (Week 1-2)
☐ Inventory all existing datasets
☐ Identify duplicated Power Query logic
☐ Document current refresh schedules
☐ Analyze dataset dependencies (which reports use which datasets)
☐ Prioritize datasets for migration (start with most duplicated)

Phase 2: Design Dataflow Architecture (Week 3-4)
☐ Design layered dataflow structure (Ingestion/Standardization/Business)
☐ Define naming conventions
☐ Plan incremental refresh policies
☐ Design shared datasets (one per subject area)
☐ Document dependencies and refresh order

Phase 3: Build Dataflows (Week 5-8)
☐ Create Ingestion dataflows (extract raw data)
☐ Create Standardization dataflows (cleanse, conform)
☐ Create Business dataflows (metrics, aggregations)
☐ Test refresh times and validate data accuracy
☐ Implement incremental refresh where applicable

Phase 4: Build Shared Datasets (Week 9-10)
☐ Create shared datasets consuming business dataflows
☐ Define relationships and DAX measures
☐ Implement RLS
☐ Test with sample reports
☐ Certify datasets

Phase 5: Migrate Reports (Week 11-14)
☐ Identify reports using old datasets
☐ Rebind reports to new shared datasets
☐ Remove duplicate visuals/calculations
☐ Test report functionality
☐ Deploy to production workspace
☐ Deprecate old datasets (after validation period)

Phase 6: Governance & Monitoring (Ongoing)
☐ Monitor refresh success rates
☐ Track cost savings (refresh time, storage)
☐ Collect user feedback
☐ Iterate and optimize

Migration Script Example

// Old Dataset (Monolithic):
// All logic in single dataset

let
    Source = Sql.Database("server", "DB"),
    Orders = Source{[Schema="dbo",Item="Orders"]}[Data],
    
    // All transformations in one place
    Filtered = Table.SelectRows(Orders, each [OrderDate] >= #date(2024, 1, 1)),
    Typed = Table.TransformColumnTypes(Filtered, {{"Amount", type number}}),
    AddedMargin = Table.AddColumn(Typed, "Margin", each [Amount] * 0.30),
    
    // ... 50 more transformation steps ...
    
    Final = AddedMargin
in
    Final

// New Approach (Dataflow + Shared Dataset):

// DATAFLOW: DF_Business_Sales
let
    // Reference standardized dataflow (reusable)
    Source = PowerBI.Dataflows(...){[entity="Fact_Sales"]}[Data]
    
    // Only business-specific logic here
    // (No extraction, cleansing - already done in lower layers)
in
    Source

// SHARED DATASET:
// Consumes DF_Business_Sales
// Adds DAX measures only
// Multiple reports connect to this dataset

Troubleshooting Guide

Issue 1: Dataflow Refresh Failures

Symptoms:

Dataflow shows "Refresh failed" status
Timeout errors
Data inconsistency

Diagnosis:

# Check recent refresh history
$workspaceId = "workspace-guid"
$dataflowId = "dataflow-guid"

$refreshHistory = Invoke-PowerBIRestMethod `
    -Url "groups/$workspaceId/dataflows/$dataflowId/refreshes?`$top=10" `
    -Method Get | ConvertFrom-Json

$refreshHistory.value | ForEach-Object {
    [PSCustomObject]@{
        StartTime = $_.startTime
        EndTime = $_.endTime
        Status = $_.status
        Duration = if ($_.endTime) { 
            (New-TimeSpan -Start $_.startTime -End $_.endTime).TotalMinutes 
        } else { "In Progress" }
        ErrorMessage = $_.error.message
    }
} | Format-Table -AutoSize

Common Causes & Resolutions:

Query timeout (> 2 hours)
- Add incremental refresh
- Optimize Power Query (enable query folding)
- Partition large tables
Memory limits exceeded
- Reduce entity size (split into multiple entities)
- Use computed entities instead of linked (materializes data)
- Upgrade Premium capacity SKU
Source system unavailable
- Check gateway connectivity
- Verify credentials haven't expired
- Implement retry logic in orchestration

Issue 2: Duplicate Data in Reports

Symptoms:

Metrics don't match between reports
Same query returns different results
"Which number is correct?" from business users

Root Causes:

Multiple datasets with different transformation logic
Lack of certified shared dataset
Inconsistent DAX measure definitions

Resolution:

1. Audit Current State:
   ☐ List all datasets used across organization
   ☐ Identify duplicate Power Query logic
   ☐ Document different calculation methods

2. Consolidate to Shared Dataset:
   ☐ Create single certified dataset per subject area
   ☐ Migrate all reports to use shared dataset
   ☐ Deprecate old datasets after validation

3. Governance Process:
   ☐ Require approval for new datasets
   ☐ Promote reuse of existing shared datasets
   ☐ Regular audits of dataset proliferation

Issue 3: Slow Dataset Refresh

Symptoms:

Dataset refresh takes hours
Reports show stale data
Capacity overload

Diagnosis:

-- Check dataset refresh duration trend
SELECT 
    DatasetName,
    RefreshDate,
    DATEDIFF(MINUTE, StartTime, EndTime) AS DurationMinutes
FROM DatasetRefreshLog
WHERE DatasetName = 'Sales_Certified'
  AND RefreshDate >= DATEADD(DAY, -30, GETDATE())
ORDER BY RefreshDate DESC;

Optimization Steps:

Enable incremental refresh on dataset
Use dataflows for heavy transformations (pre-process data)
Optimize DAX measures (avoid expensive iterators)
Use aggregations for large fact tables
Schedule refreshes during off-peak hours

Best Practices Checklist

☑ Architecture:
  ☐ Use layered dataflow design (Ingestion/Standardization/Business)
  ☐ Minimize computed entities (prefer linked where possible)
  ☐ Create shared datasets for enterprise metrics
  ☐ Certify datasets used by multiple teams
  ☐ Document dependencies (dataflow → dataset → report lineage)

☑ Performance:
  ☐ Implement incremental refresh for large tables (> 1M rows)
  ☐ Use Enhanced Compute Engine for computed entities
  ☐ Optimize Power Query (enable query folding)
  ☐ Schedule refreshes to respect dependencies
  ☐ Monitor refresh duration trends

☑ Governance:
  ☐ Naming convention: DF_<Layer>_<Domain>_<Purpose>
  ☐ Assign dataflow owners
  ☐ Document refresh SLAs
  ☐ Version control dataflow definitions (export JSON)
  ☐ Regular access reviews (quarterly)
  ☐ Implement change management process

☑ Cost Optimization:
  ☐ Use incremental refresh (reduce CU usage)
  ☐ Linked entities over computed (reduce storage)
  ☐ Archive old data (aggregate to summaries)
  ☐ Optimize refresh schedules (avoid over-refreshing)
  ☐ Monitor Premium capacity utilization

☑ Security:
  ☐ Apply RLS at shared dataset level
  ☐ Use service principals for gateway connections
  ☐ Implement least-privilege access
  ☐ Audit data access logs

Key Takeaways

Layered dataflow architecture (Ingestion/Standardization/Business) eliminates duplication and ensures consistency
Linked entities avoid data duplication, while computed entities enable complex transformations and incremental refresh
Incremental refresh is critical for large tables to reduce refresh time and capacity costs
Shared datasets provide single source of truth for enterprise metrics, with RLS and certification
Refresh orchestration must respect dependencies between dataflow layers and datasets
Cost optimization focuses on minimizing computed entities, enabling incremental refresh, and optimizing schedules
Migration from monolithic datasets requires phased approach: assess, design, build, migrate, monitor
Governance includes naming conventions, ownership, documentation, and change management
Performance monitoring tracks refresh duration, failure rates, and capacity utilization

Next Steps

Assess current dataset landscape (identify duplication)
Design layered dataflow architecture (Ingestion/Standardization/Business)
Create first ingestion dataflow (start simple, single source system)
Build standardization layer (cleanse and conform dimensions)
Implement business layer (add calculated metrics)
Create first shared dataset (import from business dataflow)
Certify dataset and grant Build permissions
Migrate first report to use shared dataset
Implement refresh orchestration script
Monitor refresh success rates and optimize
Document architecture and governance process
Scale to additional subject areas (Sales, Finance, Operations)

Additional Resources

Reusable. Governed. Scalable.