Dataflows and Shared Datasets: Architecture
Introduction
As Power BI implementations mature, organizations face challenges with duplicated transformation logic, inconsistent metrics definitions, refresh bottlenecks, and inability to scale semantic models across the enterprise. Dataflows and shared datasets provide the architectural foundation to address these challenges through reusable, governed, and performant data preparation layers.
This comprehensive guide explores enterprise-grade dataflows and shared datasets architecture, from layered design principles through advanced refresh orchestration, incremental refresh configuration, computed entities, dependency management, cost optimization strategies, migration patterns, and comprehensive governance frameworks. You'll learn how to build scalable, maintainable BI solutions that eliminate duplication and ensure consistency across your organization.
Prerequisites
- Power BI Premium or Premium Per User (PPU) for dataflows
- Understanding of Power Query M language
- Familiarity with data warehousing concepts (staging, dimensional modeling)
- Power BI Pro or Premium for shared datasets
- (Optional) PowerShell for automation and governance
Dataflows vs Shared Datasets vs Traditional Datasets
Architectural Comparison
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Traditional Approach (Monolithic Datasets) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β SQL Server βββ Dataset A (Sales) βββ Report 1 β
β β β
β βββββ Dataset B (Sales) βββ Report 2 β
β β β
β βββββ Dataset C (Sales) βββ Report 3 β
β β
β β Problems: β
β - Duplicated Power Query logic in each dataset β
β - Inconsistent transformations lead to different results β
β - 3x refresh load on source database β
β - Difficult to maintain (change requires updating all) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Dataflows + Shared Datasets Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β SQL Server βββ Dataflow (Sales) βββ Shared Dataset (Sales) β
β β β
β ββββ Report 1 β
β ββββ Report 2 β
β ββββ Report 3 β
β β
β β
Benefits: β
β - Centralized Power Query transformations β
β - Single source of truth for business logic β
β - 1x refresh load on source database β
β - Update once, all reports inherit changes β
β - Dataflow output cached for multiple dataset refreshes β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Feature Comparison
| Feature | Traditional Dataset | Dataflow | Shared Dataset |
|---|---|---|---|
| Purpose | Transform + Model + Visualize | Transform only | Model + Visualize (reusable) |
| Power Query | β Yes | β Yes | β Yes (from dataflow or direct) |
| Data Storage | In-memory (VertiPaq) | Azure Data Lake Storage Gen2 | In-memory (VertiPaq) |
| Reusability | β Per report | β Multiple datasets can consume | β Multiple reports can use |
| Incremental Refresh | β Yes | β Yes | β Yes |
| Computed Entities | N/A | β Yes (referenced entities) | β Calc tables/columns |
| License Required | Pro | Premium/PPU | Pro (to create), Free (to consume) |
| Refresh Cost | Per dataset | Once (cached) | Per dataset, but data from dataflow |
| Direct Query | β Yes | β No (Import only) | β Yes |
| Certification | β No | β No | β Yes (endorsed dataset) |
Layered Dataflow Architecture
Three-Tier Design Pattern
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 1: Ingestion / Bronze Dataflows β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Purpose: Raw data extraction with minimal transformation β
β β
β Source Systems βββ DF_Ingestion_SalesOrders β
β - Extract raw tables β
β - Data type conversion only β
β - No business logic β
β - Store as-is from source β
β β
β Example Entities: β
β - Orders_Raw β
β - Customers_Raw β
β - Products_Raw β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 2: Standardization / Silver Dataflows β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Purpose: Conform dimensions, cleanse, standardize β
β β
β Ingestion DFs βββ DF_Standardized_Customers β
β - Merge/deduplicate β
β - Standardize naming (FirstName, LastName) β
β - Data quality checks β
β - Conform dimensions (date, geography) β
β β
β Example Entities: β
β - Dim_Customer (standardized) β
β - Dim_Product (standardized) β
β - Dim_Date (conformed calendar) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 3: Business / Gold Dataflows β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Purpose: Business metrics, calculated entities, aggregations β
β β
β Standardized DFs βββ DF_Business_SalesMetrics β
β - Business KPIs (Margin, COGS) β
β - Aggregated summaries β
β - Enriched with business rules β
β β
β Example Entities: β
β - Fact_Sales (with calculated columns) β
β - Agg_SalesByCustomer (pre-aggregated) β
β - Metrics_ProductPerformance β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 4: Shared Datasets β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Purpose: Semantic model with relationships, measures, RLS β
β β
β Business DFs βββ Dataset_Sales_Certified β
β - Import from dataflows β
β - Define relationships β
β - Create DAX measures β
β - Apply RLS β
β - Certify for enterprise use β
β β β
β βββ Report A (Sales Overview) β
β βββ Report B (Customer Analysis) β
β βββ Report C (Product Performance) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Implementation: Ingestion Layer
// DF_Ingestion_Orders
// Minimal transformation: extract and convert data types only
let
Source = Sql.Database("sql-server", "SalesDB"),
Orders_Raw = Source{[Schema="dbo",Item="Orders"]}[Data],
// Type conversion only (no business logic)
TypedColumns = Table.TransformColumnTypes(Orders_Raw, {
{"OrderID", Int64.Type},
{"OrderDate", type datetime},
{"CustomerID", Int64.Type},
{"Amount", type number},
{"Status", type text}
}),
// Add metadata for lineage tracking
AddedMetadata = Table.AddColumn(TypedColumns, "DataflowRefreshDate",
each DateTime.LocalNow(), type datetime)
in
AddedMetadata
Implementation: Standardization Layer
// DF_Standardized_Customers
// References DF_Ingestion_Customers
// Cleanses, deduplicates, standardizes naming
let
// Reference ingestion dataflow (linked entity)
Source = PowerBI.Dataflows(null),
Workspace = Source{[workspaceId="workspace-guid"]}[Data],
Dataflow = Workspace{[dataflowId="dataflow-guid"]}[Data],
Customers_Raw = Dataflow{[entity="Customers_Raw",version=""]}[Data],
// Cleanse: trim whitespace, proper case
CleanedNames = Table.TransformColumns(Customers_Raw, {
{"FirstName", Text.Trim, type text},
{"LastName", Text.Trim, type text},
{"Email", Text.Lower, type text} // Standardize email to lowercase
}),
// Deduplicate based on Email (keep most recent)
Sorted = Table.Sort(CleanedNames, {{"LastModifiedDate", Order.Descending}}),
Deduplicated = Table.Distinct(Sorted, {"Email"}),
// Standardize country names (USA, US β United States)
StandardizedCountry = Table.ReplaceValue(
Deduplicated,
each [Country],
each if [Country] = "USA" or [Country] = "US" then "United States" else [Country],
Replacer.ReplaceValue,
{"Country"}
),
// Add surrogate key (if source doesn't have one)
AddedIndex = Table.AddIndexColumn(StandardizedCountry, "CustomerKey", 1, 1),
// Select and reorder columns
FinalColumns = Table.SelectColumns(AddedIndex, {
"CustomerKey",
"CustomerID",
"FirstName",
"LastName",
"Email",
"Country",
"Region"
})
in
FinalColumns
Implementation: Business Layer
// DF_Business_SalesMetrics
// References DF_Standardized_Orders and DF_Standardized_Customers
// Enriches with business calculations
let
// Reference standardized dataflows
Orders = PowerBI.Dataflows(...){[entity="Dim_Orders"]}[Data],
Customers = PowerBI.Dataflows(...){[entity="Dim_Customer"]}[Data],
// Join orders with customer tier (for discount calculation)
MergedTier = Table.NestedJoin(
Orders, {"CustomerID"},
Customers, {"CustomerID"},
"CustomerInfo", JoinKind.Inner
),
ExpandedTier = Table.ExpandTableColumn(MergedTier, "CustomerInfo",
{"CustomerTier", "Region"}, {"CustomerTier", "Region"}),
// Business Logic: Calculate discounted amount
AddDiscountRate = Table.AddColumn(ExpandedTier, "DiscountRate",
each if [CustomerTier] = "Gold" then 0.15
else if [CustomerTier] = "Silver" then 0.10
else if [CustomerTier] = "Bronze" then 0.05
else 0.00,
type number),
AddDiscountedAmount = Table.AddColumn(AddDiscountRate, "DiscountedAmount",
each [Amount] * (1 - [DiscountRate]), type number),
// Calculate margin (requires COGS from product table)
AddMargin = Table.AddColumn(AddDiscountedAmount, "Margin",
each [DiscountedAmount] - [COGS], type number),
AddMarginPercent = Table.AddColumn(AddMargin, "MarginPercent",
each if [DiscountedAmount] <> 0
then [Margin] / [DiscountedAmount]
else null, type number)
in
AddMarginPercent
Computed Entities vs Linked Entities
Linked Entities (References)
// Linked Entity: References another dataflow entity without copying data
// Storage: Points to source entity (no data duplication)
// Refresh: Doesn't require re-querying source system
let
// Links to DF_Ingestion_Orders
Source = PowerBI.Dataflows(null),
Workspace = Source{[workspaceId="workspace-guid"]}[Data],
Dataflow = Workspace{[dataflowId="ingestion-dataflow-guid"]}[Data],
Orders = Dataflow{[entity="Orders_Raw",version=""]}[Data]
in
Orders // This is a LINKED entity (no data copied)
Linked Entity Characteristics:
- β No data duplication (references source entity)
- β Faster refresh (no data transfer)
- β Lower storage costs
- β Always up-to-date with source entity
- β Requires source dataflow to be available
- β Adds dependency between dataflows
Computed Entities
// Computed Entity: Stores transformed data independently
// Storage: Materialized in Azure Data Lake (separate from source)
// Refresh: Queries source and stores result
let
// References source entity
Source = PowerBI.Dataflows(...){[entity="Orders_Raw"]}[Data],
// Perform transformations (these materialize the computed entity)
Filtered = Table.SelectRows(Source, each [OrderDate] >= #date(2024, 1, 1)),
Aggregated = Table.Group(Filtered, {"CustomerID"}, {
{"TotalAmount", each List.Sum([Amount]), type number},
{"OrderCount", each Table.RowCount(_), Int64.Type}
})
in
Aggregated // This is STORED as a computed entity
Computed Entity Characteristics:
- β Materialized (stored independently)
- β Decouples from source dataflow
- β Supports Enhanced Compute Engine for complex transformations
- β Can be incrementally refreshed
- β Duplicates data (higher storage cost)
- β Requires refresh to sync with source
When to Use Each:
| Scenario | Use Linked Entity | Use Computed Entity |
|---|---|---|
| Simple pass-through | β Yes | β No |
| Complex aggregations/joins | β No | β Yes |
| Minimize storage | β Yes | β No |
| Incremental refresh needed | β No | β Yes |
| Decouple from source | β No | β Yes |
| Enhanced Compute Engine | β No | β Yes |
Incremental Refresh Configuration
Dataflow Incremental Refresh Setup
// Step 1: Create RangeStart and RangeEnd parameters (case-sensitive!)
// In Power BI Service: Create these in dataflow settings
RangeStart = #datetime(2020, 1, 1, 0, 0, 0) meta [IsParameterQuery=true, Type="DateTime"]
RangeEnd = #datetime(2025, 12, 31, 23, 59, 59) meta [IsParameterQuery=true, Type="DateTime"]
// Step 2: Filter table using parameters
let
Source = Sql.Database("server", "DB"),
Orders = Source{[Schema="dbo",Item="Orders"]}[Data],
// Filter by date range (must use parameter names exactly as "RangeStart" and "RangeEnd")
FilteredRows = Table.SelectRows(Orders,
each [OrderDate] >= RangeStart and [OrderDate] < RangeEnd)
in
FilteredRows
Incremental Refresh Policy Configuration (Power BI Service):
Dataflow Settings β Incremental Refresh:
β Enable Incremental Refresh
β Refresh rows where OrderDate is in the last: 7 days
β Store rows where OrderDate is in the last: 2 years
β Detect data changes: Use OrderDate column
β Only refresh complete days (avoid partial day issues)
Partitions Created:
- Historical: 2020-01-01 to 2025-08-11 (static, not refreshed)
- Rolling: 2025-08-11 to 2025-08-18 (refreshed daily)
Dataset Incremental Refresh from Dataflow
// Dataset consuming dataflow with incremental refresh
// Step 1: Define RangeStart/RangeEnd parameters in dataset
RangeStart = #datetime(2024, 1, 1, 0, 0, 0) meta [IsParameterQuery=true, Type="DateTime"]
RangeEnd = #datetime(2025, 12, 31, 23, 59, 59) meta [IsParameterQuery=true, Type="DateTime"]
// Step 2: Connect to dataflow
let
Source = PowerBI.Dataflows(null),
Workspace = Source{[workspaceId="workspace-guid"]}[Data],
Dataflow = Workspace{[dataflowId="dataflow-guid"]}[Data],
Orders = Dataflow{[entity="Fact_Sales",version=""]}[Data],
// Apply filter using parameters (enables incremental refresh)
FilteredRows = Table.SelectRows(Orders,
each [OrderDate] >= RangeStart and [OrderDate] < RangeEnd)
in
FilteredRows
Result: Dataset only refreshes recent data, while dataflow handles full history.
Incremental Refresh Best Practices
β Design Considerations:
β Use consistent date column (OrderDate, TransactionDate)
β Ensure date column is indexed in source database
β Rolling window: 7-30 days (daily refresh needs)
β Historical window: 2-5 years (long-term storage)
β Test with "Refresh full data" option before enabling
β Performance Optimization:
β Partition large tables (> 10M rows)
β Use "Detect data changes" to skip unchanged partitions
β Enable "Only refresh complete days" to avoid partial data
β Schedule refreshes during off-peak hours
β Monitoring:
β Track refresh duration (should decrease with incremental)
β Monitor partition creation (should align with policy)
β Alert on full refresh (indicates policy issue)
β Validate historical data integrity regularly
Shared Dataset Architecture
Creating Enterprise Shared Datasets
Shared Dataset Design Pattern:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Shared Dataset: "Sales_Certified" β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Data Sources: β
β ββ DF_Business_Sales (Fact_Sales entity) β
β ββ DF_Standardized_Customers (Dim_Customer entity) β
β ββ DF_Standardized_Products (Dim_Product entity) β
β ββ DF_Standardized_Date (Dim_Date entity) β
β β
β Relationships: β
β - Fact_Sales[CustomerKey] β Dim_Customer[CustomerKey]β
β - Fact_Sales[ProductKey] β Dim_Product[ProductKey] β
β - Fact_Sales[OrderDateKey] β Dim_Date[DateKey] β
β β
β DAX Measures (50+ enterprise KPIs): β
β - Total Sales β
β - Total Margin β
β - YTD Sales β
β - Sales vs Target β
β - Customer LTV β
β - ... (centralized business logic) β
β β
β Row-Level Security: β
β - Dynamic RLS by Region β
β - Manager hierarchy security β
β β
β Status: β
Certified Dataset β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Build Permission (Viewer role)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Thin Reports (Visual Layer Only) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Report A: Sales Overview Dashboard β
β - Uses measures from certified dataset β
β - No duplicate DAX logic β
β - Inherits RLS automatically β
β β
β Report B: Customer Analysis β
β - Same measures, different visuals β
β - Consistent calculations across org β
β β
β Report C: Product Performance β
β - Uses same semantic model β
β - Governed by central dataset team β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Dataset Certification Process
# Power BI REST API: Certify dataset
$headers = @{
"Authorization" = "Bearer $accessToken"
"Content-Type" = "application/json"
}
$certificationBody = @{
"endorsementDetails" = @{
"endorsement" = "Certified" # Promoted, Certified
"certifiedBy" = "Data Governance Team"
}
} | ConvertTo-Json
$datasetId = "dataset-guid"
Invoke-RestMethod -Uri "https://api.powerbi.com/v1.0/myorg/datasets/$datasetId/endorsement" `
-Method Post `
-Headers $headers `
-Body $certificationBody
Write-Host "β
Dataset certified successfully"
Certification Criteria Checklist:
β Data Quality:
β Data validated against source systems
β Refresh schedule tested and stable
β Data quality checks implemented (row counts, null checks)
β Historical data integrity validated
β Semantic Model:
β Relationships correctly defined (star schema preferred)
β DAX measures tested and optimized
β Measure naming convention followed (no spaces, PascalCase)
β Calculation groups used for time intelligence (if applicable)
β Security:
β Row-level security implemented and tested
β Object-level security applied (if sensitive columns exist)
β Build permission granted to report creators only
β Viewer permission for end users
β Documentation:
β Dataset description updated
β Measure descriptions added
β Source dataflow dependencies documented
β Refresh schedule and SLA documented
β Governance:
β Dataset owner assigned
β Support contact documented
β Change management process defined
β Version control implemented (export .pbix to Git)
Refresh Orchestration
Dependency Management
Refresh Dependency Chain:
1. Ingestion Dataflows (Layer 1)
DF_Ingestion_Orders ββ Refresh: Daily 2:00 AM βββ
DF_Ingestion_Customers ββ Refresh: Daily 2:00 AM βββ€
DF_Ingestion_Products ββ Refresh: Daily 2:00 AM βββ€
β
2. Standardization Dataflows (Layer 2) β
DF_Standardized_Orders ββ Refresh: Daily 3:00 AM βββ€ (Wait for Layer 1)
DF_Standardized_Customersββ Refresh: Daily 3:00 AM βββ€
β
3. Business Dataflows (Layer 3) β
DF_Business_SalesMetrics ββ Refresh: Daily 4:00 AM βββ€ (Wait for Layer 2)
β
4. Shared Datasets β
Dataset_Sales_Certified ββ Refresh: Daily 5:00 AM βββ (Wait for Layer 3)
β
5. Reports (auto-refresh when dataset refreshes)
PowerShell Orchestration Script
# Orchestrate dataflow and dataset refreshes with dependency checking
function Invoke-DataflowRefresh {
param(
[string]$WorkspaceId,
[string]$DataflowId,
[string]$DataflowName
)
Write-Host "π Starting refresh: $DataflowName" -ForegroundColor Cyan
$headers = @{
"Authorization" = "Bearer $accessToken"
}
# Trigger refresh
$refreshResponse = Invoke-RestMethod `
-Uri "https://api.powerbi.com/v1.0/myorg/groups/$WorkspaceId/dataflows/$DataflowId/refreshes" `
-Method Post `
-Headers $headers
# Poll refresh status
do {
Start-Sleep -Seconds 30
$status = Invoke-RestMethod `
-Uri "https://api.powerbi.com/v1.0/myorg/groups/$WorkspaceId/dataflows/$DataflowId/refreshes?`$top=1" `
-Method Get `
-Headers $headers
$currentStatus = $status.value[0].status
Write-Host " Status: $currentStatus"
} while ($currentStatus -eq "InProgress")
if ($currentStatus -eq "Completed") {
Write-Host "β
Refresh completed: $DataflowName" -ForegroundColor Green
return $true
} else {
Write-Host "β Refresh failed: $DataflowName - $($status.value[0].error.message)" -ForegroundColor Red
return $false
}
}
# Orchestration workflow
$workspaceId = "workspace-guid"
# Layer 1: Ingestion (parallel)
$layer1Success = @(
Invoke-DataflowRefresh -WorkspaceId $workspaceId -DataflowId "ingestion-orders-guid" -DataflowName "DF_Ingestion_Orders",
Invoke-DataflowRefresh -WorkspaceId $workspaceId -DataflowId "ingestion-customers-guid" -DataflowName "DF_Ingestion_Customers"
) -notcontains $false
if (-not $layer1Success) {
Write-Host "β Layer 1 refresh failed. Aborting." -ForegroundColor Red
exit 1
}
# Layer 2: Standardization (depends on Layer 1)
$layer2Success = Invoke-DataflowRefresh -WorkspaceId $workspaceId `
-DataflowId "standardized-orders-guid" `
-DataflowName "DF_Standardized_Orders"
if (-not $layer2Success) {
Write-Host "β Layer 2 refresh failed. Aborting." -ForegroundColor Red
exit 1
}
# Layer 3: Business (depends on Layer 2)
$layer3Success = Invoke-DataflowRefresh -WorkspaceId $workspaceId `
-DataflowId "business-sales-guid" `
-DataflowName "DF_Business_SalesMetrics"
if (-not $layer3Success) {
Write-Host "β Layer 3 refresh failed. Aborting." -ForegroundColor Red
exit 1
}
# Layer 4: Dataset refresh (depends on Layer 3)
Write-Host "π Refreshing dataset: Sales_Certified" -ForegroundColor Cyan
Invoke-PowerBIRestMethod -Url "groups/$workspaceId/datasets/$datasetId/refreshes" -Method Post
Write-Host "β
Full refresh orchestration completed successfully" -ForegroundColor Green
Cost Optimization
Premium Capacity Cost Analysis
Dataflow Storage Costs (Premium):
Factors:
- Storage: Azure Data Lake Storage Gen2 (pay per GB)
- Compute: Premium capacity CU consumption
- Refresh frequency: More refreshes = higher CU usage
Cost Optimization Strategies:
1. Minimize Computed Entities
β 10 computed entities with duplicated aggregations
β
1 computed entity with comprehensive aggregations
Savings: ~40% storage, ~30% refresh time
2. Incremental Refresh
β Full refresh of 100M row table daily
β
Incremental refresh (7-day window)
Savings: ~90% refresh time, ~60% CU usage
3. Linked Entities Where Possible
β Copy data across 5 dataflows (5x storage)
β
Use linked entities (1x storage)
Savings: ~80% storage cost
4. Archive Old Data
β Keep 10 years of detailed transactions
β
Aggregate to monthly summaries after 2 years
Savings: ~70% storage for historical data
5. Optimize Refresh Schedule
β Hourly refresh for rarely changing dimensions
β
Daily refresh for dimensions, hourly for facts
Savings: ~50% CU usage
PowerShell Cost Analysis Script
# Analyze dataflow storage and refresh costs
function Get-DataflowCostAnalysis {
param([string]$WorkspaceId)
$dataflows = Invoke-PowerBIRestMethod -Url "groups/$WorkspaceId/dataflows" -Method Get | ConvertFrom-Json
$analysis = @()
foreach ($df in $dataflows.value) {
$refreshes = Invoke-PowerBIRestMethod -Url "groups/$WorkspaceId/dataflows/$($df.objectId)/refreshes?`$top=30" -Method Get | ConvertFrom-Json
$avgRefreshDuration = ($refreshes.value | Measure-Object -Property durationInMs -Average).Average / 1000 / 60 # minutes
$refreshesPerMonth = $refreshes.value.Count * (30 / 7) # Extrapolate to monthly
$analysis += [PSCustomObject]@{
DataflowName = $df.name
EntityCount = $df.entities.Count
AvgRefreshMinutes = [math]::Round($avgRefreshDuration, 2)
EstimatedMonthlyRefreshHours = [math]::Round(($refreshesPerMonth * $avgRefreshDuration) / 60, 2)
RefreshFrequency = if ($refreshes.value.Count -gt 0) { "$($refreshes.value.Count) in last 7 days" } else { "Unknown" }
EstimatedMonthlyCost = [math]::Round(($refreshesPerMonth * $avgRefreshDuration / 60) * 0.20, 2) # $0.20 per CU-hour estimate
}
}
$analysis | Format-Table -AutoSize
$totalMonthlyCost = ($analysis | Measure-Object -Property EstimatedMonthlyCost -Sum).Sum
Write-Host "`nπ° Estimated Monthly Cost: `$$totalMonthlyCost" -ForegroundColor Yellow
}
Get-DataflowCostAnalysis -WorkspaceId "workspace-guid"
Migration from Monolithic Datasets
Migration Strategy
Phase 1: Assessment (Week 1-2)
β Inventory all existing datasets
β Identify duplicated Power Query logic
β Document current refresh schedules
β Analyze dataset dependencies (which reports use which datasets)
β Prioritize datasets for migration (start with most duplicated)
Phase 2: Design Dataflow Architecture (Week 3-4)
β Design layered dataflow structure (Ingestion/Standardization/Business)
β Define naming conventions
β Plan incremental refresh policies
β Design shared datasets (one per subject area)
β Document dependencies and refresh order
Phase 3: Build Dataflows (Week 5-8)
β Create Ingestion dataflows (extract raw data)
β Create Standardization dataflows (cleanse, conform)
β Create Business dataflows (metrics, aggregations)
β Test refresh times and validate data accuracy
β Implement incremental refresh where applicable
Phase 4: Build Shared Datasets (Week 9-10)
β Create shared datasets consuming business dataflows
β Define relationships and DAX measures
β Implement RLS
β Test with sample reports
β Certify datasets
Phase 5: Migrate Reports (Week 11-14)
β Identify reports using old datasets
β Rebind reports to new shared datasets
β Remove duplicate visuals/calculations
β Test report functionality
β Deploy to production workspace
β Deprecate old datasets (after validation period)
Phase 6: Governance & Monitoring (Ongoing)
β Monitor refresh success rates
β Track cost savings (refresh time, storage)
β Collect user feedback
β Iterate and optimize
Migration Script Example
// Old Dataset (Monolithic):
// All logic in single dataset
let
Source = Sql.Database("server", "DB"),
Orders = Source{[Schema="dbo",Item="Orders"]}[Data],
// All transformations in one place
Filtered = Table.SelectRows(Orders, each [OrderDate] >= #date(2024, 1, 1)),
Typed = Table.TransformColumnTypes(Filtered, {{"Amount", type number}}),
AddedMargin = Table.AddColumn(Typed, "Margin", each [Amount] * 0.30),
// ... 50 more transformation steps ...
Final = AddedMargin
in
Final
// New Approach (Dataflow + Shared Dataset):
// DATAFLOW: DF_Business_Sales
let
// Reference standardized dataflow (reusable)
Source = PowerBI.Dataflows(...){[entity="Fact_Sales"]}[Data]
// Only business-specific logic here
// (No extraction, cleansing - already done in lower layers)
in
Source
// SHARED DATASET:
// Consumes DF_Business_Sales
// Adds DAX measures only
// Multiple reports connect to this dataset
Troubleshooting Guide
Issue 1: Dataflow Refresh Failures
Symptoms:
- Dataflow shows "Refresh failed" status
- Timeout errors
- Data inconsistency
Diagnosis:
# Check recent refresh history
$workspaceId = "workspace-guid"
$dataflowId = "dataflow-guid"
$refreshHistory = Invoke-PowerBIRestMethod `
-Url "groups/$workspaceId/dataflows/$dataflowId/refreshes?`$top=10" `
-Method Get | ConvertFrom-Json
$refreshHistory.value | ForEach-Object {
[PSCustomObject]@{
StartTime = $_.startTime
EndTime = $_.endTime
Status = $_.status
Duration = if ($_.endTime) {
(New-TimeSpan -Start $_.startTime -End $_.endTime).TotalMinutes
} else { "In Progress" }
ErrorMessage = $_.error.message
}
} | Format-Table -AutoSize
Common Causes & Resolutions:
Query timeout (> 2 hours)
- Add incremental refresh
- Optimize Power Query (enable query folding)
- Partition large tables
Memory limits exceeded
- Reduce entity size (split into multiple entities)
- Use computed entities instead of linked (materializes data)
- Upgrade Premium capacity SKU
Source system unavailable
- Check gateway connectivity
- Verify credentials haven't expired
- Implement retry logic in orchestration
Issue 2: Duplicate Data in Reports
Symptoms:
- Metrics don't match between reports
- Same query returns different results
- "Which number is correct?" from business users
Root Causes:
- Multiple datasets with different transformation logic
- Lack of certified shared dataset
- Inconsistent DAX measure definitions
Resolution:
1. Audit Current State:
β List all datasets used across organization
β Identify duplicate Power Query logic
β Document different calculation methods
2. Consolidate to Shared Dataset:
β Create single certified dataset per subject area
β Migrate all reports to use shared dataset
β Deprecate old datasets after validation
3. Governance Process:
β Require approval for new datasets
β Promote reuse of existing shared datasets
β Regular audits of dataset proliferation
Issue 3: Slow Dataset Refresh
Symptoms:
- Dataset refresh takes hours
- Reports show stale data
- Capacity overload
Diagnosis:
-- Check dataset refresh duration trend
SELECT
DatasetName,
RefreshDate,
DATEDIFF(MINUTE, StartTime, EndTime) AS DurationMinutes
FROM DatasetRefreshLog
WHERE DatasetName = 'Sales_Certified'
AND RefreshDate >= DATEADD(DAY, -30, GETDATE())
ORDER BY RefreshDate DESC;
Optimization Steps:
- Enable incremental refresh on dataset
- Use dataflows for heavy transformations (pre-process data)
- Optimize DAX measures (avoid expensive iterators)
- Use aggregations for large fact tables
- Schedule refreshes during off-peak hours
Best Practices Checklist
β Architecture:
β Use layered dataflow design (Ingestion/Standardization/Business)
β Minimize computed entities (prefer linked where possible)
β Create shared datasets for enterprise metrics
β Certify datasets used by multiple teams
β Document dependencies (dataflow β dataset β report lineage)
β Performance:
β Implement incremental refresh for large tables (> 1M rows)
β Use Enhanced Compute Engine for computed entities
β Optimize Power Query (enable query folding)
β Schedule refreshes to respect dependencies
β Monitor refresh duration trends
β Governance:
β Naming convention: DF_<Layer>_<Domain>_<Purpose>
β Assign dataflow owners
β Document refresh SLAs
β Version control dataflow definitions (export JSON)
β Regular access reviews (quarterly)
β Implement change management process
β Cost Optimization:
β Use incremental refresh (reduce CU usage)
β Linked entities over computed (reduce storage)
β Archive old data (aggregate to summaries)
β Optimize refresh schedules (avoid over-refreshing)
β Monitor Premium capacity utilization
β Security:
β Apply RLS at shared dataset level
β Use service principals for gateway connections
β Implement least-privilege access
β Audit data access logs
Key Takeaways
- Layered dataflow architecture (Ingestion/Standardization/Business) eliminates duplication and ensures consistency
- Linked entities avoid data duplication, while computed entities enable complex transformations and incremental refresh
- Incremental refresh is critical for large tables to reduce refresh time and capacity costs
- Shared datasets provide single source of truth for enterprise metrics, with RLS and certification
- Refresh orchestration must respect dependencies between dataflow layers and datasets
- Cost optimization focuses on minimizing computed entities, enabling incremental refresh, and optimizing schedules
- Migration from monolithic datasets requires phased approach: assess, design, build, migrate, monitor
- Governance includes naming conventions, ownership, documentation, and change management
- Performance monitoring tracks refresh duration, failure rates, and capacity utilization
Next Steps
- Assess current dataset landscape (identify duplication)
- Design layered dataflow architecture (Ingestion/Standardization/Business)
- Create first ingestion dataflow (start simple, single source system)
- Build standardization layer (cleanse and conform dimensions)
- Implement business layer (add calculated metrics)
- Create first shared dataset (import from business dataflow)
- Certify dataset and grant Build permissions
- Migrate first report to use shared dataset
- Implement refresh orchestration script
- Monitor refresh success rates and optimize
- Document architecture and governance process
- Scale to additional subject areas (Sales, Finance, Operations)
Additional Resources
- Power BI Dataflows Introduction
- Dataflow Best Practices
- Incremental Refresh for Dataflows
- Computed Entities vs Linked Entities
- Enhanced Compute Engine
- Shared Datasets Overview
- Dataset Certification
- Dataflow REST API
Reusable. Governed. Scalable.