In the previous post I showed how to replace Power BI reports with a DuckDB-powered static web app. At the end of that post I said there was a companion piece: the streaming side of the same argument. This is it.

There are actually two things happening in this post, and it is worth being upfront about both. The first is a proof of concept: smart cameras on buses detect fatigue events and fire a small JSON payload to an Azure Function. Within a second the event appears on a dashboard. I built this quickly to prove the Azure streaming pipeline actually works end-to-end. Small payload, few fields, manually triggered by a Python script, done in a weekend. Job done.

The second thing is the actual production use case: streaming Auckland Transport GTFS real-time data: bus positions, delays, alerts. To an ops dashboard. AT does not expose a webhook or subscription API, but we already poll their GTFS-RT feed every minute for our data lake. So the change was minimal: after each poll, also POST the result to the same Azure Function streaming endpoint. That one extra line of code is where all the real complexity lives. The AT feed is 1.6–1.7 megabytes of JSON every 60 seconds, split across hundreds of trips and vehicles. That is where every hidden issue surfaced. The camera POC hid all of it. Small payload, simple structure, no HMAC issues, no 429s. The gotchas in this post are almost entirely AT-specific, and I want to be clear about that.

You still don't need Power BI

The Streaming Argument

Power BI does have a streaming mode. There are three variants: pure Streaming datasets (temporary cache, dashboard tiles only), Push datasets (permanent storage, can build reports), and PubNub datasets (external stream, no storage). I want to be accurate here rather than just rag on it, so the table below is based on what Microsoft actually documents.[1]

⚠️

Power BI real-time streaming is being retired. Microsoft announced that creation of new real-time semantic models including Push datasets, Streaming datasets, and streaming tiles was blocked from October 31, 2024. Existing models will be fully retired by October 2027. Microsoft's own recommendation is to migrate to Real-Time Intelligence in Microsoft Fabric instead. This makes the comparison below less relevant over time, but it is worth knowing if you are evaluating Power BI streaming right now.

Capability Power BI Streaming This stack
Live dashboard tiles Works on Pro with Push or Streaming datasets. New streaming models can no longer be created as of Oct 2024. Works on Azure free tier. No retirement risk.
Historical reports on streaming data Pure Streaming: dashboard tiles only, no report-building. Push datasets support reports but are retiring in 2027. Queue → Blob → query with DuckDB, unlimited, your storage, no platform constraints.
Payload size per message Streaming visuals support packets of 15KB. Beyond that, streaming visuals fail (push API continues but tiles stop rendering).[7] Web PubSub frame limit is 1MB per message. AT data chunked to stay under it.
Ingest frequency Power BI can be called roughly once per second via the streaming API.[7] No platform-imposed frequency limit on the ingest function.
UI customisation Dashboard tiles only. No filtering, no custom layout, no arbitrary JS. Full HTML/CSS/JS. Every pixel yours.
Data ownership Data lives inside Power BI's platform. Retiring in 2027. Your blob storage, queryable with anything, not going anywhere.

To be fair: if your team was already using Push datasets and just needed basic live tiles, it worked fine while it lasted. The argument now is moot for new projects because Microsoft has closed the door on new streaming models. If you are evaluating this today, the choice is either Microsoft Fabric Real-Time Intelligence (the replacement, with its own cost and complexity), or building something like this stack.

What it actually costs

The Cost Comparison

Microsoft retired the old Power BI Premium P-SKUs in July 2024. The equivalent is now Microsoft Fabric F64 at around $5,003/month.[2] Here is how that stacks up against this architecture.

Component Power BI (Fabric F64 capacity) This stack
Dashboard hosting ~$5,003/month (Fabric F64) $0 (SWA free tier)
Real-time WebSocket / push Included in capacity $0 on Free_F1 (20 connections, 20K messages/day). Caveat: AT-scale data needs Standard S1. See azure.microsoft.com/pricing/details/web-pubsub for current S1 unit pricing.[6]
AT API polling (every minute) Included / requires dataflow setup $0 (43,200 executions/month, well inside the 1M free grant)[3]
Event persistence (queue + blob) Included in capacity ~$0.02/month at low volume
🧮

The AT polling maths: 1 execution per minute × 60 × 24 × 30 = 43,200 executions per month. Azure Functions gives you the first 1,000,000 free every month.[3] The poller costs $0 at this frequency. You would need to poll 23× more often before you exit the free tier. The caveat: if your organisation already pays for Fabric F64 for other workloads, the marginal cost of adding this dashboard is zero too. The argument is for teams who would buy capacity specifically for the streaming use case.

Full architecture

How Everything Connects

Every event enters through a single Azure Function. From there it splits into two parallel paths: a real-time path that gets the event to a connected browser in under a second, and a persistence path that queues it for batch write to blob storage. Neither path can block the other. The blob storage side feeds the data warehouse, which can then serve Power BI if your organisation uses it, or the DuckDB-powered SWA reporting platform described in the previous post. The two architectures are designed to share the same storage layer.

Event flow · IoT camera → Azure Functions → browser
📡
IoT Smart Camera
Fatigue / speed events
🔄
AT API Poller
Runs every 60 seconds
POST /api/ingest
AZURE FUNCTIONS · ingest()
① HMAC verify ② Parse JSON ③ enrich_event() ④ Fan out
EROAD vehicle + driver
⚡ Real-time path
Azure Web PubSub
group: ops-room · WebSocket
↓ sub-second broadcast
Azure Static Web App
OpsCentre live dashboard
🗄️ Persistence path
Storage Queue
smartcamera-iot-raw
↓ CRON every 5 min
Batch Processor
batch_queue_processor()
↓ upload_blob()
📦 Blob Storage (partitioned JSON)
raw/date=20250815/device=387/{eventId}.json
↓ serves
📊 Data Warehouse → Reporting
Power BI · or our own DuckDB SWA (previous post)
How to build it

Step by Step

These steps walk through building the camera POC pipeline, the version that proved the concept. It is genuinely buildable in a weekend. The AT-specific additions and the gotchas that only showed up with real production data are covered after. Web PubSub's Free_F1 tier (20 connections, 20,000 messages/day) is enough for the POC. For the AT production workload, it was not. More on that later.[6]

🧒

ELI5: what is a WebSocket relay? Imagine you want to push breaking news to twenty friends across town. You could ring each one individually, but that means keeping twenty phone lines open and calling them one by one. Instead, you hire a message runner. You tell the runner the news once, and he sprints to everyone at the same time. Azure Web PubSub is the message runner.

Your Azure Function is you. The browsers are your friends. Web PubSub sits in the middle holding all the open connections. When an event arrives, the Function calls Web PubSub once with send_to_group("ops-room"), and Web PubSub immediately delivers it to every connected browser. The Function does not know or care how many browsers are watching. That is Web PubSub's job.

A WebSocket is just a phone line that stays open rather than hanging up after each call. Normal web requests ask a question, get an answer, and disconnect. A WebSocket stays open so messages can arrive at any time without the browser having to ask. Web PubSub manages all those open lines so your Function App does not have to.

Step 01

Provision the four Azure services

Create these in the same resource group. That way you can manage billing and tear them all down together if needed.

Azure Functions App (Consumption plan, Python runtime). This is where your ingest and negotiate functions live.

Azure Web PubSub (Free_F1 tier to start, hub name: ops). This is the WebSocket relay. Copy the connection string from the portal after creation. If you are streaming AT-scale data (over 1.5MB per poll), start on Standard S1 instead. More on that later.

Azure Storage Account (Standard LRS). Create a queue named smartcamera-iot-raw and a blob container named smartcamera-raw inside it.

Azure Static Web App (Free tier). Connect it to your GitHub repo. Every push to main auto-deploys.

Authentication note: The Free SWA tier serves your dashboard publicly to anyone with the URL. For an internal ops dashboard that is probably not what you want. The Standard tier at $9/app/month adds custom authentication providers. Wire in Entra ID (Active Directory) and only your org's users can open it. Check the SWA pricing page for current rates. $9/month to lock down an ops dashboard is an easy call.

Set these App Settings on your Function App (Settings → Environment variables in the portal): WEBPUBSUB_CONNECTION_STRING, WEBPUBSUB_HUBops, WEBPUBSUB_GROUPops-room, RAW_STORAGE_CONN_STR → your storage account connection string.

Step 02

Write the two Azure Function endpoints: ingest and negotiate

Both of these live in the same Azure Function App, under the same api/ folder. They share the same environment variables and Web PubSub connection string. Think of the Function App as the container: ingest is the endpoint that receives events and broadcasts them, negotiate is the endpoint the browser calls to get a WebSocket URL. You need both.

The ingest function receives incoming events and fans them out to two destinations simultaneously: the Storage Queue for persistence, and Web PubSub for real-time delivery. The critical rule is that a queue failure must never block the live broadcast. Wrap the queue write in a try/except and continue regardless.

api/ingest/__init__.py · fan-out to queue and Web PubSub
# Path A: Storage Queue (persistence, non-blocking)
try:
  if _RAW_Q:
    _RAW_Q.send_message(json.dumps({
      "event": outbound, "receivedAt": int(time.time())
    }))
except Exception as e:
  logging.warning("queue enqueue failed: %s", e)
  # Don't raise — queue failure must never block the WebSocket path

# Path B: Web PubSub — this is what pushes to the browser instantly
SERVICE.send_to_group(GROUP, outbound)

return func.HttpResponse(status_code=202)

SERVICE.send_to_group("ops-room", outbound) tells Web PubSub to relay the message to every browser in that group. The Function does not track who is connected or how many. That is entirely Web PubSub's job.

For the AT production pipeline (/api/ingestat), the function receives a raw 1.6MB JSON body containing the entire GTFS-RT feed from Auckland Transport. Rather than trying to push that as one giant message, the function parses the body and splits it into up to three separate message types: trip_updates (delay predictions per trip), vehicles (live GPS positions), and alerts (service alerts and cancellations). Each type is then further chunked at 400 records per message, with a 0.5-second pause between chunks. Web PubSub has a 1MB per-frame limit, so sending the raw 1.6MB blob would be immediately rejected. Splitting and chunking keeps every individual message well within that limit.

The negotiate function is what the browser calls once on page load to get a short-lived, signed WebSocket URL. The Web PubSub connection string must never reach the browser. Anyone who saw it could broadcast to your ops room. The negotiate function holds it server-side and uses it to mint a 60-minute token. The browser gets back a wss:// URL with the token embedded in the query string, opens a WebSocket directly to Web PubSub using that URL, and the function is done.

api/negotiate/__init__.py · token vending machine
from azure.messaging.webpubsubservice import WebPubSubServiceClient

def main(req):
  service = WebPubSubServiceClient.from_connection_string(
    os.environ["WEBPUBSUB_CONNECTION_STRING"], hub="ops")

  # Mint a signed URL scoped to ops-room group, expires in 60 min
  token = service.get_client_access_token(
    groups=["ops-room"],
    roles=["webpubsub.joinLeaveGroup"],
    minutes_to_expire=60)

  # Connection string stays server-side. Browser only gets the signed URL.
  return HttpResponse(json.dumps({"url": token["url"]}))

Set this function's auth level to anonymous in function.json so the browser can call it without an API key. The security is in the signed token it returns, not in locking who can call the endpoint.

Step 03

Wire up the browser

The browser calls /api/negotiate to get the signed URL, then opens a WebSocket directly to Web PubSub and listens. The protocol string "json.webpubsub.azure.v1" is required. Leave it out and Web PubSub will reject the connection. The auto-reconnect on onclose is also important: if the connection drops, the browser retries silently after 2 seconds. Ops managers should never have to refresh.

js/app.js · connect()
async function connect() {
  const r = await fetch("/api/negotiate", { cache: "no-store" });
  const { url } = await r.json();

  // "json.webpubsub.azure.v1" is required — omitting it causes a rejected connection
  const ws = new WebSocket(url, "json.webpubsub.azure.v1");
  ws.onmessage = (evt) => {
    const msg = JSON.parse(evt.data);
    // type === "system" are Web PubSub heartbeats — ignore them
    if (msg.type === 'message') handleIncoming(msg.data);
  };
  ws.onclose = () => setTimeout(connect, 2000);
}
connect();

A note on Azure SignalR Service vs Web PubSub

I initially thought SignalR was what I needed for this. It is the more well-known Azure real-time service, it has more tutorials, and it came up first in most searches. I was wrong, and it cost me a few hours.

Azure SignalR Service was built specifically for ASP.NET applications. It sits in front of a running hub server, which is a persistent backend process that manages connections and routes messages. The SignalR client library is what your browser talks to, and the hub server is what your backend code runs on. This model works well if you are building a .NET application with a traditional server process, because the hub server pattern is familiar from ASP.NET SignalR.

Azure Web PubSub is serverless. There is no hub server. Your Azure Function calls the Web PubSub REST API to broadcast, and Web PubSub pushes to all connected clients. No persistent process to run, no hub server to deploy or scale. For a Python Azure Functions backend talking to a static web app, Web PubSub is the right fit. SignalR would have required me to run an additional server process, which defeats the entire point of using Azure Functions on the Consumption plan.

If you are building something in .NET with a long-running ASP.NET host, SignalR is worth looking at. For this stack (Python Functions, static SWA, serverless everything), Web PubSub is the better choice.

Step 05

Test it with a Python script

Before wiring up a real camera, test the whole pipeline with a plain POST. Use an interesting location so the mini-map renders nicely in the event card.

test_ingest.py
import requests

url = "https://your-functions-app.azurewebsites.net/api/ingest"

payload = {
  "eventId": "f2b5db2c-9b2b-4b2f-b1c9-2b1f4a0b1e77",
  "deviceId": "387",
  "eventType": "Fatigue",
  "score": 88,
  "location": { "lat": -44.98192, "lon": 168.817132 },
  "notes": "Yawning detected, eyes closed >2s"
}

r = requests.post(url, json=payload)
print(r.status_code)  # 202 = accepted, event appears on dashboard within 1s
Step 06

Add the batch processor for persistence

Azure Storage Queue does have a native QueueTrigger integration that automatically deletes messages on successful processing.[4] In principle that is the cleaner approach. I ended up not using it because messages were reappearing in the queue after about 10 minutes even after apparent successful processing. This is a documented issue with visibility timeout behaviour in Azure Storage Queue.[5] Rather than debug it, I built a timer-triggered batch processor that manually reads, writes to blob, then explicitly deletes each message with its pop receipt.

api/batch_processor/function.json · CRON every 5 minutes
{
  "bindings": [{
    "type": "timerTrigger",
    "name": "mytimer",
    "schedule": "0 */5 * * * *"
  }]
}

The blob path pattern (raw/date={date}/device={id}/{eventId}.json) gives date and device partitioning out of the box. From blob storage you can query with DuckDB as described in the previous post, or load into a data warehouse for longer-term analysis. The two posts connect directly.

⚠️

Deployment gotcha: VS Code reports success when nothing changed. If you are pushing code changes via the Azure Functions extension in VS Code, be aware it sometimes says the deployment succeeded when the running app has not actually updated. No error, no warning, just a quiet lie.

When that happens, skip VS Code and use the CLI: install the Azure CLI with winget install -e --id Microsoft.AzureCLI, log in with az login, then run func azure functionapp publish <appname> --python --build remote --force. The --build remote flag compiles dependencies in Azure's own Linux environment to avoid architecture mismatches. The --force flag pushes even when Azure thinks nothing changed.

Honestly: set up CI/CD from the start and you avoid this entirely. When your GitHub repo auto-deploys to Azure on every push, you never touch the manual path. I have not done this for the Function App yet and I have paid for it in confusion. Do not be like me. (I will write up a proper how-to on setting up CI/CD for Azure Functions soon, once I stop being lazy and actually do it properly myself.)

Step 07 · Gotcha

HMAC signing: sign the raw bytes, not the dict

🧒 ELI5: what is HMAC and why does this endpoint use it?

HMAC stands for Hash-based Message Authentication Code. The ELI5 version: imagine you and a friend agree on a secret word before you go into separate rooms. Later your friend passes a note under the door. But how do you know it is really from your friend and not someone else who snuck into the building? The answer: your friend includes a special code at the bottom of the note, computed by scrambling the note's contents together with the secret word using a one-way mathematical function. You do the same calculation on your side. If the codes match, it is genuinely from your friend. No one else could have produced that code without knowing the secret word.

That is exactly what the IOT_HMAC_SECRET environment variable does here. The ingest endpoint is publicly accessible (it has to be, for the camera to POST to it). Without HMAC, anyone who knew the URL could send fake events. With HMAC, every request must include a signature computed from the shared secret and the exact request body. The function verifies the signature before processing anything. The secret never travels over the network. Only the signature does, and the signature is useless without the original secret.

If HMAC verification is enabled, every POST must include a valid signature header. The signature covers both the timestamp and the exact request body. The mistake I kept making was signing a Python dict instead of the serialised string, so the signature I calculated locally never matched what the function saw on the other end.

The fix: serialise to a string first, sign that string, send that same string as the body. Print the timestamp and signature in your test script so you can cross-check against the function logs.

test_ingest_hmac.py · correct signing pattern
import requests, json, time, hmac as hmac_lib, hashlib

url = "https://your-functions-app.azurewebsites.net/api/ingest"
iot_secret = "your-shared-secret"

payload_dict = {"eventId": "abc-123", "deviceId": "387", "score": 88}

# Serialise FIRST: sign and send the exact same bytes
payload_body = json.dumps(payload_dict)

# Signature format: sha256=HMAC(secret, timestamp + "." + body)
ts = str(int(time.time()))
msg = ts + "." + payload_body
sig = "sha256=" + hmac_lib.new(
  iot_secret.encode('utf-8'),
  msg.encode('utf-8'),
  hashlib.sha256
).hexdigest()

# Print for cross-checking against Postman or function logs
print(f"ts: {ts}")
print(f"sig: {sig}")

headers = {
  'Content-Type': 'application/json',
  'x-iot-timestamp': ts,
  'x-iot-signature': sig
}

# data=payload_body (string), not json=payload_dict (dict) — critical difference
r = requests.post(url, headers=headers, data=payload_body)
print(f"Status: {r.status_code} · {r.text}")

Note: the default timestamp skew tolerance is 300 seconds. If your machine's clock is more than 5 minutes off Azure's clock, you will get a 401 even with a correct signature. Also watch the difference between data=payload_body and json=payload_dict in the requests call. Using json= would re-serialise the dict, potentially changing byte order and breaking the signature match.

Step 08 · AT specific

Static GTFS lookups: scheduled GitHub Actions rebuild

This one is neat, and honestly something I had never done before and never thought was possible. A GitHub Action that triggers on a schedule, rebuilds the app with fresh data, and deploys it automatically without any manual intervention. The idea that the GTFS files just stay current on their own, without me touching anything, still feels a bit like magic.

The AT dashboard needs static GTFS lookup files (routes, stops, trips, shapes, timetables) bundled with the app so the browser can enrich raw stream data client-side without hitting a database on every event. These are pre-built JSON files generated from the AT GTFS static feed at build time and served from the SWA's public folder.

The problem: the AT static feed changes periodically. New routes, updated timetables, schedule changes. If you only build on push, the lookup files go stale. Loading the full GTFS feed fresh on every dashboard open is not viable. The raw files are large.

The solution is a scheduled trigger in your existing GitHub Actions workflow. Add a schedule block alongside your existing push trigger. The rest of the workflow is unchanged.

.github/workflows/azure-static-web-apps.yml · add the schedule trigger
name: Azure Static Web Apps CI/CD
on:
  push:
    branches: [ main ]
  pull_request:
    types: [opened, synchronize, reopened, closed]
    branches: [ main ]
  schedule:
    # Rebuild daily at 3:00 AM NZST (15:00 UTC previous day)
    # Picks up AT GTFS static feed changes automatically. Zero new infrastructure
    - cron: '0 15 * * *'

GTFS static data rarely changes more than weekly, so daily is plenty. The build step that generates the JSON lookup files runs identically whether triggered by a code push or by the schedule. No extra services, no cron job to manage, no external dependency. Just the workflow you already have with one extra block.

How the browser side actually works

The AT Pipeline: ingestat, negotiate, and useStreamingData

This is where the actual production work lives. The camera POC uses a single ingest endpoint and handles one small event at a time. Easy. The AT pipeline handles a 1.6MB JSON feed every 60 seconds, splits it into three data types, chunks it for reliable WebSocket delivery, and feeds a React app that does all the enrichment client-side. Different scale, different problems, different endpoint.

Two ingest endpoints: ingest vs ingestat

/api/ingest is the camera POC endpoint. It validates an HMAC signature, enriches the event with vehicle and driver data from the EROAD API, enqueues the raw event for blob storage, and broadcasts a single enriched JSON message to Web PubSub. One small event in, one message out. This was easy to get working precisely because the payload is tiny and the cadence is low.

/api/ingestat is the AT production endpoint. Our existing AT data lake poller calls the GTFS-Realtime API every 60 seconds and now also POSTs the full feed to this endpoint as a one-line addition to the existing pipeline. The function parses the feed and splits it into up to three message types: trip_updates, vehicles, and alerts. Large feeds get chunked at 400 records per chunk with a 0.5-second pause between chunks. This endpoint does no EROAD enrichment. AT enrichment happens client-side against the pre-built static GTFS JSON files instead.

📦

Why enrich client-side for AT data? The real-time feed is deliberately a dumb pipe: raw GTFS-RT data with IDs only, no names. Enrichment (route names, stop names, scheduled times, route shapes) happens in the browser by joining the stream against pre-built static GTFS JSON files bundled with the SWA. This keeps the ingest function fast and avoids coupling the real-time path to a database lookup on every poll cycle.

How the data travels end to end

Before getting into message formats, here is the full sequence from AT API call to pixels updating in the browser. This covers every hop including the negotiate step that trips people up.

Full data flow · AT GET → POST → Web PubSub → browser → rendered dashboard
A · Browser opens dashboard (once on load)
1
Browser → /api/negotiate (HTTP GET)
Browser needs a signed WebSocket URL. It can't connect to Web PubSub directly without one.
2
/api/negotiate → Web PubSub SDK → get_client_access_token()
Function uses the connection string (server-side only) to mint a signed URL that expires in 60 min.
3
Browser receives { url: "wss://…?access_token=…" }
Connection string never leaves the server. Only the signed URL does.
4
Browser → new WebSocket(url, "json.webpubsub.azure.v1")
WebSocket connection opens directly to Web PubSub. Browser joins the "ops-room" group. Now listening for events.
B · Every 60 seconds (AT poller loop)
5
AT Poller → GET AT GTFS-RT API
Our existing data lake poller fetches the full Auckland Transport real-time feed (~1.6MB JSON).
6
AT Poller → POST /api/ingestat (one extra line in existing pipeline)
The same feed that goes to the data lake also hits the streaming endpoint. No new schedule, no new trigger.
7
/api/ingestat parses and splits the 1.6MB feed into 3 message types
trip_updates vehicles alerts · chunked at 400 records each · 0.5s pause between types · stays under Web PubSub 1MB frame limit
8
send_to_group("ops-room") → Web PubSub pushes to all connected browsers
Web PubSub delivers each chunk to every browser in ops-room simultaneously. The Function is already done by this point.
C · Browser processes the data (useStreamingData hook)
9
Chunks accumulate into tripUpdatesRef and vehiclesRef Maps (in memory)
chunk_index 0 clears the Map (full replacement). Each subsequent chunk adds to it. When the final chunk lands, a 300ms debounced rebuild fires.
10
rebuild() merges Maps → joins with static GTFS files → computes KPIs → setData()
Route names, stop names, shapes, and timetables are joined client-side from static JSON bundled with the SWA. No database call. Dashboard re-renders.
Token expires after 60 min → ws.onclose fires → browser calls /api/negotiate again → new WebSocket opened → seamless for ops team

What arrives over the WebSocket

Every ~60 seconds the browser receives up to three WebSocket messages, each wrapped in a Web PubSub envelope. The inner payload looks like this:

WebSocket message structure · Web PubSub envelope unwrapped
// Web PubSub wraps everything in an envelope first:
{ type: "message", from: "group", dataType: "json", data: { ... } }

// The inner payload (envelope.data) for a chunked trip_updates message:
{
  "type": "trip_updates",
  "chunk_index": 0,  // zero-based, so this is chunk 1 of N
  "chunk_total": 2,  // 2 chunks total for this message type
  "data": [ ... up to 400 trip update records ... ]
}

// When chunk_index === chunk_total - 1, this is the final chunk.
// That is when you trigger the dashboard rebuild.

How useStreamingData processes it

The useStreamingData React hook manages the entire WebSocket lifecycle. It uses two useRef Maps as the live in-memory store: tripUpdatesRef keyed by trip ID, and vehiclesRef keyed by entity ID. When chunk 0 arrives for a message type, the corresponding Map is cleared (full replacement, not incremental merge). Subsequent chunks are accumulated. When the final chunk lands, a 300ms debounced rebuild fires.

The rebuild() function merges the two Maps into a unified trips array, filtering out invalid positions (must be inside New Zealand's bounding box), cancelled trips, and trips without vehicle assignments. It then computes KPIs, fleet health metrics, and actionable insights before calling setData() to update React state. The 300ms debounce prevents multiple rapid rebuilds when trip_updates and vehicles messages arrive close together within the same poll cycle.

One thing the hook does that is not immediately obvious: it handles type === 'system' and type === 'ack' messages silently. Web PubSub sends these as housekeeping, and if you do not explicitly skip them, you will hit parse errors or unexpected behaviour in your message handler.

What it looks like

The AT Analytics Dashboard

Here is the actual dashboard in action. This is a screen recording of the live AT OpsCentre streaming real-time GTFS data from Auckland Transport. Every 60 seconds the feed refreshes: trip updates, vehicle positions, and service alerts all flowing through the pipeline described above.

at-opscentre.kepakisan.co.nz
Screen recording of the AT OpsCentre live dashboard streaming real-time GTFS data from Auckland Transport, showing KPI cards, route performance heatmap, and live trip updates
Making it actually useful

Click a Trip, See the Real Arrival Times

The dashboard looked impressive from day one. Live KPIs, heatmaps, streaming data. But the ops team asked the question that matters: "So a bus is 4 minutes late at stop 12. What does that actually mean for the remaining 23 stops on the route?" That is the question GTFS delay data alone cannot answer. A bus that is 4 minutes late at one stop might be 4 minutes late at the next stop, or 12 minutes late, depending on traffic between those stops right now. The scheduled timetable does not know about a crash on the motorway. Google Maps does.

So I added a click handler. When an ops manager clicks on any active trip in the dashboard, the app fires a request to the Google Maps Routes API with the bus's current GPS position as the origin and each remaining stop on the route as a waypoint. The API returns traffic-aware predicted travel times for each leg. The dashboard then computes a predicted arrival time for every remaining stop by chaining those leg durations together, and displays the cumulative delay or earliness against the scheduled timetable.

Why this matters operationally

Without this, the ops team sees a single delay number per trip: "4 minutes late." That number comes from the GTFS-RT feed and reflects the delay at the last reported stop. It says nothing about what happens next. A 4-minute delay on a route that runs through clear suburban streets might stay at 4 minutes. The same delay on a route about to hit the Harbour Bridge at 5:15pm could compound to 15 minutes by the terminus. The GTFS feed has no way to know the difference. Google's traffic layer does.

The cumulative view changes the conversation. Instead of "bus 2847 is 4 minutes late," the ops team sees "bus 2847 is 4 minutes late now, predicted 7 minutes late at stop 18, and 11 minutes late at the terminus based on current traffic." That is actionable. They can decide whether to notify passengers, adjust connections, or deploy a relief bus before the delay compounds rather than after.

How it works

Implementation

Google Maps Routes API integration

When the user clicks a trip row, the app already has the bus's current GPS position from the vehiclesRef Map and the full list of remaining stops from the static GTFS data bundled with the SWA. The click handler builds a Routes API request with the current position as origin, the final stop as destination, and all intermediate stops as waypoints.

Simplified flow · click handler → Routes API → cumulative delay
// 1. User clicks a trip row in the dashboard
const vehicle = vehiclesRef.current.get(tripId);
const remainingStops = getRemainingStops(tripId, vehicle.stopSequence);

// 2. Build the Routes API request
// Origin: bus's current GPS position
// Destination: final stop on the route
// Intermediates: all remaining stops as waypoints
const routeRequest = {
  "origin": { latLng: { lat: vehicle.lat, lng: vehicle.lon } },
  "destination": { latLng: lastStop.position },
  "intermediates": remainingStops.map(s => ({ latLng: s.position })),
  "travelMode": "DRIVE",
  "routingPreference": "TRAFFIC_AWARE"
};

// 3. Response gives duration per leg (traffic-aware)
// 4. Chain leg durations to compute predicted arrival at each stop
// 5. Compare predicted arrival vs scheduled arrival → cumulative delay
let cumulativeTime = Date.now();
legs.forEach((leg, i) => {
  cumulativeTime += leg.duration;  // traffic-aware seconds
  const scheduled = remainingStops[i].scheduledArrival;
  const delta = (cumulativeTime - scheduled) / 60;  // minutes
  // delta > 0 = late, delta < 0 = early
});

The key detail is routingPreference: "TRAFFIC_AWARE". Without it, the API returns static duration estimates that are no better than the scheduled timetable. With it, Google factors in live traffic conditions, road closures, and congestion patterns. That is the entire value of this feature: the difference between "how long should this take" and "how long will this actually take right now."

💰

Cost note: The Routes API is not free. Google charges per request based on the number of waypoints. For an internal ops dashboard where a manager clicks on a handful of trips per shift, the cost is negligible. You are not calling it on every poll cycle, only on user interaction. If you were calling it for every active trip on every 60-second refresh, the bill would add up fast. The click-to-predict pattern keeps it cheap by design: the expensive API call only fires when a human decides they need the information.

🎯

Why this is the feature that made the dashboard useful: Everything else in OpsCentre, the KPIs, the heatmap, the streaming pipeline, is infrastructure. It looks good and it is technically interesting. But the ops team does not make decisions based on a heatmap. They make decisions based on "will this bus make its connection at Britomart in 20 minutes?" The Google Maps integration is the only feature that directly answers an operational question. The rest is context. This is the action.

When being cheap backfired

The 429 Problem: Stop Being Cheap, Maha

Before I get into the story, here is the thing I should have read properly from the start. The Free_F1 Web PubSub tier has hard limits that make it unsuitable for anything AT-scale:

Limit Free_F1 Standard S1 (1 unit)
Concurrent connections 20 1,000
Outbound messages per day 20,000 1,000,000 (included free per unit)[6]
REST API throttle 1,000 requests/sec/unit (shared)[8] 1,000 requests/sec/unit (dedicated)
Max WebSocket frame size 1 MB per frame (both tiers)[8]
Monthly cost $0 Check azure.microsoft.com/pricing/details/web-pubsub (varies by region)
⚠️

If your payload is over 1.5MB, start on Standard S1. The AT GTFS-RT feed runs 1.6-1.7MB per 60-second poll. Each chunk the Function pushes to Web PubSub counts toward the daily message quota, calculated in 2KB increments. A single AT poll cycle can consume several hundred of your 20,000 free daily messages. Multiply by 1,440 polls per day and the free tier is exhausted before breakfast. Do not make the same mistake I made. If you know your data is large, start on Standard.

At some point the AT data started growing. The full GTFS-RT feed hitting around 1.6 to 1.7 megabytes per poll cycle. And then the 429s started.

This problem never showed up with the camera POC. A single fatigue event is maybe 300 bytes. Manually triggering it a few times per minute sits well within the free tier's 20,000 daily message quota. The camera POC was a genuinely misleading baseline. Everything worked fine, I declared victory, and moved on to AT. AT data is a different beast entirely.

If you are not familiar, HTTP 429 means "too many requests": the service is throttling you. In this case, the Azure Function was hitting the Web PubSub API to publish the chunked messages, and Web PubSub was rejecting them. I was still on the Free_F1 tier.

Here is what the free tier actually allows, and why AT data blows through it so fast:

Limit Free_F1 AT data reality
Concurrent connections 20 Fine for an internal ops room
Outbound messages per day 20,000 AT poll every 60s = 1,440 polls/day. At 1.7MB per poll chunked into 400-record messages, each poll sends several hundred 2KB billing units. Burns through 20K quickly
Message size billing unit Every 2KB of outbound traffic = 1 message 1.7MB payload = 850 billing messages per poll cycle, before even counting connected browsers
REST API throttle 1,000 req/sec per unit[8] Fine for this use case

The maths is unforgiving. 850 billing messages per poll, 1,440 polls per day = 1.2 million billing messages per day needed. The free tier gives 20,000. No amount of clever chunking was going to fix that ratio.

I spent a lot of time trying to be clever about it. Reducing chunk size. Adding backoff delays. Compressing the payload. None of it fixed the underlying issue. The real problem was that the free tier was simply not built for what I was asking it to do. And I knew it. I just did not want to admit it.

Eventually my boss said: "Stop being cheap, Maha, I can pay for it."

So I upgraded to Standard S1. One click in the portal. No code change. No redeployment. The 429s stopped immediately. The whole thing just worked.

💸

Lesson: Spending hours debugging a capacity problem is not being frugal. It is wasting the most expensive resource you have, which is your own time. Upgrade early.

The Standard S1 tier gives you 1,000 concurrent connections and 1 million outbound messages per unit per day.[6] One unit is more than enough for an internal ops dashboard. Check azure.microsoft.com/pricing/details/web-pubsub for current pricing as it varies by region. Even at Standard S1 rates, the full cost of this streaming stack is a small fraction of what Fabric F64 capacity costs (~$5,003/month[2]) for the equivalent Power BI streaming feature set. The upgrade pays for itself in the first hour you stop losing hair over 429s.

To put numbers on it: if your payload per poll cycle exceeds about 40KB (20,000 messages / 1,440 daily polls = ~14 polls before you exhaust daily quota, at 2KB per billing unit), you will hit the free tier wall. AT data at 1.7MB is roughly 42 times over that threshold. The free tier is for the camera POC. Standard S1 is for AT production data.

A tool I probably should have considered

Would Kafka Have Been Better? Honest Answer

I want to be upfront about something: I am comfortable in Microsoft's ecosystem. Azure, Functions, Web PubSub, SWA, Blob Storage. I know where things are, I know what the error messages mean, and I know who to blame when something breaks. That familiarity absolutely influenced what I chose to build with. That is a bias, and it is worth naming.

Kafka gets mentioned in almost every conversation about streaming, so let me give an honest comparison rather than dismiss it.

What Kafka actually is

Apache Kafka is a distributed, log-based streaming platform. The key word is log: events are written to a durable, append-only, partitioned log on disk. Consumers read from that log at their own pace and can replay from any point in time. It is built for high-throughput, high-durability scenarios: high-volume data pipelines, event sourcing, complex stream processing, exactly-once delivery semantics. Originally developed at LinkedIn for internal data pipelines and then open-sourced, it is now one of the most widely deployed pieces of infrastructure in enterprise data engineering.

Where Kafka is better than what I built

Replay. If a downstream consumer crashes and comes back up, it can replay every event from the point it left off. My setup has no replay. If the browser refreshes or the connection drops, you start from zero (or from the last blob batch if you implement backfill).

Multiple independent consumers. Kafka lets multiple downstream systems read the same event stream independently, in parallel, at their own pace. In my setup, the Web PubSub broadcast goes to connected browsers and that is it. If I wanted a second consumer (say, a fraud detection service also watching fatigue events) I would have to add a second fan-out path explicitly.

Durable retention. Kafka retains events for a configurable period (days, weeks, forever). My queue drains every 5 minutes and then events exist only in blob storage as partitioned JSON files.

Complex stream processing. Kafka has native stream processing through Kafka Streams and integrates with Apache Flink and ksqlDB for joins, windowed aggregations, and real-time transformations. The multi-cloud portability argument is also strong: Kafka producers and consumers use the same client libraries regardless of whether workloads run on AWS, GCP, Azure, or on-premises. My current stack is Azure-only by design.

Where Kafka is worse (or just different) for this use case

Kafka does not push to browsers. This is the important one. Kafka is a backend-to-backend system. It does not have a WebSocket push layer. Even if I used Kafka for everything else, I would still need Web PubSub (or a Socket.io server, or something equivalent) to get events into a browser in real time. The two tools solve different parts of the same problem.

Operational overhead. Kafka offers more flexibility and control but at the cost of significantly higher operational complexity. Running self-managed Kafka means brokers, ZooKeeper (or KRaft in newer versions), partition replication, monitoring, and alerting. Managed options like Confluent Cloud or Azure Event Hubs with the Kafka protocol reduce that overhead but add cost and another vendor relationship.

Overkill for the polling cadence. The AT pipeline polls every 60 seconds. That is not a high-frequency stream. Kafka's advantages really show at high throughput: thousands of events per second, large numbers of partitions, complex processing topologies. For 1 event per minute feeding a single dashboard, the infrastructure complexity is hard to justify.

Apache Kafka Azure Web PubSub (this stack)
Primary purpose Durable, replayable event log between backend services Real-time WebSocket delivery to browser clients
Browser push Not natively. Needs a separate WebSocket layer Core feature, built-in
Event replay Yes: configurable retention, replay from any offset No: messages not stored (hence the blob Storage Queue workaround)
Multiple consumers Yes: consumer groups read independently One group broadcast. Multiple browser sessions get the same message
Operational complexity High (self-managed) or moderate (managed, with added cost) Minimal: fully managed, no cluster to operate
Cloud portability Runs anywhere, same client libraries across clouds Azure only
Right for this build? Would solve the backend layer well, but still needs a WebSocket layer Fits the use case: low frequency, single dashboard consumer
🎯

The honest verdict: For this specific use case (one data source, polling every 60 seconds, one dashboard), Web PubSub was the right call and Kafka would have been overkill. But if the Ritchies data platform grew to dozens of real-time sources, complex cross-feed processing, multiple downstream consumers beyond a dashboard, and the need to replay historical events. In that case Kafka (or Azure Event Hubs, which speaks the Kafka protocol) for the backbone would be the right architecture. Web PubSub would still be there at the edge doing the browser delivery. They are not competing tools, they solve different layers of the same problem.

If I do another project in this space, I would look at Kafka more seriously. My Microsoft comfort zone got me to a working solution fast, and that mattered. But I should not mistake familiarity for optimality, and I am aware that I do not always.

Bringing it together

The Bigger Argument

The previous post showed that you do not need Power BI for reporting. This post shows you do not need it for streaming either. Historical batch queries with DuckDB and Parquet files, and live event streaming with Azure Web PubSub and a WebSocket-connected React app. Both on the same free hosting model, both keeping your data in your own storage, neither requiring a per-user licence or dedicated capacity.

Worth being explicit about where the blob storage lands in all of this: it is not a dead end. The partitioned JSON files the batch processor writes feed the data warehouse, which in turn serves two reporting consumers. If your org already has Power BI Pro licences for other work, Power BI can query the warehouse directly, which is completely reasonable. If you want a zero-licence option, the DuckDB static web app from the previous post queries the same blob storage directly from the browser. Both work. The streaming layer and the reporting layer are independent. The queue is the bridge.

What surprised me building the AT side was how much logic ended up in the browser. The useStreamingData hook does chunked message reassembly, in-memory Map merges, debounced rebuilds, KPI computation, fleet health scoring, and insight generation. All client-side, all without a database. The ingest function stays fast because it is a dumb relay. Enrichment happens at the edges. That pattern (keep the hot path simple, push complexity to where it belongs) is probably the most transferable thing from this whole build.

Build the live path first. The persistence path can always be added without breaking anything. The queue is the bridge between the two, and at this scale it costs almost nothing to run.

The next piece to add is history backfill on connect: when an ops manager opens the dashboard, load the last N events from blob storage so they are not starting from zero. The blob storage the CRON processor writes to is already queryable with DuckDB, so this circles back to the previous post in a very literal way.

References
Sources cited in this post
  1. Microsoft Learn: Real-time streaming in Power BI. Covers the three dataset types (Streaming, Push, PubNub), their limitations, and ingest rate caps.
  2. Microsoft Learn: Streaming dataflows (preview). Confirms Premium capacity or PPU required, minimum A3. P-SKU retirement July 2024; F64 (~$5,003/month) is the equivalent replacement.
  3. Microsoft Azure: Azure Functions pricing. Consumption plan: first 1 million executions free per month, plus 400,000 GB-seconds compute. Beyond that, $0.20 per million executions.
  4. Microsoft Learn: Azure Queue storage trigger for Azure Functions. Documents the QueueTrigger peek-lock model, automatic deletion on success, and retry/poison queue behaviour.
  5. Microsoft Q&A: Azure Storage Queue: processed message comes back to the queue after 10 minutes. Community thread documenting the visibility timeout reappearance issue behind the CRON workaround.
  6. Microsoft Learn: Billing model of Azure Web PubSub service. Free tier (Free_F1): 20 concurrent connections, 20,000 outbound messages/day. Standard: 1,000 connections/unit, 1M messages/unit/day included free per unit.
  7. Microsoft Learn: Power BI streaming: considerations and limitations. Push datasets: 1 req/sec, 16MB max. Pure Streaming: 5 req/sec, 15KB max, 1-hour cache only.
  8. Microsoft Learn: Scale an instance of Azure Web PubSub Service. Covers upgrading from Free to Standard S1 and scaling out units. Scale settings apply without code changes or redeployment.
  9. Confluent: Apache Kafka vs. Google Cloud Pub/Sub. Useful framing on the architectural difference between Kafka (durable partitioned log, pull-based, multi-consumer) vs managed pub/sub services (fan-out delivery, fire-and-forget). The same distinction applies when comparing Kafka against Azure Web PubSub.