Streaming TB-Scale Synthetic Datasets Without Disk Hell
Customers hit OOM kills and disk-full errors generating terabyte datasets. We rebuilt the output pipeline around Azure Blob, per-file SAS URLs, BYO storage, and rate-controlled NDJSON streaming — so a 1 TB job now ships end-to-end with zero buffering.
A customer generated a 840 GB banking AML dataset last month. The job ran for 6 hours and produced exactly what they asked for. Then the download timed out. Then the download timed out again. Then they tried the per-file API and hit a 2 GB zip cap. Then they opened a support ticket.
The root cause, embarrassingly, was our API pod. It was a 512 Mi memory container with no ephemeral volume, trying to build a zip archive in RAM before serving it. That design works fine for 10 GB fintech demo jobs. It falls over on anything serious.
v2.3 replaces the entire output pipeline. The new architecture ships TB-scale datasets end-to-end with no buffering anywhere — and introduces two new delivery modes for customers whose workflows don't involve downloading archives at all.
Managed Azure Blob (Default)
Every job's output now lands in Azure Blob storage under a per-user per-job prefix. The API pod still keeps a local 100 GB working directory for DataSynth to stage files into, but uploads happen per-file as soon as generation completes. Local files are deleted immediately after successful upload. The client never pulls bytes through the API pod.
Downloads are a manifest of per-file SAS URLs:
curl -H "Authorization: Bearer $VYNFI_API_KEY" \ https://api.vynfi.com/v1/jobs/$JOB_ID/download{ "type": "managed_blob", "ttl_seconds": 3600, "files": [ { "path": "journal_entries.parquet", "size": 12847293018, "url": "https://stvynfiproddata.blob.core.windows.net/...?sig=..." }, ... ]}SAS URLs are minted via user-delegation, scoped to a single blob, read-only, and TTL-limited to one hour. Clients pull directly from Azure Blob with no API proxying. A 200 GB Parquet file downloads at whatever rate your network allows — no OOM, no 2 GB cap, no proxy timeouts.
BYO Storage (Team+)
Several of our enterprise customers run airgapped or heavily regulated environments where 'your synthetic data is briefly at rest in a VynFi storage account' is itself a compliance problem. v2.3 lets them submit a container SAS URL as part of the job:
POST /v1/jobs{ "config": { ... }, "output_destination": { "kind": "byo_azure_sas", "container_sas_url": "https://customer.blob.core.windows.net/mycont?sv=2023-11-03&sp=racwdl&se=2026-04-14T12:00:00Z&sig=..." }}The worker validates the SAS (HTTPS-only, Azure host, ≥ 2h expiry, `cw` permissions at minimum), uploads generated files directly into the customer's container, and deletes local copies immediately. Zero bytes of customer data ever land on VynFi-managed storage.
NDJSON Live Streaming (Scale+)
For live-ingestion workflows — feeding Kafka, Spark, ClickHouse, or an LLM training loop — even 'download from blob' adds unwanted latency. DataSynth 2.3 added a native NDJSON streaming endpoint, and we now proxy it with token-bucket rate-limiting and periodic progress events:
curl -N -H "Authorization: Bearer $VYNFI_API_KEY" \ "https://api.vynfi.com/v1/jobs/$JOB_ID/stream/ndjson?rate=500&burst=100&progress_interval=1000"{"type":"journal_entries","subtype":"JournalEntry","data":{...}}{"type":"journal_entries","subtype":"JournalEntry","data":{...}}...{"type":"_progress","lines_emitted":1000}{"type":"journal_entries","subtype":"JournalEntry","data":{...}}...Each line is a self-describing envelope. Rate-limiting caps at 10,000 lines/sec; burst allows short spikes; progress events let you write reliable resumable consumers. No buffering, no disk — your consumer backpressures the producer through HTTP body flow control.
Size Estimation + Tier Quotas
To prevent runaway jobs, /v1/configs/estimate-size returns a calibrated byte estimate before you submit. Each tier has a per-job quota (Free 10 GB, Developer 100 GB, Team 1 TB, Scale 10 TB, Enterprise unlimited). The dashboard wizard calls this endpoint on every config change and warns you as you approach quota.
curl -X POST https://api.vynfi.com/v1/configs/estimate-size \ -H "Authorization: Bearer $VYNFI_API_KEY" \ -H "Content-Type: application/json" \ -d '{"config": { "sector": "banking_aml", "rows": 10000000, "companies": 50 }}'{ "estimatedBytes": 487923847293, "estimatedFiles": 2000, "tierQuotaBytes": 1099511627776, "exceedsQuota": false, "warning": null, "breakdown": [ { "domain": "core_journal_entries", "bytes": 120000000000 }, { "domain": "banking_transactions", "bytes": 180000000000 }, ... ]}What Changed Under the Hood
- Pod memory: 512 Mi → 4 Gi limit. Pod ephemeral storage: 0 → 10 Gi request, 100 Gi limit.
- PVC: 32 Gi → 100 Gi (workspace for DataSynth staging).
- Azure Blob account: already existed in our Terraform, but was never wired. It is now.
- Workload identity: granted Storage Blob Data Contributor + Storage Blob Delegator on the storage account.
- Upload path: block-blob streaming in 8 MiB chunks via Put Block + Put Block List, up to 200 GB per blob.
- Download path: user-delegation SAS with 1 h TTL, minted via HMAC-SHA256 off a 6-day delegation key.
- Deprecated: the /v1/jobs/{id}/download.zip path for outputs > 500 MB. Use the manifest + per-file SAS instead.
If you were running into disk or memory limits on large jobs, you can re-run them now. If you want NDJSON streaming into your own pipeline, upgrade to Scale. If you want to keep every synthetic byte off VynFi infrastructure, upgrade to Team and submit your own container SAS URL.