Core Architecture

Four microservices, one shared catalog, and a pluggable storage layer. Here is how Data Island Core is built.

System Overview

Core is composed of four microservices that share a Redis catalog and a pluggable object storage backend.

CLIENTS REST / SDK Browser Power BI / Excel Claude Desktop API Server :8051 · FastAPI · 128+ endpoints Web UI :8050 · Jinja2 · HTMX OData :8052 · OData v4.0 MCP :8099 · 24 AI tools SHARED CORE LIBRARY SQL Parser Engine Select RBAC Filter Catalog Manager View Chain Redis Catalog · Sessions · Audit Pub/Sub · Locks · Registry Query Engines DuckDB Lite · DuckDB Pro Spark SQL via Thrift Object Storage Apache Parquet files S3 · Azure · GCS · MinIO · Local All services share the same core library — security, query, and catalog logic is never duplicated.

Microservices

Each service is independently deployable, horizontally scalable, and communicates through the shared Redis catalog.

API Server

:8051

The primary programmatic interface. Handles all CRUD operations on tables, data ingestion, SQL queries, data quality, sharing, and administration.

  • FastAPI with async request handling
  • Auto-generated OpenAPI 3.0 documentation
  • JWT token validation via Gatekeeper
  • Batch and streaming data ingestion
  • SQL query routing to DuckDB or Spark

Web UI

:8050

Browser-based management interface for exploring tables, running queries, managing permissions, and monitoring system health.

  • Jinja2 server-side rendering with HTMX interactivity
  • Schema browser and data preview
  • SQL editor with syntax highlighting
  • User and permission management
  • Real-time audit log viewer

OData Server

:8052

OData v4.0 compliant endpoint for direct integration with Power BI, Excel, Tableau, and other BI tools. No drivers or plugins required.

  • Full OData v4.0 protocol support
  • $filter, $select, $orderby, $top, $skip
  • Automatic schema discovery for BI tools
  • Token-based authentication
  • Row-level security applied transparently

MCP Server

:8099

Model Context Protocol server that enables AI assistants like Claude Desktop to query tables, explore schemas, and execute SQL through natural conversation.

  • Full MCP protocol implementation
  • Table listing and schema inspection tools
  • SQL query execution with result formatting
  • Data quality report retrieval
  • Inherits user permissions from calling context

Data Flow

Follow the path of data through Core, from ingestion to query results.

Write Path

Client Request JWT + payload
Auth Check JWKS validation + RBAC
Schema Validation Type + constraint checks
Parquet Write Compress + store
Catalog Update Version N+1 in Redis
Audit Log SHA-256 hash chain

Read Path

Query Arrives SQL / REST / OData
Auth + RBAC Token + permissions
View Chain Dedup + Tombstone + RBAC
Engine Select DuckDB or Spark
Execute Query Predicate pushdown
Return Results JSON / OData / Parquet

The View Chain

Every read query passes through a chain of composable views that transform raw storage into clean, authorized results.

Base Data
Dedup View
Tombstone View
RBAC View
User Query

Base Data

Raw Parquet files in object storage. Each file represents a version of the table as written by the client. Immutable and append-only.

Dedup View

Removes duplicate rows based on configured key columns. Uses a last-write-wins strategy: the most recent version of each key is retained.

Tombstone View

Filters out soft-deleted records. Tombstone markers written via the API are applied here, removing rows from query results while preserving audit history.

RBAC View

Applies row-level filtering (SQL WHERE clauses) and column-level restrictions based on the caller's roles and permissions. Invisible to the query layer.

User Query

The final SQL or filter expression submitted by the user. Executed against the already-filtered, deduplicated, and authorized view of the data.

Engine Selection

Core automatically selects the optimal SQL engine based on data size and freshness. No manual tuning required.

AUTO Mode Decision Matrix

When the engine is set to AUTO (the default), Core applies a two-dimensional decision matrix. The first axis is total data size across all referenced Parquet files. The second axis is data freshness — how recently the table was last written to. This prevents cache thrashing during active ingestion.

Data Freshness FRESH (<5 min) STABLE (≥5 min) Data Size Small (<100 MB) Dashboards, ad-hoc LITE cheap anyway LITE cheap anyway Medium (100 MB–10 GB) Production reporting LITE cache would churn PRO cache pays off Large (≥10 GB) Distributed workloads SPARK too big for DuckDB SPARK too big for DuckDB Spark requires a Thrift Server. When unavailable, large queries fall back to DuckDB Pro. Freshness threshold and size boundaries are configurable via environment variables.

Why Freshness Matters

  • Fresh data (age <5 min): Routed to Lite even at medium sizes, because cached views in Pro would be invalidated before they pay off — avoiding cache thrash during active ingestion.
  • Stable data (age ≥5 min): Routed to Pro for medium datasets, because the persistent connection and cached views will be reused across multiple queries, amortizing setup cost.
  • Unknown freshness: Data is assumed stable so that Pro gets a chance to cache. Conservative default that favors performance.

Engine Characteristics

  • DuckDB Lite: Transient connection, VIEWs created and dropped per query. Zero state between requests. Sub-second for small datasets. Single persistent connection with HTTP metadata cache.
  • DuckDB Pro: Module-level singleton connection that survives across requests. Persistent view cache with version-aware invalidation. Reference counting prevents in-flight queries from breaking during version transitions.
  • Spark SQL: Connects to a Spark Thrift Server via HiveServer2. Batches large file lists into temporary views, unions them, and executes with configurable timeout (default 300s).

Storage Architecture

Five pluggable storage backends behind a single interface. Switch providers with a configuration change — not a migration project.

The Storage Factory

The get_storage() factory function resolves and instantiates the correct backend at startup based on the STORAGE_TYPE environment variable. Cloud SDKs are lazily imported — you only install the packages you need. All backends implement the same StorageInterface abstract base class, so the rest of the system is completely backend-agnostic.

StorageInterface read · write · list · delete · exists · presign Local Disk Atomic writes · fsync No deps required AWS S3 boto3 · vhost + path style Presigned URLs MinIO minio SDK Auto-create bucket Azure Blob Connection string · SAS Managed identity Google Cloud Service account · ADC Workload identity STORAGE_TYPE=S3 → get_storage() → S3Storage Cloud SDKs lazily imported · install only what you need All data stored as standard Apache Parquet. Open format — zero vendor lock-in.

Backend Selection Priority

  • 1. Explicit argument: get_storage(kind="S3") — used in tests and migration scripts.
  • 2. Environment variable: STORAGE_TYPE=MINIO — the standard production path.
  • 3. Package defaults: Configured at the package level for embedded SDK use.
  • 4. Fallback: LOCAL — always available, zero dependencies.

Key Design Decisions

  • Base Prefix: A single bucket can host multiple SuperTable deployments via SUPERTABLE_PREFIX. All paths are transparently prefixed.
  • Atomic Writes (Local): Uses tempfile + fsync + os.replace() for crash-safe JSON writes. Directory is fsynced on POSIX for durability.
  • Lazy Imports: Cloud SDKs are imported via importlib.import_module() only when selected. Missing SDK triggers a user-friendly install hint.
  • DuckDB Integration: Each backend exposes connection properties (endpoint, credentials, URL style) that configure_httpfs_and_s3() uses to set up DuckDB's httpfs extension for direct Parquet reads.

Explore the Full Platform

See how Core connects with Gatekeeper, Studio, and Lighthouse.