Core Architecture

System Overview

Core is composed of four microservices that share a Redis catalog and a pluggable object storage backend.

Microservices

Each service is independently deployable, horizontally scalable, and communicates through the shared Redis catalog.

API Server

:8051

The primary programmatic interface. Handles all CRUD operations on tables, data ingestion, SQL queries, data quality, sharing, and administration.

FastAPI with async request handling
Auto-generated OpenAPI 3.0 documentation
JWT token validation via Gatekeeper
Batch and streaming data ingestion
SQL query routing to DuckDB or Spark

Web UI

:8050

Browser-based management interface for exploring tables, running queries, managing permissions, and monitoring system health.

Jinja2 server-side rendering with HTMX interactivity
Schema browser and data preview
SQL editor with syntax highlighting
User and permission management
Real-time audit log viewer

OData Server

:8052

OData v4.0 compliant endpoint for direct integration with Power BI, Excel, Tableau, and other BI tools. No drivers or plugins required.

Full OData v4.0 protocol support
$filter, $select, $orderby, $top, $skip
Automatic schema discovery for BI tools
Token-based authentication
Row-level security applied transparently

MCP Server

:8099

Model Context Protocol server that enables AI assistants like Claude Desktop to query tables, explore schemas, and execute SQL through natural conversation.

Full MCP protocol implementation
Table listing and schema inspection tools
SQL query execution with result formatting
Data quality report retrieval
Inherits user permissions from calling context

Data Flow

Follow the path of data through Core, from ingestion to query results.

Write Path

Client Request JWT + payload

Auth Check JWKS validation + RBAC

Schema Validation Type + constraint checks

Parquet Write Compress + store

Catalog Update Version N+1 in Redis

Audit Log SHA-256 hash chain

Read Path

Query Arrives SQL / REST / OData

Auth + RBAC Token + permissions

View Chain Dedup + Tombstone + RBAC

Engine Select DuckDB or Spark

Execute Query Predicate pushdown

Return Results JSON / OData / Parquet

The View Chain

Every read query passes through a chain of composable views that transform raw storage into clean, authorized results.

Base Data

Dedup View

Tombstone View

RBAC View

User Query

Base Data

Raw Parquet files in object storage. Each file represents a version of the table as written by the client. Immutable and append-only.

Dedup View

Removes duplicate rows based on configured key columns. Uses a last-write-wins strategy: the most recent version of each key is retained.

Tombstone View

Filters out soft-deleted records. Tombstone markers written via the API are applied here, removing rows from query results while preserving audit history.

RBAC View

Applies row-level filtering (SQL WHERE clauses) and column-level restrictions based on the caller's roles and permissions. Invisible to the query layer.

User Query

The final SQL or filter expression submitted by the user. Executed against the already-filtered, deduplicated, and authorized view of the data.

Engine Selection

Core automatically selects the optimal SQL engine based on data size and freshness. No manual tuning required.

AUTO Mode Decision Matrix

When the engine is set to AUTO (the default), Core applies a two-dimensional decision matrix. The first axis is total data size across all referenced Parquet files. The second axis is data freshness — how recently the table was last written to. This prevents cache thrashing during active ingestion.

Why Freshness Matters

Fresh data (age <5 min): Routed to Lite even at medium sizes, because cached views in Pro would be invalidated before they pay off — avoiding cache thrash during active ingestion.
Stable data (age ≥5 min): Routed to Pro for medium datasets, because the persistent connection and cached views will be reused across multiple queries, amortizing setup cost.
Unknown freshness: Data is assumed stable so that Pro gets a chance to cache. Conservative default that favors performance.

Engine Characteristics

DuckDB Lite: Transient connection, VIEWs created and dropped per query. Zero state between requests. Sub-second for small datasets. Single persistent connection with HTTP metadata cache.
DuckDB Pro: Module-level singleton connection that survives across requests. Persistent view cache with version-aware invalidation. Reference counting prevents in-flight queries from breaking during version transitions.
Spark SQL: Connects to a Spark Thrift Server via HiveServer2. Batches large file lists into temporary views, unions them, and executes with configurable timeout (default 300s).

Storage Architecture

Five pluggable storage backends behind a single interface. Switch providers with a configuration change — not a migration project.

The Storage Factory

The get_storage() factory function resolves and instantiates the correct backend at startup based on the STORAGE_TYPE environment variable. Cloud SDKs are lazily imported — you only install the packages you need. All backends implement the same StorageInterface abstract base class, so the rest of the system is completely backend-agnostic.

Backend Selection Priority

1. Explicit argument: get_storage(kind="S3") — used in tests and migration scripts.
2. Environment variable: STORAGE_TYPE=MINIO — the standard production path.
3. Package defaults: Configured at the package level for embedded SDK use.
4. Fallback: LOCAL — always available, zero dependencies.

Key Design Decisions

Base Prefix: A single bucket can host multiple SuperTable deployments via SUPERTABLE_PREFIX. All paths are transparently prefixed.
Atomic Writes (Local): Uses tempfile + fsync + os.replace() for crash-safe JSON writes. Directory is fsynced on POSIX for durability.
Lazy Imports: Cloud SDKs are imported via importlib.import_module() only when selected. Missing SDK triggers a user-friendly install hint.
DuckDB Integration: Each backend exposes connection properties (endpoint, credentials, URL style) that configure_httpfs_and_s3() uses to set up DuckDB's httpfs extension for direct Parquet reads.

System Overview

Microservices

API Server

Web UI

OData Server

MCP Server

Data Flow

Write Path

Read Path

The View Chain

Base Data

Dedup View

Tombstone View

RBAC View

User Query

Engine Selection

AUTO Mode Decision Matrix

Why Freshness Matters

Engine Characteristics

Storage Architecture

The Storage Factory

Backend Selection Priority

Key Design Decisions

Explore the Full Platform