Core Architecture
Four microservices, one shared catalog, and a pluggable storage layer. Here is how Data Island Core is built.
System Overview
Core is composed of four microservices that share a Redis catalog and a pluggable object storage backend.
Microservices
Each service is independently deployable, horizontally scalable, and communicates through the shared Redis catalog.
API Server
:8051
The primary programmatic interface. Handles all CRUD operations on tables, data ingestion, SQL queries, data quality, sharing, and administration.
- FastAPI with async request handling
- Auto-generated OpenAPI 3.0 documentation
- JWT token validation via Gatekeeper
- Batch and streaming data ingestion
- SQL query routing to DuckDB or Spark
Web UI
:8050
Browser-based management interface for exploring tables, running queries, managing permissions, and monitoring system health.
- Jinja2 server-side rendering with HTMX interactivity
- Schema browser and data preview
- SQL editor with syntax highlighting
- User and permission management
- Real-time audit log viewer
OData Server
:8052
OData v4.0 compliant endpoint for direct integration with Power BI, Excel, Tableau, and other BI tools. No drivers or plugins required.
- Full OData v4.0 protocol support
- $filter, $select, $orderby, $top, $skip
- Automatic schema discovery for BI tools
- Token-based authentication
- Row-level security applied transparently
MCP Server
:8099
Model Context Protocol server that enables AI assistants like Claude Desktop to query tables, explore schemas, and execute SQL through natural conversation.
- Full MCP protocol implementation
- Table listing and schema inspection tools
- SQL query execution with result formatting
- Data quality report retrieval
- Inherits user permissions from calling context
Data Flow
Follow the path of data through Core, from ingestion to query results.
Write Path
Read Path
The View Chain
Every read query passes through a chain of composable views that transform raw storage into clean, authorized results.
Base Data
Raw Parquet files in object storage. Each file represents a version of the table as written by the client. Immutable and append-only.
Dedup View
Removes duplicate rows based on configured key columns. Uses a last-write-wins strategy: the most recent version of each key is retained.
Tombstone View
Filters out soft-deleted records. Tombstone markers written via the API are applied here, removing rows from query results while preserving audit history.
RBAC View
Applies row-level filtering (SQL WHERE clauses) and column-level restrictions based on the caller's roles and permissions. Invisible to the query layer.
User Query
The final SQL or filter expression submitted by the user. Executed against the already-filtered, deduplicated, and authorized view of the data.
Engine Selection
Core automatically selects the optimal SQL engine based on data size and freshness. No manual tuning required.
AUTO Mode Decision Matrix
When the engine is set to AUTO (the default), Core applies a two-dimensional decision matrix. The first axis is total data size across all referenced Parquet files. The second axis is data freshness — how recently the table was last written to. This prevents cache thrashing during active ingestion.
Why Freshness Matters
- Fresh data (age <5 min): Routed to Lite even at medium sizes, because cached views in Pro would be invalidated before they pay off — avoiding cache thrash during active ingestion.
- Stable data (age ≥5 min): Routed to Pro for medium datasets, because the persistent connection and cached views will be reused across multiple queries, amortizing setup cost.
- Unknown freshness: Data is assumed stable so that Pro gets a chance to cache. Conservative default that favors performance.
Engine Characteristics
- DuckDB Lite: Transient connection, VIEWs created and dropped per query. Zero state between requests. Sub-second for small datasets. Single persistent connection with HTTP metadata cache.
- DuckDB Pro: Module-level singleton connection that survives across requests. Persistent view cache with version-aware invalidation. Reference counting prevents in-flight queries from breaking during version transitions.
- Spark SQL: Connects to a Spark Thrift Server via HiveServer2. Batches large file lists into temporary views, unions them, and executes with configurable timeout (default 300s).
Storage Architecture
Five pluggable storage backends behind a single interface. Switch providers with a configuration change — not a migration project.
The Storage Factory
The get_storage() factory function resolves and instantiates the correct backend at startup based on the STORAGE_TYPE environment variable.
Cloud SDKs are lazily imported — you only install the packages you need. All backends implement the same StorageInterface abstract base class, so the rest of the system is completely backend-agnostic.
Backend Selection Priority
- 1. Explicit argument:
get_storage(kind="S3")— used in tests and migration scripts. - 2. Environment variable:
STORAGE_TYPE=MINIO— the standard production path. - 3. Package defaults: Configured at the package level for embedded SDK use.
- 4. Fallback:
LOCAL— always available, zero dependencies.
Key Design Decisions
- Base Prefix: A single bucket can host multiple SuperTable deployments via
SUPERTABLE_PREFIX. All paths are transparently prefixed. - Atomic Writes (Local): Uses
tempfile+fsync+os.replace()for crash-safe JSON writes. Directory is fsynced on POSIX for durability. - Lazy Imports: Cloud SDKs are imported via
importlib.import_module()only when selected. Missing SDK triggers a user-friendly install hint. - DuckDB Integration: Each backend exposes connection properties (endpoint, credentials, URL style) that
configure_httpfs_and_s3()uses to set up DuckDB's httpfs extension for direct Parquet reads.