Core Features

A deep dive into every capability of Data Island Core. Click each section to explore the technical details.

Versioned Storage

Every write operation in Data Island Core creates a new, immutable version of the table. Data is never overwritten in place. This append-only architecture provides a complete audit trail, safe rollbacks, and point-in-time queries with zero configuration.

Core Concepts

Append-only writes

Immutable snapshots

Time-travel queries

Automatic deduplication

Soft-delete tombstones

Version metadata

How It Works

Append-Only: Each write appends a new Parquet file to the table's storage path. Previous versions remain intact and accessible.
Immutable Snapshots: Every version is a complete, self-contained snapshot referenced by a monotonic version number. The catalog records row count, byte size, schema hash, and timestamp.
Time-Travel Queries: Query any historical version using ?version=N or ?as_of=2024-01-15T10:00:00Z. Reconstructs the exact state at that point.
Deduplication: The Dedup View automatically removes duplicate rows based on configurable key columns. Applied transparently in the view chain before query execution.
Tombstones: Soft-delete records by writing tombstone markers. The Tombstone View filters these out during reads while preserving the original data for audit purposes.

# Write data — automatically creates version N+1
curl -X POST /api/v1/tables/orders/data \
  -H 'Authorization: Bearer $TOKEN' \
  -d @orders.json

# Query a specific historical version
curl /api/v1/tables/orders/data?version=42

# Query as of a specific timestamp
curl /api/v1/tables/orders/data?as_of=2024-06-15T00:00:00Z

SQL Analytics

Run standard SQL queries against any table with automatic engine selection. Core picks the right engine based on data size, so you get sub-second responses on small datasets and distributed processing for large ones.

Engine Selection

Engine	Mode	Best For	Features
DuckDB Lite	In-process	Small datasets (< 100 MB)	Sub-second latency, zero configuration
DuckDB Pro	Persistent	Medium datasets (100 MB – 10 GB)	Result caching, persistent catalog, parallel scans
Spark SQL	Thrift Server	Large datasets (> 10 GB)	Distributed execution, predicate pushdown, shuffle optimization

How Auto-Selection Works

Core checks the total byte size of the target table version in the catalog.
If below the DuckDB Lite threshold, the query runs in-process with zero overhead.
Medium datasets are routed to DuckDB Pro, which maintains a persistent on-disk cache for repeat queries.
Large datasets are dispatched to Spark SQL via the Thrift protocol. The Spark cluster handles distributed reads, predicate pushdown, and shuffle operations.
You can override the engine selection with the ?engine= query parameter if needed.

# SQL query via REST API — engine auto-selected
curl -X POST /api/v1/sql \
  -H 'Authorization: Bearer $TOKEN' \
  -d '{"query": "SELECT region, SUM(revenue) FROM sales GROUP BY region ORDER BY 2 DESC"}'

# Force a specific engine
curl -X POST /api/v1/sql?engine=spark \
  -d '{"query": "SELECT * FROM large_events WHERE ts > NOW() - INTERVAL 7 DAY"}'

Data Quality

Built-in data quality engine with 16 validation checks across three categories, 5 anomaly detectors, scheduled execution, and per-table quality scores. Catch data issues before they propagate downstream.

Validation Checks (16 total)

Type & Format

Type validation (int, float, string, bool, date)
Regex pattern matching
Enum / allowed values
Min / max length
Date format validation
Numeric range checks

Completeness

Null / missing value ratio
Required field enforcement
Unique constraint validation
Primary key integrity
Cross-column dependency checks

Distribution

Statistical outlier detection
Value frequency analysis
Cardinality monitoring
Distribution drift detection
Freshness / staleness checks

Anomaly Detectors

Row count anomalies

Schema drift detection

Null ratio spikes

Value range violations

Freshness alerts

Execution & Scoring

Scheduled Runs: Configure quality checks to run on a cron schedule or after every write operation.
Quality Scores: Each table gets a composite quality score (0–100) based on check pass rates, weighted by severity.
Historical Tracking: Quality scores are versioned alongside data, enabling quality trend analysis over time.
Alerts: Configurable thresholds trigger notifications when quality drops below acceptable levels.

Table Mirroring

Automatically export tables to open table formats after every write. Integrate Data Island with the broader data ecosystem without manual ETL or synchronization.

Supported Formats

Format	Ecosystem	Trigger
Delta Lake	Databricks, Spark, Trino, Presto	After every write
Apache Iceberg	Snowflake, Trino, Flink, Spark	After every write
Raw Parquet	Any tool that reads Parquet	After every write

How Mirroring Works

Configure mirroring targets per table: choose which formats to export and where to write them.
After every successful write, Core automatically generates the corresponding Delta Lake transaction log, Iceberg metadata, or plain Parquet files.
External tools (Spark, Databricks, Snowflake) can read the mirrored data directly using their native connectors.
Mirroring is asynchronous and does not block the write path. Status is tracked in the catalog.

Monitoring & Observability

Built-in monitoring for read/write operations, query latency, service health, and structured logging. Everything you need to operate Core in production.

Metrics

Read/write operation counts

Latency percentiles (p50, p95, p99)

Query engine utilization

Storage utilization per table

Active connections

Error rates by endpoint

Service Registry

Health Endpoints: Each microservice exposes /health with detailed status including dependency checks (Redis, storage backend).
Service Discovery: Services register themselves in Redis on startup. The registry tracks port, version, capabilities, and last heartbeat.
Readiness & Liveness: Kubernetes-compatible probes for orchestrated deployments.

Structured Logging

All log output is structured JSON with consistent fields: timestamp, level, service, trace_id, user_id, operation, duration_ms.
Correlation IDs propagate across service boundaries for end-to-end request tracing.
Log levels are configurable per service at runtime without restarts.