Core Features

A deep dive into every capability of Data Island Core. Click each section to explore the technical details.

Versioned Storage

Every write operation in Data Island Core creates a new, immutable version of the table. Data is never overwritten in place. This append-only architecture provides a complete audit trail, safe rollbacks, and point-in-time queries with zero configuration.

Core Concepts

Append-only writes
Immutable snapshots
Time-travel queries
Automatic deduplication
Soft-delete tombstones
Version metadata

How It Works

  • Append-Only: Each write appends a new Parquet file to the table's storage path. Previous versions remain intact and accessible.
  • Immutable Snapshots: Every version is a complete, self-contained snapshot referenced by a monotonic version number. The catalog records row count, byte size, schema hash, and timestamp.
  • Time-Travel Queries: Query any historical version using ?version=N or ?as_of=2024-01-15T10:00:00Z. Reconstructs the exact state at that point.
  • Deduplication: The Dedup View automatically removes duplicate rows based on configurable key columns. Applied transparently in the view chain before query execution.
  • Tombstones: Soft-delete records by writing tombstone markers. The Tombstone View filters these out during reads while preserving the original data for audit purposes.
Version Operations
# Write data — automatically creates version N+1
curl -X POST /api/v1/tables/orders/data \
  -H 'Authorization: Bearer $TOKEN' \
  -d @orders.json

# Query a specific historical version
curl /api/v1/tables/orders/data?version=42

# Query as of a specific timestamp
curl /api/v1/tables/orders/data?as_of=2024-06-15T00:00:00Z
SQL Analytics

Run standard SQL queries against any table with automatic engine selection. Core picks the right engine based on data size, so you get sub-second responses on small datasets and distributed processing for large ones.

Engine Selection

Engine Mode Best For Features
DuckDB Lite In-process Small datasets (< 100 MB) Sub-second latency, zero configuration
DuckDB Pro Persistent Medium datasets (100 MB – 10 GB) Result caching, persistent catalog, parallel scans
Spark SQL Thrift Server Large datasets (> 10 GB) Distributed execution, predicate pushdown, shuffle optimization

How Auto-Selection Works

  • Core checks the total byte size of the target table version in the catalog.
  • If below the DuckDB Lite threshold, the query runs in-process with zero overhead.
  • Medium datasets are routed to DuckDB Pro, which maintains a persistent on-disk cache for repeat queries.
  • Large datasets are dispatched to Spark SQL via the Thrift protocol. The Spark cluster handles distributed reads, predicate pushdown, and shuffle operations.
  • You can override the engine selection with the ?engine= query parameter if needed.
SQL Queries
# SQL query via REST API — engine auto-selected
curl -X POST /api/v1/sql \
  -H 'Authorization: Bearer $TOKEN' \
  -d '{"query": "SELECT region, SUM(revenue) FROM sales GROUP BY region ORDER BY 2 DESC"}'

# Force a specific engine
curl -X POST /api/v1/sql?engine=spark \
  -d '{"query": "SELECT * FROM large_events WHERE ts > NOW() - INTERVAL 7 DAY"}'
Data Quality

Built-in data quality engine with 16 validation checks across three categories, 5 anomaly detectors, scheduled execution, and per-table quality scores. Catch data issues before they propagate downstream.

Validation Checks (16 total)

Type & Format
  • Type validation (int, float, string, bool, date)
  • Regex pattern matching
  • Enum / allowed values
  • Min / max length
  • Date format validation
  • Numeric range checks
Completeness
  • Null / missing value ratio
  • Required field enforcement
  • Unique constraint validation
  • Primary key integrity
  • Cross-column dependency checks
Distribution
  • Statistical outlier detection
  • Value frequency analysis
  • Cardinality monitoring
  • Distribution drift detection
  • Freshness / staleness checks

Anomaly Detectors

Row count anomalies
Schema drift detection
Null ratio spikes
Value range violations
Freshness alerts

Execution & Scoring

  • Scheduled Runs: Configure quality checks to run on a cron schedule or after every write operation.
  • Quality Scores: Each table gets a composite quality score (0–100) based on check pass rates, weighted by severity.
  • Historical Tracking: Quality scores are versioned alongside data, enabling quality trend analysis over time.
  • Alerts: Configurable thresholds trigger notifications when quality drops below acceptable levels.
Data Sharing

Share tables across organizations with fine-grained access controls. Zero-copy reads mean consumers see the latest data without duplication or ETL pipelines.

Share Types

Cross-organization shares
Zero-copy reads
Read-only access
Column-level filters
Row-level filters (SQL WHERE)
Bulk refresh

How Sharing Works

  • Create a Share: Define which tables to share, with optional column and row filters. The share gets a unique ID and access token.
  • Zero-Copy: Consumers query the shared table directly from the provider's storage. No data is copied or duplicated.
  • Column Filters: Restrict which columns are visible to the consumer. Sensitive columns are excluded at the share level.
  • Row Filters: Apply SQL WHERE clauses that filter rows before they reach the consumer. Useful for jurisdiction or tenant-based data partitioning.
  • Materialization: Optionally materialize shared data into the consumer's own storage for offline access or performance optimization.
  • Bulk Refresh: Refresh all shared tables in a single operation. Ensures consumers see consistent snapshots across related tables.
Table Mirroring

Automatically export tables to open table formats after every write. Integrate Data Island with the broader data ecosystem without manual ETL or synchronization.

Supported Formats

Format Ecosystem Trigger
Delta Lake Databricks, Spark, Trino, Presto After every write
Apache Iceberg Snowflake, Trino, Flink, Spark After every write
Raw Parquet Any tool that reads Parquet After every write

How Mirroring Works

  • Configure mirroring targets per table: choose which formats to export and where to write them.
  • After every successful write, Core automatically generates the corresponding Delta Lake transaction log, Iceberg metadata, or plain Parquet files.
  • External tools (Spark, Databricks, Snowflake) can read the mirrored data directly using their native connectors.
  • Mirroring is asynchronous and does not block the write path. Status is tracked in the catalog.
Monitoring & Observability

Built-in monitoring for read/write operations, query latency, service health, and structured logging. Everything you need to operate Core in production.

Metrics

Read/write operation counts
Latency percentiles (p50, p95, p99)
Query engine utilization
Storage utilization per table
Active connections
Error rates by endpoint

Service Registry

  • Health Endpoints: Each microservice exposes /health with detailed status including dependency checks (Redis, storage backend).
  • Service Discovery: Services register themselves in Redis on startup. The registry tracks port, version, capabilities, and last heartbeat.
  • Readiness & Liveness: Kubernetes-compatible probes for orchestrated deployments.

Structured Logging

  • All log output is structured JSON with consistent fields: timestamp, level, service, trace_id, user_id, operation, duration_ms.
  • Correlation IDs propagate across service boundaries for end-to-end request tracing.
  • Log levels are configurable per service at runtime without restarts.

See Core in Action

Deploy locally with Docker Compose and start querying in minutes.