Core Features
A deep dive into every capability of Data Island Core. Click each section to explore the technical details.
Versioned Storage
Every write operation in Data Island Core creates a new, immutable version of the table. Data is never overwritten in place. This append-only architecture provides a complete audit trail, safe rollbacks, and point-in-time queries with zero configuration.
Core Concepts
How It Works
- Append-Only: Each write appends a new Parquet file to the table's storage path. Previous versions remain intact and accessible.
- Immutable Snapshots: Every version is a complete, self-contained snapshot referenced by a monotonic version number. The catalog records row count, byte size, schema hash, and timestamp.
- Time-Travel Queries: Query any historical version using
?version=Nor?as_of=2024-01-15T10:00:00Z. Reconstructs the exact state at that point. - Deduplication: The Dedup View automatically removes duplicate rows based on configurable key columns. Applied transparently in the view chain before query execution.
- Tombstones: Soft-delete records by writing tombstone markers. The Tombstone View filters these out during reads while preserving the original data for audit purposes.
# Write data — automatically creates version N+1
curl -X POST /api/v1/tables/orders/data \
-H 'Authorization: Bearer $TOKEN' \
-d @orders.json
# Query a specific historical version
curl /api/v1/tables/orders/data?version=42
# Query as of a specific timestamp
curl /api/v1/tables/orders/data?as_of=2024-06-15T00:00:00Z
SQL Analytics
Run standard SQL queries against any table with automatic engine selection. Core picks the right engine based on data size, so you get sub-second responses on small datasets and distributed processing for large ones.
Engine Selection
| Engine | Mode | Best For | Features |
|---|---|---|---|
| DuckDB Lite | In-process | Small datasets (< 100 MB) | Sub-second latency, zero configuration |
| DuckDB Pro | Persistent | Medium datasets (100 MB – 10 GB) | Result caching, persistent catalog, parallel scans |
| Spark SQL | Thrift Server | Large datasets (> 10 GB) | Distributed execution, predicate pushdown, shuffle optimization |
How Auto-Selection Works
- Core checks the total byte size of the target table version in the catalog.
- If below the DuckDB Lite threshold, the query runs in-process with zero overhead.
- Medium datasets are routed to DuckDB Pro, which maintains a persistent on-disk cache for repeat queries.
- Large datasets are dispatched to Spark SQL via the Thrift protocol. The Spark cluster handles distributed reads, predicate pushdown, and shuffle operations.
- You can override the engine selection with the
?engine=query parameter if needed.
# SQL query via REST API — engine auto-selected
curl -X POST /api/v1/sql \
-H 'Authorization: Bearer $TOKEN' \
-d '{"query": "SELECT region, SUM(revenue) FROM sales GROUP BY region ORDER BY 2 DESC"}'
# Force a specific engine
curl -X POST /api/v1/sql?engine=spark \
-d '{"query": "SELECT * FROM large_events WHERE ts > NOW() - INTERVAL 7 DAY"}'
Data Quality
Built-in data quality engine with 16 validation checks across three categories, 5 anomaly detectors, scheduled execution, and per-table quality scores. Catch data issues before they propagate downstream.
Validation Checks (16 total)
Type & Format
- Type validation (int, float, string, bool, date)
- Regex pattern matching
- Enum / allowed values
- Min / max length
- Date format validation
- Numeric range checks
Completeness
- Null / missing value ratio
- Required field enforcement
- Unique constraint validation
- Primary key integrity
- Cross-column dependency checks
Distribution
- Statistical outlier detection
- Value frequency analysis
- Cardinality monitoring
- Distribution drift detection
- Freshness / staleness checks
Anomaly Detectors
Execution & Scoring
- Scheduled Runs: Configure quality checks to run on a cron schedule or after every write operation.
- Quality Scores: Each table gets a composite quality score (0–100) based on check pass rates, weighted by severity.
- Historical Tracking: Quality scores are versioned alongside data, enabling quality trend analysis over time.
- Alerts: Configurable thresholds trigger notifications when quality drops below acceptable levels.
Table Mirroring
Automatically export tables to open table formats after every write. Integrate Data Island with the broader data ecosystem without manual ETL or synchronization.
Supported Formats
| Format | Ecosystem | Trigger |
|---|---|---|
| Delta Lake | Databricks, Spark, Trino, Presto | After every write |
| Apache Iceberg | Snowflake, Trino, Flink, Spark | After every write |
| Raw Parquet | Any tool that reads Parquet | After every write |
How Mirroring Works
- Configure mirroring targets per table: choose which formats to export and where to write them.
- After every successful write, Core automatically generates the corresponding Delta Lake transaction log, Iceberg metadata, or plain Parquet files.
- External tools (Spark, Databricks, Snowflake) can read the mirrored data directly using their native connectors.
- Mirroring is asynchronous and does not block the write path. Status is tracked in the catalog.
Monitoring & Observability
Built-in monitoring for read/write operations, query latency, service health, and structured logging. Everything you need to operate Core in production.
Metrics
Service Registry
- Health Endpoints: Each microservice exposes
/healthwith detailed status including dependency checks (Redis, storage backend). - Service Discovery: Services register themselves in Redis on startup. The registry tracks port, version, capabilities, and last heartbeat.
- Readiness & Liveness: Kubernetes-compatible probes for orchestrated deployments.
Structured Logging
- All log output is structured JSON with consistent fields: timestamp, level, service, trace_id, user_id, operation, duration_ms.
- Correlation IDs propagate across service boundaries for end-to-end request tracing.
- Log levels are configurable per service at runtime without restarts.