Overview
Modern AI applications — especially agentic systems — require more than a single database. They need raw object storage for data and models, full-text search for keyword retrieval, vector databases for semantic similarity, RDF stores for ontological reasoning, and property graphs for relationship traversal. A lakehouse integrates all of these into a coherent data platform accessible to AI agents via unified APIs.
AppSofa Lab builds and operates a research-grade AI Data Lakehouse composed of five best-in-class open-source components. Our research focuses on query routing, cross-store joins, consistency guarantees, and the optimal data placement strategies that minimize retrieval latency for agentic workloads.
Architecture
Each store is purpose-built for a data modality. A query router layer — implemented as an AI agent tool — selects the optimal store (or combination of stores) for each retrieval request, enabling hybrid queries that span multiple engines in a single agent step.
Ingest Layer
Data flows into MinIO as the canonical raw store. ETL pipelines extract structured, text, vector, and graph data and distribute it to the appropriate downstream stores.
Query Router
An agent-facing router classifies each retrieval intent (keyword, semantic, graph traversal, SPARQL) and dispatches to the right store, merging results when multi-store retrieval is needed.
Consistency Model
The lakehouse is eventually consistent across stores. Synchronization events are propagated via a message bus, ensuring that updates to one store are reflected across all derived stores within a bounded window.
Agent Tool Interface
Each store is exposed as a typed tool in the agent tool registry. Agents call retrieve_semantic(), retrieve_keyword(), traverse_graph(), or sparql_query() with zero awareness of the underlying storage engine.
Supported Ingest File Types
The lakehouse ingest pipeline accepts any file type and routes each modality through the appropriate processing chain before distributing derived representations across the downstream stores.
| Category | File Types |
|---|---|
Text & Documents | PDF, DOCX, PPTX, XLSX, TXT / MD, HTML, XML, EPUB, RTF |
Images | PNG, JPEG, WEBP, TIFF, SVG, HEIC / HEIF, BMP, GIF, DICOM, RAW (ARW / CR3 / NEF), scanned PDF |
Video | MP4, MOV, AVI, MKV, WEBM, FLV, WMV, TS, M2TS, animated GIF |
Voice & Audio | MP3, AAC, WAV, AIFF, FLAC, OGG, M4A, WMA, OPUS, AMR |
Structured & Tabular | JSON, JSONL, CSV, TSV, Parquet, ORC, Avro, Arrow / Feather, SQL dump, HDF5 |
Code & Notebooks | Python, JS, TS, .ipynb (Jupyter), YAML, TOML, INI, Dockerfile, Makefile, SQL, shell scripts |
Knowledge & Ontology | Turtle / RDF, OWL, N-Triples, N-Quads, JSON-LD, SPARQL results, GraphML, GEXF |
Model Artifacts | SafeTensors, PT, GGUF / GGML, ONNX, TFLite, TF SavedModel, MLflow artifacts, ZIP, TAR, GZ |
Lakehouse Components
Use Cases
Multi-Modal RAG
Hybrid retrieval combining BM25 (ElasticSearch), semantic similarity (Qdrant), and graph context (Neo4J) for LLM-grounded generation with higher precision than single-store RAG.
Knowledge Graph Intelligence
Ontology-driven reasoning over organizational data using Oxigraph SPARQL, linked to Neo4J for entity relationship traversal and ElasticSearch for text evidence retrieval.
Fraud Network Analysis
Agentic fraud agents query Neo4J for transaction ring detection, Qdrant for behavioral similarity matching, and ElasticSearch for alert triage — all from a single agent step.
Model Artifact Management
MinIO stores model weights, training datasets, evaluation snapshots, and vector index backups — versioned and catalogued for reproducible research workflows.
Collaborate
Building a polyglot AI data platform?
We design, deploy, and operate AI Data Lakehouses for enterprise and federal clients — from architecture to production-grade agent integration.
Get in Touch