How to give your AI the right documents in milliseconds — at enterprise scale.
TL;DR
- Most internal search was never designed for LLMs. It’s slow, fuzzy, and forces you into clumsy RAG pipelines with multiple chained queries.
- Modern engines like Elasticsearch, Weaviate, Qdrant or Azure AI Search already combine BM25 + embeddings — but usually with limited, UI-oriented faceting.
- MFO goes further with an AI-native faceted search engine:
- hybrid lexical + semantic retrieval,
- bitmap-based faceting on PostgreSQL (via
pg_facets), - LLM-optimized ranking focused on giving the model a small, highly relevant context window.
- Under the hood: Roaring Bitmaps on 32-bit IDs, up to ~4.3B documents per table, and millisecond-level latency on tens of millions of docs with dozens of facets.
- For dev / product / data teams, this becomes a shared memory layer for all agents and RAG workflows.
- If this resonates, you can simply register on the website to access the beta and test it on your own data.

1. Why information search must change
Situation. Inside most organizations, information multiplies faster than anyone can follow:
- emails, chats, tickets,
- meeting notes and call transcripts,
- legal & regulatory docs,
- product docs, runbooks, incident reports…
In parallel, teams are rolling out LLM-based copilots and agents everywhere. These systems rely massively on RAG: “retrieve the right docs, feed them to the model, hope for a good answer”.
Complication. The reality is less pretty:
- internal search is often slow and keyword-only,
- each RAG call triggers multiple retrieval steps (BM25 → vector → filter → rerank),
- most pipelines drown the LLM in too many or wrong documents.
You end up with:
- latency,
- token and infra costs,
- approximate answers,
- hallucinations when the right document didn’t make it into context.
Question. How do you let an agent instantly grab only the few documents that matter, across millions, without hand-crafting ten queries and a custom post-processing pipeline?
Answer. MFO’s answer is an AI-native faceted search engine, designed from day one for LLMs and agents:
- hybrid BM25 + embeddings,
- multi-dimensional, bitmap-based faceting,
- ranking optimized for the LLM’s context window,
- built on PostgreSQL +
pg_facets.
2. The basics: how “classic” search actually works
To keep this grounded, let’s rewind a bit.
2.1 Full-text search (BM25)
You know this from Google, Gmail, or SharePoint search: you type words; it returns documents containing those words.
Under the hood, many engines use BM25, a ranking function that scores documents based on:
- how often the terms appear,
- how rare they are in the corpus,
- document length, etc.
Good for: exact codes, error messages, IDs, legal references.
Limitation: doesn’t really understand paraphrases or “same idea, different wording”.
2.2 Categories vs facets
A simple but crucial distinction:
- Category = one folder
Example: Finance → Contracts → 2023.
A document “lives” in one place. Rigid, quickly painful. - Facet = one lens among many
The same document can be addressed as:- Ranged Year = 2019-2023
- Type = Contract
- Author = Peter
- Domain = Finance
- Jurisdiction = EU
In e-commerce, when you filter by size, color, price → those are facets.
➡ Facets are about multi-dimensional navigation. They let you cut through large datasets along many axes at once.
3. Faceted search before AI — and its limits
Search engines like Apache Solr or Elasticsearch have supported faceted navigation for years. Great for:
- product catalogs,
- document libraries,
- log exploration.
But for AI and RAG, they run into a few walls:
- Keyword-centric. Even with tweaks, the core is still lexical; semantic relations are bolted on later or via side systems.
- Facet count & complexity. Designed mainly for a handful of user-facing facets (brand, price, category…), not for dozens of dynamic, model-driven dimensions.
- Not LLM-aware. They optimize “top N results for a human”, not “best compact context window for an LLM”.
At the same time, new systems like Weaviate, Qdrant, Milvus, Azure AI Search, Elastic’s hybrid features all offer hybrid BM25 + vector retrieval and some level of filtering.
They’re powerful, but mostly as general purpose search backends. You still have a lot of glue code to build an LLM-ready, faceted memory layer.
4. Why “vanilla” RAG struggles in practice
On paper, RAG is elegant:
“Find relevant docs, feed them to the LLM, get a better answer.”
In practice, for dev / product / data teams, it often looks like this:
- Run a BM25 search.
- Run a vector search.
- Apply metadata filters.
- Run a reranker or LLM-based scoring.
- Trim to fit the context window.
- Realize you missed a critical doc → tweak and repeat.
Three big pain points:
- Too many steps. Each extra retrieval / rerank adds latency and API costs.
- Too much noise. You often end up sending a fuzzy mix of “maybe relevant” chunks.
- No structural understanding. An internal email, a board memo, a ticket, a law article → same pipeline, even though they should be treated differently.
For one chatbot experiment, that’s OK.
For agents in production, it becomes a bottleneck.
5. The core idea: AI-native faceted search
Here’s the repositioned, explicit value prop:
MFO provides an AI-native faceted search engine designed for agents and LLMs.
It:combines BM25 (lexical), embeddings (semantic) and multi-dimensional metadata in a single hybrid retrieval layer;uses pg_facets and Roaring Bitmaps for extremely fast faceted filtering and counting on PostgreSQL;optimizes ranking for LLM context quality and cost, not just click-through;keeps latency in the few-millisecond range on tens of millions of documents, thanks to compressed bitmaps and in-memory caches.
This engine becomes the shared memory for your agents and RAG workflows.
5.1 Under the hood: PostgreSQL + Roaring Bitmaps + pg_facets
Concretely, MFO’s engine relies on:
- PostgreSQL for storage, transactions, and SQL integration,
- the
pg_facetsextension (ported to Zig) for bitmap-based faceting, - Roaring Bitmaps to represent sets of document IDs efficiently.
Some key properties:
- ID range: 32-bit unsigned → from
0to4,294,967,295
→ theoretically up to ~4.3 billion documents per table. - Facets & values: each facet value gets its own bitmap. There is no hard limit on how many facets or values you can define; it’s a matter of memory and architecture.
- Chunking: documents can be split into chunks based on ID ranges (
chunk_bits), e.g.:chunk_bits = 20→ ~1,048,576 docs per chunk,chunk_bits = 24→ ~16,777,216 docs per chunk.
This helps parallelize queries and keep bitmaps small and hot in cache.
Performance envelope (order of magnitude):
| Scenario | Documents | Facets | Values / facet | Typical latency* |
|---|---|---|---|---|
| Small | 100K | 10 | 100 | < 1 ms |
| Medium | 1M | 50 | 500 | 1–10 ms |
| Large | 10M | 100 | 1,000 | 10–50 ms |
| Very large | 100M | 200 | 2,000 | 50–200 ms |
*With warm caches and reasonably distributed data.
Bitmap operations (AND/OR/XOR + cardinality) are implemented in C with SIMD optimizations and minimal allocations, which is why you can compute multi-facet filters and counts in a few milliseconds even with many dimensions.
6. How MFO’s AI-native faceted search works
6.1 Pillar 1 — Hybrid search: lexical + semantic + metadata
Hybrid search (BM25 + vectors) is now considered a best practice across modern search engines:
- Elasticsearch, Weaviate, Azure AI Search, Milvus and others all provide hybrid BM25 + vector pipelines.
MFO builds on the same idea but ties it directly to faceted filtering and LLM needs:
- Faceted layer (metadata) - applied first
Usespg_facetsto represent each facet value as a compressed bitmap:- tenant, product, version,
- jurisdiction, department, topic,
- risk level, sentiment, source system, etc.
- Search are then run in parallel:
- Lexical layer (BM25)
Uses PostgreSQL’s text search capabilities or external search engines. Good for:- exact references,
- error messages,
- article numbers,
- IDs and codes.
- Semantic layer (embeddings)
Uses a vector index (pgvector, or an external DB like Qdrant / Weaviate) to capture paraphrases and conceptual similarity. - Result are reranked and recombined
- Lexical layer (BM25)
At query time, MFO can:
- Run lexical and semantic retrieval (like Weaviate’s hybrid mode or Azure’s hybrid queries).
- Intersect the candidate set with facet filters by bitmap operations.
- Produce a global score that blends lexical, semantic, and metadata signals.
Result: a compact candidate set that is already highly filtered and structured before any LLM sees it.
6.2 Pillar 2 — AI-powered, dynamic facets
Classic faceting (Solr, Elasticsearch) is usually: “predefined filters on a side panel”. Good for catalogs, less so for agents.
MFO treats facets as first-class citizens for AI:
pg_facetscan compute facet counts and intersections across many dimensions in a single pass, thanks to Roaring Bitmaps.- It’s efficient even when:
- you have tens or hundreds of facets,
- each facet has thousands of values,
- and you want “drill-down” (select one filter, recompute all counts).
On top of that, MFO adds an AI layer that:
- understands the intent of the query,
- knows the workflow (support, compliance, ops, sales),
- and can decide which facets are most informative for this context:
- in support: product, version, environment, error code,
- in compliance: regulation, jurisdiction, risk level, date,
- in sales: account, region, segment, deal stage.
Facets stop being static UI filters. They become dimensions the agent can reason over.
6.3 Pillar 3 — LLM-optimized ranking
Most hybrid systems stop at “give me the top-k results”. For an LLM, that’s not always optimal.
MFO’s ranking objective is explicit:
Maximize the LLM’s chance of producing a correct, grounded answer, under a context window and cost constraint.
That means taking into account:
- Relevance (lexical + semantic).
- Diversity of sources and viewpoints (avoid 10 near-duplicates).
- Freshness and versioning (don’t mix outdated policies with current ones).
- Document quality / confidence scores.
- Token budget: 2K vs 8K vs 32K tokens isn’t the same game.
Instead of “k best docs”, MFO effectively solves a subset selection problem:
- Pick a set of chunks that:
- cover the sub-questions implied by the prompt,
- stay within a max token budget,
- keep redundancy low.
You can think of it as a knapsack / set-cover heuristic tuned for LLM context windows. LLM understand the facets usage and use them already very efficiently!
6.4 Pillar 4 — Real-time architecture for agents
Agents don’t behave like humans typing one search per minute. They:
- chain several searches in a single task,
- refine queries themselves,
- coordinate with other agents,
- run in parallel workflows.
Here, the internal properties of Roaring Bitmaps matter:
- bitmap operations (AND, OR, AND NOT) are fast and cache-friendly,
- many filters can be combined without re-scanning the full table,
- facet counts can be computed on the filtered set in place.
With sane hardware and warm caches, you’re looking at:
- sub-millisecond to low-millisecond response times on hundreds of thousands to a few million docs,
- tens of milliseconds on tens of millions of docs with many facets.
This keeps latencies acceptable even when agents fire multiple retrievals per “thought”.
7. Real-world use cases
A. Internal search over millions of documents
Emails
“Find all threads where a client raised a compliance risk between 2021 and 2024.”
The engine:
- filters by customer, date range, language,
- uses sentiment / topic facets to detect “risk” discussions,
- ranks the threads most likely to matter for the question.
Call transcripts
“Calls mentioning a critical incident resolved in less than 24 hours.”
- Facets: channel, severity, resolution time, product, account.
- Agents can quickly retrieve calls matching a very precise pattern.
Meeting notes
“Meetings where the finance team discussed the Q4 budget.”
Facets like team, participants, project, quarter, topic.
Regulatory documents
“Articles dealing with biometric data processing in EU law.”
Perfect match for heavily structured, multi-topic texts.
B. Embedded in existing channels
Ask by mail:
“Can you send me the latest internal memo summarizing GDPR implications for our operations in Asia?”
The agent uses MFO’s engine, finds the right memo and attachments, and answers with grounded references.
Chat
A chat UI backed by MFO can justify its answers with:
- the exact fragments used,
- the facets applied (version, product, region…),
- making it easier to debug or audit.
Inside SaaS products
- HR: policies, contracts, role definitions.
- Legal: clauses, case law, negotiations.
- Support: incident history, runbooks, RCA reports.
Agents
An agent can:
- search with facets,
- reason over facet distributions (“most incidents in region X, product Y”),
- and emit actions (escalations, alerts, reports) on that basis.
8. Technical appendix — capacity & scaling
For technical teams, here’s the mental model.
Create as many Collections as Needed and for each Collection:
- Max addressable documents per table: ~4.3B (32-bit IDs).
- Facets per table: no hard limit (each facet value is a bitmap).
- Values per facet: no hard limit in principle; in practice, you’ll usually be in the hundreds or thousands.
- Recommended practical range:
- up to 100M docs per index with excellent latency,
- beyond that: shard per business unit, region, or product line; or switch to 64-bit bitmaps if you really need billions of IDs in one logical space.
To go further:
- use chunking to split documents by ID ranges,
- place hot data (recent months, critical entities) on faster storage or dedicated nodes,
- keep vector retrieval & faceted filtering close (same DC / VPC).
9. Conclusion & next step
For 20 years, search engines have helped humans find documents.
With AI agents, the challenge is different:
we need to give models exactly the right set of documents to think and act reliably.
MFO’s AI-native faceted search per collection:
- respects the best practices of modern hybrid search (BM25 + vectors),
- adds a bitmap-based faceted core on PostgreSQL,
- and bakes in an LLM-aware ranking objective.
For dev, product and data teams, it’s less about “yet another search engine” and more about:
- a shared memory layer for all your agents,
- that stays fast at tens of millions of documents,
- that you can reason about, debug, and extend.
If you want to see how this behaves on your own data,
The next step is simple: you can register on the website to access the beta version and test your first RAG/agent workflow connected to a real AI-native faceted search.