Internal — Sales Training
DataBricks
How In-ATS Sourcing Works

Everything sales needs to know about candidate data ingestion, semantic search, costs, and timelines.

5
DataBrick Types
2
ATSs Live
~$188
100K Candidates / Mo
~6 hrs
100K Bullhorn Ingest

Agenda

1. What is a DataBrick?   2. How it works   3. Live demo   4. Search deep dive   5. Ingestion detail   6. Timing   7. Costs   8. ATSs   9. What's next

March 12, 2026  ·  Internal Only  ·  Not for customer distribution
01
Concept

What is a DataBrick?

A DataBrick is a searchable data source. Instead of connecting directly to an ATS, we give recruiters different "bricks" of candidate data they can search across semantically.

🏢

ATS DataBrick

Candidates pulled directly from the customer's ATS (Bullhorn, Avionte, etc.)

source_type: ats
🎤

HeyMilo DataBrick

Past HeyMilo interview candidates — already have rich structured data from our interviews.

source_type: heymilo
🌐

PDL DataBrick

External candidates from PeopleDataLabs — enriched professional profiles.

source_type: pdl
📋

Candidate Directory

Workspace-level index of all candidates across sources for quick lookup.

source_type: candidate_directory
📄

Posting Directory

Index of job postings for matching and routing candidates to open reqs.

source_type: posting_directory

Why this matters for the customer

Recruiters don't need to manually search their ATS. They type a natural language query — "nurse with 5 years of experience in Houston" — and we search across ALL their data sources at once using AI-powered semantic search.

02
Overview

How It Works

Three steps — connect, ingest, search. That's it.

🔌
1. Connect
Customer connects their ATS — Bullhorn, Avionte, etc. Standard OAuth or API key, takes minutes.
📥
2. Ingest
We pull all their candidates, normalize the data, and build a searchable AI index. One-time process.
🔍
3. Search
Recruiters type natural language queries. AI finds and scores the best candidates across all their data — in seconds.

What the customer sees

  • Connect their ATS in settings (just like any other integration)
  • Wait for ingestion to finish (hours, not days for most clients)
  • Open sourcing, type what they're looking for in plain English
  • Get ranked candidates with AI scores and match explanations

What makes this different

  • Searches by meaning, not keywords — "nurse" finds "RN", "registered nurse", "LPN"
  • Searches across ALL data at once — work history, skills, education, certs
  • Multiple data sources in one search — ATS + HeyMilo interviews + external
  • AI evaluates candidates against the job — not just keyword matching

Talent rediscovery, not talent acquisition

Most staffing firms have 100K+ candidates sitting in their ATS that they never search effectively. We make that existing data work for them — no new candidates needed, just smarter access to what they already have.

03
Demo

Live Demo

End-to-end walkthrough — connect, ingest, search.

🎬

Live Demo

Connect ATS → Ingest candidates → Search → Evaluate → Results

① Connect

Customer connects their ATS via OAuth. DataBrick is provisioned automatically.

② Ingest

Candidates are pulled, transformed, and indexed. Progress is tracked in real-time.

③ Search

Recruiter types a natural language query. Top candidates returned with AI scores.

Demo Notes

Show the DataBricks management UI, real-time ingestion progress, search results with scored candidates, and the Sally sourcing flow for a complete picture.

04
Search

How Search Works

What happens every time a recruiter types a query — results in seconds.

💬
Natural Language Query
"nurse, 5 yrs, Houston"
🧩
Query Decomposition
LLM splits into sub-queries
skills: 0.4, exp: 0.3, loc: 0.3
Parallel Vector Search
5 workers, hybrid α=0.5
BM25 + semantic combined
📊
Rerank + Aggregate
Weighted scores across results
🏆
LLM Evaluates
GPT-4o-mini scores vs criteria

Multi-Vector Namespace Search

Each candidate has multiple vector "namespaces" — separate embeddings for different parts of their profile.

NamespaceWhat It Captures
personal_infoName, title, summary
highlightsKey skills, achievements
profile_textFull profile narrative
contactLocation, availability

Plus cross-reference collections for work history, education, skills, certs — each with their own vectors and back-references to the main candidate.

Why This Beats ATS Search

  1. Semantic understanding — "Java developer" matches "Senior Software Engineer (Java/Spring)" even without exact keywords.
  2. Cross-field matching — "5 years healthcare" searches work history, skills, AND certifications simultaneously.
  3. Multi-source aggregation — One search across ATS + HeyMilo interviews + external data, deduplicated and ranked.
  4. AI evaluation — LLM scores candidates against the job's actual requirements, not just keyword frequency.
05
Ingestion

Ingestion Pipeline — Detail

How candidate data goes from ATS to searchable vector index.

🏢
ATS API
Rate-limited
Bullhorn: 1K/min
Avionte: 10 RPS
🔄
Transform
Normalize to canonical schema
LLM-generated mappers
💾
Data Lake
MongoDB buffer
Upsert + dedup
🧠
Weaviate
Vectorize + HNSW index
~80 vectors/sec

Per Candidate: ~16 Vectors

ComponentVectorsNotes
Main namespace vectors4personal, contact, location, metadata
Work history~31 vector per entry
Education~21 vector per entry
Skills~51 vector per skill
Certifications~11 vector per cert
Tags~11 vector per tag group
Total~16Varies by data richness

Enrichment Calls Per ATS

ATSAPI Calls / Candidate
Bullhorn1 per 50 (bulk API)
Avionte6 (1 basic + 5 enrichment)

Enrichment = skills, education, work history, certifications, tags

Avionte needs 6 separate API calls per candidate to get the full profile. Bullhorn returns everything in a single bulk call.

Idempotent by design

Re-running ingestion reuses existing data and only computes deltas. No duplicates, no wasted API calls.

06
Timing

Time to Ingest

How long it takes to go from zero to searchable, broken down by ATS.

total_time ≈ max(ats_fetch_time, weaviate_ingest_time) + 20% pipeline overhead

Bullhorn

1,000 candidates/min bulk API · Bottleneck = Weaviate

CandidatesATS FetchWeaviateTotal
10K10 min33 min~35 min
100K1.7 hrs5.6 hrs~6 hrs
500K8.3 hrs27.8 hrs~30 hrs
1M16.7 hrs55.6 hrs~2.5 days

Avionte

10 RPS · 6 API calls/candidate · Bottleneck = ATS rate limit

CandidatesATS FetchWeaviateTotal
10K1.7 hrs0.6 hrs~1.8 hrs
100K16.7 hrs5.6 hrs~18 hrs
500K83 hrs27.8 hrs~3.7 days
1M167 hrs55.6 hrs~7.4 days

Quick Formula for Sales

Bullhorn: total hours ≈ candidates ÷ 17,000   |   Avionte: total hours ≈ (candidates × 6) ÷ 36,000
Example: 200K Bullhorn candidates → 200,000 ÷ 17,000 ≈ 12 hours.   200K Avionte → (200,000 × 6) ÷ 36,000 ≈ 33 hours.

Why the difference?

Bullhorn's bulk API returns 50 candidates per call, so ATS fetch is fast and Weaviate vectorization is the bottleneck. Avionte requires 6 separate API calls per candidate at 10 requests/second, so the ATS rate limit dominates.

07
Cost

Hosting Cost — Our Side

What it costs us to run the infrastructure per customer.

Monthly Cost = S × s + C × n + I × N

S × s — Session Cost

S = ~$0.50 per sourcing session

LLM cost for query decomposition + evaluation criteria generation.

s = sessions/month (1 session = 1 job posting search)

C × n — Eval Cost

C = $0.0018 per candidate evaluated

GPT-4o-mini scores each candidate returned from vector search.

n = candidates scanned/month

I × N — Infrastructure

I = $0.94/1K (≤1M) or $0.77/1K (>1M)

Weaviate node, transformer pod, GKE, MongoDB — shared across clients.

N = total candidates in thousands

Example Scenarios

ScenarioCandidatesRecruitersSessions/moS × sC × nI × NTotal/mo
Small agency100K1160 $80$14$94$188
Mid staffing firm500K3480 $240$43$470$753
Large enterprise1M6960 $480$346$940$1,766

Infrastructure Base Cost = ~$600/mo fixed

Weaviate node $383 + transformer pod $49 + GKE $73 + MongoDB $75 + network $20. Shared across all clients — gets amortized as more clients onboard.

Multi-client = lower per-client cost

Infrastructure is shared. 3 clients at 800K total candidates = $752 infra total, way less than 3 separate deployments.

08
Integrations

Supported ATSs

What's live today and what's in the pipeline.

Bullhorn

LIVE
Rate limit1,000 candidates/min (bulk API)
API calls / candidate1 per 50 (bulk)
Batch size100 candidates per batch
Concurrency5 parallel workers
100K ingest time~6 hours
BottleneckWeaviate vectorization (ATS is fast)
Data fieldsFull profile in single bulk response

Avionte

LIVE
Rate limit10 requests/second
API calls / candidate6 (1 basic + 5 enrichment)
Batch size100 candidates per batch
Concurrency5 parallel workers
100K ingest time~18 hours
BottleneckATS rate limit (10 RPS × 6 calls)
EnrichmentSkills, education, work history, certs, tags

On the Radar

Ceipal

PLANNED

2 RPS, 6 calls/candidate. ~3.5 days for 100K. Slowest due to rate limits.

Ashby

PLANNED

Training sessions completed Jan 2026. Integration architecture defined.

Greenhouse

PLANNED

Training completed Jan 2026. Transformer service ready to generate mappers.

Adding a new ATS

Our transformer service uses LLMs to automatically generate data mappers for new ATSs. Once we have API access and a sample payload, a new ATS can be onboarded in days, not weeks.

09
Roadmap

What's Next

Where DataBricks is going — this is the foundation for a much bigger play.

✅ ATS Ingestion (Bullhorn + Avionte)

Pull candidates from ATS, normalize, index into searchable vector store. SHIPPED

✅ Semantic Search + AI Evaluation

Multi-vector search with LLM-based candidate scoring against job criteria. SHIPPED

✅ DataBricks Management Dashboard

UI to view, manage, and reindex vector collections. SHIPPED

🔄 More ATSs (Ceipal, Ashby, Greenhouse)

Transformer service auto-generates mappers for new ATSs. Training completed for Ashby + Greenhouse.

📊 Deep Analytics on DataBricks

Run analytics across a DataBrick — talent market trends, skill gap analysis, compensation benchmarking. Clickhouse backend for structured queries.

🔄 Data Migration Between Sources

Move candidates from one data source to another. ATS → ATS migration, or consolidating after M&A.

🧬 Cross-Source Enrichment

Use one DataBrick to enrich another. Example: match ATS candidates against PDL data to fill in missing emails, phone numbers, social profiles.

🔗 Candidate Dedup + Merge Across Sources

Same candidate exists in Bullhorn AND Avionte? Merge profiles, pick best data from each, create a unified candidate record.

📁 File Upload DataBrick

Upload CSVs or resume files directly as a DataBrick. No ATS required — useful for clients migrating off legacy systems.

🤖 Autonomous Sourcing Worker

Hourly automated sourcing cycles — search internal data, score candidates, apply exclusion rules, queue for outreach. Fully hands-off for recruiters.

The Big Picture

DataBricks turns HeyMilo from an interview tool into a candidate data platform. Any data in, any data out, AI-powered search and enrichment across everything.

10
Reference

Quick Reference — Sales Cheat Sheet

Numbers to have at your fingertips.

Ingestion Time Calculator

ATS10K50K100K500K
Bullhorn35 min3 hrs6 hrs30 hrs
Avionte1.8 hrs9 hrs18 hrs3.7 days

Monthly Cost Estimate

Candidates1 Recruiter3 Recruiters6 Recruiters
50K$127$367$727
100K$188$428$788
500K$564$804$1,164
1M$1,034$1,274$1,766

Assumes 160 sessions/recruiter/month, 50 candidates evaluated per session, at 65% infra utilization.

Customer FAQ

"How long until my team can search?"

Depends on ATS and candidate count. Bullhorn: ~6 hrs for 100K. Avionte: ~18 hrs for 100K. Initial ingestion is one-time. After that, incremental updates keep data fresh.

"Is my data secure?"

Data stays within our GCP infrastructure (us-central1). Each workspace has isolated collections. No data is shared between customers. SOC 2 compliant.

"What if I have 2 ATSs?"

Each ATS gets its own DataBrick. Recruiters can search across both simultaneously — we handle dedup and cross-source ranking.

"How is this different from ATS search?"

ATS search is keyword-based. We use AI to understand what you mean, not just what you type. We also search across work history, skills, education, and certifications simultaneously.

11