Build a Datalake from HubSpot
This tutorial takes your HubSpot CRM from a fresh RevOS project all the way to a queryable semantic model — connect the source, sync deals and companies into your warehouse, layer dbt transforms (Bronze → Silver → Gold), publish a cube, and answer a real revenue question without writing SQL.
Where From Zero to Semantic Model starts from a public sample dataset, this one starts from live data you own: the difference is the front half — connecting a Source and syncing it into your warehouse. The transform-and-query spine is the same.
Persona: RevOps / data engineer on a new RevOS org, with a HubSpot account and nothing else. Goal: A pipeline-health view per company — open / won / lost deals, open pipeline value, won value, win rate — and the ability to ask questions of it. Time: ~30–40 minutes (most of it waiting on the first sync).
AI coding agents
This tutorial is written for use with an AI coding agent and has been validated with Claude Code. It should also work with any agent that reads the project's AGENTS.md file — that file points the agent at the right skills for each step.
The skills these steps rely on (explore-lakehouse, create-dbt-transformations, create-cubes, query-semantic-model, and more) are installed automatically by revos init into .claude/skills/ — you don't install them separately. See AI Skills.
Step 1 — Install the CLI and log in
Install the RevOS CLI globally:
npm install -g @revos/cli
Verify it, then authenticate (this opens a browser for OAuth):
revos --version
revos auth login
revos org list
For more detail see Installation and Authentication.
Step 2 — Scaffold a project
Create a new RevOS project:
revos init hubspot-datalake
The CLI prompts you to select an organization, then generates the medallion project layout (dbt/models/{bronze,silver,gold}, cubes/, connections/, connectors/), a Dev Container, and the RevOS AI skills under .claude/skills/.
Open hubspot-datalake in VS Code and click Reopen in Container when prompted. All subsequent commands run inside the Dev Container. Finish setup once inside:
revos init # signs you in (if needed) and provisions the GCP service-account key
See Project Scaffolding for the full breakdown.
Step 3 — Connect HubSpot as a Source
A Source is the external system you pull from. HubSpot authentication (OAuth) happens server-side, so creating a Source opens the RevOS UI:
revos sources create
Pick HubSpot, complete the OAuth consent in the browser, select the tables to sync (for the pipeline-health view, at least deals and companies — add contacts, owners, products, and deal_pipelines for more context), and save. Saving does two things: RevOS stores the credentials server-side (they never touch your project tree) and provisions a Connection for the tables you picked, which immediately starts syncing into your BigQuery dataset.
This command opens the RevOS web app in your browser to add the Source and pick tables. Inside the Dev Container it can't launch a browser for you — it prints the app URL to open on your host machine, so finish the OAuth and table selection there.
Back in the terminal, confirm the Source exists and note its id:
revos sources list --json
You can inspect the streams a Source advertises at any time:
revos sources list-streams <source-id> --json | jq '.[] | {streamName, syncModes, defaultCursorField}'
HubSpot exposes CRM streams including contacts, companies, deals, deal_pipelines, owners, and products. You choose which to sync in the wizard above; this command is handy for reviewing each stream's cursor and supported sync modes.
Step 4 — Pull the Connection into your project
Selecting tables when you added the Source already created a Connection server-side — the sync pipeline that lands those streams into your org's BigQuery dataset — and the first sync is already running. You don't author it from scratch; you pull it into your project as code so it's versioned and tunable.
revos init pulls existing Connections (and Cubes) when it scaffolds the project, so if you created the Source before running init it's already there. If you added the Source afterwards, refresh your local state:
revos pull
This writes connections/hubspot.yaml, mirroring what you configured in the UI:
apiVersion: revos/v1
kind: Connection
metadata:
name: hubspot
id: <connection-id> # written back by pull
spec:
name: HubSpot
source:
id: <source-id>
schedule: { units: 24, timeUnit: hours }
status: active
prefix: hubspot_
streams:
- name: deals
syncMode: incremental_deduped_history
cursorField: [updatedAt]
primaryKey: [[id]]
- name: companies
syncMode: incremental_deduped_history
cursorField: [updatedAt]
primaryKey: [[id]]
# contacts, deal_pipelines, owners, … — whatever you selected
<connection-id> and <source-id> are placeholders — revos pull writes the real values into the file for you; you never fill them in by hand.
From here the Connection is yours to tune as code — change the schedule, add or drop streams, or mask fields — then reconcile:
revos diff
revos apply
Check the sync status and confirm the raw tables land:
revos connections list --json
revos status
The first sync of deals + companies is small and completes in a few minutes. Once it's done, the raw tables (hubspot_deals, hubspot_companies, …) exist in your BigQuery dataset.
Need to mask a field before it ever reaches the warehouse — hash an email, drop a notes column? Add a mappers: block to the stream. See Data Masking & Stream Mappers.
Step 5 — Explore the lakehouse
Before transforming, see what landed. Ask your AI coding agent:
What HubSpot data do we have in our lakehouse?
The explore-lakehouse skill inspects your dataset and returns a plain-English summary — table names, row counts, column types, and the fields worth cleaning. Note the type quirks it surfaces (HubSpot ships booleans and ids as strings); the next step handles them.
Step 6 — Build Bronze → Silver → Gold transformations
Now build the medallion layers. Ask your AI coding agent:
Create dbt transformations on the HubSpot data:
- Silver:
- stg_hubspot_deals — typed, deduped deals; cast amount to numeric, parse
timestamps, derive is_closed / is_won / is_lost boolean flags from the
deal stage, keep company association id as a typed key
- stg_hubspot_companies — cleaned company records (id, name, industry, domain)
- Gold:
- company_pipeline_summary (grain: company) — join deals to companies and
aggregate per company: open_deals, won_deals, lost_deals,
open_pipeline_value, won_value, win_rate (won / closed),
last_deal_updated_at
The create-dbt-transformations skill writes the Bronze sources into dbt/models/bronze/schema.yml and the SQL into dbt/models/silver/ and dbt/models/gold/. Bronze is source-only — Silver reads raw via {{ source('bronze', '<table>') }}, Gold reads Silver via {{ ref() }}.
Review the SQL, then build and test:
dbt run
dbt test
After a green run you have company_pipeline_summary — one row per company with the pipeline metrics.
The deal → company join is a type mismatch: a deal's company association (hs_primary_associated_company) lands as NUMERIC, while companies.id (and deals.id) are STRING. The join only works once Silver casts the key to a matching type — that's what the typing in the Silver models above handles. The closed/won/lost flags (archived, hs_is_closed, …) currently sync as real BOOL columns, so no cast is needed there — but connector field types can shift across HubSpot API versions, so if a sync lands a flag as a string, cast it in Silver too.
Step 7 — Create the semantic model
The semantic model turns the Gold table into queryable measures and dimensions. Ask your AI coding agent:
Create a semantic model over company_pipeline_summary. Dimensions: company
name, industry. Measures: total open pipeline value, total won value, overall
win rate, and counts of open / won / lost deals. Link the cube to the HubSpot
connection.
The create-cubes skill writes a cube YAML into cubes/. Reconcile it with RevOS and confirm:
revos apply
revos status
You can render the model too:
Visualize the semantic model.
See Cubes and Project Scaffolding for the reconciliation workflow.
Step 8 — Ask a business question and see a chart
The payoff: ask questions of HubSpot data without SQL. Ask your AI coding agent:
Which companies have open pipeline but have never won a deal? Show open
pipeline value, sorted descending.
The query-semantic-model skill discovers your measures with revos cubes meta, composes a Cube query, runs it with revos cubes query, and renders the result inline — as a table, then an ASCII bar chart:
company open_pipeline open_deals won_deals
───────────────── ─────────────── ─────────── ─────────
Northwind Traders $ 84,000 3 0
Globex Corp $ 47,500 2 0
Amounts show in your HubSpot account's default currency ($ here). These are your at-risk accounts — companies with money in flight and nothing closed yet, the ones a CSM should prioritize. Try other shapes too:
Total open pipeline value and overall win rate across all companies.
Top 5 companies by won value.
Any question answerable from the cube works, and the chart appears in the chat — no UI hop.
Step 9 — Wrap-up
You've turned a HubSpot account into a governed datalake and a queryable model:
| Layer | What you built |
|---|---|
| Source | HubSpot connected via OAuth, credentials held server-side |
| Connection | hubspot.yaml (created in the UI, pulled to code) syncing deals + companies into BigQuery |
| Bronze | Raw HubSpot tables declared as dbt sources |
| Silver | stg_hubspot_deals, stg_hubspot_companies — typed, deduped, boolean flags |
| Gold | company_pipeline_summary — per-company pipeline metrics |
| Semantic model | A cube over the Gold table with measures, dimensions, and a link to the connection |
| First chart | A real revenue question answered inline in chat |
Where to go next
- Segments — save "At-Risk Accounts" (
open_deals > 0 AND won_deals = 0) as a reusable segment - Tables — build a "Pipeline Health" table over those accounts
- Data Masking & Stream Mappers — mask PII at ingest before it reaches the warehouse
- Connectors — pull from a system that isn't in the built-in catalog
- From Zero to Semantic Model — the same spine on a public sample dataset