Build a Datalake from HubSpot

This tutorial takes your HubSpot CRM from a fresh RevOS project all the way to a queryable semantic model — connect the source, sync deals and companies into your warehouse, layer dbt transforms (Bronze → Silver → Gold), publish a cube, and answer a real revenue question without writing SQL.

Where From Zero to Semantic Model starts from a public sample dataset, this one starts from live data you own: the difference is the front half — connecting a Source and syncing it into your warehouse. The transform-and-query spine is the same.

Persona: RevOps / data engineer on a new RevOS org, with a HubSpot account and nothing else. Goal: A pipeline-health view per company — open / won / lost deals, open pipeline value, won value, win rate — and the ability to ask questions of it. Time: ~30–40 minutes (most of it waiting on the first sync).

AI coding agents

This tutorial is written for use with an AI coding agent and has been validated with Claude Code. It should also work with any agent that reads the project's AGENTS.md file — that file points the agent at the right skills for each step.

The skills these steps rely on (explore-lakehouse, create-dbt-transformations, create-cubes, query-semantic-model, and more) are installed automatically by revos init into .claude/skills/ — you don't install them separately. See AI Skills.

Step 1 — Install the CLI and log in

Install the RevOS CLI globally:

npm install -g @revos/cli

Verify it, then authenticate (this opens a browser for OAuth):

revos --version
revos auth login
revos org list

For more detail see Installation and Authentication.

Step 2 — Scaffold a project

Create a new RevOS project:

revos init hubspot-datalake

The CLI prompts you to select an organization, then generates the medallion project layout (dbt/models/{bronze,silver,gold}, cubes/, connections/, connectors/), a Dev Container, and the RevOS AI skills under .claude/skills/.

Open hubspot-datalake in VS Code and click Reopen in Container when prompted. All subsequent commands run inside the Dev Container. Finish setup once inside:

revos init          # signs you in (if needed) and provisions the GCP service-account key

See Project Scaffolding for the full breakdown.

Step 3 — Connect HubSpot as a Source

A Source is the external system you pull from. HubSpot authentication (OAuth) happens server-side, so creating a Source opens the RevOS UI:

revos sources create

Pick HubSpot, complete the OAuth consent in the browser, select the tables to sync (for the pipeline-health view, at least deals and companies — add contacts, owners, products, and deal_pipelines for more context), and save. Saving does two things: RevOS stores the credentials server-side (they never touch your project tree) and provisions a Connection for the tables you picked, which immediately starts syncing into your BigQuery dataset.

note

This command opens the RevOS web app in your browser to add the Source and pick tables. Inside the Dev Container it can't launch a browser for you — it prints the app URL to open on your host machine, so finish the OAuth and table selection there.

Back in the terminal, confirm the Source exists and note its id:

revos sources list --json

You can inspect the streams a Source advertises at any time:

revos sources list-streams <source-id> --json | jq '.[] | {streamName, syncModes, defaultCursorField}'

HubSpot exposes CRM streams including contacts, companies, deals, deal_pipelines, owners, and products. You choose which to sync in the wizard above; this command is handy for reviewing each stream's cursor and supported sync modes.

Step 4 — Pull the Connection into your project

Selecting tables when you added the Source already created a Connection server-side — the sync pipeline that lands those streams into your org's BigQuery dataset — and the first sync is already running. You don't author it from scratch; you pull it into your project as code so it's versioned and tunable.

revos init pulls existing Connections (and Cubes) when it scaffolds the project, so if you created the Source before running init it's already there. If you added the Source afterwards, refresh your local state:

revos pull

This writes connections/hubspot.yaml, mirroring what you configured in the UI:

apiVersion: revos/v1
kind: Connection
metadata:
  name: hubspot
  id: <connection-id> # written back by pull
spec:
  name: HubSpot
  source:
    id: <source-id>
  schedule: { units: 24, timeUnit: hours }
  status: active
  prefix: hubspot_
  streams:
    - name: deals
      syncMode: incremental_deduped_history
      cursorField: [updatedAt]
      primaryKey: [[id]]
    - name: companies
      syncMode: incremental_deduped_history
      cursorField: [updatedAt]
      primaryKey: [[id]]
    # contacts, deal_pipelines, owners, … — whatever you selected

<connection-id> and <source-id> are placeholders — revos pull writes the real values into the file for you; you never fill them in by hand.

From here the Connection is yours to tune as code — change the schedule, add or drop streams, or mask fields — then reconcile:

revos diff
revos apply

Check the sync status and confirm the raw tables land:

revos connections list --json
revos status

The first sync of deals + companies is small and completes in a few minutes. Once it's done, the raw tables (hubspot_deals, hubspot_companies, …) exist in your BigQuery dataset.

tip

Need to mask a field before it ever reaches the warehouse — hash an email, drop a notes column? Add a mappers: block to the stream. See Data Masking & Stream Mappers.

Step 5 — Explore the lakehouse

Before transforming, see what landed. Ask your AI coding agent:

What HubSpot data do we have in our lakehouse?

The explore-lakehouse skill inspects your dataset and returns a plain-English summary — table names, row counts, column types, and the fields worth cleaning. Note the type quirks it surfaces (HubSpot ships booleans and ids as strings); the next step handles them.

Step 6 — Build Bronze → Silver → Gold transformations

Now build the medallion layers. Ask your AI coding agent:

Create dbt transformations on the HubSpot data:
- Silver:
  - stg_hubspot_deals — typed, deduped deals; cast amount to numeric, parse
    timestamps, derive is_closed / is_won / is_lost boolean flags from the
    deal stage, keep company association id as a typed key
  - stg_hubspot_companies — cleaned company records (id, name, industry, domain)
- Gold:
  - company_pipeline_summary (grain: company) — join deals to companies and
    aggregate per company: open_deals, won_deals, lost_deals,
    open_pipeline_value, won_value, win_rate (won / closed),
    last_deal_updated_at

The create-dbt-transformations skill writes the Bronze sources into dbt/models/bronze/schema.yml and the SQL into dbt/models/silver/ and dbt/models/gold/. Bronze is source-only — Silver reads raw via {{ source('bronze', '<table>') }}, Gold reads Silver via {{ ref() }}.

Review the SQL, then build and test:

dbt run
dbt test

After a green run you have company_pipeline_summary — one row per company with the pipeline metrics.

note

The deal → company join is a type mismatch: a deal's company association (hs_primary_associated_company) lands as NUMERIC, while companies.id (and deals.id) are STRING. The join only works once Silver casts the key to a matching type — that's what the typing in the Silver models above handles. The closed/won/lost flags (archived, hs_is_closed, …) currently sync as real BOOL columns, so no cast is needed there — but connector field types can shift across HubSpot API versions, so if a sync lands a flag as a string, cast it in Silver too.

Step 7 — Create the semantic model

The semantic model turns the Gold table into queryable measures and dimensions. Ask your AI coding agent:

Create a semantic model over company_pipeline_summary. Dimensions: company
name, industry. Measures: total open pipeline value, total won value, overall
win rate, and counts of open / won / lost deals. Link the cube to the HubSpot
connection.

The create-cubes skill writes a cube YAML into cubes/. Reconcile it with RevOS and confirm:

revos apply
revos status

You can render the model too:

Visualize the semantic model.

See Cubes and Project Scaffolding for the reconciliation workflow.

Step 8 — Ask a business question and see a chart

The payoff: ask questions of HubSpot data without SQL. Ask your AI coding agent:

Which companies have open pipeline but have never won a deal? Show open
pipeline value, sorted descending.

The query-semantic-model skill discovers your measures with revos cubes meta, composes a Cube query, runs it with revos cubes query, and renders the result inline — as a table, then an ASCII bar chart:

company              open_pipeline   open_deals   won_deals
─────────────────  ───────────────  ───────────  ─────────
Northwind Traders        $ 84,000             3           0
Globex Corp              $ 47,500             2           0

Amounts show in your HubSpot account's default currency ($ here). These are your at-risk accounts — companies with money in flight and nothing closed yet, the ones a CSM should prioritize. Try other shapes too:

Total open pipeline value and overall win rate across all companies.

Top 5 companies by won value.

Any question answerable from the cube works, and the chart appears in the chat — no UI hop.

Step 9 — Wrap-up

You've turned a HubSpot account into a governed datalake and a queryable model:

Layer	What you built
Source	HubSpot connected via OAuth, credentials held server-side
Connection	`hubspot.yaml` (created in the UI, pulled to code) syncing deals + companies into BigQuery
Bronze	Raw HubSpot tables declared as dbt sources
Silver	`stg_hubspot_deals`, `stg_hubspot_companies` — typed, deduped, boolean flags
Gold	`company_pipeline_summary` — per-company pipeline metrics
Semantic model	A cube over the Gold table with measures, dimensions, and a link to the connection
First chart	A real revenue question answered inline in chat

Where to go next

Segments — save "At-Risk Accounts" (open_deals > 0 AND won_deals = 0) as a reusable segment
Tables — build a "Pipeline Health" table over those accounts
Data Masking & Stream Mappers — mask PII at ingest before it reaches the warehouse
Connectors — pull from a system that isn't in the built-in catalog
From Zero to Semantic Model — the same spine on a public sample dataset

AI coding agents​

Step 1 — Install the CLI and log in​

Step 2 — Scaffold a project​

Step 3 — Connect HubSpot as a Source​

Step 4 — Pull the Connection into your project​

Step 5 — Explore the lakehouse​

Step 6 — Build Bronze → Silver → Gold transformations​

Step 7 — Create the semantic model​

Step 8 — Ask a business question and see a chart​

Step 9 — Wrap-up​

Where to go next​