How I Cleaned 40,000 Messy HubSpot Records Using Data Hub and ChatGPT

Written by Oscar Gonzalez | Nov 12, 2025 6:12:32 PM

Perfect — those are excellent sources and make your post stronger by backing up your enrichment claims with official HubSpot documentation and credible industry coverage.

Here’s your final blog draft with all three sources woven naturally into the text for context and SEO credibility (no heavy citation style, just natural linking).

This all started as a weekend project.
I thought I’d spend a few hours cleaning up a handful of messy records in HubSpot.

Then I realized there weren’t a few. There were over 40,000 company records — and they were all inconsistent.

Different naming formats. Missing domains. Blank industries.
Some had no contacts. Others had duplicates across regions.
Every report and workflow downstream was breaking because of it.

That weekend project turned into a full-blown system rebuild. But what came out of it was a repeatable way to clean and maintain data from inside HubSpot — no spreadsheets, no exports, no manual cleanup days.

Here’s exactly how I did it, how long it took, and what I’d do differently next time.

Step 1: Understand the Mess

Estimated time: 1–2 hours

Before fixing anything, I needed to understand what was broken.

Bad data wasn’t just an inconvenience — it was making reports useless and creating confusion between teams. So I started by defining what a “healthy” company record should look like:

Has a valid website domain
Has a clean company name
Has at least one associated contact or deal
Has a filled-in industry and country
Shows recent activity or engagement

Those criteria became the backbone of the first workflow: Data Health Score.

Step 2: Build the Workflows

Estimated time: 6–8 hours total

I built three main workflows inside HubSpot. Each handles a different part of the cleanup.

Workflow 1 – Data Health Score

Start every company at 50 points.
Then apply penalties or bonuses depending on data quality.

No website domain → -25
Consumer domain (like Gmail) → -20
Subdomain instead of apex → -15
No contacts or deals → -10
Missing industry → -10
Recent activity → +5
Multiple contacts or deals → +10

Scores are capped between 0 and 100. Each record ends up with a number that tells you how healthy it is and what needs attention.

Workflow 2 – Recommended Action

Once the score is calculated, this workflow determines what to do next:

80 or above: Merge
60–79: Assign as Parent or Child (depending on subdomain or region)
Below 60: Flag for review

It also assigns a confidence level (High, Medium, Low) and writes a note to the “Scoring Rationale” field so we know why.

Workflow 3 – Autofill Company Name

When a company name is missing, this workflow fills it in safely.
If there’s a domain, it uses that (e.g., nestle.com → Nestlé).
If not, it looks at the primary contact’s company field.

Each record also gets a Name Autofill Status = Auto-filled, and the original name is stored in a custom property called Previous Company Name.

It’s clean, transparent, and reversible if needed.

Step 3: Add Custom Properties

Estimated time: 1 hour

To make everything trackable and transparent, I added a few key custom properties:

Data Health Score
Data Health Confidence
Recommended Action
Scoring Rationale
Duplicate Flag
Name Autofill Status
Previous Company Name

These properties make it easy to filter, report, and explain what changed. They’re the backbone of all dashboards and views.

Step 4: Bring in ChatGPT

Estimated time: 2–3 hours (first version)

Here’s the truth, I’m not a developer.

I know how to map logic and build workflows, but I don’t write code from scratch. That’s where ChatGPT came in.

I used it to write and refine small snippets for tasks like checking whether a company’s website actually loads or verifying if two records share the same apex domain.

The first versions weren’t perfect. I’d copy, paste, test, fix errors, and repeat until it worked.
After a few iterations, the workflow provided accurate results over 90 percent of the time.

ChatGPT didn’t just write code, it explained it. That helped me understand the logic behind each snippet and gave me the confidence to tweak it later.

The goal wasn’t to automate coding. It was to build a system that could evolve.

Step 5: Use Enrichment Wisely

Estimated time: 30 minutes

HubSpot enrichment has changed a lot recently.
According to HubSpot’s official billing documentation, enrichment credits no longer apply to HubSpot’s own AI-powered data tools for most subscriptions.

In fact, HubSpot announced an upgrade to its record enrichment capabilities, confirming that standard record enrichment no longer consumes credits for Starter, Professional, or Enterprise tiers using Breeze Intelligence.

As PALO Creative’s breakdown explains, enrichment is now effectively unlimited when you’re using HubSpot’s native Data Hub features.

Credits still apply if you connect external data partners like Clearbit or ZoomInfo, but the built-in tools now handle enrichment automatically.

That means enrichment can run in the background without you worrying about limits. And when enrichment doesn’t fill in the blanks, the workflow and code combo picks up right where it leaves off.

Step 6: Test and Refine

Estimated time: 3–4 hours

Before rolling it out across 40,000 records, I tested the system on smaller batches.

I looked for edge cases like:

Companies using regional TLDs (.fr, .de, .cn)
Subdomain-based records (us.brand.com vs brand.com)
Contacts using personal email domains

Every test helped improve accuracy. I adjusted penalty weights, updated field logic, and refined the code through ChatGPT until everything worked smoothly.

Step 7: Review and Roll Out

Estimated time: 2–3 hours for pilot + 1–2 days full rollout

Once the pilot group looked good, I scaled it up.

The end result:

Standardized company names
Clean domain data
Parent–child relationships are correctly set
A clear health score across every company

And the biggest win? We could finally trust our CRM again.
No more guessing which record to keep or merge. The workflows made those decisions visible and logical.

Step 8: Keep It Self-Healing

Ongoing

Data gets messy again the moment you stop watching it.
That’s why the system isn’t just a cleanup — it’s a maintenance loop.

New companies are automatically scored, enriched, and reviewed.
Low-confidence records are flagged.
And because everything is native to HubSpot, there’s no manual upkeep.

If you’re using HubSpot Data Hub and still managing data through exports or spreadsheets, this approach can change how you think about cleanup forever.

Final Thoughts

It took about a week of part-time work to build, test, and refine everything.
But the payoff was huge — cleaner data, faster workflows, and zero time spent on manual imports.

I didn’t write a single line of code on my own. I just asked ChatGPT to help me build, edit, and improve what I already knew I needed.

Now the workflow handles 90 percent of the cleanup automatically and gives clear feedback on the rest.

It’s not perfect. But it’s evolving.
And that’s exactly the point.

View full post