Case Study
3x Outreach Engagement by Backtesting CRM Data



Background & Challenge
Company details were redacted in order to respect the privacy of our client.
An early-stage B2B SaaS startup had put thoughtful work into defining their ideal customer profile (ICP). Their scoring model was sophisticated — factoring in company size, support team headcount, funding stage, tech stack, and geography.
But despite the strong logic behind their targeting, outbound wasn’t working.
Engagement was low. Credit utilization for list-building via Clay was high. There was a clear misalignment between who they thought was their best-fit customer and who was actually responding.
The question became:
“Is our ICP scoring hypothesis actually predicting real-world results?”
Moving From Assumptions to Evidence: How We Backtested the ICP
Blossomer led a systematic backtesting process to answer two key questions:
Correlation-Level Insight: Do higher-scoring accounts and leads actually convert at higher rates?
Feature-Level Analysis: Which specific signals are the true drivers of engagement and conversion?
Our goal was to replace guesswork with data — and prove whether the existing scoring model held up when tested against real outcomes.
Step 1: Gather Clean Historical Data
We merged data across HubSpot (Deals, Contacts, Companies) and Apollo (Sequences, Contact Engagement) to create a unified, account-level outcome table. Each row represented a company, enriched with:
Field | Description |
---|---|
| Company name |
| Contact-level score (0–5) |
| Account-level score (0–10) |
| Tier bucket based on scoring |
| 1 if any email opened |
| 1 if any email clicked |
| 1 if any email replied |
| 1 if a meeting was booked |
| 1 if deal reached key stages (Qualified, Piloting, etc.) |
| Contact title |
Outcome Definition: We labeled a "conversion" as any deal reaching: Appointment Scheduled, Qualified to Buy, Decision Maker Bought-In, Piloting, Contract Sent, or Closed Won.
Step 2: Score vs. Outcome Correlation
We asked: Is there a meaningful relationship between the lead/account scores and actual outcomes like replies or meetings booked?
from scipy.stats import pointbiserialr import pandas as pd # Load merged dataset df = pd.read_csv("scoring_backtest.csv") # Correlation between lead score and replies correlation, p_value = pointbiserialr(df['lead_score'], df['replied_email']) print(f"Lead Score ↔ Reply Correlation: {correlation:.2f} (p={p_value:.3f})")
🔍 What We Found:
Correlation between lead score and meetings booked was weak (r < 0.2).
Certain "Tier 1" accounts had no engagement at all.
Some "Tier 2" or "Tier 3" accounts outperformed "Tier 1."
This suggested that the original scoring model was not predictive — and needed rethinking.
Step 3: Feature-Level Signal Strength Analysis
To uncover which specific signals actually drove engagement, we used logistic regression:
from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report features = ['employee_count', 'founding_year_recent', 'cx_hiring', 'recent_funding'] X = df[features] y = df['meeting_booked'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model = LogisticRegression() model.fit(X_train, y_train) print("Feature Coefficients:") for feature, coef in zip(features, model.coef_[0]): print(f"{feature}: {coef:.2f}")
We confirmed that newer companies with recent CX hiring and active CX team growth were far more likely to engage.
Feature Insights
Signal | Predictive Strength |
---|---|
Founded After 2020 | Strong |
Recent CX Executive Hire | Strong |
CX Team Growth (15%+ MoM) | Strong |
Funding Stage Alone | Weak |
Employee Count Alone | Weak |
Step 4: Reweighting the Scoring Model
Based on the analysis, we:
Increased weight on validated signals: founding year, CX hiring, CX growth
Decreased weight on weak predictors: funding stage alone, employee count alone
Removed non-predictive criteria
The scoring model shifted from theory-based to evidence-backed.
🚀 Results: Focused Targeting, Real Outcomes
Tripled engagement rates on outbound campaigns
80% reduction in Clay credit utilization (thanks to tighter, more precise lists)
Sharper focus on high-fit conversations with decision-makers who were actually ready to engage
🎯 Key Takeaway
Even the best-designed ICP is still a hypothesis until you validate it. By systematically backtesting historical data, we helped this startup move from sophisticated guesswork to measurable, repeatable targeting — unlocking better conversations, higher engagement, and less wasted effort.
Background & Challenge
Company details were redacted in order to respect the privacy of our client.
An early-stage B2B SaaS startup had put thoughtful work into defining their ideal customer profile (ICP). Their scoring model was sophisticated — factoring in company size, support team headcount, funding stage, tech stack, and geography.
But despite the strong logic behind their targeting, outbound wasn’t working.
Engagement was low. Credit utilization for list-building via Clay was high. There was a clear misalignment between who they thought was their best-fit customer and who was actually responding.
The question became:
“Is our ICP scoring hypothesis actually predicting real-world results?”
Moving From Assumptions to Evidence: How We Backtested the ICP
Blossomer led a systematic backtesting process to answer two key questions:
Correlation-Level Insight: Do higher-scoring accounts and leads actually convert at higher rates?
Feature-Level Analysis: Which specific signals are the true drivers of engagement and conversion?
Our goal was to replace guesswork with data — and prove whether the existing scoring model held up when tested against real outcomes.
Step 1: Gather Clean Historical Data
We merged data across HubSpot (Deals, Contacts, Companies) and Apollo (Sequences, Contact Engagement) to create a unified, account-level outcome table. Each row represented a company, enriched with:
Field | Description |
---|---|
| Company name |
| Contact-level score (0–5) |
| Account-level score (0–10) |
| Tier bucket based on scoring |
| 1 if any email opened |
| 1 if any email clicked |
| 1 if any email replied |
| 1 if a meeting was booked |
| 1 if deal reached key stages (Qualified, Piloting, etc.) |
| Contact title |
Outcome Definition: We labeled a "conversion" as any deal reaching: Appointment Scheduled, Qualified to Buy, Decision Maker Bought-In, Piloting, Contract Sent, or Closed Won.
Step 2: Score vs. Outcome Correlation
We asked: Is there a meaningful relationship between the lead/account scores and actual outcomes like replies or meetings booked?
from scipy.stats import pointbiserialr import pandas as pd # Load merged dataset df = pd.read_csv("scoring_backtest.csv") # Correlation between lead score and replies correlation, p_value = pointbiserialr(df['lead_score'], df['replied_email']) print(f"Lead Score ↔ Reply Correlation: {correlation:.2f} (p={p_value:.3f})")
🔍 What We Found:
Correlation between lead score and meetings booked was weak (r < 0.2).
Certain "Tier 1" accounts had no engagement at all.
Some "Tier 2" or "Tier 3" accounts outperformed "Tier 1."
This suggested that the original scoring model was not predictive — and needed rethinking.
Step 3: Feature-Level Signal Strength Analysis
To uncover which specific signals actually drove engagement, we used logistic regression:
from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report features = ['employee_count', 'founding_year_recent', 'cx_hiring', 'recent_funding'] X = df[features] y = df['meeting_booked'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model = LogisticRegression() model.fit(X_train, y_train) print("Feature Coefficients:") for feature, coef in zip(features, model.coef_[0]): print(f"{feature}: {coef:.2f}")
We confirmed that newer companies with recent CX hiring and active CX team growth were far more likely to engage.
Feature Insights
Signal | Predictive Strength |
---|---|
Founded After 2020 | Strong |
Recent CX Executive Hire | Strong |
CX Team Growth (15%+ MoM) | Strong |
Funding Stage Alone | Weak |
Employee Count Alone | Weak |
Step 4: Reweighting the Scoring Model
Based on the analysis, we:
Increased weight on validated signals: founding year, CX hiring, CX growth
Decreased weight on weak predictors: funding stage alone, employee count alone
Removed non-predictive criteria
The scoring model shifted from theory-based to evidence-backed.
🚀 Results: Focused Targeting, Real Outcomes
Tripled engagement rates on outbound campaigns
80% reduction in Clay credit utilization (thanks to tighter, more precise lists)
Sharper focus on high-fit conversations with decision-makers who were actually ready to engage
🎯 Key Takeaway
Even the best-designed ICP is still a hypothesis until you validate it. By systematically backtesting historical data, we helped this startup move from sophisticated guesswork to measurable, repeatable targeting — unlocking better conversations, higher engagement, and less wasted effort.
Background & Challenge
Company details were redacted in order to respect the privacy of our client.
An early-stage B2B SaaS startup had put thoughtful work into defining their ideal customer profile (ICP). Their scoring model was sophisticated — factoring in company size, support team headcount, funding stage, tech stack, and geography.
But despite the strong logic behind their targeting, outbound wasn’t working.
Engagement was low. Credit utilization for list-building via Clay was high. There was a clear misalignment between who they thought was their best-fit customer and who was actually responding.
The question became:
“Is our ICP scoring hypothesis actually predicting real-world results?”
Moving From Assumptions to Evidence: How We Backtested the ICP
Blossomer led a systematic backtesting process to answer two key questions:
Correlation-Level Insight: Do higher-scoring accounts and leads actually convert at higher rates?
Feature-Level Analysis: Which specific signals are the true drivers of engagement and conversion?
Our goal was to replace guesswork with data — and prove whether the existing scoring model held up when tested against real outcomes.
Step 1: Gather Clean Historical Data
We merged data across HubSpot (Deals, Contacts, Companies) and Apollo (Sequences, Contact Engagement) to create a unified, account-level outcome table. Each row represented a company, enriched with:
Field | Description |
---|---|
| Company name |
| Contact-level score (0–5) |
| Account-level score (0–10) |
| Tier bucket based on scoring |
| 1 if any email opened |
| 1 if any email clicked |
| 1 if any email replied |
| 1 if a meeting was booked |
| 1 if deal reached key stages (Qualified, Piloting, etc.) |
| Contact title |
Outcome Definition: We labeled a "conversion" as any deal reaching: Appointment Scheduled, Qualified to Buy, Decision Maker Bought-In, Piloting, Contract Sent, or Closed Won.
Step 2: Score vs. Outcome Correlation
We asked: Is there a meaningful relationship between the lead/account scores and actual outcomes like replies or meetings booked?
from scipy.stats import pointbiserialr import pandas as pd # Load merged dataset df = pd.read_csv("scoring_backtest.csv") # Correlation between lead score and replies correlation, p_value = pointbiserialr(df['lead_score'], df['replied_email']) print(f"Lead Score ↔ Reply Correlation: {correlation:.2f} (p={p_value:.3f})")
🔍 What We Found:
Correlation between lead score and meetings booked was weak (r < 0.2).
Certain "Tier 1" accounts had no engagement at all.
Some "Tier 2" or "Tier 3" accounts outperformed "Tier 1."
This suggested that the original scoring model was not predictive — and needed rethinking.
Step 3: Feature-Level Signal Strength Analysis
To uncover which specific signals actually drove engagement, we used logistic regression:
from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report features = ['employee_count', 'founding_year_recent', 'cx_hiring', 'recent_funding'] X = df[features] y = df['meeting_booked'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model = LogisticRegression() model.fit(X_train, y_train) print("Feature Coefficients:") for feature, coef in zip(features, model.coef_[0]): print(f"{feature}: {coef:.2f}")
We confirmed that newer companies with recent CX hiring and active CX team growth were far more likely to engage.
Feature Insights
Signal | Predictive Strength |
---|---|
Founded After 2020 | Strong |
Recent CX Executive Hire | Strong |
CX Team Growth (15%+ MoM) | Strong |
Funding Stage Alone | Weak |
Employee Count Alone | Weak |
Step 4: Reweighting the Scoring Model
Based on the analysis, we:
Increased weight on validated signals: founding year, CX hiring, CX growth
Decreased weight on weak predictors: funding stage alone, employee count alone
Removed non-predictive criteria
The scoring model shifted from theory-based to evidence-backed.
🚀 Results: Focused Targeting, Real Outcomes
Tripled engagement rates on outbound campaigns
80% reduction in Clay credit utilization (thanks to tighter, more precise lists)
Sharper focus on high-fit conversations with decision-makers who were actually ready to engage
🎯 Key Takeaway
Even the best-designed ICP is still a hypothesis until you validate it. By systematically backtesting historical data, we helped this startup move from sophisticated guesswork to measurable, repeatable targeting — unlocking better conversations, higher engagement, and less wasted effort.
Background & Challenge
Company details were redacted in order to respect the privacy of our client.
An early-stage B2B SaaS startup had put thoughtful work into defining their ideal customer profile (ICP). Their scoring model was sophisticated — factoring in company size, support team headcount, funding stage, tech stack, and geography.
But despite the strong logic behind their targeting, outbound wasn’t working.
Engagement was low. Credit utilization for list-building via Clay was high. There was a clear misalignment between who they thought was their best-fit customer and who was actually responding.
The question became:
“Is our ICP scoring hypothesis actually predicting real-world results?”
Moving From Assumptions to Evidence: How We Backtested the ICP
Blossomer led a systematic backtesting process to answer two key questions:
Correlation-Level Insight: Do higher-scoring accounts and leads actually convert at higher rates?
Feature-Level Analysis: Which specific signals are the true drivers of engagement and conversion?
Our goal was to replace guesswork with data — and prove whether the existing scoring model held up when tested against real outcomes.
Step 1: Gather Clean Historical Data
We merged data across HubSpot (Deals, Contacts, Companies) and Apollo (Sequences, Contact Engagement) to create a unified, account-level outcome table. Each row represented a company, enriched with:
Field | Description |
---|---|
| Company name |
| Contact-level score (0–5) |
| Account-level score (0–10) |
| Tier bucket based on scoring |
| 1 if any email opened |
| 1 if any email clicked |
| 1 if any email replied |
| 1 if a meeting was booked |
| 1 if deal reached key stages (Qualified, Piloting, etc.) |
| Contact title |
Outcome Definition: We labeled a "conversion" as any deal reaching: Appointment Scheduled, Qualified to Buy, Decision Maker Bought-In, Piloting, Contract Sent, or Closed Won.
Step 2: Score vs. Outcome Correlation
We asked: Is there a meaningful relationship between the lead/account scores and actual outcomes like replies or meetings booked?
from scipy.stats import pointbiserialr import pandas as pd # Load merged dataset df = pd.read_csv("scoring_backtest.csv") # Correlation between lead score and replies correlation, p_value = pointbiserialr(df['lead_score'], df['replied_email']) print(f"Lead Score ↔ Reply Correlation: {correlation:.2f} (p={p_value:.3f})")
🔍 What We Found:
Correlation between lead score and meetings booked was weak (r < 0.2).
Certain "Tier 1" accounts had no engagement at all.
Some "Tier 2" or "Tier 3" accounts outperformed "Tier 1."
This suggested that the original scoring model was not predictive — and needed rethinking.
Step 3: Feature-Level Signal Strength Analysis
To uncover which specific signals actually drove engagement, we used logistic regression:
from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report features = ['employee_count', 'founding_year_recent', 'cx_hiring', 'recent_funding'] X = df[features] y = df['meeting_booked'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model = LogisticRegression() model.fit(X_train, y_train) print("Feature Coefficients:") for feature, coef in zip(features, model.coef_[0]): print(f"{feature}: {coef:.2f}")
We confirmed that newer companies with recent CX hiring and active CX team growth were far more likely to engage.
Feature Insights
Signal | Predictive Strength |
---|---|
Founded After 2020 | Strong |
Recent CX Executive Hire | Strong |
CX Team Growth (15%+ MoM) | Strong |
Funding Stage Alone | Weak |
Employee Count Alone | Weak |
Step 4: Reweighting the Scoring Model
Based on the analysis, we:
Increased weight on validated signals: founding year, CX hiring, CX growth
Decreased weight on weak predictors: funding stage alone, employee count alone
Removed non-predictive criteria
The scoring model shifted from theory-based to evidence-backed.
🚀 Results: Focused Targeting, Real Outcomes
Tripled engagement rates on outbound campaigns
80% reduction in Clay credit utilization (thanks to tighter, more precise lists)
Sharper focus on high-fit conversations with decision-makers who were actually ready to engage
🎯 Key Takeaway
Even the best-designed ICP is still a hypothesis until you validate it. By systematically backtesting historical data, we helped this startup move from sophisticated guesswork to measurable, repeatable targeting — unlocking better conversations, higher engagement, and less wasted effort.
Start Building Your Outbound System
Book a free discovery call to understand how we work. You'll receive actionable strategies upfront, then decide if we're the right fit. We move fast, with no lengthy onboarding delays.
Start Building Your Outbound System
Book a free discovery call to understand how we work. You'll receive actionable strategies upfront, then decide if we're the right fit. We move fast, with no lengthy onboarding delays.
Start Building Your Outbound System
Book a free discovery call to understand how we work. You'll receive actionable strategies upfront, then decide if we're the right fit. We move fast, with no lengthy onboarding delays.
Start Building Your Outbound System
Book a free discovery call to understand how we work. You'll receive actionable strategies upfront, then decide if we're the right fit. We move fast, with no lengthy onboarding delays.
Quick Links
Blossomer, LLC © 2024, All Rights Reserved
Quick Links
Blossomer, LLC © 2024, All Rights Reserved
Quick Links
Blossomer, LLC © 2024, All Rights Reserved
Quick Links
Blossomer, LLC © 2024, All Rights Reserved