Case Study

3x Outreach Engagement by Backtesting CRM Data

Background & Challenge

Company details were redacted in order to respect the privacy of our client.

An early-stage B2B SaaS startup had put thoughtful work into defining their ideal customer profile (ICP). Their scoring model was sophisticated — factoring in company size, support team headcount, funding stage, tech stack, and geography.

But despite the strong logic behind their targeting, outbound wasn’t working.

Engagement was low. Credit utilization for list-building via Clay was high. There was a clear misalignment between who they thought was their best-fit customer and who was actually responding.

The question became:

“Is our ICP scoring hypothesis actually predicting real-world results?”


Moving From Assumptions to Evidence: How We Backtested the ICP

Blossomer led a systematic backtesting process to answer two key questions:

  1. Correlation-Level Insight: Do higher-scoring accounts and leads actually convert at higher rates?

  2. Feature-Level Analysis: Which specific signals are the true drivers of engagement and conversion?

Our goal was to replace guesswork with data — and prove whether the existing scoring model held up when tested against real outcomes.


Step 1: Gather Clean Historical Data

We merged data across HubSpot (Deals, Contacts, Companies) and Apollo (Sequences, Contact Engagement) to create a unified, account-level outcome table. Each row represented a company, enriched with:

Field

Description

account_name

Company name

lead_score

Contact-level score (0–5)

account_score

Account-level score (0–10)

tier

Tier bucket based on scoring

opened_email

1 if any email opened

clicked_email

1 if any email clicked

replied_email

1 if any email replied

meeting_booked

1 if a meeting was booked

deal_conversion

1 if deal reached key stages (Qualified, Piloting, etc.)

job_title

Contact title

Outcome Definition: We labeled a "conversion" as any deal reaching: Appointment Scheduled, Qualified to Buy, Decision Maker Bought-In, Piloting, Contract Sent, or Closed Won.


Step 2: Score vs. Outcome Correlation

We asked: Is there a meaningful relationship between the lead/account scores and actual outcomes like replies or meetings booked?

from scipy.stats import pointbiserialr
import pandas as pd

# Load merged dataset
df = pd.read_csv("scoring_backtest.csv")

# Correlation between lead score and replies
correlation, p_value = pointbiserialr(df['lead_score'], df['replied_email'])
print(f"Lead Score ↔ Reply Correlation: {correlation:.2f} (p={p_value:.3f})")


🔍 What We Found:

  • Correlation between lead score and meetings booked was weak (r < 0.2).

  • Certain "Tier 1" accounts had no engagement at all.

  • Some "Tier 2" or "Tier 3" accounts outperformed "Tier 1."

This suggested that the original scoring model was not predictive — and needed rethinking.


Step 3: Feature-Level Signal Strength Analysis

To uncover which specific signals actually drove engagement, we used logistic regression:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

features = ['employee_count', 'founding_year_recent', 'cx_hiring', 'recent_funding']
X = df[features]
y = df['meeting_booked']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression()
model.fit(X_train, y_train)

print("Feature Coefficients:")
for feature, coef in zip(features, model.coef_[0]):
    print(f"{feature}: {coef:.2f}")

We confirmed that newer companies with recent CX hiring and active CX team growth were far more likely to engage.

Feature Insights

Signal

Predictive Strength

Founded After 2020

Strong

Recent CX Executive Hire

Strong

CX Team Growth (15%+ MoM)

Strong

Funding Stage Alone

Weak

Employee Count Alone

Weak


Step 4: Reweighting the Scoring Model

Based on the analysis, we:

  • Increased weight on validated signals: founding year, CX hiring, CX growth

  • Decreased weight on weak predictors: funding stage alone, employee count alone

  • Removed non-predictive criteria

The scoring model shifted from theory-based to evidence-backed.


🚀 Results: Focused Targeting, Real Outcomes

  • Tripled engagement rates on outbound campaigns

  • 80% reduction in Clay credit utilization (thanks to tighter, more precise lists)

  • Sharper focus on high-fit conversations with decision-makers who were actually ready to engage


🎯 Key Takeaway

Even the best-designed ICP is still a hypothesis until you validate it. By systematically backtesting historical data, we helped this startup move from sophisticated guesswork to measurable, repeatable targeting — unlocking better conversations, higher engagement, and less wasted effort.

Background & Challenge

Company details were redacted in order to respect the privacy of our client.

An early-stage B2B SaaS startup had put thoughtful work into defining their ideal customer profile (ICP). Their scoring model was sophisticated — factoring in company size, support team headcount, funding stage, tech stack, and geography.

But despite the strong logic behind their targeting, outbound wasn’t working.

Engagement was low. Credit utilization for list-building via Clay was high. There was a clear misalignment between who they thought was their best-fit customer and who was actually responding.

The question became:

“Is our ICP scoring hypothesis actually predicting real-world results?”


Moving From Assumptions to Evidence: How We Backtested the ICP

Blossomer led a systematic backtesting process to answer two key questions:

  1. Correlation-Level Insight: Do higher-scoring accounts and leads actually convert at higher rates?

  2. Feature-Level Analysis: Which specific signals are the true drivers of engagement and conversion?

Our goal was to replace guesswork with data — and prove whether the existing scoring model held up when tested against real outcomes.


Step 1: Gather Clean Historical Data

We merged data across HubSpot (Deals, Contacts, Companies) and Apollo (Sequences, Contact Engagement) to create a unified, account-level outcome table. Each row represented a company, enriched with:

Field

Description

account_name

Company name

lead_score

Contact-level score (0–5)

account_score

Account-level score (0–10)

tier

Tier bucket based on scoring

opened_email

1 if any email opened

clicked_email

1 if any email clicked

replied_email

1 if any email replied

meeting_booked

1 if a meeting was booked

deal_conversion

1 if deal reached key stages (Qualified, Piloting, etc.)

job_title

Contact title

Outcome Definition: We labeled a "conversion" as any deal reaching: Appointment Scheduled, Qualified to Buy, Decision Maker Bought-In, Piloting, Contract Sent, or Closed Won.


Step 2: Score vs. Outcome Correlation

We asked: Is there a meaningful relationship between the lead/account scores and actual outcomes like replies or meetings booked?

from scipy.stats import pointbiserialr
import pandas as pd

# Load merged dataset
df = pd.read_csv("scoring_backtest.csv")

# Correlation between lead score and replies
correlation, p_value = pointbiserialr(df['lead_score'], df['replied_email'])
print(f"Lead Score ↔ Reply Correlation: {correlation:.2f} (p={p_value:.3f})")


🔍 What We Found:

  • Correlation between lead score and meetings booked was weak (r < 0.2).

  • Certain "Tier 1" accounts had no engagement at all.

  • Some "Tier 2" or "Tier 3" accounts outperformed "Tier 1."

This suggested that the original scoring model was not predictive — and needed rethinking.


Step 3: Feature-Level Signal Strength Analysis

To uncover which specific signals actually drove engagement, we used logistic regression:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

features = ['employee_count', 'founding_year_recent', 'cx_hiring', 'recent_funding']
X = df[features]
y = df['meeting_booked']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression()
model.fit(X_train, y_train)

print("Feature Coefficients:")
for feature, coef in zip(features, model.coef_[0]):
    print(f"{feature}: {coef:.2f}")

We confirmed that newer companies with recent CX hiring and active CX team growth were far more likely to engage.

Feature Insights

Signal

Predictive Strength

Founded After 2020

Strong

Recent CX Executive Hire

Strong

CX Team Growth (15%+ MoM)

Strong

Funding Stage Alone

Weak

Employee Count Alone

Weak


Step 4: Reweighting the Scoring Model

Based on the analysis, we:

  • Increased weight on validated signals: founding year, CX hiring, CX growth

  • Decreased weight on weak predictors: funding stage alone, employee count alone

  • Removed non-predictive criteria

The scoring model shifted from theory-based to evidence-backed.


🚀 Results: Focused Targeting, Real Outcomes

  • Tripled engagement rates on outbound campaigns

  • 80% reduction in Clay credit utilization (thanks to tighter, more precise lists)

  • Sharper focus on high-fit conversations with decision-makers who were actually ready to engage


🎯 Key Takeaway

Even the best-designed ICP is still a hypothesis until you validate it. By systematically backtesting historical data, we helped this startup move from sophisticated guesswork to measurable, repeatable targeting — unlocking better conversations, higher engagement, and less wasted effort.

Background & Challenge

Company details were redacted in order to respect the privacy of our client.

An early-stage B2B SaaS startup had put thoughtful work into defining their ideal customer profile (ICP). Their scoring model was sophisticated — factoring in company size, support team headcount, funding stage, tech stack, and geography.

But despite the strong logic behind their targeting, outbound wasn’t working.

Engagement was low. Credit utilization for list-building via Clay was high. There was a clear misalignment between who they thought was their best-fit customer and who was actually responding.

The question became:

“Is our ICP scoring hypothesis actually predicting real-world results?”


Moving From Assumptions to Evidence: How We Backtested the ICP

Blossomer led a systematic backtesting process to answer two key questions:

  1. Correlation-Level Insight: Do higher-scoring accounts and leads actually convert at higher rates?

  2. Feature-Level Analysis: Which specific signals are the true drivers of engagement and conversion?

Our goal was to replace guesswork with data — and prove whether the existing scoring model held up when tested against real outcomes.


Step 1: Gather Clean Historical Data

We merged data across HubSpot (Deals, Contacts, Companies) and Apollo (Sequences, Contact Engagement) to create a unified, account-level outcome table. Each row represented a company, enriched with:

Field

Description

account_name

Company name

lead_score

Contact-level score (0–5)

account_score

Account-level score (0–10)

tier

Tier bucket based on scoring

opened_email

1 if any email opened

clicked_email

1 if any email clicked

replied_email

1 if any email replied

meeting_booked

1 if a meeting was booked

deal_conversion

1 if deal reached key stages (Qualified, Piloting, etc.)

job_title

Contact title

Outcome Definition: We labeled a "conversion" as any deal reaching: Appointment Scheduled, Qualified to Buy, Decision Maker Bought-In, Piloting, Contract Sent, or Closed Won.


Step 2: Score vs. Outcome Correlation

We asked: Is there a meaningful relationship between the lead/account scores and actual outcomes like replies or meetings booked?

from scipy.stats import pointbiserialr
import pandas as pd

# Load merged dataset
df = pd.read_csv("scoring_backtest.csv")

# Correlation between lead score and replies
correlation, p_value = pointbiserialr(df['lead_score'], df['replied_email'])
print(f"Lead Score ↔ Reply Correlation: {correlation:.2f} (p={p_value:.3f})")


🔍 What We Found:

  • Correlation between lead score and meetings booked was weak (r < 0.2).

  • Certain "Tier 1" accounts had no engagement at all.

  • Some "Tier 2" or "Tier 3" accounts outperformed "Tier 1."

This suggested that the original scoring model was not predictive — and needed rethinking.


Step 3: Feature-Level Signal Strength Analysis

To uncover which specific signals actually drove engagement, we used logistic regression:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

features = ['employee_count', 'founding_year_recent', 'cx_hiring', 'recent_funding']
X = df[features]
y = df['meeting_booked']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression()
model.fit(X_train, y_train)

print("Feature Coefficients:")
for feature, coef in zip(features, model.coef_[0]):
    print(f"{feature}: {coef:.2f}")

We confirmed that newer companies with recent CX hiring and active CX team growth were far more likely to engage.

Feature Insights

Signal

Predictive Strength

Founded After 2020

Strong

Recent CX Executive Hire

Strong

CX Team Growth (15%+ MoM)

Strong

Funding Stage Alone

Weak

Employee Count Alone

Weak


Step 4: Reweighting the Scoring Model

Based on the analysis, we:

  • Increased weight on validated signals: founding year, CX hiring, CX growth

  • Decreased weight on weak predictors: funding stage alone, employee count alone

  • Removed non-predictive criteria

The scoring model shifted from theory-based to evidence-backed.


🚀 Results: Focused Targeting, Real Outcomes

  • Tripled engagement rates on outbound campaigns

  • 80% reduction in Clay credit utilization (thanks to tighter, more precise lists)

  • Sharper focus on high-fit conversations with decision-makers who were actually ready to engage


🎯 Key Takeaway

Even the best-designed ICP is still a hypothesis until you validate it. By systematically backtesting historical data, we helped this startup move from sophisticated guesswork to measurable, repeatable targeting — unlocking better conversations, higher engagement, and less wasted effort.

Background & Challenge

Company details were redacted in order to respect the privacy of our client.

An early-stage B2B SaaS startup had put thoughtful work into defining their ideal customer profile (ICP). Their scoring model was sophisticated — factoring in company size, support team headcount, funding stage, tech stack, and geography.

But despite the strong logic behind their targeting, outbound wasn’t working.

Engagement was low. Credit utilization for list-building via Clay was high. There was a clear misalignment between who they thought was their best-fit customer and who was actually responding.

The question became:

“Is our ICP scoring hypothesis actually predicting real-world results?”


Moving From Assumptions to Evidence: How We Backtested the ICP

Blossomer led a systematic backtesting process to answer two key questions:

  1. Correlation-Level Insight: Do higher-scoring accounts and leads actually convert at higher rates?

  2. Feature-Level Analysis: Which specific signals are the true drivers of engagement and conversion?

Our goal was to replace guesswork with data — and prove whether the existing scoring model held up when tested against real outcomes.


Step 1: Gather Clean Historical Data

We merged data across HubSpot (Deals, Contacts, Companies) and Apollo (Sequences, Contact Engagement) to create a unified, account-level outcome table. Each row represented a company, enriched with:

Field

Description

account_name

Company name

lead_score

Contact-level score (0–5)

account_score

Account-level score (0–10)

tier

Tier bucket based on scoring

opened_email

1 if any email opened

clicked_email

1 if any email clicked

replied_email

1 if any email replied

meeting_booked

1 if a meeting was booked

deal_conversion

1 if deal reached key stages (Qualified, Piloting, etc.)

job_title

Contact title

Outcome Definition: We labeled a "conversion" as any deal reaching: Appointment Scheduled, Qualified to Buy, Decision Maker Bought-In, Piloting, Contract Sent, or Closed Won.


Step 2: Score vs. Outcome Correlation

We asked: Is there a meaningful relationship between the lead/account scores and actual outcomes like replies or meetings booked?

from scipy.stats import pointbiserialr
import pandas as pd

# Load merged dataset
df = pd.read_csv("scoring_backtest.csv")

# Correlation between lead score and replies
correlation, p_value = pointbiserialr(df['lead_score'], df['replied_email'])
print(f"Lead Score ↔ Reply Correlation: {correlation:.2f} (p={p_value:.3f})")


🔍 What We Found:

  • Correlation between lead score and meetings booked was weak (r < 0.2).

  • Certain "Tier 1" accounts had no engagement at all.

  • Some "Tier 2" or "Tier 3" accounts outperformed "Tier 1."

This suggested that the original scoring model was not predictive — and needed rethinking.


Step 3: Feature-Level Signal Strength Analysis

To uncover which specific signals actually drove engagement, we used logistic regression:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

features = ['employee_count', 'founding_year_recent', 'cx_hiring', 'recent_funding']
X = df[features]
y = df['meeting_booked']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression()
model.fit(X_train, y_train)

print("Feature Coefficients:")
for feature, coef in zip(features, model.coef_[0]):
    print(f"{feature}: {coef:.2f}")

We confirmed that newer companies with recent CX hiring and active CX team growth were far more likely to engage.

Feature Insights

Signal

Predictive Strength

Founded After 2020

Strong

Recent CX Executive Hire

Strong

CX Team Growth (15%+ MoM)

Strong

Funding Stage Alone

Weak

Employee Count Alone

Weak


Step 4: Reweighting the Scoring Model

Based on the analysis, we:

  • Increased weight on validated signals: founding year, CX hiring, CX growth

  • Decreased weight on weak predictors: funding stage alone, employee count alone

  • Removed non-predictive criteria

The scoring model shifted from theory-based to evidence-backed.


🚀 Results: Focused Targeting, Real Outcomes

  • Tripled engagement rates on outbound campaigns

  • 80% reduction in Clay credit utilization (thanks to tighter, more precise lists)

  • Sharper focus on high-fit conversations with decision-makers who were actually ready to engage


🎯 Key Takeaway

Even the best-designed ICP is still a hypothesis until you validate it. By systematically backtesting historical data, we helped this startup move from sophisticated guesswork to measurable, repeatable targeting — unlocking better conversations, higher engagement, and less wasted effort.

Start Building Your Outbound System

Book a free discovery call to understand how we work. You'll receive actionable strategies upfront, then decide if we're the right fit. We move fast, with no lengthy onboarding delays.

Start Building Your Outbound System

Book a free discovery call to understand how we work. You'll receive actionable strategies upfront, then decide if we're the right fit. We move fast, with no lengthy onboarding delays.

Start Building Your Outbound System

Book a free discovery call to understand how we work. You'll receive actionable strategies upfront, then decide if we're the right fit. We move fast, with no lengthy onboarding delays.

Start Building Your Outbound System

Book a free discovery call to understand how we work. You'll receive actionable strategies upfront, then decide if we're the right fit. We move fast, with no lengthy onboarding delays.

We help early-stage B2B SaaS founders land their first 100 customers through expertly built, fully managed outbound systems. Get qualified opportunities without hiring SDRs, and build predictable revenue with systems that scale.

Blossomer, LLC © 2024, All Rights Reserved

We help early-stage B2B SaaS founders land their first 100 customers through expertly built, fully managed outbound systems. Get qualified opportunities without hiring SDRs, and build predictable revenue with systems that scale.

Blossomer, LLC © 2024, All Rights Reserved

We help early-stage B2B SaaS founders land their first 100 customers through expertly built, fully managed outbound systems. Get qualified opportunities without hiring SDRs, and build predictable revenue with systems that scale.

Blossomer, LLC © 2024, All Rights Reserved

We help early-stage B2B SaaS founders land their first 100 customers through expertly built, fully managed outbound systems. Get qualified opportunities without hiring SDRs, and build predictable revenue with systems that scale.

Blossomer, LLC © 2024, All Rights Reserved