The issue of categorical values in Data Science

Imon Rashid
Nov 18, 2024
4 min read

Updated: Nov 19, 2024

By Imon Rashid

November 18, 2024

(The notebook is attached at the bottom of the page to review.)

The Intricacies of Converting Qualitative Data to Quantitative value

In the realm of data science, converting qualitative (categorical) data into quantitative forms, such as one-hot or ordinal encoding, can introduce significant challenges. These transformations assign numerical values to non-numeric categories, which may inadvertently create artificial relationships. For instance, encoding roles as 1, 2, 3 might falsely imply a linear progression.

Moreover, these conversions often lead to increased complexity by expanding the dataset. This complexity complicates exploratory data analysis (EDA) and feature selection, raising the risk of overfitting. It frequently necessitates revisiting earlier stages, including transformations, visualizations, and correlation analysis, to ensure accurate and meaningful insights.

In the attached notebook the dataset was unsuitable for predictive modeling due to its heavy reliance on categorical data, which posed significant challenges. Converting these categorical variables into numerical forms (e.g., one-hot or ordinal encoding) introduced artificial relationships and increased dimensionality, leading to skewed insights during feature engineering. The heatmap failed to reveal meaningful correlations as the transformed data lacked inherent quantitative relationships, making it difficult to establish strong predictive signals.

Additionally, the transformation process disrupted the interpretability of key features and required extensive rework of earlier stages like EDA, ultimately undermining the dataset's suitability for accurate and actionable predictions.

1. Challenges with Converting Qualitative Data to Quantitative

Skewing Analysis: Converting qualitative (categorical) data into quantitative forms (e.g., one-hot encoding, ordinal encoding) can lead to biases. These conversions assign numerical values to non-numeric categories, which may unintentionally introduce artificial relationships.
- Example: Encoding roles as 1, 2, 3 might suggest a linear relationship where none exists.
Increased Complexity: Conversions often expand the dataset (e.g., via one-hot encoding), making EDA and feature selection more complex. This added complexity increases the risk of overfitting and requires revisiting earlier steps, including transformations, visualizations, and correlation analysis.

2. Focus on Collecting Quantitative Data

Emphasizing quantitative data from the start reduces the need for extensive conversions and makes the analysis more reliable. Here's how you can revise your data collection strategy:

a. Incorporate Measurable Metrics

Instead of recording categorical variables (e.g., "Role"), try to capture measurable outcomes directly linked to the role's performance.
Examples:
- Interactions: Number of meetings, calls, or emails sent by the lead.
- Lead Score: A predefined scoring metric based on factors like engagement, intent, and historical data.
- Time in Pipeline: Days the lead spends in each stage of the sales funnel.
- Conversion Likelihood: A confidence score (percentage) based on historical data.

b. Collect Behavioral Data

Quantify lead behavior instead of relying solely on attributes.
Examples:
- Page views or time spent on your website.
- Response time to emails or messages.
- Frequency of interactions (e.g., repeat visits or touchpoints).

c. Contextual Metrics for Leads

Replace categorical variables like "Company Size" with specific numerical metrics, such as:
- Revenue or annual budget.
- Number of employees (quantified instead of small/medium/large).
- Industry growth rate (e.g., percent growth over 5 years).

d. Binary Indicators for Simple Categories

For variables like "Decision Maker," consider binary indicators (1 for yes, 0 for no) rather than multi-level categorical encodings. This avoids unnecessary complexity while preserving meaning.

3. Refining the Data Collection Process

To ensure future datasets align with your goals, implement a more structured data collection plan:

a. Collaborate with Stakeholders

Work with sales, marketing, and operations teams to identify what metrics are most relevant for lead conversion.
Align the data collection process with practical, measurable KPIs.

b. Design Structured Data Input Forms

Use standardized forms or tools to collect numerical data from the start.
Example: Instead of "Rate engagement (low/medium/high)," ask for a numerical score (e.g., 1–10) or collect raw data like "number of calls."

c. Incorporate Surveys and Feedback Tools

Use tools like surveys to capture direct feedback in numerical form.
Example: Instead of asking leads for feedback in open-ended text, provide Likert-scale questions ("Rate your satisfaction from 1 to 5").

4. Long-Term Benefits

Consistency: Raw numerical data is inherently consistent and less prone to subjective interpretation during feature engineering.
Reduced Rework: Collecting quantitative data minimizes the need for data transformations, reducing effort in EDA, correlation analysis, and feature selection.
Actionable Insights: Quantitative data often leads to clearer, more actionable insights and better alignment with predictive models.

5. Next Steps

Evaluate your existing data collection framework to identify where categorical data can be replaced or supplemented with quantitative metrics.
If historical data collection is biased toward categorical variables, consider running a smaller pilot study with quantitative data collection to validate whether this new approach improves analysis.
Focus on meaningful metrics that directly correlate with lead conversion outcomes, avoiding unnecessary complexity.

Conclusion

While categorical data can be informative, its reliance on conversions during feature engineering introduces biases and unnecessary rework. By redesigning the data collection process to focus on numerical metrics and behavioral data, you'll not only reduce manual effort but also make future analyses and modeling more reliable and interpretable.

1 Comment

Imon Rashid

Nov 20, 2024

Observations:

RoleNumeric and IsDecisionMaker: Correlation: 0.86

Interpretation: A strong positive correlation suggests that as the role becomes more significant (higher numeric value), the likelihood of being a decision-maker increases. This relationship could be expected if higher roles inherently have decision-making authority. RoleNumeric and HighValueLead:

RoleNumeric and HighValueLead: Correlation: 0.42

Interpretation: A moderate positive correlation implies that higher-ranking roles are somewhat associated with high-value leads, but the relationship is not very strong. This could indicate that while senior roles are involved in high-value opportunities, there may be other factors at play. CompanySizeNumeric and HighValueLead:

CompanySizeNumeric and HighValueLead: Correlation: 0.41 Interpretation: This moderate correlation suggests that larger companies are slightly more likely to generate high-value leads. This makes sense if larger organizations…

Imon Rashid

Applied Data Science by MIT Professional Ed

Certified Salesforce Administrator

Certified Scrum Product Owner

Certified Scrum Master

B.Sc Computer Information System, Florida Atlantic University

M.Sc Information Assurance, Capitol Technology University

Author of the Book
" Business Analysis Fundamentals "

The issue of categorical values in Data Science

Recent Posts

1 Comment