top of page
Writer's pictureImon Rashid

The issue of categorical values in Data Science

Updated: Nov 19

By Imon Rashid

November 18, 2024

(All Rights Reserved)

(The notebook is attached at the bottom of the page to review.)



The Intricacies of Converting Qualitative Data to Quantitative value


In the realm of data science, converting qualitative (categorical) data into quantitative forms, such as one-hot or ordinal encoding, can introduce significant challenges. These transformations assign numerical values to non-numeric categories, which may inadvertently create artificial relationships. For instance, encoding roles as 1, 2, 3 might falsely imply a linear progression.

Moreover, these conversions often lead to increased complexity by expanding the dataset. This complexity complicates exploratory data analysis (EDA) and feature selection, raising the risk of overfitting. It frequently necessitates revisiting earlier stages, including transformations, visualizations, and correlation analysis, to ensure accurate and meaningful insights.

In the attached notebook the dataset was unsuitable for predictive modeling due to its heavy reliance on categorical data, which posed significant challenges. Converting these categorical variables into numerical forms (e.g., one-hot or ordinal encoding) introduced artificial relationships and increased dimensionality, leading to skewed insights during feature engineering. The heatmap failed to reveal meaningful correlations as the transformed data lacked inherent quantitative relationships, making it difficult to establish strong predictive signals.

Additionally, the transformation process disrupted the interpretability of key features and required extensive rework of earlier stages like EDA, ultimately undermining the dataset's suitability for accurate and actionable predictions.


1. Challenges with Converting Qualitative Data to Quantitative

  • Skewing Analysis: Converting qualitative (categorical) data into quantitative forms (e.g., one-hot encoding, ordinal encoding) can lead to biases. These conversions assign numerical values to non-numeric categories, which may unintentionally introduce artificial relationships.

    • Example: Encoding roles as 1, 2, 3 might suggest a linear relationship where none exists.

  • Increased Complexity: Conversions often expand the dataset (e.g., via one-hot encoding), making EDA and feature selection more complex. This added complexity increases the risk of overfitting and requires revisiting earlier steps, including transformations, visualizations, and correlation analysis.


2. Focus on Collecting Quantitative Data


Emphasizing quantitative data from the start reduces the need for extensive conversions and makes the analysis more reliable. Here's how you can revise your data collection strategy:

a. Incorporate Measurable Metrics

  • Instead of recording categorical variables (e.g., "Role"), try to capture measurable outcomes directly linked to the role's performance.

  • Examples:

    • Interactions: Number of meetings, calls, or emails sent by the lead.

    • Lead Score: A predefined scoring metric based on factors like engagement, intent, and historical data.

    • Time in Pipeline: Days the lead spends in each stage of the sales funnel.

    • Conversion Likelihood: A confidence score (percentage) based on historical data.

b. Collect Behavioral Data

  • Quantify lead behavior instead of relying solely on attributes.

  • Examples:

    • Page views or time spent on your website.

    • Response time to emails or messages.

    • Frequency of interactions (e.g., repeat visits or touchpoints).

c. Contextual Metrics for Leads

  • Replace categorical variables like "Company Size" with specific numerical metrics, such as:

    • Revenue or annual budget.

    • Number of employees (quantified instead of small/medium/large).

    • Industry growth rate (e.g., percent growth over 5 years).

d. Binary Indicators for Simple Categories

  • For variables like "Decision Maker," consider binary indicators (1 for yes, 0 for no) rather than multi-level categorical encodings. This avoids unnecessary complexity while preserving meaning.


3. Refining the Data Collection Process


To ensure future datasets align with your goals, implement a more structured data collection plan:

a. Collaborate with Stakeholders

  • Work with sales, marketing, and operations teams to identify what metrics are most relevant for lead conversion.

  • Align the data collection process with practical, measurable KPIs.

b. Design Structured Data Input Forms

  • Use standardized forms or tools to collect numerical data from the start.

  • Example: Instead of "Rate engagement (low/medium/high)," ask for a numerical score (e.g., 1–10) or collect raw data like "number of calls."

c. Incorporate Surveys and Feedback Tools

  • Use tools like surveys to capture direct feedback in numerical form.

  • Example: Instead of asking leads for feedback in open-ended text, provide Likert-scale questions ("Rate your satisfaction from 1 to 5").


4. Long-Term Benefits


  • Consistency: Raw numerical data is inherently consistent and less prone to subjective interpretation during feature engineering.

  • Reduced Rework: Collecting quantitative data minimizes the need for data transformations, reducing effort in EDA, correlation analysis, and feature selection.

  • Actionable Insights: Quantitative data often leads to clearer, more actionable insights and better alignment with predictive models.


5. Next Steps


  • Evaluate your existing data collection framework to identify where categorical data can be replaced or supplemented with quantitative metrics.

  • If historical data collection is biased toward categorical variables, consider running a smaller pilot study with quantitative data collection to validate whether this new approach improves analysis.

  • Focus on meaningful metrics that directly correlate with lead conversion outcomes, avoiding unnecessary complexity.


Conclusion

While categorical data can be informative, its reliance on conversions during feature engineering introduces biases and unnecessary rework. By redesigning the data collection process to focus on numerical metrics and behavioral data, you'll not only reduce manual effort but also make future analyses and modeling more reliable and interpretable.




19 views1 comment

Recent Posts

See All

Sales Analysis Project Summary

Imon Rashid November, 2024 All Rights Reserved This project aimed to analyze weekly sales data from a large retailer to identify key...

1 Comment


Observations:


RoleNumeric and IsDecisionMaker: Correlation: 0.86

Interpretation: A strong positive correlation suggests that as the role becomes more significant (higher numeric value), the likelihood of being a decision-maker increases. This relationship could be expected if higher roles inherently have decision-making authority. RoleNumeric and HighValueLead:

RoleNumeric and HighValueLead: Correlation: 0.42

Interpretation: A moderate positive correlation implies that higher-ranking roles are somewhat associated with high-value leads, but the relationship is not very strong. This could indicate that while senior roles are involved in high-value opportunities, there may be other factors at play. CompanySizeNumeric and HighValueLead:

CompanySizeNumeric and HighValueLead: Correlation: 0.41 Interpretation: This moderate correlation suggests that larger companies are slightly more likely to generate high-value leads. This makes sense if larger organizations…


Like
bottom of page