1. Executive Summary
This document defines a credit scoring framework that addresses two distinct customer segments (Individuals and SMEs) using a confidence-weighted, multi-source approach . The framework is designed for the Iraqi market where traditional credit bureau data is sparse or nonexistent .
Core Innovation: Two-Dimensional Confidence
Every data source in this framework is evaluated on two dimensions:
- Expected Performance Benchmark - Typical predictive power observed for this data type in academic literature and industry deployments (measured in Gini coefficient / AUC), treated as a prior expectation not a hard ceiling
- Data Availability Confidence - How much data we have for this specific customer, relative to the minimum threshold needed for reliability
Design Decisions Made
| Decision | Choice | Rationale |
|---|---|---|
| Cold Start Baseline | Psychometric + ID verification | Works with zero financial history |
| Bootstrap Strategy | Simulation/synthetic based on regional benchmarks | No existing default data available |
| Customer Journey | Progressive unlock + milestone-based | Industry standard (Tala/Branch model) |
| Model Architecture | Shared core + segment overlays | Best practice per McKinsey /FICO |
| SME Owner Credit | Decreasing weight over time | 70-90% for startups → 10-30% for established |
| Conflict Resolution | Document all three approaches | Implementation decision deferred |
2. Segment-Specific Scoring Framework
2.1 Individual Scoring
Individuals are scored primarily on behavioral signals that indicate willingness and capacity to repay.
Data Sources for Individuals
| Category | Data Points | Iraq Availability | Predictive Value |
|---|---|---|---|
| Telecom | Top-up patterns, bill payments, SIM age, network quality | High (Zain, Asiacell, Korek) | High |
| Psychometric | Conscientiousness, locus of control, impulsivity, fluid intelligence | Universal (quiz-based) | Medium-High |
| Mobile Wallet | Transaction velocity, merchant payments, P2P transfers, balance patterns | Medium (Zain Cash, AsiaHawala) | High |
| Device Behavioral | App usage, battery patterns, form-filling behavior | High (Android SDK) | Medium |
| Identity | INID verification, biometric matching, SIM-binding | High (national infrastructure) | Gatekeeper |
| Bank Transactions | Salary deposits, expense patterns, balance consistency | Low-Medium (limited bank penetration) | Very High (when available) |
Individual Scoring Formula
Fairness Note on Network Score: The Ahl Score (family/network financial health) must be implemented using behavioral metadata only (e.g., contact call/payment patterns), never identity attributes (tribe, sect, region). Requires explicit fairness testing before production to ensure no proxy discrimination. May be removed if regulatory or fairness concerns arise.
2.2 SME/Company Scoring
SMEs require fundamentally different data because risk drivers differ from individuals.
Data Sources for SMEs
| Category | Data Points | Iraq Availability | Predictive Value |
|---|---|---|---|
| Cash Flow | Bank inflows/outflows, transaction volume, revenue consistency | Medium (requires bank link) | Very High |
| Trade Payments | Supplier payment timing, Days Beyond Terms (DBT) | Low (no trade bureau) | High |
| Owner Personal Credit | Principal's FICO-equivalent, personal debt load | Via individual scoring | High (decreases with maturity) |
| Business Stability | Years in operation, employee count, legal structure | Medium (company registry) | Medium |
| Digital Ledger | POS transactions, e-commerce sales, invoice records | Low-Medium (emerging) | High |
| Sector/Geography | Industry risk factors, regional stability | Available (can be modeled) | Adjustment factor |
SME Scoring Formula
2.3 Segment Comparison
| Dimension | Individuals | SMEs |
|---|---|---|
| Primary Risk Driver | Willingness to repay (character) | Capacity to repay (cash flow) |
| Cold Start Data | Psychometric + telecom | Owner personal + psychometric |
| Score Range | 300-850 (FICO-familiar) | 0-300 (SBSS-style) or 1-100 |
| Key Predictors | Bill payment, stability signals | Revenue trend, trade payments |
| Data Maturity | 6 months telecom minimum | 6-12 months transactions minimum |
| Personal Guarantee | N/A | Required until business credit established |
3. Data Source Tiers & Accuracy Benchmarks
Each data source has an expected performance benchmark—the typical Gini coefficient range observed in academic literature and industry deployments. These are prior expectations, not hard ceilings; actual performance varies based on label definition, product terms, segment mix, feature engineering, and macroeconomic conditions.
Tier 1: Highest Predictive Power (Gini 0.35-0.50)
| Data Source | Standalone Gini | AUC Range | Min Data Points | Source |
|---|---|---|---|---|
| Bank Transaction Data | 0.40-0.50 | 0.70-0.75 | 6 months history | BIS |
| Combined Ensemble (3+ sources) | 0.50-0.65 | 0.75-0.83 | Varies | Djeundje et al. |
| Trade Payment History | 0.35-0.45 | 0.68-0.73 | 12+ tradelines | D&B Research |
Tier 2: Medium Predictive Power (Gini 0.25-0.40)
| Data Source | Standalone Gini | AUC Range | Min Data Points | Source |
|---|---|---|---|---|
| Telecom Data (CDR, top-ups) | 0.30-0.40 | 0.65-0.70 | 6 months | Tala/Branch |
| Psychometric Assessment | 0.25-0.35 | 0.63-0.68 | Single assessment | LenddoEFL , MDPI |
| Mobile Wallet Transactions | 0.30-0.38 | 0.65-0.69 | 3 months | M-Shwari Kenya |
Tier 3: Supplementary/Thin-File (Gini 0.15-0.25)
| Data Source | Standalone Gini | AUC Range | Min Data Points | Source |
|---|---|---|---|---|
| Device Metadata | 0.15-0.25 | 0.58-0.63 | Single app session | CredoLab |
| Social Graph (metadata only) | 0.10-0.20 | 0.55-0.60 | Contact list | Academic |
| Utility Payments | 0.15-0.22 | 0.58-0.61 | 6 months | FICO |
Ensemble Lift Effect
Combining sources breaks individual ceilings:
| Combination | Expected Gini | Lift vs Best Single |
|---|---|---|
| Psychometric alone | 0.30 | Baseline |
| Psychometric + Telecom | 0.42 | +40% |
| Psychometric + Telecom + Transactions | 0.55 | +83% |
| Full ensemble (5+ sources) | 0.60-0.65 | +100-117% |
4. Confidence Scoring & Data Requirements
4.1 Confidence Calculation
For each data source, confidence is calculated based on data availability relative to minimum thresholds:
Note: Minimum_Viable (see Section 4.2) is a gating threshold; below it, exclude the source entirely (or set Data_Confidence ≈ 0) even if the formula yields a non-zero value.
4.2 Minimum Data Thresholds by Source
| Data Source | Minimum Viable | Optimal | Confidence at Minimum |
|---|---|---|---|
| Psychometric | 1 complete assessment | 1 assessment | 1.0 (binary) |
| Telecom (SIM age) | 6 months | 12+ months | 0.5 at 6mo, 1.0 at 12mo |
| Telecom (top-ups) | 50 transactions | 150+ transactions | Linear scale |
| Bank Transactions | 3 months | 12 months | 0.25 at 3mo, 1.0 at 12mo |
| Mobile Wallet | 30 transactions | 100+ transactions | Linear scale |
| Trade Payments (SME) | 3 tradelines | 10+ tradelines | 0.3 at 3, 1.0 at 10 |
| Device Behavioral | 1 app session | 5+ sessions | 0.2 per session |
4.3 Ground Truth Requirements
For model training and validation:
| Metric | Minimum | Optimal | Notes |
|---|---|---|---|
| Total samples | 3,000 | 10,000+ | For robust model training |
| Default ("bad") samples | 300-450 | 800-1,200 | Critical for learning default patterns |
| Time for validation | 6 months | 12 months | Need to observe repayment outcomes |
| Default rate in sample | 5-15% | 8-12% | Too low = insufficient signal |
Notes:
- Default definition: X DPD (e.g., 60+ for short-tenor microloans; 90+ commonly used in bank portfolios), to be finalized during pilot based on product terms and regulatory guidance.
- Sampling: If observed portfolio default rate is below 5%, use case-control sampling (oversample defaults) for training, then recalibrate PDs to the true base rate before deployment (AUC/Gini unaffected; calibration is).
5. Cold Start Strategy
5.1 Baseline: Psychometric + ID Verification
When a customer has zero history, the system uses:
-
Identity Verification (Gatekeeper)
- INID (Unified National Card) biometric match
- Liveness detection (anti-spoofing)
- SIM-binding verification
-
Psychometric Assessment (Scoring)
- 15-30 minute gamified quiz
- Measures: Conscientiousness, Locus of Control, Impulsivity, Fluid Intelligence
- Expected Gini: 0.25-0.35 standalone
- Accuracy: ~70% classification (AdviceRobo benchmark)
5.2 Cold Start Score Interpretation
| Psychometric Score | Risk Tier | Recommended Action |
|---|---|---|
| Top 20% | Low Risk | Approve small initial loan |
| Middle 60% | Medium Risk | Approve micro-loan with tight limits |
| Bottom 20% | High Risk | Decline or request additional data |
5.3 Progression Pathway (Ladder Model)
Based on Tala/Branch industry standard:
| Stage | Data Available | Typical Loan Size | Default Risk* |
|---|---|---|---|
| Cold Start | Psychometric + ID only | $10-50 | 15-25% |
| Warm (1-2 loans repaid) | + Repayment history | $50-150 | 10-15% |
| Established (3-5 loans) | + Telecom + wallet data | $150-300 | 5-10% |
| Mature (6+ loans) | Full profile | $300-500+ | 3-5% |
*Assumption Warning: Default rates shown are illustrative benchmarks from comparable emerging markets (Kenya, Philippines). Actual rates in Iraq will vary significantly based on underwriting policy, economic conditions, and customer selection. These figures should be recalibrated after 6-12 months of portfolio performance data.
Unlock Triggers:
- On-time repayment of current loan
- Early repayment accelerates progression
- Adding new data sources (bank link, employer verification)
- Platform engagement (using wallet features, bill pay)
5.4 Bootstrapping via Simulation
Since no historical default data exists, initial model weights will be:
- Literature-based: Use published Gini coefficients as starting weights
- Regional benchmarks: Adapt models validated in similar markets (Kenya, Philippines, Jordan)
- Conservative bias: Start with tighter approval thresholds, loosen as data accumulates
- Learning portfolio: Small initial loans to generate outcome data within 6-12 months
6. Dynamic Weight Adjustment
6.1 Weight Rebalancing Logic
Weights are not static—they adjust based on:
- Data availability per customer: More data → higher weight for that source
- Cross-source validation: Conflicting sources → both get reduced weight
- Outcome feedback: Sources that predict well get increased weight over time
6.2 Customer Journey Model
Progressive Unlock + Milestone-Based (Industry Standard)
Re-scoring Triggers:
- Each loan application
- Each repayment (on-time, early, or late)
- New data source added
- Quarterly periodic review (for active customers)
6.3 SME Owner Credit Decay Schedule
For SME scoring, owner personal credit weight decreases as business matures:
| Business Age | Owner Credit Weight | Business Data Weight |
|---|---|---|
| 0-1 years | 80-90% | 10-20% |
| 1-2 years | 60-70% | 30-40% |
| 2-5 years | 40-50% | 50-60% |
| 5-10 years | 20-30% | 70-80% |
| 10+ years | 10-20% | 80-90% |
Transition Logic: Weight shifts when business demonstrates:
- 12+ months of transaction history
- Positive trade payment record
- Stable or growing revenue trend
- No owner-level delinquencies
7. Conflict Resolution Approaches
When data sources provide conflicting signals (e.g., psychometric says "trustworthy" but transactions show erratic spending), three approaches exist:
Approach A: Hybrid Fusion (Best Performing)
How it works:
- Combine related sources early (e.g., all telecom signals → single telecom score)
- Keep dissimilar sources separate (telecom vs psychometric vs transactions)
- Meta-learner resolves conflicts using learned weights
Pros:
- Best empirical performance in research
- Handles non-linear relationships
- Adapts to data characteristics
Cons:
- Complex to implement
- Requires sufficient training data
- Less interpretable
Approach B: Information Value (IV) Weighted Averaging
How it works:
- Calculate IV for each data source during model training
- Higher IV = higher weight in final score
- Conflicts resolved by mathematical averaging
Typical IV-based weights:
| Source | IV Score | Weight |
|---|---|---|
| Transaction data | 0.34 | 34% |
| Bureau data | 0.32 | 32% |
| Psychometric | 0.19 | 19% |
| Device/behavioral | 0.15 | 15% |
Pros:
- Transparent and explainable
- Easy to implement
- Regulatory-friendly
Cons:
- Assumes linear relationships
- May not capture complex interactions
Approach C: Implicit ML Resolution
How it works:
- Feed all features to ensemble model (XGBoost, LightGBM)
- Model learns optimal feature interactions automatically
- SHAP values provide post-hoc explanation
Pros:
- Often best predictive accuracy
- Handles feature interactions
- Discovers unexpected patterns
Cons:
- "Black box" concerns
- Requires explainability layer (SHAP/LIME)
- Risk of overfitting
Recommendation
Start with Approach B (IV-weighted) for interpretability and regulatory acceptance. Transition to Approach A or C as data volume and technical capability mature.
8. Model Architecture Decision
8.1 Chosen Architecture: Shared Core + Segment Overlays
Based on McKinsey and FICO best practices, the recommended architecture is:
8.2 Why This Architecture
| Benefit | Explanation |
|---|---|
| Reduced maintenance | One core to update vs two separate systems |
| Knowledge transfer | Insights from individual scoring improve SME, and vice versa |
| Flexible deployment | Can add new segments (e.g., micro-enterprise) via new overlay |
| Data efficiency | Shared features computed once, used by both |
| Regulatory clarity | Clear separation of segment-specific logic |
8.3 Cascading Data Logic
When data is missing, the system cascades:
9. Open Questions & Future Research
9.1 Ground Truth Validation
Challenge: Without historical default data, initial model weights are assumptions.
Proposed Approach:
- Launch with literature-based weights
- Deploy conservative approval thresholds (reject borderline cases)
- Track actual defaults over 6-12 months
- Recalibrate weights based on observed outcomes
- Expand approval gradually as confidence increases
9.2 Concept Drift Monitoring
Credit risk changes over time (economic conditions, fraud evolution). Required:
- Monthly model performance monitoring (PSI, KS stability)
- Trigger-based retraining when drift detected
- COVID-style stress testing for economic shocks
- Fraud pattern updates as new schemes emerge
9.3 Iraq-Specific Calibration Needs
| Factor | Calibration Required |
|---|---|
| Regional risk | Baghdad vs Basra vs Kurdistan multipliers |
| Sector risk | Oil economy vs agriculture vs services |
| Currency volatility | IQD/USD fluctuation impact on ability to pay |
| Cultural factors | Ahl Score validation + repayment culture effects (must avoid identity-based attributes) |
| Seasonal patterns | Ramadan, harvest cycles, government salary timing |
9.4 Data Partnership Priorities
To improve model accuracy, pursue data partnerships in order:
- Critical: Telecom APIs (Zain, Asiacell, Korek)
- Critical: National ID verification (INID integration)
- High: Mobile wallet data (Zain Cash, AsiaHawala)
- High: Qi Card salary data
- Medium: Utility payment history
- Medium: Employer verification networks
9.5 SME Entity Resolution Challenge
SME identity in Iraq is fragmented (inconsistent registration, multiple owner identities, cash-based revenue, loan stacking risk). Required capabilities: Entity resolution linking Business ↔ Owners ↔ Devices ↔ Wallets ↔ Bank Accounts, cross-application duplicate detection, beneficial ownership verification, and velocity rules to prevent stacking. Technical specification to be defined in architecture phase.
10. Citations
Academic Sources
-
Djeundje et al. (2021). "Enhancing Credit Scoring with Alternative Data." Expert Systems with Applications. ScienceDirect
-
Hlongwane et al. (2024). "Enhancing credit scoring accuracy with a comprehensive evaluation of alternative data." PLOS ONE. PMC
-
Weng et al. (2024). "Class imbalance Bayesian model averaging for consumer loan default prediction." Research in International Business and Finance. ScienceDirect
-
Feng et al. (2019). "Dynamic weighted ensemble classification for credit scoring using Markov Chain." Applied Intelligence. Springer
-
MDPI (2023). "Character Counts: Psychometric-Based Credit Scoring for Underbanked Consumers." MDPI
-
World Bank (2022). "Evening the Credit Score: Impact of Psychometric Credit Scoring on Women-Owned Firms." World Bank
Industry Sources
-
FICO Blog. "How to Build Credit Risk Models Using AI and Machine Learning." FICO
-
Experian (2025). "Blended Credit Scores: A Smarter Approach to Small Business Lending." Experian
-
McKinsey & Company (2021). "Designing Next-Generation Credit-Decisioning Models." McKinsey
-
CGAP (2019). "Credit Scoring Technical Guide." CGAP
-
LenddoEFL. "Scoring Methodology." LenddoEFL
-
Cenfri/i2i (2017). "Advancing Financial Inclusion Case Study: Branch." Cenfri
-
BIS Papers No. 148. "Digital Innovation for SMEs." BIS
-
AFI (2025). "Alternative Data for Credit Scoring." AFI
-
Nav. "FICO SBSS Score Explained." Nav
Residual Uncertainties
- Exact IV weights for Iraq: Literature values are from other markets; Iraq-specific calibration needed
- Ahl Score effectiveness: Family/tribal scoring is theoretical; requires validation
- Regulatory acceptance: CBI stance on psychometric and alternative data scoring unknown
- Telecom partnership terms: Data access and pricing not yet negotiated
Appendix A: Understanding the Gini Coefficient
A.1 Historical Origin
| Attribute | Details |
|---|---|
| Named after | Corrado Gini (Italian statistician, 1884-1965) |
| First published | 1912, "Variabilità e mutabilità" (Variability and Mutability) |
| Original purpose | Measuring income inequality in populations |
| Credit adaptation | 1990s-2000s, to measure model discriminatory power |
Original Economics Meaning:
- Gini = 0: Perfect equality (everyone has same income)
- Gini = 1: Perfect inequality (one person has everything)
A.2 What Gini Measures in Credit Scoring
The Gini coefficient measures discriminatory power — how well a score separates defaults from non-defaults.
Gini Interpretation Scale
| Gini Value | Rating | Interpretation | Use Case |
|---|---|---|---|
| 0.00 | Useless | Random coin flip | Reject model |
| 0.10-0.20 | Very Weak | Barely better than random | Supplementary only |
| 0.20-0.30 | Weak | Useful but limited | Thin-file fallback |
| 0.30-0.40 | Decent | Solid predictive power | Single source acceptable |
| 0.40-0.50 | Good | Strong model | Production ready |
| 0.50-0.60 | Very Good | Excellent separation | High-value decisions |
| 0.60+ | Excellent | Rare for single source | Ensemble territory |
| 1.00 | Perfect | Impossible in practice | Theoretical max |
A.3 Relationship to AUC
The Gini coefficient is directly related to AUC (Area Under the ROC Curve):
| AUC | Calculation | Gini | Rating |
|---|---|---|---|
| 0.50 | 2(0.50) − 1 | 0.00 | Random |
| 0.65 | 2(0.65) − 1 | 0.30 | Decent |
| 0.75 | 2(0.75) − 1 | 0.50 | Good |
| 0.85 | 2(0.85) − 1 | 0.70 | Excellent |
A.4 Combining Multiple Data Sources (Ensemble Gini)
Critical Insight: Ensemble Gini is NOT additive.
Why it's not additive: Combined predictive power depends on correlation between sources.
| Correlation | Effect | Example |
|---|---|---|
| High (0.9) | Small lift — sources say the same thing | Two telecom features |
| Moderate (0.5) | Decent lift — some new information | Telecom + psychometric |
| Low (0.1) | Large lift — each adds unique signal | Psychometric + bank transactions |
Simplified Approximation (for uncorrelated sources):
A.5 Practical Ensemble Calculation
In practice, you don't calculate combined Gini mathematically — you measure it empirically:
A.6 Typical Ensemble Lifts (Empirical)
Based on industry research, here are typical Gini improvements when combining sources:
| Data Sources Combined | Combined Gini | Lift |
|---|---|---|
| Psychometric alone | 0.30 | — |
| + Telecom | 0.42 | +40% |
| + Wallet | 0.48 | +60% |
| + Bank | 0.58 | +93% |
| Full ensemble (5+) | 0.60-0.65 | +100-117% |
A.7 Key Takeaways
The Golden Rule for Adding Data Sources:
Example: Bank transactions are so valuable because they have:
- High Gini (0.40-0.50)
- Low correlation with psychometric/telecom
- Capture unique financial behavior signals