Fair Knowledge Tracing in Second Language Acquisition: Analysis of Algorithmic Bias

1. Introduction & Background

Predictive modeling in education, particularly Knowledge Tracing (KT), aims to model a student's evolving knowledge state to forecast future performance and personalize instruction. Traditional methods relying on human interpretation of performance data are prone to cognitive biases (e.g., positivity bias, memory limits). Computational KT, introduced by Corbett and Anderson, mitigates these by using student interaction data.

While most research prioritizes model accuracy, this paper shifts focus to a critical yet underexplored dimension: algorithmic fairness. Fairness ensures models do not systematically disadvantage groups based on sensitive attributes (e.g., device type, country of origin). In the context of Second Language Acquisition (SLA) via platforms like Duolingo, bias could perpetuate educational inequity.

Core Research Questions: This study evaluates the fairness of KT models across: 1) Different client platforms (iOS, Android, Web), and 2) Learners from developed versus developing countries.

2. Methodology & Experimental Setup

The study employs a comparative analysis framework to evaluate both the predictive performance and fairness of models.

2.1 Datasets: Duolingo Tracks

Three distinct learning tracks from the 2018 Duolingo Shared Task on Second Language Acquisition were used:

en_es: English speakers learning Spanish.
es_en: Spanish speakers learning English.
fr_en: French speakers learning English.

The data includes sequences of student exercise attempts, metadata on client platform (iOS/Android/Web), and inferred country development status.

2.2 Predictive Models Evaluated

The study compares two broad classes of models:

Machine Learning (ML) Models: Likely includes traditional models like Logistic Regression, Random Forests, or Bayesian Knowledge Tracing (BKT).
Deep Learning (DL) Models: Likely includes sequence models like Long Short-Term Memory (LSTM) networks or Deep Knowledge Tracing (DKT), which are adept at capturing temporal dependencies in learning sequences.

The choice reflects the evolution from classical statistical models to neural network-based approaches in KT.

2.3 Fairness Metrics & Evaluation Framework

Fairness was assessed using group fairness metrics. For a binary prediction (e.g., will the student answer the next item correctly?), common metrics include:

Demographic Parity: Equal prediction rates across groups.
Equal Opportunity: Equal true positive rates across groups.
Predictive Parity: Equal precision across groups.

Disparities in these metrics between groups (e.g., mobile vs. non-mobile users) indicate algorithmic bias.

3. Experimental Results & Findings

The analysis yielded four key findings, highlighting trade-offs between accuracy and fairness.

Key Findings at a Glance

DL Superiority: DL models generally outperformed ML in both accuracy and fairness.
Mobile Bias: Both ML and DL showed bias favoring mobile (iOS/Android) over web users.
Development Bias: ML models exhibited stronger bias against learners from developing countries than DL models.
Context-Dependent Choice: Optimal model choice (DL vs. ML) depends on the specific learning track.

3.1 Performance: Accuracy Comparison

Deep Learning models demonstrated a marked advantage in predictive accuracy across the evaluated tracks. This aligns with the established capability of neural sequence models like DKT to model complex, non-linear learning trajectories more effectively than simpler ML models, as noted in the seminal DKT paper by Piech et al.

3.2 Fairness Across Client Platforms

A consistent and noticeable bias was observed favoring mobile app users (iOS, Android) over web browser users. This could stem from:

Data quality differences (e.g., interaction patterns, session lengths).
Unintended correlation between platform choice and learner engagement or socioeconomic factors baked into the training data.

This finding is critical for edtech companies serving multi-platform user bases.

3.3 Fairness Across Country Development Levels

Machine Learning algorithms showed a more pronounced bias against learners from developing countries compared to Deep Learning algorithms. This suggests that DL models, with their greater capacity, might be learning more robust, generalizable patterns that are less sensitive to spurious correlations linked to development status.

3.4 Trade-off Analysis: Accuracy vs. Fairness

The study recommends a nuanced, context-specific approach:

For en_es and es_en tracks, Deep Learning is more apt, offering a better balance.
For the fr_en track, Machine Learning emerged as a more suitable option, potentially due to dataset characteristics where simpler models generalize more fairly.

This underscores that there is no universally "fairer" model class; the optimal choice is task-dependent.

4. Technical Deep Dive

4.1 Knowledge Tracing Formalism

At its core, KT models a learner's knowledge state as a latent variable that evolves over time. Given a sequence of learner interactions (e.g., exercise attempts) $X = \{x_1, x_2, ..., x_t\}$, the goal is to predict the probability of correctness on the next item, $P(r_{t+1} = 1 | X)$.

Deep Knowledge Tracing (DKT) uses a Recurrent Neural Network (RNN) to model this:

$h_t = \text{RNN}(x_t, h_{t-1})$

$P(r_{t+1}) = \sigma(W \cdot h_t + b)$

where $h_t$ is the hidden state representing the knowledge state at time $t$, and $\sigma$ is the sigmoid function.

4.2 Fairness Metrics Formulation

Let $A \in \{0,1\}$ be a sensitive attribute (e.g., $A=1$ for mobile user, $A=0$ for web user). Let $\hat{Y}$ be the model's prediction. Demographic Parity requires:

$P(\hat{Y}=1 | A=1) = P(\hat{Y}=1 | A=0)$

Equal Opportunity (considering correctness as the positive outcome) requires:

$P(\hat{Y}=1 | A=1, Y=1) = P(\hat{Y}=1 | A=0, Y=1)$

The bias observed in the study can be quantified as the difference or ratio between these conditional probabilities for different groups.

5. Analysis Framework & Case Example

Framework for Auditing KT Fairness: Edtech developers can adopt this structured approach:

Disaggregate Evaluation: Never report only aggregate accuracy. Always calculate performance metrics (accuracy, AUC) and fairness metrics (demographic parity difference, equal opportunity difference) separately for each sensitive subgroup (by platform, country, gender if available).
Root Cause Analysis: For identified biases, investigate feature correlations. Is "number of sessions" correlated with both platform and prediction outcome? Could proxy variables for socioeconomic status be leaking into the model via behavioral data?
Mitigation Strategy Selection: Based on the cause, choose a mitigation technique: pre-processing (re-weighting data), in-processing (adding fairness constraints to the loss function, as in approaches like those from the FAT* conference community), or post-processing (calibrating thresholds per group).

Case Example - The Mobile Bias: Imagine an LSTM-based KT model trained on Duolingo data shows a 15% higher predicted probability of success for iOS users vs. Web users, holding actual performance constant. Our audit reveals the "time-of-day" feature is a key driver: iOS users practice more in short, frequent bursts (commutes), while Web users have longer, less frequent sessions. The model associates the "commute pattern" with higher engagement and boosts predictions, unfairly penalizing Web users who may learn effectively in different patterns. Mitigation: We could apply a fairness-aware regularization term during training that penalizes the model for differences in prediction distributions between the platform groups, guided by the work of researchers like Zemel et al. on learning fair representations.

6. Critical Analysis & Expert Interpretation

Core Insight: This paper delivers a crucial, uncomfortable truth for the booming EdTech sector: your state-of-the-art knowledge tracing models are likely baking in systemic biases that favor affluent, mobile-first users and developed nations. The pursuit of accuracy has blinded the field to the ethical debt accumulating in its algorithms. The finding that bias persists even in sophisticated Deep Learning models is a sobering counterpoint to the belief that more complex models inherently learn "fairer" representations.

Logical Flow: The authors logically progress from establishing the KT paradigm to exposing its fairness blind spot. Using the well-established Duolingo dataset provides credibility and reproducibility. The bifurcated analysis—platform bias and geopolitical bias—cleverly captures two major axes of digital divide. The comparison between classical ML and modern DL is not just technical but strategic, helping practitioners choose tools with ethical implications in mind.

Strengths & Flaws: The primary strength is its actionable, empirical focus on real-world data and clear, comparative findings. It moves beyond theoretical fairness discussions. However, a significant flaw is the lack of mechanistic explanation. Why does mobile bias occur? Is it data artifact, user behavior difference, or model limitation? The paper diagnoses the disease but offers scant pathology. Furthermore, the suggestion to use ML for the `fr_en` track based on fairness, despite its lower accuracy, presents a real-world dilemma: how much accuracy are we willing to sacrifice for fairness, and who decides?

Actionable Insights: For product leaders and engineers, this study is a mandate for change. First, fairness auditing must become a standard KPI alongside A/B testing for new model deployments, akin to practices advocated by Google's PAIR initiative. Second, the observed biases suggest a need for platform-specific feature engineering or calibration. Perhaps web users require a subtly different predictive model. Third, the research underscores the need for more diverse and representative training data. Collaborations with NGOs or educational bodies in developing regions could help rebalance datasets. Finally, the field must develop and adopt "Fairness-by-Design" KT architectures, integrating constraints from the outset, rather than retrofitting fairness as an afterthought.

7. Future Applications & Research Directions

Personalized Fairness-Aware Tutoring: Future ITS can dynamically adjust not just for knowledge state, but also to counteract predicted biases. If the system detects a student is from an underrepresented group for which the model is less confident, it could provide more supportive scaffolding or gather more data to reduce uncertainty fairly.
Cross-Cultural & Cross-Linguistic Model Transfer: Research should explore fairness in transfer learning. Is a KT model trained on English-speaking learners fair when fine-tuned for Spanish speakers? Techniques from domain adaptation could be merged with fairness constraints.
Explainable Fairness (XFairness): Beyond measuring bias, we need tools to explain which features contribute to unfair outcomes. This aligns with the broader XAI (Explainable AI) movement and is critical for developer trust and effective mitigation.
Longitudinal Fairness Studies: Does algorithmic bias increase or decrease over a learner's multi-year journey? Longitudinal studies are needed to understand the compounding effects of biased feedback loops in adaptive systems.
Integration with Learning Science: Future work must bridge the gap with pedagogical theory. What does "fairness" mean from a cognitive load or motivational perspective? Fairness should align with educational equity principles, not just statistical parity.

8. References

Corbett, A. T., & Anderson, J. R. (1994). Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction, 4(4), 253-278.
Piech, C., Bassen, J., Huang, J., Ganguli, S., Sahami, M., Guibas, L. J., & Sohl-Dickstein, J. (2015). Deep knowledge tracing. Advances in neural information processing systems, 28.
Zemel, R., Wu, Y., Swersky, K., Pitassi, T., & Dwork, C. (2013). Learning fair representations. International conference on machine learning (pp. 325-333). PMLR.
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6), 1-35.
Google PAIR. (n.d.). People + AI Guidebook. Retrieved from https://pair.withgoogle.com/
Duolingo. (2018). Duolingo Second Language Acquisition Shared Task. Proceedings of the 2018 EMNLP Workshop W-NUT.
Barocas, S., Hardt, M., & Narayanan, A. (2019). Fairness and Machine Learning: Limitations and Opportunities. fairmlbook.org.