Select Language

Deep Factorization Machines for Knowledge Tracing: Analysis of the 2018 Duolingo SLAM Solution

Analysis of a research paper applying Deep Factorization Machines to the Duolingo Second Language Acquisition Modeling task, exploring its methodology, results, and implications for educational data mining.
study-chinese.com | PDF Size: 0.1 MB
Rating: 4.5/5
Your Rating
You have already rated this document
PDF Document Cover - Deep Factorization Machines for Knowledge Tracing: Analysis of the 2018 Duolingo SLAM Solution

1. Introduction & Overview

This paper presents the author's solution to the 2018 Duolingo Shared Task on Second Language Acquisition Modeling (SLAM). The core challenge was knowledge tracing at the word level: predicting whether a student would correctly write the words of a new sentence, given their historical attempt data on thousands of sentences annotated with lexical, morphological, and syntactic features.

The proposed solution utilizes Deep Factorization Machines (DeepFM), a model designed to capture both low-order (linear) and high-order (non-linear) feature interactions. The model achieved an AUC of 0.815, outperforming a logistic regression baseline (AUC 0.774) but falling short of the top-performing model (AUC 0.861) in the competition.

Key Insights

  • Applies a recommendation system model (DeepFM) to the educational data mining problem of knowledge tracing.
  • Demonstrates how traditional models like Item Response Theory (IRT) can be viewed as special cases within a more general factorization framework.
  • Highlights the importance of leveraging rich side information (user, item, skill, linguistic features) for accurate performance prediction.

2. Related Work & Theoretical Background

The paper positions itself within the historical and contemporary landscape of student modeling.

2.1 Item Response Theory (IRT)

Item Response Theory (IRT) is a psychometric framework that models the probability of a correct response as a function of the student's latent ability ($\theta$) and the item's parameters (e.g., difficulty $b$, discrimination $a$). A common model is the 2-parameter logistic (2PL) model:

$P(\text{correct} | \theta) = \frac{1}{1 + e^{-a(\theta - b)}}$

IRT is foundational in standardized testing but traditionally handles simple student-item interactions without rich side information.

2.2 Knowledge Tracing Evolution

  • Bayesian Knowledge Tracing (BKT): Models the learner as a Hidden Markov Model, tracking the probability of knowing a skill over time.
  • Deep Knowledge Tracing (DKT): Uses Recurrent Neural Networks (RNNs), specifically LSTMs, to model temporal sequences of learner interactions. Piech et al. (2015) demonstrated its potential, but subsequent work (Wilson et al., 2016) showed IRT variants could be competitive.
  • Limitation: Both BKT and early DKT often ignored auxiliary feature information about items and learners.

2.3 Factorization Machines & Wide & Deep Learning

The paper builds on two key ideas from recommender systems:

  1. Factorization Machines (FMs): Proposed by Rendle (2010), FMs model all pairwise interactions between variables using factorized parameters, effectively learning embeddings for categorical features. The prediction for a feature vector $\mathbf{x}$ is:

    $\hat{y}(\mathbf{x}) = w_0 + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n} \sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j$

    where $\mathbf{v}_i$ are latent factor vectors.
  2. Wide & Deep Learning: Proposed by Cheng et al. (2016) at Google, this architecture jointly trains a wide linear model (for memorization) and a deep neural network (for generalization).
  3. DeepFM: Guo et al. (2017) fused these ideas, replacing the wide component with an FM to learn low-order feature interactions automatically, while a DNN learns high-order interactions. This is the model adopted in this paper.

3. DeepFM Model for Knowledge Tracing

The paper adapts the DeepFM architecture for the knowledge tracing task.

3.1 Model Formulation & Architecture

The core idea is to treat each learning interaction (e.g., "user 123 attempts word 'serendipity' within a sentence having feature X") as a sparse feature vector $\mathbf{x}$. The model learns an embedding for every entity (e.g., user_id=123, word='serendipity', feature_X=1).

The final prediction is a probability:

$p(\mathbf{x}) = \psi(y_{FM} + y_{DNN})$

where $\psi$ is a link function (sigmoid $\sigma$ or normal CDF $\Phi$).

  • FM Component: Computes $y_{FM}$ as in the standard FM equation, capturing all pairwise interactions between entity embeddings (e.g., user-word, user-skill, word-skill).
  • Deep Component: A standard feed-forward neural network takes the concatenated entity embeddings as input and computes $y_{DNN}$, capturing complex, high-order feature interactions.

Both components share the same input feature embeddings, making the model efficient and jointly trained.

3.2 Feature Encoding & Entity Embeddings

Each instance is encoded into a sparse vector of size $N$, where $N$ is the total number of possible entities across all categorical and continuous feature categories (user, item, skill, time, linguistic tags).

  • Discrete entities: Encoded with a value of 1 if present.
  • Continuous entities (e.g., timestamp): The actual continuous value is used.
  • Absent entities: Encoded as 0.

This flexible encoding allows the model to seamlessly integrate diverse data types from the Duolingo task.

4. Experimental Setup & Results

4.1 Duolingo SLAM 2018 Task

The task provided sequences of student attempts on foreign language sentences. For each word in a new sentence, the goal was to predict the probability of the student writing it correctly. The dataset included rich linguistic annotations for each word/token.

4.2 Data Preparation & Feature Engineering

To apply DeepFM, the raw sequential data was transformed into a standard feature matrix format. Key steps likely included:

  1. Instance Creation: Each student-word attempt became a single data instance.
  2. Feature Categorization: Identifying categories: user ID, word/token ID, sentence ID, part-of-speech tag, morphological feature, syntactic dependency relation, etc.
  3. Sparse Representation: Converting these categories into the sparse entity vector $\mathbf{x}$.

4.3 Performance Results & Analysis

Model Performance (AUC)

  • Logistic Regression Baseline: 0.774
  • DeepFM (Proposed Model): 0.815
  • Top Performing Model (Benchmark): 0.861

Interpretation: The DeepFM model provided a significant 5.3% relative improvement over a strong linear baseline, validating the power of modeling feature interactions. However, the gap to the top model indicates room for architectural improvement or more sophisticated feature engineering.

The paper suggests that DeepFM can subsume traditional IRT models. For example, a simple IRT model can be approximated by the FM component with entities only for user ability and item difficulty, where their interaction term $\langle \mathbf{v}_{user}, \mathbf{v}_{item} \rangle$ captures the $a(\theta - b)$ dynamic.

5. Technical Deep Dive & Analysis

Industry Analyst Perspective: Core Insight, Logical Flow, Strengths & Flaws, Actionable Insights

5.1 Core Insight & Logical Flow

The paper's fundamental bet is that knowledge tracing is, at its core, a recommendation problem. Instead of recommending movies, you're predicting the "relevance" (correctness) of a knowledge component (word) to a user (student) at a specific context (sentence with features). This reframing is powerful. The logical flow is elegant: 1) Acknowledge the limitation of sequential-only models (DKT) and simple linear models (IRT, LR). 2) Identify the need to model rich, cross-feature interactions (user-skill, skill-context). 3) Import a state-of-the-art recommender system architecture (DeepFM) proven to excel at this exact problem. 4) Validate it beats simple baselines. This is a classic case of cross-pollination from a mature field (recommender systems) to an emerging one (EdTech AI), similar to how computer vision techniques revolutionized medical image analysis.

5.2 Strengths & Critical Flaws

Strengths:

  • Unified Framework: Its greatest theoretical contribution is showing how IRT, FM, and other models exist on a spectrum within this architecture. This is reminiscent of the unifying view provided by models like Transformer in NLP, which subsumed RNNs and CNNs for sequence tasks.
  • Feature Agnosticism: The model can ingest any categorical or continuous feature without extensive pre-processing, a huge practical advantage for messy educational datasets.
  • Strong Baseline Beater: A 0.815 AUC is a solid, production-viable result, convincingly better than the logistic regression baseline.

Critical Flaws & Missed Opportunities:

  • The Elephant in the Room: The 0.861 Benchmark. The paper glosses over why DeepFM fell short. Was it model capacity? Training data? The lack of explicit temporal modeling is a glaring weakness. DeepFM treats each attempt as independent, ignoring the crucial sequence. The winning model likely incorporated temporal dynamics, akin to how WaveNet or temporal convolutions outperform feed-forward models in time-series prediction. This is a major architectural blind spot.
  • Black Box Trade-off: While more interpretable than a pure DNN, the learned embeddings are still opaque. For educational stakeholders, explaining why a prediction was made is often as important as the prediction itself. The paper offers no interpretability tools.
  • Computational Cost: Learning embeddings for every unique entity (every user, every word) can be massive and inefficient for large-scale, dynamic platforms like Duolingo with millions of new users and content items.

5.3 Actionable Insights & Strategic Implications

For EdTech companies and researchers:

  1. Prioritize Feature Engineering Over Model Novelty: This paper's success stemmed more from its feature representation (encoding all side information) than a radically new model. Invest in data infrastructure to capture and serve rich contextual features (time-of-day, device, previous lesson history, engagement metrics).
  2. Hybridize, Don't Just Import: The next step isn't another recommender model. It's DeepFM + Temporal Awareness. Explore architectures like DeepFM with LSTM/GRU towers or Temporal Factorization Machines. Look to work like TiSASRec (Li et al., 2020) that combines self-attention with time intervals for sequential recommendation.
  3. Benchmark Relentlessly Against Simplicity: The fact that a well-tuned IRT variant (Wilson et al., 2016) can compete with DKT is a humbling lesson. Always benchmark against strong, interpretable baselines (IRT, logistic regression with clever features). Complexity must justify its performance lift and computational cost.
  4. Focus on Actionable Outputs: Move beyond prediction AUC. The real value is in prescription. Use the model's pairwise interaction strengths (from the FM component) to identify which skill gaps are most critical for a student or which lesson features are most confusing. Turn diagnostics into personalized learning paths.

6. Analysis Framework & Conceptual Example

Conceptual Framework for Applying DeepFM to a New Educational Dataset:

  1. Define the Prediction Target: Binary (correct/incorrect), or multi-class (partial credit levels).
  2. Inventory All Features (Entities):
    • Student-Level: ID, demographic bucket, overall performance history.
    • Item/Question-Level: ID, knowledge component(s), difficulty rating, format (multiple choice, open-ended).
    • Interaction Context: Timestamp, time spent, attempt number, platform used.
    • External: Lesson ID, teacher ID (in classroom settings).
  3. Construct the Sparse Vector for an Instance:

    Example: Student_S123 attempts Question_Q456 on Knowledge Component "Linear Equations."
    Feature Vector $\mathbf{x}$ would have 1s at indices corresponding to entities: [student=S123, question=Q456, kc=linear_equations, attempt_num=2, ...] and 0s elsewhere.

  4. Model Training & Interpretation:
    • The FM component learns that the interaction $\langle \mathbf{v}_{S123}, \mathbf{v}_{linear\_equations} \rangle$ is strongly negative, indicating this student struggles with this KC.
    • The DNN component might detect a complex pattern: students who struggle with "linear equations" and attempt questions quickly (short time-spent feature) and on mobile devices have an even higher failure rate.

7. Future Applications & Research Directions

  • Temporal & Sequential Enhancements: Integrating recurrent or attention-based layers (like Transformers) to model the order and timing of learning activities explicitly. Models like SAINT+ (Choi et al., 2020) combine self-attention for exercise and response features, pointing the way forward.
  • Cross-Domain Knowledge Tracing: Using embeddings from a language model (e.g., BERT) to represent exercise text or student explanations, enabling the model to generalize to unseen exercises based on semantic similarity.
  • Causal Inference for Intervention Design: Moving from correlation (prediction) to causation. Could the model identify not just that a student will fail, but which specific intervention (a video, a hint, a simpler problem) would most likely change that outcome? This connects to the burgeoning field of uplift modeling in personalized education.
  • Federated & Privacy-Preserving Learning: Developing versions of DeepFM that can train on decentralized student data (on individual devices/school servers) without centralizing sensitive information, crucial for ethical EdTech scaling.
  • Integration with Learning Science Theory: Constraining or initializing model parameters based on cognitive theories (e.g., spacing effect, cognitive load theory) to make models more interpretable and theoretically grounded.

8. References

  1. Cheng, H. T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., ... & Shah, H. (2016). Wide & deep learning for recommender systems. Proceedings of the 1st workshop on deep learning for recommender systems.
  2. Corbett, A. T., & Anderson, J. R. (1994). Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction.
  3. Guo, H., Tang, R., Ye, Y., Li, Z., & He, X. (2017). DeepFM: A factorization-machine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247.
  4. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage.
  5. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation.
  6. Piech, C., Bassen, J., Huang, J., Ganguli, S., Sahami, M., Guibas, L. J., & Sohl-Dickstein, J. (2015). Deep knowledge tracing. Advances in neural information processing systems.
  7. Rendle, S. (2010). Factorization machines. 2010 IEEE International Conference on Data Mining.
  8. Settles, B., Brunk, B., & T. (2018). The 2018 Duolingo Shared Task on Second Language Acquisition Modeling. Proceedings of the 2018 SLAM Workshop.
  9. Vie, J. J., & Kashima, H. (2018). Knowledge tracing machines: Factorization machines for knowledge tracing. arXiv preprint arXiv:1811.03388.
  10. Wilson, K. H., Karklin, Y., Han, B., & Ekanadham, C. (2016). Back to the basics: Bayesian extensions of IRT outperform neural networks for proficiency estimation. Educational Data Mining.
  11. Li, J., Wang, Y., & McAuley, J. (2020). Time interval aware self-attention for sequential recommendation. Proceedings of the 13th International Conference on Web Search and Data Mining.
  12. Choi, Y., Lee, Y., Cho, J., Baek, J., Kim, B., Cha, Y., ... & Kim, S. (2020). Towards an appropriate query, key, and value computation for knowledge tracing. Proceedings of the Seventh ACM Conference on Learning@ Scale.