Second Language Acquisition of Neural Language Models: A Linguistic Analysis

1. Introduction & Overview

This work investigates the second language (L2) acquisition of neural language models (LMs), shifting focus from the typical study of their first language (L1) acquisition. The core research question is: How does the L1 acquisition of an LM affect the efficiency and nature of its subsequent grammar acquisition in an L2? The study designs a human-like L2 learning scenario for bilingual LMs, pretraining them on an L1 (French, German, Russian, Japanese) before exposing them to English as an L2. The goal is to analyze cross-lingual transfer from a linguistic perspective, using grammatical judgment tests to evaluate syntactic generalization, moving beyond holistic metrics like perplexity.

2. Experimental Procedure & Methodology

The experimental pipeline mimics a human L2 learning trajectory with controlled data exposure.

2.1 L1 Pretraining Phase

A masked language model (e.g., based on architectures like BERT) is pretrained from scratch on a monolingual corpus of a chosen L1. This phase establishes the model's initial linguistic "native" competence.

2.2 L2 Acquisition Phase

The L1-pretrained model is then further trained (fine-tuned) on a limited English (L2) corpus. The study explores different data conditions: L2 monolingual texts only, or a mix of L1-L2 parallel translation pairs, with training data size restricted to simulate realistic human L2 input.

2.3 Evaluation: Grammatical Judgment Test

The model's L2 linguistic knowledge is probed using the BLiMP benchmark (The Benchmark of Linguistic Minimal Pairs). BLiMP tests specific grammatical phenomena (e.g., subject-verb agreement, filler-gap dependencies) by having the model choose between a grammatical and an ungrammatical sentence pair, providing a fine-grained analysis of syntactic generalization.

3. Inductive Biases & L2 Training Methods

Initial experiments compared how different L2 training data configurations affect acquisition speed and quality.

3.1 Monolingual vs. Bilingual Data Settings

Training solely on L2 monolingual texts every two epochs led to faster L2 grammar acquisition compared to more complex settings.

3.2 Effect of Parallel Texts

Interestingly, feeding L1-L2 translation pairs to the LM during L2 training slowed down the acquisition of L2 grammatical knowledge. This suggests that explicit parallel alignment might introduce noise or a conflicting learning signal for pure syntactic generalization in the early stages of L2 learning for LMs.

4. Main Experimental Results & Analysis

The core findings reveal significant effects of L1 on L2 acquisition in LMs.

Key Insights

Positive Transfer: L1 pretraining accelerates and improves linguistic generalization in L2.
L1 Dependency: The choice of L1 substantially affects L2 performance.
Grammar-Specific Gains: Benefits are not uniform across linguistic phenomena.

4.1 L1 Knowledge Promotes L2 Generalization

Models with L1 pretraining achieved better performance on the English BLiMP benchmark after L2 exposure compared to models trained on English from scratch with equivalent data. This indicates that prior linguistic knowledge, even from a different language, provides a useful inductive bias for learning new grammatical structures.

4.2 Differential Effects of L1 Choice

The transfer efficacy varied by L1. Models with French or German as L1 showed stronger L2 (English) generalization than those with Russian or Japanese as L1. This aligns with human language learning difficulty rankings (e.g., Chiswick & Miller, 2004), where linguistic proximity (e.g., shared Germanic roots for English/German) facilitates transfer.

4.3 Grammar-Specific Transfer Effects

The boost from L1 pretraining was most pronounced for morphological (e.g., verb conjugation) and syntactic (e.g., word order) items. Gains were smaller for purely semantic items or those requiring integration of syntax and semantics. This suggests L1 knowledge primarily aids in acquiring formal structural rules of the L2.

5. Process Analysis of L2 Acquisition

5.1 Progression & Data Inefficiency

The acquisition of L2 knowledge was found to be data-inefficient. Performance improved significantly only after the model had been exposed to the entire limited L2 dataset many times (e.g., 50-100 epochs), unlike humans who can generalize from fewer examples.

5.2 L1 Knowledge Degradation

During L2 training, the model's performance on its original L1 tasks degraded. This phenomenon, analogous to "catastrophic forgetting" in continual learning, highlights a key difference from balanced human bilingualism and points to the need for techniques to maintain linguistic knowledge balance.

6. Technical Details & Mathematical Framework

The core of the LM is based on the Transformer architecture and the masked language modeling (MLM) objective. During L1 pretraining, the model learns by predicting randomly masked tokens $w_t$ in a sequence $\mathbf{x} = (w_1, ..., w_T)$ based on their context. The objective is to maximize the log-likelihood: $$\mathcal{L}_{MLM} = \mathbb{E}_{\mathbf{x} \sim \mathcal{D}} \sum_{t \in M} \log P(w_t | \mathbf{x}_{\backslash t}; \theta)$$ where $M$ is the set of masked positions, $\mathcal{D}$ is the L1 corpus, and $\theta$ are model parameters. During L2 acquisition, this objective is applied to the L2 corpus $\mathcal{D}_{L2}$, starting from parameters $\theta_{L1}$ fine-tuned to $\theta_{L1+L2}$. The grammatical judgment on BLiMP uses the model's relative probability scores for a minimal pair $(s_{grammatical}, s_{ungrammatical})$: $$P(s_{grammatical}) > P(s_{ungrammatical})$$ where $P(s) = \prod_{t=1}^{T} P(w_t | w_{

7. Results & Chart Description

Figure 1 (Experimental Procedure Diagram): The diagram visually outlines the three-stage pipeline. From left to right: 1) Multiple boxes labeled "LM in Fr," "LM in Ge," etc., representing different L1 models after pretraining. 2) An arrow labeled "Exposure to L2 (English)" points from these models to a central box containing the text "Corpus" and the BLiMP benchmark icon. 3) Another arrow labeled "Test L2 knowledge" points from the central box to a final box showing the evaluation outcome "Aa" (likely representing accuracy scores). The diagram effectively communicates the comparative setup where models with different L1 bases are subjected to the same L2 learning and evaluation regimen.

Key Result Visualization (Implied): While not explicitly graphed in the provided text, the results would typically be presented in bar charts or line graphs showing: 1) BLiMP accuracy scores for English (L2) on the y-axis, grouped by the L1 of the model (French, German, Russian, Japanese) on the x-axis, clearly showing the French/German advantage. 2) A line graph showing L2 accuracy (y-axis) over training epochs/iterations (x-axis) for different L1 models, demonstrating the slow, data-inefficient learning curve. 3) A grouped bar chart showing accuracy gains from L1 pretraining for different BLiMP sub-categories (Morphology, Syntax, Semantics, etc.), highlighting the larger gains for formal syntactic phenomena.

8. Analysis Framework: Example Case

Case Study: Analyzing L1-L2 Transfer for Subject-Verb Agreement

1. Phenomenon: English requires verb inflection to agree with the number of the subject (e.g., "The dog runs" vs. "The dogs run").

2. L1 Influence Hypothesis: An LM pretrained on French (which has rich subject-verb agreement) may have a stronger latent representation for the concept of "agreement" between sentence elements compared to an LM pretrained on Japanese (which lacks verb conjugation for number). This abstract structural bias could facilitate learning the specific realization of this rule in English.

3. Testing with BLiMP: The model is presented with minimal pairs like:
Grammatical: The key to the cabinets *is* on the table.
Ungrammatical: The key to the cabinets *are* on the table.
The model must assign a higher probability to the grammatical sentence.

4. Expected Result: The French-L1 model is predicted to achieve higher accuracy on this BLiMP subset earlier in L2 training than the Japanese-L1 model, demonstrating positive transfer of an abstract grammatical concept.

5. Framework Application: This case can be formalized by probing the model's internal representations (e.g., using diagnostic classifiers) after L1 training to see if a "number agreement" detector can be trained more easily from the French-L1 model's embeddings. Then, tracking the performance curve on English agreement during L2 training quantifies the transfer benefit.

9. Application Outlook & Future Directions

Efficient Multilingual Model Training: Insights can guide curriculum learning strategies—pretraining on linguistically "proximal" languages before targeting distant ones to improve sample efficiency and final performance.
Personalized Language Learning Tools: AI tutors could adapt instructional content based on a learner's native language, emphasizing grammatical areas where negative transfer is likely (inspired by Contrastive Analysis).
Mitigating Catastrophic Forgetting: Future work must address L1 degradation during L2 learning. Techniques from continual learning (e.g., elastic weight consolidation, experience replay) could be integrated to create models that maintain stable multilingual competence.
Deeper Linguistic Probes: Extending analysis beyond syntax to pragmatics, discourse, and sociolinguistic competence in L2 acquisition of LMs.
Cross-Modal L2 Acquisition: Investigating how vision-and-language models acquire a "second language" in a multimodal context.

10. References

Oba, M., Kuribayashi, T., Ouchi, H., & Watanabe, T. (2023). Second Language Acquisition of Neural Language Models. arXiv preprint arXiv:2306.02920.
Brown, T. B., et al. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33, 1877-1901.
Vaswani, A., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30.
Chiswick, B. R., & Miller, P. W. (2004). Linguistic Distance: A Quantitative Measure of the Distance Between English and Other Languages. Journal of Multilingual and Multicultural Development, 26(1), 1-11.
Warstadt, A., Singh, A., & Bowman, S. R. (2020). BLiMP: The Benchmark of Linguistic Minimal Pairs. Proceedings of the Society for Computation in Linguistics, 3(1), 217-229.
Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019.
Kirkpatrick, J., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13), 3521-3526.

11. Original Analysis & Expert Commentary

Core Insight

This paper isn't just another incremental NLP study; it's a bold, necessary pivot from treating LMs as monolithic "language" processors to viewing them as simulated cognitive systems with a developmental trajectory. The core insight is that an LM's "native language" fundamentally sculpts its learning biases, making cross-lingual transfer not a free bonus but a structured, predictable, and uneven process. The finding that parallel data can hinder syntactic acquisition is a bombshell for standard multilingual training dogma, suggesting that early-stage L2 learning in machines, like in humans, might benefit more from immersive, monolingual exposure than from explicit translation exercises.

Logical Flow

The authors' logic is admirably clean: 1) Isolate the variable (L1 identity) while controlling for architecture and L2 data. 2) Use a linguistically-grounded evaluation (BLiMP) instead of task-specific fine-tuning, which often conflates linguistic knowledge with task-specific heuristics. 3) Compare to human benchmarks (language difficulty rankings), providing a crucial external validation point often missing in pure ML research. This methodological rigor allows them to move from correlation (L1 affects L2 performance) towards a mechanistic hypothesis (abstract structural knowledge transfers).

Strengths & Flaws

Strengths: The study's primary strength is its interdisciplinary bridge-building. By framing the problem in terms of SLA theory, it generates hypotheses that are novel to NLP (e.g., testing differential transfer across grammatical phenomena). The controlled, human-scale data setting is a refreshing counterpoint to the "more data is always better" paradigm, forcing the models to generalize, not memorize.

Critical Flaws: The elephant in the room is scale. The experiments are conducted with relatively small LMs. As highlighted by the "Scaling Laws" research from OpenAI and others, model behavior can change dramatically with size. Does the French-L1 advantage hold for a 500B parameter model, or does sheer capacity overwhelm inductive bias? Furthermore, the focus on syntax via BLiMP, while precise, ignores the vast terrain of semantic and pragmatic transfer, which are equally critical for fluency. The observed catastrophic forgetting of L1 also points to a fundamental architectural limitation compared to the neuroplasticity of the human brain.

Actionable Insights

For practitioners, this research offers a blueprint for strategic pretraining. Don't just pretrain on a random soup of languages. If the target is high-performance in language X, first pretrain on its closest linguistic relatives to bootstrap structural learning. For researchers, the agenda is clear: 1) Scale up the experiments to modern LLM sizes to test the robustness of these findings. 2) Integrate continual learning techniques from the start to combat L1 degradation—this is no longer a niche problem but central to building stable multilingual agents. 3) Develop more comprehensive linguistic benchmarks that go beyond minimal pairs to include discourse coherence and pragmatic appropriateness, perhaps drawing from frameworks like the Common European Framework of Reference for Languages (CEFR). Ultimately, this work shifts the goal from building models that know languages to building models that learn them in a human-like way—a far more ambitious and intellectually rich pursuit.