What is Nepali Grammar Correction?

Automatic detection and correction of grammatical errors in Nepali using transformer-based models and a large parallel corpus.

Where can I download Nepali GEC datasets?

Both detection and correction datasets are available on Hugging Face under the sumitaryal profile.

Which models achieve the best GED accuracy?

The MuRIL-based classifier reached 91.15% accuracy; NepBERTa achieved 81.73% under the reported setup.

Nepali Grammar Correction (GEC): 8.13M Corpus, Models & Results

Overview

The Nepali GEC system is a two-stage pipeline: (1) a BERT-based Grammatical Error Detection (GED) model labels a sentence as correct or incorrect; (2) for incorrect inputs, a Masked Language Model (MLM) (MuRIL or NepBERTa) generates alternative sentences, which are filtered again by the GED to surface only valid suggestions. This design exactly follows the process shown in the paper’s Fig. 1–2 and the thesis Figures 3.1–3.2.

Parallel pairs
8,130,496

Error types
7

Best GED accuracy (MuRIL)
91.15%

Classifier batch size
256

Block Diagram — GED (Sequence Classification)

Mirrors paper Fig. 1 and thesis Figure 3.1 exactly; the classifier is BERT (MuRIL/NepBERTa) + a dense layer predicting correct vs incorrect.

Source: Figures described in the paper and thesis.

Block Diagram — GEC Engine (MLM + GED filter)

Matches paper Fig. 2 and thesis Figure 3.2, including both masking strategies and the GED-based filtering.

Source: Figures described in the paper and thesis.

Corpus — Exact Statistics

The parallel corpus contains 8,130,496 source–target pairs across seven error types. Distribution:

Error Type	Instances	Percent
Verb inflection	3,202,676	39.39%
Pronouns (subject missing)	316,393	3.89%
Sentence structure	1,001,038	12.31%
Auxiliary verb missing	1,031,388	12.69%
Main verb missing	1,031,274	12.68%
Punctuation errors	1,044,203	12.84%
Homophones	503,524	6.20%
Total	8,130,496	100%

Counts and percentages reproduced verbatim from the paper’s Table VIII.

Dataset splits used for GED fine-tuning

Split	Correct sentences	Incorrect sentences	Total sentences
Train	2,568,682	7,514,122	10,082,804
Validation	365,606	405,905	771,511

Training Setup (GED)

MuRIL: 1 epoch; AdamW (lr=5e-5, β1=0.9, β2=0.999, ε=1e-8); batch size 256; loss: Cross-Entropy.
NepBERTa: 2 epochs; same optimizer and batch size.
Trainable parameters — MuRIL: 237,557,762; NepBERTa: 109,514,298.
Tokenizer vocabulary size — MuRIL: 197,285; NepBERTa: 30,523.

Results (GED)

Model	Training Loss	Validation Loss	Accuracy
MuRIL	0.2427	0.2177	91.15%
NepBERTa	0.2776	0.3446	81.73%

Processing time for MLM suggestions grows roughly linearly with sentence length (see paper Fig. 3 and Fig. 4).

Implementation Notes

Input cleaning: 3–20 token filter; non-Devanagari removed; English numerals converted to Nepali; quotes/parentheses sanity checks.
Error generation includes: suffix-swap verb inflections (via POS+lemmatizer), homophones, punctuation perturbations, word-swap structure, subject/verb deletion.
Masking for correction: (i) mask each token; (ii) insert a [MASK] in each inter-word gap; all hypotheses re-scored by GED.

Original PNG Diagrams (for reference)

GED flow (PNG) — GED — provided PNG (kept for parity with the thesis/paper figures).

GEC flow (PNG) — GEC — provided PNG (kept for parity with the thesis/paper figures).

Models & Datasets (Hugging Face)

MuRIL GED Model — 237M params, 91.15% accuracy.
NepBERTa GED Model — 109M params.
Correction Corpus — Parallel errorful→correct pairs.
Detection Dataset — Labeled correct / incorrect sentences.

Resources

PDFs are bundled alongside this page for offline reading.

Citation

When referencing this work, cite the peer‑reviewed INSPECT 2024 paper (BibTeX below).

@inproceedings{aryal2024bert,
title={BERT-Based Nepali Grammatical Error Detection and Correction Leveraging a New Corpus},
author={Aryal, Sumit and Jaiswal, Anku},
booktitle={2024 IEEE International Conference on Intelligent Signal Processing and Effective Communication Technologies (INSPECT)},
pages={1--6},
year={2024},
organization={IEEE}
}

Provenance & Alignment

The block diagrams above reproduce the flows from the paper’s Fig. 1–2 (GED and end-to-end GEC) and the thesis Figures 3.1–3.2. The corpus statistics table equals Table VIII; the dataset split equals Table IX; the results equal Table XI.