Real Word Spelling Error Detection and Correction for Urdu Language

Aziz, Romila and Anwar, Muhammad Waqas and Jamal, Muhammad Hasan and Bajwa, Usama Ijaz and Kuc Castilla, Ángel Gabriel and Uc-Rios, Carlos and Bautista Thompson, Ernesto and Ashraf, Imran UNSPECIFIED, UNSPECIFIED, UNSPECIFIED, UNSPECIFIED, UNSPECIFIED, carlos.uc@unini.edu.mx, ernesto.bautista@unini.edu.mx, UNSPECIFIED (2023) Real Word Spelling Error Detection and Correction for Urdu Language. IEEE Access. p. 1. ISSN 2169-3536

[img]
Preview
Text
Real_Word_Spelling_Error_Detection_and_Correction_for_Urdu_Language.pdf
Available under License Creative Commons Attribution Non-commercial No Derivatives.

Download (3MB) | Preview

Abstract

Non-word and real-word errors are generally two types of spelling errors. Non-word errors are misspelled words that are nonexistent in the lexicon while real-word errors are misspelled words that exist in the lexicon but are used out of context in a sentence. Lexicon-based lookup approach is widely used for non-word errors but it is incapable of handling real-word errors as they require contextual information. Contrary to the English language, real-word error detection and correction for low-resourced languages like Urdu is an unexplored area. This paper presents a real-word spelling error detection and correction approach for the Urdu language. We develop an extensive lexicon of 593,738 words and use this lexicon to develop a dataset for real-word errors comprising 125562 sentences and 2,552,735 words. Based on the developed lexicon and dataset, we then develop a contextual spell checker that detects and corrects real-word errors. For the real-word error detection phase, word-gram features are used along with five machine learning classifiers, achieving a precision, recall, and F1-score of 0.84,0.79, and 0.81 respectively. We also test the proposed approach with a 40% error density. For real-word error correction, the Damerau-Levenshtein distance is used along with the n-gram model for further ranking of the suggested candidate words, achieving an accuracy of up to 83.67%.

Item Type: Article
Uncontrolled Keywords: Real-word errors, spelling correction, spelling detection, spell checker
Subjects: Subjects > Engineering
Divisions: Europe University of Atlantic > Research > Scientific Production
Fundación Universitaria Internacional de Colombia > Research > Scientific Production
Ibero-american International University > Research > Scientific Production
Ibero-american International University > Research > Scientific Production
Universidad Internacional do Cuanza > Research > Scientific Production
Depositing User: Sr Bibliotecario
Date Deposited: 14 Sep 2023 09:41
Last Modified: 14 Sep 2023 09:41
URI: http://repositorio.funiber.org/id/eprint/8800

Actions (login required)

View Item View Item