The Quranic Arabic Corpus

Welcome to the Quranic Arabic Corpus, an annotated linguistic resource which shows the Arabic grammar, syntax and morphology for each word in the Holy Quran. The corpus provides three levels of analysis: morphological annotation, a syntactic treebank and a semantic ontology. A treebank is a linguistic resource which collects together syntactic trees. These are manually annotated analyses of sentences which can be read both by humans and computers, with different treebanks adopting different theories of syntax. Most recent Arabic language computing research focuses on modern standard Arabic, and the classical Arabic of the Quran has been relatively unexplored. Almost no attention has been given to traditional Arabic grammar, despite many volumes written on the subject over the centuries.

The grammar section of the website provides a set of guidelines for annotators who wish to contribute to the project. The approach to syntax used is the traditional Arabic grammar known as iʿrāb (إعراب), which explains inflection and case endings by assigning syntactic functions and semantic roles to words. This is the natural approach to studying Arabic syntax, and has a long tradition in Arabic linguistics having been developed over a 1000-year period. Traditional Arabic grammar is recognized as one of the origins of modern dependency grammar.

Arabic Treebanks

Some other Arabic treebank projects which analyze different texts, and follow different approaches to syntax include:

The Penn Arabic Treebank
The source text is a collection of newswire articles. Annotators use part-of-speech and phrase tags adapted from the English Penn Treebank project (over 400 tags are used). The grammar framework followed is constituent phrase structure grammar.

The Prague Arabic Dependency Treebank
The same collection of newswire articles, but annotated using a dependency grammar instead of using constituent phrase structure. The grammar framework used is a variation of dependency grammar called Functional Generative Description, originally developed at Prague in the 60's.

The Columbia Arabic Treebank
Another re-annotation of the Penn Arabic Treebank newswire articles, but using a simplified dependency grammar which is closer to traditional Arabic grammar. A tagging scheme is used which allows rapid annotation. The treebank uses only 6 part-of-speech tags, and 8 dependency relation types.

Quranic Arabic Dependency Treebank (QADT)

The Quranic Arabic Corpus differs from other Arabic treebanks in three important ways:

The source text is a different form of Arabic. The language of the Holy Quran is considered to be classical Arabic, which differs from the modern standard Arabic in use today. Being a central religious text, the Quran is also of a different genre to the newspaper articles annotated in other Arabic treebanks. Given the importance of the Quran to Islam worldwide, special care needs to be taken when annotating the text to ensure accepted historical accuracy, as syntactic analysis can imply meaning. Fortunately there are numerous books which provide complete grammatical analysis of the Quran in Arabic using traditional Grammar, which annotators are encouraged to use for validating their annotations.
The text contains diacritics and is vowelized. The Arabic text of the Holy Quran contains explicit diacritic marks, and is hence fully vowelized. Modern standard Arabic is written without diacritics and so inflections and case endings are inferred by the reader instead of being made explicit as part of the orthography. Diacritics were first introduced into the Arabic language for the Quran to reduce any possible ambiguity in meaning and to preserve the oral tradition. This simplifies morphological and syntactic analysis of the text.
The traditional grammar of iʿrāb (إعراب) is used. The grammar framework used to annotate the syntax of the Quran is traditional Arabic grammar, represented graphically using dependency graphs. The annotation and terminology used agrees fully with existing historic grammatical analyses of the Quran. This contrasts with other Arabic treebanks which follow different grammar frameworks.

Arabic Treebanks

Quranic Arabic Dependency Treebank (QADT)

See Also