Arabic has a rich morphology and a single word can function as an entire sentence in English. For example the Arabic word fajaʿalnāhum (فَجَعَلْنَٰهُمُ) found in verse (23:41) can be translated into the English sentence "and We made them". The reason that such a compact syntax is possible is that the single word can be divided into 4 distinct morphological segments:
(23:41:4)
fajaʿalnāhum
and We made them
Fig 1. Morphological segmentation for word (23:41:4).
- fa - a prefixed conjunction ("and")
- jaʿal - the stem, a perfect past tense verb ("made") inflected as first person masculine plural
- nā - a suffixed subject pronoun ("We")
- hum – a suffixed object pronoun ("them")
This single-word sentence has VSO (verb-subject-object) order. In general Arabic is rather flexible with regards to word order since case endings can be used to determine the role of each word in a sentence. Word order is typically used to emphasize different parts of a sentence. In the Quranic Arabic corpus, a part-of-speech tag has been assigned to each morphological segment that makes up a word. For example the word above has 4 part-of-speech tags, with one tag for each of its 4 segments:
- CONJ - conjunction
- V - verb
- PRON - pronoun (for the attached subject pronoun)
- PRON - pronoun (a second pronoun segment for the attached object pronoun)
Although multiple segments can be fused together into a single word usually only one segment will be identified as the stem. Any segments preceding the stem are prefixes and any segments following the stem are suffixes. Prefix and suffix segments are optional while the stem segment is the unmodified form of the word. Occasionally a word will have two stems such as the contraction عَن + مَا = عَمَّ:
Fig 2. A contraction of two stems in word (78:1:1).
Prefixes
As well as part-of-speech tags, multiple inflection features are assigned to each morphological segment. For example, features for person, gender and number. The features for prefixes end in + and are shown in figures 3 to 7 below. In contrast features for suffixes start with +.
Feature | Name | Segment part-of-speech / description |
Al+ | determiner (al) | DET – determiner prefix ("the") |
bi+ | preposition (bi) | P – preposition prefix ("by", "with", "in") |
ka+ | preposition (ka) | P – preposition prefix ("like" or "thus") |
ta+ | preposition (ta) | P – particle of oath prefix used as a preposition ("by Allah") |
sa+ | future particle (sa) | P – prefixed particle indicating the future ("they will") |
ya+ | vocative particle (yā) | VOC – a vocative prefix usually translated as "O" |
ha+ | vocative particle (hā) | VOC – a vocative prefix usually translated as "Lo!" |
Fig 3. Features identifying prefixed segments.
Feature | Name | Segment part-of-speech / description |
A:INTG+ | interrogative particle (alif) | INTG – prefixed interrogative particle ("is?", "did?", "do?") |
A:EQ+ | equalization particle (alif) | EQ – prefixed equalization particle ("whether") |
Fig 4. Features identifying the particle alif as a prefix.
Feature | Name | Segment part-of-speech / description |
w:CONJ+ | conjunction (wa) | CONJ – conjunction prefix ("and") |
w:REM+ | resumption (wa) | REM – resumption prefix ("then" or "so") |
w:CIRC+ | circumstantial (wa) | CIRC – circumstantial prefix ("while") |
w:SUP+ | supplemental (wa) | SUP – supplemental prefix ("then" or "so") |
w:P+ | preposition (wa) | P – particle of oath prefix used as a preposition ("by the pen") |
w:COM+ | comitative (wa) | COM – comitative prefix ("with") |
Fig 5. Features identifying the particle wāw as a prefix.
Feature | Name | Segment part-of-speech / description |
f:REM+ | resumption (fa) | REM – resumption prefix ("then" or "so") |
f:CONJ+ | conjunction (fa) | CONJ – conjunction prefix ("and") |
f:RSLT+ | result (fa) | RSLT – result prefix ("then") |
f:SUP+ | supplemental (fa) | SUP – supplemental prefix ("then" or "so") |
f:CAUS+ | cause (fa) | CAUS – cause prefix ("then" or "so") |
Fig 6. Features identifying the particle fa as a prefix.
Feature | Name | Segment part-of-speech / description |
l:P+ | preposition (lām) | P – the letter lām as a prefixed preposition |
l:EMPH+ | emphasis (lām) | P – the letter lām as a prefixed particle used to give emphasis |
l:PRP+ | purpose (lām) | P – the letter lām as a prefixed particle used to indicate purpose |
l:IMPV+ | imperative (lām) | P – the letter lām as a prefixed particle used to form an imperative |
Fig 7. Features identifying the particle lām as a prefix.
Roots and Lemmas
In Arabic and other Semitic languages such as Hebrew, similar words may be grouped together according to a root. This is a sequence of typically 3 or 4 consonants (known as radicals) which together form a triliteral or quadriliteral root. From a single root a wide variety of words may be formed, with distinct yet related meanings. For example from the triliteral root kāf tā bā (ك ت ب) the verb "write" may be formed, as well as its derivatives in Arabic including "writing", "book", "author", "library" and "office".
The concept of a lemma is also used to group similar words together at a finer level of granularity than a root. The lemma groups word-forms that differ only by inflectional (as opposed to derivational) morphology, and do not vary in meaning. Unlike the root, the lemma is an actual word selected to represent the group and is typically the same word as used in dictionary headings. A third feature used to group words together is the SP (special) feature. Certain groups of verbs and particles have special rules in Arabic grammar with regards to case endings and syntactic roles.
Both roots and lemmas are used in the Quranic Arabic corpus so that words may be easily grouped together to form an electronic lexicon of the Quran in classical Arabic. For verbs, only the root (not lemma) is indicated, since the remaining morphological features are sufficient to determine the final form of the verb. Nouns, proper nouns and adjectives have both a root and a lemma. Other parts of speech such as particles only have lemmas (not roots) indicated, since these fall outside of the root + template paradigm. The following table lists the morphological features used to group similar words together. These features make use of extended Buckwalter transliteration:
Feature | Name | Description |
ROOT: | root | Indicates the (usually triliteral) root of a word, for example ROOT:ktb |
LEM: | lemma | Specifies the common lemma for a group of words, for example LEM:kitaAb |
SP: | special | Indicates that the word belongs to a special group, for example SP:<in~ |
Fig 8. Root and lemma features.
Person, Gender and Number
In Arabic, words may inflect for person, gender and number. Unlike in English words inflect not only for plural and singular but also for the dual. For example there is a distinct word-form to represent "two books". In the Quranic Arabic corpus, the features for person, gender and number are combined using a concatenative notation. For example 3MS represents third person, masculine, singular. Similarly 2D represents second person, dual. The concept of gender in Arabic grammar may refer to either semantic, morphemic or grammatical gender (see the grammar of gender).
Feature | Arabic Name | Values | Description |
person | الاسناد | 1, 2, 3 | first person, second person, third person |
gender | الجنس | M, F | masculine, feminine |
number | العدد | S, D, P | singular, dual, plural |
Fig 9. Features for person, gender and number.
Verb Features
The morphological features discussed in this section apply to verbs as well as to their derivatives: the active participle, passive participle and the verbal noun. An important verb feature is the aspect. This is closely related to but distinct from the concept of tense. In Quranic Arabic the aspect of a verb is either perfect, imperfect, or imperative. The perfect roughly corresponds to the past tense in English although there is a distinction: the perfect refers to actions which have been completed. In addition to aspect, verbs in Quranic Arabic are conjugated for mood. Imperfect verbs may be found in the indicative, subjunctive and jussive moods. The indicative mood is the normal "default" mood so that if the mood feature is not tagged then the verb should be considered to be in the indicative mood.
The two other features used for verbs and their derivatives are voice (active or passive) and form. The active voice is the default and if not indicated a verb should be considered to be in the active voice. Verb forms are indicated using roman numerals, as found in Arabic dictionaries, so that (IV) represents a fourth form verb.
Feature | Arabic Name | Description |
PERF | فعل ماض | Perfect verb |
IMPF | فعل مضارع | Imperfect verb |
IMPV | فعل أمر | Imperative verb |
Fig 10. Aspect features.
Feature | Arabic Name | Description |
IND | مرفوع | Indicative mood (default) |
SUBJ | منصوب | Subjunctive mood |
JUS | مجزوم | Jussive mood |
Fig 11. Mood features.
Feature | Arabic Name | Description |
ACT | مبني للمعلوم | Active voice (default) |
PASS | مبني للمجهول | Passive voice |
Fig 12. Voice features.
Feature | Description |
I | First form (default) |
II | Second form |
III | Third form |
IV | Fourth form |
V | Fifth form |
VI | Sixth form |
VII | Seventh form |
VIII | Eighth form |
IX | Ninth form |
X | Tenth form |
XI | Eleventh form |
XII | Twelfth form |
Fig 13. Verb form features.
Derived Nouns
In Quranic Arabic, the active participle, passive participle and verbal noun are three types of nominals which are derived directly from verbs. In the Quranic Arabic corpus these are tagged with the noun or adjective part-of speech-tag and include one out of three possible derivation features. For example active participles are tagged in the corpus as POS:N ACT PCPL. The verbal features above that apply to verbs also apply to derived nouns (aspect, mood, voice and form) and are used to indicate the morphology of the original verb that the noun was derived from. Figure 14 below shows the derivation features used to indicate the type of a derived noun:
Feature | Arabic Name | Description |
ACT PCPL | اسم فاعل | Active participle |
PASS PCPL | اسم مفعول | Passive participle |
VN | مصدر | Verbal noun |
Fig 14. Derivation features.
Nominal Features
The feature Al+ is used to denote the prefixed determiner al ("the") attached to nominals (nouns, proper nouns and adjectives). In Arabic there is no indefinite article ("a"/"an" in English). Instead tanwīn is used and diacritics are attached to the end of a word to mark it as indefinite. The features DEF and INDEF are used to indicate the state of a noun as definite or as indefinite respectively (see figure 15 below). Nominals may be found in one of three grammatical cases: the nominative case, the accusative case, and the genitive case (see figure 16):
Feature | Arabic Name | Description |
DEF | معرفة | Definite state |
INDEF | نكرة | Indefinite state |
Fig 15. State features.
Feature | Arabic Name | Description |
NOM | مرفوع | Nominative case |
ACC | منصوب | Accusative case |
GEN | مجرور | Genitive case |
Fig 16. Case features.
Suffixes
In the Quranic Arabic Corpus, three features are used to indicate suffixes. These are attached pronouns, the vocative suffix and the nūn of emphasis. The vocative suffix is denoted by the morphological feature +VOC and is used only with the word allāh to produce the vocative word-form allāhumma. The morphological feature +n:EMPH is used to denote the emphatic usage of nūn as an attached suffix.
Attached pronoun suffixes are identified using the PRON: compound morphological feature. Pronouns attached to nouns are possessive pronouns, and when attached to verbs they are either subject or object pronouns. An attached pronoun may inflect for person, gender and number. A concatenative notation is used with the PRON: tag. For example PRON:3MS represents a third person masculine singular suffixed pronoun. Similarly PRON:2D represents a second person dual suffixed pronoun. See figure 9 above for person, gender and number features.