Unicode deserialization is supported by JQuranTree to allow the Uthmani Script to be loaded into the orthography model. The decoding process is reversible, and is tested via the round trip method: A Unicode encoder is used to serialize the orthography model back into Unicode, and tests are run to ensure that the original character data is recovered.
Decoding Unicode Character Data
Since the Unicode encoder and decoder need only support the Uthmani Script, it is sufficient that they handle only the characters found within that XML document. The orthography model represents the Quranic text as a sequence of ArabicCharacters. Each character has a CharacterType and zero or more DiacriticTypes. Orthographic characters include letters, and other Quranic symbols. The diacritics are tashkīl (vowels).
Unicode decoding is performed using table lookup. For each Unicode character in the Uthmani script, the orthographic character type and diacritic type are looked up (see Fig 1. below). A sequence of several Unicode characters may be decoded as a single orthographic ArabicCharacter. If table lookup results in a character type, then a new orthographic character is formed. Otherwise, if the lookup results in only a diacritic type, then the diacritic will be combined with the previous orthographic character.
UNICODE | ORTHOGRAPHY MODEL | |||
Decimal | Hex | Glyph | Character | Diacritic |
1569 | U+0621 | ![]() |
Hamza | - |
1571 | U+0623 | ![]() |
Alif | HamzaAbove |
1572 | U+0624 | ![]() |
Waw | HamzaAbove |
1573 | U+0625 | ![]() |
Alif | HamzaBelow |
1574 | U+0626 | ![]() |
Ya | HamzaAbove |
1575 | U+0627 | ![]() |
Alif | - |
1576 | U+0628 | ![]() |
Ba | - |
1577 | U+0629 | ![]() |
TaMarbuta | - |
1578 | U+062A | ![]() |
Ta | - |
1579 | U+062B | ![]() |
Tha | - |
1580 | U+062C | ![]() |
Jeem | - |
1581 | U+062D | ![]() |
HHa | - |
1582 | U+062E | ![]() |
Kha | - |
1583 | U+062F | ![]() |
Dal | - |
1584 | U+0630 | ![]() |
Thal | - |
1585 | U+0631 | ![]() |
Ra | - |
1586 | U+0632 | ![]() |
Zain | - |
1587 | U+0633 | ![]() |
Seen | - |
1588 | U+0634 | ![]() |
Sheen | - |
1589 | U+0635 | ![]() |
Sad | - |
1590 | U+0636 | ![]() |
DDad | - |
1591 | U+0637 | ![]() |
TTa | - |
1592 | U+0638 | ![]() |
DTha | - |
1593 | U+0639 | ![]() |
Ain | - |
1594 | U+063A | ![]() |
Ghain | - |
1600 | U+0640 | ![]() |
Tatweel | - |
1601 | U+0641 | ![]() |
Fa | - |
1602 | U+0642 | ![]() |
Qaf | - |
1603 | U+0643 | ![]() |
Kaf | - |
1604 | U+0644 | ![]() |
Lam | - |
1605 | U+0645 | ![]() |
Meem | - |
1606 | U+0646 | ![]() |
Noon | - |
1607 | U+0647 | ![]() |
Ha | - |
1608 | U+0648 | ![]() |
Waw | - |
1609 | U+0649 | ![]() |
AlifMaksura | - |
1610 | U+064A | ![]() |
Ya | - |
1611 | U+064B | ![]() |
- | Fathatan |
1612 | U+064C | ![]() |
- | Dammatan |
1613 | U+064D | ![]() |
- | Kasratan |
1614 | U+064E | ![]() |
- | Fatha |
1615 | U+064F | ![]() |
- | Damma |
1616 | U+0650 | ![]() |
- | Kasra |
1617 | U+0651 | ![]() |
- | Shadda |
1618 | U+0652 | ![]() |
- | Sukun |
1619 | U+0653 | ![]() |
- | Maddah |
1620 | U+0654 | ![]() |
- | HamzaAbove |
1648 | U+0670 | ![]() |
Alif | AlifKhanjareeya |
1649 | U+0671 | ![]() |
Alif | HamzatWasl |
1756 | U+06DC | ![]() |
SmallHighSeen | - |
1759 | U+06DF | ![]() |
SmallHighRoundedZero | - |
1760 | U+06E0 | ![]() |
SmallHighUprightRectangularZero | - |
1762 | U+06E2 | ![]() |
SmallHighMeemIsolatedForm | - |
1763 | U+06E3 | ![]() |
SmallLowSeen | - |
1765 | U+06E5 | ![]() |
SmallWaw | - |
1766 | U+06E6 | ![]() |
SmallYa | - |
1768 | U+06E8 | ![]() |
SmallHighNoon | - |
1770 | U+06EA | ![]() |
EmptyCentreLowStop | - |
1771 | U+06EB | ![]() |
EmptyCentreHighStop | - |
1772 | U+06EC | ![]() |
RoundedHighStopWithFilledCentre | - |
1773 | U+06ED | ![]() |
SmallLowMeem | - |
Fig 1. Unicode decoding table.
Algorithm for Encoding into Unicode
Any given subset of the orthographic model could have multiple representations in Unicode. This is due to the fact that Unicode allows combining marks to be ordered arbitrarily, and because certain combinations of letters and diacritics (e.g. alif and hamza) can be represented as a single Unicode character.
The encoding algorithm used by JQuranTree was chosen to ensure that round trip testing is possible, i.e. the Unicode serialization used is exactly reversible. Given Tanzil XML, the original sequence of Unicode characters will be recovered after deserializing then reserializing the orthographic model. The Unicode encoding algorithm is shown in Fig 2. below:
For each ArabicCharacter: | |
Step 1. | If the letter or Quranic symbol has a diacritic
that forms a well known combination, then map this onto a single
Unicode character. If Hamza above was the diacritic used, then
remove this from the list of diacritics to consider. The 6 well known
combinations are:
- Alif/Waw/Ya + Hamza above
- Alif + Hamza below - Alif + Hamzat wasl - Alif + Khanjareeya (superscript Alif) |
Step 2. | If Step 1 did not apply, then use the decoding table (Fig 1. above) to determine the Unicode character to use for the letter or Quranic symbol, without its diacritics. |
Step 3. | Use the decoding table to form Unicode Characters
out any remaining diacritics, in the following order:
- Hamza above
- Shadda - Fathatan - Dammatan - Kasratan - Fatha - Damma - Kasra - Sukun - Maddah |
Fig 2. Unicode encoding algorithm.