Qur'an | Word by Word | Audio | Prayer Times
__ Sign In
 
__

Java API - Longest Token Example

__

The example below displays the longest tokens in the Quran. The program uses the getLetterCount accessor to measure token length. This excludes Quranic symbols which are not letters, as defined by the orthography model. The program below has two steps. In step 1, an analysis table is used to perform frequency analysis, showing the number of tokens of various lengths. In step 2, a second analysis table is used to tabulate occurrences of the longest tokens.

Java Example

public class LongestTokenExample {

    public static void main() {

        // ---------------------------------
        // Step 1. Most common token lengths
        // ---------------------------------
        
        // Create a new analysis table.
        AnalysisTable table = new AnalysisTable("TokenLength");

        // Add the length of each token to the table.
        for (Token token : Document.getTokens()) {
            table.add(token.getLetterCount());
        }

        // Group by token length and display results.
        AnalysisTable groupTable = table.group("TokenLength");
        groupTable.sort("TokenLength", SortOrder.Descending);
        System.out.println(groupTable);

        // ----------------------
        // Step 2. Longest tokens
        // ----------------------
        
        // Get the maximum token length.
        int maxTokenLength
            = groupTable.getInteger(0, "TokenLength");

        // Find all tokens of that size.
        AnalysisTable tokenTable = new AnalysisTable(
                "ChapterNumber", "VerseNumber",
                "TokenNumber", "Token");
        for (Token token : Document.getTokens()) {
            if (token.getLetterCount() == maxTokenLength) {
                tokenTable.add(token.getChapterNumber(),
                        token.getVerseNumber(),
                        token.getTokenNumber(),
                        token.removeNonLetters().toBuckwalter());
            }
        }

        // Display tokens.
        System.out.println(tokenTable);
    }
}

Program Output

TokenLength Count
----------- -----
11          4
10          50
9           407
8           2554
7           4626
6           10929
5           14263
4           17495
3           15554
2           11544
1           3

ChapterNumber VerseNumber TokenNumber Token
------------- ----------- ----------- -----
3             17          5           wa{lomusotagofiriyna
4             75          8           wa{lomusotaDoEafiyna
4             127         25          wa{lomusotaDoEafiyna
15            22          8           fa>asoqayona`kumuwhu

Discussion

The above results show that the there are 4 tokens in the Quranic text which are each 11 letters long. The 4th token in the above table includes an alif khanjarīya. This is counted as a letter, being an abbreviation for a full alif. The program uses the removeNonLetters() method to remove Quranic symbols from the tokens before displaying them.

See Also

Language Research Group
University of Leeds
__