Qur'an | Word by Word | Audio | Prayer Times
__ Sign In
 
__

Java API - Token Frequency Example

__

The Java program below contains two examples that list the 10 most frequent tokens in the Quran, with and without diacritics. An analysis table is used to tabulate results for each example. Frequency analysis is performed by grouping the tables using Buckwalter transliteration.

Java Example

public class TokenFrequencyExample {

    public static void main() {

        // Example #1.
        topTokensWithDiacritics();

        // Example #2.
        topTokensWithoutDiacritics();
    }

    private static void topTokensWithDiacritics() {

        // Create a new analysis table.
        AnalysisTable table = new AnalysisTable("Token");

        // Add each token to the table.
        for (Token token : Document.getTokens()) {
            table.add(token.toBuckwalter());
        }

        // Group and display top 10 results.
        AnalysisTable groupTable = table.group("Token");
        groupTable.sort("Count", SortOrder.Descending);
        System.out.println(groupTable.toString(10));
    }

    private static void topTokensWithoutDiacritics() {

        // Create a new analysis table.
        AnalysisTable table = new AnalysisTable("Token");

        // Add each token to the table, without diacritics.
        for (Token token : Document.getTokens()) {
            table.add(token.removeDiacritics().toBuckwalter());
        }

        // Group and display top 10 results.
        AnalysisTable groupTable = table.group("Token");
        groupTable.sort("Count", SortOrder.Descending);
        System.out.println(groupTable.toString(10));
    }
}

Program Output

Token     Count
-----     -----
fiY       1098
{ll~ahi   828
{l~a*iyna 810
{ll~ahu   733
min       728
maA       711
laA       616
<in~a     609
walaA     605
{ll~aha   592

Token Count
----- -----
mn    2589
Allh  2153
An    1603
fY    1185
mA    1010
lA    812
Al*yn 811
AlA   763
wlA   658
wmA   646

Discussion

The above results show that for tokens including diacritics, the most frequent token is . This preposition means "in" and occurs 1098 times. Excluding diacritics, the most frequent token is the preposition min, meaning "from" which occurs 2589 times. The above program uses the removeDiacritics() method in the second example, to remove diacritics from each token.

See Also

Language Research Group
University of Leeds
__