The Java program below contains two examples that list the 10 most frequent tokens in the Quran, with and without diacritics. An analysis table is used to tabulate results for each example. Frequency analysis is performed by grouping the tables using Buckwalter transliteration.
Java Example
public class TokenFrequencyExample { public static void main() { // Example #1. topTokensWithDiacritics(); // Example #2. topTokensWithoutDiacritics(); } private static void topTokensWithDiacritics() { // Create a new analysis table. AnalysisTable table = new AnalysisTable("Token"); // Add each token to the table. for (Token token : Document.getTokens()) { table.add(token.toBuckwalter()); } // Group and display top 10 results. AnalysisTable groupTable = table.group("Token"); groupTable.sort("Count", SortOrder.Descending); System.out.println(groupTable.toString(10)); } private static void topTokensWithoutDiacritics() { // Create a new analysis table. AnalysisTable table = new AnalysisTable("Token"); // Add each token to the table, without diacritics. for (Token token : Document.getTokens()) { table.add(token.removeDiacritics().toBuckwalter()); } // Group and display top 10 results. AnalysisTable groupTable = table.group("Token"); groupTable.sort("Count", SortOrder.Descending); System.out.println(groupTable.toString(10)); } }
Program Output
Token Count ----- ----- fiY 1098 {ll~ahi 828 {l~a*iyna 810 {ll~ahu 733 min 728 maA 711 laA 616 <in~a 609 walaA 605 {ll~aha 592 Token Count ----- ----- mn 2589 Allh 2153 An 1603 fY 1185 mA 1010 lA 812 Al*yn 811 AlA 763 wlA 658 wmA 646
Discussion
The above results show that for tokens including diacritics, the most frequent token is fī. This preposition means "in" and occurs 1098 times. Excluding diacritics, the most frequent token is the preposition min, meaning "from" which occurs 2589 times. The above program uses the removeDiacritics() method in the second example, to remove diacritics from each token.