The example below displays the longest tokens in the Quran. The program uses the getLetterCount accessor to measure token length. This excludes Quranic symbols which are not letters, as defined by the orthography model. The program below has two steps. In step 1, an analysis table is used to perform frequency analysis, showing the number of tokens of various lengths. In step 2, a second analysis table is used to tabulate occurrences of the longest tokens.
Java Example
public class LongestTokenExample { public static void main() { // --------------------------------- // Step 1. Most common token lengths // --------------------------------- // Create a new analysis table. AnalysisTable table = new AnalysisTable("TokenLength"); // Add the length of each token to the table. for (Token token : Document.getTokens()) { table.add(token.getLetterCount()); } // Group by token length and display results. AnalysisTable groupTable = table.group("TokenLength"); groupTable.sort("TokenLength", SortOrder.Descending); System.out.println(groupTable); // ---------------------- // Step 2. Longest tokens // ---------------------- // Get the maximum token length. int maxTokenLength = groupTable.getInteger(0, "TokenLength"); // Find all tokens of that size. AnalysisTable tokenTable = new AnalysisTable( "ChapterNumber", "VerseNumber", "TokenNumber", "Token"); for (Token token : Document.getTokens()) { if (token.getLetterCount() == maxTokenLength) { tokenTable.add(token.getChapterNumber(), token.getVerseNumber(), token.getTokenNumber(), token.removeNonLetters().toBuckwalter()); } } // Display tokens. System.out.println(tokenTable); } }
Program Output
TokenLength Count ----------- ----- 11 4 10 50 9 407 8 2554 7 4626 6 10929 5 14263 4 17495 3 15554 2 11544 1 3 ChapterNumber VerseNumber TokenNumber Token ------------- ----------- ----------- ----- 3 17 5 wa{lomusotagofiriyna 4 75 8 wa{lomusotaDoEafiyna 4 127 25 wa{lomusotaDoEafiyna 15 22 8 fa>asoqayona`kumuwhu
Discussion
The above results show that the there are 4 tokens in the Quranic text which are each 11 letters long. The 4th token in the above table includes an alif khanjarīya. This is counted as a letter, being an abbreviation for a full alif. The program uses the removeNonLetters() method to remove Quranic symbols from the tokens before displaying them.