Java API - Analysis Table

The AnalyisTable is a general purpose class which may be used to tabulate, sort, group and export results. An analysis table is organized into a set of rows and columns, with each column having a unique column name. The analysis table may be used as follows:

Step 1.	A new analysis table is created by specifying a list of column names.
Step 2.	Rows are added to the table by a Java program. These are typically the results of a search, or other program you have written to collect data.
Step 3.	After rows have been added, the results can be analysed by performing any of the following operations: - Sort the data by a column, in ascending or descending order. - Group the data by a list of columns. - Display the table to screen, or display the first N rows of the table. - Export the table to a file, e.g. a tab delimited file or a CSV file.

The analysis table is useful for sorting and displaying results. When writing results to the screen, the table will be correctly formatted and all columns will be aligned. Grouping results allows frequency analysis to be performed, by constructing a frequency table from the original table.

A Simple Example

Suppose we wish to find the 5 longest verses in the Quranic text, measuring length by the number of tokens in each verse. We can write a simple Java program to search the orthography model, using an analysis table to collect the results:

// Step 1. Create a new analysis table.
AnalysisTable table = new AnalysisTable(
    "ChapterNumber", "VerseNumber", "TokenCount");

// Step 2. Tabulate the number of tokens in each verse.
for (Verse verse : Document.getVerses()) {
    table.add(
        verse.getChapterNumber(),
        verse.getVerseNumber(),
        verse.getTokenCount());
}

// Step 3. Sort the table, then display the first 5 rows.
table.sort("TokenCount", SortOrder.Descending);
System.out.println(table.toString(5));

In step 1 of the program, we create a new analysis table. The table is created with three columns named ChapterNumber, VerseNumber and TokenCount. At this stage the table is empty and contains no rows.

In step 2, we enumerate through all verses in the Quranic text. For each verse, we use the add() method to add a new row to the analysis table. The row contains 3 values, one for each the columns defined in step 1. We use the Verse methods getChapterNumber(), getVerseNumber() and getTokenCount() to get the values that make up the row.

In step 3, we sort the table and then display the results. The sort() method is used to sort the table by the TokenCount column, in descending order. This will allow the longest verses to be displayed first. We then use the toString() method to display the table. This method accepts an optional parameter, the number of rows to display. The results of the Java program are shown below:

ChapterNumber VerseNumber TokenCount
------------- ----------- ----------
2             282         128
4             12          88
24            31          78
73            20          78
24            61          76

Grouping Results

Frequency counts can be derived from an analysis table by grouping results. The group() method creates a new analysis table, based on the original table. This method accepts the list of columns that you wish to group on. The new group table will contain the grouped columns, together with an additional column named Count. The Count column contains the number of items in that group, in other words its frequency.

As an example, consider the analysis table in the preceding section, which tabulated the number of tokens in each verse. Suppose we are interested in the frequency of token counts. That is, how many verses contain 1 token, how many contain 2 tokens, and so on. We can derive frequency counts from the original table by grouping the data:

// Group the token count table by number of tokens.
AnalysisTable groupTable = table.group("TokenCount");

// Sort the group table by the Count column, in descending order.
groupTable.sort("Count", SortOrder.Descending);

// Display the first 5 rows of the group table.
System.out.println(groupTable.toString(5));

This program groups the original table by the TokenCount column. The group table will have columns named TokenCount and Count. The Count column is added automatically, and represents the number of verses that have a particular token count. The program then sorts the group table, and displays the first 5 rows. The results below show the 5 most frequent token counts in the Quranic text:

TokenCount Count
---------- -----
4          530
5          419
3          402
6          358
11         345

As can be seen, 531 verses are 4 tokens in length. This is the most common number of tokens per verse. The next result shows that 419 verses contain 5 tokens, and so on.

Exporting Results

As well as displaying an analysis table to screen, it is also possible to save the contents of an analysis table to disk, by writing out a delimited file. The default is a tab delimited file, but the delimiter is configurable. To export a CSV file, a comma should be specified as the delimiter character. The writeFile() method is used to export an analysis table:

// Write out a tab delimited file (the default).
table.writeFile("data.txt");

// Write a comma separated file.
table.writeFile("data.csv", ',');

// Write out the first 10 rows to a CSV file.
Table.writeFile("data2.csv", ',', 10);