KWJP Frequence lists

Frequency lists of single words and n-grams have been automatically compiled on the basis of the balanced Corpus of Contemporary Polish. They may contain errors, as lemmatisation and morphosyntactic tagging have been done automatically. Only words in the Latin alphabet (with diacritics) are taken into account; words may contain hyphens.

Several variants of the lists are available. First, lists are compiled for the three main text genres: fiction, factual and news journalism. Second, list are compiled in relation to the forms: lemmas and text words, further subdivided into case-sensitive and case-insensitive forms. Both the single-word lists and the n-gram lists are restricted to units that occurred at least five times in the entire corpus (the number of occurrences in genres may be lower).

Columns have the meanings described below. The R column contains the ranks of the unit, i.e., subsequent numbers on the list sorted by frequency. The Unit column contains words, or n-grams. If the list is compiled of lemmas, it also has a POS column with a grammatical class (flexeme) assigned to the entry. The list may therefore include homonymous entries assigned to different classes. Column F contains the frequency of the unit in the corpus, and the IPM (items per million) column contains the relative frequency per million words. The ARF (average reduced frequency) column contains the value of the so-called adjusted frequency; it reduces the frequency of words occurring in the corpus in close clusters (only in one or in several texts). Words evenly distributed throughout the corpus have the ARF value relatively close to the frequency (F). The 1-DP column contains the value of the DP dispersion measure (deviation of proportions), scaled in such a way that words with a relatively even distribution in the corpus have values close to 1, and words with a very uneven distribution have values close to 0. More on ARF and 1-DP measures can be found in the papers in Bibliography.

The n-gram lists also contain the Dice column listing the so-called Dice coefficient, interpreted as a measure of the strength of co-occurrence of two or more words. The coefficient has a maximum value of 1 for words that only occur next to each other in texts and do not occur at all in other contexts. Lower values will have combinations of words that occur together relatively often but are also used in other contexts. Values close to zero have n-gram that consist of words occurring infrequently next to each other and relatively frequently in other contexts.

All columns can be filtered using the fields at the bottom of the column. Columns with numeric values (R, F, IPM, ARF, and 1-DP) can be filtered using ranges. The POS column can be filtered by grammatical categories (a value can be selected from the menu). The Entity column can be filtered by entering any substring of a searched word. Regular expressions can be used in this field; the ^ and $ characters mark the beginning and end of a word, respectively. For example, the expression ^szczęś finds all words beginning with szczęś-, the expression liwy$ finds all words ending in -liwy.

Bibliography