The Corpus of Contemporary Polish, 2011–2020


The new reference corpus of Polish

The Corpus of Contemporary Polish (Korpus Współczesnego Języka Polskiego – KWJP) covers texts from the second decade of the 21st century. It is a large reference corpus and might be used with the same confidence as reference publications, dictionaries and encyclopedias. In order to be able to search for words or constructions with the assumption that their frequency, accompanying words and associations (collocations) will be the same as those of average language users, the size and balance of the corpus is ensured.

Many users of the new corpus are probably familiar with the National Corpus of Polish (NKJP), the first large, balanced corpus of Polish, to which we will refer, but using KWJP does not depend on this. The NKJP is well-balanced, but only contains texts up to 2010; the Monco.pl corpus of Internet texts, including press services, served as the reference corpus in the years after this. It was a huge, up-to-date and constantly growing corpus, but it was unbalanced in terms of genres and with limited grammatical search capabilities. The new corpus can be seen as a continuation of NKJP, as together they allow us to trace changes in language from the early 20th century to the present. However, it is an independent corpus with up-to-date texts. It can be used to check typical uses of words and constructions in the language we speak today.

KWJP was developed at the Institute of Computer Science PAS, within the “Digital research infrastructure for the humanities and art sciences” project conducted in 2020–2023 by the DARIAH-PL research consortium.

Corpus structure

The Corpus of Contemporary Polish comprises texts written between 2011 and 2020. In total, we have collected over one billion words, 100 million of which have been included in the balanced corpus. It is made up of 35% news and journalistic texts (mainly from dailies and weeklies), 30% fiction and 35% non-fiction book and periodical texts.

We selected press texts randomly from individual titles and books from various thematic areas based on data on the readership structure from Polskie Badania Czytelnictwa and the National Library.

In selecting the texts, we were not guided by our own literary tastes, political views or substantive assessment of the facts described in the texts. The reference corpus is intended to contain typical texts and to reflect the linguistic habits of average Polish users, and also to document the diversity of Polish literature in various thematic areas and genres.

The table listing corpus texts can be found in the Texts tab.

Compared to the structure of the balanced NKJP, our corpus lacks spoken and online texts. The Corpus of Contemporary Polish is, in fact, a corpus of Polish of traditional written (edited) genres. It is smaller than the balanced NKJP corpus because, in principle, it is only intended to represent a decade rather than almost a century of Polish writing. A user looking for larger, not necessarily balanced collections of texts can use a comprehensive corpus consisting of all collected data with a predominance of contemporary press. Reference corpora of other languages have a similar structure, e.g. the synchronous corpora of the Czech National Corpus.

Corpus search. Words, grammatical forms, statistics, syntax

The search engine of the Corpus of Contemporary Polish is similar to the one used in Korpusomat. The search system is slightly different from that of NKJP; it is certainly simpler and it is possible to not only search for words and structures, but also to group them in frequency order.

The corpus has been automatically enriched with extensive grammatical information – both inflectional and syntactic. Tagging allows us to search the corpus for all occurrences of a given lexeme according to its entry form, as well as its grammatical class and the values of its grammatical categories, direct syntactic head and use in a phrase of a specific type. The search instructions with examples are available in the User Guide tab. Statistical data from the entire corpus, a frequency list of words and word combinations, and keywords can be seen in the Frequence lists tab or downloaded from a repository.

Rules for using the corpus and acknowledgments

The corpus would not have been created without the cooperation of dozens of publishers, editors, libraries and cultural institutions. Their full list can be found in the Texts tab. If a user wishes to cite examples, then the full bibliographic address (author, title, publisher, year) and the acronym KWJP should be used.

Citing

When using the corpus in research and publications, please cite it in the following manner:

M. Marciniak, W. Kieraś, K. Bojałkowska, P. Borkowski, M. Borys, W. Eźlakowski, W. Guz, Ł. Kobyliński, D. Komosińska, K. Krasnowska-Kieraś, M. Łaziński, M. Miernecka, B. Nitoń, M. Ogrodniczuk, M. Rudolf, A. Tomaszewska, M. Woliński, J. Wołoszyn, B. Wójtowicz, A. Wróblewska, N. Zawadzka-Paluektau: Korpus Współczesnego Języka Polskiego, Instytut Podstaw Informatyki PAN, Warszawa 2023. URL: https://kwjp.pl.

@misc{kwjp:2023,
  author       = "Marciniak, M. and Kieraś, W. and Bojałkowska, K. and Borkowski, P. and Borys, M. and Eźlakowski, W. and Guz, W. and Kobyliński, Ł. and Komosińska, D. and Krasnowska-Kieraś, K. and Łaziński, M. and Miernecka, M. and Nitoń, B. and Ogrodniczuk, M. and Rudolf, M. and Tomaszewska, A. and Woliński, M. and Wołoszyn, J. and Wójtowicz, B. and Wróblewska, A. and Zawadzka-Paluektau, N.",
  title        = "Korpus Współczesnego Języka Polskiego",
  howpublished = "Instytut Podstaw Informatyki PAN, Warszawa",
  year         = "2023",
  note         = "https://kwjp.pl",
}

Project team

  • dr hab. Małgorzata Marciniak — project leader at ICS PAS
  • dr Witold Kieraś
  • dr hab. Maciej Ogrodniczuk
Linguistic team
  • dr Krystyna Bojałkowska
  • mgr Monika Borys
  • mgr Wiktor Eźlakowski
  • dr hab. Wojciech Guz
  • prof. Marek Łaziński
  • mgr Martyna Miernecka
  • mgr Aleksandra Tomaszewska
  • dr Joanna Wołoszyn
  • dr hab. Beata Wójtowicz
  • mgr Natalia Zawadzka-Paluektau
Programming team
  • dr Piotr Borkowski
  • dr Łukasz Kobyliński
  • mgr Dorota Komosińska
  • mgr Katarzyna Krasnowska-Kieraś
  • mgr Bartłomiej Nitoń
  • dr Michał Rudolf
  • dr hab. Marcin Woliński
  • dr Alina Wróblewska