API

Corpus object

class lingcorpora.corpus.Corpus(language, verbose=True)

The object of this class should be instantiated for each corpus. Search is conducted via search method.

Parameters:
  • language (str) –

    Language ISO 639-3 code for the corpus with combined codes for parallel corpora. List of available corpora with corresponding codes:

    Code Corpus
    ady Adyghe corpus
    alb Albanian corpus
    arm Eastern Armenian corpus
    bam Corpus Bambara de Reference
    bua Buryat corpus
    dan Danish corpus
    deu German corpus
    emk Maninka Automatically Parsed corpus
    est Estonian corpus
    grk Modern Greek corpus
    hin Hindi corpus
    kal Kalmyk corpus
    kat Georgian monolingual corpus
    kaz Almaty corpus of the Kazakh language
    mon Mongolian corpus
    rus National Corpus of Russian
    rus_parallel Parallel subcorpus of National Corpus of Russian Language
    rus_pol Polish-Russian Parallel Corpus
    tat Tatar corpus
    udm Udmurt corpus
    yid Modern Yiddish corpus
    zho Center of Chinese Linguistics corpus
    zho_eng Chinese-English subcorpus of JuKuu corpus
  • verbose (bool, default True) – whether to enable tqdm progressbar.
doc

Documentation for chosen corpus (after instance creation).

Type:str
results

List of all Result objects, each returned by search method.

Type:list
failed

List of Result objects where nothing was found.

Type:list
reset_failed()

Reset .failed

retry_failed()

Apply .search() to failed queries stored in .failed

search(query, *args, **kwargs)

This is a search function that queries the corpus and returns the results.

Parameters:query (str) – query, for arguments see params_container.Container

Example

>>> rus_corp = lingcorpora.Corpus('rus')
>>> rus_results = rus_corp.search('мешок', n_results=10)
>>> rus_results
"мешок": 100%|███████████████████████████████| 10/10 [00:07<00:00,  1.40docs/s]
[Result(query=мешок, N=10, params={'n_results': 10, 'kwic': True, 'n_left': None, 'n_right': None, 'query_language': None, 'subcorpus': 'main', 'get_analysis': False, 'gr_tags': None, 'start': 0, 'writing_system': None})]
class lingcorpora.params_container.Container(query, n_results=100, kwic=True, n_left=None, n_right=None, subcorpus=None, get_analysis=False, gr_tags=None, query_language=None, start=0, writing_system=None)

Universal arguments: query, n_results.

Other arguments depend on corpus.

Parameters:
  • query (str or list[str]) – query or queries.
  • n_results (int, default 100) – number of results wanted.
  • kwic (bool, default True:) – kwic format (True) or a sentence (False).
  • n_left (int, default None:) – number of words / symbols (corpus-specific) in the left context.
  • n_right (int, default None:) – number of words / symbols (corpus-specific) in the right context.
  • subcorpus (str, default None:) – subcorpus to search in.
  • get_analysis (boolean, default False) – whether to download grammatical information if the corpus is annotated.
  • gr_tags (dict, default None) – tags for grammar search
  • query_language (str) – for parallel corpora, language of the query.
  • start (int, default 0) – result index to start from.
  • writing_system (str, default None) – writing system of results.

Working with results

class lingcorpora.result.Result(language, query_params)

The object of this class contains all results found. Result object is iterable and supports indexing.

Parameters:
  • language (str) – corpus language.
  • query_params (dict) – all other parameters of the search.
results

List of results.

Type:list[Target]
n

Number of results.

Type:int
query

Search query.

Type:str

Example

>>> corp = lingcorpora.Corpus('emk')
>>> results = corp.search('tuma', n_results=10, kwic=False)[0]
>>> results
"tuma": 100%|██████████| 10/10 [00:00<00:00, 11.09docs/s]
Result(query=tuma, N=10, params={'n_results': 10, 'kwic': False, 'n_left': None, 'n_right': None, 'query_language': None, 'subcorpus': 'cormani-brut-lat', 'get_analysis': False, 'gr_tags': None, 'start': 0, 'writing_system': ''})
clear()

Overwrites the results attribute to empty list.

Example

>>> print(results.results)
>>> results.clear()
>>> print(results.results)
[Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, )]
[]
export_csv(filename=None, header=True, sep=';')

Save search result as CSV.

Parameters:
  • filename (str, default None) – name of the file. If None, filename is lang_query_results.csv with omission of disallowed filename symbols.
  • header (bool, default True) – whether to include a header in the table. Header is stored in .__header: ('index', 'text')
  • sep (str, default ';') – cell separator in the csv.
class lingcorpora.target.Target(text, idxs, meta, analysis, gr_tags=None, transl=None, lang=None)

Target contains one item from the result list.

Parameters:
  • text (str) – full sentence / document.
  • idxs (tuple (l, r)) – target indexes in self.text -> self.text[l:r].
  • meta (str) – sentence / document info (if exists).
  • analysis (list of dicts) – target analysis (parsed).
  • gr_tags (str, default None) – grammatical tags passed by user.
  • transl (str, default None) – text translation (for parallel corporas and dictionaries).
  • lang (str, default None) – translation language (for parallel corporas and dictionaries).

Examples

>>> rus_corp = lingcorpora.Corpus('rus')
>>> rus_results = rus_corp.search('одеяло', n_results = 10, get_analysis=True)[0]
>>> first_hit = rus_results[0]
>>> first_hit
Target(одеяло, Народный костюм: архаика или современность? // «Народное творчество», 2004)
>>> for k, v in vars(first_hit).items():
>>> print(k, v)
text  Я, например, для внучки настегала своими руками лоскутное одеяло, зная, что оно будет её оберегать, давать ей энергию.
idxs (59, 65)
meta Народный костюм: архаика или современность? // «Народное творчество», 2004
tags {'lex': ['одеяло'], 'gramm': ['S', 'inan', 'n', 'sg', 'acc', 'disamb'], 'sem': ['r:concr', 't:tool:bedding'], 'flags': ['animred', 'bcomma', 'bmark', 'casered', 'genderred', 'numred']}
transl None
lang None
kwic(left, right, level='word')

This function makes kwic format for an item for further usage and csv output.

Parameters:
  • left (int) – length of left context
  • right (int) – length of right context
  • level (str, default word) – counting context length by tokens (word) or by characters (char)

Examples

>>> first_hit.kwic(left=5, right=5)
('внучки настегала своими руками лоскутное',
'одеяло',
', зная, что оно будет её')

>>> first_hit.kwic(left=30, right=30, level='char')
('егала своими руками лоскутное ', 'одеяло', ', зная, что оно будет её обере')