API¶

Corpus object¶

class lingcorpora.corpus.Corpus(language, verbose=True)¶

The object of this class should be instantiated for each corpus. Search is conducted via search method.

Parameters:

language (str) –

Language ISO 639-3 code for the corpus with combined codes for parallel corpora. List of available corpora with corresponding codes:

Code	Corpus
ady	Adyghe corpus
alb	Albanian corpus
arm	Eastern Armenian corpus
bam	Corpus Bambara de Reference
bua	Buryat corpus
dan	Danish corpus
deu	German corpus
emk	Maninka Automatically Parsed corpus
est	Estonian corpus
grk	Modern Greek corpus
hin	Hindi corpus
kal	Kalmyk corpus
kat	Georgian monolingual corpus
kaz	Almaty corpus of the Kazakh language
mon	Mongolian corpus
rus	National Corpus of Russian
rus_parallel	Parallel subcorpus of National Corpus of Russian Language
rus_pol	Polish-Russian Parallel Corpus
tat	Tatar corpus
udm	Udmurt corpus
yid	Modern Yiddish corpus
zho	Center of Chinese Linguistics corpus
zho_eng	Chinese-English subcorpus of JuKuu corpus

verbose (bool, default True) – whether to enable tqdm progressbar.

doc¶

Documentation for chosen corpus (after instance creation).

Type:	str

results¶

List of all Result objects, each returned by search method.

Type:	list

failed¶

List of Result objects where nothing was found.

Type:	list

reset_failed()¶: Reset .failed

retry_failed()¶: Apply .search() to failed queries stored in .failed

search(query, *args, **kwargs)¶

This is a search function that queries the corpus and returns the results.

Parameters:	query (str) – query, for arguments see params_container.Container

Example

>>> rus_corp = lingcorpora.Corpus('rus')
>>> rus_results = rus_corp.search('мешок', n_results=10)
>>> rus_results
"мешок": 100%|███████████████████████████████| 10/10 [00:07<00:00,  1.40docs/s]
[Result(query=мешок, N=10, params={'n_results': 10, 'kwic': True, 'n_left': None, 'n_right': None, 'query_language': None, 'subcorpus': 'main', 'get_analysis': False, 'gr_tags': None, 'start': 0, 'writing_system': None})]

class lingcorpora.params_container.Container(query, n_results=100, kwic=True, n_left=None, n_right=None, subcorpus=None, get_analysis=False, gr_tags=None, query_language=None, start=0, writing_system=None)¶

Universal arguments: query, n_results.

Other arguments depend on corpus.

Parameters:

query (str or list[str]) – query or queries.
n_results (int, default 100) – number of results wanted.
kwic (bool, default True:) – kwic format (True) or a sentence (False).
n_left (int, default None:) – number of words / symbols (corpus-specific) in the left context.
n_right (int, default None:) – number of words / symbols (corpus-specific) in the right context.
subcorpus (str, default None:) – subcorpus to search in.
get_analysis (boolean, default False) – whether to download grammatical information if the corpus is annotated.
gr_tags (dict, default None) – tags for grammar search
query_language (str) – for parallel corpora, language of the query.
start (int, default 0) – result index to start from.
writing_system (str, default None) – writing system of results.

Working with results¶

class lingcorpora.result.Result(language, query_params)¶

The object of this class contains all results found. Result object is iterable and supports indexing.

Parameters:	language (str) – corpus language. query_params (dict) – all other parameters of the search.

results¶

List of results.

Type:	list[Target]

n¶

Number of results.

Type:	int

query¶

Search query.

Type:	str

Example

>>> corp = lingcorpora.Corpus('emk')
>>> results = corp.search('tuma', n_results=10, kwic=False)[0]
>>> results
"tuma": 100%|██████████| 10/10 [00:00<00:00, 11.09docs/s]
Result(query=tuma, N=10, params={'n_results': 10, 'kwic': False, 'n_left': None, 'n_right': None, 'query_language': None, 'subcorpus': 'cormani-brut-lat', 'get_analysis': False, 'gr_tags': None, 'start': 0, 'writing_system': ''})

clear()¶

Overwrites the results attribute to empty list.

Example

>>> print(results.results)
>>> results.clear()
>>> print(results.results)
[Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, )]
[]

export_csv(filename=None, header=True, sep=';')¶

Save search result as CSV.

Parameters:	filename (str, default None) – name of the file. If None, filename is lang_query_results.csv with omission of disallowed filename symbols. header (bool, default True) – whether to include a header in the table. Header is stored in .__header: `('index', 'text')` sep (str, default ';') – cell separator in the csv.

class lingcorpora.target.Target(text, idxs, meta, analysis, gr_tags=None, transl=None, lang=None)¶

Target contains one item from the result list.

Parameters:

text (str) – full sentence / document.
idxs (tuple (l, r)) – target indexes in self.text -> self.text[l:r].
meta (str) – sentence / document info (if exists).
analysis (list of dicts) – target analysis (parsed).
gr_tags (str, default None) – grammatical tags passed by user.
transl (str, default None) – text translation (for parallel corporas and dictionaries).
lang (str, default None) – translation language (for parallel corporas and dictionaries).

Examples

>>> rus_corp = lingcorpora.Corpus('rus')
>>> rus_results = rus_corp.search('одеяло', n_results = 10, get_analysis=True)[0]
>>> first_hit = rus_results[0]
>>> first_hit
Target(одеяло, Народный костюм: архаика или современность? // «Народное творчество», 2004)

>>> for k, v in vars(first_hit).items():
>>> print(k, v)
text  Я, например, для внучки настегала своими руками лоскутное одеяло, зная, что оно будет её оберегать, давать ей энергию.
idxs (59, 65)
meta Народный костюм: архаика или современность? // «Народное творчество», 2004
tags {'lex': ['одеяло'], 'gramm': ['S', 'inan', 'n', 'sg', 'acc', 'disamb'], 'sem': ['r:concr', 't:tool:bedding'], 'flags': ['animred', 'bcomma', 'bmark', 'casered', 'genderred', 'numred']}
transl None
lang None

kwic(left, right, level='word')¶

This function makes kwic format for an item for further usage and csv output.

Parameters:	left (int) – length of left context right (int) – length of right context level (str, default word) – counting context length by tokens (word) or by characters (char)

Examples

>>> first_hit.kwic(left=5, right=5)
('внучки настегала своими руками лоскутное',
'одеяло',
', зная, что оно будет её')

>>> first_hit.kwic(left=30, right=30, level='char')
('егала своими руками лоскутное ', 'одеяло', ', зная, что оно будет её обере')