API¶
Corpus object¶
-
class
lingcorpora.corpus.
Corpus
(language, verbose=True)¶ The object of this class should be instantiated for each corpus. Search is conducted via search method.
Parameters: - language (str) –
Language ISO 639-3 code for the corpus with combined codes for parallel corpora. List of available corpora with corresponding codes:
Code Corpus ady Adyghe corpus alb Albanian corpus arm Eastern Armenian corpus bam Corpus Bambara de Reference bua Buryat corpus dan Danish corpus deu German corpus emk Maninka Automatically Parsed corpus est Estonian corpus grk Modern Greek corpus hin Hindi corpus kal Kalmyk corpus kat Georgian monolingual corpus kaz Almaty corpus of the Kazakh language mon Mongolian corpus rus National Corpus of Russian rus_parallel Parallel subcorpus of National Corpus of Russian Language rus_pol Polish-Russian Parallel Corpus tat Tatar corpus udm Udmurt corpus yid Modern Yiddish corpus zho Center of Chinese Linguistics corpus zho_eng Chinese-English subcorpus of JuKuu corpus - verbose (bool, default True) – whether to enable tqdm progressbar.
-
doc
¶ Documentation for chosen corpus (after instance creation).
Type: str
-
results
¶ List of all Result objects, each returned by search method.
Type: list
-
failed
¶ List of Result objects where nothing was found.
Type: list
-
reset_failed
()¶ Reset .failed
-
retry_failed
()¶ Apply .search() to failed queries stored in .failed
-
search
(query, *args, **kwargs)¶ This is a search function that queries the corpus and returns the results.
Parameters: query (str) – query, for arguments see params_container.Container Example
>>> rus_corp = lingcorpora.Corpus('rus') >>> rus_results = rus_corp.search('мешок', n_results=10) >>> rus_results "мешок": 100%|███████████████████████████████| 10/10 [00:07<00:00, 1.40docs/s] [Result(query=мешок, N=10, params={'n_results': 10, 'kwic': True, 'n_left': None, 'n_right': None, 'query_language': None, 'subcorpus': 'main', 'get_analysis': False, 'gr_tags': None, 'start': 0, 'writing_system': None})]
- language (str) –
-
class
lingcorpora.params_container.
Container
(query, n_results=100, kwic=True, n_left=None, n_right=None, subcorpus=None, get_analysis=False, gr_tags=None, query_language=None, start=0, writing_system=None)¶ Universal arguments:
query
,n_results
.Other arguments depend on corpus.
Parameters: - query (str or list[str]) – query or queries.
- n_results (int, default 100) – number of results wanted.
- kwic (bool, default True:) – kwic format (True) or a sentence (False).
- n_left (int, default None:) – number of words / symbols (corpus-specific) in the left context.
- n_right (int, default None:) – number of words / symbols (corpus-specific) in the right context.
- subcorpus (str, default None:) – subcorpus to search in.
- get_analysis (boolean, default False) – whether to download grammatical information if the corpus is annotated.
- gr_tags (dict, default None) – tags for grammar search
- query_language (str) – for parallel corpora, language of the query.
- start (int, default 0) – result index to start from.
- writing_system (str, default None) – writing system of results.
Working with results¶
-
class
lingcorpora.result.
Result
(language, query_params)¶ The object of this class contains all results found. Result object is iterable and supports indexing.
Parameters: - language (str) – corpus language.
- query_params (dict) – all other parameters of the search.
-
n
¶ Number of results.
Type: int
-
query
¶ Search query.
Type: str
Example
>>> corp = lingcorpora.Corpus('emk') >>> results = corp.search('tuma', n_results=10, kwic=False)[0] >>> results "tuma": 100%|██████████| 10/10 [00:00<00:00, 11.09docs/s] Result(query=tuma, N=10, params={'n_results': 10, 'kwic': False, 'n_left': None, 'n_right': None, 'query_language': None, 'subcorpus': 'cormani-brut-lat', 'get_analysis': False, 'gr_tags': None, 'start': 0, 'writing_system': ''})
-
clear
()¶ Overwrites the results attribute to empty list.
Example
>>> print(results.results) >>> results.clear() >>> print(results.results) [Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, ), Target(tuma, )] []
-
export_csv
(filename=None, header=True, sep=';')¶ Save search result as CSV.
Parameters: - filename (str, default None) – name of the file. If None, filename is lang_query_results.csv with omission of disallowed filename symbols.
- header (bool, default True) – whether to include a header in the table.
Header is stored in .__header:
('index', 'text')
- sep (str, default ';') – cell separator in the csv.
-
class
lingcorpora.target.
Target
(text, idxs, meta, analysis, gr_tags=None, transl=None, lang=None)¶ Target contains one item from the result list.
Parameters: - text (str) – full sentence / document.
- idxs (tuple (l, r)) – target indexes in self.text -> self.text[l:r].
- meta (str) – sentence / document info (if exists).
- analysis (list of dicts) – target analysis (parsed).
- gr_tags (str, default None) – grammatical tags passed by user.
- transl (str, default None) – text translation (for parallel corporas and dictionaries).
- lang (str, default None) – translation language (for parallel corporas and dictionaries).
Examples
>>> rus_corp = lingcorpora.Corpus('rus') >>> rus_results = rus_corp.search('одеяло', n_results = 10, get_analysis=True)[0] >>> first_hit = rus_results[0] >>> first_hit Target(одеяло, Народный костюм: архаика или современность? // «Народное творчество», 2004)
>>> for k, v in vars(first_hit).items(): >>> print(k, v) text Я, например, для внучки настегала своими руками лоскутное одеяло, зная, что оно будет её оберегать, давать ей энергию. idxs (59, 65) meta Народный костюм: архаика или современность? // «Народное творчество», 2004 tags {'lex': ['одеяло'], 'gramm': ['S', 'inan', 'n', 'sg', 'acc', 'disamb'], 'sem': ['r:concr', 't:tool:bedding'], 'flags': ['animred', 'bcomma', 'bmark', 'casered', 'genderred', 'numred']} transl None lang None
-
kwic
(left, right, level='word')¶ This function makes
kwic
format for an item for further usage and csv output.Parameters: - left (int) – length of left context
- right (int) – length of right context
- level (str, default word) – counting context length by tokens (word) or by characters (char)
Examples
>>> first_hit.kwic(left=5, right=5) ('внучки настегала своими руками лоскутное', 'одеяло', ', зная, что оно будет её') >>> first_hit.kwic(left=30, right=30, level='char') ('егала своими руками лоскутное ', 'одеяло', ', зная, что оно будет её обере')