Making new API¶
Main concept¶
- Each API of a corpus, dictionary, etc. is a
PageParserclass (see below) which has method.extract(). PageParser.extract()is a generator (seeyieldin Python) ofTargetobjects (individual hits).PageParserinherits fromContainer, which is a class inparams_container.pyand contains all possible parameters for corpora.- All
Targetobjects are collected insearch(in theCorpusobject) into theResultobject. - Documentation for users can be found here.
To make a new API¶
- Make a
PageParserobject - It inherits from
ContainerandContainerconstructor is called in__init__(see example below) - It has method
extract()whichyieldsTargetobjects - All other (auxiliary) parameters in
PageParsershould be encapsulated (add to underscores__to their names)
- It inherits from
- Make a
- You should pass to
Targetobject the following information: - whole sentence (
text) - string - indices (
idxs) of the target in the sentence:landrsuch that target ==text[l:r]- tuple - metadata (
meta) (document name, author, year, etc.) - string. If there is no meta, then pass empty string - grammar tags (
tags) - dict. If there are no tags, pass empty dict - for parallel corpora: translation (
transl) - translation fromqueryLanguageto another language - for parallel corpora: language (
lang) - the other language (notqueryLanguage) in the example pair - Important: if there are several target occurrences in one example, you should split them into separate Target objects.
- whole sentence (
- You should pass to
- Write the docstring
__doc__and the author__author__beforePageParser - Name the file langcode_corpus.py and place it into the
corporadirectory. langcode stands for ISO 639-3 code - For testing purposes querying data must be provided via
<dict>namedTEST_DATA(see template below for details) - If you would like to add new search parameters, open
params_container.pyand add this parameter to the arguments (do not forget default value) and attributes. - Make a pull request and if API is OK, we will:
- Add it to the package
- Include it in the docs
API template¶
from lingcorpora.params_container import Container
from lingcorpora.target import Target
__author__ = ''
# The docs should be in reST format
# The header is the name of the module
# The subheader is 'Search Parameters'
# e.g:
# Mongolian Corpus
# ================
#
# API for Mongolian corpus (http://web-corpora.net/MongolianCorpus/search/).
#
# **Search Parameters**
#
# query: str or list([str])
# query or queries
# n_results: int, default 100
# number of results wanted
# ...
#
# Example
# -------
# ...
__doc__ = \
"""
"""
# <dict> of querying data passed to `Corpus.search` as kwargs while testing
# keys and types to be preserved
TEST_DATA = {
'test_single_query': {'query': <str>, ...}, # {arg: value, ...}
'test_multi_query': {'query': [<str 1>, <str 2>, ... <str N>], ...} # {arg: value, ...}
}
class PageParser(Container):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# inner auxiliary attributes:
# self.__page = None
# self.__pagenum = 0
# ...
def any_method_for_getting_the_results(self):
pass
# ...
def any_method_for_getting_the_results_10(self):
pass
def extract(self):
"""
--- Generator of found occurrences as `Target` types
Query.search() uses this method---
"""
# ...
# for each occurrence found we pass `Target` object,
# describing the occurrence, to Query.search()
# for parallel corpora also transl and lang
for text, idxs, meta, tags in found:
yield Target(text, idxs, meta, tags)