Making new API¶
Main concept¶
- Each API of a corpus, dictionary, etc. is a
PageParser
class (see below) which has method.extract()
. PageParser.extract()
is a generator (seeyield
in Python) ofTarget
objects (individual hits).PageParser
inherits fromContainer
, which is a class inparams_container.py
and contains all possible parameters for corpora.- All
Target
objects are collected insearch
(in theCorpus
object) into theResult
object. - Documentation for users can be found here.
To make a new API¶
- Make a
PageParser
object - It inherits from
Container
andContainer
constructor is called in__init__
(see example below) - It has method
extract()
whichyield
sTarget
objects - All other (auxiliary) parameters in
PageParser
should be encapsulated (add to underscores__
to their names)
- It inherits from
- Make a
- You should pass to
Target
object the following information: - whole sentence (
text
) - string - indices (
idxs
) of the target in the sentence:l
andr
such that target ==text[l:r]
- tuple - metadata (
meta
) (document name, author, year, etc.) - string. If there is no meta, then pass empty string - grammar tags (
tags
) - dict. If there are no tags, pass empty dict - for parallel corpora: translation (
transl
) - translation fromqueryLanguage
to another language - for parallel corpora: language (
lang
) - the other language (notqueryLanguage
) in the example pair - Important: if there are several target occurrences in one example, you should split them into separate Target objects.
- whole sentence (
- You should pass to
- Write the docstring
__doc__
and the author__author__
beforePageParser
- Name the file langcode_corpus.py and place it into the
corpora
directory. langcode stands for ISO 639-3 code - For testing purposes querying data must be provided via
<dict>
namedTEST_DATA
(see template below for details) - If you would like to add new search parameters, open
params_container.py
and add this parameter to the arguments (do not forget default value) and attributes. - Make a pull request and if API is OK, we will:
- Add it to the package
- Include it in the docs
API template¶
from lingcorpora.params_container import Container
from lingcorpora.target import Target
__author__ = ''
# The docs should be in reST format
# The header is the name of the module
# The subheader is 'Search Parameters'
# e.g:
# Mongolian Corpus
# ================
#
# API for Mongolian corpus (http://web-corpora.net/MongolianCorpus/search/).
#
# **Search Parameters**
#
# query: str or list([str])
# query or queries
# n_results: int, default 100
# number of results wanted
# ...
#
# Example
# -------
# ...
__doc__ = \
"""
"""
# <dict> of querying data passed to `Corpus.search` as kwargs while testing
# keys and types to be preserved
TEST_DATA = {
'test_single_query': {'query': <str>, ...}, # {arg: value, ...}
'test_multi_query': {'query': [<str 1>, <str 2>, ... <str N>], ...} # {arg: value, ...}
}
class PageParser(Container):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# inner auxiliary attributes:
# self.__page = None
# self.__pagenum = 0
# ...
def any_method_for_getting_the_results(self):
pass
# ...
def any_method_for_getting_the_results_10(self):
pass
def extract(self):
"""
--- Generator of found occurrences as `Target` types
Query.search() uses this method---
"""
# ...
# for each occurrence found we pass `Target` object,
# describing the occurrence, to Query.search()
# for parallel corpora also transl and lang
for text, idxs, meta, tags in found:
yield Target(text, idxs, meta, tags)