Source author record

Peter-Lucas Jones

Peter-Lucas Jones appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

1works
1topics
4close collaborators

Actions

Connect this record

Log in to claim

Research graph

See the researcher in context

Open full explorer

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

1 published item(s)

preprint2022arXiv

The Development of a Labelled te reo Māori-English Bilingual Database for Language Technology

Te reo Māori (referred to as Māori), New Zealand's indigenous language, is under-resourced in language technology. Māori speakers are bilingual, where Māori is code-switched with English. Unfortunately, there are minimal resources available for Māori language technology, language detection and code-switch detection between Māori-English pair. Both English and Māori use Roman-derived orthography making rule-based systems for detecting language and code-switching restrictive. Most Māori language detection is done manually by language experts. This research builds a Māori-English bilingual database of 66,016,807 words with word-level language annotation. The New Zealand Parliament Hansard debates reports were used to build the database. The language labels are assigned using language-specific rules and expert manual annotations. Words with the same spelling, but different meanings, exist for Māori and English. These words could not be categorised as Māori or English based on word-level language rules. Hence, manual annotations were necessary. An analysis reporting the various aspects of the database such as metadata, year-wise analysis, frequently occurring words, sentence length and N-grams is also reported. The database developed here is a valuable tool for future language and speech technology development for Aotearoa New Zealand. The methodology followed to label the database can also be followed by other low-resourced language pairs.