Integrating Unicode CLDR: Translation with Integrated Data

August 11, 2017

Finally I have successfully implemented translation of date strings with new integrated data. I have been working on it for a long while now, much more than I expected. I expected to be completely finished with translation of date strings without numeral words much earlier and I was planning to get started with numeral parser as early as possible. But it took me a while longer as I did not think clearly and therefore started on with a complicated solution and then moved on to simpler solutions. As I mentioned in my last blog post, the problem was to translate relative date strings, which are of two types, one that have no digits like 'yesterday' and the other that have digits and are stored as regex patterns like '(\\d+) day ago'.

Currently dateparser uses translations for 'ago' and 'in' along with other words and relative dates are translated in a similar way as other date strings. First dates are splitted by numbers within dates and then by known words stored in the translation data for the language, and then these splitted tokens are translated and joined to form the translated string. To translate with relative strings, first I thought of substituting them with their translations and separate out translated strings from other portion of dates and store them as separate dictionaries translated and not_translated with keys as indices in original date string and values as tokens (translated and not translated respectively). And later translating tokens from not_translated dictionary using same approach used currently in dateparser and joining tokens to form translated date. This method was unnecessarily complicated and inefficient. Then I realized date strings could be splitted by stored relative strings in a similar way they are splitted by known words. With the new approach dates are splitted first by relative strings, then the parts other than relative strings are splitted first by numbers and then by known words in dictionary. After splitting date into tokens, tokens are translated with either compiled relative patterns if they match else they are translated by translations of known words (including relative words without numbers) in dictionary. I feel satisfied with with this implementation of translation although some tests are failing and need to be resolved.

I will resolve failing tests and will keep adding new tests regularly. At the same time I have to implement numeral parser to parse dates with numerals in word form.

Search This Blog

GSoC 2017 @ dateparser

Integrating Unicode CLDR: Translation with Integrated Data

Popular posts from this blog

GSoC 2017: Work Summary

Integrate Unicode CLDR: Working with Retrieved Data

Integrating Unicode CLDR: Adding Locale Support