Integrate Unicode CLDR: Working with Retrieved Data

I have been working on scripts and modifying data for quite some time and it seems the data is ready to be used for parsing dates. The data retrieved from Unicode CLDR has been divided into two parts: numeral_translation_data, that will be specifically used to parse numerals, and date_translation_data, that will be used to parse date strings after modifying it by parsing numerals if included in the date string. The existing data that was contributed by various individuals to dateparser has been modified to supplement the data retrieved from CLDR, i.e., only the portion of data that is not included in data retrieved from CLDR remains as supplementary data, which will continue to be modified by contributors in future.

The data for date translation consists of translations for months, weekdays and periods, date order, and translations for relative-type dates. This data is stored for different languages and for each language, locale-specific data is stored for all locales with the language. The locale-specific data has data stored in the same format as the language and only the portion of data that is different or adds to data for the language is stored here. It has the same working principle as that of supplementary data for language. For a given locale that needs to be loaded for parsing, first data retrieved from CLDR for the language associated with the locale will be loaded, then it will be supplemented with supplementary data for the language, and finally locale-specific data will be combined with base language data to form complete locale data.

Now that data is ready to be used for parsing, the next task is to modify the codebase to load data as required and support parsing with locales. I have started to work on loading, and after that I would make necessary changes in parser to support locales and parsing with new data. After that comes what seems to me most difficult, creating a numeral parser that would be able to convert numerals in words to integers which will be subsequently used to modify date strings with numerals before parsing them. I have divided my remaining work into two parts, firstly I would make necessary changes to support parsing dates without numerals and write relevant tests, later I will work on numeral parser and make changes to support parsing dates with numerals.

Popular posts from this blog

GSoC 2017: Work Summary

GSoC 2017 : My Project