Integrating unicode CLDR : Initial Phase
There has been some time since I last wrote and I have been working on my project. As I was free during the GSoC bonding period, I started working on my project early so as to have a head start. I started working on scripts to retrieve translation data from unicode CLDR. That came out good as by the time GSoC period started, I had already written a major part of the scripts and had retrieved most of the data required for parsing. At the same time I started to write tests for dateparser and coverage has increased fairly. After the GSoC period started I made further changes to the script to resolve some issues like correcting date order of languages and storing data in order. As of now since I have added numeral data as well, data that has been retrieved is complete with all components required to translate dates as initially proposed in my proposal. The challenge now is to use this data to effectively and efficiently translate dates.
Currently dateparser uses a dictionary based method to translate date strings, where translations are mapped to corresponding words in English. This approach is fine as long as we don't deal with numerals in word form. The problem quite obvious, is that unlike months and weekdays which are finite in number, numbers are infinite, and even if we restrict our data to finite numbers it is not feasible to store translations of so many numbers. But storing all numbers is not required as numerals in every language follow a pattern that repeats itself. In order to store translation data of numerals, the pattern in numerals need to be encoded as done for numerals stored in cldr-rbnf, repository of numeral translation data of unicode CLDR. The problem at hand is to form an approach to use this encoded data to translate numerals, and to use it in conjunction with dictionary approach used for translating other words in date string.
Currently dateparser uses a dictionary based method to translate date strings, where translations are mapped to corresponding words in English. This approach is fine as long as we don't deal with numerals in word form. The problem quite obvious, is that unlike months and weekdays which are finite in number, numbers are infinite, and even if we restrict our data to finite numbers it is not feasible to store translations of so many numbers. But storing all numbers is not required as numerals in every language follow a pattern that repeats itself. In order to store translation data of numerals, the pattern in numerals need to be encoded as done for numerals stored in cldr-rbnf, repository of numeral translation data of unicode CLDR. The problem at hand is to form an approach to use this encoded data to translate numerals, and to use it in conjunction with dictionary approach used for translating other words in date string.