Integrating Unicode CLDR: Parsing with Retrieved Data

As I mentioned in the last blog post, I have been working to modify the codebase so that dateparser could use new data to parse date strings. Most of the new data for date translation is similar to previous data used for translation, the only major difference is in dealing with relative type date strings. The previous data used translations for ago and in to translate relative type date strings like "10 years ago",  "15 hours 10 minutes ago", "in 10 minutes and 8 seconds". But the data retrieved from unicode CLDR contains translation for complete date strings so that they translate directly to formats like "** year ago" or "in ** month" which can be used further for parsing. Initially I thought direct translations like these would work but then I realised that many relative dates are not covered in these translations. These include dates that contains more than one date fields, like "1 year 3 months ago" and dates that include time as well like "in 3 days 2 pm". Currently dateparser is able to parse such dates and many more dates that contain relative type dates as substrings, and if such dates are not translated correctly it would mean such dates would no longer be parsed.

The obvious solution to the problem is to use regex to translate relative type substrings in date string and translate other portions of date string separately, and later combine to form the complete translated date. I have used regex to translate relative type substrings as well as split the date string to separate out other parts of date string which are then separately translated using the current approach for translation. Another change that has been made is translation of numeral digits, for which the Python built-in function int() itself comes in handy. Instead of using translations for digits stored in unicode CLDR, it is more convenient and practical to use int() itself as it is able to translate most of the numeral digits, it is able to convert digits used for positional numeral systems, so it cannot be used to translate numeral systems that are not positional, like Chinese numerals. So first the numeral digits are converted in date string before translating rest of the date string. I am still working and I will be able to complete my work by the end of this week so that dateparser will be able to parse  dates without numeral words. After that I will start my work on numeral parser and dates with numeral words.

Popular posts from this blog

GSoC 2017: Work Summary

Integrate Unicode CLDR: Working with Retrieved Data

GSoC 2017 : My Project