GSoC 2017 : My Project

So I have been finally selected for Google Summer of Code 2017 at Python Software Foundation. Python Software Foundation(PSF) serves as an umbrella organisation comprising various sub-orgs that have projects that contribute to the development of the Python language. The sub-org I am going to work with is Scrapinghub.

About Scrapinghub

Scrapinghub, as its name clearly suggests, deals primarily with scraping the web. Scrapinghub is a company concerned with Information Retrieval and its later manipulation, i.e., it deals with both data extraction and data processing after retrieval. It has various projects that deal with these topics and the one I am going to work on deals with data processing after retrieval. I am going to work with dateparser, which is a Python library that primarily deals with parsing dates in various languages and formats.

About dateparser

dateparser is a Python library that is used to parse various forms of dates in different languages to a common format. Currently dateparser supports parsing in 29 languages and various different formats. The translation data for the languages have been contributed by the community and various contributors and this data is used to translate dates in various languages before parsing them. dateparser basically works by creating a dictionary from translation data of the languages and then using it to detect the language and translate the date before trying out different parsing methods to parse the date.

My Project : Integrate unicode CLDR database with dateparser

Currently dateparser supports a few languages and needs to have support for many more languages for which it needs a larger database of translation data. The languages are used differently on the basis of the territory they are used in and thus parsing dates in a language also varies according to the territory. The complete information for parsing dates needs to take account of both language and territory which can be obtained by using locales. A major drawback of dateparser is that it lacks a mechanism of defining and working with locales. This project aims to extend support for all locales in Unicode Common Locale Data Repository(CLDR) which is a standard repository of locale specific data. By integration of unicode CLDR, dateparser will be able to support more than 500 locales and more than 200 languages.

Popular posts from this blog

GSoC 2017: Work Summary

Integrate Unicode CLDR: Working with Retrieved Data