Posts

GSoC 2017: Work Summary

So finally Google Summer of Code 2017 is coming to an end. It was an amazing experience to be a part of this prestigious program and I learned a lot in these three months about open source development. Work Summary I worked on dateparser , a project under sub-org Scrapinghub of Python Software Foundation (PSF) , that deals with parsing dates in various languages and formats. The objective of the project was to integrate translation data of all locales in Unicode Common Locale Data Repository(CLDR) which is a standard repository of locale specific data with the existing translation data in dateparser. Here is a brief outline of the work done on my project on dateparser during GSoC 2017: Work Completed 1.   Retrieved Translation data from Unicode CLDR Scripts were written to retrieve translation data from unicode CLDR  github repository. The translation data for dates and numerals were separately stored. 2. Ordered languages by population The languages were...

Integrating Unicode CLDR: Translation with Integrated Data

Finally I have successfully implemented translation of date strings with new integrated data. I have been working on it for a long while now, much more than I expected. I expected to be completely finished with translation of date strings without numeral words much earlier and I was planning to get started with numeral parser as early as possible. But it took me a while longer as I did not think clearly and therefore started on with a complicated solution and then moved on to simpler solutions. As I mentioned in my last blog post, the problem was to translate relative date strings, which are of two types, one that have no digits like 'yesterday' and the other that have digits and are stored as regex patterns like '(\\d+) day ago'. Currently dateparser uses translations for 'ago' and 'in' along with other words and relative dates are translated in a similar way as other date strings. First dates are splitted by numbers within dates and then by known wo...

Integrating Unicode CLDR: Parsing with Retrieved Data

As I mentioned in the last blog post, I have been working to modify the codebase so that dateparser could use new data to parse date strings. Most of the new data for date translation is similar to previous data used for translation, the only major difference is in dealing with relative type date strings. The previous data used translations for ago and in to translate relative type date strings like "10 years ago",  "15 hours 10 minutes ago", "in 10 minutes and 8 seconds". But the data retrieved from unicode CLDR contains translation for complete date strings so that they translate directly to formats like "** year ago" or "in ** month" which can be used further for parsing. Initially I thought direct translations like these would work but then I realised that many relative dates are not covered in these translations. These include dates that contains more than one date fields, like "1 year 3 months ago" and dates that includ...

Integrating Unicode CLDR: Adding Locale Support

So far translation data is ready to be used for translation of date strings. This includes combined data from unicode CLDR and supplementary data contributed by many individuals. Initially the data from unicode CLDR was stored as json files and supplementary data was stored as yaml files and data from both these sources were to be combined after loading each separately which was then supposed to be used for translating dates. But after discussion with my mentor we agreed upon a better approach suggested by him that we could store data directly in python modules. Storing data in python modules directly has the advantage that importing data from python modules is faster than loading from json and yaml files. And considering the fact that we don't have to combine data from cldr and supplementary data files at run time, storing data in Python modules proves to be much more efficient. As the data is ready, now is the time to make necessary modifications in the codebase to support l...

Integrate Unicode CLDR: Working with Retrieved Data

I have been working on scripts and modifying data for quite some time and it seems the data is ready to be used for parsing dates. The data retrieved from Unicode CLDR  has been divided into two parts: numeral_translation_data , that will be specifically used to parse numerals, and date_translation_data ,   that will be used to parse date strings after modifying it by parsing numerals if included in the date string. The existing data that was contributed by various individuals to dateparser has been modified to supplement the data retrieved from CLDR, i.e., only the portion of data that is not included in data retrieved from CLDR remains as supplementary data , which will continue to be modified by contributors in future. The data for date translation consists of translations for months, weekdays and periods, date order, and translations for relative-type dates. This data is stored for different languages and for each language, locale-specific data is stored for ...

Integrating unicode CLDR : Initial Phase

There has been some time since I last wrote and I have been working on my project. As I was free during the GSoC bonding period, I started working on my project early so as to have a head start. I started working on scripts to retrieve translation data from unicode CLDR. That came out good as by the time GSoC period started, I had already written a major part of the scripts and had retrieved most of the data required for parsing. At the same time I started to write tests for dateparser and coverage has increased fairly. After the GSoC period started I made further changes to the script to resolve some issues like correcting date order of languages and storing data in order. As of now since I have added numeral data as well, data that has been retrieved is complete with all components required to translate dates as initially proposed in my proposal. The challenge now is to use this data to effectively and efficiently translate dates. Currently dateparser uses a dictionary based metho...

GSoC 2017 : My Project

So I have been finally selected for Google Summer of Code 2017  at Python Software Foundation . Python Software Foundation(PSF) serves as an umbrella organisation comprising various sub-orgs that have projects that contribute to the development of the Python language. The sub-org I am going to work with is Scrapinghub . About Scrapinghub Scrapinghub, as its name clearly suggests, deals primarily with scraping the web. Scrapinghub is a company concerned with Information Retrieval and its later manipulation, i.e., it deals with both data extraction and data processing after retrieval. It has various projects that deal with these topics and the one I am going to work on deals with data processing after retrieval. I am going to work with dateparser , which is a Python library that primarily deals with parsing dates in various languages and formats. About dateparser dateparser is a Python library that is used to parse various forms of dates in different languages to a common for...