203 lines
		
	
	
		
			9.4 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
		
			Executable File
		
	
	
			
		
		
	
	
			203 lines
		
	
	
		
			9.4 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
		
			Executable File
		
	
	
| indic-trans
 | ||
| ===========
 | ||
| 
 | ||
| |travis| |coverage| |CircleCI| |Documentation Status|
 | ||
| 
 | ||
| ----
 | ||
| 
 | ||
| The project aims on adding a state-of-the-art transliteration module for cross transliterations among all Indian languages including English and Urdu.
 | ||
| 
 | ||
| The module currently supports the following languages:
 | ||
| 
 | ||
|   * Hindi       
 | ||
|   * Bengali
 | ||
|   * Gujarati
 | ||
|   * Punjabi
 | ||
|   * Malayalam
 | ||
|   * Kannada
 | ||
|   * Tamil
 | ||
|   * Telugu
 | ||
|   * Oriya
 | ||
|   * Marathi
 | ||
|   * Assamese
 | ||
|   * Konkani
 | ||
|   * Bodo
 | ||
|   * Nepali
 | ||
|   * Urdu
 | ||
|   * English
 | ||
| 
 | ||
| Links & References
 | ||
| ------------------
 | ||
| 
 | ||
| * `Official source code repo <https://github.com/libindic/indic-trans>`_
 | ||
| * `HTML documentation <http://indic-trans.readthedocs.org>`_
 | ||
| * `Transliteration Blog <http://irshadbhat.github.io/gsoc>`_
 | ||
| * Mailing list: silpa-discuss@nongnu.org
 | ||
| * IRC channel: ``#silpa`` at ``irc.freenode.net``
 | ||
| 
 | ||
| Installation
 | ||
| ------------
 | ||
| 
 | ||
| Dependencies
 | ||
| ^^^^^^^^^^^^
 | ||
| 
 | ||
| `indictrans`_ requires `cython`_, and `SciPy`_.
 | ||
| 
 | ||
| .. _`indictrans`: https://github.com/libindic/indic-trans
 | ||
| 
 | ||
| .. _`cython`: http://docs.cython.org/src/quickstart/install.html
 | ||
| 
 | ||
| .. _`Scipy`: http://www.scipy.org/install.html
 | ||
| 
 | ||
| Clone & Install
 | ||
| ^^^^^^^^^^^^^^^
 | ||
| 
 | ||
| ::
 | ||
| 
 | ||
|     Clone the repository:
 | ||
|         git clone https://github.com/libindic/indic-trans.git
 | ||
|         ------------------------OR--------------------------
 | ||
|         git clone https://github.com/irshadbhat/indic-trans.git
 | ||
| 
 | ||
|     Change to the cloned directory:
 | ||
|         cd indic-trans
 | ||
|         pip install -r requirements.txt
 | ||
|         pip install .
 | ||
| 
 | ||
| Examples
 | ||
| --------
 | ||
| 
 | ||
| 1. From Console:
 | ||
| ^^^^^^^^^^^^^^^^
 | ||
| 
 | ||
| .. parsed-literal::
 | ||
| 
 | ||
|     indictrans --h
 | ||
| 
 | ||
|     -h, --help          show this help message and exit
 | ||
|     -v, --version       show program's version number and exit
 | ||
|     -s, --source        select language (3 letter ISO-639 code) {hin, guj, pan,
 | ||
|                         ben, mal, kan, tam, tel, ori, eng, mar, nep, bod, kok,
 | ||
|                         asm, urd}
 | ||
|     -t, --target        select language (3 letter ISO-639 code) {hin, guj, pan,
 | ||
|                         ben, mal, kan, tam, tel, ori, eng, mar, nep, bod, kok,
 | ||
|                         asm, urd}
 | ||
|     -b, --build-lookup  build lookup to fasten transliteration
 | ||
|     -m, --ml            use ML system for transliteration
 | ||
|     -r, --rb            use rule-based system for transliteration
 | ||
|     -i, --input         <input-file>
 | ||
|     -o, --output        <output-file>
 | ||
| 
 | ||
| 
 | ||
|     Example ::
 | ||
| 
 | ||
| 	indictrans < hindi.txt --s hin --t eng --build-lookup > hindi-rom.txt
 | ||
| 	indictrans < roman.txt --s hin --t eng --build-lookup > roman-hin.txt
 | ||
| 
 | ||
| If the input text contains repeating words, which raw text generally does, make sure to set ``build_lookup``. As the name indicates this builds lookup for transliterated words and thus avoids repeated transliteration of same words. This saves a lot of time if the input corpus is too big.
 | ||
| 
 | ||
| Note that ``ml`` and ``rb`` are mutually exclusive arguments. If none of these is set, then the sytem defaults to ``rb``.
 | ||
| 
 | ||
| 2. Using Python:
 | ||
| ^^^^^^^^^^^^^^^^
 | ||
| 
 | ||
| .. code:: python
 | ||
| 
 | ||
|     >>> from indictrans import Transliterator
 | ||
|     >>> trn = Transliterator(source='hin', target='eng', build_lookup=True)
 | ||
|     >>> 
 | ||
|     >>> hin = """कांग्रेस पार्टी अध्यक्ष सोनिया गांधी, तमिलनाडु की मुख्यमंत्री
 | ||
|     ... जयललिता और रिज़र्व बैंक के गवर्नर रघुराम राजन के बीच एक समानता
 | ||
|     ... है. ये सभी अलग-अलग कारणों से भारतीय जनता पार्टी के राज्यसभा सांसद
 | ||
|     ... सुब्रमण्यम स्वामी के निशाने पर हैं. उनके जयललिता और सोनिया गांधी के
 | ||
|     ... पीछे पड़ने का कारण कथित भ्रष्टाचार है."""
 | ||
|     >>>
 | ||
|     >>> eng = trn.transform(hin)
 | ||
|     >>> print(eng)
 | ||
|     congress party adhyaksh sonia gandhi, tamilnadu kii mukhyamantri
 | ||
|     jayalalita or reserve bank ke governor raghuram rajan ke bich ek samanta
 | ||
|     he. ye sabhi alag-alag kaarnon se bhartiya janata party ke rajyasabha saansad
 | ||
|     subramanyam swami ke nishane par hai. unke jayalalita or sonia gandhi ke
 | ||
|     peeche padane kaa kaaran kathith bhrashtachar he.
 | ||
|     >>> 
 | ||
|     >>> trn = Transliterator(source='eng', target='hin')
 | ||
|     >>> 
 | ||
|     >>> hin_ = trn.transform(eng)
 | ||
|     >>> 
 | ||
|     >>> print(hin_)
 | ||
|     कांग्रेस पार्टी अध्यक्ष सोनिया गांधी, तमिलनाडु की मुख्यमंत्री
 | ||
|     जयललिता और रिज़र्व बैंक के गवर्नर रघुराम राजन के बीच एक समनता
 | ||
|     है. ये सभी अलग-अलग कारनों से भारतीय जनता पार्टी के राज्यसभा सांसद
 | ||
|     सुब्रमण्यम स्वामी के निशाने पर हैं. उनके जयललिता और सोनिया गांधी के
 | ||
|     पीछे पड़ने का कारण कथित भ्रष्टाचार है.
 | ||
|     >>>
 | ||
| 
 | ||
| 3. K-Best Transliterations
 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^
 | ||
| 
 | ||
| .. code:: python
 | ||
| 
 | ||
|     >>> from indictrans import Transliterator
 | ||
|     >>> r2i = Transliterator(source='eng', target='mal', decode='beamsearch')
 | ||
|     >>> words = '''sereleskar morocco calendar bhagyalakshmi bhoolokanathan
 | ||
|     ...         medical ernakulam kilometer vitamin management university
 | ||
|     ...         naukuchiatal'''.split()
 | ||
|     >>> for word in words:
 | ||
|     ...     print('%s -> %s' % (word, 
 | ||
|     ...                         '  '.join(r2i.transform(word, k_best=5))))
 | ||
|     ... 
 | ||
|     sereleskar -> സേറെലേസ്കാര്  സെറെലേസ്കാര്  സേറെലേസ്കാര  സെറെലേസ്കാര  സേറെലേസ്കര്
 | ||
|     morocco -> മൊറോക്കോ  മൊറോക്ഡോ  മൊരോക്കോ  മോറോക്കോ  മൊറോക്കൂ
 | ||
|     calendar -> കേലെന്ദര  കേലെന്ഡര  കേലെന്ദ്ര  കേലെന്ദാര  കേലെന്ഡ്ര
 | ||
|     bhagyalakshmi -> ഭാഗ്യലക്ഷ്മീ  ഭാഗ്യലക്ഷ്മി  ഭഗ്യലക്ഷ്മീ  ഭാഗ്യാലക്ഷ്മീ  ഭഗ്യലക്ഷ്മി
 | ||
|     bhoolokanathan -> ഭൂലോകനാഥന  ഭൂലോകാനാഥന  ഭൂലോക്കനാഥന  ബൂലോകനാഥന  ഭൂലോകനാതന
 | ||
|     medical -> മെഡിക്കല്  മെഡിക്കലും  മെഡിക്കില്  മ്മഎഡിക്കല്  മേഡിക്കല്
 | ||
|     ernakulam -> എറണാകുളം  ഈറണാകുളം  എറണാകുലം  എറണാകുളഅം  എറണാകുളാം
 | ||
|     kilometer -> കിലോമീറ്റര്  കിലോഈറ്റര്  കിലോമീറ്റ്ര്  കിലോമീറ്ററ്  കിലോമീടര്
 | ||
|     vitamin -> വിറ്റാമിന്  വിറ്റമിന്  വൈറ്റാമിന്  വിതാമിന്  വിതആമിന്
 | ||
|     management -> മാനേജ്മെന്റ്  മാനേജ്ഞ്മെന്റ്  മാനേഗ്മെന്റ്  മാംനേജ്മെന്റ്  മാനേജ്മെതുറ്
 | ||
|     university -> യൂണിവേഴ്സിറ്റി  യൂണിവേര്സിറ്റി  യുണിവേഴ്സിറ്റി  യൂനിവേഴ്സിറ്റി  യൂണിവേഴ്സിറ്റീ
 | ||
|     naukuchiatal -> നകുചിയാറ്റാള്  നകുചിയാറ്റാല്  നകുചിയാറ്റാല  നകുചിയാറ്റള്  നകുചിയറ്റാള്
 | ||
| 
 | ||
| Cite
 | ||
| ^^^^
 | ||
| 
 | ||
| If you use this code for a publication, please cite the following paper:
 | ||
| 
 | ||
| @inproceedings{Bhat:2014:ISS:2824864.2824872,
 | ||
|  author = {Bhat, Irshad Ahmad and Mujadia, Vandan and Tammewar, Aniruddha and Bhat, Riyaz Ahmad and Shrivastava, Manish},
 | ||
|  title = {IIIT-H System Submission for FIRE2014 Shared Task on Transliterated Search},
 | ||
|  booktitle = {Proceedings of the Forum for Information Retrieval Evaluation},
 | ||
|  series = {FIRE '14},
 | ||
|  year = {2015},
 | ||
|  isbn = {978-1-4503-3755-7},
 | ||
|  location = {Bangalore, India},
 | ||
|  pages = {48--53},
 | ||
|  numpages = {6},
 | ||
|  url = {http://doi.acm.org/10.1145/2824864.2824872},
 | ||
|  doi = {10.1145/2824864.2824872},
 | ||
|  acmid = {2824872},
 | ||
|  publisher = {ACM},
 | ||
|  address = {New York, NY, USA},
 | ||
|  keywords = {Information Retrieval, Language Identification, Language Modeling, Perplexity, Transliteration},
 | ||
| }
 | ||
| 
 | ||
| ----
 | ||
| 
 | ||
| |travis| |coverage| |CircleCI| |Documentation Status|
 | ||
| 
 | ||
| .. |travis| image:: https://travis-ci.org/libindic/indic-trans.svg?branch=master
 | ||
|    :target: https://travis-ci.org/libindic/indic-trans
 | ||
|    :alt: travis-ci build status
 | ||
| 
 | ||
| .. |coverage| image:: https://coveralls.io/repos/github/libindic/indic-trans/badge.svg?branch=master 
 | ||
|    :target: https://coveralls.io/github/libindic/indic-trans?branch=master
 | ||
|    :alt: coveralls.io coverage status
 | ||
|    
 | ||
| .. |CircleCI| image:: https://circleci.com/gh/libindic/indic-trans.svg?style=svg
 | ||
|     :target: https://circleci.com/gh/libindic/indic-trans
 | ||
| 
 | ||
| .. |Documentation Status| image:: https://readthedocs.org/projects/indic-trans/badge/?version=latest
 | ||
|     :target: http://indic-trans.readthedocs.io/en/latest/?badge=latest
 | ||
|     :alt: Documentation Status
 |