dummy-link

BytePairEncoding

Julia implementation of Byte Pair Encoding for NLP

Readme

BytePairEncoding.jl

Build Status Build status codecov

Pure Julia implementation of the Byte Pair Encoding(BPE) method in the subword neural machine translation paper. It's a port of the original python package subword-nmt. BytePairEncoding.jl support different tokenize method(with the help of WordTokenizers.jl). You can simply use set_tokenizer([your tokenize function]) and then Learn the BPE map with it.

Table of Contents

  1. BytePairEncoding.jl
  2. API
    1. Unicode Normalization
  3. Examples
  4. Roadmap

API

  • BPELearner([vocabulary files]; num_sym, min_freq, endsym, normalizer)
    • work as the learning configure.
      • num_sym: how many pair to generate.
      • min_freq: threshold of learned pair frequency.
      • endsym: the symbol for seperate internal & last pair, if is set, it will automatically invoke set_endsym(endsym.
      • normalizer: normalizer type, default is identity(not normalized), see next section for define normalization
    • add!(::BPELearner, newfile)
      • add a new file to learner.
    • learn!(::BPELearner)
      • learn the bpe map.
    • emit(::BPELearner, output_filename)
      • generate the bpe map file.
  • Bpe(bpefile; glossaries, merge, sepsym, endsym, normalizer)
    • the bpe encoding related config.
      • merge: how many pair to load.
      • sepsym: seperator symbol for internal pair, default is "".
      • endsym: end symbol of the last pair, default "</w>".
      • have_header: skip first line? default true
      • glossaries: a list of glossaries, support both Regex & String.
      • normalizer: normalizer type, default is identity(not normalized), see next section for define normalization
    • process_line(::Bpe, line): segment a given line the join as a new line, leading & trailing whitesplace will remmain.
    • segment(::Bpe, line): segment a line into a list of segments
    • segment_token(::Bpe, token::String): segment a given token or a list of tokens.
  • set_endsym(::String): set the end symbol, default "</w>".
  • set_tokenizer(func): set the tokenizer fucntion , default is nltk_word_tokenize.
  • whitespace_tokenize(str): simply the split(str) function, for use with set_tokenizer.

Unicode Normalization

support unicode normalization

  • UtfNormalizer
    • wrapper type on Julia built-in unicode normalization function
      • UtfNormalizer(::Symbol): support :NFC, :NFD, :NFKC, :NFKD, :NFKC_CF
      • UtfNormalizer([option_names=all_default_false]): options (stable, compat, compose, decompose, stripignore, rejectna, newline2ls, newline2ps, newline2lf, stripcc, casefold, lump, stripmark), usage example: UtfNormalizer(stable=true, compose=true)
    • normalize(::AbstractNormalizer, ::String): normalize given string with specific normalizer.

Examples

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
      | | |_| | | | (_| |  |  Version 1.0.2 (2018-11-08)
     _/ |\__'_|_|_|\__'_|  |
    |__/                   |
    
    julia> using BytePairEncoding
    
    julia> using WordTokenizers
    
    julia> set_tokenizer(nltk_word_tokenize)
    tokenize (generic function with 1 method)
    
    julia> norm = UtfNormalizer(:NFKC)
    UtfNormalizer(14)
    
    julia> vocabfiles = ["./data/.....", "./another/data/....." ...]
    
    julia> bper = BPELearner(vocabfiles, 1000; normalizer=norm)
    BPELearner(num_sym=1000, min_freq=2, endsym="</w>", normailzer=UtfNormalizer)
    
    julia> learn!(bper)
    
    julia> emit(bper, "./bpe.out")
    "./bpe.out"
    
    julia> bpe = Bpe("./bpe.out"; normalizer=norm)
    Bpe(merge=-1, sepsym="", endsym="</w>", num_glossaries=0, normalizer=UtfNormalizer)
    
    julia> sample_sent =  "It's interesting that technology often works as a servant for us, yet frequently we become a
     servant to it. E-mail is a useful tool but many feel controlled by this new tool. The average business person is g
    etting about 80e-mails per day and many feel that about 80% of the messages in their julia-observer-quote-cut-paste-41__workquot;Inboxjulia-observer-quote-cut-paste-41__workquot; are of little or no
           value. So, I have four suggestions to help you to become better at julia-observer-quote-cut-paste-41__workquot;Erasing E-mailjulia-observer-quote-cut-paste-41__workquot;.\n1.Get off the lists.
     The best way to deal with a problem is to never have it. If you are receiving a lot of unwanted e-mails, ask to be
     removed from the various lists. This would include your inclusion in unwanted lists.\n2.julia-observer-quote-cut-paste-41__workquot;Unlisted addressjulia-observer-quote-cut-paste-41__workquot;. Just
     as you keep an julia-observer-quote-cut-paste-41__workquot;unlistedjulia-observer-quote-cut-paste-41__workquot; telephone number that you share only with those whom you want to have direct access to
    , you might want to get a separate e-mail address only for the important communications you wish to receive.\n"
    "It's interesting that technology often works as a servant for us, yet frequently we become a servant to it. E-mail
     is a useful tool but many feel controlled by this new tool. The average business person is getting about 80e-mails
     per day and many feel that about 80% of the messages in their julia-observer-quote-cut-paste-41__workquot;Inboxjulia-observer-quote-cut-paste-41__workquot; are of little or no\nvalue. So, I have fou
    r suggestions to help you to become better at julia-observer-quote-cut-paste-41__workquot;Erasing E-mailjulia-observer-quote-cut-paste-41__workquot;.\n1.Get off the lists. The best way to deal with a
     problem is to never have it. If you are receiving a lot of unwanted e-mails, ask to be removed from the various li
    sts. This would include your inclusion in unwanted lists.\n2.julia-observer-quote-cut-paste-41__workquot;Unlisted addressjulia-observer-quote-cut-paste-41__workquot;. Just as you keep an julia-observer-quote-cut-paste-41__workquot;unlistedjulia-observer-quote-cut-paste-41__workquot;
     telephone number that you share only with those whom you want to have direct access to, you might want to get a se
    parate e-mail address only for the important communications you wish to receive.\n"
    
    julia> first(split_sentences(sample_sent))
    "It's interesting that technology often works as a servant for us, yet frequently we become a servant to it."
    
    julia> segment(bpe, ans)
    42-element Array{String,1}:
     "I"        
     "t</w>"    
     "'"        
     "s</w>"    
     "inter"    
     "est"      
     "ing</w>"  
     "that</w>" 
     "t"        
     "ec"       
     "h"        
     "no"       
     "lo"       
     "g"        
     "y</w>"    
     "of"       
     "ten</w>"  
     "works</w>"
     "as</w>"   
     "a</w>"    
     ⋮          
     "us</w>"   
     ",</w>"    
     "y"        
     "et</w>"   
     "f"        
     "re"       
     "qu"       
     "ent"      
     "ly</w>"   
     "we</w>"   
     "b"        
     "ecom"     
     "e</w>"    
     "a</w>"    
     "serv"     
     "ant</w>"  
     "to</w>"   
     "it</w>"   
     ".</w>"    
    
    julia> for sentence ∈ split_sentences(sample_sent)
               println(process_line(bpe, sentence))
           end
    I t</w> ' s</w> inter est ing</w> that</w> t ec h no lo g y</w> of ten</w> works</w> as</w> a</w> serv ant</w> for<
    /w> us</w> ,</w> y et</w> f re qu ent ly</w> we</w> b ecom e</w> a</w> serv ant</w> to</w> it</w> .</w>
    E - ma il</w> is</w> a</w> us e ful</w> tool</w> but</w> many</w> fe el</w> cont ro l led</w> by</w> this</w> new</
    w> tool</w> .</w>
    T he</w> a ver age</w> b us in ess</w> pers on</w> is</w> g et ting</w> about</w> 8 0 e - ma il s</w> p er</w> day<
    /w> and</w> many</w> fe el</w> that</w> about</w> 8 0</w> %</w> of</w> the</w> m es sa ges</w> in</w> their</w> ` `
    </w> In bo x</w> ' '</w> are</w> of</w> l it t le</w> or</w> no</w>
    value</w> .</w>
    S o</w> ,</w> I</w> have</w> f our</w> su g g es tions</w> to</w> help</w> you</w> to</w> b ecom e</w> bet ter</w>
    at</w> ` `</w> E r as ing</w> E - ma il</w> ' '</w> .</w>
    1 . G et</w> of f</w> the</w> li sts</w> .</w>
    T he</w> b est</w> way</w> to</w> de al</w> with</w> a</w> pro bl em</w> is</w> to</w> n ever</w> have</w> it</w> .
    </w>
    I f</w> you</w> are</w> recei ving</w> a</w> l ot</w> of</w> un w an ted</w> e - ma il s</w> ,</w> as k</w> to</w>
    be</w> re mo ved</w> from</w> the</w> vari ous</w> li sts</w> .</w>
    T his</w> would</w> incl u de</w> your</w> incl us i on</w> in</w> un w an ted</w> li sts</w> .</w>
    2 .</w> ' '</w> U n li sted</w> ad d ress</w> ' '</w> .</w>
    J ust</w> as</w> you</w> ke ep</w> an</w> ` ` un li sted ' ' t el e ph one numb er that you
 sh are only with those who m you want to have di rec t acc ess to<
/w> , you might want to get a se par ate e - ma il ad d ress only for the im por t ant comm un ic ations you w ish to receive .

julia> 

Roadmap

  • add more interface and function
  • add pre-learned bpe map
  • support for different bpe format
  • support custom normalization
  • support for google sentencepiece
  • Maybe add to Embeddings.jl with bpemb: pre-train bpe embedding

First Commit

12/03/2018

Last Touched

almost 2 years ago

Commits

36 commits

Used By: