Show HN: HuggingFace – Fast tokenization library for deep-learning NLP pipelines

168 points by julien_c 5 years ago | 42 comments
  • julien_c 5 years ago
    TL;DR: Hugging Face, the NLP research company known for its transformers library (DISCLAIMER: I work at Hugging Face), has just released a new open-source library for ultra-fast & versatile tokenization for NLP neural net models (i.e. converting strings in model input tensors).

    Main features: - Encode 1GB in 20sec - Provide BPE/Byte-Level-BPE/WordPiece/SentencePiece... - Compute exhaustive set of outputs (offset mappings, attention masks, special token masks...) - Written in Rust with bindings for Python and node.js

    Github repository and doc: https://github.com/huggingface/tokenizers/tree/master/tokeni...

    To install: - Rust: https://crates.io/crates/tokenizers - Python: pip install tokenizers - Node: npm install tokenizers

    • mark_l_watson 5 years ago
      I love the work done and made freely available by both spaCy and HuggingFace.

      I had my own NLP libraries for about 20 years, simple ones were examples in my books, and more complex and not so understandable ones I sold as products and pulled in lots of consulting work with.

      I have completely given up my own work developing NLP tools, and generally I use the Python bindings (via the Hy language (hylang) which is a Lisp that sits on top of Python) for spaCy, huggingface, TensorFlow, and Keras. I am retired now but my personal research is in hybrid symbolic and deep learning AI.

      • dunefox 5 years ago
        Hybrid symbolic and NN will be my next area of hobby research, currently getting my masters degree in NLP. Do you have a few good resources to get startedor/read about?
        • dzink 5 years ago
          Very interested in working on symbolic and deep learning projects for NLP as well.
        • screye 5 years ago
          I can't believe the level of productivity this Hugging face team has.

          They seemed to have found the ideal balance of software engineering capability and Neural network knowledge, in a team of highly effective and efficient employees.

          Idk what their monetization plan is as a startup, but it is 100% undervalued at 20 million, and that is just the quality of that team. Now, if only I can figure out how to put a few thousand $ in a series-A startup as just some guy.

          • manojlds 5 years ago
            > Idk what their monetization plan is as a startup

            > put a few thousand $ in a series-A

            Not a good idea.

            • screye 5 years ago
              I see them as an acqui-hire target. Especially form Facebook since they are so geographically close to FAIR labs in NY or Google and get integrated into Google AI like Deep Mind did. (esp. since google uses a ton of Transformers any ways)

              I can't think of many small teams that can be acquired and can build a company's ML infrastructure as fast as this team.

              If they have the money for it, OCI and Azure may also be keeping a look out for them.

          • ZeroCool2u 5 years ago
            We use both SpaCy and HuggingFace at work. Is there a comparison of this vs SpaCy's tokenizer[1]?

            1. https://spacy.io/usage/linguistic-features#tokenization

            • LunaSea 5 years ago
              It used to be that pre-DeepLearning tokenizers would extract ngrams (n-token sized chunks) but this doesn't seem to exist anymore in the word embedding tokenizers I've come by.

              Is this possible using HuggingFace (or another word embedding based library)?

              I know that there are some simple heuristics like merging noun token sequences together to extract ngrams but they are too simplistic and very error prone.

              • brockf 5 years ago
                Most implementations are actually moving in the opposite direction. Previously, there was a tendency to look to aggregate words into phrases to better capture the "context" of a word. Now, most approaches are splitting words into sub-word parts or even characters. With networks that capture temporal relationships across tokens (as opposed to older, "bag of words" models), multi-word patterns can effectively be captured by attending to the temporal order of sub-word parts.
                • LunaSea 5 years ago
                  > multi-word patterns can effectively be captured by attending to the temporal order of sub-word parts

                  Indeed. Do you have an example of a library or snippet that demonstrates this?

                  My limited understanding of BERT (and other) word embeddings was that they only contain the word's position in the 728 (I believe) dimensional space but doesn't contain queryable temporal information no?

                  I like ngrams as a sort of untagged / unlabelled entity.

                  • PeterisP 5 years ago
                    When using BERT (and all the many things like it, such as earlier ELMO, ULMfit and later ROBERTA/ERNIE/ALBERTa/etc) as the 'embeddings' you provide as input all the tokens in a sequence. You don't get an "embedding for word foobar in position 123", you get an embedding for all the sequence at once, so whatever corresponds to that token is a 728-dimensional "embedding for word foobar in position 123 conditional on all the particular other words that were before and after it'. Including very long-distance relations.

                    One of the simpler ways to try that out in your code seems to be running BERT-as-a-service https://github.com/hanxiao/bert-as-service , or alternatively the huggingface libraries that are discussed in the original article.

                    It's kind of the other way around compared to word2vec-style systems; before that you used to have a 'thin' embedding layer that's essentially just a lookup table followed by a bunch of complex layers of neural networks (e.g. multiple Bi-LSTMs followed by CRF); in the 'current style' you have "thick embeddings" which is running through all the many transformer layers in a pretrained BERT-like system, followed by a thin custom layer that's often just glorified linear regression.

                    • visarga 5 years ago
                      > Do you have an example of a library or snippet that demonstrates this?

                      All NLP neural nets (based on LSTM or Transformer) do this. It's their main function - to create contextual representations of the input tokens.

                      The word 'position' in the 728 dimensional space is an embedding and it can be compared with other words by dot product. There are libraries that can do dot product ranking fast (such as annoy).

                • useful 5 years ago
                  Somewhat related, if someone want to build something awesome, I haven't seen anything that merges lucene with BPE/SentencePiece.

                  SentencePiece has to make it so you can shrink the memory requirements of your indexes for search and typeahead stuff.

                  • hnaccy 5 years ago
                    Great! Just did a quick test and got a 6-7x speedup on tokenization.
                    • clmnt 5 years ago
                      Mind sharing what tests your ran & with which setup? Thanks!
                    • orestis 5 years ago
                      Are there examples on how this can be used for topic modeling, document similarity etc? All the examples I’ve seen (gensim) use bag-of-words which seems to be outdated.
                      • rococode 5 years ago
                        They don't use huggingface, but some of the modern approaches for topic modeling use variational auto-encoders, see:

                        Open-SESAME (2017): https://arxiv.org/abs/1706.09528 / https://github.com/swabhs/open-sesame

                        VAMPIRE (2019): https://arxiv.org/abs/1906.02242 / https://github.com/allenai/vampire

                        • samcodes 5 years ago
                          Thanks! I hadn’t seen VAMPIRE! So stoked to see a new approach to topic modeling. SVD etc are very much a local max
                        • ogrisel 5 years ago
                          Big transformers neural network are probably overkill for topic modeling. More traditional methods implemented in Gensim or scikit learn such as tfidf vectors followed by SVD (aka LSI) or LDA or NMF are probably just fine to extract topics (soft clustering).
                          • ogrisel 5 years ago
                            The reason is that you do not need to finely understand the structure of individual sentences to group documents by similar topics. Word order does not matter much for this task. Hence the success of methods that use Bag of Words (eg TFIDF) as their input representation.
                            • orestis 5 years ago
                              It might be that the corpus I was trying to cluster needs better preprocessing, or perhaps better n-grams. Using Bigrams only I saw a lot of common words that were meaningless, but adding them as stop words made the results worse. Hence my wondering if some other vectorization would produce better results.

                              On a related note, as a newcomer just trying to get things done (i.e. applied NLP) I find the whole ecosystem great but frustrating, so many frameworks and libraries but not clear ways to compose them together. Any resources out there that help make a sense of things?

                            • oddnearfuture 5 years ago
                              With the appropriate amount of data of course.
                          • echelon 5 years ago
                            I'm very familiar with the TTS, VC, and other "audio-shaped" spaces, but I've never delved into NLP.

                            What problems can you solve with NLP? Sentiment analysis? Semantic analysis? Translation?

                            What cool problems are there?

                            • visarga 5 years ago
                              > What problems can you solve with NLP?

                              It's mostly understanding text and generating text. You can do named entity extraction, question answering, summarisation, dialogue bots, information extraction from semi-structured documents such as tables and invoices, spelling correction, typing auto-suggestions, document classification and clustering, topic discovery, part of speech tagging, syntactic trees, language modelling, image description and image question answering, entailment detection (if two affirmations support one another), coreference resolution, entity linking, intent detection and slot filling, build large knowledge bases (databases of triplets subject-relation-object), spam detection, toxic message detection, ranking search results in search engines and many many more.

                              • mraison 5 years ago
                                I believe many folks are particularly attracted to NLP because the Turing test [1] is an NLP problem.

                                [1] https://en.m.wikipedia.org/wiki/Turing_test

                                • brokensegue 5 years ago
                                  Disagree. I think it's value is mostly unrelated to that.
                                • starpilot 5 years ago
                                  All of the above, it's like asking what problems can you solve with math? HuggingFace's transformers are said to be a swiss army knife for NLP. I haven't worked with them yet, but the main fundamental utility seems to be generating fixed-length vector representations of words. Word2vec started this, but the vectors have gotten much better with stuff like BERT.
                                  • Isn0gud 5 years ago
                                    I thought transformers are mainly used for multi-word embeddings?!
                                  • crawdog 5 years ago
                                    There's a lot! Sentence detection, parts of speech (POS) detection to name a couple. These can be used to determine key concepts in documents that lack metadata. For example: you could cluster on common phrases to identify relationships in data.
                                  • m0zg 5 years ago
                                    Question for HuggingFace folks. Your repos do not contain any tests. Why is that? How do you ensure your stuff actually works after you make a change?
                                    • 5 years ago
                                      • virtuous_signal 5 years ago
                                        I didn't realize that particular emoji had a name. I thought it was a play on this: https://en.wikipedia.org/wiki/Alien_(creature_in_Alien_franc...
                                      • manojlds 5 years ago
                                        Title is off? Should mention Tokenizers as the project.
                                        • rsp1984 5 years ago
                                          What does tokenization (of strings, I guess) do?
                                          • wyldfire 5 years ago
                                            The README [1] shows a great example:

                                            The sentence "Hello, y'all! How are you ?" is tokenized into words. Those words are then encoded into integers representative of the words' identity in the model's dictionary.

                                                >>> output = tokenizer.encode("Hello, y'all! How are you  ?")
                                                Encoding(num_tokens=13, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing, original_str, normalized_str])
                                                >>> print(output.ids, output.tokens, output.offsets)
                                                [101, 7592, 1010, 1061, 1005, 2035, 999, 2129, 2024, 2017, 100, 1029, 102]
                                                ['[CLS]', 'hello', ',', 'y', "'", 'all', '!', 'how', 'are', 'you', '[UNK]', '?', '[SEP]']
                                                [(0, 0), (0, 5), (5, 6), (7, 8), (8, 9), (9, 12), (12, 13), (14, 17), (18, 21), (22, 25), (26, 27), (28, 29), (0, 0)]
                                            
                                            But there's also good detail in the source [2] which says, "A Tokenizer works as a pipeline, it processes some raw text as input and outputs an Encoding. The various steps of the pipeline are: ...."

                                            [1] https://github.com/huggingface/tokenizers#quick-examples-usi...

                                            [2] https://github.com/huggingface/tokenizers/tree/master/tokeni...

                                          • tarr11 5 years ago
                                            Why is this company called HuggingFace?
                                            • itronitron 5 years ago
                                              I assume it is a reference to the movie Alien.
                                            • 5 years ago