A multimodal dataset with one trillion tokens

224 points by kulikalov 11 months ago | 52 comments
  • punnerud 11 months ago
    More info on the Salesforce blog: https://blog.salesforceairesearch.com/mint-1t/
    • j7ake 11 months ago
      Wow did not expect sales force to be behind this.

      It’s basically free advertising for technical people to join sales force.

      • jszymborski 11 months ago
        Salesforce has long been involved in publishing quality NLP papers, especially during Stephen Merity's tenure.

        Smerity's papers are some of my favourite. Check out

        https://ar5iv.labs.arxiv.org/html/1708.02182

        And my all-time favourite

        https://ar5iv.labs.arxiv.org/html/1911.11423

        • nighthawk454 11 months ago
          Hey thanks for those Smerity links, hadn't run across his work yet, second one in particular looks great
        • 0xDEADFED5 11 months ago
          Salesforce produced one of the best Llama-3(8B) finetunes, IMO: SFR-Iterative-DPO-LLaMA-3-8B-R

          Hopefully they do something with Llama-3.1

          • gotaran 11 months ago
            I’m skeptical of the caliber of talent at Salesforce given the unusable state of their core product.
            • paxys 11 months ago
              The people building CRM software aren't also the ones doing AI research. The two have nothing to do with each other.
        • supermatt 11 months ago
          I havent trained any LLMs, so please accept my comment with all the naivety with which it is given - but in the "examples of MINT multimodal documents" graphic at the top of the README, it feels to me as though the labeling (for the images on the left) couldn't be much worse? Is this normal for these datasets? How are we able to build such powerful models with such poor quality data?
          • whiplash451 11 months ago
            Deep learning is robust to massive label noise [1]

            Not to say that data quality does not matter, but these noisy sets are still very useful.

            [1] https://arxiv.org/abs/1705.10694

            • sigmoid10 11 months ago
              Minor correction: Deep learning using gradient descent is incredibly robust to noise. If you know the mathematics, this also makes sense intuitively: gradients of incorrect labels will generally point in random directions, whereas the "truth" points in a specific direction (and I explicitly mean truth in the sense of what is portrayed consistently as fact in the dataset, not the real world truth). So when you accumulate gradients, you will end up with a net effect that moves weights only towards the consistent answers.

              Since gradient descent is by far the most popular algorithm, it's easy to conflate these two things. But there are other approaches that don't treat noise so well.

            • zsyllepsis 11 months ago
              I think the labels could be much, much worse. They could contain straight noise, just completely random text - not even words. They could also contain plausible, factual text which otherwise has no relationship with the text.

              I think most commonly image datasets like this consist of images and their captions, with the presumption that the content author had _some_ reason of associating the two. The goal of the model is to learn that association. And with a _lot_ of examples, to learn nuanced representations.

              In the third image, for example, we see some kind of text on a material. The caption mentions "Every year he rides for someone we know, touched by cancer". Perhaps the model is fed another example of bicycle races, with similar imagery of racing bibs. Perhaps its fed another of a race that specifically mentions it's a charity ride to raise money for cancer. Perhaps....

              You get the idea. Alone, each example provides only vague connections between the image and the caption. But when you have a ton of data it becomes easier to separate noise from a weak signal.

            • naveen99 11 months ago
              Copyright and intellectual property are directly at odds with these types of efforts, and has been losing to linux, gnu, github, wikipedia, mit open courseware, youtube, LLMs and their datasets. But copyright did slay Napster, PirateBay, anna’s archive etc…
              • larodi 11 months ago
                This all be quite dated in 10-20 years now. Common information will be free as it was in the 90s, but valuable information will then probably cost even more. And 99.9% times illegal to obtain or possess.
                • littlestymaar 11 months ago
                  Until individual countries start realizing that protecting copyright is costing them lots of potential economic growth coming from IA, and the IA business start lobbying more than the copyright business, at which point the law would just change.

                  Intellectual property is a fairly recent invention in economic history, and it only happened because it benefited the elite. If the balance of power changes so will the law.

                  • larodi 11 months ago
                    thats apparent. my point being that IA will put the final nail in the coffin as it's retelling information in a way which evades copyright in many cases.
                    • layer8 11 months ago
                      IA?
                  • kridsdale1 11 months ago
                    Pirate Bay is thriving.
                    • Narhem 11 months ago
                      The people who can afford to give do so knowing the alternative is what amounts to a prison lifestyle for their children.

                      Any ground that can be broken should be.

                    • sva_ 11 months ago
                      Does it make sense to measure a dataset in tokens? Shouldn't it be tokenizer-agnostic? I.e. the OpenAI tokenizer encodes about ~4 characters per token, but I could also have a tokenizer that does 1 character per token leading to a ~4x increase in token count (relative to the OpenAI tokenizer.)
                      • anas-awadalla 11 months ago
                        Hello! Totally agree that tokens will be model dependent. We chose to calculate tokens using the GPT-2 tokenizer as that is a common metric used by other datasets like fineweb. So this should roughly give you a sense of how large the data is in comparison to others. We report other metrics too like number of documents and number of images.
                        • reverius42 11 months ago
                          How does the GPT-2 tokenizer deal with non-text input? This dataset is multimodal but I thought GPT-2 was text only.
                      • ks2048 11 months ago
                        It looks like it contains data from CommonCrawl and ArXiv. It's not clear what kind of processing they did, but sometimes these releases seem like just repackaging existing datasets with your name own name on them. It's not hard to get bulk downloads from these sources directly.

                        I thought CommonCrawl truncated files at 1MB. I wonder if the PDFs for CommonCrawl were re-fetched from the URLs. That could be useful if they provide simple way to get those full files.

                        • anas-awadalla 11 months ago
                          Hello! Creator of MINT here.

                          We do a lot of pre-processing of commoncrawl (which in its raw form isn’t all that useful for training models). This includes heuristics to remove low quality text and images and deduplicating documents, paragraphs, and images. All of these are crucial to achieve good training performance.

                          On your point regarding PDFs, we actually don’t constraint ourselves to the 1MB files and do our own downloading of PDFs!

                          • ks2048 11 months ago
                            I see. Thanks for the reply. I opened one of the tar files and see now how it has extracted the text into json files.
                        • wsc981 11 months ago
                          So, I read the blog post and checked the Github page, but not a clear picture here for me. I am still kinda new to the LLM space.

                          What would the use-case be for this model? What are the advantages over something like Llama?

                          • Tepix 11 months ago
                            It's a dataset to train models, not a model.
                          • stealthcat 11 months ago
                            Marketed as “multimodal” but actually texts and images.

                            Multimodal dataset should be multimedia: text, audio, images, video, and optionally more like sensor readings and robot actions.

                            • optimalsolver 11 months ago
                              How effective would modeling raw byte sequences be, with the individual bytes as the "tokens", and a vocabulary of 256 elements?

                              You could then train on any kind of digital data.

                              • derefr 11 months ago
                                Due to the way tokenization usually works with LLMs (using BPE — Byte Pair Encoding), there's actually usually already a 256-element embedding within the token-space that represents "raw bytes." You could say that this 256-element set is "pre-seeded" into any BPE encoding — and will remain as part of the encoding as long as at least one document in the dataset used to determine the tokenization, uses each byte at least once in a non-high-frequency-suffix-predictable way.

                                These tokens are also already very much in use by the tokenizer — they get emitted in sequences, to encode single Unicode codepoints that weren't common enough in the dataset to get their own tokens, and so instead require multiple tokens to represent them. I believe most tokenizers (e.g. tiktoken) just take the UTF-8 byte-sequences underlying these codepoints and encode them literally as sequences of the above 256-element set.

                                If you're curious, here's the definition of the encoding used by most modern LLMs, in newline-delimited "[base64 of raw input byte sequence] [tokenID to encode as]" format: https://openaipublic.blob.core.windows.net/encodings/cl100k_... . If you decode it, you can observe that the rest of the 256-element single-byte embedding space gets mapped to tokenIDs immediately following those of the ASCII printables.

                                • nodja 11 months ago
                                  Somewhat inefficient for text, very inefficient for images, specially if you work in pixel space. The max context a model today has been trained is 1M tokens, which takes up a lot of memory. Even if context was not an issue, to generate a 1000x1000 image would take ~3 hours on 100token/s inference.

                                  Google has trained an encoder/decoder LLM on bytes called ByT5[1]

                                  [1] https://huggingface.co/google/byt5-xxl

                                  • Tostino 11 months ago
                                    I think the work on multi-token prediction[0] within a single turn could be a significant development that makes byte-level tokenization models more practical. This approach allows the model to predict multiple tokens in parallel, potentially addressing the efficiency concerns raised about byte-level models.

                                    By predicting multiple tokens simultaneously, it could significantly speed up inference time, especially for tasks that require generating large amounts of data (like images). This could help mitigate the performance bottleneck mentioned in the parent comment about generating a 1000x1000 image.

                                    [0] https://ar5iv.labs.arxiv.org/html/2404.19737

                                  • akrymski 11 months ago
                                    Forget bytes, go for bits. Vocab of size 2. At a theoretical level all of AI comes down to a classifier that is able to predict the next bit given a string of bits. Check out Tsetlin Machines. At some point we will be doing it in hardware.

                                    https://byte-gpt.github.io/

                                    • kulikalov 11 months ago
                                      Sounds inefficient. It’s like predicting the boiling point of a kettle by measuring the speed of individual molecules of water.
                                      • BizarroLand 11 months ago
                                        That would be surprisingly easy with 1st year calculus as long as you were willing to accept a small degree of inaccuracy.
                                    • donnyg 11 months ago
                                      • joshuamcginnis 11 months ago
                                        You might be interested in reading up on DNA sequence llm models and tooling.
                                        • botro 11 months ago
                                          Can you reccomend any resources for this?
                                        • hansvm 11 months ago
                                          It has pros and cons, like anything else.

                                          Tokenization:

                                          - Serves as a form of compression. The main benefit of that is supporting longer sequences for any given context window. As a side benefit, it squeezes about the same amount of "information" into each token -- meaning you don't have to add any terms to your model to account for such an imbalance (or even test whether that hyperparameter matters).

                                          - Allows you to insert stuff other than the raw data into your stream of "tokens" to the LLM. For something like a chatbot, that could be as simple as a prefix to whoever's talking next (e.g., system, user, model). You similarly probably want control characters to denote the end of a sequence. If you have multi-modal content (e.g., text + images), you need some way to delimit the transition between those. All of those problems could mostly be solved with an appropriate encoding scheme, but that's basically tokenization by a different name (in that it's a transformation from one set of tokens to another that you have to apply to every input).

                                          You can solve that second problem trivially with just a vocabulary of 256 "byte" tokens plus O(1) control tokens, so that's not a huge deal in practice, just a point worth mentioning if we're talking about actually naively encoding bytes.

                                          The first problem is more interesting. One observation is that if for your particular problem tokenization doesn't offer much compression, the difference won't matter much, or will favor raw bytes over tokenization if the tokenization isn't tailored to your particular data. IIRC there was something about Hebrew text floating around as an example of raw byte models performing better than tokenized models.

                                          Another observation is that if your particular model has any form of compression for redundant state space (not true of any sort of vanilla transformer, mostly not true for any major competitor, technically possible regardless), especially if the cost of processing a token isn't substantially greater than the cost per byte of tokenizing an input, you also don't buy anything from tokenization. You're absolutely able to feed that raw data in and let the model handle the details.

                                          On the flip side, suppose you're handling vanilla English text with a vanilla transformer. You can support something like 50x longer sequences basically for free by adding tokenization. You'd be silly not to.

                                          Image transformers are slightly different in some sense, at least in typical implementations. The tokenization is lossy (not injective), and the de-tokenization must therefore have the opposite property (not a function -- or, since it is a function, it either doesn't reproduce every possible input image patch or has randomness to at least match the right distribution hopefully). They're often called the same thing, but I view that as something different from tokenization. Certain categories of problems (much like the English text example above) are made drastically cheaper by the process. Others (unlike the English text example above) are rendered impossible by the loss of information. A byte vocabulary makes those theoretically possible again, but you suddenly need a way to handle the "entropy per byte" problem which you didn't have to care about before.

                                          Maybe one last idea, fuzzy detokenization (like in image transformers) has a notable advantage in spec adherence. Outputting an image and then letting some other hand-written code convert that to a png is much more likely to produce something usable than outputting a png directly, byte by byte. The whole thing is probabilistic, and the flurry of strategies you've seen along the lines of "decode while greedily adhering to a schema (json being the canonical example everyone wants to use for some reason, if you want to search for it)" produce the wrong output distribution, often drastically so, by virtue of the biased sampling on something only correct because of its conditional probabilities. I'm not sure exactly how big of a model you need (or how tailored of a loss function) to make a model reliably output correct, large png files, but the current SOTA isn't there yet for general-purpose problems.

                                          In practice, people have made some byte-token models. They vary from "meh" to SOTA depending on the problem. On most problems, they're much more expensive than tokenized solutions. Interestingly, when they're SOTA they tend to be among the cheaper solutions.

                                          I've been chipping away at some new model architectures, and something kind of like a byte-token solution is pretty suitable for those, largely because the model itself offers that compression you would otherwise obtain from tokenization. I'll finish and release them one of these years. For transformers though, the byte-token solution is usually only interesting insofar as proving people's suspicions. Results are fine, not amazing, except in special cases.

                                        • brianjking 11 months ago
                                          What's the license though?
                                          • dpifke 11 months ago
                                            From https://huggingface.co/datasets/mlfoundations/MINT-1T-HTML#l...:

                                            We release MINT-1T under a CC-BY-4.0 license, designating it primarily as a research artifact. While the dataset is freely available, users are responsible for ensuring its legal use in commercial settings. Users must independently verify compliance with applicable laws before employing MINT-1T for commercial purposes.

                                            Same page includes this caveat:

                                            Potential Legal and Ethical Concerns: While efforts were made to respect robots.txt files and remove sensitive information, there may still be content that individuals did not explicitly consent to include.

                                            • paxys 11 months ago
                                              Ah yes, the "if you get busted for copyright violations it's not our problem" license.
                                          • EGreg 11 months ago
                                            License: None

                                            Means we can’t legally use it?

                                            • thomashop 11 months ago
                                              ```We release MINT-1T under a CC-BY-4.0 license, designating it primarily as a research artifact. While the dataset is freely available, users are responsible for ensuring its legal use in commercial settings. Users must independently verify compliance with applicable laws before employing MINT-1T for commercial purposes.```
                                            • Olesya000 11 months ago
                                              [dead]
                                              • Olesya000 11 months ago
                                                [dead]
                                                • benreesman 11 months ago
                                                  Salesforce quietly does some truly tier-one stuff. They don’t showboat it which makes them seem more, not less, serious at least from my seat.

                                                  They use Bazel and shit, which is an acid test for being professionals, it’s a real shop.

                                                  The Magnificent 7 are about to get the taste slapped out of their mouth by skittish momentum guys and their chattels on Sand Hill Road. I look forward to the space this week will create for shops like Salesforce.