Comparing Large Language Models
A colleage pointed me to this site which he found from an unsourced Medium post.
https://llm.garden/
This led me to a meta-search about this site and found other list-of-lists of LLMs. Here's one:
https://www.tensorops.ai/post/where-can-i-find-a-list-of-llms
First back to LLM Garden:
I get the distinct impression (from mediocracy of design?) That this site was generated from (or by!) chatbot instruction. That aside, just from the "At Speed" nature of the LLM market that OpenAI, Google, and Meta have initiated, this is going to be a difficult list for them to keep current. This is probably also true of many of the other list found at TensorOps. The advantage will be to the one which spends tokens to crawl the web for new hotness.At this writing (23 June '23) FinnLLM is listed from 1 June, but Orca has ostensibly made more impact and that paper was released four days later. Also, IMO, their “Description” field has data that should be sorted. But you didn’t ask for a critique of this list. Their purpose may be manifold: market competitors, applicability to task, open or commercial, etc. But how do you actually compare them?
First order comparison is similarity. There’s a natural tendency to want to group things. Bracket then and then compare brackets and tease our winners or a "final four". But we have to know that it’s “apples to apples” when doing that. By “Like chatGPT” using this list’s available entries, do we just mean architecture, params, token length? Or are we going to need other lists created by folk who know more about most/all of these other models to be able to have a hunch or opinion (or data?) by which to make that kind of comparison with an OpenAI product?
Just the notion of the classification of the architectures being Encoder, or Decoder or ED is one slice, but not a complete view. In the last couple of months there have been explosions in innovation in post base training using open sourced LLMs. Should this list have entries for autoregressive (GPT) vs bidirectional (BERT)?
Further there’s little in this listing about the contributions to each LLM which are at least as influential. Pretraining methods and data quality, type, training time (convergence thresholds, etc.). Instruction tuning. And not least, Alignment: using RLHF, student regret minimizations, alignment goals (correctness, impartiality, prompt rejection). In short, this is a superficial list.
We’re kind of at the point (especially with so much leading work happening in corporations holding their cards close) that comparing LLMs by the public data amounts to the equivalent of comparing ICE cars by engine displacement, horsepower, zero-to-60 times and price. You might reveal large and more salient differences if you knew about tire size and type, engine design, to say nothing of seat comfort, legroom, visibility and of course, ‘does it have CarPlay?’ :slightly_smiling_face:
Comments
Post a Comment